1. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, James Lee, Lu Bai, and colleagues use a multifaceted approach to investigate the relationship between transcription factor condensate formation, transcription, and 3D gene clustering of the MET regulon in the model organism S. cerevisiae. This study represents a second clear example of inducible transcriptional condensates in budding yeast, as most evidence for transcriptional condensates arises from studies of mammalian systems. In addition, this study links the genomic location of transcriptional condensates to the potency of transcription of a reporter gene regulated by the master transcription factor contained in the condensate. The strength of evidence supporting these two conclusions is strong. Less strong is evidence supporting the claim that Met4-containing condensates mediate the clustering of genes in the MET regulon.

      Strengths:

      The manuscript is for the most part clearly written, with the overriding model and specific hypothesis being tested clearly explained. Figure legends are particularly well written. An additional strength of the manuscript is that most of the main conclusions are supported by the data. This includes the propensity of Met4 and Met32 to form puncta-like structures under inducing conditions, formation of Met32-containing LLPS-like droplets in vitro (within which Met4 can colocalize), colocalization of Met4-GFP with Met4-target genes under inducing conditions, enhanced transcription of a Met3pr-GFP reporter when targeted within 1.5 - 5 kb of select Met4 target genes, and most impressively, evidence that several MET genes appear to reposition under transcriptionally inducing conditions. The latter is based on a recently reported novel in vivo methylation assay, MTAC, developed by the Bai lab.

      Weaknesses:

      My principal concern is that the authors fail to show convincing evidence for a key conclusion, highlighted in the title, that nuclear condensates per se drive MET gene clustering. Figure 4E demonstrates that Met4 molecules, not condensates per se, are necessary for fostering distant cis and trans interactions between MET6 and three other Met4 targets under -met inducing conditions. In addition, the paper would be strengthened by discussing a recent study conducted in yeast that comes to many of the same conclusions reported here, including the role of inducible TF condensates in driving 3D genome reorganization (Chowdhary et al, Mol. Cell 2022).

      Following the reviewer’s advice, we carried out MTAC with the VP near MET6 in WT Met4 and ΔIDR2.3 strains (results shown below). The conclusions are somewhat ambiguous. For long-distance interactions with MUP1, YKG9, STR3, and MET13, we indeed observe decreased MTAC signals close to background levels in the ΔIDR2.3 strain, which aligns with the model suggesting that Met4 condensation promotes clustering among Met4 targeted genes. However, we also noticed significant decreases in the local MTAC signals (HIS3 and MET6). It is possible that the changes in Met4 condensates alter the chromosomal folding near MET6, thereby affecting the local MTAC signals. Alternatively, LacI-M.CviPI (the methyltransferase) could be induced to a lesser extent in the ΔIDR2.3 strain, leading to a genome-wide decrease in MTAC signals. Due to this ambiguity, we decided not to include the following plot in the main figure.

      Author response image 1.

      We discussed Hsf1 and added the suggested reference on page 13.

      Other concerns:

      (1) A central premise of the study is that the inducible formation of condensates underpins the induction of MET gene transcription and MET gene clustering. Yet, Figure 1 suggests (and the authors acknowledge) that puncta-like Met4-containing structures pre-exist in the nuclei of non-induced cells. Thus, the transcription and gene reorganization observed is due to a relatively modest increase in condensate-like structures. Are we dealing with two different types of Met4 condensates? (For example, different combinations of Met4 with its partners; Mediator- or Pol II-lacking vs. Mediator- or Pol II-containing; etc.?) At the very least, a comment to this effect is necessary.

      Although Met4 can form smaller puncta in the +met condition (Figure 1A), it cannot be recruited to its target genes due to the absence of its sequence-specific binding partners, Met31 and Met32 (these two factors are actively degraded in the +met condition). Consistently, in the +met condition, Met4 shows extremely low genome-wide ChIP signals (Figure 3C). Therefore, these Met4 puncta in +met do not have organize the 3D genome or have gene regulatory functions. This discussion is added on page 12.

      (2) Using an in vitro assay, the authors demonstrate that Met4 colocalizes with Met32 LLPS droplets (Figure 2F). Is the same true in vivo - that is, is Met32 required for Met4 condensation? This could be readily tested using auxin-induced degradation of Met32. Along similar lines, the claim that Met32 is required for MET gene clustering (line 250) requires auxin-induced degradation of this protein.

      As the reviewer pointed out above, cells in the +met condition also show small Met4 puncta. In this condition, Met32 is essentially undetectable (Met31 level is even lower and remains undetectable even in the -met conditions). Therefore, Met4 does not strictly require the presence of Met32 in vivo (may require other factors or modifications). Met4 does not have DNA-binding activity, and therefore it cannot target and organize chromosomes on its own. Although we did not do the Met32 degradation experiment, we measured the 3D genome conformation in +met and showed that there are no detectable interactions among Met4 target genes.

      (3) The authors use a single time point during -met induction (2 h) to evaluate TF clustering, transcription (mRNA abundance), and 3D restructuring. It would be informative to perform a kinetic analysis since such an analysis could reveal whether TF clustering precedes transcriptional induction or MET gene repositioning. Do the latter two phenomena occur concurrently or does one precede the other?

      We appreciate the reviewer’s insightful question. It is indeed intriguing to consider whether TF clustering precedes transcriptional induction and MET gene clustering. However, as mentioned on page 12 of our manuscript, this experiment poses significant challenges. The low intensities of the Met4 and Met32 signals necessitate high excitation for imaging, which also makes them prone to photo-bleaching. Consequently, we have been unable to measure the dynamics of Met4 and Met32 puncta in vivo, let alone co-image them with DNA/RNA. Undertaking this experiment will require considerable effort, which we plan to pursue in the future.

      (4) Based on the MTAC assay, MET13 does not appear to engage in trans interactions with other Met4 targets, whereas MET6 does (Figures 4C and 4E). Does this difference stem from the greater occupancy of Met4 at MET6 vs. MET13, greater association of another Met co-factor with the chromatin of MET6 vs. MET13, or something else?

      We were also surprised by this result, given that MET13 emerged as one of the strongest transcriptional hotspots in our previous screen. It also exhibits one of the highest Met4 ChIP signals and is closely associated with the nuclear pore complex. Our earlier findings indicate that DNA dynamics near the VP significantly influence the MTAC signal; specifically, a VP with constrained motion is less effective at methylating interacting sites (Li et al., 2024). Therefore, it is plausible that MET13 is associated with a large Met4 condensate, which constrains the motion of nearby chromatin and diminishes MTAC efficiency.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript combines live yeast cell imaging and other genomic approaches to study how transcription factor (TF) condensates might help organize and enhance the transcription of the target genes in the methionine starvation response pathway. The authors show that the TFs in this response can form phase-separated condensates through their intrinsically disordered regions (IDRs), and mediate the spatial clustering of the related endogenous genes as well as reporter inserted near the endogenous target loci.

      Strengths:

      This work uses rigorous experimental approaches, such as imaging of endogenously labeled TFs, determining expression and clustering of endogenous target genes, and reporter integration near the endogenous target loci. The importance of TFs is shown by rapid degradation. Single-cell data are combined with genomic sequencing-based assays. Control loci engineered in the same way are usually included. Some of these controls are very helpful in showing the pathway-specific effect of the TF condensates in enhancing transcription.

      Weaknesses:

      Perhaps the biggest weakness of this work is that the role of IDR and phase separation in mediating the target gene clustering is unclear. This is an important question. TF IDRs may have many functions including mediating phase separation and binding to other transcriptional molecules (not limited to proteins and may even include RNAs). The effect of IDR deletion on reduced Fano number in cells could come from reduced binding with other molecules. This should be tested on phase separation of the purified protein after IDR deletion. Also, the authors have not shown IDR deletion affects the clustering of the target genes, so IDR deletion may affect the binding of other molecules (not the general transcription machinery) that are specifically important for target gene transcription. If the self-association of the IDR is the main driving force of the clustering and target gene transcription enhancement, can one replace this IDR with totally unrelated IDRs that have been shown to mediate phase separation in non-transcription systems and still see the gene clustering and transcription enhancement effects? This work has all the setup to test this hypothesis.

      We thank the reviewer for raising this point, and we tried more in vitro and in vivo experiments with Met4 IDR deletions. See the answer to Reviewer 1 for the in vivo 3D mapping experiment.

      We purified Met4-ΔIDR2 with an MBP tag, but its low yield made labeling and conducting thorough experiments challenging. At concentrations above ~10 μM, the protein tends to aggregate, while at lower concentrations, it remains diffusive in solution and does not form condensates. When we mixed purified Met4-ΔIDR2 with Met32, we observed reduced partitioning inside Met32 condensates compared to the full-length Met4. As the reviewer noted, this diminished interaction may contribute to the decreased puncta formation observed in vivo. This result is added to the manuscript on page 11 and supplementary figure 5.

      The Met4 protein was tagged with MBP but Met 32 was not. MBP tag is well known to enhance protein solubility and prevent phase separation. This made the comparison of their in vitro phase behavior very different and led the authors to think that maybe Met32 is the scaffold in the co-condensates. If MBP was necessary to increase yield and solubility during expression and purification, it should be cleaved (a protease cleavage site should be engineered) to allow phase separation in vitro.

      Following the reviewer’s advice, we purified Met4-TEV-MBP so that the MBP can be cleaved off. Unfortunately, concentrated Met4-TEV-MBP needs to be stored at high salt (400mM) to be soluble. When exchanged into a suitable buffer for TEV cleavage (≤200 mM NaCl), nearly all soluble protein aggregates. Attempts to digest the protein in storage buffer results in observable aggregation before significant cleavage (see below).  

      Author response image 2.

      Are ATG36 and LDS2 also supposed to be induced by -met? This should be explained clearly. The signals are high at -met.

      Genomic loci ATG36 and LDS2 were chosen as controls because they are not bound by Met TFs (ChIP-seq tracks) and their expressions are not induced by -met (RNA-seq data). This information is added to the manuscript on page 9. When MET3pr-GFP reporter is inserted into these loci, GFP is induced by -met (because it is driven by the MET3 promoter), but the induction level is less than the same reporter inserted into the transcriptional hotspot like MET13 and MET6 (Figure 6E, also see Du et al., Plos Genetics, 2017).

      ChIP-seq data:

      Author response image 3.

      RNA-seq counts:

      Author response table 1.

      Figure 6B, the Met4-GFP seems to form condensates at all three loci without a very obvious difference, though 6C shows a difference. 6C is from only one picture each. The authors should probably quantify the signals from a large number of randomly selected pictures (cells) and do statistics.

      If we understand this comment correctly, the reviewer is referring to the fact that all three loci in Figure 6B appear to show a peak in GFP intensity. This pattern emerges because these images are averaged among many cells (number of cells analyzed in 6B has been added to the Figure legends). GFP intensities near the center will always be higher because peripheral pixels are more likely to fall outside the nuclei boundaries, where Met4 signals are absent (same as in Figure 3F). Importantly, MET6 locus shows higher intensity near the center in comparison to PUT1 and ATG36, indicating its co-localization with Met4 condensates.

      Reviewer #3 (Public Review):

      Summary:

      In this study, the authors probe the connections between clustering of the Met4/32 transcription factors (TFs), clustering of their regulatory targets, and transcriptional regulation. While there is an increasing number of studies on TF clustering in vitro and in vivo, there is an important need to probe whether clustering plays a functional role in gene expression. Another important question is whether TF clustering leads to the clustering of relevant gene targets in vivo. Here the authors provide several lines of evidence to make a compelling case that Met4/32 and their target genes cluster and that this leads to an increase in transcription of these genes in the induced state. First, they found that, in the induced state, Met4/32 forms co-localized puncta in vivo. This is supported by in vitro studies showing that these TFs can form condensates in vitro with Med32 being the driver of these condensates. They found that two target genes, MET6 and MET13 have a higher probability of being co-localized with Met4 puncta compared with non-target loci. Using a targeted DNA methylation assay, they found that MET13 and MET6 show Met4-dependent long-range interactions with other Met4-regulated loci, consistent with the clustering of at least some target genes under induced conditions. Finally, by inserting a Met4-regulated reporter gene at variable distances from MET6, they provide evidence that insertion near this gene is a modest hotspot for activity.

      Weaknesses:

      (1) Please provide more information on the assay for puncta formation (Figure 1). It's unclear to me from the description provided how this assay was able to quantitate the number of puncta in cells.

      Due to the variation in puncta size and intensity (as illustrated in Figure 1A), counting the number of puncta would be highly subjective with arbitrary cutoffs. Therefore, we chose to calculate the CV and Fano values instead, which are unbiased measures. Proteins that form puncta will exhibit greater pixel-to-pixel variations in GFP intensity, resulting in higher CV and Fano values.

      (2) How does the number of puncta in cells correspond with the number of Met-regulated genes? What are the implications of this calculation?

      As previously mentioned, defining the exact number of Met4 puncta is challenging. The number of puncta does not necessarily have one-to-one correspondence to the number of Met4 target genes. Some puncta may not be associated with chromosomes, while others may interact with multiple genes.

      (3) A control for chromosomal insertion of the Met-regulated reporter was a GAL4 promoter derivative reporter. However, this control promoter seems 5-10 fold more active than the Met-regulated promoter (Figure 6). It's possible that the high activity from the control promoter overcomes some other limiting step such that chromosomal location isn't important. It would be ideal if the authors used a promoter with comparable activity to the Met-reporter as a control.

      We agree with the reviewer that it will be better to use another promoter with comparable activity. Indeed, this was our rationale for selecting the attenuated GAL1 promoter over the WT version; however, it still exhibited substantially higher activity than the MET3pr. Unfortunately, we do not have a promoter from a different pathway that is calibrated to match the activity level of MET3pr. Nonetheless, MET17pr has much higher activity (~3 fold) than MET3pr, and we observed similar degree of stimulus from the hotspot in comparison to the control locus for both promoters (1.5-2-fold increase in GFP expression) (Figure 6E & F). This suggests that the observed effects are more likely to depend on the activation pathway and TF identity rather than the promoter strength.

      (4) It seems like transcription from a very large number of genes is altered in the Met4 IDR mutant (Figure 7F). Why is this and could this variability affect the conclusions from this experiment?

      We agree with the reviewer that ΔIDR 2.3 truncation affects the expression of 2711 (P-adj <0.05) genes (1339 up,1372 down). We suspect that this is due to the decreased expression of Met4 target genes, leading to altered levels of methionine and other sulfur-containing metabolites. Such changes would have a global impact on gene expression. Importantly, despite the similar number of genes that show up vs down regulation in the ΔIDR 2.3 strain, almost all Met4 targets showed decreased expression (Fig 7F). This supports the model where Met4 condensates lead to increased expression in its target genes.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for The Authors):

      (1) The introduction contains multiple miscitations. Rather than gene clustering, most of the studies and reviews cited (e.g., lines 35-39) report interactions between genomic loci (E-E, E-P, and P-P). There are other claims not supported by the papers cited. Moreover, the authors lump together original research papers and reviews within a given group without distinguishing which is which.

      We thank the reviewer for pointing this out. We reorganized the references in the introduction.

      (2) One option to address the concern regarding the lack of evidence that nuclear condensates per se drive MET gene clustering is to test the impact of Met4 ΔIDR2.3 on MTAC signals.

      We carried out the suggested experiment. See answer above (Reviewer #1, Question #1).

      (3) Authors claim that there are significant differences between values depicted in Figures 1B and 3G. Statistical tests are necessary to show this.

      Significance values were calculated in comparison to free GFP using two-tailed Student’s t-test in 1B,1C, and 3G. The corresponding figure legends are updated.

      (4) How are the data in Figures 3F, G, and 6B, C generated? This is unclear from the information provided in the Figure legends and Materials and Methods.

      For each cell, we projected the highest mCherry and GFP intensity at each pixel for all z positions onto a 2D plane (MIP). The MIP images were aligned with the mCherry dot at the center and averaged among all cells. To calculate the GFP intensities like in Figure 3G and 6C, a single line was drawn across the center and the GFP profile was analyzed by ImageJ. We now describe this in the corresponding figure legends, and the Materials and Methods are also updated.

      (5) Typos/ unclear writing: lines 24, 58, 79, 82, 84, 96, 117, 121, 131, 142, 147, 161 (terminus, not "terminal"), 250, 325, 349, 761 (was, not "are"). For several of these: "condense" is not "condensate"; for many others: inappropriate use of "the". Supplementary Figure 1 legend: not "a single nuclei" instead "a single nucleus".

      We thank the reviewer for pointing this out. We tried our best to correct grammatical errors.

      (6) Define GAL1Spr (Figure 6F).

      The GAL1S promoter is an attenuated GAL1 promoter that lacks two out of the four Gal4 binding site. The original paper is now cited in the manuscript on page 10.  

      (7) Figure 7B, C: there appears to be an inconsistency between the image and bar graph value for ΔIDR3.

      The Fano values calculated in 7C are averaged among a population of cells (we added the cell numbers to the legend), while the image in 7B is an example of an individual nucleus. There is some cell-to-cell variability in how the Met4 appears. To be more representative, we chose a different image for ΔIDR3.

      (8) Supplementary Tables: use descriptive titles for file names.

      This is corrected.

      Reviewer #2 (Recommendations For The Authors):

      Minor:

      Figure 4F is not cited in the text, and the color legend seems wrong for targeted and control.

      Figure 4F is now cited in the text. The labels were corrected.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      The investigators in this study analyzed the dataset assembly from 540 Salmonella isolates, and those from 45 recent isolates from Zhejiang University of China. The analysis and comparison of the resistome and mobilome of these isolates identified a significantly higher rate of cross-region dissemination compared to localized propagation. This study highlights the key role of the resistome in driving the transition and evolutionary 

      Thank you for summarizing our work. According to your comments, we carefully considered and responded to them and made corresponding revisions to the text. Additionally, to fully contextualize the background knowledge and clarify the major points in this study, we add some references.

      Upon further review of our initial manuscript, we realized that the original submission did not strictly follow the lineage order proposed by Zhou et al. (Natl Sci Rev. 2023 Sep 2;10(10):nwad228). To avoid confusion and keep the uniform knowledge in the typing system, we have adjusted the lineage nomenclature along the revised manuscript to reflect the corrected order as follows:

      Author response table 1.

      To ensure consistency with previous studies, we have revised the nomenclature for the different lineages of bvSP.

      Strengths: 

      The isolates included in this study were from 16 countries in the past century (1920 to 2023). While the study uses S. Gallinarun as the prototype, the conclusion from this work will likely apply to other Salmonella serotypes and other pathogens. 

      Thanks for the constructive comments and the positive reception of the manuscript.

      Weaknesses: 

      While the isolates came from 16 countries, most strains in this study were originally from China. 

      We appreciate the reviewer's observation regarding the sampling distribution of isolates in this study. We acknowledge that while the isolates were collected from 15 different countries, with a significant proportion originated from China (Author response image 1). This focus is due to several reasons:

      Author response image 1.

      Geographic distribution of 580 S. Gallinarum. Different colors indicate the countries of origin for the 580 S. Gallinarum strains in the dataset. Darker shades represent higher numbers of strains.

      (1) As once a globally prevalent pathogen across the 20th century, S. Gallinarum was listed by the World Organization for Animal Health (WOAH) due to its economic importance. After 30 years of implementation of the National Poultry Improvement Plan in the US, it was almost eradicated in high-income countries, and interestingly, it became an endemic pathogen with sporadic outbreaks in most low- or middle-income countries like China and Brazil. Given the vast expanse of China's land area and the country's economic factors, implementing the same measures remains challenging.  

      (2) S. Gallinarum is an avian-specific pathogen, particularly affecting chickens, and its distribution is closely linked to chicken meat production in different countries. There are more frequent reports of fowl typhoid in some high chicken-producing developing countries. Data from the United States Department of Agriculture (USDA) on annual chicken meat production for 2023/2024 show that the global distribution of S. Gallinarum aligns closely with the overall chicken meat production of these countries (https://fas.usda.gov/data/production/commodity/0115000).

      Author response image 2.

      The United States Department of Agriculture (USDA) data on annual chicken meat production for 2023/2024 across different countries globally.

      (3) Our primary objective was to investigate the localized resistome adaptation of S. Gallinarum in regions. Being a region with significant disease burden, China has reported numerous outbreaks (Sci Data. 2022 Aug 13;9(1):495; Sci Data. 2024 Feb 27;11(1):244) and a high AMR prevalence of this serovar (Natl Sci Rev. 2023 Sep 2;10(10):nwad228; mSystems. 2023 Dec 21;8(6):e0088323), making it an excellent example for understanding localized resistance mechanisms.

      (4) As China is the primary country of origin for the strains in this study, it is necessary to ensure that the strains from China are consistent with the local geographic characteristics of the country. Therefore, we conducted a correlation analysis between the number of strains from different provinces in China and the total GDP/population size of those provinces (Author response image 3). The results show that most points fall within the 95% confidence interval of the regression line. Although some points exhibit relative unbalance in the number of S. Gallinarum strains, most data points for these regions have a small sample size (n < 15). Overall, we found that the prevalence of S. Gallinarum in different regions of China is consistent with the overall nationwide trend.

      Author response image 3.

      Correlation analysis between the number of S. Gallinarum collected from different provinces in China and the total GDP/population size. The figure depicts a series of points representing individual provinces. The x-axis indicates the number of S. Gallinarum included in the dataset, while the y-axis displays the values for total GDP and total population size, respectively.

      Nevertheless, a search of nearly a decade of literature on PubMed and a summary of the S. Gallinarum genome available on public databases indicate that the dataset used is the most complete. Furthermore, focusing on a specific region within China allowed us to conduct a detailed and thorough analysis. However, we highly agree that expanding the study to include more isolates from other countries would enhance the generalizability of our findings, and we are actively collecting additional S. Gallinarum genome data. In the revised manuscript, we have further emphasized the limitations as follow:

      Lines 427-429: “However, the current study has some limitations. Firstly, despite assembling the most comprehensive WGS database for S. Gallinarum from public and laboratory sources, there are still biases in the examined collection. The majority (438/580) of S. Gallinarum samples were collected from China, possibly since the WGS is a technology that only became widely available in the 21st century. This makes it impractical to sequence it on a large scale in the 20th century, when S. Gallinarum caused a global pandemic. So, we suspect that human intervention in the development of this epidemic is the main driving force behind the fact that most of the strains in the data set originated in China. In our future work, we aim to actively gather more data to minimize potential biases within our dataset, thereby improving the robustness and generalizability of our findings.”

      Reviewer #2 (Public review): 

      Summary: 

      The authors sequence 45 new samples of S. Gallinarum, a commensal Salmonella found in chickens, which can sometimes cause disease. They combine these sequences with around 500 from public databases, determine the population structure of the pathogen, and coarse relationships of lineages with geography. The authors further investigate known anti-microbial genes found in these genomes, how they associate with each other, whether they have been horizontally transferred, and date the emergence of clades. 

      Thank you for your constructive suggestions, which are valuable and highly beneficial for improving our paper. According to your comments, we carefully considered and responded to them and made corresponding revisions to the text. Furthermore, to fully contextualize the background knowledge and clarify the major points in this study, we add some references to support our findings and policy implications.

      Upon further review of our initial manuscript, we realized that the original submission did not strictly follow the lineage order proposed by Zhou et al. (Natl Sci Rev. 2023 Sep 2;10(10):nwad228). To avoid confusion in the typing system, we have adjusted the lineage nomenclature in the revised manuscript to reflect the corrected order (see Author response table 1).

      Strengths: 

      (1) It doesn't seem that much is known about this serovar, so publicly available new sequences from a high-burden region are a valuable addition to the literature. 

      (2) Combining these sequences with publicly available sequences is a good way to better contextualise any findings. 

      Thank you so much for your thorough review and constructive comments on the manuscript.

      Weaknesses: 

      There are many issues with the genomic analysis that undermine the conclusions, the major ones I identified being: 

      (1) Recombination removal using gubbins was not presented fully anywhere. In this diversity of species, it is usually impossible to remove recombination in this way. A phylogeny with genetic scale and the gubbins results is needed. Critically, results on timing the emergence (fig2) depend on this, and cannot be trusted given the data presented. 

      We sincerely thank you for pointing out this issue. In the original manuscript, we aimed to present different lineages of S. Gallinarum within a single phylogenetic tree constructed using BEAST. However, in the revised manuscript, we have addressed this issue by applying the approach recommended by Gubbins to remove recombination events for each lineage defined by FastBAPs. Additionally, to better illustrate the removal of recombination regions in the genome, we have included a figure generated by Gubbins (New Supplementary Figure 12). 

      Our results indicate that recombination events are relatively infrequent in Lineage 1, followed by Lineage 3, but occur more frequently in Lineage 2. In the revised manuscript, we have included additional descriptions in the Methods section to clarify this analysis. We hope these modifications adequately address the reviewer’s concerns and enhance the trustworthiness of our findings.

      (2) The use of BEAST was also only briefly presented, but is the basis of a major conclusion of the paper. Plot S3 (root-to-tip regression) is unconvincing as a basis of this data fitting a molecular clock model. We would need more information on this analysis, including convergence and credible intervals. 

      Thank you very much for raising this issue. We decided to reconduct separate BEAST analyses for each lineage, accurately presenting the evolutionary scale based on the abovementioned improvements. The implementation of individual lineage for BEAST analysis was conducted based on the following steps:

      (1) Using R51 as the reference, a reference-mapped multiple core-genome SNP sequence alignment was created, and recombination regions were detected and removed as described above.

      (2) TreeTime was used to assess the temporal structure by performing a regression analysis of the root-to-tip branch distances within the maximum likelihood tree, considering the sampling date as a variable (New Supplementary Figures 6). However, the root-to-tip regression analysis presented in New Supplementary Figures 6 was not intended as a basis for selecting the best molecular clock model; its purpose was to clean the dataset with appropriate measurements.

      (3) To determine the optimal model for running BEAST, we tested a total of six combinations in the initial phase of our study. These combinations included the strict clock, relaxed lognormal clock, and three population models (Bayesian SkyGrid, Bayesian Skyline, and Constant Size). Before conducting the complete BEAST analysis, we evaluated each combination using a Markov Chain Monte Carlo (MCMC) analysis with a total chain length of 100 million and sampling every 10,000 iterations. We then summarized the results using NSLogAnalyser and determined the optimal model based on the marginal likelihood value for each combination. The results indicated that the model incorporating the Bayesian Skyline and the relaxed lognormal clock yielded the highest marginal likelihood value in our sample. Then, we proceeded to perform a timecalibrated Bayesian phylogenetic inference analysis for each lineage. The following settings were configured: the "GTR" substitution model, “4 gamma categories”, the "Relaxed Clock Log Normal" model, the "Coalescent Bayesian Skyline" tree prior, and an MCMC chain length of 100 million, with sampling every 10,000 iterations.

      (4) Convergence was assessed using Tracer, with all parameter effective sampling sizes (ESS) exceeding 200. Maximum clade credibility trees were generated using TreeAnnotator. Finally, key divergence time points (with 95% credible intervals) were estimated, and the tree was visualized using FigTree. 

      For the key lineages, L2b and L3b (carrying the resistome, posing antimicrobial resistance (AMR) risks, and exhibiting intercontinental transmission events), we have redrawn Figure 2 based on the updated BEAST analysis results (New Figure 2). For L1, L2a, and L3c, we have added supplementary figures to provide a more detailed visualization of their respective BEAST analysis outcomes (New Supplementary Figures 3-5). The revised BEAST analysis indicates that the origin of L3b in China can be traced back to as early as 1683 (95% CI: 1608 to 1839). In contrast, the earliest possible origin of L2b in China dates back to 1880 (95% CI: 1838 to 1902). This indicates that the previous manuscript's assumption that L2b is an older lineage compared to L3b may be inaccurate. 

      Furthermore, In the revised manuscript, we specifically estimated the time points for the first intercontinental transmission events for the two major lineages, L2b and L3b. Our results indicate that L2b, likely underwent two major intercontinental transmission events. The first occurred around 1893 (95% CI: 1870 to 1918), with transmission from China to South America. The second major transmission event occurred in 1923 (95% CI: 1907 to 1940), involving the spread from South America to Europe. In contrast, the transmission pattern of L3b appears relatively more straightforward. Our findings show that L3b, an S. Gallinarum lineage originating in China, only underwent one intercontinental transmission event from China to Europe, likely occurring around 1790 (95% CI: 1661 to 1890) (New Supplementary Figure 7). Based on the more critical BEAST analysis for each lineage, we have revised the corresponding conclusions in the manuscript. We believe that the updated BEAST analysis, performed using a more accurate recombination removal approach, significantly enhances the rigor and credibility of our findings.

      (3) Using a distance of 100 SNPs for a transmission is completely arbitrary. This would at least need to be justified in terms of the evolutionary rate and serial interval. 

      Using single nucleotide polymorphism (SNP) distance to trace pathogen transmission is a common approach (J Infect Dis. 2015 Apr 1;211(7):1154-63) and in our previous studies (hLife 2024; 2(5):246-256. mLife 2024; 3(1):156-160.). When the SNP distance within a cluster falls below a set threshold, the strains in that cluster are considered to have a potential direct transmission link. It is generally accepted that the lower the threshold, the more stringent the screening process becomes. However, there is little agreement in the literature regarding what such a threshold should be, and the appropriate SNP cut-off for inferring transmission likely depends critically on the context (Mol Biol Evol. 2019 Mar 1;36(3):587-603).

      In this study, we compared various thresholds (SNPs = 5, 10, 20, 25, 30, 35, 40, 50, 100) to ensure clustering in an appropriate manner. First, we summarized the tracing results under each threshold (Author response image 4), which demonstrated that, regardless of the threshold used, all strains associated with transmission events originated from the same location (New Figure 3a).

      Author response image 4.

      Clustering results of 45 newly isolated S. Gallinarum strains using different SNP thresholds of 5, 10, 15, 20, 25, 28, 30, 50, and 100 SNPs. The nine subplots represent the clustering results under each threshold. Each point corresponds to an individual strain, and lines connect strains with potential transmission relationships.

      In response to your comments regarding the evolutionary rate, we estimated the overall evolutionary rate of the S. Gallinarum using BEAST. We applied the methodology described by Arthur W. Pightling et al. (Front Microbiol. 2022 Jun 16; 13:797997). The numbers of SNPs per year were determined by multiplying the evolutionary rates estimated with BEAST by the number of core SNP sites identified in the alignments. We hypothesize that a slower evolutionary rate in bacteria typically requires a lower SNP threshold when tracing transmission events using SNP distance analysis. Pightling et al.'s previous research found an average evolutionary rate of 1.97 SNPs per year (95% HPD, 0.48 to 4.61) across 22 different Salmonella serotypes. Our updated BEAST estimation for the evolutionary rate of S. Gallinarum suggests it is approximately 0.74 SNPs per year (95% HPD, 0.42 to 1.06). Based on these findings, and our previous experience with similar studies (mBio. 2023 Oct 31;14(5):e0133323.), we set a threshold of 5 SNPs in the revised manuscript.

      Then, we adopted the newly established SNP distance threshold (n=5) to update Figure 3a and New Supplementary Figure 8. The heatmap on the far right of New Figure 3a illustrates the SNP distances among 45 newly isolated S. Gallinarum strains from two locations in Zhejiang Province (Taishun and Yueqing). New Supplementary Figure 8 simulates potential transmission events between the bvSP strains isolated from Zhejiang Province (n=95) and those from China with available provincial information (n=435). These analyses collectively demonstrate the localized transmission pattern of bvSP within China. Our analysis using the newly established SNP threshold indicates that the 45 strains isolated from Taishun and Yueqing exhibit a highly localized transmission pattern, with pairs of strains exhibiting potential transmission events below the set threshold occurring exclusively within a single location. Subsequently, we conducted the SNP distance-based tracing analysis for the 95 strains from Zhejiang Province and those from China with available provincial information (n=435) (New Supplementary Figure 8, New Supplementary Table S8). Under the SNP distance threshold (n=5), we identified a total of 91 potential transmission events, all of which occurred exclusively within Zhejiang Province. No inter-provincial transmission events were detected. Based on these findings, we revised the methods and conclusions in the manuscript accordingly. We believe that the updated version well addresses your concerns.

      Nevertheless, the final revised and updated results do not change the conclusions presented in our original manuscript. Instead, applying a more stringent SNP distance threshold allows us to provide solid evidence supporting the localized transmission pattern of S. Gallinarum in China. 

      (4) The HGT definition is non-standard, and phylogeny (vertical inheritance) is not controlled for.  

      The cited method: 

      'In this study, potentially recently transferred ARGs were defined as those with perfect identity (more than 99% nucleotide identity and 100% coverage) in distinct plasmids in distinct host bacteria using BLASTn (E-value {less than or equal to}10−5)' 

      This clearly does not apply here, as the application of distinct hosts and plasmids cannot be used. Subsequent analysis using this method is likely invalid, and some of it (e.g. Figure 6c) is statistically very poor. 

      Thank you for raising this important question. In our study, Horizontal Gene Transfer (HGT) is defined as the transfer of genetic information between different organisms, a process that facilitates the spread of antibiotic resistance genes (ARGs) among bacteria. This definition of HGT is consistent with that used in previous studies (Evol Med Public Health. 2015; 2015(1):193–194; ISME J. 2024 Jan 8;18(1):wrad032). In Salmonella, the transfer of antimicrobial resistance genes via HGT is not solely dependent on plasmids; other mobile genetic elements (MGEs), such as transposons, integrons, and prophages, also play significant roles. This has also  been documented in our previous work (mSystems. 2023 Dec 21;8(6):e0088323). Given the involvement of various MGEs in the horizontal transfer of ARGs, we propose that the criteria for evaluating horizontal transfer via plasmids can also be applied to ARGs mediated by other MGEs.

      In this study, we adopted stricter criteria than those used by Xiaolong Wang et al. Specifically, we defined two ARGs as identical only if they exhibited 100% nucleotide identity and 100% coverage. To address concerns regarding the potential influence of vertical inheritance in our analysis, we have made the following improvements. In the revised manuscript, we provide a more detailed table that includes the co-localization analysis of each ARG with mobile genetic elements (New Supplementary Table 9). For prophages and plasmids, we required that ARGs be located directly within these elements. In contrast, for transposons and integrons, we considered ARGs to be associated if they were located within a 5 kb region upstream or downstream of these elements (Nucleic Acids Res. 2022 Jul 5;50(W1):W768-W773). 

      In the revised manuscript, we first categorized a total of 621 ARGs carried by 436 bvSP isolates collected in China according to the aforementioned criteria and found that 415 ARGs were located on MGEs. After excluding the ARGs not associated with MGEs, we recalculated the overall HGT frequency of 10 types of ARGs in China, the horizontal ARGs transfer frequency in three key regions, and the horizontal ARGs transfer frequency within a single region (New Supplementary Table 7). Based on the results, we updated relevant sections of the manuscript and remade Figure 6. The updated manuscript describes the results of this section as follows:

      “Horizontal transfer of resistome occurs widely in localized bvSP

      Horizontal transfer of the resistome facilitates the acquisition of AMR among bacteria, which may record the distinct acquisition event in the bacterial genome. To compare these events in a geographic manner, we further investigated the HGT frequency of each ARG carried by bvSP isolated from China and explored the HGT frequency of resistome between three defined regions. Potentially horizontally transferred ARGs were defined as those with perfect identity (100% identity and 100% coverage) and were located on MGEs across different strains (Fig. 6a). We first categorized a total of 621 ARGs carried by 436 bvSP isolates collected in China and found that 415 ARGs were located on MGEs. After excluding the ARGs not associated with MGEs, our findings reveal that horizontal gene transfer of ARGs is widespread among Chinese bvSP isolates, with an overall transfer rate of 92%. Specifically, 50% of the ARGs exhibited an HGT frequency of 100%, indicating that these ARGs might underwent extensive frequent horizontal transfer events (Fig. 6b). It is noteworthy that certain resistance genes, such as tet(A), aph(3'')-Ib, and aph(6)-Id, appear to be less susceptible to horizontal transfer.

      However, different regions generally exhibited a considerable difference in resistome HGT frequency. Overall, bvSP from the southern areas in China showed the highest HGT frequency (HGT frequency=95%). The HGT frequencies for bvSP within the eastern and northern regions of China are lower, at 92% and 91%, respectively (Fig. 6c). For specifical ARG type, we found tet(A) is more prone to horizontal transfer in the southern region, and this proportion was considerably lower in the eastern region. Interestingly, certain ARGs such as aph(6)-Id, undergo horizontal transfer only within the eastern and northern regions of China (Fig. 6d). Notably, as a localized transmission pathogen, resistome carried by bvSP exhibited a dynamic potential among inter-regional and local demographic transmission, especially from northern region to southern region (HGT frequency=93%) (Fig. 6e, Supplementary Table 7).”

      We also modified the current version of the pipeline used to calculat the HGT frequency of resistance genes. In the revised pipeline, users are required to provide a file specifying the locations of mobilome on the genome before formally calculating the HGT frequency of the target ARGs. The specific code and data used in the calculation have been uploaded to https://github.com/tjiaa/Cal_HGT_Frequency.

      However, we also acknowledge that the current in silico method has some limitations. This approach heavily relies heavily on prior information in existing resistome/mobilome databases. Additionally, the characteristics of second-generation sequencing data make it challenging to locate gene positions precisely. Using complete genome assemblies might be a crucial approach to address this issue effectively. In the revised manuscript, we have also provided a more detailed explanation of the implications of the current pipeline.

      Regarding your second concern, "some of it (e.g., Figure 6c) is statistically very poor," the horizontal ARG transfer frequency calculation for each region was based on the proportion of horizontal transfer events of ARGs in that region to the total possible transfer events. As a result, we are unable to calculate the statistical significance between the two regions. Our aim with this approach is to provide a rough estimate of the extent of horizontal ARG transfer within the S. Gallinarum population in each region. In future studies, we will refine our conclusions by developing a broader range of evaluation methods to ensure more comprehensive assessment and validation.

      (5) Associations between lineages, resistome, mobilome, etc do not control for the effect of genetic background/phylogeny. So e.g. the claim 'the resistome also demonstrated a lineage-preferential distribution' is not well-supported. 

      Thank you for your comments. We acknowledge that the associations between lineages and the mobilome/resistome may be influenced by the genetic background or phylogeny of the strains. For instance, our conclusion regarding the lineage-preferential distribution of the resistome was primarily based on New Figure 4a, where L3 is clearly shown to carry the most ARGs. Furthermore, we observed that L3b tends to harbor bla<sub>_TEM-1B</sub>, _sul2, and tet(A) more frequently than other lineages. However, we recognize that this evidence is insufficient to support a definitive conclusion of “demonstrated a lineage-preferential distribution”. Therefore, we have re-examined the current manuscript and described these findings as a potential association between the mobilome/resistome and lineages.

      (6) The invasiveness index is not well described, and the difference in means is not biologically convincing as although it appears significant, it is very small. 

      Thank you for pointing this out. For the invasiveness index mentioned in the manuscript, we used the method described in previous studies. (PLoS Genet. 2018 May 8;14(5), Nat Microbiol. 2021 Mar;6(3):327-338). Specifically, Salmonella’s ability to cause intestinal or extraintestinal infections in hosts is related to the degree of genome degradation. We evaluated the potential for extraintestinal infection by 45 newly isolated S. Gallinarum strains (L2b and L3b) using a model that quantitatively assesses genome degradation. We analyzed samples using the 196 top predictor genes, employing a machine-learning approach that utilizes a random forest classifier and delta-bitscore functional variant-calling. This method evaluated the invasiveness of S. Gallinarum towards the host, and the distribution of invasiveness index values for each region was statistically tested using unpaired t-test. The code used for calculating the invasiveness index is available at https://github.com/Gardner-BinfLab/invasive_salmonella. In the revised manuscript, we added a more detailed description of the invasiveness index calculation in the Methods section as follows:

      Lines 592-603: “Specifically, Salmonella’s ability to cause intestinal or extraintestinal infections in hosts is related to the degree of genome degradation. We evaluated the potential for extraintestinal infection by 45 newly isolated S. Gallinarum strains (L2b and L3b) using a model that quantitatively assesses genome degradation. We analyzed each sample using the 196 top predictor genes for measuring the invasiveness of S. Gallinarum, employing a machine-learning approach that utilizes a random forest classifier and deltabitscore functional variant-calling. This method evaluated the invasiveness of S. Gallinarum towards the host, and the distribution of invasiveness index values for each region was statistically tested using unpaired t-test. The code used for calculating the invasiveness index is available at: https://github.com/Gardner-BinfLab/invasive_salmonella.”

      Regarding the second question, 'the difference in means is not biologically convincing as although it appears significant, it is very small,' we believe that this difference is biologically meaningful. In our previous work, we infected chicken embryos with different lineages of S. Gallinarum (Natl Sci Rev. 2023 Sep 2;10(10):nwad228). The virulence of thirteen strains of Salmonella Gallinarum, comprising five from lineage L2b and eight from lineage L3b, was evaluated in 16-day-old SPF chicken embryos through inoculation into the allantoic cavity. Controls included embryos that inoculated with phosphate-buffered saline (PBS). The embryos were incubated in a thermostatic incubator maintained at 37.5°C with a relative humidity ranging from 50% to 60%. Prior to inoculation, the viability of the embryos was assessed by examining the integrity of their venous system and their movements; any dead embryos were excluded from the study. Overnight cultures resuspended in PBS at a concentration of 1000 CFU per 100 μL were administered to the embryos. Mortality was recorded daily for a period of five days, concluding upon the hatching of the chicks. 

      It is generally accepted that strains with higher invasive capabilities are more likely to cause chicken embryo mortality. Our experimental results showed that the L2b, which exhibits higher invasiveness, with a slightly higher to cause chicken embryo death (Author response image 5). 

      Author response image 5.

      The survival curves of chicken embryos infected with bvSP isolates from S. Gallinarum L2b and S. Gallinarum L3b. Inoculation with Phosphate Buffer Saline (PBS) were considered controls. 

      (7) 'In more detail, both the resistome and mobilome exhibited a steady decline until the 1980s, followed by a consistent increase from the 1980s to the 2010s. However, after the 2010s, a subsequent decrease was identified.' 

      Where is the data/plot to support this? Is it a significant change? Is this due to sampling or phylogenetics? 

      Thank you for highlighting these critical points. The description in this statement is based on New Supplementary Figure 11. On the right side of New Supplementary Figure 11, we presented the average number of Antimicrobial Resistance Genes (ARGs) and Mobile Genetic Elements (MGEs) carried by S. Gallinarum isolates from different years, and we described the overall trend across these years. However, we realized that this statement might overinterpret the data. Given that this sentence does not impact our emphasis on the overall increasing trends observed in the resistome and mobilome, as well as their potential association, we decided to remove it in the revised manuscript.

      The revised paragraph would read as follows:

      Lines 261-268: “Variations in regional antimicrobial use may result in uneven pressure for selecting AMR. The mobilome is considered the primary reservoir for spreading resistome, and a consistent trend between the resistome and the mobilome has been observed across different lineages, from L1-L3c. We observed an overall gradual rise in the resistome quantity carried by bvSP across various lineages, correlating with the total mobilome content (S11 Fig). Furthermore, we investigated the interplay between particular mobile elements and resistome types in bvSP.”

      (8) It is not clear what the burden of disease this pathogen causes in the population, or how significant it is to agricultural policy. The article claims to 'provide valuable insights for targeted policy interventions.', but no such interventions are described. 

      Thank you for your constructive suggestions. Salmonella Gallinarum is an avian-specific pathogen that induces fowl typhoid, a severe systemic disease characterized by high mortality rates in chickens, thereby posing a significant threat to the poultry industry, particularly in developing countries (Rev Sci Tech. 2000 Aug;19(2):40524). In our previous research, we conducted a comprehensive meta-analysis of 201 publications encompassing over 900 million samples to investigate the global impact of S. Gallinarum (Sci Data. 2022 Aug 13;9(1):495). Our findings estimated that the global prevalence of S. Gallinarum is 8.54% (with a 95% confidence interval of 8.43% to 8.65%), with notable regional variations in incidence rates.

      Our previously analysis focused on the prevalence of S. Gallinarum (including biovars SP and SG) across six continents. The results revealed that all continents, except Oceania, exhibited positive prevalences of S. Gallinarum. Asia had the highest prevalence at 17.31%, closely followed by Europe at 16.03%. In Asia, the prevalence of biovar SP was higher than that of biovar SG, whereas in Europe, biovar SG was observed to be approximately two hundred times more prevalent than biovar SP. In South America, the prevalence of S. Gallinarum was higher than that of biovar SP, at 10.06% and 13.20% respectively. Conversely, the prevalence of S. Gallinarum was relatively lower in North America (4.45%) compared to Africa (1.10%) (Author response image 6).

      Given the significant economic losses caused by S. Gallinarum to the poultry industry and the potential risk of escalating antimicrobial resistance, more targeted policy interventions are urgently needed. Further elaboration on this implication is provided in the revised “Discussion” section as follows:

      Lines 401-416: “In summary, the findings of this study highlight that S. Gallinarum remains a significant concern in developing countries, particularly in China. Compared to other regions, S. Gallinarum in China poses a notably higher risk of AMR, necessitating the development of additional therapies, i.e. vaccine, probiotics, bacteriophage therapy in response to the government's policy aimed at reducing antimicrobial use ( J Infect Dev Ctries. 2014 Feb 13;8(2):129-36). Furthermore, given the dynamic nature of S. Gallinarum risks across different regions, it is crucial to prioritize continuous monitoring in key areas, particularly in China's southern regions where the extensive poultry farming is located. Lastly, from a One-Health perspective, controlling AMR in S. Gallinarum should not solely focus on local farming environments, with improved overall welfare on poultry and farming style. The breeding pyramid of industrialized poultry production should be targeted on the top, with enhanced and accurate detection techniques (mSphere. 2024 Jul 30;9(7):e0036224). More importantly, comprehensive efforts should be made to reduce antimicrobial usage overall and mitigate potential AMR transmission from environmental sources or other hosts (Vaccines (Basel). 2024 Sep 18;12(9):1067; Vaccines (Basel). 2023 Apr 18;11(4):865; Front Immunol. 2022 Aug 11:13:973224).”

      Author response image 6.

      A comparison of the global prevalence of S. gallinarum across continents.

      (9) The abstract mentions stepwise evolution as a main aim, but no results refer to this. 

      Thank you for raising this issue. In the revised manuscript, we have changed “stepwise evolution” to simply “evolution” to ensure a more accurate and precise description.

      (10) The authors attribute changes in population dynamics to normalisation in China-EU relations and hen fever. However, even if the date is correct, this is not a strongly supported causal claim, as many other reasons are also possible (for example other industrial processes which may have changed during this period). 

      Thank you for raising this critical issue. In the revised manuscript, we conducted a more stringent BEAST analysis for each lineage, as described earlier. This led to some changes in the inferred evolutionary timelines. Consequently, we have removed the corresponding statement from the “Results” section. Instead, we now only provide a discussion of historical events, supported by literature, that could have facilitated the intercontinental spread of L2b and L3b in the “Discussion” section. We believe these revisions have made the manuscript more rigorous and precise.

      Lines 332-342: “_The biovar types of _S. Gallinarum have been well-defined as bvSP, bvSG, and bvSD historically ( J Vet Med B Infect Dis Vet Public Health. 2005 Jun;52(5):2148). Among these, bvSP can be further subdivided into five lineages (L1, L2a, L2b, L3b, and L3c) using hierarchical Bayesian analysis. Different sublineages exhibited preferential geographic distribution, with L2b and L3b of bvSP being predominant global lineage types with a high risk of AMR. The historical geographical transmission was verified using a spatiotemporal Bayesian framework. The result shows that L3b was initially spread from China to Europe in the 18<sup>th</sup>-19<sup>th</sup> century, which may be associated with the European hen fever event in the mid-19th century (Burnham GP. 1855. The history of the hen fever: a humorous record). L2b, on the other hand, appears to have spread to Europe via South America, potentially contributing to the prevalence of bvSP in the United States.”  

      (11) No acknowledgment of potential undersampling outside of China is made, for example, 'Notably, all bvSP isolates from Asia were exclusively found in China, which can be manually divided into three distinct regions (southern, eastern, and northern).'.

      Perhaps we just haven't looked in other places?

      We appreciate the reviewer's observation regarding the sampling distribution of isolates in this study. We acknowledge that while the isolates were collected from 15 different countries with, a significant proportion originated from China (Author response image 1). This focus is due to several reasons:

      (1) As once a globally prevalent pathogen across the 20th century, S. Gallinarum was listed by the World Organization for Animal Health (WOAH) due to its economic importance. After 30 years of implementation the National Poultry Improvement Plan in the US, it was almost eradicated in high-income countries, and interestingly, it became an endemic pathogen with sporadic outbreaks in most low- or middle-income countries like China and Brazil. Given the vast expanse of China's land area and the country's economic factors, implementing the same measures remains a challenging endeavour. 

      (2) S. Gallinarum is an avian-specific pathogen, particularly affecting chickens, and its distribution is closely linked to chicken meat production in different countries. In some high chicken-producing developing countries, such as China and Brazil, there are more frequent reports of fowl typhoid. Data from the United States Department of Agriculture (USDA) on annual chicken meat production for 2023/2024 show that the global distribution of S. Gallinarum aligns closely with the overall chicken meat production of these countries (https://fas.usda.gov/data/production/commodity/0115000).  

      (3) Our primary objective was to investigate the localized resistome adaptation of S. Gallinarum in regions. Being a region with significant disease burden, China has reported numerous outbreaks (Sci Data. 2022 Aug 13;9(1):495; Sci Data. 2024 Feb 27;11(1):244) and a high AMR prevalence of this serovar (Natl Sci Rev. 2023 Sep 2;10(10):nwad228; mSystems. 2023 Dec 21;8(6):e0088323), making it an excellent example for understanding localized resistance mechanisms. 

      Nevertheless, a search of nearly a decade of literature on PubMed and a summary of the S. Gallinarum genome available on public databases indicate that the dataset used is the most complete. Furthermore, focusing on a specific region within China allowed us to conduct a detailed and thorough analysis. However, we highly agree that expanding the study to include more isolates from other countries would enhance the generalizability of our findings, and we are actively collecting additional S. Gallinarum genome data. In the revised manuscript, we modified this sentence to indicate that this phenomenon is only observed in the current dataset, thereby avoiding an overly absolute statement:

      Lines 131-135: “For the bvSP strains from Asia included in our dataset, we found that all originated from China. To further investigate the distribution of bvSP across different regions in China, we categorized them into three distinct regions: southern, eastern, and northern (Supplementary Table 3)”.

      (12) Many of the conclusions are highly speculative and not supported by the data. 

      Thank you for your comment. We have carefully revised the manuscript to address your concerns. We hope that the changes made in the revised version meet your expectations and provide a clearer and more accurate interpretation of our findings.

      (13) The figures are not always the best presentation of the data: 

      a. Stacked bar plots in Figure 1 are hard to interpret, the total numbers need to be shown.

      Panel C conveys little information. 

      b. Figure 4B: stacked bars are hard to read and do not show totals. 

      c. Figure 5 has no obvious interpretation or significance. 

      Thank you for your comments. We have revised the figures to improve the clarity and presentation of the data.

      In summary, the quality of analysis is poor and likely flawed (although there is not always enough information on methods present to confidently assess this or provide recommendations for how it might be improved). So, the stated conclusions are not supported. 

      Thank you for your valuable feedback. We have carefully revised the manuscript to address your concerns. We hope that the updated figures and tables, and new data in the revised version meet your expectations and provide more appropriate interpretation of our findings.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors): 

      This reviewer enjoyed reading this well-written manuscript. The authors are encouraged to address the following comments and revise the manuscript accordingly. 

      (1) Title: The authors use avian-restrict Salmonella to refer to Salmonella Gallinarum. Please consider using Salmonella Gallinarum in the title. Also, your analysis relates to resistome and mobilome. Would it make sense to add mobilome in the manuscript? 

      Thank you for your guidance. In the revised manuscript, we have changed the title to “Avian-specific Salmonella enterica Serovar Gallinarum transition to endemicity is accompanied by localized resistome and mobilome interaction”. We believe that this revised title more accurately reflects the content of our study.

      (2) Abstract: This study uses 45 isolates from your labs. However, you failed to include these 45 isolates in the Abstract. Also, please clarify the sources of these isolates (from dead chickens, or dead chicken embryos? You wrote in two different ways in this manuscript). Also, I am not entirely convinced how the results from these 45 isolates will support the overall conclusion of this work. 

      Thank you for your thorough review and constructive comments on the manuscript. In the revised version, we have added a description of 45 newly isolated S. Gallinarum strains in the Abstract to provide readers with a clearer understanding of the dataset used in this study.

      Lines 36-41: “Using the most comprehensive whole-genome sequencing dataset of Salmonella enterica serovar Gallinarum (S. Gallinarum) collected from 16 countries, including 45 newly recovered samples from two related local regions, we established the relationship among avian-specific pathogen genetic profiles and localization patterns.”

      Furthermore, the newly isolated S. Gallinarum strains were obtained from dead chicken embryos. We think your second concern may arise from the following description in the manuscript: “All 734 samples of dead chicken embryos were collected from Taishun and Yueqing in Zhejiang Province, China. After the thorough autopsy, the liver, intestines, and spleen were extracted and added separately into 2 mL centrifuge tubes containing 1 mL PBS. The organs were then homogenized by grinding.” In fact, all the collected dead chicken embryos were aged 19 to 20 days. At this developmental stage, collecting the liver, intestines, and spleen for isolation and cultivation of S. Gallinarum is possible. To avoid any confusion, we have included a more detailed description of the dead chicken embryos in the revised manuscript as follows:

      Lines 447-451: “All 734 samples of dead chicken embryos aged 19 to 20 days were collected from Taishun and Yueqing in Zhejiang Province, China. After a thorough autopsy, the liver, intestines, and spleen were extracted and added separately into 2 mL centrifuge tubes containing 1 mL PBS. The organs were then homogenized by grinding.”

      Regarding your concern about the statement, “I am not entirely convinced how the results from these 45 isolates will support the overall conclusion of this work,” we would like to clarify the significance of these new isolates. Our research first identified distinct characteristics in the 45 newly isolated S. Gallinarum strains from Taishun and Yueqing, Zhejiang Province. Specifically, we found that most of the strains from Yueqing belonged to sequence type ST92, whereas the majority from Taishun were ST3717. Additionally, there were significant differences between these geographically close strains in terms of SNP distance and predicted invasion capabilities. These findings suggest that S. Gallinarum may exhibit localized transmission patterns, which forms the basis of the scientific question and hypothesis we originally aimed to address. Furthermore, in our previous work, we collected 325 S. Gallinarum strains. By incorporating the newly isolated 45 strains, we aim to provide a more comprehensive view of the population diversity, transmission pattern and potential risk of S. Gallinarum. We will continue to endeavour to understand the global genomic and population diversity in this field.

      Finally, we revised the sentences that could potentially raise concerns for readers: 

      Lines 175-177: “To investigate the dissemination pattern of bvSP in China, we obtained forty-five newly isolated bvSP from 734 samples (6.1% overall isolation rate) collected from diseased chickens at two farms in Yueqing and Taishun, Zhejiang Province.”  >  “To investigate the dissemination pattern of bvSP, we obtained forty-five newly isolated bvSP from 734 samples (6.1% overall isolation rate) collected from diseased chickens at two farms in Yueqing and Taishun, Zhejiang Province.”

      (3) The manuscript uses nomenclature and classification into different sublineages. Did the authors establish the approaches for defining these sublineages in this group or did you follow the accepted standards? 

      Thank you very much for raising this important issue. The biovar types of Salmonella Gallinarum have historically been well-defined as S. Gallinarum biovar

      Pullorum (bvSP), S. Gallinarum biovar Gallinarum (bvSG), and S. Gallinarum biovar Duisburg (bvSD) (J Vet Med B Infect Dis Vet Public Health. 2005 Jun;52(5):214-8). However, there seems to be no widespread consensus on the population nomenclature for the key biovar bvSP. In a previous study, Zhou et al. classified bvSP into six lineages:

      L1, L2a, L2b, L3a, L3b, and L3c (Natl Sci Rev. 2023 Sep 2;10(10):nwad228). However, our more comprehensive analysis of S. Gallinarum using a larger dataset and hierarchical Bayesian clustering revealed that L3a, previously considered a distinct lineage, is actually a sublineage of L3c. Upon further review of our initial manuscript, we realized that the original submission did not strictly follow the lineage order proposed by Zhou et al. To avoid confusion in the typing system, we have adjusted the lineage nomenclature in the revised manuscript to reflect the corrected order (see Author response table 1).

      (4) This reviewer is convinced with the analysis approaches and conclusion of this work.

      In the meantime, the authors are encouraged to discuss the application of the conclusion of this study: a) can the data be somehow used in the prediction model? b) would the conclusion from S. Gallinarum have generalized application values for other pathogens. 

      Thank you for your constructive comments on the manuscript. 

      a) can the data be somehow used in the prediction model?

      We believe that genomic data can be effectively used for constructing prediction models; however, the success of such models largely depends on the specific traits being predicted. In this study, we utilized a random forest prediction model based on 196 top genes (PLoS Genet. 2018 May 8;14(5)) to predict the invasiveness of 45 newly isolated strains. In relation to the antimicrobial resistance (AMR) issue discussed in this paper, we also conducted relevant analyses. For instance, we explored the use of image-based models to predict whether a genome is resistant to specific antibiotics (Comput Struct Biotechnol J. 2023 Dec 29:23:559-565). We are confident that the incorporation of newly generated data will facilitate the development of future predictive models, and we plan to pursue further research in this area.

      b) would the conclusion from S. Gallinarum have generalized application values for other pathogens.

      This might be explained from two perspectives. First, the key role of the mobilome in facilitating the spread of the resistome, as emphasized in this study, has also been confirmed in research on other pathogens (mBio. 2024 Oct 16;15(10):e0242824). Thus, we believe that the pipeline we developed to assess the horizontal transfer frequency of different resistance genes across regions applies to various pathogens. On the other hand, due to distinct evolutionary histories, different pathogens exhibit varying levels of adaptation to their environments. In this study, we found that S. Gallinarum tends to spread highly localized; however, this conclusion may not necessarily hold for other pathogens.

      Reviewer #2 (Recommendations for the authors): 

      The authors would need to: 

      (1) Address my concerns about genomic analyses listed in the public review. 

      Thank you for your valuable feedback. We have carefully reviewed your concerns and made the necessary revisions to address the points raised about genomic analyses in the public review. We sincerely hope that these modifications meet your expectations and provide more robust analysis. We appreciate your thoughtful input and remain open to further suggestions to improve the manuscript.

      (2) Add more detail on the genomic methods and their outputs, as suggested above. 

      We have added further details to clarify the methodologies and outputs as mentioned above. Specifically, we expanded the description of the data processing, and the bioinformatic tools used for analysis. To ensure clarity, we also included an expanded discussion of the key outputs, highlighting their implications. We hope these revisions meet your expectations.

      (3) Critically rewrite their introduction to make it clear what problem they are trying to address. 

      Thank you for your guidance. In the revised manuscript, we have made the necessary modifications to the Introduction section to more clearly articulate the problem we aim to address.

      (4) Critically rewrite their conclusions so they are supported by the data they present, and make it clear when claims are more speculative. 

      Thank you for your guidance. In the revised manuscript, we have made the recommended modifications to the relevant sections of the conclusion as outlined above.

      More minor issues I identified: 

      (1) Typo in the title 'avian-restrict'. 

      Done.

      Line 1: “Avian-specific Salmonella enterica Serovar Gallinarum transition to endemicity is accompanied by localized resistome and mobilome interaction.”

      (2) 'By utilizing the pipeline we developed' -- a pipeline has not been introduced at this point. 

      In the revised manuscript, we have removed this section from the 'Abstract'.

      Lines 46-48: “Notably, the mobilome-resistome combination among distinct lineages exhibits a geographical-specific manner, further supporting a localized endemic mobilome-driven process.”

      (3) 'has more than 90% serovars' -- doesn't make sense. 

      Revised.

      Lines 82-83: “Salmonella, a pathogen with distinct geographical characteristics, has more than 90% of its serovars frequently categorized as geo-serotypes.”

      (4) 'horrific mortality rates that remain a disproportionate burden'. 

      Revised.

      Lines 83-87: “Among the thousands of geo-serotypes, Salmonella enterica Serovar Gallinarum (S. Gallinarum) is an avian-specific pathogen that causes severe mortality, with particularly detrimental effects on the poultry industry in low- and middle-income countries.”

      (5) What is the rate, what is a comparison, how is it disproportionate? 

      Thank you for your valuable feedback. It is challenging to accurately estimate the specific prevalence of S. Gallinarum, particularly due to the lack of comprehensive data in many countries. Numerous cases likely go unreported. However, S. Gallinarum is more commonly detected in low- and middle-income countries. Here, we provide three evidence supporting this observation. First, in our previous research, we conducted a comprehensive meta-analysis of 201 studies, involving over 900 million samples, to evaluate the global impact of S. Gallinarum (Sci Data. 2022 Aug 13;9(1):495). The estimated prevalence in 17 countries showed that Bangladesh had the highest rate (25.75%) of S. Gallinarum infections. However, for biovar Pullorum (bvSP), Argentina (20.69%) and China (18.18%) reported the highest prevalence rates. Second, previous studies have also reported that S. Gallinarum predominantly occurs in low- and middleincome countries (Vet Microbiol. 2019 Jan:228:165-172; BMC Microbiol. 2024 Oct 18;24(1):414). Finally, S. Gallinarum was once a globally prevalent pathogen in the 20th century. Following the implementation of eradication programs in most high-income countries, it was listed by the World Organization for Animal Health and subsequently became an endemic pathogen with sporadic outbreaks. However, similar eradication efforts are challenging to implement in low- and middle-income countries, leading to a disproportionately higher incidence of S. Gallinarum in these regions.

      In the revised manuscript, we have rephrased this sentence to enhance its accuracy:

      Lines 83-87: “Among the thousands of geo-serotypes, Salmonella enterica serovar Gallinarum (S. Gallinarum) is an avian-specific pathogen that causes severe mortality, with particularly detrimental effects on the poultry industry in low- and middle-income countries.”

      (6) 'we collected the most comprehensive set of 580 S. Gallinarum isolates', -> 'we collected the most comprehensive set S. Gallinarum isolates, consisting of 580 genomes'. 

      Revised.

      Lines 97-100: “To fill the gaps in understanding the evolution of S. Gallinarum under regional-associated AMR pressures and its adaptation to endemicity, we collected the most comprehensive set S. Gallinarum isolates, consisting of 580 genomes, spanning the period from 1920 to 2023.” 

      (7) Sequence reads are not available, and use a non-standard database. The eLife policy states: 'Sequence reads and assembly must be included for reference genomes, while novel short sequences, including epitopes, functional domains, genetic markers and haplotypes should be deposited, together with surrounding sequences, into Genbank, DNA Data Bank of Japan (DDBJ), or EMBL Nucleotide Sequence Database (ENA). DNA and RNA sequencing data should be deposited in NCBI Trace Archive or NCBI Sequence Read Archive (SRA).' So the sequences assemblies and reads should ideally be mirrored appropriately. 

      Thank you for your valuable suggestion regarding submitting the genome data for the newly isolated 45 S. Gallinarum strains. The genome data have been deposited in the NCBI Sequence Read Archive (SRA) under two BioProjects. The “SRA Accession number” for each strain have been added to New Supplementary Table 1. We believe this will ensure that the data are more readily accessible to a broader audience of researchers for download and analysis. We have revised the corresponding paragraph in the manuscript as follows:

      Lines 606-608: “For the newly isolated 45 strains of Salmonella Gallinarum, genome data have been deposited in NCBI Sequence Read Archive (SRA) database. The “SRA Accession” for each strain are listed in Supplementary Table 1.”

      (8) You should state at the start of the results which data is public, and how much is newly sequenced. 

      Revised.

      Lines 109-112: “To understand the global geographic distribution and genetic relationships of S. Gallinarum, we assembled the most comprehensive S. Gallinarum WGS dataset (n=580), comprising 535 publicly available genomes and 45 newly sequenced genomes.”

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Chan et al. tried identifying the binding sites or pockets for the KCNQ1-KCNE1 activator mefenamic acid. Because the KCNQ1-KCNE1 channel is responsible for cardiac repolarization, genetic impairment of either the KCNQ1 or KCNE1 gene can cause cardiac arrhythmia. Therefore, the development of activators without side effects is highly demanded. Because the binding of mefenamic acid requires both KCNQ1 and KCNE1 subunits, the authors performed drug docking simulation by using KCNQ1-KCNE3 structural model (because this is the only available KCNQ1-KCNE structure) with substitution of the extracellular five amino acids (R53-Y58) into D39-A44 of KCNE1. That could be a limitation of the work because the binding mode of KCNE1 might differ from that of KCNE3. Still, they successfully identified some critical amino acid residues, including W323 of KCNQ1 and K41 and A44 of KCNE1. They subsequently tested these identified amino acid residues by analyzing the point mutants and confirmed that they attenuated the effects of the activator. They also examined another activator, yet structurally different DIDS, and reported that DIDS and mefenamic acid share the binding pocket, and they concluded that the extracellular region composed of S1, S6, and KCNE1 is a generic binding pocket for the IKS activators.

      The data are solid and well support their conclusions, although there are a few concerns regarding the choice of mutants for analysis and data presentation.

      Other comments:

      1. One of the limitations of this work is that they used psKCNE1 (mostly KCNE3), not real KCNE1, as written above. It is also noted that KCNQ1-KCNE3 is in the open state. Unbinding may be facilitated in the closed state, although evaluating that in the current work is difficult.

      We agree that it is difficult to evaluate the role of unbinding from our model. Our data showing that longer interpulse intervals have a normalizing effect on the GV curve (Figure 3-figure supplement 2) could be interpreted to suggest that unbinding occurs in the closed state. Alternatively, the slowing of deactivation caused by S1-S6 interactions and facilitated by the activators may effectively be exceeded at the longer interpulse intervals.

      1. According to Figure 2-figure supplement 2, some amino acid residues (S298 and A300) of the turret might be involved in the binding of mefenamic acid. On the other hand, Q147 showing a comparable delta G value to S298 and A300 was picked for mutant analysis. What are the criteria for the following electrophysiological study?

      EP experiments interrogated selected residues with significant contributions to mefenamic acid and DIDs coordination as revealed by the MM/GBSA and MM/PBSA methods. A300 was identified as potentially important. We did attempt A300C but were never able to get adequate expression for analysis.

      1. It is an interesting speculation that K41C and W323A stabilize the extracellular region of KCNE1 and might increase the binding efficacy of mefenamic acid. Is it also the case for DIDS? K41 may not be critical for DIDS, however.

      Yes, we found K41 was not critical to the binding/action of DIDS compared to MEF. In electrophysiological experiments with the K41C mutation, DIDS induced a leftward GV shift (~ -25 mV) whereas the normalized response was statistically non-significant. In MD simulation studies, we observed detachment of DIDS from K41C-Iks only in 3 runs out of 8 simulations. This is in contrast to Mef, where the drug left the binding site of K41C-Iks complex in all simulations.

      1. Same to #2, why was the pore turret (S298-A300) not examined in Figure 7?

      Again, we attempted A300C but could not get high enough expression.

      Reviewer #3 (Public Review):

      Weaknesses:

      1. The computational aspect of the work is rather under-sampled - Figure 2 and Figure 4. The lack of quantitative analysis on the molecular dynamic simulation studies is striking, as only a video of a single representative replica is being shown per mutant/drug. Given that the simulations shown in the video are extremely short; some video only lasts up to 80 ns. Could the author provide longer simulations in each simulation condition (at least to 500 ns or until a stable binding pose is obtained in case the ligand does not leave the binding site), at least with three replicates per each condition? If not able to extend the length of the simulations due to resources issue, then further quantitative analysis should be conducted to prove that all simulations are converged and are sufficient. Please see the rest of the quantitative analysis in other comments.

      We provide more quantitative analysis for the existing MD simulations and ran five additional simulations with 500 ns duration by embedding the channel in a POPC lipid membrane. For the new MD simulations, we used a different force field in order to minimize ambiguity related to force fields as well. Analysis of these data has led to new data and supplemental figures regarding RMSD of ligands during the simulations (Figure 4-figure supplement 1 and Figure 6-figure supplement 3), clustering of MD trajectories based on Mef conformation (Figure 2-figure supplement 3 and Figure 6 -figure supplement 2), H-bond formation over the simulations (Figure 2-figure supplement 4 and Figure 6-figure supplement 1). We have edited the manuscript to include this new information where appropriate.

      1. Given that the protein is a tetramer, at least 12 datasets could have been curated to improve the statistic. It was also unclear how frequently the frames from the simulations were taken in order to calculate the PBSA/GBSA.

      By using one ligand for each ps-IKs channel complex we tried to keep the molecular system and corresponding analysis as simple as was possible. Our initial results have shown that 4D docking and subsequent MD simulations with only one ligand bound to ps-IKs was complicated enough. Our attempts to dock 4 ligands simultaneously and analyze the properties of such a system were ineffective due to difficulties in: i) obtaining stable complexes during conformational sampling and 4D docking procedures, since the ligand interaction covers a region including three protein chains with dynamic properties, ii) possible changes of receptor conformation properties at three other subunits when one ligand is already occupying its site, iii) marked diversity of the binding poses of the ligand as cluster analysis of ligand-channels complex shows (Figure 2-figure supplement 3).

      We have added a line in the methods to clarify the use of only one ligand per channel complex in simulations.

      In order to calculate MMPBSA/MMGBSA we used a frame every 0.3 ns throughout the 300 ns simulation (1000 frames/simulation) or during the time the ligand remained bound. We have clarified this in the Methods.

      1. The lack of labels on several structures is rather unhelpful (Figure 2B, 2C, 4B). The lack of clarity of the interaction map in Figures 2D and 6A.

      We updated figures considering the reviewer's comments and added labels. For 2D interaction maps, we provided additional information in figure legends to improve clarity.

      1. The RMSF analysis is rather unclear and unlabelled thoroughly. In fact, I still don't quite understand why n = 3, given that the protein is a tetramer. If only one out of four were docked and studied, this rationale needs to be explained and accounted for in the manuscript.

      The rationale of conducting MD simulations with one ligand bound to IKs is explained in response to point 2 of the reviewer’s comments.

      RMSF analysis in Figure 4C-E was calculated using the chain to which Mef was docked but after Mef had left the binding site. Details were added to the methods.

      1. For the condition that the ligands suppose to leave the site (K42C for Mef and Y46A for DIDS), can you please provide simulations at a sufficient length of time to show that ligand left the site over three replicates? Given that the protein is a tetramer, I would be expecting three replicates of data to have four data points from each subunit. I would be expecting distance calculation or RMSD of the ligand position in the binding site to be calculated either as a time series or as a distribution plot to show the difference between each mutant in the ligand stability within the binding pocket. I would expect all the videos to be translatable to certain quantitative measures.

      We have shown in the manuscript that the MEF molecule detaches from the K41C/IKs channel complex in all three simulations (at 25 ns, 70 ns and 20 ns, Table. 4). Similarly, the ligand left the site in all five new 500 ns duration simulations. We did not provide simualtions for Y46A, but Y46C left the binding site in 4 of 5 500 ns simulations and changed binding pose in the other.

      Difficulties encountered upon extending the docking and MD simulations for 4 receptor sites of the channel complex is discussed in our response to point # 2 of the reviewer.

      1. Given that K41 (Mef) and Y46 are very important in the coordination, could you calculate the frequency at which such residues form hydrogen bonds with the drug in the binding site? Can you also calculate the occupancy or the frequency of contact that the residues are making to the ligand (close 4-angstrom proximity etc.) and show whether those agree with the ligand interaction map obtained from ICM pro in Figure 2D?

      We thank the reviewer for the suggestion to analyze the H-bond contribution to ligand dynamics in the binding site. In the plots shown in Figure 2-figure supplement 4 and Figure 6-figure supplement 1, we now provide detailed information about the dynamics of the H-bond formation between the ligand and the channel-complex throughout simulations. In addition, we have quantified this and have added these numbers to a table (Table 2) and in the text of the results.

      1. Given that the author claims that both molecules share the same binding site and the mode of ligand binding seems to be very dynamic, I would expect the authors to show the distribution of the position of ligand, or space, or volume occupied by the ligand throughout multiple repeats of simulations, over sufficient sampling time that both ligand samples the same conformational space in the binding pocket. This will prove the point in the discussion - Line 463-464. "We can imagine a dynamic complex... bind/unbind from Its at a high frequency".

      To support our statement regarding a dynamic complex we analyzed longer MD simulations and clustered trajectories, from this an average conformation from each cluster was extracted and provided as supplementary information which shows the different binding modes for Mef (Figure 2-figure supplement 3). DIDS was more stable in MD simulations and though there were also several clusters, they were similar enough that when using the same cut-off distance as for mefenamic acid, they could be grouped into one cluster. (Note the scale differences on dendrogram between Figure 2-figure supplement 3 and Figure 6-figure supplement 2).

      1. I would expect the authors to explain the significance and the importance of the PBSA/GBSA analysis as they are not reporting the same energy in several cases, especially K41 in Figure 2 - figure supplement 2. It was also questionable that Y46, which seems to have high binding energy, show no difference in the EPhys works in figure 3. These need to be commented on.

      Several studies indicate that G values calculated using MM/PBSA and MM/GBSA methods may vary. Some studies report marked differences and the reasons for such a discrepancy is thoroughly discussed in a review by Genheden and Ryde (PMID: 25835573). Therefore, we used both methods to be sure that key residues contributing to ligand binding identified with one method appear in the list of residues for which the calculations are done with the other method.

      Y46C which showed only a slightly less favorable binding energy and did not unbind during 300 ns simulations, unbound, or changed pose in 4 out of 5 of the longer simulations in the presence of a lipid membrane (Figure 4-figure supplement 1). The discrepancy between electrophysiological and MD data is commented in the manuscript (pages 12-13).

      1. Can the author prove that the PBSA/GBSA analysis yielded the same average free energy throughout the MD simulation? This should be the case when the simulations are converged. The author may takes the snapshots from the first ten ns, conduct the analysis and take the average, then 50, then 100, then 250 and 500 ns. The author then hopefully expects that as the simulations get longer, the system has reached equilibrium, and the free energy obtained per residue corresponds to the ensemble average.

      As we mention in the manuscript, MEF- channel interactions are quite dynamic and vary even from simulation to simulation. The frequent change of the binding pose of the ligands observed during simulations (represented in Figure 2 - figure supplement 3 as clusters) is a clear reflection of such a dynamic process. Therefore, we do not expect the same average energy throughout the simulation but we do expect that G values stands above the background for key residues, which was generally the case (Figure 2 - figure supplement 2 and Figure 6.)

      1. The phrase "Lowest interaction free energy for residues in ps-KCNE1 and selected KCNQ1 domains are shown as enlarged panels (n=3 for each point)" needs further explanation. Is this from different frames? I would rather see this PBSA and GBSA calculated on every frame of the simulations, maybe at the one ns increment across 500 ns simulations, in 4 binding sites, in 3 replicas, and these are being plotted as the distribution instead of plotting the smallest number. Can you show each data point corresponding to n = 3?

      The MMPBSA/MMGBSA was calculated for 1000 frames across 3x300 ns simulations with 0.3 ns sampling interval, together 3000 frames, shown in Figure 2-figure supplement 2 and includes error bars to show the differences across runs. We have updated the legend for greater clarity.

      1. I cannot wrap my head around what you are trying to show in Figure 2B. This could be genuinely improved with better labelling. Can you explain whether this predicted binding pose for Mef in the figure is taken from the docking or from the last frame of the simulation? Given that the binding mode seems to be quite dynamic, a single snapshot might not be very helpful. I suggest a figure describing different modes of binding. Figure 2B should be combined with figure 2C as both are not very informative.

      We have updated Figure 2B with better labelling and added a new figure showing the different modes of binding (Figure 2-figure supplement 3).

      1. Similar to the comment above, but for Figure 4B. I do not understand the argument. If the author is trying to say that the pocket is closed after Mef is removed - then can you show, using MD simulation, that the pocket is openable in an apo to the state where Mef can bind? I am aware that the open pocket is generated through batches of structures through conformational sampling - but as the region is supposed to be disordered, can you show that there is a possibility of the allosteric or cryptic pocket being opened in the simulations? If not, can you show that the structure with the open pocket, when the ligand is removed, is capable of collapsing down to the structure similar to the cryo-EM structure? If none of the above work, the author might consider using PocketMiner tools to find an allosteric pocket (https://doi.org/10.1038/s41467-023-36699-3) and see a possibility that the pocket exists.

      Please see the attached screenshot which depicts the binding pocket from the longest run we performed (1250 ns) before drug detachment (grey superimposed structures) and after (red superimposed structures). Mefenamic acid is represented as licorice and colored green. Snapshots for superimposition were collected every 10 ns. As can be seen in the figure, when the drug leaves the binding site (after 500 ns, structures colored red), the N-terminal residue of psKCNE1, W323, and other residues that form the pocket shift toward the binding site, overlapping with where Mefenamic acid once resided. The surface structure in Figure 4B shows this collapse.

      Author response image 1.

      In the manuscript, we propose that drug binding occurs by the mechanism that could be best described by induced fit models, which state that the formation of the firm complexes (channel-Mef complex) is a result of multiple-states conformational adjustments of the bimolecular interaction. These interactions do not necessarily need to have large interfaces at the initial phase. This seems to be the case in Mef with IKS interactions, since we could not identify a pocket of appropriate size either using PocketMiner software suggested by the reviewer or with PocketFinder tool of ICM-pro software.

      1. Figure 4C - again, can you show the RMSF analysis of all four subunits leading to 12 data points? If it is too messy to plot, can you plot a mean with a standard deviation? I would say that a 1-1.5 angstroms increase in the RMSF is not a "markedly increased", as stated on line 280. I would also encourage the authors to label whether the RMSF is calculated from the backbone, side-chain or C-alpha atoms and, ideally, compare them to see where the dynamical properties are coming from.

      Please see the answer to comment #4. We agree that the changes are not so dramatic and modified the text accordingly. RMSD was calculated for backbone atom to compare residues with different side chains, a note of this is now in the methods and statistical significance of ps-IKs vs K41C, W323A and Y46C is indicated in Figures 4C-4E.

      1. In the discussion - Lines 464-467. "Slowed deactivation of the S1/KCNE1/Pore domain/drug complex... By stabilising the activated complex. MD simulation suggests the latter is most likely the case." Can you point out explicitly where this has been proven? If the drug really stabilised the activated complex, can you show which intermolecular interaction within E1/S1/Pore has the drug broken and re-form to strengthen the complex formation? The authors have not disproven the point on steric hindrance either. Can this be disproved by further quantitative analysis of existing unbiased equilibrium simulations?

      The stabilization of S1/KCNE1/Pore by drugs does not necessarily have to involve a creation of new contacts between protein parts or breakage of interfaces between them. The stabilization of activated complexes by drugs may occur when the drug simultaneously binds to both moveable parts of the channel, such as voltage sensor(s) or upper KCNE1 region, and static region(s) of the channel, such as the pore domain. We have changed the corresponding text for better clarity.

      1. Figure 4D - Can you show this RMSF analysis for all mutants you conducted in this study, such as Y46C? Can you explain the difference in F dynamics in the KCNE3 for both Figure 4C and 4D?

      We now show the RMSF for K41C, W323A and Y46C in Figure 4C-E. We speculate that K41 (magenta) and W323 (yellow), given their location at the lipid interface (see Author response image 1), may be important stabilizing residues for the KCNE N-terminus, whereas Y46 (green) which is further down the TMD has less of an impact.

      Author response image 2.

      1. Line 477: the author suggested that K41 and Mef may stabilise the protein-protein interface at the external region of the channel complex. Can you prove that through the change in protein-protein interaction, contact is made over time on the existing MD trajectories, whether they are broken or formed? The interface from which residues help to form and stabilise the contact? If this is just a hypothesis for future study, then this has to be stated clearly.

      It is known that crosslinking of several residues of external E1 with the external pore residues dramatically stabilizes voltage-sensors of KCNQ1/KCNE1 complex in the up-state conformation. This prevents movable protein regions in the voltage-sensors returning to their initial positions upon depolarization, locking the channel in an open state. We suggest that MEF may restrain the backward movement of voltage-sensors in a similar way that stabilizes open conformation of the channel. The stabilization of the voltage sensor domain through MEF occurs due to contacts of the drug with both static (pore domain) and dynamic protein parts (voltage-sensors and external KCNE1 regions). We have changed the corresponding part of the text.

      1. The author stated on lines 305-307 that "DIDS is stabilised by its hydrophobic and vdW contacts with KCNQ1 and KCNE1 subunits as well as by two hydrogen bonds formed between the drug and ps-KCNE1 residue L42 and KCNQ1 residue Q147" Can you show, using H-bond analysis that these two hydrogen bonds really exist stably in the simulations? Can you show, using minimum distance analysis, that L42 are in the vdW radii stably and are making close contact throughout the simulations?

      We performed a detailed H-bond analysis (Figure 6-supplement figure 1) which shows that DIDS forms multiple H-bond over the simulations, though only some of them (GLU43, TYR46, ILE47, SER298, TYR299, TRP323 ) are stable. Thus, the H-bonds that we observed in DIDS-docking experiments were unstable in MD simulations. As in the case of the IKs-MEF complex, the prevailing H-bonds exhibit marked quantitative variability from simulation to simulation. We have added a table detailing the most frequent H-bonds during MD simulations (Table 2).

      1. Discussion - In line 417, the author stated that the "S1 appears to pull away from the pore" and supplemented the claim with the movie. This is insufficient. The author should demonstrate distance calculation between the S1 helix and the pore, in WT and mutants, with and without the drug. This could be shown as a time series or distribution of centre-of-mass distance over time.

      We tried to analyze the distance changes between the upper S1 and the pore domain but failed to see a strong correlation We have removed this statement from the discussion.

      1. Given that all the work were done in the open state channel with PIP2 bound (PDB entry: 6v01), could the author demonstrate, either using docking, or simulations, or alignment, or space-filling models - that the ligand, both DIDS and Mef, would not be able to fit in the binding site of a closed state channel (PDB entry: 6v00). This would help illustrate the point denoted Lines 464-467. "Slowed deactivation of the S1/KCNE1/Pore domain/drug complex... By stabilising the activated complex. MD simulation suggests the latter is most likely the case."

      As of now, a structure representing the closed state of the channel does not exist. 6V00 is the closed inactivated state of the channel pore with voltage-sensors in the activated conformation. In order to create simulation conditions that reliably describe the electrophysiological experiments, at least a good model for closed channels with resting state voltage sensors is necessary.

      1. The author stated that the binding pose changed in one run (lines 317 to 318). Can you comment on those changes? If the pose has changed - what has it changed to? Can you run longer simulations to see if it can reverse back to the initial confirmation? Or will it leave the site completely?

      Longer simulations and trajectory clustering revealed several binding modes, where one pose dominated in approximately 50% of all simulations in Figure 2-figure supplement 3 encircled with a blue frame.

      1. Binding free energy of -32 kcal/mol = -134 kJ/mol. If you try to do dG = -RTlnKd, your lnKd is -52. Your Kd is e^-52, which means it will never unbind if it exists. I am aware that this is the caveat with the methodologies. But maybe these should be highlighted throughout the manuscript.

      We thank the reviewer for this comment. G values, and corresponding Kd values, calculated from simulation of Mef-ps-IKs complex do not reflect the apparent Kd values determined in electrophysiological experiments, nor do they reflect Kd values of drug binding that could be determined in biochemical essays. Important measures are the changes observed in simulations of mutant channel complexes relative to wild type. We now briefly mention this issue in the manuscript.

      Reviewer #1 (Recommendations For The Authors):

      1) It would be nice to have labels of amino acid residues in Figure 2B.

      We updated Figure 2B and added some residue labels.

      2) Fig. 3A and 7A. In what order the current traces are presented? I don't see the rule.

      We have now arranged the current traces in a more orderly manner, listing them first by ascending KCNE1 residue numbers and then by ascending KCNQ1 residue numbers. Now consistent with Fig 3 and 7 (normalized response and delta V1/2).

      3) Line 312 "A44 and Y46 were more so." A44 may be more critical, but I can't see Y46 is more, according to Figure 2-figure supplement2 and Figure 6.

      Indeed, comparison of the energy decomposition data indicates approximately the same ∆G values for Y46. We have revised this in the text correspondingly.

      4) Line 267 "Mefenamic acid..." I would like to see the movie.

      We no longer have access to this original movie

      5) In supplemental movies 5-7, the side chains of some critical amino acid residues (W323, K41) would be better presented as in movies 1-4.

      We have retained the original presentations of these movies as the original files are no longer available.

      Reviewer #2 (Recommendations For The Authors):

      General comments:

      1) To determine the effect of mefenamic acid and DIDS on channel closing kinetics, a protocol in which they step from an activating test pulse to a repolarizing tail pulse to -40 mV for 1 s is used. If I understand it right, the drug response is assessed as the difference in instantaneous tail current amplitude and the amplitude after 1 s (row 599-603). The drug response of each mutant is then normalized to the response of the WT channel. However, for several mutants there is barely any sign of current decay during this relatively brief pulse (1 s) at this specific voltage. To determine drug effects more reliably on channel closing kinetics/the extent of channel closing, I wonder if these protocols could be refined? For instance, to cover a larger set of voltages and consider longer timescales?

      To clarify, the drug response of each mutant is not normalized to the response of the WT channel. In fact, our analysis is not meant to compare mutant and WT tail current decay but rather how isochronal tail current decay is changed in response to drug treatment in each channel construct. As acknowledged by the reviewer, the peak to end difference currents were calculated by subtracting the minimum amplitude of the deactivating current from the peak amplitude of the deactivating current. But the difference current in mefenamic acid or DIDS was normalized to the maximum control (in the absence of drug) difference current and subtracted from 1.0 to obtain the normalized response. Thus, the difference in tail current decay in the absence and in the presence of drug is measured within the same time scale and allow a direct comparison between before and after drug treatment. As shown in Fig 3D and 7C, a large drug response such as the one measured in WT channels is reflected by a value close to 1. A smaller drug response is indicated by low values. We recognize that some mutations resulted in an intrinsic inhibition of tail current decay in the absence of drug, which potentially lead to underestimating the normalized response value. Our goal was not to study in detail the effects of the drug on channel closing kinetics, but only to determine the impact of the mutation on drug binding by using tail current decay as a readout. Consequently, we believe that the duration of the deactivating tail current used in this experiment was sufficient to detect drug-induced tail current decay inhibition.

      2) The effect of mefenamic acid seems to be highly dependent on the pulse-to-pulse interval in the experiments. For instance, for WT in Figure 3 - Figure supplement 1, a 15 s pulse-to-pulse interval provides a -100 mV shift in V1/2 induced by mefenamic acid, whereas there is no shift induced when using a 30 s pulse-to-pulse interval. Can the authors explain why they generally consider a 15 s pulse-to-pulse interval more suitable (physiologically relevant?) in their experiments to assess drug effects?

      In our previous experiments, we have determined that a 15 s inter-pulse interval is generally adequate for the WT IKs channels to fully deactivate before the onset of the next pulse. Consistent with our previous work (Wang et al. 2019), we observed that in wild-type EQ channels, there is no current summation from one pulse to the next one (see Fig 1A, bottom panel). This is important as the IKs channel complex is known to be frequency dependent i.e. current amplitude increases as the inter-pulse interval gets shorter. Such current summation results in a leftward shift of the conductance-voltage (GV) relationship. This is also important with regards to drug effects. As indicated by the reviewer, mefenamic acid effects are prominent with a 15 sec inter-pulse interval but less so with a 30 sec inter-pulse interval when enough time is given for channels to more completely deactivate. Full effects of mefenamic acid would have therefore been concealed with a 30sec inter-pulse interval.

      Moreover, our patch-clamp recordings aim to explore the distinct responses of mutant channels to mefenamic acid and DIDS in comparison to the wild-type channel. It is important to note that the inter-pulse interval's physiological relevance is not necessarily crucial in this context.

      3) Related to comment 1 and 2, there is a large diversity in the intrinsic properties of tested mutants. For instance, V1/2 ranges from 4 to 70 mV. Also, there is large variability in the slope of the G-V curves. Whether channel closing kinetics, or the impact of pulse-to-pulse interval, vary among mutants is not clear. Could the authors please discuss whether the intrinsic properties of mutants may affect their ability to respond to mefenamic acid and DIDS? Also, please provide representative current families and G-V curves for all assessed mutants in supplementary figures.

      The intrinsic properties of some mutants vary from the WT channels and influence their responsiveness to mefenamic acid and DIDS. The impact of the mutations on the IKs channel complex are reflected by changes in V1/2 (Table 1, 4) and tail current decay (Figs. 3, 7). But, it is the examination of the drug effects on these intrinsic properties (i.e. GV curve and tail current decay) that constitutes the primary endpoint of our study. We consider that the degree by which mef and DIDS modify these intrinsic properties reflects their ability to bind or not to the mutated channel. In our analysis, we compared each mutant's response to mefenamic acid and DIDS with its respective control. Consequently, the intrinsic properties of the mutant channels have already been considered in our evaluation. As requested, we have provided representative current families and G-V curves for all assessed mutants in Figure 3-figure supplement 1 and Figure 7-figure supplement 1.

      4) The A44C and Y148C mutants give strikingly different currents in the examples shown in Figure 3 and Figure 7. What is the reason for this? In the examples in figure 7, it almost looks like KCNE1 is absent. Although linked constructs are used, is there any indication that KCNE1 is not co-assembled properly with KCNQ1 in those examples?

      The size of the current is critical to determining its shape, as during the test pulse there is some endogenous current mixed in which impacts shape. A44C and Y148C currents shown in Figure 7 are smaller with a larger contribution of the endogenous current, mostly at the foot of the current trace. In our experience there is little endogenous current in the tail current at -40 mV and for this reason we focus our measurements there.

      Although constructs with tethered KCNQ1 and KCNE1 were used, we cannot rule out the possibility that Q1 and E1 interaction was altered by some of the mutations. Several KCNE1 and KCNQ1 residues have been identified as points of contact between the two subunits. For instance, the KCNE1 loop (position 36-47) has been shown to interact with the KCNQ1 S1-S2 linker (position 140-148) (Wang et al, 2011). Thus, it is conceivable that mutation of one or several of those residues may alter KCNQ1/KCNE1 interaction and modify the activation/deactivation kinetics of the IKs channel complex.

      5) I had a hard time following the details of the simulation approaches used. If not already stated (I could not find it), please provide: i) details on whether the whole channel protein was considered for 4D docking or a docking box was specified, ii) information on how simulations with mutant ps-IKs were prepared (for instance with the K41C mutant), especially whether the in silico mutated channel was allowed to relax before evaluation (and for how long). Also, please make sure that information on simulation time and number of repeats are provided in the Methods section.

      For 4D docking, only residues within 0.8 nm of psKCNE1 residues D39-A44 were selected. Complexes with mutated residues were relaxed using the same protocol as the WT channel, (equilibration with gradually releasing restraints with a final equilibration for 10 ns where only the backbone was constrained with 50 kcal/mol/nm2). We have updated the methods accordingly.

      Specific comments:

      In figure legends, please provide information on whether data represents mean +/- SD or SEM. Also, please provide information on which statistical test was used in each figure.

      We revised the figure legend to add the nature of the statistical test used.

      G-V curves are normalized between 0 and 1. However, for many mutants the G-V relationship does not reach saturation at depolarized voltages. Does this affect the estimated V1/2? I could not really tell as I was not sure how V1/2 was determined for different mutants (could the explanation on row 595-598 be clarified)?

      The primary focus here is in the shift between the control response and drug response for each mutant, rather than the absolute V1/2 values. The isochronal G-V curves that are generated for each construct (WT and mutant) utilize an identical voltage protocol. This approach ensures a uniform comparison among all mutants. By observing the shifts in these curves, we can gain insight into the response of mutant channels to the drug. This information ultimately helps elucidate the inherent properties of the mutant channels and contributes to our understanding of the drug's binding mechanism to the channel.

      As requested by the reviewer, we also clarified the way V1/2 was generated: When the G-V curve did not reach zero, the V1/2 value was directly read from the plot at the voltage point where the curve crossed the 0.5 value on the y coordinate.

      A general comment is that the Discussion is fairly long and some sections are quite redundant to the Results section. The authors could consider focusing the text in the Discussion.

      We changed the discussion correspondingly wherever it was appropriate.

      I found it a bit hard to follow the authors interpretation on whether their drug molecules remain bound throughout the experiments, or whether there is fast binding/unbinding. Please clarify if possible.

      In the 300 ns MD simulations mefenamic acid and DIDS remained stably bound to WT-ps-IKS, binding of drugs to mutant complexes are described in the Table 3 and Table 5. In longer simulations with the channel embedded in a lipid environment, mefenamic acid unbinds in two out of five runs for WT-ps-IKs (Figure 4 – figure supplement 1), and DIDS shows a few events where it briefly unbinds (Figure 6 -figure supplement 3). Based on electrophysiological data we speculate that drugs might bind and unbind to WT-ps-IKs during the gating process. We do not see bind-unbinding in MD simulations, since the model we used in simulations reflects only open conformation of the channel-complex with an activated-state voltage-sensor, whereas a resting-state voltage sensor condition was not considered.

      The authors have previously shown that channels with no, one or two KCNE1 subunits are not, or only to a small extent, affected by mefenamic acid (Wang et al., 2020). Could the details of the binding site and proposed mechanisms of action provide clues as to why all binding sites need to be occupied to give prominent drug effects?

      In the manuscript, we propose that the binding of drugs induces conformational changes in the pocket region that stabilize S1/KCNE1/Pore complex. In the tetrameric channel with 4:4 alpha to beta stoichiometry the drugs are likely to occupy all four sites with complete stabilization of S1/KCNE1/Pore. When one or more KCNE1 subunits is absent, as in case of EQQ, or EQQQQ constructs, drugs will bind to the site(s) where KCNE1 is available. This will lead to stabilization of the only certain part of the S1/KCNE1/Pore complex. We believe that the corresponding effect of the drug, in this case will be partially effective.

      There is a bit of jumping in the order of when some figures are introduced (e.g. row 178 and 239). The authors could consider changing the order to make the figures easier to follow.

      We have changed the corresponding section appropriately to improve the reading flow.

      Row 237: "Data not shown", please show data.

      The G-V curve of the KCNE1 Y46C mutant displays a complex, double Boltzmann relationship which does not allow for the calculation of a meaningful V1/2 nor would it allow for an accurate determination of drug effects. Consequently, we have excluded it from the manuscript.

      In the Discussion, the author use the term "KCNE1/3". Does this correspond to the previous mention of "ps-KCNE1"?

      Yes, this refers to ps-KCNE1. We have changed it correspondingly.

      Row 576: When was HMR 1556 used?

      While HMR 1556 was used in preliminary experiments to confirm that the recorded current was indeed IKs, it does not provide substantial value to the data presented in our study or our experiments. As a result, we have excluded HMR 1556 experiments from the final results and have revised the Methods section accordingly.

      Reviewer #3 (Recommendations For The Authors):

      1) Figures 2D and 6A are very unclear. Can the authors provide labels as text rather than coloured circles, whether the residue is on Q1 or E1? There is also a distance label in the figure in the small font with the faintest shade of grey, which I believe is supposed to be hydrogen bonds. Can this be improved for clarity?

      We feel that additional labels on the ligand diagrams to be more confusing, instead, we updated the description in the legend and added labels to Figure 2B and Figure 6B to improve the clarity of residue positions. In addition, we have added 2 new figures with more detailed information about H-bonds (Figure 2-figure supplement 4, Figure 6- figure supplement 1).

      2) Figure 2B - all side chains need labelling in different binding modes. The green ligand on blue protein is very difficult to see. Suddenly, the ligand turns light blue in panel 2C. Can this be consistent throughout the manuscript?

      Figure 2B is updated according to this comment.

      3) Figure 2 - figure supplement 2, and figure 6B. Can the author show the residue number on the x-axis instead of just the one-letter abbreviation? This requires the reader to count and is not helpful when we try to figure out where the residue is at a glance. I would suggest a structure label adjacent to the plot to show whether they are located with respect to the drug molecule.

      Since the numbers for residues on either end of the cluster are indicated at the bottom of each boxed section, we feel that adding residue numbers would just further clutter the figure.

      4) Figure 2 - figure supplement 2, and Figure 6B. Can you explain what is being shown in the error bar? I assume standard deviation?

      Error bars on Figure 2-figure supplement 2 represent SEM. We added corresponding text in the figure legend.

      5) Figure 2 - figure supplement 2, and figure 6B. Can you explain how many frames are being accounted for in this PBSA calculation?

      For Figure 2- figure supplement 2 and Figure 6B a frame was made every 0.3 ns over 3x300 ns simulation, 1000 frames for each simulation, 3000 frames overall.

      6) Figure 3D/E and 7C/D, it would be helpful to show which mutant show agreeable results with the simulations, PBSA/GBSA and contact analyses as suggested above.

      The inconsistencies and discrepancies between the results of MD simulations and electrophysiological experiments are discussed throughout the manuscript.

      7) Figure legend, figure 3E - I assume that there is a type that is different mutants with respect to those without the drug. Otherwise, how could WT, with respect to WT, has -105 mV dV1/2?

      The reviewer is correct in that the bars indicate the difference in V1/2 between control and drug treatment. Thus, the difference in V1/2 (∆V1/2) between the V1/2 calculated for WT control and the V1/2 for mefenamic acid is indeed -105 mV. We have now revised Figure 3E's legend to accurately reflect this and ensure a clear understanding of the data presented.

      8) Figure 3 - figure supplement 1B is very messy, and I could not extract the key point from it. Can this be plotted on a separate trace? At least 1 WT trace and one mutant trace, 1 with WT+drug and one mut+drug as four separate plots for clarity?

      The key message of this figure is to illustrate the similarities of EQ WT + Mef and EQ L142C data. Thus, after thorough consideration, we have concluded that maintaining the current figure, which displays the progressive G-V curve shift in EQ WT and L142C in a superimposed manner, best illustrates the gradual shift in the G-V curves. This presentation allows for a clearer and more immediate comparison of the curve shifts, which may be more challenging to discern if the G-V curves were separated into individual figures. We believe that the existing format effectively communicates the relevant information in a comprehensive and accessible manner.

      9) Figure 4B - the label Voltage is blended into the orange helix. Can the label be placed more neatly?

      We altered the labels for this figure and added that information in the figure description.

      10) Can you show the numerical label of the residue, at least only to the KCNE1 portion in Figures 4C and 4D?

      We updated these figures and added residue numbering for clarity.

      11) Can you hide all non-polar hydrogen atoms in figure 8 and colour each subunit so that it agrees with the rest of the manuscripts? Can you adjust the position of the side chain so that it is interpretable? Can you summarise this as a cartoon? For example, Q147 and Y148 are in grey and are very far hidden away. So as S298. Can you colour-code your label? The methionine (I assume M45) next to T327 is shown as the stick and is unlabelled. Maybe set the orthoscopic view, increase the lighting and rotate the figures in a more interpretable fashion?

      We agree that Fig.8 is rather small as originally presented. We have tried to emphasize those residues we feel most critical to the study and inevitably that leads to de-emphasis of other, less important residues. As long as the figure is reproduced at sufficient size we feel that it has sufficient clarity for the purposes of the Discussion.

      12) Line 538-539. Can you provide more detail on how the extracellular residues of KCNE3 are substituted? Did you use Modeller, SwissModel, or AlphaFold to substitute this region of the KCNEs?

      We used ICM-pro to substitute extracellular residues of KCNE3 and create mutant variants of the Iks channel. This information is provided in the methods section now.

      13) Line 551: The PIP2 density was solved using cryo-EM, not X-ray crystallography.

      We corrected this.

      14) Line 555: The system was equilibrated for ten ns. In which ensemble? Was there any restraint applied during the equilibration run? If yes, at what force constant?

      The system was equilibrated in NVT and NPT ensembles with restraints. These details are added to methods. In the new simulations, we did equilibrations gradually releasing spatial from the backbone, sidechains, lipids, and ligands. A final 30 ns equilibration in the NPT ensemble was performed with restraint only for backbone atoms with a force constant of 50 kJ/mol/nm2. Methods were edited accordingly.

      15) Line 557: Kelvin is a unit without a degree.

      Corrected

      16) Line 559: PME is an electrostatic algorithm, not a method.

      Corrected

      17) Line 566: Collecting 1000 snapshots at which intervals. Given your run are not equal in length, how can you ensure that these are representative snapshots?

      Please see comment #5.

      18) Table 3 - Why SD for computational data and SEM for experimental data?

      There was no particular reason for using SD in some graphs. We used appropriate statistical tests to compare the groups where the difference was not obvious.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Using lineage tracing and single-cell RNA sequencing, Li et al. reported brain ECs can differentiate into pericytes after stroke. This finding is novel and important to the field.

      Strengths:

      Detailed characterization of each time point and genetic manipulation of genes for study role of ECs and E-pericyte.

      Weaknesses:

      Genetic evidence for lineage tracing of ECs and E-pericytes requires more convincing data that includes staining, FACS, and scRNA-seq analysis.

      We appreciate the reviewer’s recommendation to explore more convincing data, including staining, FACS, and scRNA-seq analysis. We initially employed traditional lineage tracing methods to demonstrate that endothelial cells can transform into pericytes after stroke. We utilized Cdh5CreERT2;Ai47 mice, Tie2-Dre;Mfsd2aCreER;Ai47 mice, and AAV-BI30 virus-infected Ai47 mice. However, in our validation of the transformed cells as pericytes, there are limitations to our results. While three pericyte markers (CD13, NG2, and PDGFRβ) were used in Cdh5CreERT2;Ai47 mice, only one marker (CD13) was applied in Tie2Dre; Mfsd2aCreER;Ai47 and AAV-BI30 virus-infected Ai47 mice. This is insufficient, and the other two pericyte markers (NG2 and PDGFRβ) need to be verified in these models.

      At scRNA-seq, although we observed an increased proportion of pericyte/EGFP<sup>+</sup> cells after stroke, we did not rule out potential contamination by pericyte cells, nor did we include sufficient replicates. To address these issues, we can explore additional methods for analyzing scRNA-seq data, increasing sample replicates, and eliminating pericyte contamination using advanced algorithms. Furthermore, we can use chimeric-related mutations to compare normal endothelial cells, normal pericytes, endothelial-derived pericytes (E-pericytes), and intermediate fibroblast-like cells at the DNA level. This approach will help identify and trace chimeric-related mutations across different cell types and developmental stages. Finally, we can track the entire process of endothelial cell transformation into pericytes using two-photon imaging in vivo.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Li and colleagues study the fate of endothelial cells in a mouse model of ischemic stroke. Using genetic lineage tracing approaches, they found that endothelial cells give rise to non-endothelial cells, which they term "E-pericytes." They further show that depleting these cells exacerbates blood-brain barrier leakage and worsens functional recovery. The authors also provide evidence that endothelial-to-mesenchymal transition, myeloid cell-derived TGFβ1, and endothelial TGFβRII are involved in this process. These are potentially interesting findings, however, the experimental evidence that endothelial cells undergo transdifferentiation to non-endothelial cells is weak, as is the evidence that these cells are pericytes. Addressing this foundational weakness will facilitate the interpretation of the other findings.

      Strengths:

      (1) The authors address an important question about blood vessel function and plasticity in the context of stroke.

      (2) The authors use a variety of genetic approaches to understand cell fate in the context of stroke. Particularly commendable is the use of several complementary lineage tracing strategies, including an intersectional strategy requiring both endothelial Cre activity and subsequent mural cell NG2 promoter activity.

      (3) The authors address upstream cellular and molecular mechanisms, including roles for myeloid-derived TGFβ.

      Weaknesses:

      (1) The authors use Cdh5-CreERT2; Ai47 mice to permanently label endothelial cells and their progeny with eGFP. They then isolate eGFP<sup>+</sup> cells from control and MCAO RP7D and RP34D brains, and use single-cell RNA-seq to identify the resulting cell types. Theoretically, all eGFP<sup>+</sup> cells should be endothelial cells or their progeny. This is a very powerful and well-conceived experiment. The authors use the presence of a pericyte cluster as evidence that endothelial-to-pericyte transdifferentiation occurs. However, pericytes are also present in the scRNA-seq data from sham mice, as are several other cell types such as fibroblasts and microglia. This suggests that pericytes and these other cell types might have been co-purified (e.g., as doublets) with eGFP<sup>+</sup> endothelial cells during FACS and may not themselves be eGFP<sup>+</sup>. Pericyte-endothelial doublets are common in scRNA-seq given that these cell types are closely and tightly associated. Additionally, tight association (e.g., via peg-socket junctions) can cause fragments of endothelial cells to be retained on pericytes (and vice-versa) during dissociation. Finally, it is possible that after stroke or during the dissociation process, endothelial cells lyse and release eGFP that could be taken up by other cell types. All of these scenarios could lead to the purification of cells that were not derived (transdifferentiated) from endothelial cells. The authors note that the proportion of pericytes increased in the stroke groups, but it does not appear this experiment was replicated and thus this conclusion is not supported by statistical analysis. The results of pseudotime and trajectory analyses rely on the foundation that the pericytes in this dataset are endothelial-derived, which, as discussed above, has not been rigorously demonstrated.

      Thank you for your thoughtful comment.

      Indeed, we face the challenge of obtaining pure cells. As the reviewer has pointed out, several factors may contribute to cell contamination. For instance, the meninges of adult mice are difficult to remove completely, which may lead to fibroblast contamination. Although Cdh5CreERT2 can specifically label endothelial cells in the normal brain parenchyma, there may still be very few unspecific cells in certain brain regions, such as the choroid plexus and periventricular areas, resulting in the presence of ependymal cells. To address these issues, we can improve our methodology by carefully removing the meninges, choroid plexus, and periventricular cells during sample preparation. Additionally, we need to increase the N of the transcriptome samples to enhance the reliability of our data.

      (2) I have the same concern regarding the inadvertent purification of cells that were not derived from endothelial cells in the context of the bulk RNA-seq experiment (Figure S4), especially given the sample-to-sample variability in gene expression in the RP34D, eGFP<sup>+</sup> non-ECs-group (e.g., only 2/5 samples are enriched for mesenchymal transcription factor Tbx18, only 1/5 samples are enriched for mural cell TF Heyl). If the sorted eGFP<sup>+</sup> non-ECs were pericytes, I would expect a strong and consistent pericyte-like gene expression profile.

      This is an interesting question.

      Indeed, significant differences were observed in the expression of pericyte-related transcriptional profiles within the eGFP<sup>+</sup> non-ECs group. For instance, transcription factors such as Hic1 and Fosl1 were nearly absent in the eGFP<sup>+</sup> non-ECs group. We propose several potential explanations for these observations:

      (1) The sorted eGFP<sup>+</sup> non-ECs group may contain other cell types, leading to contamination.

      (2) The eGFP<sup>+</sup> non-ECs group may not uniformly express all pericyte-related transcriptional profiles.

      (3) The temporal dynamics of transcription factor expression (i.e., different factors being expressed at different stages) could contribute to the observed variability.

      (4) The heterogeneity in the timing of endothelial-to-pericyte transformation (i.e., some cells have already transformed into pericytes while others are in the process of transformation at the early stage) may result in significant differences in transcriptional profiles.

      (3) The authors use immunohistochemistry to understand localization, morphology, and marker expression of eGFP<sup>+</sup> cells in situ. The representative "E-pericytes" shown in Figure 3A-D are not associated with blood vessels, and the authors' quantification also shows that the majority of such cells are not vessel-associated ("avascular"). By definition, pericytes are a component of blood vessels and are embedded within the vascular basement membrane. Thus, concluding that these cells are pericytes ("E-pericytes") may be erroneous.

      Yes, we found that 72.2% of E-pericytes were free and not associated with blood vessels. Normally, pericytes surround blood vessels and connect to endothelial cells. However, in certain diseases, such as Alzheimer's disease, stroke, and diabetic encephalopathy, pericytes can detach from blood vessels. In our stroke model, we observed that pericytes detach from blood vessels. This phenomenon can be explained by two possible scenarios:

      (1) After endothelial cells transform into E-pericytes, the E-pericytes detach from blood vessels due to the pathological environment following stroke.

      (2) After stroke, blood vessel function is impaired, leading to vascular degeneration. Endothelial cells shed from the blood vessels and subsequently transform into E-pericytes.

      Therefore, preventing pericyte detachment from blood vessels after stroke represents an important scientific challenge.

      (4) CD13 flow cytometry and immunohistochemistry are used extensively to identify pericytes. In the context of several complementary lineage tracing strategies noted in Strength #2, CD13 immunohistochemistry is the only marker used to identify putative pericytes (Figure S3J-M). In stroke, CD13 is not specific to pericytes; dendritic cells and other monocyte-derived cells express CD13 (Anpep) in mouse brain after stroke (PMID: 38177281, https://anratherlab.shinyapps.io/strokevis/).

      We thank the reviewer for their valuable input. In the context of stroke, CD13 is not specific to pericytes. Additionally, pericytes lack a single specific marker; instead, their identity is determined by a combination of multiple markers. To more convincingly validate the identity of pericytes, it is necessary to incorporate additional pericyte markers alongside several complementary lineage tracing strategies.

      (5) The authors conclude that "EC-specific overexpression of the Tgfbr2 protein by a virus (Tgfbr2) decreases Evans blue leakage, promotes CBF recovery, alleviates neurological deficits and facilitates spontaneous behavioral recovery after stroke by increasing the number of E-pericytes." All data in Figure 10, however, compare endothelial Tgfbr2 overexpression to a DsRed overexpression control. There is no group in which Tgfbr2 is overexpressed but "E-pericytes" are eliminated with DTA (this is done in Figure 9B, but this experiment lacks the Tgfbr2 overexpression-only control). Thus, the observed functional outcomes cannot be ascribed to "E-pericytes"; it remains possible that endothelial Tgfbr2 overexpression affects EB leakage, CBF, and behavior through alternative mechanisms.

      We thank the reviewer for their valuable comment. Although in Figures 9A-B, we observed no significant difference in Evans blue leakage between the Tgfbr2 overexpression group and the Tgfbr2 overexpression + DTA group (P=0.8153), this suggests that the impact of Tgfbr2 overexpression on the blood-brain barrier (BBB) is primarily attributed from the E-pericytes generated by Tgfbr2 expression. Furthermore, in Figure 10A, the inclusion of the Tgfbr2 overexpression + DTA group would provide stronger evidence that the effects of Tgfbr2 overexpression on the BBB and neurobehavioral outcomes are mainly due to the E-pericytes derived from Tgfbr2 expression.

      (6) Single-cell and bulk RNA-seq data are not available in a public repository (such as GEO). Depositing these data would facilitate their independent reevaluation and reuse.

      Thank you for the suggestion and we have uploaded Single-cell and bulk RNA-seq data (The assignment of GEO number is pending).

      Reviewer #3 (Public review):

      Summary:

      The data and experiments presented in that study convincingly show that a subpopulation of endothelial cells undergo transformation into pericyte-like cells after stroke in mice. These so-called "E-pericytes" are protective and might present a new target for stroke recovery. The authors used a huge battery of different techniques and modified signaling pathways and cellular interactions using several genetic and pharmacological tools to show that TGFbeta and EndoMT are causes of this transformation.

      Strengths:

      The amount of different genetic and pharmacological approaches in combination with sophisticated techniques such as single-cell RNAseq is impressive and convincing. The results support their conclusions and the authors achieved their aims. The findings will strongly impact the field of cerebrovascular recovery after stroke and might open up new therapeutic targets.

      Weaknesses:

      The written and graphic presentation of the findings needs substantial improvement. Language editing is strongly recommended (there are a lot of spelling and grammatical errors in the text and illustrations, including legends).

      Thank you for raising this important point and we will place greater emphasis on the written and graphic presentation of the findings.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In this study, Li et al. reported that endothelial cells in the brain can differentiate into pericytes to promote the restoration of blood-brain barrier (BBB) function after stroke. Understanding the mechanisms underlying BBB restoration post-stroke is crucial to the field. Using lineage tracing, RNA sequencing (RNA-seq), and immunostaining, Li et al. detected the transdifferentiation of endothelial cells (ECs) into E-pericytes in the middle cerebral artery occlusion (MCAO) model. The specific knockout of Tgfbr2 in ECs reduced the number of E-pericytes, exacerbated BBB leakage, and worsened neurological deficits. This observation of EC to pericyte differentiation is novel; however, the conclusions at this stage are not fully supported by the evidence provided.

      (1) The authors claimed, based on the EdU assay, that 12.9% of pericytes present at RP34D originated from self-proliferation, while the origin of the remaining 27.6% of new pericytes remains unclear. This raises concerns, as the EdU assay is not 100% efficient in detecting all proliferating cells. If EdU<sup>+</sup> ECs account for fewer than 10% of all ECs, it follows that other EdU-ECs must have alternative origins.

      That is an interesting question. To address this issue, we need to consider the following aspects:

      (1) The EdU assay is not 100% efficient in detecting all proliferating cells, which means that the actual proportion of proliferating pericytes may be higher than 12.9%, while the proportion of pericytes from other sources may be lower than 27.6% (as determined by FACS). This is consistent with the observation in Figure 3H (immunofluorescence analysis), where EGFP<sup>+</sup> pericytes accounted for only 24.5% of all pericytes.

      (2) The dose of EdU administered in our study was relatively high (200 mg/kg, intraperitoneal injection, daily), which may increase the efficiency of EdU labeling.

      (3) When EdU<sup>+</sup> endothelial cells (ECs) constitute less than 10% of all ECs, it does suggest that EdU-ECs could be a source of pericytes. However, at least EdU<sup>+</sup> ECs cannot transform into pericytes, as we did not detect any EdU<sup>+</sup>EGFP<sup>+</sup> pericytes.

      (2) The reference for Cdh5CreERT2 is cited as 25, which is a review article published in ATVB. This review lists many different drivers, and the specific Cdh5CreERT2 line used in this study is not identified. This specificity is critical for accurate lineage tracing of ECs.

      Although the review I mentioned did not address this, the specificity of Cdh5CreERT2 in the brain has been demonstrated in other studies (Boyé K, et al. Nat Commun. 2022 Mar 4;13(1):1169; Patel A, et al. Proc Natl Acad Sci U S A. 2024 Dec 3;121(49):e2322124121). We have further confirmed that Cdh5CreERT2 specifically labels endothelial cells in the brain parenchyma (Figure S1). Additionally, we found nonspecific labeling in the blood (less than 1% CD45+ blood cells, primarily myeloid cells) and meninges outside the brain parenchyma. We ruled out nonspecific transdifferentiation labeling in the blood through bone marrow reconstitution experiments and in the meninges using in vivo two-photon imaging (results not shown).

      (3) The scRNA-seq data should include GFP signals to track the increasing number of pericytes from early to late stages post-injury. This is the only independent method from staining to verify that the pericytes are indeed derived from GFP<sup>+</sup> ECs after brain injury. Sham samples should be utilized as strict side-by-side controls.

      This is a valuable suggestion. We observed that, despite being positive for EGFP protein, only 50% of the sorted cells expressed the EGFP gene at the transcriptome level. This phenomenon has also been reported in other studies (Rodor J,et al a. Cardiovasc Res. 2022 Aug 24;118(11):2519-2534.). For these reasons, we did not rely on GFP signals to track the increase in pericyte numbers from early to late stages post-injury.

      (4) Since Ai47 is employed, there are three different variants of green fluorescent proteins, including ZsGreen, which may result in signals being spotted in the staining. The GFP signal detected could also represent dead cells that have lost CD31 expression.

      The detected GFP signal could also originate from dead cells that have lost CD31 expression, which is a plausible explanation. As shown in Figure 3I, EGFP<sup>+</sup> non-ECs peak at RP14D and then decline, suggesting that some EGFP<sup>+</sup> non-ECs either die or revert to endothelial cells (ECs). Therefore, it cannot be ruled out that we captured some dead EGFP<sup>+</sup> non-ECs; however, as indicated in Figure 3I, this proportion is likely less than 25%. Additionally, pericytes are prone to death in ischemic and hypoxic environments (Figure 1A), which explains why some of the transformed EGFP<sup>+</sup> non-ECs may die. Nevertheless, at RP514D, we can still detect EGFP<sup>+</sup> non-ECs, indicating that a subset of these cells can survive for an extended period (Figure S3F).

      (5) The quality of the staining images is not convincing, as some non-ECs and ECs are in close proximity, leading to potential artifacts in signal interpretation. The reviewer cannot rely solely on single staining techniques to be convinced of EC differentiation into pericytes. Although it has been reported that ECs can differentiate into pericytes during development, this phenomenon in the adult brain is surprising; thus, more rigorous evidence with strong lineage tracing data should be provided through multiple measurements.

      Why some non-ECs and ECs are located nearby:

      (1) Non-ECs exhibit characteristics of pericytes, which are typically adjacent to ECs.

      (2) Could this proximity lead to potential artifacts in signal interpretation? We believe this is unlikely, as we also observed a significant number of non-ECs located far from ECs on blood vessels (Figure 3A-B, Figure S3M).

      (3) Three pericyte markers (CD13, NG2, and PDGFRβ) were also used to verify the transformed cells, while the three pericyte markers were not expressed in normal endothelial cells.

      (6) FACS (Fluorescence-activated cell sorting) should be employed to quantitatively assess the contribution of GFP<sup>+</sup> ECs to pericytes at each stage after injury, compared to sham controls.

      Yes, if the contribution of GFP<sup>+</sup> ECs to pericytes could be assessed at each time point, the role of E-pericytes in the pericyte pool could be better explained, and the proportion of E-pericytes would become more prominent. In Figure 3, we did not use FACS to evaluate the contribution of GFP<sup>+</sup> ECs to pericytes at each stage post-injury. Instead, we only assessed the ratio of EGFP<sup>+</sup> non-ECs to all EGFP<sup>+</sup> cells. However, we did verify the contribution of GFP<sup>+</sup> ECs (E-pericytes) to pericytes at RP34D using FACS (CD13+ DsRed/CD13 = 25.6%, Figure 4C). This ratio is consistent with the immunofluorescence data (Figure 3H).

      (7) In Tie2Dre;Mfsd2aCrexER;Ai47 mice, ECs in the brain are specifically labeled, indicating that ECs could give rise to CD13+ EGFP<sup>+</sup> non-ECs at RP34D (Figure S3L). However, the GFP signal for Ai47 is not homogeneous, displaying many spotted patterns. Using tdTomato as an alternative for detection could enhance clarity.

      We repeated the experiment using tdTomato as the reporter gene in mice and observed results consistent with those obtained using Ai47 as the reporter gene. For consistency, all results presented are based on Ai47. Regarding the spotted patterns observed with Ai47, this phenomenon can be attributed to the relatively low laser intensity (2%). Higher laser intensity would cause overexposure of EGFP<sup>+</sup> ECs. To address the issue of spotted patterns in Ai47 imaging, we can improve the visualization of complete cell morphology (as shown in Figure S3M) by increasing the gain value, which enhances the background signal.

      (8) The data concerning the genetic ablation of pericytes lacks specificity. There is insufficient evidence to support that DTA is specifically expressed in E-pericytes. The authors should utilize DTR (Diphtheria Toxin Receptor) and confirm that DTR expression is restricted to pericytes derived from GFP<sup>+</sup> ECs. Treatment with diphtheria toxin, but not PBS as a control, should specifically ablate these E-pericytes without affecting any other GFP-pericytes in the brain following injury.

      We did not verify that DTA expression was restricted to E-pericytes. To ensure that DTA is only expressed in converted E-pericytes, we employed two strategies:

      (1) Specific Targeting of Endothelial Cells: We used the AAV-BI30 virus to specifically infect endothelial cells. Although not 100% exclusive, 98.5% of the expression occurred in endothelial cells, with minimal infection in neurons and microglia. Additionally, we combined this with Cdh5CreERT2 to control the DIO action in the virus. This means that only endothelial cells expressing both Cdh5CreERT2 and infected with AAV-BI30 could undergo cell fate changes and transform into pericytes, subsequently expressing markers such as NG2 and driving DTA expression in E-pericytes (Figure 4A).

      (2) Validation of DTA Expression: To prevent off-target expression of DTA in other cell types, we plan to verify DTA protein expression using specific antibodies to confirm whether DTA is expressed in unintended cells. Alternatively, as suggested, we could utilize the Diphtheria Toxin Receptor (DTR) system. By ensuring that DTR expression is restricted to pericytes derived from GFP<sup>+</sup> ECs, treatment with diphtheria toxin would specifically ablate these E-pericytes without affecting other GFP- pericytes in the brain post-injury.

      (9) There is currently no convincing genetic data demonstrating that Tgfb signaling overexpression or deletion modulates the transdifferentiation of ECs to pericytes.

      Yes, this is an important consideration. Although we knocked out the TGFβ receptor in endothelial cells (ECs) and observed a reduction in the formation of E-pericytes (Figure 6D and 6G), it would be more informative to specifically knockout the Tgfb gene in myeloid cells or monocyte-macrophage lineages to determine whether these cells are the primary source of TGFβ driving endothelial cell transformation. Additionally, injecting TGFβ protein directly into the brains of mice could help explore whether exogenous TGFβ promotes the formation of E-pericytes.

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1D, there does not appear to be a clear PDGFRβ-positive population. In this case, it is necessary to include the negative control that served as the basis for drawing the positive gate.

      Author response image 1 below show the negative control for CD31 and PDGFRβ.

      Author response image 1.

      (2) Figures 3A-D, Figures S3J-M, the authors statistically compare % negative to % positive. It appears % negative = 100% - % positive. If this is the case, these groups are not independent and should not be statistically compared.

      This is a very important point, and such a comparison is not appropriate. The statistical comparison mentioned above has now been removed.

      (3) Figure 4B, in addition to the cells indicated with arrows, there is a substantial additional DsRed+ signal of similar intensity in this image. It would be helpful to show a negative control.

      Author response image 2 below show the contralateral and ipsilateral, respectively. In the contralateral, DsRed has few signals, no complete cell morphology, and is separated from the Hoechst+ nucleus. in the ipsilateral, DsRed signals are strong, have intact cell morphology, and are tightly bound to the Hoechst+ nucleus. In the ipsilateral, some DsRed signals may come from dying cells.

      Author response image 2.

      (4) Figure 6G, the y-axis title is "E-pericytes/all EGFP<sup>+</sup> cells (%)" but the y-axis scale goes from 0 to 900. Is this an error?

      Thank you. We want to calculate the number of pericytes per unit area, it should be E-pericyte/mm2.

      (5) Figure 9B, in the representative images, the 6th group is labeled "Tgfb2 + DTA" but in the plot below, the 6th group is labeled Tgfbr2 + DsRed. Which is correct?

      Thank you. The "Tgfb2 + DTA" is right. We have changed it to "Tgfb2 + DTA" in the 6th group, Figure 9B.

      (6) Figure S1I, error bars and/or individual data points should be shown.

      The purpose of this diagram is to demonstrate the number of mice in which EGFP<sup>+</sup> cells are 100% co-labeled with endothelial markers (CD31, ERG, GLUT1, and VE-Cadherin), as EGFP<sup>+</sup> cells are exclusively found in endothelial cells within the brain parenchyma. Additionally, the diagram illustrates the number of mice in which EGFP<sup>+</sup> cells show no co-labeling (0%) with mural cell markers (CD13, PDGFRβ, α-SMA, and NG2), as EGFP<sup>+</sup> cells are not present in mural cells within the brain parenchyma.

      (7) The authors write: "When Tgfbr2 was overexpressed and DTA was expressed specifically in the same ECs, DTA prevented the EC-specific overexpression of the Tgfbr2 gene and increased the proportion of E-pericytes.". The authors' strategy for DTA expression involves the NG2 promoter, which, in principle, is not active in ECs. Thus how can DTA be "expressed specifically in the same ECs" and how can DTA "prevent EC-specific overexpression" of Tgfbr2?

      Our purpose is not clearly expressed. The statement should be revised to: “When Tgfbr2 was overexpressed to increase E-pericytes and DTA was expressed in transformed cells to deplete E-pericytes, we found that there was no significant change in the number of E-pericytes in the Tgfbr2 + DTA group compared with the DTA group.”

      (8) The interpretation of Evans blue leakage as "low molecular weight" leakage should be revised since Evans blue binds serum albumin and thus it is the molecular weight of this complex (~67 kDa) that is relevant.

      We agree with the reviewer. Yes, it should not be stated that Evans blue is low molecular weight, as it binds to serum albumin to form complexes. The text has been revised to: “Interestingly, no obvious leakage of dextran-rhodamine B (~70 kDa) (Figure S8C) or Texas Red (~71 kDa) was detected (Figure S8D). However, the elimination of E-pericytes allowed evans blue and trypan blue to cross the blood-brain barrier (BBB).”

      (9) It is critical that the sequencing data be made available through a public repository (such as GEO).

      Thank you. Now we've uploaded it to GEO.

      (10) It would be extremely helpful if the authors would make their viral plasmids available through a public repository (such as Addgene).

      Thank you. Now we've uploaded it to Addgene (The assignment of Addgene number is pending).

      Reviewer #3 (Recommendations for the authors):

      (1) The distribution and expression of pericytic and fibroblast markers at different time points after stroke is confusing while reading the manuscript, e.g., vimentin is not expressed on day 34 but on day 8, whereas CD13 is expressed on day 34 but not on day 8, if I understood the text correctly. To make it easier to follow, the authors could add a label of the day after stroke to each of the subfigures which show images and co-expression of different markers (e.g. Figures 3 and S3).

      Below are the expressions of different specific markers in each cell.

      “√” stand for positive, “×” stand for positive

      Author response table 1.

      (2) The authors need to check the N numbers again, e.g., Figure S3L: 4 dots per group are shown in the graph but an N of 3 is mentioned in the legend.

      Thank you for raising this important point. N=4 has been corrected in the legend of Figure 3S. We also checked other N numbers.

      (3) Labelling of graphs should be consistent (e.g., S4C: "I-ECs" vs. S4F: "Ipsi-ECs") and correct (e.g., "DsRed" instead of "DeRed" in Figure 4B).

      Yes, we need a uniform name with "Ipsi-ECs" and "DsRed". Thank you.

      (4) Figure 4: In the text, the injection is described to be done on day 34 whereas in Figure 4A the injections are described to take place before MCAO, please clarify. Does day 34 mean 34 days after injection or after MCAO (as in the former experiments)?

      In the text, the sentence, “Then we used AAV2/9-BI30-NG2 promoter-DIO-DTA (DTA) to deplete E-pericytes at RP34D (Figure 4D),” could be misinterpreted as suggesting that the virus was injected at RP34D. To avoid confusion, it has been revised to: “We used the AAV2/9-BI30-NG2 promoter-DIO-DTA (DTA) virus, which was injected before MCAO (Figure 4A), to deplete E-pericytes (Figure 4D).” Yes, day 34 means 34 days after injection or after MCAO and we unify to 34 days.

      (5) Some images are too dark to recognize clear structures and prove the findings (e.g., Figure S6B).

      Thank you for raising this important point.

      (6) There is no Figure S8D (as mentioned in the text).

      Thank you for raising this important point. This problem has been corrected.

      (7) Figure S9: the text only states, that Tgfbr2 overexpression increases CBF recovery and effective perfusion. Also with the legend, it is not clear what was done and measured, especially in Figure S9B - what do the graphs show? Also, the y-axis labeling is missing for the traces.

      In Figure S9A, we assessed changes in blood flow using laser speckle imaging. Laser speckle imaging relies on random interference patterns formed by scattered light when a laser strikes tissue. Moving red blood cells alter the contrast of the speckle pattern: faster blood flow results in quicker speckle changes and lower contrast, while slower blood flow leads to slower speckle changes and higher contrast. By analyzing these changes in speckle contrast, blood flow dynamics can be evaluated in real-time and non-invasively.

      In Figure S9B, we measured blood flow changes using Laser Doppler flowmetry. When a laser interacts with flowing blood, the moving red blood cells scatter the light, causing a frequency shift (Doppler shift). Faster blood flow results in a greater frequency shift, while slower blood flow leads to a smaller frequency shift. By detecting the frequency shift of the scattered light, blood flow velocity and changes can be measured in real time and non-invasively. In Laser Doppler Flowmetry (LDF), the unit of the vertical axis is typically Perfusion Units (PU). PU is a relative unit used to represent changes in blood flow rather than absolute blood flow velocity. These methods have now been further explained in the diagram.

      (8) Which regions of the brain were used to take images (e.g., to count neurons)?

      We captured images and quantified neurons in the cortex and striatum of the brain. Our statistical analysis further demonstrated that, at RP34D, the presence of E-pericytes in the brain does not exhibit region-specificity. Instead, the formation of E-pericytes is driven by TGFβ1, which is regulated by immune cells. Ultimately, the distribution and activity of these immune cells are influenced by the severity of ischemia and hypoxia.

      (9) The sentence "Protein C receptor-expressing (Procr+) ECs could give rise to de novo formation of ECs and pericytes in the mammary gland13." is repeated almost identically in three different places in the text. However, whether Procr+ cells are involved in the described transdifferentiation or whether "E-pericytes" do express the protein C receptor is not shown and needs additional investigation.

      The reason for referencing this literature is to highlight that endothelial cells (ECs) during breast development can give rise to pericytes, which serves as background knowledge supporting our research. To further explore this phenomenon in brain, we used ProcrCreERT2;Ai47 mice subjected to MCAO (middle cerebral artery occlusion) to investigate whether Procr+ ECs could transform into pericytes, similar to what occurs in mammary glands. However, since ProcrCreERT2 labels not only ECs but also pericytes in the brain, the results did not achieve our goal and were therefore not included in the study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:  

      This paper investigates the relationship between ocular drift - eye movements long thought to be random - and visual acuity. This is a fundamental issue for how vision works. The work uses adaptive optics retinal imaging to monitor eye movements and where a target object is in the cone photoreceptor array. The surprising result is that ocular drift is systematic - causing the object to move to the center of the cone mosaic over the course of each perceptual trial. The tools used to reach this conclusion are state-of-the-art and the evidence presented is convincing.

      Strengths  

      P1.1. The central question of the paper is interesting, as far as I know, it has not been answered in past work, and the approaches employed in this work are appropriate and provide clear answers.

      P1.2. The central finding - that ocular drift is not a completely random process - is important and has a broad impact on how we think about the relationship between eye movements and visual perception.

      P1.3. The presentation is quite nice: the figures clearly illustrate key points and have a nice mix of primary and analyzed data, and the writing (with one important exception) is generally clear.

      Thank you for your positive feedback.

      Weaknesses

      P1.4. The handling of the Nyquist limit is confusing throughout the paper and could be improved. It is not clear (at least to me) how the Nyquist limit applies to the specific task considered. I think of the Nyquist limit as saying that spatial frequencies above a certain cutoff set by the cone spacing are being aliased and cannot be disambiguated from the structure at a lower spatial frequency. In other words, there is a limit to the spatial frequency content that can be uniquely represented by discrete cone sampling locations. Acuity beyond that limit is certainly possible with a stationary image - e.g. a line will set up a distribution of responses in the cones that it covers, and without noise, an arbitrarily small displacement of the line would change the distribution of cone responses in a way that could be resolved. This is an important point because it relates to whether some kind of active sampling or movement of the detectors is needed to explain the spatial resolution results in the paper. This issue comes up in the introduction, results, and discussion. It arises in particular in the two Discussion paragraphs starting on line 343.

      We thank you for pointing out a possible confusion for readers. Overall, we contrast our results to the static Nyquist limit because it is generally regarded as the upper limit of resolution acuity. We updated our text in a few places, especially the Discussion, and added a reference to make our use of the Nyquist limit clearer.

      We agree with the reviewer of how the Nyquist limit is interpreted within the context of visual structure. If visual structure is under-sampled, it is not lost, but creates new, interfered visual structure at lower spatial frequency. For regular patterns like gratings, interference patterns may emerge akin to Moire patterns, which have been shown to occur in the human eye, and which form is based on the arrangement and regularity of the photoreceptor mosaic (Williams, 1985). We note however that the successful resolution of the lower frequency pattern does not necessarily carry the same structural information, specifically, orientation, and the aliased structure might indeed mask the original stimulus. Please compare Figure 1f where we show individual static snapshots of such aliased patterns, especially visible when the optotypes are small (towards the lower right of the figure). We note that theoretical work predicts that with prior knowledge about the stimulus, even such static images might be possible to de-alias (Ruderman & Bialek, 1992). We added this to our manuscript.   

      We think the reviewer’s following point about the resolution of a line position, is only partially connected to the first, however. In our manuscript we note in the Introduction that resolution of the relative position of visual objects is a so called hyperacuity phenomenon. The fact that it occurs in humans and other animals demonstrates that visual brains have come up with neuronal mechanisms to determine relative stimulus position with sub-Nyquist resolution. The exact mechanism is however not fully clear. One solution is that relative cone signal intensities could be harnessed, similar as is employed technically, e.g. in a quadrant-cell detector. Its positional precision is much higher than the individual cell’s size (or Nyquist limit), predominantly determined by the detector’s sensitivity and to a lesser degree its size. On the other hand, such detector, being hyperacute with object location, would not have the same resolution as, for instance, letter-E orientation discrimination. 

      Note that in all the above occasions, a static image-sensor-relationship is assumed. In our paper, we were aiming to convey, like others did before, that a moving stimulus may give rise to sub-Nyquist structural resolution, beyond what is already known for positional acuity and hence, classical hyperacuity. 

      Based on the data shown in this manuscript and other experimental data currently collected in the lab, it seems to us that eye movements are indeed the crucial point in achieving sub-Nyquist resolution. For example, ultra-short presentation durations, allowing virtually no retinal slip, push thresholds close to the Nyquist limit and above. Furthermore, with AOSLO stimulation, it is possible to stabilize a stimulus on the retina, which would be a useful tool studying this hypothesis. Our current level of stabilization is however not accurate enough to completely mitigate retinal image motion in the foveola, where cells are smallest, and transients could occur. From what we observe and other studies that looked at resolution thresholds at more peripheral retinal locations, we would predict that foveolar resolution of a perfectly stabilized stimulus would be indeed limited by the Nyquist limit of the receptor mosaic.

      P1.5. One question that came up as I read the paper was whether the eye movement parameters depend on the size of the E. In other words, to what extent is ocular drift tuned to specific behavioral tasks?

      This is an interesting question. Yet, the experimental data collected for the current manuscript does not contain enough dispersion in target size to give a definitive answer, unfortunately. A larger range of stimulus sizes and especially a similar number of trials per size would be required. Nonetheless, when individual trials were re-grouped to percentiles of all stimulus sizes (scaled for each eye individually), we found that drift length and directionality was not significantly different between any percentile group of stimulus sizes (Wilcoxon sign rank test, p > 0.12, see also Figure R1). Our experimental trials started with a stimulus demanding visual acuity of 20/16 (logMAR = -0.1), therefore all presented stimulus sizes were rather close to threshold. The high visual demand in this AO resolution task might bring the oculomotor system to a limit, where ocular drift length can’t be decreased further. However, with the limitation due to the small range of stimulus sizes, further investigations would be needed. Given this and that this topic is also ongoing research in our lab where also more complex dynamics of FEM patterns are considered, we refrain from showing this analysis in the current manuscript.  

      Author response image 1.

      Drift length does not depend on stimulus sizes close to threshold. All experimental trials were sorted by stimulus size and then grouped into percentiles for each participant (left). Additionally, 10 % of trials with stimulus sizes just above or below threshold are shown for comparison (right). For each group, median drift lengths (z-scored) are shown as box and whiskers plot. Drift length was not significantly different across groups.  

      Reviewer #2 (Public Review):

      Summary:

      In this work, Witten et al. assess visual acuity, cone density, and fixational behavior in the central foveal region in a large number of subjects.

      This work elegantly presents a number of important findings, and I can see this becoming a landmark work in the field. First, it shows that acuity is determined by the cone mosaic, hence, subjects characterized by higher cone densities show higher acuity in diffraction-limited settings. Second, it shows that humans can achieve higher visual resolution than what is dictated by cone sampling, suggesting that this is likely the result of fixational drift, which constantly moves the stimuli over the cone mosaic. Third, the study reports a correlation between the amplitude of fixational motion and acuity, namely, subjects with smaller drifts have higher acuities and higher cone density. Fourth, it is shown that humans tend to move the fixated object toward the region of higher cone density in the retina, lending further support to the idea that drift is not a random process, but is likely controlled. This is a beautiful and unique work that furthers our understanding of the visuomotor system and the interplay of anatomy, oculomotor behavior, and visual acuity.

      Strengths:

      P2.1. The work is rigorously conducted, it uses state-of-the-art technology to record fixational eye movements while imaging the central fovea at high resolution and examines exactly where the viewed stimulus falls on individuals' foveal cone mosaic with respect to different anatomical landmarks in this region. The figures are clear and nicely packaged. It is important to emphasize that this study is a real tour-de-force in which the authors collected a massive amount of data on 20 subjects. This is particularly remarkable considering how challenging it is to run psychophysics experiments using this sophisticated technology. Most of the studies using psychophysics with AO are, indeed, limited to a few subjects. Therefore, this work shows a unique set of data, filling a gap in the literature.

      Thank you, we are very grateful for your positive feedback.

      Weaknesses:

      P2.2. No major weakness was noted, but data analysis could be further improved by examining drift instantaneous direction rather than start-point-end-point direction, and by adding a statistical quantification of the difference in direction tuning between the three anatomical landmarks considered.

      Thank you for these two suggestions. We now show the development of directionality with time (after the first frame, 33 ms as well as 165 ms, 330 ms and 462 ms), and performed a Rayleigh test for non-uniformity of circular data. Please also see our response to comment R2.4.

      Briefly, directional tuning was already visible at 33 ms after stimulus onset and continuously increases with longer analysis duration. Directionality is thus not pronounced at shorter analysis windows. These results have been added to the text and figures (Figure 4 - figure supplement 1).

      The statistical tests showed that circular sample directionality was not uniformly distributed for all three retinal locations. The circular average was between -10 and 10 ° in all cases and the variance was decreasing with increasing time (from 48.5 ° to 34.3 ° for CDC, 49.6 ° to 38.6 ° for PRL and 53.9 ° to 43.4 for PCD location, between frame 2 and 15). As we have discussed in the paper, we would expect all three locations to come out as significant, given their vicinity to the CDC (which is systematic in the case of PRL, and random in the case of PCD, see also comment R2.2).        

      Reviewer #3 (Public Review):

      Summary:

      The manuscript by Witten et al., titled "Sub-cone visual resolution by active, adaptive sampling in the human foveola," aims to investigate the link between acuity thresholds (and hyperacuity) and retinal sampling. Specifically, using in vivo foveal cone-resolved imaging and simultaneous microscopic photostimulation, the researchers examined visual acuity thresholds in 16 volunteers and correlated them with each individual's retinal sampling capacity and the characteristics of ocular drift.

      First, the authors found that although visual acuity was highly correlated with the individual spatial arrangement of cones, for all participants, visual resolution exceeded the Nyquist sampling limit - a well-known phenomenon in the literature called hyperacuity.

      Thus, the researchers hypothesized that this increase in acuity, which could not be explained in terms of spatial encoding mechanisms, might result from exploiting the spatiotemporal characteristics of visual input, which is continuously modulated over time by eye movements even during so-called fixations (e.g., ocular drift).

      Authors reported a correlation between subjects, between acuity threshold and drift amplitude, suggesting that the visual system benefits from transforming spatial input into a spatiotemporal flow. Finally, they showed that drift, contrary to the traditional view of it as random involuntary movement, appears to exhibit directionality: drift tends to move stimuli to higher cone density areas, therefore enhancing visual resolution.

      Strengths:

      P3.1. The work is of broad interest, the methods are clear, and the results are solid.

      Thank you.

      Weaknesses:

      P3.2. Literature (1/2): The authors do not appear to be aware of an important paper published in 2023 by Lin et al. (https://doi.org/10.1016/j.cub.2023.03.026), which nicely demonstrates that (i) ocular drifts are under cognitive influence, and (ii) specific task knowledge influences the dominant orientation of these ocular drifts even in the absence of visual information. The results of this article are particularly relevant and should be discussed in light of the findings of the current experiment.

      Thank you for pointing to this important work which we were aware of. It simply slipped through during writing. It is now discussed in lines 390-393. 

      P3.3. Literature (2/2): The hypothesis that hyperacuity is attributable to ocular movements has been proposed by other authors and should be cited and discussed (e.g., https://doi.org/10.3389/fncom.2012.00089, https://doi.org/10.10

      Thank you for pointing us towards these works which we have now added to the Discussion section. We would like to stress however, that we see a distinction between classical hyperacuity phenomena (Vernier, stereo, centering, etc.) as a form of positional acuity, and orientation discrimination.  

      P3.4. Drift Dynamic Characterization: The drift is primarily characterized as the "concatenated vector sum of all frame-wise motion vectors within the 500 ms stimulus duration.". To better compare with other studies investigating the link between drift dynamics and visual acuity (e.g., Clark et al., 2022), it would be interesting to analyze the drift-diffusion constant, which might be the parameter most capable of describing the dynamic characteristics of drift.

      During our analysis, we have computed the diffusion coefficient (D) and it showed qualitatively similar results to the drift length (see figures below). We decided to not show these results, because we are convinced that D is indeed not the most capable parameter to describe the typical drift characteristic seen here. The diffusion coefficient is computed as the slope of the mean square displacement (MSD). In our view, there are two main issues with applying this metric to our data, one conceptual, one factual:

      (1) Computation of a diffusion coefficient is based upon the assumption that the underlying movement is similar to a random walk process. From a historical perspective, where drift has been regarded as more random, this makes sense. We also agree that D can serve as a valuable metric, depending on the individual research question. In our data, however, we clearly show that drift is not random, and a metric quantifying randomness is thus ill-defined. 

      (2) We often observed out- and in-type motion traces, i.e. where the eye somewhat backtracks from where it started. Traces in this case are equally long (and fast) as other motion will be with a singular direction, but D would in this case be much smaller, as the MSD first increases and then decreases. In reality, the same number of cones would have been traversed as with the larger D of straight outward movement, albeit not unique cones. For our current analyses, the drift length captures this relationship better.

      Author response image 2.

      Diffusion coefficient (D) and the relation to visual acuity (see Figure 3 e-g for comparison to drift length). a, D was strongly correlated between fellow eyes. b, Cone density and D were not significantly correlated. c, The median D had a moderate correlation with visual acuity thresholds in dominant as well as non-dominant eyes. Dominant eyes are indicated by filled, nondominant eyes by open markers.

      We would like to put forward that, in general, better metrics are needed, especially in respect to the visual signals arising from the moving eye. We are actively looking into this in follow-up work, and we hope that the current manuscript might spark also others to come up with new ways of characterizing the fine movements of the eye during fixation.

      P3.5. Possible inconsistencies: Binocular differences are not expected based on the hypothesis; the authors may speculate a bit more about this. Additionally, the fact that hyperacuity does not occur with longer infrared wavelengths but the drift dynamics do not vary between the two conditions is interesting and should be discussed more thoroughly.

      Binocularity: the differences in performance between fellow eyes is rather subtle, and we do not have a firm grip on differences other than the cone mosaic and fixational motor behavior between the two eyes. We would rather not speculate beyond what we already do, namely that some factor related to the development of ocular dominance is at play. What we do show with our data is that cone density and drift patterns seem to have no part in it.  

      Effect of wavelength: even with the longer 840 nm wavelength, most eyes resolve below the Nyquist limit, with a general increase in thresholds (getting worse) compared to 788 nm. As we wrote in the manuscript, we assume that the increased image blur and reduced cone contrast introduced by the longer wavelength are key to why there is an overall reduction in acuity. No changes were made to the manuscript. As a more general remark, we would not consider the sub-Nyquist performances seen in our data to be a hyperacuity, although technically it is. The reason is that hyperacuity is usually associated with stimuli that require resolving positional shifts, and not orientation. There is a log unit of difference between thresholds in these tasks.  

      P3.6. As a Suggestion: can the authors predict the accuracy of individual participants in single trials just by looking at the drift dynamics?

      That’s a very interesting point that we indeed currently look at in another project. As a comment, we can add that by purely looking at the drift dynamics in the current data, we could not predict the accuracy (percent correct) of the participant. When comparing drift length or diffusion coefficients between trials with correct or false response, we do not observe a significant difference. Also, when adding an anatomical correlate and compare between trials where sampling density increases or decreases, there is no significant trend. We think that it is a more complex interplay between all the influencing factors that can perhaps be met by a model considering all drift dynamics, photoreceptor geometry and stimulus characteristics.   

      No changes were made to the manuscript.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As you will see, the reviewers were quite enthusiastic about your work, but have a few issues for your consideration. We hope that this is helpful. We'll consider any revisions in composing a final eLife assessment.

      Reviewer #1 (Recommendations For The Authors):

      R1.1:  Discussion of myopia. Myopia takes a fair bit of space in the Discussion, but the paper does not include any subjects that are sufficiently myopic to test the predictions. I would suggest reducing the amount of space devoted to this issue, and instead making the prediction that myopia may help with resolution quickly. The introduction (lines 54-56) left me expecting a test of this hypothesis, and I think similarly that issue could be left out of the introduction.

      We have removed this part from the Introduction and shortened the Discussion.  

      R1.2: Line 118: define CDC here.

      Thank you for pointing this out, it is now defined at this location.  

      R1.3: Line 159-162: suggest breaking this sentence into two. This sentence also serves as a transition to the next section, but the wording suggests it is a result that is shown in the prior section. Suggest rewording to make the transition part clear. Maybe something like "Hence the spatial arrangement of cones only partially ... . Next we show that ocular motion and the associated ... are another important factor."

      Text was changed as suggested.  

      R1.4.: Figure 3: The retina images are a bit hard to see - suggest making them larger to take an entire row. As a reader, I also was wondering about the temporal progression of the drift trajectories and the relation to the CDC. Since you get to that in Figure 4, you could clarify in the text that you are starting by analyzing distance traveled and will return to the issue of directed trajectories.

      Visibility was probably an issue during the initial submission and review process where images were produced at lower resolution. The original figures are of sufficient resolution to fully appreciate the underlying cone mosaic and will later be able to zoom in the online publication.  

      We added a mention of the order of analysis in the Results section (LL 163-165)

      R1.5: Line 176: define "sum of piecewise drift amplitude" (e.g. refer to Figure where it is defined).

      We refer to this metric now as the drift length (as pointed out rightfully so by reviewer #2), and added its definition at this location.   

      R1.6: Lines 205-208: suggest clarifying this sentence is a transition to the next section. As for the earlier sentence mentioned above, this sounds like a result rather than a transition to an issue you will consider next.

      This sentence was changed to make the transition clearer. 

      R1.7: Line 225: suggest starting a new paragraph here.

      Done as suggested

      Reviewer #2 (Recommendations For The Authors):

      I don't have any major concerns, mostly suggestions and minor comments.

      R2.1: (1) The authors use piecewise amplitude as a measure of the amount of retinal motion introduced by ocular drift. However, to me, this sounds like what is normally referred to as the path length of a trace rather than its amplitude. I would suggest using the term length rather than amplitude, as amplitude is normally considered the distance between the starting and the ending point of a trace.

      This was changed as suggested throughout the manuscript. 

      R2.2: (2) It would be useful to elaborate more on the difference between CDC and PCD, I know the authors do this in other publications, but to the naïve reader, it comes a bit as a surprise that drift directionality is toward the CDC but less so toward the PCD. Is the difference between these metrics simply related to the fact that defining the PCD location is more susceptible to errors, especially if image quality is not optimal? If indeed the PCD is the point of peak cone density, assuming no errors or variability in the estimation of this point, shouldn't we expect drift moving stimuli toward this point, as the CDC will be characterized by a slightly lower density? I.e., is the absence of a PCD directionality trend as strong as the trend seen for the CDC simply the result of variability and error in the estimate of the PCD or it is primarily due to the distribution of cone density not being symmetrical around the PCD?

      Thank you for this comment. We already refer in the Methods section to the respective papers where this difference is analyzed in more detail, and shortly discuss it here.

      To briefly answer the reviewer’s final question: PCD location is too variable, and ought to be avoided as a retinal landmark. While we believe there is value in reporting the PCD as a metric of maximum density, it has been shown recently (Reiniger et al., 2021; Warr et al., 2024; Wynne et al., 2022) and is visible in our own (partly unpublished) data, that its location will change with changing one or more of these factors: cone density metric, window size or cone quantity selected, cone annotation quality, image quality (e.g. across days), individual grader, annotation software, and likely more. Each of these factors alone can change the PCD location quite drastically, all while of course, the retina does not change. The CDC on the other hand, given its low-pass filtering nature, is immune to the aforementioned changes within a much wider range and will thus reflect the anatomical and, shown here, functional center of vision, better. However, there will always be individual eyes where PCD location and the CDC are close, and thus researchers might be inclined to also use the PCD as a landmark. We strongly advise against this. In a way, the PCD is a non-sense location while its dimension, density, can be a valuable metric, as density does not vary that much (see e.g. data on CDC density and PCD density reported in this manuscript).  

      Below we append a direct comparison of PCD vs CDC location stability when only one of the mentioned factors are changed. Sixteen retinas imaged on two different days were annotated and analyzed by the same grader with the same approach, and the difference in both locations are shown.  

      Author response image 3.

      Reproducibility of CDC and PCD location in comparison. Two retinal mosaics which were recorded at two different timepoints, maximum 1 year apart from each other, were compared for 16 eyes. The retinal mosaics were carefully aligned. The retinal locations for CDC and PCD that were computed for the first timepoint were used as the spatial anchor (coordinate center), the locations plotted here as red circles (CDC) and gray diamonds (PCD) represent the deviations that were measured at the second timepoint for both metrics.  

      R2.3.: I don't see a statistical comparison between the drift angle tuning for CDC, PRL, and PCD. The distributions in Figure 4F look very similar and all with a relatively wide std. It would be useful to mark the mean of the distributions and report statistical tests. What are the data shown in this figure, single subjects, all subjects pooled together, average across subjects? Please specify in the caption.

      We added a Rayleigh test to test each distribution for nun-uniformity and Kolmogorov-Smirnov tests to compare the distributions towards the different landmarks.  We added the missing specifications to the figure caption of Figure 4 – figure supplement 1. 

      R2.4: I would suggest also calculating drift direction based on the average instantaneous drift velocity, similarly to what is done with amplitude. From Figure 3B it is clear that some drifts are more curved than others. For curved drifts with small amplitudes the start-point- end-point (SE) direction is not very meaningful and it is not a good representation of the overall directionality of the segment. Some drifts also seem to be monotonic and then change direction (eg. the last three examples from participant 10). In this case, the SE direction is likely quite different from the average instantaneous direction. I suspect that if direction is calculated this way it may show the trend of drifting toward the CDC more clearly.

      In response to this and a comment of reviewer #1, we add a calculation of initial  drift direction (and for increasing duration) and show it in Figure 4 – figure supplement 1. By doing so, we hope to capture initial directionality, irrespective of whether later parts in the path change direction. We find that directionality increases with increasing presentation duration. 

      R2.5: I find the discussion point on myopia a bit confusing. Considering that this is a rather tangential point and there are only two myopic participants, I would suggest either removing it from the discussion or explaining it more clearly.

      We changed this section, also in response to comment R1.1.

      R2.6: I would suggest adding to the discussion more elaboration on how these results may relate to acuity in normal conditions (in the presence of optical aberrations). For example, will this relationship between sampling cone density and visual acuity also hold natural viewing conditions?

      We added only a half sentence to the first paragraph of the discussion. We are hesitant to extend this because there is very likely a non-straightforward relationship between acuity in normal and fully corrected conditions. We would predict that, if each eye were given the same type and magnitude of aberrations (similar to what we achieved by removing them), cone density will be the most prominent factor of acuity differences. Given that individual aberrations can vary substantially between eyes, this effect will be diluted, up to the point where aberrations will be the most important factor to acuity. As an example, under natural viewing conditions, pupil size will dominantly modulate the magnitude of aberrations.

      R2.7: Line 398 - the point on the superdiffusive nature of drift comes out of the blue and it is unclear. What is it meant by "superdiffusive"?

      We simply wanted to express that some drift properties seem to be adaptable while others aren’t. The text was changed at this location to remove this seemingly unmotivated term. 

      R2.8: Although it is true that drift has been assumed to be a random motion, there has been mounting evidence, especially in recent years, showing a degree of control and knowledge about ocular drift (eg. Poletti et al, 2015, JN; Lin et al, 2023, Current Biology).

      We agree, of course. We mention this fact several times in the paper and adjusted some sentences to prevent misunderstandings. The mentioned papers are now cited in the Discussion. 

      R2.9: Reference 23 is out of context and should be removed as it deals with the control of fine spatial attention in the foveola rather than microsaccades or drift.

      We removed this reference. 

      R2.10: Minor point: Figures appear to be low resolution in the pdf.

      This seemed to have been an issue with the submission process. All figures will be available in high resolution in the final online version. 

      R2.11: Figure S3, it would be useful to mark the CDC at the center with a different color maybe shaded so it can be visible also on the plot on the left.

      We changed the color and added a small amount of transparency to the PRL markers to make the CDC marker more visible. 

      R2.12: Figure S2, it would be useful to show the same graphs with respect to the PCD and PRL and maybe highlight the subjects who showed the largest (or smallest) distance between PRL and CDC).

      Please find new Figure 4 supplement 1, which contains this information in the group histograms. Also, Figure 4 supplement 2 is now ordered by the distance PRL-CDC (while the participant naming is kept as maximum acuity exhibited. In this way, it should be possible to infer the information of whether PRL-CDC distance plays a role. For us it does not seem to be crucial. Rather, stimulus onset and drift length were related, which is captured in Figure 4g. 

      R2.13: There is a typo in Line 410.

      We could not find a typo in this line, nor in the ones above and below. “Interindividual” was written on purpose, maybe “intraindividual” was expected? No changes were made to the text. 

      References

      Reiniger, J. L., Domdei, N., Holz, F. G., & Harmening, W. M. (2021). Human gaze is systematically offset from the center of cone topography. Current Biology, 31(18), 4188–4193. https://doi.org/10.1016/j.cub.2021.07.005

      Ruderman, D. L., & Bialek, W. (1992). Seeing Beyond the Nyquist Limit. Neural Computation, 4(5), 682–690. https://doi.org/10.1162/neco.1992.4.5.682

      Warr, E., Grieshop, J., Cooper, R. F., & Carroll, J. (2024). The effect of sampling window size on topographical maps of foveal cone density. Frontiers in Ophthalmology, 4, 1348950. https://doi.org/10.3389/fopht.2024.1348950

      Williams, D. R. (1985). Aliasing in human foveal vision. Vision Research, 25(2), 195–205. https://doi.org/10.1016/0042-6989(85)90113-0

      Wynne, N., Cava, J. A., Gaffney, M., Heitkotter, H., Scheidt, A., Reiniger, J. L., Grieshop, J., Yang, K., Harmening, W. M., Cooper, R. F., & Carroll, J. (2022). Intergrader agreement of foveal cone topography measured using adaptive optics scanning light ophthalmoscopy. Biomedical Optics Express, 13(8), 4445–4454. https://doi.org/10.1364/boe.460821

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The manuscript by Mäkelä et al. presents compelling experimental evidence that the amount of chromosomal DNA can become limiting for the total rate of mRNA transcription and consequently protein production in the model bacterium Escherichia coli. Specifically, the authors demonstrate that upon inhibition of DNA replication the single-cell growth rate continuously decreases, in direct proportion to the concentration of active ribosomes, as measured indirectly by single-particle tracking. The decrease of ribosomal activity with filamentation, in turn, is likely caused by a decrease of the concentration of mRNAs, as suggested by an observed plateau of the total number of active RNA polymerases. These observations are compatible with the hypothesis that DNA limits the total rate of transcription and thus translation. The authors also demonstrate that the decrease of RNAp activity is independent of two candidate stress response pathways, the SOS stress response and the stringent response, as well as an anti-sigma factor previously implicated in variations of RNAp activity upon variations of nutrient sources.

      Remarkably, the reduction of growth rate is observed soon after the inhibition of DNA replication, suggesting that the amount of DNA in wild-type cells is tuned to provide just as much substrate for RNA polymerase as needed to saturate most ribosomes with mRNAs. While previous studies of bacterial growth have most often focused on ribosomes and metabolic proteins, this study provides important evidence that chromosomal DNA has a previously underestimated important and potentially rate-limiting role for growth. 

      Thank you for the excellent summary of our work.

      Strengths: 

      This article links the growth of single cells to the amount of DNA, the number of active ribosomes and to the number of RNA polymerases, combining quantitative experiments with theory. The correlations observed during depletion of DNA, notably in M9gluCAA medium, are compelling and point towards a limiting role of DNA for transcription and subsequently for protein production soon after reduction of the amount of DNA in the cell. The article also contains a theoretical model of transcription-translation that contains a Michaelis-Menten type dependency of transcription on DNA availability and is fit to the data. While the model fits well with the continuous reduction of relative growth rate in rich medium (M9gluCAA), the behavior in minimal media without casamino acids is a bit less clear (see comments below). 

      At a technical level, single-cell growth experiments and single-particle tracking experiments are well described, suggesting that different diffusive states of molecules represent different states of RNAp/ribosome activities, which reflect the reduction of growth. However, I still have a few points about the interpretation of the data and the measured fractions of active ribosomes (see below). 

      Apart from correlations in DNA-deplete cells, the article also investigates the role of candidate stress response pathways for reduced transcription, demonstrating that neither the SOS nor the stringent response are responsible for the reduced rate of growth. Equally, the anti-sigma factor Rsd recently described for its role in controlling RNA polymerase activity in nutrient-poor growth media, seems also not involved according to mass-spec data. While other (unknown) pathways might still be involved in reducing the number of active RNA polymerases, the proposed hypothesis of the DNA substrate itself being limiting for the total rate of transcription is appealing. 

      Finally, the authors confirm the reduction of growth in the distant Caulobacter crescentus, which lacks overlapping rounds of replication and could thus have shown a different dependency on DNA concentration. 

      Weaknesses: 

      There are a range of points that should be clarified or addressed, either by additional experiments/analyses or by explanations or clear disclaimers. 

      First, the continuous reduction of growth rate upon arrest of DNA replication initiation observed in rich growth medium (M9gluCAA) is not equally observed in poor media. Instead, the relative growth rate is immediately/quickly reduced by about 10-20% and then maintained for long times, as if the arrest of replication initiation had an immediate effect but would then not lead to saturation of the DNA substrate. In particular, the long plateau of a constant relative growth rate in M9ala is difficult to reconcile with the model fit in Fig 4S2. Is it possible that DNA is not limiting in poor media (at least not for the cell sizes studied here) while replication arrest still elicits a reduction of growth rate in a different way? Might this have something to do with the naturally much higher oscillations of DNA concentration in minimal medium?

      The reviewer is correct that there are interesting differences between nutrient-rich and -poor conditions. They were originally noted in the discussion, but we understand how our original presentation made it confusing. We reorganized the text and figures to better explain our results and interpretations. In the revised manuscript, the data related to the poor media are now presented separately (new Figure 6) from the data related to the rich medium (Figures 1-3).  The total RNAP activity (abundance x active fraction) is significantly reduced in poor media (Figure 6A-B) similarly to rich medium (Figure 3H). Thus, DNA is limiting for transcription across conditions. However, the total ribosome activity in poor media (Figure 6C-D) and thus the growth rate (Figure 6EF) was less affected in comparison to rich media (Figure 2H and 1C). Our interpretation of these results is that while DNA is limiting for transcription in all tested nutrient conditions (as shown by the total active RNAP data), post-transcriptional buffering activities compensate for the reduction in transcription in poor media, thereby maintaining a better scaling of growth rates under DNA limitation. 

      The authors argue that DNA becomes limiting in the range of physiological cell sizes, in particular for M9glCAA (Fig. 1BC). It would be helpful to know by how much (fold-change) the DNA concentration is reduced below wild-type (or multi-N) levels at t=0 in Fig 1B and how DNA concentration decays with time or cell area, to get a sense by how many-fold DNA is essentially 'overexpressed/overprovided' in wild-type cells. 

      We now provide crude estimates in the Discussion section. The revised text reads: “Crude estimations suggest that ≤ 40% DNA dilution is sufficient to negatively affect transcription (total RNAP activity) in M9glyCAAT, whereas the same effect was observed after less than 10% dilution in nutrient-poor media (M9gly or M9ala) (see Materials and Methods).” We obtained these numbers based on calculations and estimates described in the Materials and Methods section and Appendix 1 (Appendix 1 – Table 1).

      Fig. 2: The distribution of diffusion coefficients of RpsB is fit to Gaussians on the log scale. Is this based on a model or on previous work or simply an empirical fit to the data? An exact analytical model for the distribution of diffusion constants can be found in the tool anaDDA by Vink, ..., Hohlbein Biophys J 2020. Alternatively, distributions of displacements are expressed analytically in other tools (e.g., in SpotOn). 

      We use an empirical fit of Gaussian mixture model (GMM) of three states to the data and extract the fractions of molecules in each state. This avoids making too many assumptions on the underlying processes, e.g. a Markovian system with Brownian diffusion. The model in anaDDA (Vink et al.) is currently limited to two-transitioning states with a maximal step number of 8 steps per track for a computationally efficient solution (longer tracks are truncated). Using a short subset of the trajectories is less accurate than using the entire trajectory and because of this, we consider full tracks with at least 9 displacements. Meanwhile, Spot-On supports a three-state model but it is still based on a semi-analytical model with a pre-calculated library of parameters created by fitting of simulated data. Neither of these models considers the effect of cell confinement, which plays a major role in single-molecule diffusion in small-sized cells such as bacteria. For these reasons, we opted to use an empirical fit to the data. We note that the fractions of active ribosomes in WT cells, which we extracted from these diffusion measurements, are consistent with the range of estimates obtained by others using similar or different approaches (Forchhammer and Lindhal 1971; Mohapatra and Weisshaar, 2018; Sanamrad et al., 2014). 

      The estimated fraction of active ribosomes in wild-type cells shows a very strong reduction with decreasing growth rate (down from 75% to 30%), twice as strong as measured in bulk experiments (Dai et al Nat Microbiology 2016; decrease from 90% to 60% for the same growth rate range) and probably incompatible with measurements of growth rate, ribosome concentrations, and almost constant translation elongation rate in this regime of growth rates. Might the different diffusive fractions of RpsB not represent active/inactive ribosomes? See also the problem of quantification above. The authors should explain and compare their results to previous work. 

      We agree that our measured range is somewhat larger than the estimated range from Dai et al, 2016. However, they use different media, strains, and growth conditions. We also note that Dai et al did not make actual measurements of the active ribosome fraction. Instead, they calculate the “active ribosome equivalent” based on a model that includes growth rate, protein synthesis rate, RNA/protein abundance, and the total number of amino acids in all proteins in the cell. Importantly, our measurements show the same overall trend (a ~30% decrease) as Dai et al, 2016. Furthermore, our results are within the range of previous experimental estimates from ribosome profiling (Forchhammer and Lindhal 1971) or single-ribosome tracking (Mohapatra and Weisshaar, 2018; Sanamrad et al., 2014). We clarified this point in the revised manuscript. 

      To measure the reduction of mRNA transcripts in the cell, the authors rely on the fluorescent dye SYTO RNAselect. They argue that 70% of the dye signal represents mRNA. The argument is based on the previously observed reduction of the total signal by 70% upon treatment with rifampicin, an RNA polymerase inhibitor (Bakshi et al 2014). The idea here is presumably that mRNA should undergo rapid degradation upon rif treatment while rRNA or tRNA are stable. However, work from Hamouche et al. RNA (2021) 27:946 demonstrates that rifampicin treatment also leads to a rapid degradation of rRNA. Furthermore, the timescale of fluorescent-signal decay in the paper by Bakshi et al. (half life about 10min) is not compatible with the previously reported rapid decay of mRNA (24min) but rather compatible with the slower, still somewhat rapid, decay of rRNA reported by Hamouche et al.. A bulk method to measure total mRNA as in the cited Balakrishnan et al. (Science 2022) would thus be a preferred method to quantify mRNA. Alternatively, the authors could also test whether the mass contribution of total RNA remains constant, which would suggest that rRNA decay does not contribute to signal loss. However, since rRNA dominates total RNA, this measurement requires high accuracy. The authors might thus tone down their conclusions on mRNA concentration changes while still highlighting the compelling data on RNAp diffusion. 

      Thank you for bringing the Hamouche et al 2021 paper to our attention. To address this potential issue, we have performed fluorescence in situ hybridization (FISH) microscopy using a 16S rRNA probe (EUB338) to quantify rRNA concentration in 1N cells. We found that the rRNA signal only slightly decreases with cell size (i.e., genome dilution) compared to the RNASelect signal (e.g., a ~5% decrease for rRNA signal vs. 50% for RNASelect for a cell size range of 4 to 10 µm2). We have revised the text and added a figure to include the new rRNA FISH data (Figure 4). In addition, as a control, we validated our rRNA FISH method by comparing the intracellular concentration of 16S rRNA in poor vs. rich media (new Figure 4 – Figure supplement 3).

      The proteomics experiments are a great addition to the single-cell studies, and the correlations between distance from ori and protein abundance is compelling. However, I was missing a different test, the authors might have already done but not put in the manuscript: If DNA is indeed limiting the initiation of transcription, genes that are already highly transcribed in non-perturbed conditions might saturate fastest upon replication inhibition, while genes rarely transcribed should have no problem to accommodate additional RNA polymerases. One might thus want to test, whether the (unperturbed) transcription initiation rate is a predictor of changes in protein composition. This is just a suggestion the authors may also ignore, but since it is an easy analysis, I chose to mention it here. 

      We did not find any correlation when we examined the potential relation between RNA slopes and mRNA abundance (from our first CRISPRi oriC time point) or the transcription initiation rate (from Balakrishnan et al., 2022, PMID: 36480614) across genes. These new plots are presented in Figure 7 – Figure supplement 2B. In contrast, we found a small but significant correlation between RNA slopes and mRNA decay rates (from Balakrishnan et al., 2022, PMID: 36480614), specifically for genes with short mRNA lifetimes (new Figure 7F). This effect is consistent with our model prediction (Figure 5 – Figure supplement 2). 

      Related to the proteomics, in l. 380 the authors write that the reduced expression close to the ori might reflect a gene-dosage compensatory mechanism. I don't understand this argument. Can the authors add a sentence to explain their hypothesis? 

      We apologize for the confusion. While performing additional analyses for the revisions, we realized that while the proteins encoded by genes close to oriC tend to display subscaling behavior, this is not true at the mRNA level (new Figure 7 – Figure supplement 3B). In light of this result, we no longer have a hypothesis for the observed negative correlation at the protein level (originally Figure 5D, now Figure 7 – Figure supplement 3A). The text was revised accordingly.  

      In Fig. 1E the authors show evidence that growth rate increases with cell length/area. While this is not a main point of the paper it might be cited by others in the future. There are two possible artifacts that could influence this experiment: a) segmentation: an overestimation of the physical length of the cell based on phase-contrast images (e.g., 200 nm would cause a 10% error in the relative rate of 2 um cells, but not of longer cells). b) timedependent changes of growth rate, e.g., due to change from liquid to solid or other perturbations. To test for the latter, one could measure growth rate as a function of time, restricting the analysis to short or long cells, or measuring growth rate for short/long cells at selected time points. For the former, I recommend comparison of phase-contrast segmentation with FM4-64-stained cell boundaries.

      As the reviewer notes, the small increase in relative growth was just a minor observation that does not affect our story whether it is biologically meaningful or the result of a technical artefact. But we agree with the reviewer that others might cite it in future works and thus should be interpreted with caution.

      An artefact associated with time-dependent changes (e.g. changing from liquid cultures to more solid agarose pads) is unlikely for two reasons. 1. We show that varying the time that cells spend on agarose pads relative to liquid cultures does not affect the cell size-dependent growth rate results (Figure 1 – supplement 5A). 2. We show that the growth rate is stable from the beginning of the time-lapse with no transient effects upon cell placement on agarose pads for imaging (Figure 1 – supplement 1). These results were described in the Methods section where they could easily be missed. We revised the text to discuss these controls more prominently in the Results section.

      As for cell segmentation, we have run simulations and agree with the reviewer that a small overestimation of cell area (which is possible with any cell segmentation methods including ours) could lead to a small increase in relative growth with increasing cell areas (new Figure 1 – Figure supplement 3). Since the finding is not important to our story, we simply revised the text and added the simulation results to alert the readers to the possibility that the observation may be due to a small cell segmentation bias.

      Reviewer #2 (Public Review): 

      In this work, the authors uncovered the effects of DNA dilution on E. coli, including a decrease in growth rate and a significant change in proteome composition. The authors demonstrated that the decline in growth rate is due to the reduction of active ribosomes and active RNA polymerases because of the limited DNA copy numbers. They further showed that the change in the DNA-to-volume ratio leads to concentration changes in almost 60% of proteins, and these changes mainly stem from the change in the mRNA levels. 

      Thank you for the support and accurate summary!

      Reviewer #3 (Public Review): 

      Summary: 

      Mäkelä et al. here investigate genome concentration as a limiting factor on growth.

      Previous work has identified key roles for transcription (RNA polymerase) and translation (ribosomes) as limiting factors on growth, which enable an exponential increase in cell mass. While a potential limiting role of genome concentration under certain conditions has been explored theoretically, Mäkelä et al. here present direct evidence that when replication is inhibited, genome concentration emerges as a limiting factor. 

      Strengths: 

      A major strength of this paper is the diligent and compelling combination of experiment and modeling used to address this core question. The use of origin- and ftsZ-targeted CRISPRi is a very nice approach that enables dissection of the specific effects of limiting genome dosage in the context of a growing cytoplasm. While it might be expected that genome concentration eventually becomes a limiting factor, what is surprising and novel here is that this happens very rapidly, with growth transitioning even for cells within the normal length distribution for E. coli. Fundamentally, it demonstrates the fine balance of bacterial physiology, where the concentration of the genome itself (at least under rapid growth conditions) is no higher than it needs to be. 

      Thank you!

      Weaknesses: 

      One limitation of the study is that genome concentration is largely treated as a single commodity. While this facilitates their modeling approach, one would expect that the growth phenotypes observed arise due to copy number limitation in a relatively small number of rate-limiting genes. The authors do report shifts in the composition of both the proteome and the transcriptome in response to replication inhibition, but while they report a positional effect of distance from the replication origin (reflecting loss of high-copy, origin-proximal genes), other factors shaping compositional shifts and their functional effects on growth are not extensively explored. This is particularly true for ribosomal RNA itself, which the authors assume to grow proportionately with protein. More generally, understanding which genes exert the greatest copy number-dependent influence on growth may aid both efforts to enhance (biotechnology) and inhibit (infection) bacterial growth. 

      We agree but feel that identifying the specific limiting genes is beyond the scope of the study. This said, we carried out additional experiments and analyses to address the reviewer’s comment and identify potential contributing factors and limiting gene candidates. First, we examined the intracellular concentration of 16S ribosomal RNA (rRNA) by rRNA FISH microscopy and found that it decays much slower than the bulk of mRNAs as measured using RNASelect staining (new Figure 4 and Figure 4 – Figure supplements 1 and 3). We found that the rRNA signal is far more stable in 1N cells than the RNASelect signal, the former decreasing by only ~5% versus ~50% for the later in response to the same range of genome dilution (Figure 4C).  Second,  we carried out new correlation analyses between our proteomic/transcriptomic datasets and published genome-wide datasets that report various variables under unperturbed conditions (e.g., mRNA abundance, mRNA degradation rates, fitness cost, transcription initiation rates, essentiality for viability); see new Figure 7E-G and Figure 7 – Figure supplement 2. In the process, we found that genes essential for viability tend, on average, to display superscaling behavior (Figure 7G). This suggests that cells have evolved mechanisms that prioritize expression of essential genes over nonessential ones during DNA-limited growth. Furthermore, this analysis identified a small number of essential genes that display strong negative RNA slopes (Figure 7C, Datasets 1 and 2), indicating that the concentration of their mRNA decreases rapidly relative to the rest of the transcriptome upon genome dilution. These essential genes with strong subscaling behavior are candidates for being growth-limiting. 

      The text and figures were revised to include these new results.

      Overall, this study provides a fundamental contribution to bacterial physiology by illuminating the relationship between DNA, mRNA, and protein in determining growth rate. While coarse-grained, the work invites exciting questions about how the composition of major cellular components is fine-tuned to a cell's needs and which specific gene products mediate this connection. This work has implications not only for biotechnology, as the authors discuss, but potentially also for our understanding of how DNA-targeted antibiotics limit bacterial growth. 

      Thank you!

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors): 

      Below are my comments. 

      (1) I noticed that a paper by Li et al. on biorxiv has found similar results as this work ("Scaling between DNA and cell size governs bacterial growth homeostasis and resource allocation," https://doi.org/10.1101/2021.11.12.468234), including the linear growth of E. coli when the DNA concentration is low. This relevant reference was not cited or discussed in the current manuscript. 

      We agree that authors should cite and discuss relevant peer-reviewed literature. But broadly speaking, we feel that extending this responsibility to all preprints (and by extension any online material) that have not been reviewed is a bit dangerous. It would effectively legitimize unreviewed claims and risk their propagation in future publications. We think that while imperfect, the peer-reviewing process still plays an important role. 

      Regarding the specific 2021 preprint that the reviewer pointed out, we think that the presented growth rate data are quite noisy and that the experiments lack a critical control (multi-N cells), making interpretation difficult. Their report that plasmid-borne expression is enhanced when DNA is severely diluted is certainly interesting and makes sense in light of our measurements that the activities, but not the concentrations, of RNA polymerases and ribosomes are reduced in 1N cells. However, we do not know why this preprint has not yet been published since 2021. There could be many possible reasons for this. Therefore, we feel that it is safer to limit our discussion to peer-reviewed literature.

      (2) I think the kinetic Model B in the Appendix has been studied in previous works, such as Klump & Hwa, PNAS 2008, https://doi.org/10.1073/pnas.0804953105

      Indeed, Klumpp & Hwa 2008 modeled the kinetics of RNA polymerase and promoter association prior to our study. But there is a difference between their model and ours. Their model is based on Michaelis Menten-type (MM) functions in which the RNAP is analogous to the “substrate” and the promoter to the “enzyme” in the MM equation. In contrast, our model uses functions based on the law of mass action (instead of MMtype of function). We have revised the text, included the Klumpp & Hwa 2008 reference, and revised the Materials & Methods section to clarify these points. 

      (3) On lines 284-285, if I understand correctly, the fractions of active RNAPs and active ribosomes are relative to the total protein number. It would be helpful if the authors could mention this explicitly to avoid confusion. 

      The fractions of active RNAPs and active ribosomes are expressed as the percentage of the total RNAPs and ribosomes. We have revised the text to be more explicit. Thank you.

      (4) On line 835, I am not sure what the bulk transcription/translation rate means. I guess it is the maximum transcription/translation rate if all RNAPs/ribosomes are working according to Eq. (1,2). It would be helpful if the authors could explain the meaning of r_1 and r_2 more explicitly. 

      Our apology for the lack of clarity. We have added the following equations:

      (5) Regarding the changes in protein concentrations due to genome dilution, a recent theoretical paper showed that it may come from the heterogeneity in promoter strengths (Wang & Lin, Nature Communications 2021). 

      In the Wang and Lin model, the heterogeneity in promoter strength predicts that the “mRNA production rate equivalent”, which is the mRNA abundance multiplied by the mRNA decay rate, will correlate the RNA slopes. However, we found these two variables to be uncorrelated (see below, The Spearman correlation coefficient ρ was 0.02 with a p-value of 0.24, indicating non-significance (NS).

      Author response image 1.

      The mRNA production rate equivalent (mRNA abundance at the first time point after CRISPRi oriC induction multiplied by the mRNA degradation rate measured by Balakrishnan et al., 2022, PMID: 36480614, expressed in transcript counts per minute) does not correlate (Spearman correlation’s p-value = 0.24) with the RNA slope in 1N-rich cells.  Data from 2570 genes are shown (grey markers, Gaussian kernel density estimation - KDE), and their binned statistics (mean +/- SEM, ~280 genes per bin, orange markers). 

      In addition, we found no significant correlation between RNA slopes and mRNA abundance or transcription initiation rate. These plots are now included in Figure 7E and Figure 7 –Figure supplement 2B. Thus, the promoter strength does not appear to be a predictor of the RNA (and protein) scaling behavior under DNA limitation. 

      Reviewer #3 (Recommendations For The Authors): 

      One general area that could be developed further is analysis of changes in the proteome/transcriptome composition, given that there may be specific clues here as to the phenotypic effects of genome concentration limitation. Specifically: 

      • In Figure 5D, the authors demonstrate an effect of origin distance on sensitivity to replication inhibition, presumably as a copy number effect. However, the authors note that the effect was only slight and postulated a compensatory mechanism. Due to the stability of proteins, one should expect relatively small effects - even if synthesis of a protein stopped completely, its concentration would only decrease twofold with a doubling of cell area (slope = -1, if I'm interpreting things correctly). It would be helpful to display the same information shown in Figure 5D at the mRNA level, since I would anticipate that higher mRNA turnover rates mean that effects on transcription rate should be felt more rapidly. 

      We thank the reviewer for this suggestion. To our surprise, we found that there is no correlation between gene location relative to the origin and RNA slope across genes. This suggests that the observed correlation between gene location and protein slopes does not occur at the mRNA level. Given that we do not have an explanation for the underlying mechanism, we decided to present these data (the original data in Figure 5D and the new data for the RNA slope) in a supplementary figure (Figure 7 – Figure supplement 3).

      • Related to this, did the authors see any other general trends? For example, do highly expressed genes hit saturation faster, making them more sensitive to limited genome concentration? 

      We found that the RNA slopes do not correlate with mRNA abundance or transcription initiation rates. However, they do correlate with mRNA decay. That is, short-lived mRNAs tend to have negative RNA slopes. The new analyses have been added as Figure 7E-F and Figure 7 – Figure supplement 2B. The text has been revised to incorporate this information. 

      • Presumably loss of growth is primarily driven by a subset of genes whose copy number becomes limiting. Previously, it has been reported that there is a wide variety among "essential" genes in their expression-fitness relationship - i.e. how much of a reduction in expression you need before growth is reduced (e.g. PMID 33080209). It would be interesting to explore the shifts in proteome/transcriptome composition to see whether any genes particularly affected by restricted genome concentration are also especially sensitive to reduced expression - overlap in these datasets may reveal which genes drive the loss of growth. 

      This is a very interesting idea – thank you! We did not find a correlation between the protein/RNA slope and the relative gene fitness as previously calculated (PMID 33080209), as shown below.

      Author response image 2.

      The relative fitness of each gene (data by Hawkins et al., 2020, PMID: 33080209, median fitness from the highest sgRNA activity bin) plotted versus the gene-specific RNA and protein slopes that we measured in 1Nrich cells after CRISPRi oriC induction. More than 260 essential genes are shown (262 RNA slopes and 270 protein slopes, grey markers), and their binned statistics (mean +/- SEM, 43-45 essential genes per bin, orange markers). The spearman correlations (ρ) with p-values above 10-3 are considered not significant (NS). In our analyses, we only considered correlations significant if they have a Spearman correlation p-value below 10-10.

      However, while doing this suggested analysis, we noticed that the essential genes that were included in the forementioned study have RNA slopes above zero on average. This led us to compare the RNA slope distributions of essential genes relative to all genes (now included in Figure 7G). We found that they tend to display superscaling behavior (positive RNA slopes), suggesting the existence of regulatory mechanisms that prioritize the expression of essential genes over less important ones when genome concentration becomes limiting for growth.  The text has been revised to include this new information.

      Other suggestions: 

      • In Figure 3 the authors report that total RNAP concentration increases with increasing cytoplasmic volume. This is in itself an interesting finding as it may imply a compensatory mechanism - can the authors offer an explanation for this? 

      We do not have a straightforward explanation. But we agree that it is very interesting and should be investigated in future studies given that this superscaling behavior is common among essential genes. 

      • The explanation of the modeling within the main text could be improved. Specifically, equations 1 and 2, as well as a discussion of models A and B (lines 290-301), do not explicitly relate DNA concentration to downstream effects. The authors provide the key information in Appendix 1, but for a general reader, it would be helpful to provide some intuition within the main text about how genome concentration influences transcription rate (i.e. via 𝛼RNAP).  

      We apologize for the lack of clarity. We have added information that hopefully improves clarity.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This article presents important results describing how the gathering, integration, and broadcasting of information in the brain changes when consciousness is lost either through anesthesia or injury. They provide convincing evidence to support their conclusions, although the paper relies on a single analysis tool (partial information decomposition) and could benefit from a clearer explication of its conceptual basis, methodology, and results. The work will be of interest to both neuroscientists and clinicians interested in fundamental and clinical aspects of consciousness.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this paper, Luppi et al., apply the recently developed integrated information decomposition to the question how the architecture of information processing changes when consciousness is lost. They explore fMRI data from two different populations: healthy volunteers undergoing reversible anesthesia, as well as from patients who have long-term disorders of consciousness. They show that, in both populations, synergistic integration of information is disrupted in common ways. These results are interpreted in the context of the SAPHIRE model (recently proposed by this same group), that describes information processing in the brain as being composed of several distinct steps: 1) gatekeeping (where gateway regions introduce sensory information to the global synergistic workspace where 2) it is integrated or "processed" before 3) by broadcast back to to the brain.

      I think that this paper is an excellent addition to the literature on information theory in neuroscience, and consciousness science specifically. The writing is clear, the figures are informative, and the authors do a good job of engaging with existing literature. While I do have some questions about the interpretations of the various information-theoretic measures, all in all, I think this is a significant piece of science that I am glad to see added to the literature.

      One specific question I have is that I am still a little unsure about what "synergy" really is in this context. From the methods, it is defined as that part of the joint mutual information that is greater than the maximum marginal mutual information. While this is a perfectly fine mathematical measure, it is not clear to me what that means for a squishy organ like the brain. What should these results mean to a neuro-biologist or clinician?

      Right now the discussion is very high level, equating synergy to "information processing" or "integrated information", but it might be helpful for readers not steeped in multivariate information theory to have some kind of toy model that gets worked out in detail. On page 15, the logical XOR is presented in the context of the single-target PID, but 1) the XOR is discrete, while the data analyzed here are continuous BOLD signals w/ Gaussian assumptions and 2) the XOR gate is a single-target system, while the power of the Phi-ID approach is the multi-target generality. Is there a Gaussian analog of the single-target XOR gate that could be presented? Or some multi-target, Gaussian toy model with enough synergy to be interesting? I think this would go a long way to making this work more accessible to the kind of interdisciplinary readership that this kind of article with inevitably attract.

      We appreciate this observation. We now clarify that:

      “redundancy between two units occurs when their future spontaneous evolution is predicted equally well by the past of either unit. Synergy instead occurs when considering the two units together increases the mutual information between the units’ past and their future – suggesting that the future of each is shaped by its interactions with the other. At the microscale (e.g., for spiking neurons) this phenomenon has been suggested as reflecting “information modification” 36,40,47. Synergy can also be viewed as reflecting the joint contribution of parts of the system to the whole, that is not driven by common input48.”

      In the Methods, we have also added the following example to provide additional intuition about synergy in the case of continuous rather than discrete variables:

      “As another example for the case of Gaussian variables (as employed here), consider a 2-node coupled autoregressive process with two parameters: a noise correlation c and a coupling parameter a. As c increases, the system is flooded by “common noise”, making the system increasingly redundant because the common noise “swamps” the signal of each node. As a increases, each node has a stronger influence both on the other and on the system as a whole, and we expect synergy to increase. Therefore, synergy reflects the joint contribution of parts of the system to the whole that is not driven by common noise. This has been demonstrated through computational modelling (Mediano et al 2019 Entropy).”

      See below for the relevant parts of Figures 1 and 2 from Mediano et al (2019 Entropy), where Psi refers to the total synergy in the system.

      Author response image 1.

      Strengths

      The authors have a very strong collection of datasets with which to explore their topic of interest. By comparing fMRI scans from patients with disorders of consciousness, healthy resting state, and various stages of propofol anesthesia, the authors have a very robust sample of the various ways consciousness can be perturbed, or lost. Consequently, it is difficult to imagine that the observed effects are merely a quirk of some biophysical effect of propofol specifically, or a particular consequence of long-term brain injury, but do in fact reflect some global property related to consciousness. The data and analyses themselves are well-described, have been previously validated, and are generally strong. I have no reason to doubt the technical validity of the presented results.

      The discussion and interpretation of these results is also very nice, bringing together ideas from the two leading neurocognitive theories of consciousness (Global Workspace and Integrated Information Theory) in a way that feels natural. The SAPHIRE model seems plausible and amenable to future research. The authors discuss this in the paper, but I think that future work on less radical interventions (e.g. movie watching, cognitive tasks, etc) could be very helpful in refining the SAPHIRE approach.

      Finally, the analogy between the PID terms and the information provided by each eye redundantly, uniquely, and synergistically is superb. I will definitely be referencing this intuition pump in future discussions of multivariate information sharing.

      We are very grateful for these positive comments, and for the feedback on our eye metaphor.

      Weaknesses

      I have some concerns about the way "information processing" is used in this study. The data analyzed, fMRI BOLD data is extremely coarse, both in spatial and temporal terms. I am not sure I am convinced that this is the natural scale at which to talk about information "processing" or "integration" in the brain. In contrast to measures like sample entropy or Lempel-Ziv complexity (which just describe the statistics of BOLD activity), synergy and Phi are presented here as quasi-causal measures: as if they "cause" or "represent" phenomenological consciousness. While the theoretical arguments linking integration to consciousness are compelling, is this is right data set to explore them in? For example, the work by Newman, Beggs, and Sherril (nee Faber), synergy is associated with "computation" performed in individual neurons: the information about the future state of a target neuron that is only accessible when knowing both inputs (analogous to the synergy in computing the sum of two dice). Whether one thinks that this is a good approach neural computation or not, it fits within the commonly accepted causal model of neural spiking activity: neurons receive inputs from multiple upstream neurons, integrate those inputs and change their firing behavior accordingly.

      In contrast, here, we are looking at BOLD data, which is a proxy measure for gross-scale regional neural activity, which itself is a coarse-graining of millions of individual neurons to a uni-dimensional spectrum that runs from "inactive to active." It feels as though a lot of inferences are being made from very coarse data.

      We appreciate the opportunity to clarify this point. It is not our intention to claim that Phi-R and synergy, as measured at the level of regional BOLD signals, represent a direct cause of consciousness, or are identical to it. Rather, our work is intended to use these measures similarly to the use of sample entropy and LZC for BOLD signals: as theoretically grounded macroscale indicators, whose empirical relationship to consciousness may reveal the relevant underlying phenomena. In other words, while our results do show that BOLD-derived Phi-R tracks the loss and recovery of consciousness, we do not claim that they are the cause of it: only that an empirical relationship exists, which is in line with what we might expect on theoretical grounds. We have now clarified this in the Limitations section of our revised manuscript, as well as revising our language accordingly in the rest of the manuscript.

      We also clarify that the meaning of “information processing” that we adopt pertains to “intrinsic” information that is present in the system’s spontaneous dynamics, rather than extrinsic information about a task:

      “Information decomposition can be applied to neural data from different scales, from electrophysiology to functional MRI, with or without reference to behaviour 34. When behavioural data are taken into account, information decomposition can shed light on the processing of “extrinsic” information, understood as the translation of sensory signals into behavioural choices across neurons or regions 41,43,45,47. However, information decomposition can also be applied to investigate the “intrinsic” information that is present in the brain’s spontaneous dynamics in the absence of any tasks, in the same vein as resting-state “functional connectivity” and methods from statistical causal inference such as Granger causality 49. In this context, information processing should be understood in terms of the dynamics of information: where and how information is stored, transferred, and modified 34.”

      References:

      (1) Newman, E. L., Varley, T. F., Parakkattu, V. K., Sherrill, S. P. & Beggs, J. M. Revealing the Dynamics of Neural Information Processing with Multivariate Information Decomposition. Entropy 24, 930 (2022).

      Reviewer #2 (Public Review):

      The authors analysed functional MRI recordings of brain activity at rest, using state-of-the-art methods that reveal the diverse ways in which the information can be integrated in the brain. In this way, they found brain areas that act as (synergistic) gateways for the 'global workspace', where conscious access to information or cognition would occur, and brain areas that serve as (redundant) broadcasters from the global workspace to the rest of the brain. The results are compelling and consisting with the already assumed role of several networks and areas within the Global Neuronal Workspace framework. Thus, in a way, this work comes to stress the role of synergy and redundancy as complementary information processing modes, which fulfill different roles in the big context of information integration.

      In addition, to prove that the identified high-order interactions are relevant to the phenomenon of consciousness, the same analysis was performed in subjects under anesthesia or with disorders of consciousness (DOC), showing that indeed the loss of consciousness is associated with a deficient integration of information within the gateway regions.

      However, there is something confusing in the redundancy and synergy matrices shown in Figure 2. These are pair-wise matrices, where the PID was applied to identify high-order interactions between pairs of brain regions. I understand that synergy and redundancy are assessed in the way the brain areas integrate information in time, but it is still a little contradictory to speak about high-order in pairs of areas. When talking about a "synergistic core", one expects that all or most of the areas belonging to that core are simultaneously involved in some (synergistic) information processing, and I do not see this being assessed with the currently presented methodology. Similarly, if redundancy is assessed only in pairs of areas, it may be due to simple correlations between them, so it is not a high-order interaction. Perhaps it is a matter of language, or about the expectations that the word 'synergy' evokes, so a clarification about this issue is needed. Moreover, as the rest of the work is based on these 'pair-wise' redundancy and synergy matrices, it becomes a significative issue.

      We are grateful for the opportunity to clarify this point. We should highlight that PhiID is in fact assessing four variables: the past of region X, the past of region B, the future of region X, and the future of region Y. Since X and Y each feature both in the past and in the future, we can re-conceptualise the PhiID outputs as reflecting the temporal evolution of how X and Y jointly convey information: the persistent redundancy that we consider corresponds to information that is always present in both X and Y; whereas the persistent synergy is information that X and Y always convey synergistically. In contrast, information transfer would correspond to the phenomenon whereby information was conveyed by one variable in the past, and by the other in the future (see Luppi et al., 2024 TICS; and Mediano et al., 2021 arXiv for more thorough discussions on this point). We have now added this clarification in our Introduction and Results, as well as adding the new Figure 2 to clarify the meaning of PhiID terms.

      We would also like to clarify that all the edges that we identify as significantly changing are indeed simultaneously involved in the difference between consciousness and unconsciousness. This is because the Network-Based Statistic differs from other ways of identifying edges that are significantly different between two groups or conditions, because it does not consider edges in isolation, but only as part of a single connected component.

      Reviewer #3 (Public Review):

      The work proposes a model of neural information processing based on a 'synergistic global workspace,' which processes information in three principal steps: a gatekeeping step (information gathering), an information integration step, and finally, a broadcasting step. The authors determined the synergistic global workspace based on previous work and extended the role of its elements using 100 fMRI recordings of the resting state of healthy participants of the HCP. The authors then applied network analysis and two different measures of information integration to examine changes in reduced states of consciousness (such as anesthesia and after-coma disorders of consciousness). They provided an interpretation of the results in terms of the proposed model of brain information processing, which could be helpful to be implemented in other states of consciousness and related to perturbative approaches. Overall, I found the manuscript to be well-organized, and the results are interesting and could be informative for a broad range of literature, suggesting interesting new ideas for the field to explore. However, there are some points that the authors could clarify to strengthen the paper. Key points include:

      (1) The work strongly relies on the identification of the regions belonging to the synergistic global workspace, which was primarily proposed and computed in a previous paper by the authors. It would be great if this computation could be included in a more explicit way in this manuscript to make it self-contained. Maybe include some table or figure being explicit in the Gradient of redundancy-to-synergy relative importance results and procedure.

      We have now added the new Supplementary Figure 1 to clarify how the synergistic workspace is identified, as per Luppi et al (2022 Nature Neuroscience).

      (2) It would be beneficial if the authors could provide further explanation regarding the differences in the procedure for selecting the workspace and its role within the proposed architecture. For instance, why does one case uses the strength of the nodes while the other case uses the participation coefficient? It would be interesting to explore what would happen if the workspace was defined directly using the participation coefficient instead of the strength. Additionally, what impact would it have on the procedure if a different selection of modules was used? For example, instead of using the RSN, other criteria, such as modularity algorithms, PCA, Hidden Markov Models, Variational Autoencoders, etc., could be considered. The main point of my question is that, probably, the RSN are quite redundant networks and other methods, as PCA generates independent networks. It would be helpful if the authors could offer some comments on their intuition regarding these points without necessarily requiring additional computations.

      We appreciate the opportunity to clarify this point. Our rationale for the procedure used to identify the workspace is to find regions where synergy is especially prominent. This is due to the close mathematical relationship between synergistic information and integration of information (see also Luppi et al., 2024 TICS), which we view as the core function of the global workspace. This identification is based on the strength ranking, as per Luppi et al (2022 Nature Neuroscience), which demonstrated that regions where synergy predominates (i.e., our proposed workspace) are also involved with high-level cognitive functions and anatomically coincide with transmodal association cortices at the confluence of multiple information streams. This is what we should expect of a global workspace, which is why we use the strength of synergistic interactions to identify it, rather than the participation coefficient. Subsequently, to discern broadcasters from gateways within the synergistic workspace, we seek to encapsulate the meaning of a “broadcaster” in information terms. We argue that this corresponds with making the same information available to multiple modules. Sameness of information corresponds to redundancy, and multiplicity of modules can be reflected in the network-theoretic notion of participation coefficient. Thus, a broadcaster is a region in the synergistic workspace (i.e., a region with strong synergistic interactions) that in addition has a high participation coefficient for its redundant interactions.

      Pertaining specifically to the use of resting-state networks as modules, indeed our own (Luppi et al., 2022 Nature Neuroscience) and others’ research has shown that each RSN entertains primarily redundant interactions among its constituent regions. This is not surprising, since RSNs are functionally defined: their constituent elements need to process the same information (e.g., pertaining to a visual task in case of the visual network). We used the RSNs as our definition of modules, because they are widely understood to reflect the intrinsic organisation of brain activity into functional units; for example, Smith et al., (2009 PNAS) and Cole et al (2014 Neuron) both showed that RSNs reflect task-related co-activation of regions, whether directly quantified from fMRI in individuals performing multiple tasks, or inferred from meta-analysis of the neuroimaging literature. This is the aspect of a “module” that matters from the global workspace perspective: modules are units with distinct function, and RSNs capture this well. This is therefore why we use the RSNs as modules when defining the participation coefficient: they provide an a-priori division into units with functionally distinct roles.

      Nonetheless, we also note that RSN organisation is robustly recovered using many different methods, including seed-based correlation from specific regions-of-interest, or Independent Components Analysis, or community detection on the network of inter-regional correlations - demonstrating that they are not merely a function of the specific method used to identify them. In fact, we show significant correlation between participation coefficient defined in terms of RSNs, and in terms of modules identified in a purely data-driven manner from Louvain consensus clustering (Figure S4).

      (3) The authors acknowledged the potential relevance of perturbative approaches in terms of PCI and quantification of consciousness. It would be valuable if the authors could also discuss perturbative approaches in relation to inducing transitions between brain states. In other words, since the authors investigate disorders of consciousness where interventions could provide insights into treatment, as suggested by computational and experimental works, it would be interesting to explore the relationship between the synergistic workspace and its modifications from this perspective as well.

      We thank the Reviewer for bringing this up: we now cite several studies that in recent years have applied perturbative approaches to induce transitions between states of consciousness.

      “The PCI is used as a means of assessing the brain’s current state, but stimulation protocols can also be adopted to directly induce transitions between states of consciousness. In rodents, carbachol administration to frontal cortex awakens rats from sevoflurane anaesthesia120, and optogenetic stimulation was used to identify a role of central thalamus neurons in controlling transitions between states of responsiveness121,122. Additionally, several studies in non-human primates have now shown that electrical stimulation of the central thalamus can reliably induce awakening from anaesthesia, accompanied by the reversal of electrophysiological and fMRI markers of anaesthesia 123–128. Finally, in human patients suffering from disorders of consciousness, stimulation of intra-laminar central thalamic nuclei was reported to induce behavioural improvement 129, and ultrasonic stimulation 130,131 and deep-brain stimulation are among potential therapies being considered for DOC patients 132,133. It will be of considerable interest to determine whether our corrected measure of integrated information and topography of the synergistic workspace also restored by these causal interventions.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I would appreciate it if the authors could revisit the figures and make sure that:

      (1) All fonts are large enough to be readable for people with visual impairments (for ex. the ranges on the colorbars in Fig. 2 are unreadably small).

      Thank you: we have increased font sizes.

      (2) The colormaps are scaled to show meaningful differences (Fig. 2A)

      We have changed the color scale in Figure 2A and 2B.

      Also, the authors may want to revisit the references section: some of the papers that were pre-prints at one point have now been published and should be updated.

      Thank you: we have updated our references.

      Minor comments:

      • In Eqs. 2 and 3, the unique information term uses the bar notation ( | ) that is typically indicative of "conditioned on." Perhaps the authors could use a slash notation (e.g. Unq(X ; Z / Y)) to avoid this ambiguity? My understanding of the Unique information is that it is not necessarily "conditioned on", so much as it is "in the context of".

      Indeed, the “|” sign of “conditioning” could be misleading; however, the “/” sign could also be misleading, if interpreted as division. Therefore, we have opted for the “\” sign of “set difference”, in Eq 2 and 3, which is conceptually more appropriate in this context.

      • The font on the figures is a little bit small - for readers with poor eyes, it might be helpful to increase the wording size.

      We have increased font sizes in the figures where relevant.

      • I don't quite understand what is happening in Fig. 2A - perhaps it is a colormap issue, but it seems as though it's just a bit white square? It looks like redundancy is broadly correlated with FC (just based on the look of the adjacency matrices), but I have no real sense of what the synergistic matrix looks like, other than "flat."

      We have now changed the color scale in Figure 2.

      Reviewer #2 (Recommendations For The Authors):

      Besides the issues mentioned in the Public review, I have the following suggestions to improve the manuscript:

      • At the end of the introduction, a few lines could be added explaining why the study of DOC patients and subjects under anesthesia will be informative in the context of this work.

      By comparing functional brain scans from transient anaesthetic-induced unconsciousness and from the persistent unconsciousness of DOC patients, which arises from brain injury, we can search for common brain changes associated with loss of consciousness – thereby disambiguating what is specific to loss of consciousness.

      • On page and in general the first part of Results, it is not evident that you are working with functional connectivity. Many times the word 'connection' is used and sometimes I was wondering whether they were structural or functional. Please clarify. Also, the meaning of 'synergistic connection' or 'redundant connection' could be explained in lay terms.

      Thank you for bringing this up. We have now replaced the word “connection” with “interaction” to disambiguate this issue, further adding “functional” where appropriate. We have also provided, in the Introduction, an intuitive explanation of what synergy and redundancy mean int he context of spontaneous fMRI signals.

      • Figure 2 needs a lot of improvement. The matrix of synergistic interactions looks completely yellow-ish with some vague areas of white. So everything is above 2. What does it mean?? Pretty uninformative. The matrix of redundant connections looks a lot of black, with some red here and there. So everything is below 0.6. Also, what are the meaning and units of the colorbars?.

      We agree: we have increased font sizes, added labels, and changed the color scale in Figure 2. We hope that the new version of Figure 2 will be clearer.

      • Caption of Figure 2 mentions "... brain regions identified as belonging to the synergistic global workspace". I didn't get it clear how do you define these areas. Are they just the sum of gateways and broadcasters, or is there another criterion?

      Regions belonging to the synergistic workspace are indeed the set comprising gateways and broadcasters; they are the regions that are synergy-dominated, as defined in Luppi et al., 2022 Nature Neuroscience. We have now clarified this in the figure caption.

      • In the first lines of page 7, it is said that data from DOC and anesthesia was parcellated in 400 + 54 regions. However, it was said in a manner that made me think it was a different parcellation than the other data. Please make it clear that the parcellation is the same (if it is).

      We have now clarified that the 400 cortical regions are from the Schaefer atlas, and 54 subcortical regions from the Tian atlas, as for the other analysis. The only other parcellation that we use is the Schaefer-232, for the robustness analysis. This is also reported in the Methods.

      • Figure 3: the labels in the colorbars cannot be read, please make them bigger. Also, the colorbars and colorscales should be centered in white, to make it clear that red is positive and blue is negative. O at least maintain consistency across the panels (I can't tell because of the small numbers).

      Thank you: we have increased font sizes, added labels, indicated that white refers to zero (so that red is always an increase, and blue is always a decrease), and changed the color scale in Figure 2.

      • The legend of Figure 4 is written in a different style, interpreting the figure rather than describing it. Please describe the figure in the caption, in order to let the read know what they are looking at.

      We have endeavoured to rewrite the legend of Figure 4 in a style that is more consistent with the other figures.

      • In several parts the 'whole-minus-sum' phi measure is mentioned and it is said that it did not decrease during loss of consciousness. However, I did not see any figure about that nor any conspicuous reference to that in Results text. Where is it?

      We apologise for the confusion: this is Figure S3A, in the Supplementary. We have now clarified this in the text.

      Reviewer #3 (Recommendations For The Authors):

      (1) In the same direction, regarding Fig. 2, in my opinion, it does not effectively aid in understanding the selection of regions as more synergistic or redundant. In panels A) and B), the color scales could be improved to better distinguish regions in the matrices (panel A) is saturated at the upper limit, while panel B) is saturated at the lower limit). Additionally, I suggest indicating in the panels what is being measured with the color scales.

      Thank you: we have increased font sizes, added labels, and changed the color scale in Figure 2.

      (2) When investigating the synergistic core of human consciousness and interpreting the results of changes in information integration measures in terms of the proposed framework, did the authors consider the synergistic workspace computed in HCP data? If the answer is positive, it would be helpful for the authors to be more explicit about it and elaborate on any differences that may be found, as well as the potential impact on interpretation.

      This is correct: the synergistic workspace, including gateways and broadcasters, are identified from the Human Connectome Project dataset. We now clarify this in the manuscript.

      Minors:

      (1) I would suggest improving the readability of figures 2 and 3, considering font size (letters and numbers) and color bars (numbers and indicate what is measured with this scale). In Figure 1, the caption defines steps instead stages that are indicated in the figure.

      Thank you: we have increased font sizes, added labels, and replaced steps with “stages” in Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We summarized the main changes:

      (1) In the Introduction part, we give a general definition of habitat fragmentation to avoid confusion, as reviewers #1 and #2 suggested.

      (2) We clarify the two aspects of the observed “extinction”——“true dieback” and “emigration”, as reviewers #2 and #3 suggested.

      (3) In the Methods part, we 1) clarify the reason for testing the temporal trend in colonization/extinction dynamics and describe how to select islands as reviewer #1 suggested; 2) describe how to exclude birds from the analysis as reviewer #2 suggested.

      (4) In the Results part, we modified and rearranged Figure 4-6 as reviewers #1, #2 and #3 suggested.

      (5) In the Discussion part, we 1) discuss the multiple aspects of the metric of isolation for future research as reviewer #3 suggested; 2) provide concrete evidence about the relationship between habitat diversity or heterogeneity and island area and 3) provide a wider perspective about how our results can inform conservation practices in fragmented habitats as reviewer #2 suggested.

      eLife Assessment

      This important study enhances our understanding of how habitat fragmentation and climate change jointly influence bird community thermophilization in a fragmented island system. The evidence supporting some conclusions is incomplete, as while the overall trends are convincing, some methodological aspects, particularly the isolation metrics and interpretation of colonization/extinction rates, require further clarification. This work will be of broad interest to ecologists and conservation biologists, providing crucial insights into how ecosystems and communities react to climate change.

      We sincerely extend our gratitude to you and the esteemed reviewers for acknowledging the importance of our study and for raising these concerns. We have clarified the rationale behind our analysis of temporal trends in colonization and extinction dynamics, as well as the choice of distance to the mainland as the isolation metric. Additionally, we further discuss the multiple aspects of the metric of isolation for future research and provide concrete supporting evidence about the relationship between habitat diversity or heterogeneity and island area.

      Incorporating these valuable suggestions, we have thoroughly revised our manuscript, ensuring that it now presents a more comprehensive and nuanced account of our research. We are confident that these improvements will further enhance the impact and relevance of our work for ecologists and conservation biologists alike, offering vital insights into the resilience and adaptation strategies of communities facing the challenges of climate change.

      Reviewer #1 (Public Review):

      Summary:

      This study reports on the thermophilization of bird communities in a network of islands with varying areas and isolation in China. Using data from 10 years of transect surveys, the authors show that warm-adapted species tend to gradually replace cold-adapted species, both in terms of abundance and occurrence. The observed trends in colonisations and extinctions are related to the respective area and isolation of islands, showing an effect of fragmentation on the process of thermophilization.

      Strengths:

      Although thermophilization of bird communities has been already reported in different contexts, it is rare that this process can be related to habitat fragmentation, despite the fact that it has been hypothesized for a long time that it could play an important role. This is made possible thanks to a really nice study system in which the construction of a dam has created this incredible Thousand Islands lake. Here, authors do not simply take observed presence-absence as granted and instead develop an ambitious hierarchical dynamic multi-species occupancy model. Moreover, they carefully interpret their results in light of their knowledge of the ecology of the species involved.

      Response: We greatly appreciate your recognition of our study system and the comprehensive approach and careful interpretation of results. 

      Weaknesses:

      Despite the clarity of this paper on many aspects, I see a strong weakness in the authors' hypotheses, which obscures the interpretation of their results. Looking at Figure 1, and in many sentences of the text, a strong baseline hypothesis is that thermophilization occurs because of an increasing colonisation rate of warm-adapted species and extinction rate of cold-adapted species. However, there does not need to be a temporal trend! Any warm-adapted species that colonizes a site has a positive net effect on CTI; similarly, any cold-adapted species that goes extinct contributes to thermophilization.

      Thank you very much for these thoughtful comments. The understanding depends on the time frame of the study and specifically, whether the system is at equilibrium. We think your claim is based on this background: if the system is not at equilibrium, then CTI can shift simply by having differential colonization (or extinction) rates for warm-adapted versus cold-adapted species. We agree with you in this case.

      On the other hand, if a community is at equilibrium, then there will be no net change in CTI over time. Imagine we have an archipelago where the average colonization of warm-adapted species is larger than the average colonization of cold-adapted species, then over time the archipelago will reach an equilibrium with stable colonization/extinction dynamics where the average CTI is stable over time. Once it is stable, then if there is a temporal trend in colonization rates, the CTI will change until a new equilibrium is reached (if it is reached).

      For our system, the question then is whether we can assume that the system is or has ever been at equilibrium. If it is not at equilibrium, then CTI can shift simply by having differential colonization (or extinction) rates for warm-adapted versus cold-adapted species. If the system is at equilibrium (at the beginning of the study), then CTI will only shift if there is a temporal change or trend in colonization or extinction rates.

      Habitat fragmentation can affect biomes for decades after dam formation. The “Relaxation effect” (Gonzalez, 2000) refers to the fact that the continent acts as a potential species pool for island communities. Under relaxation, some species will be filtered out over time, mainly through the selective extinction of species that are highly sensitive to fragmentation. Meanwhile, for a 100-hectare patch, it takes about ten years to lose 50% of bird species; The smaller the patch area, the shorter the time required (Ferraz et al., 2003; Haddad et al., 2015). This study was conducted 50 to 60 years after the formation of the TIL, making the system with a high probability of reaching “equilibrium” through “Relaxation effect”(Si et al., 2014). We have no way of knowing exactly whether “equilibrium” is true in our system. Thus, changing rates of colonization-extinction over time is actually a much stronger test of thermophilization, which makes our inference more robust.

      We add a note to the legend of Figure 1 on Lines 781-786:

      “CTI can also change simply due to differential colonization-extinction rates by thermal affinity if the system is not at equilibrium prior to the study. In our study system, we have no way of knowing whether our island system was at equilibrium at onset of the study, thus, focusing on changing rates of colonization-extinction over time presents a much stronger tests of thermophilization.”

      We hope this statement can make it clear. Thank you again for this meaningful question.

      Another potential weakness is that fragmentation is not clearly defined. Generally, fragmentation sensu lato involves both loss of habitat area and changes in the spatial structure of habitats (i.e. fragmentation per se). Here, both area and isolation are considered, which may be slightly confusing for the readers if not properly defined.

      Thank you for reminding us of that. Habitat fragmentation in this study involves both habitat loss and fragmentation per se. We have clarified the general definition in the Introduction on Lines 61-63:

      “Habitat fragmentation, usually defined as the shifts of continuous habitat into spatially isolated and small patches (Fahrig, 2003), in particular, has been hypothesized to have interactive effects with climate change on community dynamics.”

      Reviewer #2 (Public Review):

      Summary:

      This study addresses whether bird community reassembly in time is related to climate change by modelling a widely used metric, the community temperature index (CTI). The authors first computed the temperature index of 60 breeding bird species thanks to distribution atlases and climatic maps, thus obtaining a measure of the species realized thermal niche.

      These indices were aggregated at the community level, using 53 survey transects of 36 islands (repeated for 10 years) of the Thousand Islands Lake, eastern China. Any increment of this CTI (i.e. thermophilization) can thus be interpreted as a community reassembly caused by a change in climate conditions (given no confounding correlations).

      The authors show thanks to a mix of Bayesian and frequentist mixed effect models to study an increment of CTI at the island level, driven by both extinction (or emigration) of cold-adapted species and colonization of newly adapted warm-adapted species. Less isolated islands displayed higher colonization and extinction rates, confirming that dispersal constraints (created by habitat fragmentation per se) on colonization and emigration are the main determinants of thermophilization. The authors also had the opportunity to test for habitat amount (here island size). They show that the lack of microclimatic buffering resulting from less forest amount (a claim backed by understory temperature data) exacerbated the rates of cold-adapted species extinction while fostering the establishment of warm-adapted species.

      Overall these findings are important to range studies as they reveal the local change in affinity to the climate of species comprising communities while showing that the habitat fragmentation VS amount distinction is relevant when studying thermophilization. As is, the manuscript lacks a wider perspective about how these results can be fed into conservation biology, but would greatly benefit from it. Indeed, this study shows that in a fragmented reserve context, habitat amount is very important in explaining trends of loss of cold-adapted species, hinting that it may be strategic to prioritize large habitats to conserve such species. Areas of diverse size may act as stepping stones for species shifting range due to climate change, with small islands fostering the establishment of newly adapted warm-adapted species while large islands act as refugia for cold-adapted species. This study also shows that the removal of dispersal constraints with low isolation may help species relocate to the best suitable microclimate in a heterogenous reserve context.

      Thank you very much for your valuable feedback. We greatly appreciate your recognition of the scientific question to the extensive dataset and diverse approach. In particular, you provided constructive suggestions and examples on how to extend the results to conservation guidance. This is something we can’t ignore in the manuscript. We have added a paragraph to the end of the Discussion, stating how our results can inform conservation, on Lines 339-347:

      ‘Overall, our findings have important implications for conservation practices. Firstly, we confirmed the role of isolation in limiting range shifting. Better connected landscapes should be developed to remove dispersal constraints and facilitate species’ relocation to the best suitable microclimate. Second, small patches can foster the establishment of newly adapted warm-adapted species while large patches can act as refugia for cold-adapted species. Therefore, preserving patches of diverse sizes can act as stepping stones or shelters in a warming climate depending on the thermal affinity of species. These insights are important supplement to the previous emphasis on the role of habitat diversity in fostering (Richard et al., 2021) or reducing (Gaüzère et al., 2017) community-level climate debt.’

      Strength:

      The strength of the study lies in its impressive dataset of bird resurveys, that cover 10 years of continued warming (as evidenced by weather data), 60 species in 36 islands of varying size and isolation, perfect for disentangling habitat fragmentation and habitat amount effects on communities. This distinction allows us to test very different processes mediating thermophilization; island area, linked to microclimatic buffering, explained rates for a variety of species. Dispersal constraints due to fragmentation were harder to detect but confirms that fragmentation does slow down thermophilization processes.

      This study is a very good example of how the expected range shift at the biome scale of the species materializes in small fragmented regions. Specifically, the regional dynamics the authors show are analogous to what processes are expected at the trailing and colonizing edge of a shifting range: warmer and more connected places display the fastest turnover rates of community reassembly. The authors also successfully estimated extinction and colonization rates, allowing a more mechanistic understanding of CTI increment, being the product of two processes.

      The authors showed that regional diversity and CTI computed only by occurrences do not respond in 10 years of warming, but that finer metrics (abundance-based, or individual islands considered) do respond. This highlights the need to consider a variety of case-specific metrics to address local or regional trends. Figure Appendix 2 is a much-appreciated visualization of the effect of different data sources on Species thermal Index (STI) calculation.

      The methods are long and diverse, but they are documented enough so that an experienced user with the use of the provided R script can follow and reproduce them.

      Thank you very much for your profound Public Review. We greatly appreciate your recognition of the scientific question, the extensive dataset and the diverse approach. 

      Weaknesses:

      While the overall message of the paper is supported by data, the claims are not uniformly backed by the analysis. The trends of island-specific thermophilization are very credible (Figure 3), however, the variable nature of bird observations (partly compensated by an impressive number of resurveys) propagate a lot of errors in the estimation of species-specific trends in occupancy, abundance change, and the extinction and colonization rates. This materializes into a weak relationship between STI and their respective occupancy and abundance change trends (Figure 4a, Figure 5, respectively), showing that species do not uniformly contribute to the trend observed in Figure 3. This is further shown by the results presented in Figure 6, which present in my opinion the topical finding of the study. While a lot of species rates response to island areas are significant, the isolation effect on colonization and extinction rates can only be interpreted as a trend as only a few species have a significant effect. The actual effect on the occupancy change rates of species is hard to grasp, and this trend has a potentially low magnitude (see below).

      Thank you very much for pointing out this shortcoming. The R2 between STI and their respective occupancy trends is relatively small (R2\=0.035). But the R2 between STI and their respective abundance change trends are relatively bigger, in the context of Ecology research (R2\=0.123). The R2 between STI and their respective colonization rate (R2\=0.083) and extinction rate trends (R2\=0.053) are also relatively small. Low R2 indicates that we can’t make predictions using the current model, we must notice that except STI, other factors may influence the species-specific occupancy trend. Nonetheless, it is important to notice that the standardized coefficient estimates are not minor and the trend is also significant, indicating the species-specific response is as least related to STI.

      The number of species that have significant interaction terms for isolation (Figure 6) is indeed low. Although there is uncertainty in the estimation of relationships, there are also consistent trends in response to habitat fragmentation of colonization of warm-adapted species and extinction of cold-adapted species. This is especially true for the effect of isolation, where on islands nearer to the mainland, warm-adapted species (15 out of 15 investigated species) increased their colonization probability at a higher rate over time, while most cold-adapted species (21 out of 23 species) increased their extinction probability at a higher rate. We now better highlight these results in the Results and Discussion.

      While being well documented, the myriad of statistical methods used by the authors ampere the interpretation of the figure as the posterior mean presented in Figure 4b and Figure 6 needs to be transformed again by a logit-1 and fed into the equation of the respective model to make sense of. I suggest a rewording of the caption to limit its dependence on the method section for interpretation.

      Thank you for this suggestion. The value on the Y axis indicates the posterior mean of each variable (year, area, isolation and their interaction effects) extracted from the MSOM model, where the logit(extinction rate) or logit(colonization rate) was the response variable. All variables were standardized before analysis to make them comparable so interpretation is actually quite straight forward: positive values indicate positive influence while negative values indicate negative influence. Because the goal of Figure 6 is to display the negative/positive effect, we didn’t back-transform them. Following your advice, we thus modified the caption of Figure 6 (now renumbered as Figure 5, following a comment from Reviewer #3, to move Figure 5 to Figure 4c). The modified title and legends of Figure 5 are on Lines 817-820:

      “Figure 5. Posterior estimates of logit-scale parameters related to cold-adapted species’ extinction rates and warm-adapted species’ colonization rates. Points are species-specific posterior means on the logit-scale, where parameters >0 indicate positive effects (on extinction [a] or colonization [b]) and parameters <0 indicate negative effects...”

      By using a broad estimate of the realized thermal niche, a common weakness of thermophilization studies is the inability to capture local adaptation in species' physiological or behavioral response to a rise in temperature. The authors however acknowledge this limitation and provide specific examples of how species ought to evade high temperatures in this study region.

      We appreciate your recognition. This is a common problem in STI studies. We hope in future studies, researchers can take more details about microclimate of species’ true habitat across regions into consideration when calculating STI. Although challenging, focusing on a smaller portion of its distribution range may facilitate achievement.

      Reviewer #3 (Public Review):

      Summary:

      Juan Liu et al. investigated the interplay between habitat fragmentation and climate-driven thermophilization in birds in an island system in China. They used extensive bird monitoring data (9 surveys per year per island) across 36 islands of varying size and isolation from the mainland covering 10 years. The authors use extensive modeling frameworks to test a general increase in the occurrence and abundance of warm-dwelling species and vice versa for cold-dwelling species using the widely used Community Temperature Index (CTI), as well as the relationship between island fragmentation in terms of island area and isolation from the mainland on extinction and colonization rates of cold- and warm-adapted species. They found that indeed there was thermophilization happening during the last 10 years, which was more pronounced for the CTI based on abundances and less clearly for the occurrence-based metric. Generally, the authors show that this is driven by an increased colonization rate of warm-dwelling and an increased extinction rate of cold-dwelling species. Interestingly, they unravel some of the mechanisms behind this dynamic by showing that warm-adapted species increased while cold-dwelling decreased more strongly on smaller islands, which is - according to the authors - due to lowered thermal buffering on smaller islands (which was supported by air temperature monitoring done during the study period on small and large islands). They argue, that the increased extinction rate of cold-adapted species could also be due to lowered habitat heterogeneity on smaller islands. With regards to island isolation, they show that also both thermophilization processes (increase of warm and decrease of cold-adapted species) were stronger on islands closer to the mainland, due to closer sources to species populations of either group on the mainland as compared to limited dispersal (i.e. range shift potential) in more isolated islands.

      The conclusions drawn in this study are sound, and mostly well supported by the results. Only a few aspects leave open questions and could quite likely be further supported by the authors themselves thanks to their apparent extensive understanding of the study system.

      Strengths:

      The study questions and hypotheses are very well aligned with the methods used, ranging from field surveys to extensive modeling frameworks, as well as with the conclusions drawn from the results. The study addresses a complex question on the interplay between habitat fragmentation and climate-driven thermophilization which can naturally be affected by a multitude of additional factors than the ones included here. Nevertheless, the authors use a well-balanced method of simplifying this to the most important factors in question (CTI change, extinction, and colonization, together with habitat fragmentation metrics of isolation and island area). The interpretation of the results presents interesting mechanisms without being too bold on their findings and by providing important links to the existing literature as well as to additional data and analyses presented in the appendix.

      We appreciate very much for your positive and constructive comments and suggestions. Thank you for your recognition of the scientific question, the modeling approach and the conclusions. 

      Weaknesses:

      The metric of island isolation based on the distance to the mainland seems a bit too oversimplified as in real life the study system rather represents an island network where the islands of different sizes are in varying distances to each other, such that smaller islands can potentially draw from the species pools from near-by larger islands too - rather than just from the mainland. Thus a more holistic network metric of isolation could have been applied or at least discussed for future research. The fact, that the authors did find a signal of island isolation does support their method, but the variation in responses to this metric could hint at a more complex pattern going on in real-life than was assumed for this study.

      Thank you for this meaningful question. Isolation can be measured in different ways in the study region. We chose the distance to the mainland as a measure of isolation based on the results of a previous study. One study in our system provided evidence that the colonization rate and extinction rate of breeding bird species were best fitted using distance to the nearest mainland over other distance-based measures (distance to the nearest landmass, distance to the nearest bigger landmass)(Si et al., 2014). Besides, their results produced almost identical patterns of the relationship between isolation and colonization/extinction rate (Si et al., 2014). That’s why we only selected “Distance to the mainland” in our current analysis and we do find some consistent patterns as expected. The plants on all islands were cleared out about 60 years ago due to dam construction, with all bird species coming from the mainland as the original species pool through a process called “relaxation”. This could be the reason why distance to the nearest mainland is the best predictor.

      We agree with you that it’s still necessary to consider more aspects of “isolation” at least in discussion for future research. In our Discussion, we address these on Lines 292-299:

      “As a caveat, we only consider the distance to the nearest mainland as a measure of fragmentation, consistent with previous work in this system (Si et al., 2014), but we acknowledge that other distance-based metrics of isolation that incorporate inter-island connections could reveal additional insights on fragmentation effects. The spatial arrangement of islands, like the arrangement of habitat, can influence niche tracking of species (Fourcade et al., 2021). Future studies should take these metrics into account to thoroughly understand the influence of isolation and spatial arrangement of patches in mediating the effect of climate warming on species.”

      Further, the link between larger areas and higher habitat diversity or heterogeneity could be presented by providing evidence for this relationship. The authors do make a reference to a paper done in the same study system, but a more thorough presentation of it would strengthen this assumption further.

      Thank you very much for this question. We now add more details about the relationship between habitat diversity and heterogeneity based on a related study in the same system. The observed number of species significantly increased with increasing island area (slope = 4.42, R2 = 0.70, p < .001), as did the rarefied species richness per island (slope = 1.03, R2 = 0.43, p < .001), species density (slope = 0.80, R2 = 0.33, p = .001) and the rarefied species richness per unit area (slope = 0.321, R2 = 0.32, p = .001). We added this supporting evidence on Lines 317-321:

      “We thus suppose that habitat heterogeneity could also mitigate the loss of these relatively cold-adapted species as expected. Habitat diversity, including the observed number of species, the rarefied species richness per island, species density and the rarefied species richness per unit area, all increased significantly with island area instead of isolation in our system (Liu et al., 2020)”

      Despite the general clear patterns found in the paper, there were some idiosyncratic responses. Those could be due to a multitude of factors which could be discussed a bit better to inform future research using a similar study design.

      Thank you for these suggestions. We added a summary statement about the reasons for idiosyncratic responses on Lines 334-338:

      “Overall, these idiosyncratic responses reveal several possible mechanisms in regulating species' climate responses, including resource demands and biological interactions like competition and predation. Future studies are needed to take these factors into account to understand the complex mechanisms by which habitat loss meditates species range shifts.”

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 1: I disagree that there should be a temporal trend in colonisation/extinction dynamics.

      Thank you again for these thoughtful comments. We have explained in detail in the response to the Public Review.

      (2) L 485-487: As explained before I disagree. I don't see why there needs to be a temporal trend in colonization and extinction.

      Thank you again for these thoughtful comments. Because we can’t guarantee that the study system has reached equilibrium, changing rates of colonization-extinction over time is actually a much stronger test of thermophilization. More detailed statement can be seen in the response to the Public Review.

      (3) L 141: which species' ecological traits?

      Sorry for the confusion. The traits included continuous variables (dispersal ability, body size, body mass and clutch size) and categorical variables (diet, active layer, residence type). Specifically, we tested the correlation between STI and dispersal ability, body size, body mass and clutch size using Pearson correlation test. We also tested the difference in STI between different trait groups using the Wilcoxon signed-rank test for three Category variables: diet (carnivorous/ omnivorous/ herbivory), active layer (canopy/mid/low), and residence type (resident species/summer visitor). There is no significant difference between any two groups for each of the three category variables (p > 0.2). We added these on Lines 141-145:

      “No significant correlation was found between STI and species’ ecological traits; specifically, the continuous variables of dispersal ability, body size, body mass and clutch size (Pearson correlations for each, |r| < 0.22), and the categorial variables of diet (carnivorous/omnivorous/herbivory), active layer (canopy/mid/low), and residence type (resident species/summer visitor)”

      (4) L 143: CTIoccur and CTIabun were not defined before.

      Because CTIoccur and CTIabun were first defined in Methods part (section 4.4), we change the sentence to a more general statement here on Lines 147-150:

      “At the landscape scale, considering species detected across the study area, occurrence-based CTI (CTIoccur; see section 4.4) showed no trend (posterior mean temporal trend = 0.414; 95% CrI: -12.751, 13.554) but abundance-based CTI (CTIabun; see section 4.4) showed a significant increasing trend.”

      (5) Figure 4: what is the dashed vertical line? I assume the mean STI across species?

      Sorry for the unclear description. The vertical dashed line indicates the median value of STI for 60 species, as a separation of warm-adapted species and cold-adapted species. We have added these details on Lines 807-809:

      “The dotted vertical line indicates the median of STI values. Cold-adapted species are plotted in blue and warm-adapted species are plotted in orange.”

      (6) Figure 6: in the legend, replace 'points in blue' with 'points in blue/orange' or 'solid dots' or something similar.

      Thank you for this suggestion. We changed it to “points in blue/orange” on Lines 823.

      (7) L 176-176: unclear why the interaction parameters are particularly important for explaining the thermophilization mechanism: if e.g. colonization rate of warm-adapted species is constantly higher in less isolated islands, (and always higher than the extinction rate of the same species), it means that thermophilization is increased in less isolated islands, right?

      Thank you for this question. This is also related to the question about “Why use temporal trends in colonization/extinction rate to test for thermophilization mechanisms”. Colonization-extinction over time is actually a much stronger test of thermophilization (more details refer to response to Public Review and Recommendations 1&2).

      Based on this, the two main driving processes of thermophilization mechanism include the increasing colonization rate of warm-adapted species and the increasing extinction rate of cold-adapted species with year. The interaction effect between island area (or isolation) and year on colonization rate (or extinction rate) can tell us how habitat fragmentation mediates the year effect. For example, if the interaction term between year and isolation is negative for a warm-adapted species that increased in colonization rate with year, it indicates that the colonization rate increased faster on less isolated islands. This is a signal of a faster thermophilization rate on less-isolated islands.

      (8) L201-203: this is only little supported by the results that actually show that there is NO significant interaction for most species.

      Thank you for this comment. Although most species showed non-significant interaction effect, the overall trend is relatively consistent, this is especially true for the effect of isolation. To emphasize the “trend” instead of “significant effect”, we slightly modified this sentence in more rigorous wording on Lines 205-208: 

      “We further found that habitat fragmentation influences two processes of thermophilization: colonization rates of most warm-adapted species tended to increase faster on smaller and less isolated islands, while the loss rates of most cold-adapted species tended to be exacerbated on less isolated islands.”

      (9) Section 2.3: can't you have a population-level estimate? I struggled a bit to understand all the parameters of the MSOM (because of my lack of statistical/mathematical proficiency) so I cannot provide more advice here.

      Thank you for raising this advice. We think what you are mentioning is the overall estimate across all species for each variable. From MSOM, we can get a standardized estimate of every variable (year, area, isolation, interaction) for each species, separately. Because the divergent or consistent responses among species are what we are interested in, we didn’t calculate further to get a population-level estimate.

      (10) L 291: a dot is missing.

      Done. Thank you for your correction.

      (11) L 305, 315: a space is missing

      Done

      (12) L 332: how were these islands selected?

      Thank you for this question. The 36 islands were selected according to a gradient of island area and isolation, spreading across the whole lake region. The selected islands guaranteed there is no significant correlation between island area and isolation (the Pearson correlation coefficient r = -0.21, p = 0.21). The biggest 7 islands among the 36 islands are also the only several islands larger than 30 ha in the whole lake region. We have modified this in the Method part on Lines 360-363.

      “We selected 36 islands according to a gradient of island area and isolation with a guarantee of no significant correlation between island area and isolation (Pearson r = -0.21, p = 0.21). For each island, we calculated island area and isolation (measured in the nearest Euclidean distance to the mainland) to represent the degree of habitat fragmentation.”

      (13) L 334: "Distance to the mainland" was used as a metric of isolation, but elsewhere in the text you argue that the observed thermophilization is due to interisland movements. It sounds contradictory. Why not include the average or shortest distance to the other islands?

      Thank you very much for raising this comment. Yes, “Distance to the mainland” was the only metric we used for isolation. We carefully checked through the manuscript where the “interisland movement” comes from and induces the misunderstanding. It must come from Discussion 3.1 (n Lines 217-221): “Notably, when tested on the landscape scale (versus on individual island communities), only the abundance-based thermophilization trend was significant, indicating thermophilization of bird communities was mostly due to inter-island occurrence dynamics, rather than exogenous community turnover.”

      Sorry, the word “inter-island” is not exactly what we want to express here, we wanted to express that “the thermophilization was mostly due to occurrence dynamics within the region, rather than exogenous community turnover outside the region”. We have changed the sentence in Discussion part on Lines 217-221:

      “Notably, when tested on the landscape scale (versus on individual island communities), only the abundance-based thermophilization trend was significant, indicating thermophilization of bird communities was mostly due to occurrence dynamics within the region, rather than exogenous community turnover outside the region.”

      Besides, I would like to explain why we use distance to the mainland. We chose the distance to the mainland as a measure of isolation based on the results of a previous study. One study in our system provided evidence that the colonization rate and extinction rate of breeding bird species were best fitted using distance to the nearest mainland over other distance-based measures (distance to the nearest landmass, distance to the nearest bigger landmass)(Si et al., 2014). Besides, their results produced almost identical patterns of the relationship between isolation and colonization/extinction rate(Si et al., 2014). That’s why we only selected “Distance to the mainland” in our current analysis and we do find some consistent patterns as expected. The plants on all islands were cleared out about 60 years ago due to dam construction, with all bird species coming from the mainland as the original species pool through a process called “relaxation”. This may be the reason why distance to the nearest mainland is the best predictor.

      In Discussion part, we added the following discussion and talked about the other measures on Lines 292-299:

      “As a caveat, we only consider the distance to the nearest mainland as a measure of fragmentation, consistent with previous work in this system (Si et al., 2014), but we acknowledge that other distance-based metrics of isolation that incorporate inter-island connections could reveal additional insights on fragmentation effects. The spatial arrangement of islands, like the arrangement of habitat, can influence niche tracking of species (Fourcade et al., 2021). Future studies should take these metrics into account to thoroughly understand the influence of isolation and spatial arrangement of patches in mediating the effect of climate warming on species.”

      (14) L 347: you write 'relative' abundance but this measure is not relative to anything. Better write something like "we based our abundance estimate on the maximum number of individuals recorded across the nine annual surveys".

      Thank you for this suggestion, we have changed the sentence on Lines 377-379:

      “We based our abundance estimate on the maximum number of individuals recorded across the nine annual surveys.”

      (15) L 378: shouldn't the formula for CTIoccur be (equation in latex format):

      CTI{occur, j, t} =\frac{\sum_{i=1}^{N_{j,t}}STI_{i}}{N_{j,t}}

      Where Nj,t is the total number of species surveyed in the community j in year t

      Thank you very much for this careful check, we have revised it on Lines 415, 417:

      “where Nj,t is the total number of species surveyed in the community j in year t.”

      Reviewer #2 (Recommendations For The Authors):

      (1) Line 76: "weakly"

      Done. Thank you for your correction.

      (2) Line 98: I suggest a change to this sentence: "For example, habitat fragmentation renders habitats to be too isolated to be colonized, causing sedentary butterflies to lag more behind climate warming in Britain than mobile ones"

      Thank you for this modification, we have changed it on Lines 99-101.

      (3) Line 101: remove either "higher" or "increasing"

      Done, we have removed “higher”. Thank you for this advice.

      (4) Line 102: "benefiting from near source of"

      Done.

      (5) Line 104: "emigrate"

      Done.

      (6) Introduction: I suggest making it more explicit what process you describe under the word "extinction". At first read, I thought you were only referring to the dieback of individuals, but you also included emigration as an extinction process. It also needs to be reworded in Fig 1 caption.

      Thank you for this suggestion. Yes, we can’t distinguish in our system between local extinction and emigration. The observed “extinction” of cold-adapted species over 10 years may involve two processes that usually occur in order: first “emigration” and then if can’t emigrate or withstand, “real local dieback”. It should also be included in the legend of Figure 1, as you said. We have modified the legend in Lines 780-781:

      “Note that extinction here may include both the emigration of species and then the local extinction of species.”

      There is also one part in the Discussion that mentions this on Lines 287-291: “While we cannot truly distinguish in our system between local extinction and emigration, we suspect that given two islands equal except in isolation, and if both lose suitability due to climate change, individuals can easily emigrate from the island nearer to the mainland, while individuals on the more isolated island would be more likely to be trapped in place until the species went locally extinct due to a lack of rescue”.

      (7) I also suggest differentiating habitat fragmentation (distances between islands) and habitat amount (area) as explained in Fahrig 2013 (Rethinking patch size and isolation effects: the habitat amount hypothesis) and her latter paper. This will help the reader what lies behind the general trend of fragmentation: fragmentation per se and habitat amount reduction.

      Thank you for this suggestion! Habitat fragmentation in this study involves both habitat loss and fragmentation per se. We now give a general definition of habitat fragmentation on Lines 61-63:

      “Habitat fragmentation, usually defined as the shifts of continuous habitat into spatially isolated and small patches (Fahrig, 2003), in particular, has been hypothesized to have interactive effects with climate change on community dynamics.”

      (8) Line 136: is the "+-" refers to the standard deviation or confidence interval, I suggest being explicit about it once at the start of the results.

      Thank you for reminding this. The "+-" refers to the standard deviation (SD). The modified sentence is now on Lines 135-139:

      “The number of species detected in surveys on each island across the study period averaged 13.37 ± 6.26 (mean ± SD) species, ranging from 2 to 40 species, with an observed gamma diversity of 60 species. The STI of all 60 birds averaged 19.94 ± 3.58 ℃ (mean ± SD) and ranged from 9.30 ℃ (Cuculus canorus) to 27.20 ℃ (Prinia inornate), with a median of STI is 20.63 ℃ (Appendix 1—figure 2; Appendix 1—figure 3).”

      (9) Line 143: please specify the unit of thermophilization.

      The unit of thermophilization rate is the change in degree per unit year. Because in all analyses, predictor variables were z-transformed to make their effect comparable. We have added on Line 151:

      “When measuring CTI trends for individual islands (expressed as °/ unit year)”

      (10) Line 289: check if no word is missing from the sentence.

      The sentence is: “In our study, a large proportion (11 out of 15) of warm-adapted species increasing in colonization rate and half (12 out of 23) of cold-adapted species increasing in extinction rate were changing more rapidly on smaller islands.”

      Given that we have defined the species that were included in testing the third prediction in both Methods part and Result part: 15 warm-adapted species that increased in colonization rate and 23 cold-adapted species that increased in extinction rate. We now remove this redundant information and rewrote the sentence as below on Lines 300-302:

      “In our study, the colonization rate of a large proportion of warm-adapted species (11 out of 15) and the extinction rate of half of old-adapted species (12 out of 23) were increasing more rapidly on smaller islands.”

      (11) Line 319: I really miss a concluding statement of your discussion, your results are truly interesting and deserve to be summarized in two or three sentences, and maybe a perspective about how it can inform conservation practices in fragmented settings.

      Thank you for this profound suggestion both in Public Review and here. We have added a paragraph to the end of the Discussion, stating how our results can inform conservation, on Lines 339-347:

      “Overall, our findings have important implications for conservation practices. Firstly, we confirmed the role of isolation in limiting range shifting. Better connected landscapes should be developed to remove dispersal constraints and facilitate species’ relocation to the best suitable microclimate. Second, small patches can foster the establishment of newly adapted warm-adapted species while large patches can act as refugia for cold-adapted species. Therefore, preserving patches of diverse sizes can act as stepping stones or shelters in a warming climate depending on the thermal affinity of species. These insights are important supplement to the previous emphasis on the role of habitat diversity in fostering (Richard et al., 2021) or reducing (Gaüzère et al., 2017) community-level climate debt.”

      (12) Line 335: I suggest " ... the islands has been protected by forbidding logging, ..."

      Thanks for this wonderful suggestion. Done. The new sentence is now on Lines 365-366:

      “Since lake formation, the islands have been protected by forbidding logging, allowing natural succession pathways to occur.”

      (13) Line 345: this speed is unusually high for walking, check the speed.

      Sorry for the carelessness, it should be 2.0 km/h. It has been corrected on Lines 375-376:

      “In each survey, observers walked along each transect at a constant speed (2.0 km/h) and recorded all the birds seen or heard on the survey islands.”

      (14) Line 351: you could add a sentence explaining why that choice of species exclusion was made. Was made from the start of the monitoring program or did you exclude species afterward?

      We excluded them afterward. We excluded non-breeding species, nocturnal and crepuscular species, high-flying species passing over the islands (e.g., raptors, swallows) and strongly water-associated birds (e.g., cormorants). These records were recorded during monitoring, including some of them being on the shore of the island or high-flying above the island, and some nocturnal species were just spotted by accident.

      We described more details about how to exclude species on Lines 379-387:

      “We excluded non-breeding species, nocturnal and crepuscular species, high-flying species passing over the islands (e.g., raptors, swallows) and strongly water-associated birds (e.g., cormorants) from our record. First, our surveys were conducted during the day, so some nocturnal and crepuscular species, such as the owls and nightjars were excluded for inadequate survey design. Second, wagtail, kingfisher, and water birds such as ducks and herons were excluded because we were only interested in forest birds. Third, birds like swallows, and eagles who were usually flying or soaring in the air rather than staying on islands, were also excluded as it was difficult to determine their definite belonging islands. Following these operations, 60 species were finally retained.”

      (15) Line 370: I suggest adding the range and median of STI.

      Thanks for this good suggestion. The range, mean±SD of STI were already in the Results part, we added the median of STI there as well. The new sentence is now in Results part on Lines 137-139:

      “The STI of all 60 birds averaged 19.94 ± 3.58 ℃ (mean ± SD) and ranged from 9.30 ℃ (Cuculus canorus) to 27.20 ℃ (Prinia inornate), with a median of 20.63 ℃ (Appendix 1—figure 2; Appendix 1—figure 3).”

      (16) Figure 4.b: Is it possible to be more explicit about what that trend is? the coefficient of the regression Logit(ext/col) ~ year + ...... ?

      Thank you for this advice. Your understanding is right: we can interpret it as the coefficient of the ‘year’ effect in the model. More specifically, the ‘year’ effect or temporal trend here is the ‘posterior mean’ of the posterior distribution of ‘year’ in the MSOM (Multi-species Occupancy Model), in the context of the Bayesian framework. We modified this sentence on Lines 811-813:

      “ Each point in (b) represents the posterior mean estimate of year in colonization, extinction or occupancy rate for each species.”

      (17) Figure 6: is it possible to provide an easily understandable meaning of the prior presented in the Y axis? E.g. "2 corresponds to a 90% probability for a species to go extinct at T+1", if not, please specify that it is the logit of a probability.

      Thank you for this question both in Public Review and here. The value on the Y axis indicates the posterior mean of each variable (year, area, isolation and their interaction effects) extracted from the MSOM model, where the logit(extinction rate) or logit(colonization rate) was the response variable. All variables were standardized before analysis to make them comparable. So, positive values indicate positive influence while negative values indicate negative influence. Because the goal of Figure 6 is to display the negative/positive effect, we didn’t back-transform them. Following your advice, we thus modified the caption of Figure 6 (now renumbered as Figure 5, following a comment from Reviewer #3, to move Figure 5 to Figure 4c). The modified title and legends of Figure 5 are on Lines 817-820:

      “Figure 5. Posterior estimates of logit-scale parameters related to cold-adapted species’ extinction rates and warm-adapted species’ colonization rates. Points are species-specific posterior means on the logit-scale, where parameters >0 indicate positive effects (on extinction [a] or colonization [b]) and parameters <0 indicate negative effects.”

      (18) Line 773: points in blue only are significant? I suggest "points in color".

      Thank you for your reminder. Points in blue and orange are all significant. We have revised the sentence on Line 823:

      “Points in blue/orange indicate significant effects.”

      These are all small suggestions that may help you improve the readability of the final manuscript. I warmly thank you for the opportunity to review this impressive study.

      We appreciate your careful review and profound suggestions. We believe these modifications will improve the final manuscript.

      Reviewer #3 (Recommendations For The Authors):

      I have a few minor suggestions for paper revision for your otherwise excellent manuscript. I wish to emphasize that it was a pleasure to read the manuscript and that I especially enjoyed a very nice flow throughout the ms from a nicely rounded introduction that led well into the research questions and hypotheses all the way to a good and solid discussion.

      Thank you very much for your review and recognition. We have carefully checked all recommendations and addressed them in the manuscript.

      (1) L 63: space before the bracket missing and I suggest moving the reference to the end of the sentence (directly after habitat fragmentation does not seem to make sense).

      Thank you very much for this suggestion. The missed space was added, and the reference has been moved to the end of the sentence. We also add a general definition of habitat fragmentation. The new sentence is on Lines 61-64:

      “Habitat fragmentation, usually defined as the shifts of continuous habitat into spatially isolated and small patches (Fahrig, 2003), in particular, has been hypothesized to have interactive effects with climate change on community dynamics.”

      (2) L 102: I suggest to write "benefitting ..." instead.

      Done.

      (3) L 103: higher extinction rates (add "s").

      Done.

      (4) L 104: this should probably say "emigrate" and "climate warming".

      Done.

      (5) L 130-133: this is true for emigration (more isolated islands show slower emigration). But what about increased local extinction, especially for small and isolated islands? Especially since you mentioned later in the manuscript that often emigration and extinction are difficult to identify or differentiate. Might be worth a thought here or somewhere in the discussion?

      Thank you for this good question. I would like to answer it in two aspects:

      Yes, we can’t distinguish between true local extinction and emigration. The observed local “extinction” of cold-adapted species over 10 years may involve two processes that usually occur in order: first “emigration” and then, if can’t emigrate or withstand, “real local dieback”. Over 10 years, the cold-adapted species would have to tolerate before real extinction on remote islands because of disperse limitation, while on less isolated islands it would be easy to emigrate and find a more suitable habitat for the same species. Consequently, it’s harder for us to observe “extinction” of species on more isolated islands, while it’s easier to observe “fake extinct” of species on less isolated islands due to emigration. As a result, the observed extinction rate is expected to increase more sharply for species on less remote islands, while the observed extinction rate is expected to increase relatively moderately for the same species on remote islands.

      We have modified the legend of Figure 1 on Lines 780-781:

      “Note that extinction here may include both the emigration of species and then the local extinction of species.”

      There is also one part in the Discussion that mentions this on Lines 287-291: “While we cannot truly distinguish in our system between local extinction and emigration, we suspect that given two islands equal except in isolation, if both lose suitability due to climate change, individuals can easily emigrate from the island nearer to the mainland, while individuals on the more isolated island would be more likely to be trapped in place until the species went locally extinct due to a lack of rescue”.

      Besides, you said “But what about increased local extinction, especially for small and isolated islands?”, I think you are mentioning the “high extinction rate per se on remote islands”. We want to test the “trend” of extinction rate on a temporal scale, rather than the extinction rate per se on a spatial scale. Even though species have a high extinction rate on remote islands, it can also show a slower changing rate in time.

      I hope these answers solve the problem.

      (6) L 245: I think this is the first time the acronym appears in the ms (as the methods come after the discussion), so please write the full name here too.

      Thank you for pointing out this. I realized “Thousand Island Lake” appears for the first time in the last paragraph of the Introduction part. So we add “TIL” there on Lines 108-109:

      “Here, we use 10 years of bird community data in a subtropical land-bridge island system (Thousand Island Lake, TIL, China, Figure 2) during a period of consistent climatic warming.”

      (7) L 319: this section could end with a summary statement on idiosyncratic responses (i.e. some variation in the responses you found among the species) and the potential reasons for this, such as e.g. the role of other species traits or interactions, as well as other ways to measure habitat fragmentation (see main comments in public review).

      Thank you for this suggestion both in Public Review and here. We added a summary statement about the reasons for idiosyncratic responses on Lines 334-338:

      “Overall, these idiosyncratic responses reveal several possible mechanisms in regulating species' climate responses, including resource demands and biological interactions like competition and predation. Future studies are needed to take these factors into account to understand the complex mechanisms by which habitat loss meditates species range shifts.”

      We only strengthen “habitat loss” here, because idiosyncratic responses mainly come from the mediating effect of habitat loss. For the mediating effect of isolation, the response is relatively consistent (see Page 8, Lines 183-188): “In particular, the effect of isolation on temporal dynamics of thermophilization was relatively consistent across cold- and warm-adapted species (Figure 5a, b); specifically, on islands nearer to the mainland, warm-adapted species (15 out of 15 investigated species) increased their colonization probability at a higher rate over time, while most cold-adapted species (21 out of 23 species) increased their extinction probability at a higher rate”.

      (8) L 333: what about the distance to other islands? it's more of a network than a island-mainland directional system (Figure 2). You could address this aspect in the discussion.

      Thank you for this good question again. Isolation can be measured in different ways in the study region. We chose distance to the mainland because it was the best predictor of colonization and extinction rate of breeding birds in the study region, and produced similar results like the other distance-based measures, including distance to the nearest landmass, distance to the nearest larger landmass (Si et al., 2014). We still agree with you that it’s necessary to consider more aspects of “isolation” at least in discussion for future research. In Discussion part, we addressed these on Lines 292-299. For more details refer to the response to Public Review.

      (9) Figure 2: Is B1 one of the sampled islands? It is clearly much larger than most other islands and I think it could thus serve as an important population source for many of the adjacent smaller islands? Thus, the nearest neighbor distance to B1 could be as important in addition to the distance to the mainland?

      Yes, B1 is one of the sampled islands and is also the biggest island. In previous research in our study system, we tried distance to the nearest landmass, to the nearest larger landmass and the nearest mainland, they produced similar results (For more details refer to the response to Public Review). We agree with you that the nearest neighbor distance to B1 could be a potentially important measure, but need further research. In our Discussion, we address these on Lines 292-299:

      “As a caveat, we only consider the distance to the nearest mainland as a measure of fragmentation, consistent with previous work in this system (Si et al., 2014), but we acknowledge that other distance-based metrics of isolation that incorporate inter-island connections could reveal additional insights on fragmentation effects. The spatial arrangement of islands, like the arrangement of habitat, can influence niche tracking of species (Fourcade et al., 2021). Future studies should take these metrics into account to thoroughly understand the influence of isolation and spatial arrangement of patches in mediating the effect of climate warming on species.”

      (10) L 345: 20km/h walking seems impressively fast? I assume this is a typo.

      Sorry for the carelessness, it should be 2.0 km/h. it has been corrected on Lines 375-376:

      “In each survey, observers walked along each transect at a constant speed (2.0 km/h) and recorded all the birds seen or heard on the survey islands.”

      (11) L 485: I had difficulties fully understanding the models that were fitted here and could not find them in the codes you provided (which were otherwise very well documented!). Could you explain this modeling step in a bit more detail?

      Thank you for your recognition! According to Line 485 in the online PDF version (Methods part 4.6.3), it says: “An increasing colonization trend of warm-adapted species and increasing extinction trend of cold-adapted species are two main expected processes that cause thermophilization (Fourcade et al., 2021). To test our third prediction about the mediating effect of habitat fragmentation, we selected warm-adapted species that had an increasing trend in colonization rate (positive year effect in colonization rate) and cold-adapted species that had an increasing extinction rate (positive year effect in extinction rate)…..”

      We carefully checked the code in Figshare link and found that the MOSM JAGS code was not uploaded before. Very sorry for that. Now it can be found in the document [MOSM.R] at https://figshare.com/s/7a16974114262d280ef7. Hope the code, together with the modeling process in section 4.5 in the Methods can help to understand the whole modeling process. Besides, we would like to explain how to decide the temporal trend in colonization or extinction of each species related to Line 485. Let’s take the model of species-specific extinction rate for example:

      In this model, “Island” was a random effect, “Year” is added as a random slope, thus allowing “year effect” (that is: the temporal trend) of extinction rate of species to vary with “island”. Further, the interaction effect between island variables (isolation, area) was added to test if the “year effect” was related to island area or isolation.

      Because we are only interested in warm-adapted species that have a positive temporal trend in colonization and cold-adapted species that have a positive temporal trend in extinction, which are two main processes underlying thermophilizaiton, we choose warm-adapted species that have a positive year-effect in colonization, and cold-adapted species that has a positive year-effect in extinction. Hope this explanation and the JAGS code can help if you are confused about this part.

      Hope these explanations can make it clearer.

      (12) Figure 1: to me, it would be more intuitive to put the landscape configuration in the titles of the panels b, c, and d instead of "only" the mechanisms. E.g. they could be: a) fragmented islands with low climate buffering; b) small islands with low habitat heterogeneity; c) isolated islands with dispersal limitations?

      It is also slightly confusing that the bird communities are above "island" in the middle of the three fragmented habitats - which all look a bit different in terms of tree species and structure which makes the reader first think that it has something to do with the "new" species community. so maybe worth rethinking how to illustrate the three fragmented islands?

      We would like to thank you for your nice proposition. Firstly, it’s a good idea to put the landscape configuration in the title of the panels b, c, d. The new title (a) is “Fragmented islands with low climate buffering”, title (b) is “Small islands with low habitat heterogeneity”, and title (c) is “Isolated patches with dispersal limitations”.

      Second, we realized that putting the “bird community” above “island” in the middle of the three patches is a bit confusing. Actually, we wanted to show bird communities only on that one island in the middle. The other two patches are only there to represent a fragmented background. To avoid misunderstanding, we added a sentence in the legend of Figure 1 on Lines 778-780:

      “The three distinct patches signify a fragmented background and the community in the middle of the three patches was selected to exhibit colonization-extinction dynamics in fragmented habitats.”

      (13) Figure 4: please add the description of the color code for panel a.

      Sorry for the unclear description. The vertical dashed line indicates the median value of STI for 60 species, as a separation of warm-adapted species and cold-adapted species. We have added these details on Lines 807-809:

      “The dotted vertical line indicates the median of STI values. Cold-adapted species are plotted in blue and warm-adapted species are plotted in orange.”

      (14) Figure 5: You could consider adding this as panel c to Figure 4 as it depicts the same thing as in 4a but for CTI-abundance.

      Thank you for this advice. We have moved the original Figure 5 to Figure 4c. Previous Figure 6 thus turned into Figure 5. All corresponding citations in the main text were checked to adapt to the new index. The new figure is now on Lines 801-815:

      References

      Ferraz, G., Russell, G. J., Stouffer, P. C., Bierregaard Jr, R. O., Pimm, S. L., & Lovejoy, T. E. (2003). Rates of species loss from Amazonian forest fragments. Proceedings of the National Academy of Sciences, 100(24), 14069-14073. doi:10.1073/pnas.2336195100

      Fourcade, Y., WallisDeVries, M. F., Kuussaari, M., van Swaay, C. A., Heliölä, J., & Öckinger, E. (2021). Habitat amount and distribution modify community dynamics under climate change. Ecology Letters, 24(5), 950-957. doi:10.1111/ele.13691

      Gaüzère, P., Princé, K., & Devictor, V. (2017). Where do they go? The effects of topography and habitat diversity on reducing climatic debt in birds. Global Change Biology, 23(6), 2218-2229. doi:10.1111/gcb.13500

      Gonzalez, A. (2000). Community relaxation in fragmented landscapes: the relation between species richness, area and age. Ecology Letters, 3(5), 441-448. doi:10.1046/j.1461-0248.2000.00171.x

      Haddad, N. M., Brudvig, L. A., Clobert, J., Davies, K. F., Gonzalez, A., Holt, R. D., . . . Collins, C. D. (2015). Habitat fragmentation and its lasting impact on Earth’s ecosystems. Science advances, 1(2), e1500052. doi:10.1126/sciadv.1500052

      Richard, B., Dupouey, J. l., Corcket, E., Alard, D., Archaux, F., Aubert, M., . . . Macé, S. (2021). The climatic debt is growing in the understorey of temperate forests: Stand characteristics matter. Global Ecology and Biogeography, 30(7), 1474-1487. doi:10.1111/geb.13312

      Si, X., Pimm, S. L., Russell, G. J., & Ding, P. (2014). Turnover of breeding bird communities on islands in an inundated lake. Journal of Biogeography, 41(12), 2283-2292. doi:10.1111/jbi.12379

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors): 

      Figures 1 and 2. How do the authors know that the lysine mutations are specific to constitutive activity and not because it is causing the channel to be now voltage sensitive? 

      As shown in the revised Figs. 1b, S2a, and 3b, TMEM16F I521K/M522K, TMEM16F I521E, and TMEM16A I546K/I547K spontaneously expose PS, respectively. Neither membrane depolarization nor calcium stimulation was introduced under these conditions and the cells were grown in calcium-free media after transfection to limit calcium-dependent activation. Our new experiments further demonstrate that TMEM16F T526K (Fig. 1b) and TMEM16A E551K (Fig. 3b), which are further away from the activation gate, exhibit either strongly attenuated or lack spontaneous lipid scrambling activity. According to these results, the gain-of-function mutants (TMEM16F

      I521K/M522K/I521E and TMEM16A I546K/I547K) are indeed constitutively active. This constitutive scramblase activity is not due to a gain of voltage sensitivity as ion channel activity is also minimal around the resting membrane potential of a HEK cell (Fig. 1d, e and Fig. 3d, e).

      The authors see very large currents of 5 -10 nA in their electrophysiology experiments in Figures 2D and 3D. I understand that Figure 2D are whole-cell recordings but are the authors confident that the currents that they are recordings from the mutants are indeed specific to TMEM16A. More importantly, in Figure 3D they see 3-5nA currents in insideout patches, which is huge. They have no added divalent in their bath solution, which could lead to larger single-channel amplitudes, but 3-5nA seems excessive. Some control to demonstrate that these are indeed OSCA1.2 currents is important. 

      TMEM16A and TMEM16F are well-known for their high cell surface expression. Therefore, the current amplitude is usually huge even in excised inside-out or outside-out patches—please see our previous publications for details: 1) 10.1016/j.cell.2012.07.036, 2) 10.7554/eLife.02772, 3) 10.1038/s41467-019-11784-8, 4) 10.1038/s41467-019-09778-7, 5) 10.1016/j.celrep.2020.108570, 6) 10.1085/jgp.202012704, and 7) 10.1085/jgp.202313460. 

      HEK293 cells do not have endogenous TMEM16A (https://doi.org/10.1038/nature07313, 10.1016/j.cell.2008.09.003 , DOI: 10.1126/science.1163518). It therefore serves as a widely used cell line for studying TMEM16A biophysics. As overexpressing the WT control barely elicited any obvious current in 0 Ca2+ (Fig. 3d), there is no doubt that the large outward-rectifying current (hallmark of CaCC) in the revised Fig. 3d (previous Fig. 2D) was elicited from the mutant TMEM16A channels. The strong outward rectification also rules out the possibility of this being leak current.

      Regarding Fig. 4d (previous Fig. 3D), OSCA1.2 has excellent surface expression as shown in Fig. 4b. OSCA1.2 also has much higher single channel conductance (121.8 ± 3.4 pS, 10.7554/eLife.41844) than TMEM16A (~3-8 pS) and TMEM16F (<1 pS). Therefore, recording nA OSCA1.2 current from excised patches is normal given larger OSCA1.2 current at depolarized voltages than the current recorded at hyperpolarized voltages (please see our explanation in the next response). As the reviewer pointed out, lack of divalent ions in our experimental conditions may also partially contribute to the large conductance. To further verify, we conducted mock transfection recordings (please see Author response image 1 below). WT- but not mock (GFP)transfected cells gave rise to large current, further supporting that the recorded current was indeed through OSCA1.2. 

      Author response image 1.

      Representative inside-out currents for mock (GFP)- and OSCA1.2 WT-transfected cells. OSCA1.2 is responsible for nA currents elicited by the pressure and voltage protocols shown.

      Figure 3D and 5D. Most of the traces and current quantification is done at positive potentials and is outward current. Do the authors observe inward currents? It is difficult to judge by the figures since currents are so large. OSCA/TMEM63s are cationic channels and all published data on these channels have demonstrated robust inward currents at negative, physiologically relevant potentials. The lack of inward currents but only large outward currents suggests that these mutations could be doing something else to the channel. 

      Yes. We indeed observe inward current at negative holding potentials under pressure clamp (Author response image 2). However, mechanosensitive OSCA and TMEM63A channels are also voltage dependent. Their outward current is an order of magnitude larger at depolarized voltages (e.g., Author response image 2, also 10.7554/eLife.41844, see Fig. 1H). 

      Author response image 2.

      Voltage-dependent rectification of OSCA1.2 current. a. Representative OSCA1.2 trace (bottom) elicited by a voltage-ramp under -50 mmHg (top). b. The difference in inward and outward current amplitudes. 

      We found that quantifying the OSCA1.2 outward current has advantages over the inward current. Usually, using the gold standard pressure clamp protocol at negative holding voltages, peak inward current amplitude is quantified. However, OSCA inward current quickly inactivates (10.7554/eLife.41844, see Fig. 1C). This makes robust quantification and comparison with mutant channels difficult. Holding the membrane at a constant pressure and measuring OSCA1.2 G-V overcomes these issues associated with the classical inward current measurements. The large depolarization-driven outward current does not inactivate, and robust tail current (Response Fig. 1, 2) allows us to construct G-V relationships. We found quantifying mutants’ voltage dependence at constant pressure is more consistent than quantifying pressure dependence at constant voltage. These advantages make our new protocol preferable to the commonly used gold standard pressure clamp protocol for characterizing and comparing the gating mutations identified in this manuscript. 

      Figure 3 and 5. Why are mechanically activated currents being recorded at random pressure stimuli (-50 mmHg for OSCA) and (-80 mmHg for Tmem63a)? The gold standard in the field is to run an entire pressure response curve. Given that only outward currents are observed at membrane potentials +120mV and above at 0mmHg, this questions whether they are indeed constitutively active. 

      As we explained in the previous response, both voltage and membrane stretch activate OSCA/TMEM63A channels. We found measuring voltage dependence under constant pressure provided more consistent quantification than the gold standard pressure response protocol. This may be due to the variability of applied membrane tension under repeated stretches versus the more consistent applied voltage. Additionally, we chose -50 mmHg and -80 mmHg to reflect the reported differences in half-maximal pressures between OSCA1.2 and TMEM63A (e.g., P50 ~55 mmHg for 1.2 and ~61 mmHg for 63A in 10.7554/eLife.41844 versus ~86 mmHg for 1.2 and -123 mmHg for 63A in 10.1016/j.neuron.2023.07.006).

      We also used higher pressure in cell attached mode to increase TMEM63A current amplitudes, which are usually tiny.  We have updated our method section (Lines 329334) to further clarify why we used these protocols. 

      Please note that in TMEM16 proteins, ions and lipids might not always co-transport.

      This means that under certain conditions, only one type of substrate may go through. For instance, in WT TMEM16F, Ca2+ stimulation can easily trigger PS exposure at resting membrane potential. No ionic currents are elicited until strong depolarization is applied. Similarly, the TMEM16F GOF mutations spontaneously transport lipids, leading to loss of lipid asymmetry (Fig. 1b, c). However, in 0 Ca2+, these TMEM16F mutant channels still need strong depolarization for ion conduction (Fig. 1d, e). Although the detailed mechanism still needs to be further investigated, the OSCA1.2 and TMEM63A GOF mutations share similar features with TMEM16 proteins, exhibiting ion conduction under high pressures and depolarizing voltages, yet constitutively active scrambling.  

      Some clarity is needed for their choice of residues. I understand that a lot of this is also informed by the structures of these ion channels. According to the alignment shown in Supplementary Figure 1, they chose LA for OSCA1.2, which is in line with the IM (TMEM16F) and II(TMEM16A) residues but for Tmem63a they chose the hydrophobic gate residue W and S. Was the A476 tested? Also, OSCA1.2 already has a K in the hydrophobic gating residue region. How do the authors reconcile this with their model? 

      We appreciate this critical comment. We have included the characterization of TMEM63A A476K (Fig. 6, corresponding to M522 in 16F, I547 in 16A, and A439 in OSCA1.2). Interestingly, A476K transfected cells did not show obvious spontaneous PS exposure yet exhibited a modest shift in V50 comparable to W472K and S475K. These differences may reflect the high-tension activated nature of the TMEM63 proteins (10.1016/j.neuron.2023.07.006) as compared to OSCA1.2, where the corresponding mutation (A439K, Fig. 4b, c) showed very little spontaneous activity and required hypotonic stimulation to promote more robust PS exposure (Fig. 5). 

      Furthermore, as we showed in Figs. 1b-c and 3b-c, there is a lower limit (towards the Cterminus) of the TM 4 lysine mutation effect, which becomes insufficient to cause a constitutively open pore for spontaneous lipid scrambling. It is possible that TMEM63A A476K represents the lower limit of TM 4 mutations that can convert TMEM63A into a spontaneous lipid scramblase.  

      Regarding OSCA1.2 K435 and TMEM63A W472, these sites correspond to the hydrophobic gate residues on TM 4 in TMEM16F (F518, Fig. 1a) and TMEM16A (L543, Fig. 3a) so it is unsurprising to us that a lysine mutation at this site causes constitutive scramblase activity in TMEM63A (Fig. 6b, c). For OSCA1.2, it is more intriguing since this residue is already a lysine (K435). In Supplementary Fig. 5 our new experiments show that neutralizing K435 with leucine (K435L) in the background of L438K significantly attenuates spontaneous PS exposure from ~63% PS positive for L438K alone (two lysine residues) to ~31% for K435L/L438K (one lysine). One the other hand, the K435L mutation by itself is also insufficient to induce PS exposure. Therefore, the endogenous lysine at residue 435 has an additive effect on the spontaneous scramblase activity of L438K. We believe the explanation for this result lies in experiments conducted in model transmembrane helices, which have shown that stacking hydrophilic side chains within the membrane interior promotes trans-bilayer lipid flipping (see 10.1248/cpb.c22-00133). 

      These same studies also support our observation (10.1038/s41467-019-09778-7) that highly hydrophilic side chains (such as lysine or glutamic acid) accelerate trans-bilayer lipid flipping more effectively than hydrophobic side chains such as isoleucine or alanine (Author response image 3, see also 10.1021/acs.jpcb.8b00298).

      Author response image 3.

      Trans-bilayer lipid flipping rates (kflip) accelerate with increasing side chain hydropathy for a residue placed in the center of a model transmembrane helical peptide

      How do the authors know that osmotic shock is indeed activating OSCA1.2 and TMEM63A? If they can record from the channels then electrophysiology data that confirms activation of the channel in the presence of hypoosmotic shock will strengthen the osmolarity active scramblase activity demonstrated in Figure 4. So far, there is conclusive data showing that they are mechanically activated but conclusive electrophysiological data for OSCA/TMEM63 osmolarity activation is not described yet, including the reference (38) they indicate in line 132. Although osmotic shock can perturb mechanical properties of the membrane it can also activate volume-regulated anion channels, which are also present in HEK cells. 

      Thank you for raising this important question. While reference 38, (now reference 39) shows direct electrophysiological evidence of hypertonicity-induced current (e.g., Fig. 4 f, g, i, and j in 10.1038/nature13593), direct electrophysiological evidence that OSCA/TMEM63 can be activated by hypotonic stimulation is still missing. To address this question, we conducted whole-cell patch clamp experiments on mocktransfected and OSCA1.2 WT-transfected cells stimulated with 120 mOsm/kg hypotonic solution, comparable to the same conditions as hypotonic-induced scrambling shown in Fig. 5. As shown in Supplementary Fig. 6, our whole-cell recording detected a slowly evolving yet robust outward rectifying current in OSCA1.2-transfected cells, which was not observed in mock transfected cells. 

      To avoid the contamination from endogenous SWELL osmo-/volume-regulated chloride channels, our new experiment used 140 mM Na gluconate to replace NaCl in both the pipette and the bath solution. Because SWELL/VRAC channels are minimally permeable to gluconate anions (e.g., 10.1007/BF00374290), we conclude that hypotonic stimulation can indeed activate OSCA1.2 albeit with perhaps lower efficiency compared to mechanical stimulation.  

      Minor comments 

      What is the timeline for the scramblase assay for all the experiments (except Figure 4)? How long is the AnnexinV incubated before imaging? 

      Thank you for pointing out this point where we have not provided sufficient detail. Cells were imaged in the scramblase assay (including in Fig. 4, now revised Fig. 5) in AnnexinV-containing buffer immediately and without a formal incubation period because AnnexinV binding to exposed PS proceeds rapidly. We have included additional detail in the methods section to eliminate any confusion (Lines 310-312).

      In some places of the document, it says OSCA/TMEM63, and in other places, it is denoted as TMEM63/OSCA. The literature so far has always called the family OSCA/TMEM63- please stay consistent with the field. 

      Thank you for pointing this out, we have corrected these instances to be consistent with the field.   

      Reviewer #2 (Recommendations For The Authors): 

      (1) The authors' statement that the channel/scramblase family members have a relatively low "energetic barrier for scramblase" activity needs further support. While mutating the hydrophobic channel gate certainly could destabilize ion conduction to cause a GOF effect on channel activity, it is still not clear why scramblase activity, which is tantamount to altered permeation, happens in the mutant channels. Are permeation and channel gating (opening) coupled in these channels? If so, what is the basis for the coupling? Is scramblase activity only observed when the gating is destabilized or are they separable? 

      We appreciate these great questions. For the question about the ‘energetic barrier’ statement, please see our response to point (3) where we have carried out MD simulations of the OSCA1.2 WT and L438K mutant to provide insight into how the permeation pathway is altered by these mutations. 

      Regarding why TMEM16A can be converted into a scramblase, we use the extensively studied TMEM16 proteins as examples to improve our current understanding of OSCA/TMEM63 proteins. For further details please see our original paper (10.1038/s41467-019-09778-7) and our review (10.3389/fphys.2021.787773), which are summarized as follows: 

      (1) The “neck region”, consisting of the exofacial halves of TMs 3-6, form the poregate region for both ion and lipid permeation (Author response image 4B). In the closed state, the neck region is constricted and TMs 4 and 6 interact with each other, preventing substrate permeation. The hydrophobic inner activation gate that we identified (10.1038/s41467-019-09778-7) resides right underneath the inner mouth of the neck region, controlling both ion and lipid permeation scrambling. 

      (2) Based on our functional observations and the available scramblase structures of TMEM16 proteins in multiple conformations, we proposed a clamshell-like gating model to describe TMEM16 lipid scrambling (Author response image 4D). According to this model, Ca2+-induced conformational changes weaken the TM 4/6 interface. This promotes the separation of the two transmembrane segments, analogous to the opening of a clam shell, allowing a membrane-spanning groove to facilitate permeation of the lipid headgroup.

      (3) For the CaCC, TMEM16A, Ca2+ binding dilates the pore. However, the binding energy likely cannot open the TM 4/6 interface at the neck region so, in the absence of groove formation, only Cl- ions but not lipids can permeate. (Pore dilation model, Author response image  4C). 

      (4) Introducing charged residues near the inner activation gate disrupts the neck region, potentially by weakening the hydrophobic interactions between TMs 4 and 6. This mutational effect results in constitutively active TMEM16F scramblases and enables spontaneous lipid permeation in the TMEM16A CaCC. 

      (5) In our revision, we tested additional mutations with different side chain properties (Supplementary Fig. 2), validating previous findings by us (10.1038/s41467-01909778-7) and others (10.1038/s41467-022-34497-x) that gate disruption increases with the side chain hydropathy of the mutation. 

      (6) We further extended lysine mutations to two helical turns below the inner activation gate on TM 4 and identified a lower limit for mutation-induced spontaneous scramblase activity in TMEM16F and TMEM16A (Figs. 1b, c and 3b, c, respectively). Together, all these points lend additional support to our proposed gating models for TMEM16 proteins, which we postulate may also relate to the OSCA/TMEM63 family based on the evidence provided in our manuscript.

      Author response image 4.

      Model of gating (and regulatory) mechanisms in the TMEM16 family. (B) overall architecture and proposed modules, (C) pore-dilation gating model for CaCCs, (D) Clamshell gating model for CaPLSases.

      Regarding the relationship between ion and lipid permeation through TMEM16 scramblases, the following is the summary of our current understanding: 

      (1) Functionally, ion and lipid permeation are not necessarily obligatory to each other. This is evidenced by our previous biophysical characterizations of TMEM16F ion channel and lipid scramblase activities. Ca2+ can trigger TMEM16F lipid scrambling at resting membrane potentials, however, Ca2+ alone is insufficient to record TMEM16F current. Strong membrane depolarization synergistically with elevated intracellular Ca2+ is required to activate ion permeation. Based on these observations, we postulate that ions and lipids may have different extracellular gates, despite sharing an inner activation gate (10.1038/s41467-019-09778-7). Ca2+ alone may sufficiently open the inner gate (and extracellular gate) for lipids, whereas depolarization is likely required to open the extracellular gate and allow ion flux. Further structure-function studies are needed to test this hypothesis. 

      (2) Structurally, the open conformation of TMEM16 scramblases such as the fungal orthologs and human TMEM16K (Supplementary Fig. 1 b-d) are widely open, which allows lipid and ion co-transport. Ion and lipid co-transport has also been demonstrated in various MD simulations (e.g., 10.7554/eLife.28671, 10.3389/fmolb.2022.903972, and 10.1038/s41467-021-22724-w)

      (3) Functionally, we (10.1085/jgp.202012704) and others (10.7554/eLife.06901.001) have measured dual recording of channel and scramblase activities, also demonstrating that ions and lipids are co-transported simultaneously when the proteins are fully activated.

      (4) In this manuscript, we also provide multiple examples (TMEM16F in Fig. 1, TMEM16A in Fig. 3, OSCA1.2 in Fig. 4, and TMEM63A in Fig. 6) of mutations showing spontaneous phospholipid scramblase activities, yet their channel activities require strong depolarization or, in the case of TMEM63A, high pressures to be elicited.

      Together, this new evidence further supports our hypothesis that there might be multiple gates for ion and lipid permeation, in addition to the shared inner gate we previously identified. We hope these detailed explanations help convey the intricacy of these intriguing questions. Of course, future studies are needed to test our hypothesis and elucidate the complex relationship between ion and lipid permeation of these proteins. 

      (2) One weakness in the experimental approach is the very limited number of substitutions used to infer the conclusion regarding the energetic barrier and other conclusions relating to scramblase activity. Additional substitutions of charged and polar amino acids at the hydrophobic gate would be helpful in illuminating the molecular determinants of the GOF phenotype and also reveal varying patterns of lipid permeation which could be enormously informative. These additional mutations for analysis of TMEM16F and OSCA should be added to the study. 

      We appreciate these great suggestions which were shared by multiple reviewers. We have included our duplicated response below.

      “Response to reviewers 2 & 3: In our 2019 paper (10.1038/s41467-019-09778-7), we have systematically tested the side chain properties at the inner activation gate of TMEM16F on lipid scrambling activity (Response Fig. 6) and, since then, these results have been supplemented by others as well (10.1038/s41467-022-34497-x). In summary, mutating the inner activation gate residues to polar or charged residues generally results in constitutively activated scramblases without requiring Ca2+ (Fig 5a in 10.1038/s41467-019-09778-7). Because these residues form a hydrophobic gate, introducing smaller side chains via alanine substitution are also gain-of-function with the Y563A mutant as well as the F518A/Y563A/I612A variant being constitutively active (Fig. 3a in 10.1038/s41467-019-09778-7). Meanwhile, mutating these gate residues to hydrophobic amino acids causes no change for I612W, a slight gain-of-function for F518W, slight loss-of-function of F518L, and complete loss-of-function for Y563W (Fig. 4b in 10.1038/s41467-01909778-7). These findings clearly demonstrate that the side-chain properties are critical for regulating the gate opening. Charged mutations including lysine and glutamic acid are the most effective to promote gate opening (Fig 5a in 10.1038/s41467-019-09778-7).

      Similarly, others have observed that side chain hydropathy at the F518 site in TMEM16F correlates with shifts in the Ca2+ EC50 (Fig. 2 of 10.1038/s41467-022-34497-x). Note that this publication resolved the structure of the TMEM16F F518H mutant, revealing a previously unseen conformation that we have highlighted in Supplementary Fig. 1e and discussed in lines 235-238. Please also see our response to Reviewer #1 above, where we discuss discoveries in model transmembrane helical peptide systems showing that transbilayer lipid flipping rates correlate with side chain hydropathy (Author response image 3), distance between stacked hydropathic residues (schematic in 10.1248/cpb.c22-00133), and even helical angle between stacked side chains (not show). 

      Following the reviewers’ suggestions, we have tested additional mutations in alternative locations and with different side chains.  

      (1) We have added data for TMEM16F I521A and I521E to demonstrate a similar effect of alternative side chains to what has previously been reported by us and others. We found that I521A failed to show spontaneous scrambling activity (Supplementary Fig. 2), yet I521E (Supplementary Fig. 2) is a constitutively active lipid scramblase, similar to I521K (Fig. 1). This further demonstrates that gate disruption correlates with the side chain hydropathy and that this site lines a critical gating interface.

      (2) We also added lysine mutations two helical turns below the conserved inner activation gate for TMEM16F T526 (Fig. 1), TMEM16A E551 (Fig. 3). We found that there is indeed a lower limit for the observed effect in TMEM16, where lysine mutations no longer induce spontaneous lipid scrambling activity. This indicates that when TM 4/6 interaction is weaker toward intracellular side (Figs. 1a, 3a), the TM 4 lysine mutation loses the ability to promoting lipid scrambling by disrupting the TM 4/6 interface to enable clamshell-like opening of the permeation pathway. 

      (3) We added a TMEM16F lysine mutation on TM 6 at residue I611 (Fig. 2). Similar to I612K (Response Fig. 6), I611K also leads to spontaneous lipid scrambling and enhanced channel activity in the absence of calcium (Fig. 2). This shows that charged mutations along TM 6 can also promote lipid scrambling, strengthening our model that hydrophobic interactions along the TM 4/6 interface are critical for gating and lipid permeation.”

      (3) Related to the above point, it would be enormously useful to perform even limited computational modelling to support the "energetic barrier" statement. Specifically, can the authors model waters in the putative pore to examine water occupancy in the WT and mutant channels to better understand how the barrier for ions and lipids is altered in the TMEM16? 

      We appreciate this suggestion and have now conducted atomistic MD simulations of OSCA1.2 WT and L438K mutant for ~1 μs (Supplementary Fig. 4). The simulations revealed, elevated water occupancy in the pore region of the L438K mutant, likely due to a widening at the TM 4/6 interface. Conversely, the WT interface remained constricted, largely disallowing water occupancy. These computational results support our previously proposed clamshell-like gating model for TMEM16 scramblases and provide strong support that the L438K mutation is disrupting the interaction of the TM 4/6 interface, in turn reducing the energetic barrier for both ion and lipid permeation. 

      (4) I am puzzled about the ability of OSCA and the TMEM63 proteins which are cation channels to conduct negatively charged lipids. How can the pore be selective for cations and yet permeate negatively charged molecules when lipids are presented? 

      This is a great question. TMEM16 scramblase (as well as other known scramblases, such as the Xkr and Opsin families) are surprisingly non-selective to phospholipids (all major phospholipid species, not just anionic lipids like PS). It is still debated whether lipid headgroups indeed insert into an open pore or hydrophilic groove (Response Fig. 5), or if they may traverse the bilayer by the so-called ‘out-of-groove’ model. Regardless of the model, the consensus is that Ca2+-induced conformational changes catalyze lipid permeation and the mutations we have introduced are designed to mimic these conformational changes by separating the TM 4/6 interface.

      Additionally, TMEM16F channel activity was first characterized as cation non-selective (10.1016/j.cell.2012.07.036), similar to OSCA/TMEM63s, which may even exhibit some chloride permeability (10.7554/eLife.41844.001). Thus, it appears as though scramblase activity is agnostic to headgroup charge and compatible with both a mutant anion channel (TMEM16A) and mutant cation channels (TMEM16F, OSCA1.2, and TMEM63A), however, more detailed structural, functional, and computational studies are needed to further clarify ion and lipid co-transport mechanisms.  

      (5) Do pore blockers like Gd3+ which block permeation also inhibit the scramblase activity of the mutant channels? This should be tested for the mutant channels. 

      While extracellular Gd3+ has been previously reported as an inhibitor of OSCA1.2 (10.7554/eLife.41844.001), we did not observe this effect (Author response image 5), but instead saw inhibition by intracellular Gd3+ (Author response image 6). Given this discrepancy, we did not test Gd3+ inhibition of the OSCA1.2 scramblases, but instead tested Ani9, a paralog-specific inhibitor of TMEM16A, on the TMEM16A I546K gain-offunction and found it attenuated both ion channel and phospholipid scramblase activities (Supplementary Fig. 3).

      Author response image 5.

      200 µM Gd3+ext fails to inhibit OSCA1.2 currents in cell-attached patches. Pressure-elicited peak currents (n=6 each). Statistical test is an unpaired Student’s t-test.

      Author response image 6.

      200 µM Gd3+int completely inhibits OSCA1.2 currents in inside-out patches. (a) representative traces in before (black), during (red), and after (blue) Gd3+ application. (b) Representative application timecourse. (c) Quantification of peak currents (n=8 each). Statistical test is one-way ANOVA.

      Minor: 

      - Some of the current amplitudes shown in Figures 2 and 3 are enormous. Is liquid junction potential corrected in these experiments? If not, it would be preferable to correct this to avoid voltage errors. 

      Thanks for the question. The large current amplitude is due to 1) great surface expression of the proteins; 2) large single channel conductance of OSCA channels, 3) much larger current at positive voltages for OSCA channels. Our control experiment showed that WT TMEM16A at 0 Ca2+ did not give rise to any current (Fig. 3d), further demonstrating that the large current was not due to liquid junction potential. For the OSCA recordings, we also did not observe current in mock-transfected cells, further excluding the possible interference of liquid junction potential (Response Fig. 1)

      - Related, authors could consider adding some evidence using selective pharmacology to support the conclusions that the observed currents arise from TMEM or OSCA channels. 

      Thanks for the suggestion. As mentioned above, we have added experiments with Ani9, a specific inhibitor of TMEM16A, in Supplementary Fig. 3. We found that Ani9 robustly attenuated both ion channel and phospholipid scramblase activities for the TMEM16A I546K gain-of-function mutant. This is also consistent with our previous publication (10.1038/s41467-019-09778-7), where Ani9 efficiently inhibited the TMEM16A L534K mutant scramblases. Additionally, we have provided mock controls (Response Fig. 1, Fig. 6d, e) to show that the observed currents are indeed attributable to OSCA1.2 and TMEM63A.

      Reviewer #3 (Recommendations For The Authors): 

      Given that the authors postulate that the introduction of a positive charge via the lysine side chain is essential to the constitutive activity of these proteins, additional mutation controls for side chain size (e.g. glutamine/methionine) or negative charge (e.g. glutamic acid), or a different positive charge (i.e. arginine) would have strengthened their argument. To more comprehensively understand the TM4/TM6 interface, mutations at locations one turn above and one turn below could be studied until there is no phenotype. In addition, the equivalent mutations on the TM6 side should be explored to rule out the effects of conformational changes that arise from mutating TM4 and to increase the strength of evidence for the importance of side-chain interactions at the TM6 interface. 

      We appreciate these great suggestions which were shared by multiple reviewers. We have included our previous responses below.

      “Response to reviewers 2 & 3: In our 2019 paper (10.1038/s41467-019-09778-7), we have systematically tested the side chain properties at the inner activation gate of TMEM16F on lipid scrambling activity (Response Fig. 6) and, since then, these results have been supplemented by others as well (10.1038/s41467-022-34497-x). In summary, mutating the inner activation gate residues to polar or charged residues generally results in constitutively activated scramblases without requiring Ca2+ (Fig 5a in 10.1038/s41467-019-09778-7). Because these residues form a hydrophobic gate, introducing smaller side chains via alanine substitution are also gain-of-function with the Y563A mutant as well as the F518A/Y563A/I612A variant being constitutively active (Fig. 3a in 10.1038/s41467-019-09778-7). Meanwhile, mutating these gate residues to hydrophobic amino acids causes no change for I612W, a slight gain-of-function for F518W, slight loss-of-function of F518L, and complete loss-of-function for Y563W (Fig. 4b in 10.1038/s41467-01909778-7). These findings clearly demonstrate that the side-chain properties are critical for regulating the gate opening. Charged mutations including lysine and glutamic acid are the most effective to promote gate opening (Fig 5a in 10.1038/s41467-019-09778-7).

      Similarly, others have observed that side chain hydropathy at the F518 site in TMEM16F correlates with shifts in the Ca2+ EC50 (Fig. 2 of 10.1038/s41467-022-34497-x). Note that this publication resolved the structure of the TMEM16F F518H mutant, revealing a previously unseen conformation that we have highlighted in Supplementary Fig. 1e and discussed in lines 235-238. Please also see our response to Reviewer #1 above, where we discuss discoveries in model transmembrane helical peptide systems showing that transbilayer lipid flipping rates correlate with side chain hydropathy (Author response image 3), distance between stacked hydropathic residues (schematic in 10.1248/cpb.c22-00133), and even helical angle between stacked side chains (not show). 

      Following the reviewers’ suggestions, we have tested additional mutations in alternative locations and with different side chains.  

      (1) We have added data for TMEM16F I521A and I521E to demonstrate a similar effect of alternative side chains to what has previously been reported by us and others. We found that I521A failed to show spontaneous scrambling activity (Supplementary Fig. 2), yet I521E (Supplementary Fig. 2) is a constitutively active lipid scramblase, similar to I521K (Fig. 1). This further demonstrates that gate disruption correlates with the side chain hydropathy and that this site lines a critical gating interface.

      (2) We also added lysine mutations two helical turns below the conserved inner activation gate for TMEM16F T526 (Fig. 1), TMEM16A E551 (Fig. 3). We found that there is indeed a lower limit for the observed effect in TMEM16, where lysine mutations no longer induce spontaneous lipid scrambling activity. This indicates that when TM 4/6 interaction is weaker toward intracellular side (Figs. 1a, 3a), the TM 4 lysine mutation loses the ability to promoting lipid scrambling by disrupting the TM 4/6 interface to enable clamshell-like opening of the permeation pathway. 

      (3) We added a TMEM16F lysine mutation on TM 6 at residue I611 (Fig. 2). Similar to I612K (Response Fig. 6), I611K also leads to spontaneous lipid scrambling and enhanced channel activity in the absence of calcium (Fig. 2). This shows that charged mutations along TM 6 can also promote lipid scrambling, strengthening our model that hydrophobic interactions along the TM 4/6 interface are critical for gating and lipid permeation.”

      The experiments for OSCA1.2 osmolarity effects on gating and scramblase in Figure 4 could be improved by adding different levels of osmolarity in addition to time in the hypotonic solution.

      We thank the reviewer for this excellent suggestion. We extensively tested this idea and found evidence (Response Fig. 10) that intermediate osmolarity (220 and 180 mOso/kg) also can enhance the scramblase activity of the A439K mutant, albeit to a milder extent compared to 120 mOso/kg stimulation. This suggests that swellinginduced membrane stretch may proportionally induce A439K activation and lipid scrambling. Due to the relatively mild sensitivity of OSCA to osmolarity and the variations induced by the experimental conditions, we believe it is better to not include this data to avoid overclaiming. We hope the reviewer would agree. 

      Author response image 7.

      AnV intensities of WT- and A439K-transfected cells after 10 minutes of hypotonic stimulation at the listed osmolarities.

      Some confocal images appear to be rotated relative to each other (e.g. Figures 2b and 3b).

      Thank you for identifying these errors, they are corrected in the revision.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We wish to thank the Reviewers for their critical analysis of the article and for their suggestions and comments.

      In addition and beside the point-by-point answer to the Reviewers, we wish here to emphasize on three essential points that have been raised: First, we never intended (nor pretended) to address the incidence of the two EHT cell emergence processes on downstream fate, after release from the aortic floor (see for example the last paragraph of our initially submitted manuscript). We only wished to bring evidence on cell biological heterogeneity of the HE, particularly relying on cell polarity control and polarity reestablishment/reinforcement in the case of EHT pol+ cells, thus leading to emergence morphodynamic complexity. In the general context of cell extrusion in which all polarity features are generally downregulated, these are remarkable features.

      Second, we inform the Reviewers that we have performed a major revision of the work on the Pard3 proteins issue the outcome of which, hopefully, substantiates significantly the idea of a tuning of cell polarity features in the HE and all along the EHT time-window, for supporting EHT pol- and EHT pol+ types of emergence. To achieve this, we entirely revised the experimental strategy to increase specificity and sensitivity of detection of Pard3 protein isoforms expressed in the vascular system, based on endothelial FACS-sorting, qRT-PCR and single-molecule whole mount in situ hybridization using RNAscope. Importantly, we wish to stress that, by addressing Pard3 proteins, we initially aimed at substantiating our observations on the localization of our podxl2 construct (del-podxl2) used to label apical membranes. Hence, we sought to bring correlative evidence on the variation of expression of polarity proteins at early and later time points of the EHT time-window (suggesting tightly regulated expression control of polarity determinants, possibly at the mRNA level). This was clearly written and justified in the text, lines 227 or 303 of the initial manuscript. Also, this may have led to identify (a) specific isoform(s), including splicing variants as initially addressed.

      As the Reviewers will see, while performing the revision of our work, we now have been able to point at a specific isoform of Pard3, namely Pard3ba, whose mRNA expression level, in aortic cells and at the single cell resolution, is uniquely and specifically enhanced in cells contacting emergence ‘hot spots’. Using our Runx1 mutant fish line (dt-Runx1), we also show that expression of Pard3ba mRNAs, in these specific aortic regions, is sensitive to interference with Runx1 activity (i.e dt-Runx1 increases Pard3ba expression). Altogether, our new results strongly support our idea, initially proposed, on the regulation of polarity features during EHT; they indicates intercellular coordination, throughout cooperative cross-talk between aortic and HE/EHT cells. This is compatible with the idea of a ‘tuning’ of apico-basal polarity during the entire EHT time-window (including maturation of the HE to become competent for emergence and the emergence process per se whose morphodynamic complexity relies on regulating apico-basal polarity associated functions (ex: for controlling the specific junctional recycling modes of EHT pol+ and EHT pol- cells, as we suggest using JAM proteins that we have chosen owing to their function in the recruitment of Pard3 proteins for apico-basal polarity establishment)). This complements nicely our work and highlights the relevance of studying the interplay between aortic and HE/EHT cells (which we have started to dissect in the second part of our manuscript). Further work is obviously required to address local, dynamic variations of mRNAs encoding for this specific isoform of Pard3 as well as specific interference with its functions at the spatial and temporal levels (hence on live tissues), which is far beyond the scope of our currently submitted work.

      Finally, this emphasizes the importance of the aortic context, at the mesoscopic level, in the regulation of the EHT.

      Third, based on these major points and Reviewers suggestions, we propose to take into account the fact that the heterogeneity in emergence morphodynamics was not highlighted and propose the following title:

      ‘Tuning apicobasal polarity and junctional recycling in the hemogenic endothelium orchestrates the morphodynamic complexity of emerging pre-hematopoietic stem cells’

      Regarding Results and Figures, the previous Figures 3 and 4 have been entirely revised, with the support of Supplement Figures (3 and 4 supplement figures, respectively as well as a supplement video to Figure 3). Supplement Figures have also been included to the revised version, for nearly all results that appeared as data not shown (Figure 1 – figure supplement 2: illustrating the maintenance of EHT pol+ and EHT pol- cells after division; Figure 1 – figure supplement 3: illustrating the expression of the hematopoietic marker CD41 by EHT pol+ and EHT pol- cells). Also, a new supplemental figure, Figure 7 – figure supplement 7, has been added to substantiate the impact of interfering with ArhGEF11/PDZ-RhoGEF alternative splicing on hematopoiesis. Finally, a Figure for the Reviewers is added at the end of this file that shows that virtually 100% of aortic floor cells that we consider as hemogenic cells are positive for the hematopoietic marker Gata2b which is upstream of Runx1 (using RNAscope which allows achieving cellular resolution unambiguously).

      Reviewer #1 (Public Review):

      Summary:

      In this research article, the authors utilized the zebrafish embryo to explore the idea that two different cell types emerge with different morphodynamics from the floor of the dorsal aorta based on their apicobasal polarity establishment. The hypothesis that the apical-luminal polarity of the membrane could be maintained after EHT and confer different functionality to the cell is exciting, however, this could not be established. There is a general lack of data supporting several of the main statements and conclusions. In addition, the manuscript is difficult to follow and needs refinement. We present below some questions and suggestions with the goal of guiding the authors to improve the manuscript and solidify their findings.

      Here, we wish to emphasize that we do not make the hypothesis that ‘…the apical-luminal polarity of the membrane could be maintained after EHT …’ but that the apico-basal polarity establishment/maintenance controls the type of emergence and their associated cell biological features (EHT pol+ and EHT pol- cellular morphodynamics, establishment of membrane domains). Hence, our work suggests that these emergence modes, as a consequence of their intrinsic characteristics and differences, might have an impact on cellular behavior after the release (to place the work in the broader context of hematopoietic cell fate and differentiation). More specifically, the difference in the biological features of the luminal versus abluminal membrane for the two EHT types (ex: membrane signaling territories, membrane pools devoted to specific functions), might endow the cells with specific functional properties, after the release. What happens to those cells thereafter, except for illustrating the evolution of the luminal membrane for pol+ EHT cells, is beyond the scope of this paper. Here, we analyze and characterize some of the cell biological features of the EHT process per se (the emergence from the aortic floor), including the dynamic interface with adjoining endothelial cells.

      Strengths:

      New transgenic zebrafish lines developed. Challenging imaging.

      Weaknesses:

      (1) The authors conclude that the truncated version of Podxl2 fused to a fluorophore is enriched within the apical site of the cell. However, based on the images provided, an alternative interpretation is that the portion of the membrane within the apical side is less stretched than in the luminal side, and therefore the fluorophore is more concentrated and easier to identify by confocal. This alternative interpretation is also supported by data presented later in the paper where the authors demonstrate that the early HE is not polarized (membranes are not under tension and stretched yet). Could the authors confirm their interpretation with a different technique/marker like TEM?

      The argument of the apparent enrichment, or exclusion, of a marker depending on membrane stretching (and hence molecular packing) would be valid for any type of molecule embedded in these membranes, including of course endogenous ones (this is one of the general biophysical principles leading to the establishment of membrane domains, structurally and functionally speaking); hence, using another marker would not solve the issue because it would depends on its behavior in regard to packing (in particular lipid packing), which is difficult to anticipate and is a topic in its own (especially in this system that has been poorly investigated in regard to its biophysical and biochemical properties in vivo (including its exposure to the hemodynamics)).

      If we follow the logic of the Reviewer, it appears that it is not consistent with our results on the maturing HE. Indeed, in our dt-Runx1 mutants, mKate2-podxl2 is enriched at the luminal membrane of HE cells (HE cells are elongated, and the two membrane domains have a relative equal surface and bending); in comparison, HE cells have the same morphology in control animals than in mutants but, in controls, eGFP-podxl2 and mKate2-podxl2 are equally partitioned between the luminal and abluminal membranes (see Figure 3 – figure supplement 2 (for mKate2-podxl2) and Figure 2 – figure supplement 1 and 2 (for eGFP-podxl2)). In addition, we took care while designing the eGFP and mKate2 fusions to keep the natural podxl2 sequence containing critical cysteine residues to maintain assembly properties and distance from the transmembrane segment (hence the fluorescent protein per se is not directly exposed to membrane stretching).

      Finally, electron microscopy is not the approach to use for this issue because requiring tissue fixation which is always at risk because modifying significantly membrane properties. On this line, when we fix embryos (and hence membranes, see our new Figure 4 and its Supplemental Figures), we do not appear to maintain obvious EHT pol+ and pol- cell shapes. In addition, to be conclusive, the work would require not TEM but immuno-EM to be able to visualize the marker(s), which is another challenge with this system.

      (2) Could the authors confirm that the engulfed membranes are vacuoles as they claimed, using, for example, TEM? Why is it concluded that "these vacuoles appear to emanate from the abluminal membrane (facing the sub-aortic space) and not from the lumen?" This is not clear from the data presented.

      The same argument regarding electron microscopy mentioned on the point before is valid here (in addition, it would require serial sectioning in the case it would be technically feasible to make sure not to miss the very tinny connection that may only suggest ultimate narrowing down of the facing adjacent bilayers, which is quite challenging). The term vacuole which we use with caution (in fact, more often, we use the term pseudo-vacuoles in the initial manuscript, lines 140, 146, 1467 (legend to Figure 1 – figure supplemental 1 or apparent vacuole-like in the same legend lines 1465 and 1476) is legitimate here because we cannot say that they are portions of the invaginated luminal membrane as we could be accused not to show that these membranes are still connected to the luminal surface; we are here at the limit of the resolution that in vivo imaging is allowing for the moment with this system, and we drive the attention of the Reviewer on the fact that we are reaching here a sub-cellular level which is already a challenge by itself.

      In addition, if there would not be at some point vacuoles (or pseudo-vacuoles) formed in this system (membrane-bounded organelles), it would be difficult to conceive how, after release of the cell, the fluid inherited from the artic lumen would efficiently be chased from these membranes/organelles (see also our model Figure 1 – figure Supplement 1B).

      Why is it concluded that "these vacuoles appear to emanate from the abluminal membrane (facing the sub-aortic space) and not from the lumen?" This is not clear from the data presented.

      This is not referring to our data but to the Sato et al 2023 work. For EHT undergoing cells leading to aortic clusters in mammals and avians, vacuolar structures indeed appear to emanate from the ab-luminal side facing the sub-aortic space (we cannot call it basal because we do not know the polarity status of these cells). In the Revised version of the manuscript, we have moved this paragraph referring to the Sato et al work to the Discussion, which gives the possibility to expand a bit on this issue, for more clarity (see the second paragraph of our new Discussion).

      (3) It is unclear why the authors conclude that "their dynamics appears to depend on the activity of aquaporins and it is very possible that aquaporins are active in zebrafish too, although rather in EHT cells late in their emergence and/or in post-EHT cells, for water chase and vacuolar regression as proposed in our model (Figure 1 - figure supplement 1B)." In our opinion, these figures do not confirm this statement.

      This part of the text has been upgraded and moved to the Discussion (see our answer to point 2), to take Reviewers concern about clarity of the Results text section and allowing elaborating a bit more on this issue. We only wished to drive the attention on the described presence of intracellular vacuolar structures recently addressed in the Sato el al 2023 paper showing EHTcell vacuoles that are proposed to contribute to cellular deformation during the emergence. We take this example to rationalize the regression of the vacuolar structures described Figure 1 - figure supplement 1B, which is why we have written ‘… it is very possible that aquaporins are active in zebrafish too’; the first part of the sentence refers to the Sato et al 2023 paper.

      (4) Could the authors prove and show data for their conclusions "We observed that both EHT pol+ and EHT pol- cells divide during the emergence"; "both EHT pol+ and EHT pol- cells express reporters driven by the hematopoietic marker CD41 (data not shown), which indicates that they are both endowed with hematopoietic potential"; and "the full recovery of their respective morphodynamic characteristics (not shown)?".

      To the new version of our manuscript, we have added new Supplemental information to Figure 1 (two new Supplemental Figures):

      • Figure 1 - figure Supplement 2 that illustrates that both EHT pol+ and EHT pol- cells divide during the emergence as well as the maintenance of morphology for both EHT cell types. We wish also to add here that the maintenance of the EHT pol+ morphology is the most critical point, showing that dividing cells in this system do not necessarily lead to EHT pol- cells.

      • Figure 1 - figure Supplement 3 that shows that both EHT cell types express CD41.

      (5) The authors do not demonstrate the conclusion traced from Fig. 2B. Is there a fusion of the vacuoles to the apical side in the EHT pol+ cells? Do the cells inheriting less vacuoles result in pol- EHT? It looks like the legend for Fig. 2-fig supp is missing.

      As said previously, showing fusion here is not technically possible, but indeed, this is the idea, which fits with the images corresponding to timing points 0-90 minutes (Figure 2A), showing (in particular for the right cell) a large pseudo-vacuole whose membrane is heavily enriched with the polarity marker podxl2 (based on fluorescence signal in a membrane-bounded organelle that, based on its curvature radius, should be more under tension then the more convoluted EHT pol+ cell luminal membrane). Also, EHT pol – cells may be born from HE cells that either inherit from less intracellular vesicles after division (or that are derived from HE cells that are less – or not - exposed to polarity-dependent signaling (see our data presented in the new Figure 4 and the new version of the Discussion (see paragraphs ‘Characteristics of the HE and complexity of pre-hematopoietic stem cell emergence’ and ‘Spatially restricted control of Pard3ba mRNAs by Runx1’).

      Finally, the cartoon Figure 2B is a hypothetical model, consistent with our data, and that is meant to help the reader to understand the idea extrapolated from images that may not be so easy to interpret for people not working on this system. In legend of Figure 2 that describes this issue in the first version of our manuscript (lines 1241-1243), we were cautious and wrote, in parentheses: ‘note that exocytosis of the large vacuolar structure may have contributed to increase the surface of the apical/luminal membrane (the green asterisk labels the lumen of the EHT pol + cell’.

      The legend to Figure 2 – figure supplement 1 is not missing (see lines 1492 – 1499 of the first manuscript). The images of this supplement are not extracted from a time-lapse sequence and show that as early as 30hpf (shortly after the beginning of the EHT time-window – around 28hpf), cells on the aortic floor already exhibit podxl2-containing pseudo-vacuolar structures (which we propose is a prerequisite for HE cell maturation into EHT competent cells; see also Figure 2 – figure supplement 2).

      (6) The title of the paper "Tuning apico-basal polarity and junctional recycling in the hemogenic endothelium orchestrates pre-hematopoietic stem cell emergence complexity" could be interpreted as functional heterogeneity within the HSCs, which is not demonstrated in this work. A more conservative title denoting that there are two types of EHT from the DA could avoid misinterpretations and be more appropriate.

      There was no ambiguity, throughout our initial manuscript, on what we meant when using the word ‘emergence’; it refers only to the extrusion process from the aortic floor.

      Reducing our title only to the 2 types of EHT cells would be very reductionist in regard to our work that also addresses essential aspects of the interplay between hemogenic cells, cells undergoing extrusion (EHT pol+ and pol- cells), and their endothelial neighbors (not to mention what we show in terms of the cell biology for the maturing HE and the regulation of its interface with endothelial cells (evidence for vesicular trafficking, specific regulation of HE-endothelial cell intercalation required for EHT progression etc … ). However, and to take this specific comment into account, we propose a slightly changed title saying that there are emergences differentially characterized by their morphodynamic characteristics:

      ‘Tuning apicobasal polarity and junctional recycling in the hemogenic endothelium orchestrates the morphodynamic complexity of emerging pre-hematopoietic stem cells’

      (7) There are several conclusions not supported by data: "Finally, we have estimated that the ratio between EHT pol+ and EHT pol- cells is of approximately 2/1". "We observed that both EHT pol+ and EHT pol- cells divide during the emergence and remain with their respective morphological characteristics". "We also observed that both EHT pol+ and EHT pol- cells express reporters driven by the hematopoietic marker CD41 (data not shown), which indicates that they are both endowed with hematopoietic potential." These conclusions are key in the paper, and therefore they should be supported by data.

      Most of the requests of the Reviewer in this point have already been asked in point 4 and were added to the revised version.

      Regarding the EHT pol+/pol- ratio, we will keep the ratio to approximately 2/1. The Reviewer should be aware that quantification of EHT cells is a tricky issue and a source of important variability, as can be assessed by the quantifications that we have been performing (see for example figures in which we compare the dt-Runx1 phenotype with Ctrl). This is inherent to this system, more specifically because the EHT process is asynchronous, ranging from approx. 28 hpf to 3 days post fertilization (we have even observed EHT at 5 dpf). We systematically observed heterogeneity in EHT numbers and EHT types between animals and also between experiments (some days we observe EHTs at 48 hpf, others more around 55 hpf or even later). In addition, emergence also proceeds on the lateral side of the aorta and, while it is relatively easy to identify EHT pol+ cells because of their highly characterized morphology, it is more difficult for EHT pol- cells that can be mistaken to round HE cells preparing for division. In the current revision of our work, we provide additional facts and potential explanations on the mechanisms that control this asynchrony and the apparent stochasticity of the EHT process (see results of new Figures 3 and 4).

      Reviewer #2 (Public Review):

      In this study, Torcq and colleagues make careful observations of the cellular morphology of haemogenic endothelium undergoing endothelial to haematopoietic transition (EHT) to become stem cells, using the zebrafish model. To achieve this, they used an extensive array of transgenic lines driving fluorescent markers, markers of apico-basal polarity (podocalixin-FP fusions), or tight junction markers (jamb-FP fusions). The use of the runx truncation to block native Runx1 only in endothelial cells is an elegant tool to achieve something akin to tissuespecific deletion of Runx1. Overall, the imaging data is of excellent quality. They demonstrate that differences in apico-basal polarity are strongly associated with different cellular morphologies of cells undergoing EHT from HE (EHT pol- and EHT pol+) which raises the exciting possibility that these morphological differences reflect the heterogeneity of HE (and therefore HSCs) at a very early stage. They then overexpress a truncated form of Runx1 (just the runt domain) to block Runx1 function and show that more HE cells abort EHT and remain associated with the embryonic dorsal aorta. They identify pard3aa and pard3ab as potential regulators of cell polarity. However, despite showing that loss of runx1 function leads to (late) decreases in the expression of these genes, no evidence for their role in EHT is presented. The FRAP experiments and the 2d-cartography, albeit very elegant, are difficult to interpret and not very clearly described throughout the text, making interpretation difficult for someone less familiar with the techniques. Finally, while it is clear that ArhGEF11 is playing an important role in defining cell shapes and junctions between cells during EHT, there is very little statistical evidence to support the limited data presented in the (very beautiful) images.

      As mentioned in the response to reviewer 1, we revised our whole strategy for the analysis of the role of Pard3 proteins in regulating the emergence of hematopoietic precursors. Our new data, obtained using refined gene expression analysis by qRT-PCR on FACS sorted populations and by in situ gene expression analysis at the single-cell resolution using RNAscope, show first that a unique Pard3 isoform (Pard3ba) is sensitive to runx1 activity, and that its expression is specifically localized in aortic cells contacting hemogenic(HE)/EHT cells. We show a clear correlation between the densification of Pard3ba mRNAs and the presence of contacting HE/EHT cells, suggesting a key role for Pard3ba in a cross talk between aortic and hemogenic cells. Furthermore, we show that our dt-runx1 mutant impacts on the maturation of HE cells; when this mutant is expressed, we observe, in comparison to control, an accumulation of HE cells that are abnormally polarized as well as unusually high numbers of EHT pol+ cells. This strongly suggests that the polarity status of HE cells controls the mode of emergence. Overall, our work shows that regulation of apico-basal polarity features is essential for the maturation of the HE and the proper proceeding of the EHT.

      We made efforts to explain more clearly the FRAP experiments as well as the analysis of 2Dcartography throughout the text to facilitate readers comprehension. 2D-cartography are an invaluable tool to precisely discriminate between endothelial and hemogenic cells, and their usage was essential during the FRAP sessions, to point at specific junctional complexes accurately. Performing FRAP at cellular junctions during aortic development was extremely challenging technically and the outcome subjected to quite significant variability (which often leads to quantitative results at the limit of the statistical significance, which is why we speak of tendencies in our results section reporting on this type of experiments). Apart from constant movement and drifting of the embryos which are sources of variability, the EHT process per se is evolving over time and does so at heterogeneous pace (for example, the apical closure of EHT pol+ cells is characterized by a succession of contraction and stabilization phases, see Lancino et al. 2018) which is an additional source of variability in the measurements. Despite all this, our data collectively and consistently suggest a differential regime of junctional dynamics between EHT cell types and support the critical function of ArhGEF11/PDZ-RhoGEF in the control of junctional turnover at the interface between HE and aortic cells as well as between HE cells to regulate cell-cell intercalation.

      There is a sense that this work is both overwhelming in terms of the sheer amount of imaging data, and the work behind it to generate all the lines they required, and at the same time that there is very little evidence supporting the assertion that pard3 (and even ArhGEF11) are important mediators of cell morphology and cell fate in the context of EHT. For instance, the pard3 expression data, and levels after blocking runx1 (part of Figure 3 and Figure 4) don't particularly add to the manuscript beyond indicating that the pard3 genes are regulated by Runx1.

      We thank the reviewer for the comment on the Pard3 data particularly because it led us to reconsider our strategy to address with more precision and at the cellular resolution the potential function of this protein family during the time-window of the EHT. As summarized in the header of the Public Review, we identified one specific isoform of Pard3 in the zebrafish - Pard3ba – whose sensitivity to runx1 interference and spatial restriction in expression reinforce the idea of a fine control of apico-basal polarity features and associated functions while EHT is proceeding. Our new data also reinforce the interplay between HE/EHT cells and their direct endothelial neighbors.

      Weaknesses

      The writing style is quite convoluted and could be simplified for clarity. For example, there is plenty of discussion and speculation throughout the presentation of the results. A clearer separation of the results from this speculation/discussion would help with understanding. Figures are frequently presented out of order in the text; modifying the figures to accommodate the flow of the text (or the other way around) - would make it much easier to follow the narrative. While the evidence for the different cellular morphologies of cells undergoing EHT is strong, the main claim (or at least the title of the manuscript) that tuning apico-basal polarity and junctional recycling orchestrate stem cell emergence complexity is not well supported by the data.

      We refined our text when necessary, in particular taking care of transferring and substantiating the arguments that appeared in the Results section, to the Discussion. We also made efforts, on several occasions and for clarity, to describe more precisely the results presented in the different panels of the Figures.

      As mentioned in the header of the text of the Public Review and the response to the 6th point of the Public Review of Reviewer 1, we modified slightly the title to avoid ambiguity. In addition, we added a new paragraph to the beginning of our discussion that summarizes the impact of our findings and, we believe, legitimates our title.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Embryonic stages should be indicated in all images presented for clarification.

      We thank the reviewer for this point, we added stages when missing on the figures (Figure 1, Figure 1 - Figure supplement 1, Figure 2, Figure 2 - Figure supplement 1, Figure 5, Figure 6, Figure 6 - Figure supplement 1, Figure 7 - Figure supplement 3, Figure 7 - Figure supplement 5, Figure 7 - Figure supplement 6)

      (2) In which anatomical site/s were images from Fig 1C and D taken? The surrounding environment looks different, for example, cells in Fig1D seem to be surrounded by other cells, resembling the endothelial plexus at the CHT, while the cells in Fig. 1C seem to be in the dorsal aorta. Is there a spatial difference depending on where cells are budding off? The authors state that there are no differences, but no quantification or data demonstrating that statement is provided.

      As mentioned in the figure legend (lines 1206-1209 of the original manuscript), images for Figure 1C and 1D were both taken at the boundary between the end of the AGM and the entry in the caudal hematopoietic tissue. As the images were acquired from different embryos, the labelling of the underlying vein differs between the two panels, with veinous tissues being more sparsely labelled in panel C than in panel D. These images were chosen to illustrate the clearly opposite morphology between the two EHT types that we describe. However, for the rest of the paper, all images and all analysis were exclusively acquired / performed in the dorsal aorta in the AGM, in a region spanning over approximately 10-12 inter-segmentary vessels, starting from the end of the elongated yolk up to the start of the balled yolk. In light of the work from the lab of Zilong Wen showing that only cells emerging anteriorly exhibit long-term replenishment potential (Tian et al. 2017), we specifically chose to limit our comparative analysis to the AGM region and did not quantitatively investigate emergences occurring in the caudal region of the aorta. Additionally, although we routinely observe both types of emergences occurring in the caudal region of the dorsal aorta, we did not quantify the frequency of either EHT events in this region.

      Finally, the EHT pol+ cells that we show Figure 1C are of the highest quality obtained ever; one reason is that these two cells emerge at the entry of the CHT which is a region a lot easier to image at high resolution in comparison to the trunk because the sample is less thick and because we are less perturbed by heart beats.

      (3) Which figure shows "EHT pol- cells were observed in all other Tg fish lines that we are routinely imaging, including the Tg(Kdrl:Gal4;UAS:RFP) parental line that was used for transgenesis, thus excluding the possibility that these cells result from an artefact due to the expression of a deleted form of Podxl2 and/or to its overexpression."? It would be informative to include this figure.

      Other examples of EHT pol- cells were shown Figure 5C as well as Figure 6B using the Tg(kdrl:Jam3b-eGFP; kdrl:nls-mKate2) fish line, that was routinely used for junctional dynamic analyses by FRAP. Furthermore, we add now a new figure (New Figure 1 – figure supplement 3), to illustrate the presence of EHT pol- cells using the Tg(CD41:eGFP) transgenic background, additionally illustrating that EHT pol- cells are CD41 positive.

      (4) Are the spinning disk confocal images a single plane? Or maximum projections? Sometimes this is not specified.

      We made sure to take into account this remark and went through all figures legends to specify the type of images presented (Figure 1 – figure supplement 1, Figure 2, Figure 2 – figure supplement 1, Figure 2 – figure supplement 2, Figure 7 – figure supplement 3) and also, when relevant, we added this information directly to the figure panels (Figure 6A – 6B).

      (5) Could the expression data by RT-qPCR for the Pard3 isoforms be shown? Additionally, it would be appreciated if this expression data could be complemented using Daniocell (https://daniocell.nichd.nih.gov/).

      As mentioned in the first paragraph of our response to Public Reviews, and based on reviewers’ comments, we revised our strategy for the investigation of pard3 proteins expression in the vascular system, for their potential role in EHT and sensitivity to runx1. First, we used FACS sorting as well as tissue dissection to enrich in aortic endothelial cells and perform our qPCR analyses (see the new Figure 4 – figure supplement 1A and Figure 4 – figure supplement 3A for the strategy). As asked by the reviewers and for more transparency, we show the expression relative to the housekeeping gene ef1a in our different control samples (new Figure 4 – figure supplement 1C). Furthermore, we used single-molecule FISH to precisely characterise in situ the expression of several of the Pard3 isoforms (Pard3aa, Pard3ab and Pard3ba, which, based on qPCR, were the most relevant for our investigation in the vascular system) (see lines 386 to 412 in text relative to Figure 4 – figure supplement 2). This new addition nicely shows the different pattern of expression of 3 of the Pard3 zebrafish isoforms in the trunk of 2dpf embryos, outlining interesting specificities of each isoform expression in different tissues.

      We thank the reviewer for this suggestion to complement our data with the published Daniocell dataset. However, and potentially due to the poor annotation of the different pard3 genes on public databases, gene expression information was absent for two of our isoforms of interest (pard3aa and pard3ba), that we ultimately show to be the most enriched in the vascular system in the trunk. Daniocell gene expression data for the Pard3ab isoform at 48hpf show expression in pronephric duct at 48-58hpf, as well as in intestine progenitors and neuronal progenitors, which is consistent with our in situ observations using RNAscope. However, pard3ab is poorly detected within the hematopoietic and vascular clusters. This observation is coherent with our data that do not show any enrichment of this isoform in vascular tissues compared to other structures. On the other hand, pard3bb does not seem to be particularly enriched in vascular/hematopoietic clusters at 48-58hpf in the Daniocell dataset, in accordance to what we observe with our qPCR. Finally, in the Daniocell dataset, all of the pard3 variants (pard3ab, pard3bb, PARD3 and PARD3 (1 of many)) seem to be either scarcely or not detected in the hematopoietic/vascular system. In our case, for all the isoforms we studied in control condition (pard3aa, pard3ab and pard3ba), and although the technic is only semi-quantitative due to the presence of an amplification step, RNAscope assays seem to indicate a very low expression in aortic cell (with sometime as little as one mRNA copy per cell; this explains low detection in single-cell RNAseq datasets and is coherent with the Daniocell dataset.

      (6) It would be informative to add in the introduction some information on apico-basal polarity, tight junctions, JAMs (ArhGEF11/PDZ-RhoGEF).

      We modified the introduction so as to add relevant information on Pard3 proteins, their link with our JAMs reporters in the context of polarity establishment, as well as the role of ArhGEF11/PDZ-RhoGEF and its alternative splicing variants in regulating junctional integrity in the context of epithelial-to-mesenchymal transition (lines 99 to 127). This modification of the introduction also allowed us to lighten some parts of the result section (lines 222 to 224, 345 to 349 and 454 to 456 of the original manuscript).

      Reviewer #2 (Recommendations For The Authors):

      (1) There is lots of data (and lots of work) in this paper; I feel that the pard3 data doesn't substantially add to the paper, and at the same time there is data missing (see point 10, point 11 below for an example).

      To add to the clarity and substantiate our findings on Pard3, we revised entirely our investigation strategy as mentioned in previous paragraphs. We refined the characterization of Pard3 isoforms expression in the vascular tissue, using both cell enrichment by FACS for gene expression analysis as well as single-molecule FISH (RNAscope) to access to spatial information on the expression of pard3 isoforms, reaching sub-cellular resolution.

      This new strategy allowed us to show the unexpected localization of Pard3ba mRNAs in mRNAs enriched regions in the vicinity of HE/EHT cells (new Figure 4, and paragraph Interfering with Runx1 activity unravels its function in the control of Pard3ba expression and highlights heterogeneous spatial distribution of Pard3ba mRNAs along the aortic axis, see the new manuscript). Overall, the new spatial analysis we performed allowed us to substantiate our findings on Pard3ba and suggests a direct interplay between hemogenic cells and their endothelial aortic neighbors; this interplay supposedly relies on apico-basal polarity features that is at least in part regulated by runx1 in the context of HE maturation and EHT.

      (2) Labelling of the figures could be substantially improved. In many instances, the text refers to a figure (e.g. Fig 6A), but it has several panels that are not well annotated (in the case of Fig 6A, four panels) or labelled sparsely in a way that makes it easy to follow the text and identify the correct panel in the figure. Even supplementary figures are sparsely labelled. Labelling to include embryonic stages, which transgenic is being used, etc should be added to the panels to improve clarity for the reader.

      We revised the figures to added relevant information, including stages, types of images and annotations to facilitate the comprehension, including Figure 6A – 6B, Figure 5B – 5C (see response to Reviewer 1, first comment, for a more complete list of all revised figures, transgenic fish lines and embryonic stages annotations). Furthermore, we revised the integrality of the manuscript to fit as much as possible to the figures and added some annotations to more easily link the text to the figures and panels.

      (3) The current numbering of supplementary figures is quite confusing to follow.

      We revised the manuscript so as to make sure all principal and supplementary figures were called in the right order and that supplementary figures appearance was coherent with the unfolding of the text. For Figure 7 only, the majority of the supplemental figures are called before the principal figure, as they relate to our experimental strategy that we comment on before describing the results.

      (4) Graphs in Fig 4, Fig 7 supplement 1 and some of the supplementary figures miss statistical info for some comparison (I assume when non-significant), and sometimes present a p-value of a statistical test being done between samples across stages - but these are not dealt with in the text. Throughout all graphs, the font size used in graphs for annotation (labelling of samples, x-axis, and in some cases the p values) is very small and difficult to read.

      For Figure 7 - figure supplement 1, non-significant p-values of statistical tests were not displayed (as mentioned in the Figure legend, line 1614 of the original manuscript). For the new Figure 4, all p-values are displayed. For new Figure 4 - figure Supplement 1, statistical tests were only performed to compare RFP+ and RFP- cells in the trunk condition (3 biological replicates) and not in the whole embryo condition, for which we did not perform enough replicates for statistical analysis (biological duplicates).

      (5) The results are generally very difficult to follow, with a fair amount of discussion included but then very little detail of the experiments per se.

      We thank the reviewers for these comments that helped us improve the clarity of the manuscript.

      The Results section was revised to move some of the paragraphs to the introduction (see response to Reviewer 1, 6th comment), and some of them to the Discussion (such as lines 149 to 156 or 410 to 416 in the first version of the manuscript referring to vacuolar structures or to the recycling modes of JAMs in EHT pol+ and EHT pol- cells).

      (6) The truncated version of runx1 is introduced but its expected effect is not explained until the discussion. Related to this, is it expected that blocking runx1 with this construct (leading to accumulation of cells in the aorta before they undergo EHT) then leads to increased numbers of T-cell progenitors in the thymus? Abe et al (2005, J Immunol) have used the same strategy to overexpress the runt domain in thymocytes and found a decrease in these cells, rather than an increase. Can you explain this apparent discrepancy?

      We thank the reviewer for this interesting point on the effect of runx1 interference. This phenotype (increased number of thymic cells) seems to be in agreement with the phenotype that was described in zebrafish using homozygous runx1 mutants (Sood et al. 2010 PMID: 20154212), in which the authors show an increase of lymphoid progenitors in the kidney marrow of adult runx1W84X/W84X mutants compared to controls as well as a similar number of intra-thymic lck:eGFP cells in mutants and controls. Notably, the T-lymphoid lineage seems to be the only lineage spared by the mutation of runx1. This could suggest that in this case either the T-lymphoid lineage can develop independently of runx1 or that a compensation phenomenon (for example by another protein of the runx family) occurs to rescue the generation of T-lymphocytes.

      Although our data shows an impact on T-lymphopoiesis, we do not elucidate the exact mechanism leading to an increased number of thymic cells. In our case, we do not know the half-life of our dt-runx1 protein in newly generated hematopoietic cells when our transgene, expressed under the control of the kdrl vascular promoter, ceases to be produced after emergence. The effect we observe could be direct, due to the presence of our mutant protein after 3 days in thymic cells, or indirect, due to the impact of our mutant on the HE, that could lead to the preferential generation of lymphoid-biased progenitors. Similarly, we do not know whether the cells we observe at this stage in the thymus are generated from long-term HSC or short-term progenitors. Indeed, cell tracing analysis from the lab of Zilong Wen (Tian et al. 2017, see our Ref list) show the simultaneous presence of short-term PBI derived and longterm AGM derived thymic cells at 5dpf. Based on this, we can imagine for example that the sur-numerous cells we observe in the thymus are transient populations that could multiply faster in the absence of definitive populations. Conversely, based on our observation of an accumulation of EHT pol+ events, we can imagine that the EHT pol+ and EHT pol- cells are indeed differentially fated and that EHT pol+ may be biased toward a lymphoid lineage. We also know that at the stage we observe (5dpf), RNAscope assay of runx1 show that a vast majority of thymic cells do not express runx1 (our preliminary data), suggesting that the effect we observe would be an indirect one caused by upstream events rather than by direct interference with the endogenous expression of runx1 in thymic cells.

      The article referred to by the reviewer (Sato et al. 2005, PMID: 16177090) investigates on the role of runx1 during TCR selection for thymic cell maturation and shows that runx1 signaling lowers the apoptotic sensitivity of double-positive thymocytes when artificially activated, leading to a reduced number of single-positive thymic cells. Furthermore, this paper references another study from the same lab (Hayashi et al. 2000, PMID: 11120804) that used the same strategy to study the role of runx1 on the positive and negative selection steps of T lymphocytes maturation. This paper, although showing that runx1 is important for later stages of T lymphocytes differentiation — the double-positive to single-positive stage maturation —, also shows a relative increase in the amount of double-negative and double-positive thymocytes, that could be coherent with our observations. Indeed, in our case, although we show an increased number of thymic cells, we do not know the relative proportion of the different thymocyte subsets. We could explain the increased number of thymic cells by increased number of DN/DP thymocytes that would not preclude a decrease in single-positive thymocytes. Finally, the cells we observe in the thymus of our dt-runx1 mutants may also be different lymphoid populations, namely ILCs, that would react differently to runx1 interference.

      (7) Lines 154-155 refer to aquaporins but are missing a reference. This is a bit of speculation right in the results section and I struggled to understand what the point of it was.

      To clarify the argument and ease the flow of the text, as suggested by the reviewers, we transferred this paragraph (lines 149 to 156 of the initial manuscript) to the Discussion section lines 763-789). We additionally made sure to add the missing reference (Sato et al. 2023, see our Ref list).

      (8) Lines 173-175, indicating that both EHTpol+ and pol- express the CD41 transgenic marker - would be useful to show this data.

      We provide a new supplement Figure (Figure 1 – figure supplement 3), where, using an outcross of the CD41:eGFP and kdrl:mKate2-podxl2 transgenic lines, we show unambiguously and for multiple cells that both polarized EHT pol+ cells and non-polarized EHT pol- cells are CD41 positive. In addition, but not commented on in the main text, we can also see that an HE cell, characterized by its elongated morphology (in the middle of the field), its thickened nucleus and its position on the aortic floor, is also CD41 positive.

      (9) Lines 181-201 - it's not clear how HE cells were identified in the first place - was it just morphology? Or were they identified retrospectively?

      HE cells were identified solely on morphology and spatial criteria (as mentioned in the Methods section, lines 1073-1082 and 1108-1111 of the first manuscript). Furthermore, a recent investigation by the lab of Zilong Wen (Zhao et al. 2022, see our Ref list) questioning the common origin of HE cells and of endothelial cells as well as their respective capacity to extrude from the aorta to generate hematopoietic cells showed, by single-cell tracing, that 96% of floor cells are indeed hemogenic endothelial cells. Furthermore, as mentioned in the response to the 8th point, we show in Figure 1 – figure supplement 3 that all floor cells express CD41. Finally, we also used an alternative method to validate the true hemogenic identity of aortic floor cells and show, using RNAscope, that virtually 100% of floor cells that we consider as typical HE cells are indeed expressing an hematopoietic transcription factor upstream of Runx1, namely Gata2b (see Author response image 1).

      Author response image 1.

      All cells from the aortic floor, at 48hpf, express the hematopoietic marker Gata2b. 48 hpf Tg(Kdrl:eGFP) fixed embryos were used for RNAscope using a probe designed to detect Gata2b mRNAs. Subsequently, images were taken using spinning disk confocal microscopy. The image in the top panel is a z-projection of the entire aortic volume of one embryo and shows the full portion of the dorsal aorta from the anterior part (left side, at the limit of the balled yolk) down to the urogenital orifice (UGO, right side). The 4 boxes (1 - 4) delineate regions that have been magnified beneath (2X). The 2X images corresponding to each box are z-projections (top views) or z-sections (bottom views). The bottom views allow to visualize the aortic floor and to mark its position on top views). Pink arrows point at HE cells (elongated in the anteroposterior direction) and at EHT cells (ovoid/round cells; EHT pol+ cell morphology is not preserved after fixation and RNAscope; thus, it cannot be distinguished from ovoid/round EHT pol- cells). Pink dots = RNAscope spots of various sizes. The green cells in the subaortic space that are marked by RNAscope spots are newly born hematopoietic stem and progenitor cells (see for example box 1). This embryo is representative of n = 5 embryos treated and imaged.

      (1) Line 276 - the difference between the egfp-podxl2 and mKate-podxl2 - could that be due to the fluorophore used? Also, it would be good to label Fig 3 supplement 2 better and to see a control alongside the runt overexpression.

      Line 276 does not point at a difference in control conditions between eGFP-podxl2 and mKatepodxl2 (see in new Figure 1 – figure supplement 3, Figure 2 or in new Figure 3 - figure supplement 2 several examples of non-polarized HE cells in control conditions using both fluorophores) but between control and dt-runx1 conditions, both expressing the mKate2podxl2 transgene. Similarly, the new example that we provide now in the CD41 figure (Figure 1 – figure supplement 3) clearly shows that mKate-podxl2 is enriched at the apical/luminal membrane of EHT pol+ cells while no such enrichment is observed for EHT pol- cells. The Reviewer should be informed that EHT cells are not always the most typical in shape, in particular because cells can be squeezed by underlying tissues and for example the vein; or from the luminal side by flow and tensions on the aortic wall because of heart beat (the more we image up in the trunk, the more difficult the imaging and the stability of cell shape during long time-lapse sequences). To also take into account the reviewer’s comments, we added for the new Figure 3 – figure supplement 2A a control condition next to the dt-runx1 condition.

      (2) There is no quantitation data on the number of excess EHT pol+ cells in the DA, or in the thymus data (Figs 3 Supp1 and Fig 3 Supp 3). Can you quantify this data? This would better support the claim that tunin apico-basal polarity alters the morphology of the emerging HE cells.

      We added quantifications relative to both the emergence process itself, showing the accumulation of HE and EHT pol+ cells (new Figure 3B), and on hematopoiesis per se (new Figure 3 – figure supplement 1). Indeed, we show a diminution in the number of newly generated cmyb+ cells in the sub-aortic space. Furthermore, we improved our quantification of the later phenotype on the thymus (new Figure 3 – figure supplement 3), using improved segmentation methods, that indeed validate the increase number of thymic cells that we described.

      (3) The observed changes in pard3 isoforms are just reading out changes in their expression in the runt1 transgenics, rather than demonstrating a role in apico-basal polarity.

      We entirely revised our strategy regarding Pard3 expression analyses (see also the text at the beginning of this file, for the Public Review). But we wish to stress on the point that we did not intend initially to show directly a role of Pard3 proteins in controlling apico-basal polarity in the system, we just intended to provide correlative evidence supporting our observations with the polarity marker podxl2 (by interfering with their function, as written in the text, apico-basal polarity - which is essential for aortic lumenization and maintenance -, would have been impaired, blurring interpretations).

      During the revision, we obtained the unexpected finding, using RNAscope, that one Pard3 isoform, namely Pard3ba, is the one Pard3 that is expressed non-homogenously along the aortic axis and, in vast majority, by aortic cells and in the direct vicinity of emergence domains of the aortic floor (see the new Figure 4 and Figure 4 – figure supplements 2, 3).

      This correlative relation between expression of Pard3ba in aortic endothelial cells neighbouring HE/EHT cells suggests, as we propose, that a cross talk occurs between hemogenic and aortic cells, and that this cross talk relies, at least in part, on the expression of key components of apico-basal polarity and their associated functional features. In addition, we show that junctional recycling differs between both EHT types, based on our observations on the different dynamics in the turnover of JAM molecules, in the two EHT types. As JAM molecules are also required for the recruitment of Pard3, which initiates the establishment of apico-basal polarity, these different dynamics suggest that the control of apico-basal polarity is involved in supporting the morphodynamic complexity of EHT cell types.

      (4) There is a Fig 5, Supp 2 that is neither mentioned nor described anywhere in the manuscript.

      Figure 5 - figure Supplement 2 is mentioned lines 366-370 of the original manuscript, to describe the initial validation that was performed for our eGFP-JAM constructs in multiple cell types using an ubiquitous heat-shock promoter. We developed our description of this supplemental figure in the new manuscript (lines 504 to 514).

      (5) Lines 445-456 - these read like a bit of discussion, not results. There are other similar parts of the results section that also read like a discussion (e.g. 526-533)

      Although we decided to keep this paragraph in the Results section, as it justifies the rationale behind the choice of ArhGEF11/PDZ-RhoGEF, we took the reviewers comment into account and, as mentioned in the response to reviewer 1 6th comment, lightened the Results section by transferring some of the paragraphs to the Introduction or Discussion sections.

      (6) The description of Fig 7A (from line 505) is missing the stages at which the experiments were performed (also not labelled on the figure).

      The stages at which the experiments were performed is stated in the figure legend (line 1366) as well as in the Methods section of the original manuscript (line 1033). We added the information on top of the panels A and B for more clarity.

      (7) Some figures have multiple panels (e.g. Fig 7Aa'), so when referred to in the text, it remains unclear which panel is being referred to.

      We modified the text so as to refer more clearly to the different panels when mentioned in the text, particularly with regards to Figure 7 and 8 but also for all the other figures.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study investigates the transcriptional changes in neurons that underlie loss of learning and memory with age in C. elegans, and how cognition is maintained in insulin/IGF-1-like signaling mutants. The presented evidence is convincing, utilizing a cutting-edge method to isolate neurons from worms for genomics that is clearly conveyed with a rigorous experimental approach. Overall, this study supports that older daf-2 worms maintain cognitive function via mechanisms that are unique from younger wild type worms, which will be of interest to neuroscientists and researchers studying ageing.

      Thank you, we appreciate the positive comments.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      The authors perform RNA-seq on FACS-isolated neurons from adult worms at days 1 and 8 of adulthood to profile the gene expression changes that occur with cognitive decline. Supporting data are included indicating that by day 7 of adulthood, learning and memory are reduced, indicating that this time point or after represents cognitively aged worms. Neuronal identity genes are reduced in expression within cognitively aged worms, whereas genes involved in proteostasis, transcription/chromatin, and stress response are elevated. A number of specific examples are provided, representing markers of specific neuronal subtypes, and correlating expression changes to the erosion of particular functions (e.g. motor neurons, chemosensory neurons, aversive learning neurons, etc). 

      To investigate whether the upregulation of genes in neurons with age is compensatory or deleterious, the authors reduced the expression of a set of three significantly upregulated genes and performed behavioral assays in young adults. In each case, reduction of expression improved memory, consistent with a model in which age-associated increases impair neuronal function. This claim would be bolstered by an experiment elevating the expression of these genes in young neurons, which should reduce the learning index if the hypothesis is correct. 

      This is an interesting suggestion. Our long-term goal is to find ways to improve memory, and to better understand the “rules” that might govern changes with age. In this case, were interested in addressing the hypothesis that genes that rise with age must be compensatory, which is a frequently stated theory that is not often tested. Here we showed that knocking down three genes that are upregulated in aged animals improved memory; our results suggest that the wild-type functions of these genes are likely deleterious for learning and memory functions, and further, that their increased expression with age is not a compensatory function. Certainly for future work, it might be interesting to better understand how and why these specific genes have a deleterious function that increases with age, and whether that function is different in younger animals where they are not highly expressed.

      The authors then characterize learning and memory in wild-type, daf-2, and daf-2/daf-16 worms with age and find that daf-2 worms have an extended ability to learn for approximately 10 days longer than wild types. This was daf-16 dependent. Memory was extended in daf-2 as well, and strikingly, daf-2;daf-16 had no short-term memory even at day 1. Transcriptomic analysis of FACS-sorted neurons was performed on the three groups at day 8. The authors focus their analysis on daf-2 vs. daf-2;daf-16 and present evidence that daf-2 neurons express a stress-resistance gene program. One question that remains unanswered is how well the N2 and daf-2;daf-16 correlate overall, and are there differences? This may be informative as wild type and daf-2;daf-16 mutants are not phenotypically identical when it comes to memory, and there may be differences that can be detected despite the overlap in the PCA. This analysis could reveal the daf-16 targets involved in memory. 

      Re. daf-2;daf-16 vs N2: This is a good suggestion. Our analysis in Fig. S5 showed that the daf-2 vs N2 comparison shows similar results with the daf-2 vs daf-16;daf-2 comparison, but some additional genes are differentially expressed. Interestingly, the daf-2 vs N2 comparison shows that the bZip transcription factors are upregulated in daf-2 compared with N2 worms (Fig. S6f). This may indicate that additional transcription factors are controlled by the daf-2 mutation in the nervous system in addition to the DAF-16/FOXO transcription factor.

      Author response image 1.

      We also identified the differentially expressed genes in the Day 8 neuronal daf-16;daf-2 to N2 comparison, as the reviewer is asking about. The samples from different genotypes do separate from one another in the PCA plot, indicating there are differences between daf-16,daf-2 and N2 neurons. However, the difference is smaller and there are fewer genes differentially expressed between daf-16;daf-2 and N2: only 38 genes are significantly higher in daf-16;daf-2, and only 53 genes are significantly higher in N2 (log2FC > 0.5, p-adj<0.05). The genes higher in N2 are enriched in endopeptidase inhibitors, and the genes higher in daf-16;daf-2 are not enriched in any gene ontology terms. These results indicate that there are some differences between daf-16;daf-2 and N2 neurons, which correlates with the behavioral differences we see, but the difference is small compared to daf-2 neurons. We have added these data to the paper (Fig. S4e,f); thank you for the suggestion.

      The authors tested eight candidate genes that were more highly expressed in daf-2 neurons vs. daf-2;daf-16 and showed that reduction of 2 and 5 of these genes impaired learning and memory, respectively, in daf-2 worms. This finding implicates specific neuronal transcriptional targets of IIS in maintaining cognitive ability in daf-2 with age, which, importantly, are distinct from those in young wild type worms. 

      Reviewer #2 (Public Review): 

      Weng et al. perform a comprehensive study of gene expression changes in young and old animals, in wild-type and daf-2 insulin receptor mutants, in the whole animal, and specifically in the nervous system. Using this data, they identify gene families that are correlated with neuronal ageing, as well as a distinct set of genes that are upregulated in neurons of aged daf-2 mutants. This is particularly interesting as daf-2 mutants show both extended lifespans and healthier neurons in aged animals, reflected by better learning/memory in older animals compared with wild-type controls. Indeed, the knockdown of several of these upregulated genes resulted in poorer learning and memory. In addition, the authors showed that several genes upregulated during ageing in wild-type neurons also contribute to learning and memory; specifically knockdown of these genes in young animals resulted in improved memory. This indicates that (at least in this small number of cases), genes that show increased transcript levels with age in the nervous system somehow suppress memory, potentially by having damaging effects on neuronal health. 

      Finally, from a resource perspective, the neuronal transcriptome provided here will be very useful for C. elegans researchers as it adds to other existing datasets by providing the transcriptome of older animals (animals at day 8 of adulthood) and demonstrating the benefits of performing tissue-specific RNAseq instead of whole-animal sequencing. 

      Thank you!

      The work presented here is of high quality and the authors present convincing evidence supporting their conclusions.

      Thanks!

      I only have a few comments/suggestions: 

      (1) Do the genes identified to decrease learning/memory capacity in daf-2 animals (Figure 4d/e) also impact neuronal health? daf-2 mutant worms show delayed onset of age-related changes to neuron structure (Tank et al., 2011, J Neurosci). Does knockdown of the genes shown to affect learning also affect neuron structure during ageing, potentially one mechanism through which they modulate learning/memory? 

      Thank you for this suggestion, which would be good for a future direction, particularly for genes that might have some relationship to previously-identified cellular structural process. The genes we tested here include dod-24, alh-2, mtl-1, F08H9,4, C44B7.5, hsp-12.3, hsp-12.6, and cpi-1, which are related to stress response, proteolysis inhibitor, metabolic, and innate immunity GO categories, thus associated with stress resistance, proteolysis, lipid metabolism processes; none are obvious choices for morphological effects.

      However, it is worth noting that learning and memory decline much faster (Days 4-8) than morphological differences are observed (generally after Day 12-15). Moreover, those morphological differences have been studied primarily in mechanosensory neurons (touch neurons) rather than the chemosensory neurons that are involved in learning and memory, so additional genes may be required for those differences that we were not focusing on in thisi study.

      (2) The learning and memory assay data presented in this study uses the butanone olfactory learning paradigm, which is well established by the same group. Have the authors tried other learning assays when testing for learning/memory changes after the knockdown of candidate genes? Depending on the expression pattern of these genes, they may have more or less of an effect on olfactory learning versus for example gustatory or mechanosensory-based learning. 

      The reason that we use the butanone olfactory learning paradigm is because it is more similar to learning of information (neutral odorant association with positive cue (food)) – the kind of memory we would like to preserve in humans - rather than a stress-induced memory, such as starvation or pathogenesis-associated aversive learning paradigms, which are more like PTSD. (There is likely to be quite a bit of overlap in mechanism, however, including the role of genes such as magi-1 and casy-1, so it would not be surprising if many of these genes also were required for other learning paradigms.)

      (3) I have a comment on the 'compensatory vs dysregulatory' model as stated by the authors on page 7. I understand that this model presents the two main options, but perhaps this is slightly too simplistic: the gene expression that rises during ageing may be detrimental for memory (= dysregulatory), but at the same time may also be beneficial for other physiological roles in other tissues (=compensatory). 

      This is a good point, and we made the clarification that in the text: “There may be other scenarios in which a gene with multiple functions may be detrimental for some behaviors but beneficial for other physiological roles.”

      Reviewer #3 (Public Review): 

      Summary: 

      In this manuscript, Weng et al. detect a neuron-specific transcriptome that regulates aging. The authors first profile neuron-specific responses during aging at a time point where a loss in memory function is present. They discover signatures unique to neurons which validate their pipeline and reveal the loss of neuron identity with age. For example, old neurons reduce the expression of genes related to synaptic function and neuropeptide signaling and increase the expression of chromatin regulators, insulin peptides, and glycoproteins. The authors discover the detrimental effect of selected upregulated genes (utx-1, ins-19, and nmgp-1) by knocking them down in the whole body and detecting improvement of short memory functions. They then use their pipeline to test neuronal profiles of long-lived insulin/IGF mutants. They discover that genes related to stress response pathways are upregulated upon longevity (e.g. dod-24, F08H9.4) and that they are required for improved neuron function in long-lived individuals. 

      Strengths: 

      Overall, the manuscript is well-written, and the experiments are well-described. The authors take great care to explain their reasoning for performing experiments in a specific way and guide the reader through the interpretation of the results, which makes this manuscript an enjoyable and interesting read. Using neuron-specific transcriptomic analysis in aged animals the authors discover novel regulators of learning and memory, which underlines the importance of cell-specific deep sequencing. The time points of the transcriptomic profiling are elegantly chosen, as they coincide with the loss of memory and can be used to specifically reveal gene expression profiles related to neuron function. The authors showcase on the dod-24 example how powerful this approach is. In long-lived insulin/IGF-1 receptor mutants body-wide dod-24 expression differs from neuron-specific profiles. Importantly, the depletion of dod-24 has an opposing effect on lifespan and learning memory. The dataset will provide a useful resource for the C. elegans and aging community. 

      Thank you, we do hope people will find the data useful.

      Weaknesses: 

      While this study nicely describes the neuron-specific profiles, the authors do not test the relevance in a tissue-specific way. It remains unclear if modifying the responses only in neurons has implications for either memory or potentially for lifespan. The authors point to this in the text and refer to tissue-specific datasets. However, it is possible that the tissue-specific profile changes with age. The authors should consider mining publicly available cell-specific aging datasets and performing neuron-specific RNAi to test the functional relevance of the neuron-specific response. This would strengthen the importance of cell-specific profiling.

      Thank you for your suggestions. As we have mentioned in the text, our candidate genes are either (1) only expressed in the neurons (alh-2 and F08H9.4), or they are only more highly expressed in daf-2 compared to wild type only in the nervous system (C44B7.5 or dod-24). Thus, the effect we see from knocking down these genes in daf-2 are likely neuron-specific. Additionaly, we performed our assays with neuron-sensitive RNAi strain CQ745: daf-2(e1370) III; vIs69 [pCFJ90(Pmyo-2::mCherry + Punc-119::sid-1)] V. It has been previously shown that neuronal expression of sid-1 decreases non-neuronal RNAi, suggesting that neurons expressing transgenic sid-1(+) served as a sink for dsRNA (Calixto et al., 2010). Thus, this neuron-sensitive RNAi is likely neuron-specific and our results is unlikely from knocking down these genes in non-neuronal tissues. However, we do acknowledge this issue.

      To identify the expression pattern of these genes in a more cell-specific way in the adults, we examined the expression of our candidate genes that affected learning and memory, namely dod-24, F08H9.4, C44B7.9, alh-2, and mtl-1, in the Calico database (Roux et al., 2023). From that database, we can see that dod-24 is mainly expressed in the PHC and PVM neurons, and F08H9.4 is largely expressed in various neurons. Both have only slight expression outside the nervous system. C44B7.5 and mtl-1 are more broadly expressed, but C44B7.5 was not found to be differentially expressed in other tissues in daf-2, and mtl-1 only had a slight effect on learning and memory. Perhaps due to their sequencing depth and detection limit, Roux et al. didn’t detect alh-2 expression anywhere in their data.

      Thus, the neuron-specific expression and daf-2 differential expression pattern of these genes indicate that the learning and memory improvement in aged daf-2 is unlikely due to neuronal non-autonomous effects.

      To better address this concern (that for the genes that we found only expressed in the neurons, the neuron-confined expression may change with age) we examined the expression pattern change of these genes with age. As is shown below, from the Calico database, we can see that the expression in the nervous system persists, and even slightly increases, with age, thus age-related expression pattern change is not a concern to our analysis.

      Author response image 2.

      Author response image 3.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Most of my comments are in the public section. A few additional recommendations for the authors regarding the formatting/presentation: 

      The presentation of Figure S6e-h in the introduction is somewhat confusing and feels out of order. If presented first, it should be S1. Otherwise, discussion of this figure should go at the end of the results section or in the discussion if appropriate. 

      Thank you for pointing this out. We have moved the discussion of this figure to the Discussion section.

      I do not see Figure S5 described in the text.

      Good catch, thank you. We have added the descriptions for Figure S5 in the text.

      In general, check the figures, figure legends, and how they are referenced in the text, particularly the supplemental figures and legends.

      Minor comments:

      There is a typo in the Figure 4 legend: Neuronal IIX should be IIS. 

      Thanks for pointing this out. We have corrected it in the text.

      Reviewer #2 (Recommendations For The Authors): 

      • There are multiple instances throughout the manuscript where there are statements in brackets that provide justification or explanation for some of the approaches used. There is no reason for 'side note' brackets to be used. I suggest removing them and incorporating these statements into the narrative.

      Thank you, we have now incorporated these points into the main text.

      • Introduction: page 4 "here we RNA-sequenced FACS-isolated neurons" should be "here we performed RNA sequencing on FACS-isolated neurons...".

      Thank you, we have changed the text accordingly.

      • Figure 2A: I do not understand the legend for this panel "Tissue Query for wild-type genes expressed at higher levels in aged worms show lower nervous system and neuron prediction score." Please clarify.

      We have clarified the Figure 2A legend:

      (A)  Tissue prediction score for wild-type genes expressed at higher levels in aged worms.

      • Page 8: "We previously observed that loss of single genes that play a role in complex behaviors like learning and memory can have a large impact on function 60, unlike the additive roles of longevity-promoting genes 11." - a large impact on what function?

      Thank you for noting, we have clarified it in the text accordingly:

      “We previously observed that for genes that play a role in complex behaviors like learning and memory, the loss of single genes can have a large impact on these complex behaviors 60, unlike the additive roles of longevity-promoting genes 11.”

      • Next line "Therefore, one mechanism by which wild-type worms lose their function with age..." - again, what function?

      Thank you for noting this, we have clarified the text to say we refer to the learning and memory functions.

      • Page 9: "Thus, daf-2 mutants maintain their higher cognitive quality of life longer than wild-type worms, while daf-16;daf-2 mutants spend their whole lives without memory ability (Figure 3d), in contrast to claims that daf-2 mutants are less healthy than wild-type or daf-16 worms23." - since ref 23 did not perform any learning/memory tests, the definition of 'health' in ref 23 is different to 'cognitive health' as studied here. So the findings in this study are not 'in contrast' to ref 23 but rather add to these findings.

      Learning and memory ability is an important function for a healthy individual, thus we would assert that indeed, cognitive health is an important part of the “health” of daf-2 worms. In ref 23, Bansal et al. claim that daf-2 worms are less healthy without assessing their learning and memory ability; their lack of data is an insufficient reason for us to remove our statement, as cognitive health is part of healthspan. Here we find that the “learning span” of daf-2 lasts at least proportionally if not longer than that of wild type. We have also previously shown that daf-2 worms also have longer maximum velocity span with age (Hahm et al., 2015), in direct contrast with Bansal et al.’s claim that daf-2 worms move less well and thus are less healthy – daf-2 worms simply stop sooner when presented with food and switch to feeding, due to their higher odr-10 levels. The Bansal paper continues to be frequently cited as finding that daf-2 mutants are less healthy than wild type, a claim for which we can still find no experimental evidence to support. Therefore, it is important that we make the point that daf-2 worms have extended cognitive health, which is part of health span.

      • Page 13: I feel like the sentence "Furthermore, memory maintenance with age might require additional functions that were not previously uncovered in analyses of young animals" is both vague (what functions are referred to?) and a little bit obvious (obvious that age-related changes would not be revealed in analyses of young animals). Perhaps rephrase to make the desired point clearer? 

      We have clarified the sentence in the text:

      “Furthermore, memory maintenance with age might require additional genes that function in promoting stress resistance and neuronal resilience, which were not previously uncovered in analyses of young animals.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary and Strengths:

      The ability of Wolbachia to be transmitted horizontally during parasitoid wasp infections is supported by phylogenetic data here and elsewhere. Experimental analyses have shown evidence of wasp-to-wasp transmission during coinfection (eg Huigins et al), host to wasp transmission (eg Heath et al), and mechanical ('dirty needle') transmission from host to host (Ahmed et al). To my knowledge this manuscript provides the first experimental evidence of wasp to host transmission. Given the strong phylogenetic pattern of host-parasitoid Wolbachia sharing, this may be of general importance in explaining the distribution of Wolbachia across arthropods. This is of interest as Wolbachia is extremely common in the natural world and influences many aspects of host biology.

      Weaknesses:

      The first observation of the manuscript is that the Wolbachia strains in hosts are more closely related to those in their parasitoids. This has been reported on multiple occasions before, dating back to the late 1990s. The introduction cites five such papers (the observation is made in other studies too that could be cited) but then dismisses them by stating "However, without quantitative tests, this observation could simply reflect a bias in research focus." As these studies include carefully collected datasets that were analysed appropriately, I felt this claim of novelty was rather strong. It is unclear why downloading every sequence in GenBank avoids any perceived biases, when presumably the authors are reanalysing the data in these papers.

      Thank you for bringing this to our attention. In this study, we downloaded all wsp sequences from GenBank and conducted a systematic analysis. We acknowledge that there could still be a bias in research focus, but a systematic analysis, compared to a limited dataset, may reduce this bias. We agree with the reviewer's point, and we have revised this statement to make it more accurate. Now the new sentence reads: "However, there is still a lack of systematic statistical analyses to support this hypothesis." (Lines 69–70 in the revised manuscript)

      I do not doubt the observation that host-parasitoid pairs tend to share related Wolbachia, as it is corroborated by other studies, the effect size is large, and the case study of whitefly is clearcut. It is also novel to do this analysis on such a large dataset. However, the statistical analysis used is incorrect as the observations are pseudo-replicated due to phylogenetic non-independence. When analysing comparative data like this it is essential to correct for the confounding effects of related species tending to be similar due to common ancestry. In this case, it is well-known that this is an issue as it is a repeated observation that related hosts are infected by related Wolbachia. However, the authors treat every pairwise combination of species (nearly a million pairs) as an independent observation. Addressing this issue is made more complex because there are both the host and symbiont trees to consider. The additional analysis in lines 123-124 (including shuffling species pairs) does not explicitly address this issue.

      We agree with your point about the non-independence of data due to phylogenetic relationships. In the analysis of species traits, a conventional phylogenetic correction assumes that traits follow a Brownian motion model (Felsenstein, 1985). The variance of the trait values for a species i is given by:

      Var[Yi]=σ2Ti,

      Where Ti represents the time from the root to the tip for species i. Consequently, the covariance between traits of species i and j is:

      Cov[Yij,Yj]=σ<sup>2</sup>Tii,

      where Tij is the time from the root to the most recent common ancestor (MRCA) of species i and j. Linear model analysis incorporates the covariance matrix to correct for the effects of non-independence. Mathematically, this method is equivalent to the independent contrasts approach (Felsenstein, 1985).

      In our analysis, we treat the minimum interspecific wsp distance between two species as a trait for the species pair (i, j). Similarly, for any two pairs of species (i, j) and (k, l), we postulate that the covariance between their traits is given by:

      Cov[Y<sub>ij</sub>,Y<sub>kl</sub>]=σ2⋅(T<sub>ik</sub>+T<sub>jl</sub>),

      where Tik denotes the time from the root to the MRCA of species i and k, and Tjl represents the time from the root to the MRCA of species j and l. This covariance matrix is then incorporated into our linear model analysis to account for the effects of phylogenetic non-independence.

      However, when extending trait analysis to pairs of species, the computational demands increase substantially. For instance, with a dataset of 1,377 species, forming all possible pairs yields 947,376 unique species combinations. Consequently, constructing a covariance matrix for these pairs would necessitate storing 897,521,285,376 entries, a requirement that far exceeds the memory capabilities of standard computing systems.

      To address this, we randomly sampled 1,000 pairs from the total of 947,376 species pairs within the 'Others' category, thereby reducing the computational load without compromising the representativeness of our analysis. Ultimately, even after accounting for phylogenetic correction using covariance, the effect of parasitism remains highly significant (p < 0.0001).

      We have added a “Phylogenetic correction” section to Materials and Methods (Lines 392–405 in the revised manuscript). The corresponding results are described on lines 120–121 and in supplementary Note 1. The data and scripts for this analysis are available at https://doi.org/10.6084/m9.figshare.24718119.

      REFERENCE

      Felsenstein J, 1985. Phylogenies and the comparative method. The American Naturalist, 125(1), 1-15.

      The sharing of Wolbachia between whitefly and their parasitoids is very striking, although this has been reported before (eg the authors recently published a paper entitled "Diversity and Phylogenetic Analyses Reveal Horizontal Transmission of Endosymbionts Between Whiteflies and Their Parasitoids"). In Lines 154-164 it is suggested that from the tree the direction of transfer between host and parasitoid can be inferred from the data. This is not obvious to me given the poor resolution of the tree due to low sequence divergence. There are established statistical approaches to test the direction of trait changes on a tree that could have been used (a common approach is to use the software BEAST).

      We thank the reviewer for this constructive feedback on our interpretation of Wolbachia transfer between whiteflies and their parasitoids. Inspired by the reviewer's comments, we have now incorporated a trait-based approach, using the taxonomic order of the source species of the wsp gene as a discrete trait for ancestral state reconstruction on the wsp tree. The estimated ancestral trait state for one clade, which clusters wsp sequences from whiteflies and parasitoids, is Hymenoptera, suggesting that within this clade, the direction of Wolbachia transfer may have been from parasitoids to hosts. Conversely, in another clade characterized by the ancestral trait state of Hemiptera, the inferred direction of transfer appears to be from hosts to parasitoids. We have added a “Ancestral state reconstruction” section to Materials and Methods (Lines 406–412 in the revised manuscript). The corresponding results are described on lines 159–163 and 167–168. The data and script for this analysis is available at https://doi.org/10.6084/m9.figshare.24718119.

      Reviewer #2 (Public Review):

      The paper by Yan et al. aims to provide evidence for horizontal transmission of the intracellular bacterial symbiont Wolbachia from parasitoid wasps to their whitefly hosts. In my opinion, the paper in its current form consists of major flaws.

      Weaknesses:

      The dogma in the field is that although horizontal transmission events of Wolbachia occur, in most systems they are so rare that the chances of observing them in the lab are very slim.

      For the idea of bacteria moving from a parasitoid to its host, the authors have rightfully cited the paper by Hughes, et al. (2001), which presents the main arguments against the possibility of documenting such transmissions. Thus, if the authors want to provide data that contradict the large volume of evidence showing the opposite, they should present a very strong case.

      In my opinion, the paper fails to provide such concrete evidence. Moreover, it seems the work presented does not meet the basic scientific standards.

      We are grateful for your critical perspective on our work. Nonetheless, we are confident in the credibility of our findings regarding the horizontal transmission of Wolbachia from En. formosa to B. tabaci. Our study has documented this phenomenon through phylogenetic tree analyses, and we have further substantiated our observations with rigorous experiments in both cages and petri dishes. The horizontal transfer of Wolbachia was confirmed via PCR, with the wsp sequences in B. tabaci showing complete concordance with those in En. formosa. Additionally, we utilized FISH, vertical transmission experiments, and phenotypic assays to demonstrate that the transferred Wolbachia could be vertically transmitted and induce significant fitness cost in B. tabaci. All experiments were conducted with strict negative controls and a sufficient number of replicates to ensure reliability, thereby meeting basic scientific standards. The collective evidence we present points to a definitive case of Wolbachia transmission from the parasitoid En. formosa to the whitefly B. tabaci.

      My main reservations are:

      - I think the distribution pattern of bacteria stained by the probes in the FISH pictures presented in Figure 4 looks very much like Portiera, the primary symbiont found in the bacterium of all whitefly species. In order to make a strong case, the authors need to include Portiera probes along with the Wolbachia ones.

      We thank you for your critical evaluation regarding the specificity of FISH in our study. We assure the reliability of our FISH results based on several reasons.

      (1) We implemented rigorous negative controls which exhibited no detectable signal, thereby affirming the specificity of our hybridization. (2) The central region of the whitefly nymphs is a typical oviposition site for En. formosa. Post-parasitism, we observed FISH signals around the introduced parasitoid eggs, distinct from bacteriocyte cells which are rich in endosymbionts including Portiera (Fig 3e-f). This observation supports the high specificity of our FISH method. (3) In the G3 whiteflies, we detected the presence of Wolbachia in bacteriocytes in nymphs and at the posterior end of eggs in adult females (Fig. 4). This distribution pattern aligns with previously reported localizations of Wolbachia in B. tabaci (Shi et al., 2016; Skaljac et al., 2013). Furthermore, the distribution of Wolbachia in the whiteflies does indeed exhibit some overlap with that of Portiera (Skaljac et al., 2013; Bing et al., 2014). 4) The primers used in our FISH assays have been widely cited (Heddi et al., 1999) and validated in studies on B. tabaci and other systems (Guo et al., 2018; Hegde et al., 2024; Krafsur et al., 2020; Rasgon et al., 2006; Uribe-Alvarez et al., 2019; Zhao et al., 2013).

      Taking all these points into consideration, we stand by the reliability of our FISH results.

      REFERENCES

      Bing XL, Xia WQ, Gui JD, et al., 2014. Diversity and evolution of the Wolbachia endosymbionts of Bemisia (Hemiptera: Aleyrodidae) whiteflies. Ecol Evol, 4(13):2714-37.

      Guo Y, Hoffmann AA, Xu XQ, et al., 2018. Wolbachia-induced apoptosis associated with increased fecundity in Laodelphax striatellus (Hemiptera: Delphacidae). Insect Mol Biol, 27:796-807.

      Heddi A, Grenier AM, Khatchadourian C, Charles H, Nardon P, 1999. Four intracellular genomes direct weevil biology: nuclear, mitochondrial, principal endosymbiont, and Wolbachia. Proc Natl Acad Sci USA, 96:6814-6819.

      Hegde S, Marriott AE, Pionnier N, et al., 2024. Combinations of the azaquinazoline anti-Wolbachia agent, AWZ1066S, with benzimidazole anthelmintics synergise to mediate sub-seven-day sterilising and curative efficacies in experimental models of filariasis. Front Microbiol, 15:1346068.

      Krafsur AM, Ghosh A, Brelsfoard CL, 2020. Phenotypic response of Wolbachia pipientis in a cell-free medium. Microorganisms, 8.

      Rasgon JL, Gamston CE, Ren X, 2006. Survival of Wolbachia pipientis in cell-free medium. Appl Environ Microbiol, 72:6934-6937.

      Shi P, He Z, Li S, et al., 2016. Wolbachia has two different localization patterns in whitefly Bemisia tabaci AsiaII7 species. PLoS One, 11: e0162558.

      Skaljac M, Zanić K, Hrnčić S, et al., 2013. Diversity and localization of bacterial symbionts in three whitefly species (Hemiptera: Aleyrodidae) from the east coast of the Adriatic Sea. Bull Entomol Res, 103(1):48-59.

      Uribe-Alvarez C, Chiquete-Félix N, Morales-García L, et al., 2019. Wolbachia pipientis grows in Saccharomyces cerevisiae evoking early death of the host and deregulation of mitochondrial metabolism. MicrobiologyOpen, 8: e00675.

      Zhao DX, Zhang XF, Chen DS, Zhang YK, Hong XY, 2013. Wolbachia-host interactions: Host mating patterns affect Wolbachia density dynamics. PLoS One, 8: e66373.

      - If I understand the methods correctly, the phylogeny presented in Figure 2a is supposed to be based on a wide search for Wolbachia wsp gene done on the NCBI dataset (p. 348). However, when I checked the origin of some of the sequences used in the tree to show the similarity of Wolbachia between Bemisia tabaci and its parasitoids, I found that most of them were deposited by the authors themselves in the course of the current study (I could not find this mentioned in the text), or originated in a couple of papers that in my opinion should not have been published to begin with.

      We appreciate your meticulous examination of the sources for our sequence data. All the sequences included in our phylogenetic analysis were indeed downloaded from the NCBI database as of July 2023. The sequences used to illustrate the similarity of Wolbachia between B. tabaci and its parasitoids include those from our previously published study (Qi et al., 2019), which were sequenced from field samples. Additionally, some sequences were also obtained from other laboratories (Ahmed et al., 2009; Baldo et al., 2006; Van Meer et al., 1999). We acknowledge that in our prior research (Qi et al., 2019), the sequences were directly submitted to NCBI and, regrettably, we did not update the corresponding publication information after the article were published. It is not uncommon for sequences on NCBI, with some never being followed by a published paper (e.g., FJ710487- FJ710511 and JF426137-JF426149), or not having their associated publication details updated post-publication (for instance, sequences MH918776-MH918794 from Qi et al., 2019, and KF017873-KF017878 from Fattah-Hosseini et al., 2018). We recognize that this practice can lead to confusion and apologize for the oversight in our work.

      REFERENCES

      Ahmed MZ, Shatters RG, Ren SX, Jin GH, Mandour NS, Qiu BL, 2009. Genetic distinctions among the Mediterranean and Chinese populations of Bemisia tabaci Q biotype and their endosymbiont Wolbachia populations. J Appl Entomol, 133:733-741.

      Baldo L, Dunning Hotopp JC, Jolley KA, et al., 2006. Multilocus sequence typing system for the endosymbiont Wolbachia pipientis. Appl Environ Microbiol. 72(11):7098-110.

      Fattah-Hosseini S, Karimi J, Allahyari H, 2014. Molecular characterization of Iranian Encarsia formosa Gahan populations with natural incidence of Wolbachia infection. J Entomol Res Soc, 20(1):85–100.

      Qi LD, Sun JT, Hong XY, Li YX, 2019. Diversity and phylogenetic analyses reveal horizontal transmission of endosymbionts between whiteflies and their parasitoids. J Econ Entomol, 112(2):894-905.

      Van Meer MM, Witteveldt J, Stouthamer R, 1999. Phylogeny of the arthropod endosymbiont Wolbachia based on the wsp gene. Insect Mol Biol, 8(3):399-408.

      - The authors fail to discuss or even acknowledge a number of published studies that specifically show no horizontal transmission, such as the one claimed to be detected in the study presented.

      Thank you for bringing this to our attention. We have made corresponding modifications to the discussion section (Lines 256271 in the revised manuscript) and have discussed the published studies that report no evidence of horizontal transmission (Lines 260263 in the revised manuscript). The added sentences read: “Experimental confirmations of Wolbachia horizontal transfer remain relatively rare, with only a limited number of documented cases (24, 27, 37, 38). Additionally, some experiments have found no evidence of horizontal transmission of Wolbachia (39-42).” (Lines 260263 in the revised manuscript)

      Reviewer #3 (Public Review):

      This is a very ordinary research paper. The horizontal of endosymbionts, including Wolbachia, Rickettsia etc. has been reported in detail in the last 10 years, and parasitoid vectored as well as plant vectored horizontal transmission is the mainstream of research. For example, Ahmed et al. 2013 PLoS One, 2015 PLoS Pathogens, Chiel et al. 2014 Enviromental Entomology, Ahmed et al. 2016 BMC Evolution Biology, Qi et al. 2019 JEE, Liu et al. 2023 Frontiers in Cellular and Infection Microbiology, all of these reported the parasitoid vectored horizontal transmission of endosymbiont. While Caspi-Fluger et al. 2012 Proc Roy Soc B, Chrostek et al. 2017 Frontiers in Microbiology, Li et al. 2017 ISME Journal, Li et al. 2017 FEMS, Shi et al. 2024 mBio, all of these reported the plant vectored horizontal transmission of endosymbiont. For the effects of endosymbiont on the biology of the host, Ahmed et al. 2015 PLoS Pathogens explained the effects in detail.

      Thank you for the insightful comments and for highlighting the relevant literature in the field of horizontal transmission of endosymbionts, including Wolbachia and Rickettsia. After careful consideration of the studies mentioned in the commences, we believe that our work presents significant novel contributions to the field. 1) Regarding the parasitoid-mediated horizontal transmission of Wolbachia, most of the cited articles, such as Ahmed et al. 2013 in PLoS One and Ahmed et al. 2016 in BMC Evolutionary Biology, propose hypotheses but do not provide definitive evidence. The transmission of Wolbachia within the whitefly cryptic species complex (Ahmed et al. 2013) or between moths and butterflies (Ahmed et al. 2016) could be mediated by parasitoids, plants, or other unknown pathways. 2) Chiel et al. 2014 in Environmental Entomology reported “no evidence for horizontal transmission of Wolbachia between and within trophic levels” in their study system. 3) The literature you mentioned about Rickettsia, rather than Wolbachia, indirectly reflects the relative scarcity of evidence for Wolbachia horizontal transmission. For example, the evidence for plant-mediated transmission of Wolbachia remains isolated, with Li et al. 2017 in the ISME Journal being one of the few reports supporting this mode of transmission. 4) While the effects of endosymbionts on their hosts are not the central focus of our study, the effects of transgenerational Wolbachia on whiteflies are primarily demonstrated to confirm the infection of Wolbachia into whiteflies. Furthermore, the effects we report of Wolbachia on whiteflies are notably different from those reported by Ahmed et al. 2015 in PLoS Pathogens, likely due to different whitefly species and Wolbachia strains. 6) More importantly, our study reveals a mechanism of parasitoid-mediated horizontal transmission of Wolbachia that is distinct from the mechanical transmission suggested by Ahmed et al. 2015 in PLoS Pathogens. Their study implies transmission primarily through dirty needle, without Wolbachia infection of the parasitoid, suggesting host-to-host transmission at the same trophic level, where parasitoids serve as phoretic vectors. In contrast, our findings demonstrate transmission from parasitoids to hosts through unsuccessful parasitism, which represents cross-trophic level transmission. To our knowledge, this is the first experimental evidence that Wolbachia can be transmitted from parasitoids to hosts. We believe these clarifications and the novel insights provided by our research contribute valuable knowledge to the field.

      REFERENCES

      Ahmed MZ, De Barro PJ, Ren SX, Greeff JM, Qiu BL, 2013. Evidence for horizontal transmission of secondary endosymbionts in the Bemisia tabaci cryptic species complex. PLoS One, 8(1):e53084.

      Ahmed MZ, Li SJ, Xue X, Yin XJ, Ren SX, Jiggins FM, Greeff JM, Qiu BL, 2015. The intracellular bacterium Wolbachia uses parasitoid wasps as phoretic vectors for efficient horizontal transmission. PLoS Pathog, 10(2):e1004672.

      Ahmed MZ, Breinholt JW, Kawahara AY, 2016. Evidence for common horizontal transmission of Wolbachia among butterflies and moths. BMC Evol Biol, 16(1):118.

      Caspi-Fluger A, Inbar M, Mozes-Daube N, Katzir N, Portnoy V, Belausov E, Hunter MS, Zchori-Fein E, 2012. Horizontal transmission of the insect symbiont Rickettsia is plant-mediated. Proc Biol Sci, 279(1734):1791-6.

      Chiel E, Kelly SE, Harris AM, Gebiola M, Li X, Zchori-Fein E, Hunter MS, 2014. Characteristics, phenotype, and transmission of Wolbachia in the sweet potato whitefly, Bemisia tabaci (Hemiptera: Aleyrodidae), and its parasitoid Eretmocerus sp. nr. emiratus (Hymenoptera: Aphelinidae). Environ Entomol, 43(2):353-62.

      Chrostek E, Pelz-Stelinski K, Hurst GDD, Hughes GL, 2017. Horizontal transmission of intracellular insect symbionts via plants. Front Microbiol, 8:2237.

      Li SJ, Ahmed MZ, Lv N, Shi PQ, Wang XM, Huang JL, Qiu BL, 2017. Plant-mediated horizontal transmission of Wolbachia between whiteflies. ISME J, 11(4):1019-1028.

      Li YH, Ahmed MZ, Li SJ, Lv N, Shi PQ, Chen XS, Qiu BL, 2017. Plant-mediated horizontal transmission of Rickettsia endosymbiont between different whitefly species. FEMS Microbiol Ecol, 93(12).

      Liu Y, He ZQ, Wen Q, Peng J, Zhou YT, Mandour N, McKenzie CL, Ahmed MZ, Qiu BL, 2023. Parasitoid-mediated horizontal transmission of Rickettsia between whiteflies. Front Cell Infect Microbiol, 12:1077494.

      Qi LD, Sun JT, Hong XY, Li YX, 2019. Diversity and phylogenetic analyses reveal horizontal transmission of endosymbionts between whiteflies and their parasitoids. J Econ Entomol, 112(2):894-905.

      Shi PQ, Wang L, Chen XY, Wang K, Wu QJ, Turlings TCJ, Zhang PJ, Qiu BL, 2024. Rickettsia transmission from whitefly to plants benefits herbivore insects but is detrimental to fungal and viral pathogens. mBio, 15(3):e0244823.

      Weaknesses:

      In the current study, the authors downloaded the MLST or wsp genes from a public database and analyzed the data using other methods, and I think the authors may not be familiar with the research progress in the field of insect symbiont transmission, and the current stage of this manuscript lacking sufficient novelty.

      We appreciate your critical perspective on our study. However, we respectfully disagree with the viewpoint that our manuscript lacks sufficient novelty.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The data and scripts from the experimental section of the paper are not made publicly available. This would be good practice. It may well be a requirement for this journal too, but I have not read the journal policy on this matter.

      Thank you for the kind reminder, we have uploaded the data and scripts to the public database at https://doi.org/10.6084/m9.figshare.24718119.

      • Line 16 should read 'intertrophic' not 'intertropical'.

      Corrected.

      • Line 50 should not say 'the most infectious' as this is an incorrect use of the word 'infectious'. Maybe 'common'? Should also add something like 'likely' here.

      Corrected. The new sentence reads “Together, these characteristics make Wolbachia likely the most common microbe on Earth in terms of the number of species it infects (7, 8).” (Lines 47–49 in the revised manuscript).

      • Line 54 These references are all about mosquito disease vectors, not pests. More generally, in this paragraph, the research interest in Wolbachia relates overwhelmingly to blocking arbovirus transmission and not controlling pest populations.

      To enhance consistency with our statements, we have revised the supporting references as follows:

      X. Zheng et al., "Combined incompatible and sterile insect techniques eliminate mosquitoes," Nature 572, 56-61 (2019).

      A. A. Hoffmann et al., "Wolbachia establishment in Aedes populations to suppress dengue transmission," Nature 476, 454-457 (2011).

      J. T. Gong, T. P. Li, M. K. Wang, X. Y. Hong, "Prospects of Wolbachia in agricultural Pest Control," Current Opinion in Insect Science 57, 101039 (2023).J. T. Gong et al., "Stable integration of plant-virus-inhibiting Wolbachia into planthoppers for rice protection," Current Biology 30, 4837-4845.e4835 (2020).

      Regarding the content of the articles:

      Zheng et al. (2019) detail the successful suppression of wild mosquito populations through the release of male mosquitoes artificially infected with Wolbachia.

      Gong et al. (2020) present the potential of releasing Wolbachia-infected brown planthoppers to inhibit plant viruses and control pest populations.

      Gong et al. (2023) provide a comprehensive review on the application and future of Wolbachia in managing agricultural pests.

      • Line 60-61. This sentence seems poorly supported by theory or data. I suggest it is deleted. Why should CI cause extinction, and why would it have a major effect on genetic diversity beyond mtDNA?

      We have deleted the statements about extinction or genetic diversity. Now the sentence reads “It may also spread to nontarget organisms, potentially disrupting their population dynamics.” (Lines 57–58 in the revised manuscript)

      • Line 66. Reword to make clear these routes are not an exhaustive list.

      We have reworded these sentences. The new sentences now read “Similar to other symbionts, Wolbachia host shifts may occur through three main routes: parasitism, predation, and shared plant or other food sources (17). However, it is important to note that these are not the only routes through which transmission may occur, and the specific contributions of each to the overall process of host shift are not yet fully understood.” (Lines 62–66 in the revised manuscript).

      • Line 77-79. This could do with mentioning studies of parasitoid-to-host transmission like Ahmedd et al given that it is common knowledge that insects commonly survive parasitoid attacks.

      We have added sentences acknowledging the common occurrence of insects surviving parasitoid attacks and referenced and described the Ahmed et al. 2015 study. The added sentences read:

      “However, it is common in nature for hosts to survive parasitoid attacks (27-29). For example, whiteflies can survive after attacks of Eretmocerus parasitoids (27). These parasitoids can act as phoretic vectors, facilitating the spread of Wolbachia within whitefly populations through the contamination of their mouthparts and ovipositors with Wolbachia during the probing process (27).” (Lines 77–82 in the revised manuscript).

      • Line 173. Mention that there are three replicates of each cage. In Figures 2C and D, it is better to show each replicate as a separate line to see how consistent they are.

      In accordance with the reviewer's suggestion, we have included a statement highlighting the replication of our experiments: “Notably, each cage setup was replicated three times to ensure experimental rigor.” (Lines 179–180 in the revised manuscript).

      Regarding Figures 2C and D, we have revised the figures to display each replicate as a separate line, as suggested. However, we have encountered a visual clutter that may detract from the clarity of the figures. Additionally, in Figure C, the three black lines, all representing zero values, do not allow for the distinction of individual trends. Therefore, we recommend retaining the original figure format. In accordance with eLife's data policy, we have also provided the source data for all figures, ensuring that readers can access to the detailed data, thus balancing the need for visual simplicity with the provision of comprehensive data.

      Author response image 1.

      • The GloBI database is central to the phylogenetic analysis and it would be helpful to have a few words in the results stating where this information comes from.

      The revised sentence now reads: “To investigate potential horizontal transmission of Wolbachia, we retrieved 4685 wsp sequences from the NCBI database, and species interaction relationships were extracted from the GloBI database (for details, see Methods and Materials).” (Lines 94–96 in the revised manuscript).

      Reviewer #3 (Recommendations For The Authors):

      To improve the quality of this manuscript, I have some questions and suggestions.

      Introduction:

      Line 41-42, I don't agree with this statement, as mentioned above, the ways of insect symbiont transmission have been studied in the last 10 years.

      According to the reviewer’s suggestion, we have deleted this statement.

      Line 75-76, Again, the statement is not correct, many studies have clearly revealed and confirmed that Wolbachia CAN be transferred from parasitoid to their insect hosts including whitefly Bemisia tabaci.

      Thank you for your insightful comments. After careful consideration of the studies you have mentioned above, none of these articles provided definitive evidence supporting the transfer of Wolbachia from parasitoids to their insect hosts. A closely related study is Ahmed et al. (2015) in PLoS Pathogens. This article demonstrates that parasitoid wasps can act as phoretic vectors mediating the transmission of Wolbachia between whiteflies. However, Wolbachia did not infect the parasitoid wasps themselves. Therefore, this study does not provide evidence for intertrophic transmission of Wolbachia from parasitoids to their hosts. To avoid confusion, we have cited the Ahmed et al. (2015) reference following this statement and described its findings accordingly. (Lines 88-92 in revised manuscript).

      Results:

      Line 133-134, Ahmed et al. 2016 BMC Evolution Biology, clearly revealed and confirmed the "common horizontal transmission of Wolbachia between butterflies and moths".

      We thank you for guiding us to the relevant study. Ahmed et al. 2016 BMC Evolution Biology suggested common horizontal transmission of Wolbachia between butterflies and moths and proposed that this horizontal transmission might be caused by parasitoid wasps. Here, we present the potential Wolbachia transfer between Trichogramma and their lepidopteran hosts (Lines 135–136 in revised manuscript). Integrating the results from Ahmed et al. 2016, our result also suggests that Trichogramma wasps may be the vectors for horizontal transmission of Wolbachia among lepidopteran hosts. We have discussed this point in the discussion section and cited Ahmed et al. 2016 BMC Evolution Biology (Lines 239–246 in revised manuscript).

      Line 176-177, as we know Wolbachia in Encarsia formosa is a strain of parthenogenesis, why did it reduce the female ratio of whitefly progeny after it was transmitted to whitefly B. tabaci, it needs a convincing explanation.

      Wolbachia induces parthenogenesis in En. formosa. However, we observed that Wolbachia from En. formosa failed to induce parthenogenesis in B. tabaci, possibly due to the requirement for host gene compatibility. Additionally, we noted a reduced female ratio in B. tabaci infected with En. formosa Wolbachia. We speculate that this might result from the burden imposed by En. formosa Wolbachia on the new host, potentially reducing fertilization success rates and indirectly leading to a decrease in the female ratio. Similarly, we observed a decline in female fecundity, egg hatching rate, and immature survival rate in B. tabaci infected with En. formosa Wolbachia. The mechanisms underlying these fitness costs remain unclear and warrant further in-depth research.

      Line 189-190, do the authors have convincing evidence that the 60Gy irradiation only has effects on the reproduction of En. formosa, but does not have any negative effects on the activity of Wolbachia? I think there may be.

      We observed that after irradiation, the titer of Wolbachia within En. formosa significantly decreased (Fig S3). We agree that the irradiation may cause other negative effects on Wolbachia which is worth of close investigation. However, even with a significant reduction in Wolbachia titer, irradiation increased the infection rate of Wolbachia in surviving B. tabaci after wasp attacks (Fig 3C). We speculate that this may be due to irradiation of En. formosa increasing the rate of parasitic failure. While the full extent of the effects of irradiation on Wolbachia is not yet clear in our experiments, it does not alter our conclusion that Wolbachia can be transmitted from En. formosa to whitefly hosts through failed parasitism.

      Discussion:

      Line 289-290, I don't understand, why the authors think from parasitoid Eretmocerus to whitefly, and from Trichogramma to moth, are the same trophic level, they are indeed two different trophic levels.

      Thank you for your feedback. We have conducted a thorough search but were unable to locate the specific statement you are referring to. If there has been any ambiguity in our manuscript that has led to confusion, we sincerely apologize for any misunderstanding it may have caused. We agree with your perspective and have always considered the parasitoid Eretmocerus and whitefly, as well as Trichogramma and moth, to be at different trophic levels. However, in the context of specific references, such as Ahmed et al. 2015 in PLoS Pathogens, we believe that Wolbachia is transmitted within the same trophic level without infecting the parasitoid Eretmocerus, merely serving as a phoretic vector to facilitate the spread of Wolbachia among whitefly hosts. Similarly, in the case of Huigens et al. 2000 in Nature, Wolbachia uses lepidopteran hosts as vectors to promote its transmission among Trichogramma without the need to infect the lepidopteran hosts themselves.

      Materials and Methods

      Line 348, what is tblastn?

      We have corrected tblastn to TBLASTN. We are grateful to the reviewer for pointing this out. Here, we utilized TBLASTN instead of BLASTN, to avoid missing the rapidly evolving wsp sequences. Because alignment at the protein level is generally more sensitive than at the nucleotide level. TBLASTN is a bioinformatics tool within the BLAST (Basic Local Alignment Search Tool) suite used for comparing a protein query sequence against a nucleotide database. Specifically, TBLASTN aligns a given protein sequence with nucleotide sequences in a database by translating the nucleotide sequences into all possible protein sequences (considering different reading frames) and comparing them to the query protein sequence.

      Line 383, how was the Wolbachia-free line of B. tabaci established, by antibiotics? If so, how do we ensure the antibiotic does not have any negative to other symbionts in whitefly B. tabaci?

      The Wolbachia-free line of B. tabaci was collected from field, without the treatment of antibiotics. We have made revisions in the Materials and Methods section to clarify this, stating, "An iso-female line of B. tabaci, which is naturally Wolbachia-free and has not been treated with antibiotics, was established." (Lines 417–418 in the revised manuscript)

      Line 419-421 as I mentioned before, the irradiation may have negative effects on Wolbachia too, so change the biology of both Encarsia and whitefly host.

      We observed that after irradiation, the titer of Wolbachia within En. formosa significantly decreased (Fig S3). However, even with a significant reduction in Wolbachia titer, irradiation increased the infection rate of Wolbachia in surviving B. tabaci after wasp attacks (Fig 3C). We speculate that this may be due to irradiation of En. formosa increasing the rate of parasitic failure. While the full extent of the effects of irradiation on Wolbachia is not yet clear in our experiments, it does not alter our conclusion that Wolbachia can be transmitted from En. formosa to whitefly hosts through failed parasitism.

      Line 452-453, From egg to eclosion, it needs about 21 days to understand suitable temperature and other conditions, during this period, the egg and nymphs can not move, so how to keep the cut-leaf fresh enough in a Petri dish for 21 days?

      We apologize for not clearly describing the materials and methods. By using wet cotton to wrap the end of petiole of the leaf, we can keep the leaves fresh for up to a month. We have included this detail in the materials and methods to enhance the reproducibility of the experiment. “A single irradiated wasp was subsequently introduced into a Petri dish, which contained a tomato leaf infested with Wolbachia-free third or fourth instar whitefly nymphs, and wet cotton was used to wrap the end of the leaf petiole to keep the leaf fresh.” (Lines 455–458 in the revised manuscript)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The manuscript describes a series of experiments using human intracranial neural recordings designed to evaluate the processing of self-generated speech in the setting of feedback delays. Specifically, the authors aim to address the question about the relationship between speech-induced suppression and feedback sensitivity in the auditory cortex, whose relationship has been conflicting in the literature. They found a correlation between speech suppression and feedback delay sensitivity, suggesting a common process. Additional controls were done for possible forward suppression/adaptation, as well as controlling for other confounds due to amplification, etc.

      Strengths:

      The primary strength of the manuscript is the use of human intracranial recording, which is a valuable resource and gives better spatial and temporal resolution than many other approaches. The use of delayed auditory feedback is also novel and has seen less attention than other forms of shifted feedback during vocalization. Analyses are robust, and include demonstrating a scaling of neural activity with the degree of feedback delay, and more robust evidence for error encoding than simply using a single feedback perturbation.

      Weaknesses:

      Some of the analyses performed differ from those used in past work, which limits the ability to directly compare the results. Notably, past work has compared feedback effects between production and listening, which was not done here. There were also some unusual effects in the data, such as increased activity with no feedback delay when wearing headphones, that the authors attempted to control for with additional experiments, but remain unclear. Confounds by behavioral results of delayed feedback are also unclear.

      Overall the work is well done and clearly explained. The manuscript addresses an area of some controversy and does so in a rigorous fashion, namely the correlation between speech-induced suppression and feedback sensitivity (or lack thereof). While the data presented overlaps that collected and used for a previous paper, this is expected given the rare commodity these neural recordings represent. Contrasting these results to previous ones using pitch-shifted feedback should spawn additional discussion and research, including verification of the previous finding, looking at how the brain encodes feedback during speech over multiple acoustic dimensions, and how this information can be used in speech motor control.

      We thank the reviewer for their comments and have addressed the concerns point by point in the section “Recommendation for Authors”.

      Reviewer #2 (Public Review):

      Summary:

      "Speech-induced suppression and vocal feedback sensitivity in human cortex", Ozker and colleagues use intracranial EEG to understand audiomotor feedback during speech production using a speech production and delayed auditory feedback task. The purpose of the paper is to understand where and how speaker-induced suppression occurs, and whether this suppression might be related to feedback monitoring. First, they identified sites that showed auditory suppression during speech production using a single-word auditory repetition task and a visual reading task, then observed whether and how these electrodes show sensitivity to auditory feedback using a DAF paradigm. The stimuli were single words played auditorily or shown visually and repeated or read aloud by the participant. Neural data were recorded from regular- and high-density grids from the left and right hemispheres. The main findings were:

      • Speaker-induced suppression is strongest in the STG and MTG, and enhancement is generally seen in frontal/motor areas except for small regions of interest in the dorsal sensorimotor cortex and IFG, which can also show suppression.<br /> • Delayed auditory feedback, even when simultaneous, induces larger response amplitudes compared to the typical auditory word repetition and visual reading tasks. The authors presume this may be due to the effort and attention required to perform the DAF task.

      • The degree of speaker-induced suppression is correlated with sensitivity to delayed auditory feedback. • pSTG (behind TTS) is more strongly modulated by DAF than mid-anterior STG

      Strengths:

      Overall, I found the manuscript to be clear, the methodology and statistics to be solid, and the findings mostly quite robust. The large number of participants with high-density coverage over both the left and right lateral hemispheres allows for a greater dissection of the topography of speaker-induced suppression and changes due to audiomotor feedback. The tasks were well-designed and controlled for repetition suppression and other potential caveats.

      Weaknesses:

      (1) In Figure 1D, it would make more sense to align the results to the onset of articulation rather than the onset of the auditory or visual cue, since the point is to show that the responses during articulation are relatively similar. In this form, the more obvious difference is that there is an auditory response to the auditory stimulus, and none to the visual, which is expected, but not what I think the authors want to convey.

      We agree with the reviewer. We have updated Figure 1 accordingly.

      (2) The DAF paradigm includes playing auditory feedback at 0, 50, 100, and 200 ms lag, and it is expected that some of these lags are more likely to induce dysfluencies than others. It would be helpful to include some analysis of whether the degree of suppression or enhancement varies by performance on the task, since some participants may find some lags more interfering than others.

      We thank the reviewer for this suggestion. In the original analysis, we calculated a Sensitivity Index for each electrode by correlating the high gamma response with the delay condition across trials. To address the reviewer’s question, we now compared delay conditions in pairs (DAF0 vs DAF50, DAF0 vs DAF100, DAF0 vs DAF200, DAF50 vs DAF100, DAF50 vs DAF200 and DAF100 vs DAF200).

      Similar to our Suppression Index calculation, where we compared neural response to listening and speaking conditions (Listen-Speak/Listen+Speak), we now calculated the Sensitivity Index by comparing neural response to two delay conditions as follows:

      e.g.  Sensitivity Index = (DAF50 – DAF0) / (DAF50 + DAF0). We used the raw high gamma broadband signal power instead of percent signal change to ensure that the Sensitivity Index values varied between -1 to 1.

      As shown in the figure below, even when we break down the analysis by feedback delay, we still find a significant association between suppression and sensitivity (except for when we calculate sensitivity indices by comparing DAF50 and DAF100). Strongest correlation (Pearson’s correlation) was found when sensitivity indices were calculated by comparing DAF0 and DAF200.

      As the reviewer suggested, participants found DAF200 more interfering than the others and slowed down their speech the most (Articulation duration; DAF0: 0.698, DAF50: 0.726, DAF100: 0.737, and DAF200: 0.749 milliseconds; Ozker, Doyle et al. 2022).

      Author response image 1.

      (3) Figure 3 shows data from only two electrodes from one patient. An analysis of how amplitude changes as a function of the lag across all of the participants who performed this task would be helpful to see how replicable these patterns of activity are across patients. Is sensitivity to DAF always seen as a change in amplitude, or are there ever changes in latency as well? The analysis in Figure 4 gets at which electrodes are sensitive to DAF but does not give a sense of whether the temporal profile is similar to those shown in Figure 3.

      In Figure 4A, electrodes from all participants are color-coded to reflect the correlation between neural response amplitude and auditory feedback delay. A majority of auditory electrodes in the STG exhibit a positive correlation, indicating that response amplitude increases with increasing feedback delays. To demonstrate the replicability of the response patterns in Figure 3, here we show auditory responses averaged across 23 STG electrodes from 6 participants.

      Author response image 2.

      Response latency in auditory regions also increases with increasing auditory feedback delays. But this delayed auditory response to delayed auditory feedback is expected. In Figure 3, signals were aligned to the perceived auditory feedback onset, therefore we don’t see the latency differences. Below we replotted the same responses by aligning the signal to the onset of articulation. It is now clearer that responses are delayed as the auditory feedback delay increases. This is because participants start speaking at time=0, but they hear their voice with a lag so the response onset in these auditory regions are delayed.

      According to models of speech production, when there is a mismatch between expected and perceived auditory feedback, the auditory cortex encodes this mismatch with an enhanced response, reflecting an error signal. Therefore, we referred to changes in response amplitude as a measure of sensitivity to DAF.

      (4) While the sensitivity index helps to show whether increasing amounts of feedback delay are correlated with increased response enhancement, it is not sensitive to nonlinear changes as a function of feedback delay, and it is not clear from Figure 3 or 4 whether such relationships exist. A deeper investigation into the response types observed during DAF would help to clarify whether this is truly a linear relationship, dependent on behavioral errors, or something else.

      We compared responses to delay conditions in pairs in the analysis presented above (response #2). We hope these new results also clarifies this issue and address the reviewer’s concerns.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major points:

      (1) While the correlation between SuppI and SensI is clear here (as opposed to Chang et al), it is unclear if this difference is a byproduct of how SensI was calculated (and not just different tasks). In that paper, the feedback sensitivity was calculated as a metric comparing feedback responses during production and listening, whereas here the SensI is a correlation coefficient during production only. If the data exists, it would be very helpful to also show an analysis similar to that used previously (i.e. comparing DAF effects in both production and playback, either in correlations or just the 200ms delay response). One could imagine that some differences are due to sensory properties, though it is certainly less clear what delay effects would be on listening compared to say pitch shift.

      We thank the reviewer for pointing this out. Indeed, the calculation of SensI is different in the two studies. In Chang et al. study, SensI was calculated by comparing perturbed feedback responses during production and passive listening. This is a very meticulous approach as it controls for the acoustic properties of the auditory stimuli under both conditions.

      In our study, we didn’t have a passive listening condition. This would require recording the participants’ voice as they were speaking with DAF and playing it back to them in a subsequent passive listening condition. Therefore, we can’t completely eliminate the possibility that some differences are due to sensory properties. However, to address the reviewer’s concern, we examined the voice recordings of 8 participants for acoustic differences. Specifically, we compared voice intensities for different auditory feedback delays (0,50,100 and 200ms) and found no significant differences (F=0, p=0.091).

      We think that the difference with the Chang et al. study is an important point to emphasize, therefore we now added in the Discussion:

      “In contrast, to replicate this finding in humans, a previous iEEG study by Chang et al. (Chang, Niziolek et al. 2013) used frequency-shifted feedback during vowel production and found that most suppressed auditory sites did not overlap with those sensitive to feedback alterations. Using DAF instead of frequency-shifted feedback, we demonstrated a significant overlap of two neural populations in the STG, along with a strong correlation between the degree of speech-induced suppression and sensitivity to auditory feedback. This discrepancy may be due to different methods of calculating sensitivity to altered feedback. In our study, sensitivity was determined by comparing responses to delayed and non-delayed feedback during production, whereas Chang et al. compared perturbed feedback responses during production and listening. One possibility is that our approach identifies a larger auditory neural population in the STG sensitive to altered feedback. Alternatively, it could indicate a larger population highly sensitive to temporal rather than spectral perturbations in auditory feedback. Thus, we observe a wide overlap of the two neural populations in the STG showing both speech-induced suppression and sensitivity to auditory feedback. Replaying a recording of the participants' own delayed voice back to them, which we were unable to complete in this study, would have made the results of the two studies more comparable while also completely eliminating the possibility of a sensory explanation for the observed response enhancement.”

      (2) I am still a bit unclear on how Experiment 4 is different than the no-delay condition in Experiment 3. Please clarify. Also, to be clear, in Experiments 1+2 the subjects were not wearing any headphones and had no additional sidetone?

      It is correct that participants were not wearing earphones in Experiments 1&2 (with no additional sidetone), and that they were wearing earphones in Experiments 3&4.

      For the “no delay” condition in the DAF experiment (Experiment 3), participants were wearing earphones and reading words with simultaneous auditory feedback. So, this condition was equivalent to visual word reading (Experiment 2), except participants were wearing earphones. Yet, neural responses were much larger for the “no delay” condition in the DAF experiment compared to visual word reading.

      We suspected that larger neural responses in the DAF experiment were caused by hearing auditory feedback through earphones. To test and control for this possibility, in a subset of participants, we ran an additional visual word reading experiment (Experiment 4) with earphones and used the same volume settings as in the DAF experiment. We found that response magnitudes were now similar in the two experiments (Experiment 3 and 4) and earphones (with the associated increased sound amplitude) were indeed the reason for larger neural responses. Thus, Experiment 4 differs from the no-delay condition in Experiment 3 only in the stimuli read aloud.

      (3) In Figure 3, why is the DAF200 condition activity so much bigger than the other conditions, even prior to the DAF onset? I worry this might bias the rest of the response differences.

      In Figure 3B and 3D, time=0 indicates the onset of the perceived auditory feedback. Below we replotted the responses in the same two electrodes but now time=0 indicates the onset of articulation. We see that the peaking time of the responses are delayed as the auditory feedback delay increases. This is because participants start speaking at time=0, but they hear their voice with a lag so the response onset in these auditory regions are delayed. However, like the reviewer pointed out, the response for the DAF200 condition in Electrode G54 is slightly larger even at the very beginning. We think that this small, early response might reflect a response to the bone-conducted auditory feedback, which might be more prominent for the DAF200 condition. Nevertheless, we still see that response amplitude increase with increasing feedback delays in Electrode 63.

      (4) Figure 4C, are the labeled recording sites limited to those with significant DAF and/or suppression?

      In Figure 4C, we show electrodes that had significant high-gamma broadband responses during all tasks. We write in the Methods: “Electrodes that showed significant response increase (p < 10−4) either before (−0.5 to 0 s) or after speech onset (0 to 0.5 s) with respect to a baseline period (−1 to −0.6 s) and at the same time had a large signal-to-noise ratio (μ/σ > 0.7) during either of these time windows were selected. Electrode selection was first performed for each task separately, then electrodes that were commonly selected were further analyzed.”

      (5) Were there any analyses done to control for the effects of vocal changes on the DAF neural responses? The authors' previous paper did note a behavioral effect. This is probably not trivial, as we may not know the 'onset time' of the response, in contrast to pitch shift where it is more regular. If the timing is unknown, one thing that could be tried is to only look early in DAF responses (first 50ms say) to make sure the DAF effects hold.

      DAF involves two different perturbations: the absence of feedback at speech onset and the introduction of delayed feedback during playback. The timing of the behavioral effect in response to these two perturbations remains unclear. Aligning the neural responses to the production onset and examining the first 50ms would only capture the response to the acoustic feedback for the no-delay condition within that time window. Conversely, aligning the responses to the playback onset might miss the onset of the behavioral effect, which likely starts earlier as a response to the lack of feedback. We acknowledge the reviewer's point that this is a limitation of the DAF paradigm, and the behavioral effect is not as straightforward as that of pitch perturbation. However, we believe there is no clear solution to this issue.

      Minor points:

      (1) Figure 3, it might be nice to show the SuppI and SensI on the plots to give the reader a better sense of what those values look like.

      We included SuppI and SensI values in the new version of Figure 3.

      Reviewer #2 (Recommendations For The Authors):

      Minor Comments:

      (1) In Figure 1, it is unclear whether the responses shown in B-D correspond to the ROIs shown in Figure A - I am guessing so, but the alignment of the labels makes this slightly unclear, so I suggest these be relabeled somehow for clarity.

      This is fixed in the updated version of Figure 1.

      (2) In Figure 1D the difference in colors between AWR and VWR is difficult to appreciate - I suggest using two contrasting colors.

      This is fixed in the updated version of Figure 1.

      (3) Please add y-axis labels for Fig 3B-D. (I believe these are % signal change, but it would be clearer if the label were included).

      This is fixed in the updated version of Figure 3.

      (4) Can the authors comment on whether the use of speakers for AWR and VWR versus earphones for DAF and VWF- AF may have had an influence on the increased response in this condition? If the AWR were rerun using the headphone setup, or if DAF with 0 ms feedback were run with no other trials including lags, would the large differences in response amplitude be observed?

      Participants were not wearing earphones in Experiments 1&2, and that they were wearing earphones in Experiments 3&4.

      For the “no delay” condition in the DAF experiment (Experiment 3), participants were wearing earphones and reading words with simultaneous auditory feedback. So, this condition was equivalent to VWR (Experiment 2), except participants were wearing earphones. Yet, neural responses were much larger for the “no delay” condition in the DAF experiment compared to VWR.

      Supporting the reviewer’s concerns, we suspected that larger neural responses in the DAF experiment were caused by hearing auditory feedback through earphones. To test and control for this possibility, in a subset of participants, we ran the VWR-AF experiment (Experiment 4) with earphones and used the same volume settings as in the DAF experiment. We found that response magnitudes were now similar in the two experiments (Experiment 3 and 4) and earphones were indeed the reason for larger neural responses.

      (5) No data or code were available, I did not see any statement about this nor any github link or OSF link to share their data and/or code.

      Data is available in the Github repository: flinkerlab/Sensitivity-Suppression

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Here, the authors propose that changes in m6A levels may be predictable via a simple model that is based exclusively on mRNA metabolic events. Under this model, m6A mRNAs are "passive" victims of RNA metabolic events with no "active" regulatory events needed to modulate their levels by m6A writers, readers, or erasers; looking at changes in RNA transcription, RNA export, and RNA degradation dynamics is enough to explain how m6A levels change over time.

      The relevance of this study is extremely high at this stage of the epi transcriptome field. This compelling paper is in line with more and more recent studies showing how m6A is a constitutive mark reflecting overall RNA redistribution events. At the same time, it reminds every reader to carefully evaluate changes in m6A levels if observed in their experimental setup. It highlights the importance of performing extensive evaluations on how much RNA metabolic events could explain an observed m6A change.

      Weaknesses:

      It is essential to notice that m6ADyn does not exactly recapitulate the observed m6A changes. First, this can be due to m6ADyn's limitations. The authors do a great job in the Discussion highlighting these limitations. Indeed, they mention how m6ADyn cannot interpret m6A's implications on nuclear degradation or splicing and cannot model more complex scenario predictions (i.e., a scenario in which m6A both impacts export and degradation) or the contribution of single sites within a gene.

      Secondly, since predictions do not exactly recapitulate the observed m6A changes, "active" regulatory events may still play a partial role in regulating m6A changes. The authors themselves highlight situations in which data do not support m6ADyn predictions. Active mechanisms to control m6A degradation levels or mRNA export levels could exist and may still play an essential role.

      We are grateful for the reviewer’s appreciation of our findings and their implications, and are in full agreement with the reviewer regarding the limitations of our model, and the discrepancies in some cases - with our experimental measurements, potentially pointing at more complex biology than is captured by m6ADyn. We certainly cannot dismiss the possibility that active mechanisms may play a role in shaping m6A dynamics at some sites, or in some contexts. Our study aims to broaden the discussion in the field, and to introduce the possibility that passive models can explain a substantial extent of the variability observed in m6A levels.

      (1) "We next sought to assess whether alternative models could readily predict the positive correlation between m6A and nuclear localization and the negative correlations between m6A and mRNA stability. We assessed how nuclear decay might impact these associations by introducing nuclear decay as an additional rate, δ. We found that both associations were robust to this additional rate (Supplementary Figure 2a-c)."

      Based on the data, I would say that model 2 (m6A-dep + nuclear degradation) is better than model 1. The discussion of these findings in the Discussion could help clarify how to interpret this prediction. Is nuclear degradation playing a significant role, more than expected by previous studies?

      This is an important point, which we’ve now clarified in the discussion. Including nonspecific nuclear degradation in the m6ADyn framework provides a model that better aligns with the observed data, particularly by mitigating unrealistic predictions such as excessive nuclear accumulation for genes with very low sampled export rates. This adjustment addresses potential artifacts in nuclear abundance and half-life estimations. However, we continued to use the simpler version of m6ADyn for most analyses, as it captures the key dynamics and relationships effectively without introducing additional complexity. While including nuclear degradation enhances the model's robustness, it does not fundamentally alter the primary conclusions or outcomes. This balance allows for a more straightforward interpretation of the results.

      (2) The authors classify m6A levels as "low" or "high," and it is unclear how "low" differs from unmethylated mRNAs.

      We thank the reviewer for this observation. We analyzed gene methylation levels using the m6A-GI (m6A gene index) metric, which reflects the enrichment of the IP fraction across the entire gene body (CDS + 3UTR). While some genes may have minimal or no methylation, most genes likely exist along a spectrum from low to high methylation levels. Unlike earlier analyses that relied on arbitrary thresholds to classify sites as methylated, GLORI data highlight the presence of many low-stoichiometry sites that are typically overlooked. To capture this spectrum, we binned genes into equal-sized groups based on their m6A-GI values, allowing a more nuanced interpretation of methylation patterns as a continuum rather than a binary or discrete classification (e.g. no- , low- , high methylation).

      (3) The authors explore whether m6A changes could be linked with differences in mRNA subcellular localization. They tested this hypothesis by looking at mRNA changes during heat stress, a complex scenario to predict with m6ADyn. According to the collected data, heat shock is not associated with dramatic changes in m6A levels. However, the authors observe a redistribution of m6A mRNAs during the treatment and recovery time, with highly methylated mRNAs getting retained in the nucleus being associated with a shorter half-life, and being transcriptional induced by HSF1. Based on this observation, the authors use m6Adyn to predict the contribution of RNA export, RNA degradation, and RNA transcription to the observed m6A changes. However:

      (a) Do the authors have a comparison of m6ADyn predictions based on the assumption that RNA export and RNA transcription may change at the same time?

      We thank the reviewer for this point. Under the simple framework of m6ADyn in which RNA transcription and RNA export are independent of each other, the effect of simultaneously modulating two rates is additive. In Author response image 1, we simulate some scenarios wherein we simultaneously modulate two rates. For example, transcriptional upregulation and decreased export during heat shock could reinforce m6A increases, whereas transcriptional downregulation might counteract the effects of reduced export. Note that while production and export can act in similar or opposing directions, the former can only lead to temporary changes in m6A levels but without impacting steady-state levels, whereas the latter (changes in export) can alter steady-state levels. We have clarified this in the manuscript results to better contextualize how these dynamics interact.

      Author response image 1.

      m6ADyn predictions of m6A gene levels (left) and Nuc to Cyt ratio (right) upon varying perturbations of a sampled gene. The left panel depicts the simulated dynamics of log2-transformed m6A gene levels under varying conditions. The lines represent the following perturbations: (1) export is reduced to 10% (β), (2) production is increased 10-fold (α) while export is reduced to 10% (β), (3) export is reduced to 10% (β) and production is reduced to 10% (α), and (4) export is only decreased for methylated transcripts (β^m6A) to 10%. The right panel shows the corresponding nuclear:cytoplasmic (log2 Nuc:Cyt) ratios for perturbations 1 and 4.

      (b) They arbitrarily set the global reduction of export to 10%, but I'm not sure we can completely rule out whether m6A mRNAs have an export rate during heat shock similar to the non-methylated mRNAs. What happens if the authors simulate that the block in export could be preferential for m6A mRNAs only?

      We thank the reviewer for this interesting suggestion. While we cannot fully rule out such a scenario, we can identify arguments against it being an exclusive explanation. Specifically, an exclusive reduction in the export rate of methylated transcripts would be expected to increase the relationship between steady-state m6A levels (the ratio of methylated to unmethylated transcripts) and changes in localization, such that genes with higher m6A levels would exhibit a greater relative increase in the nuclear-to-cytoplasmic (Nuc:Cyt) ratio. However, the attached analysis shows only a weak association during heat stress, where genes with higher m6A-GI levels tend to increase just a little more in the Nuc:Cyt ratio, likely due to cytoplasmic depletion. A global reduction of export (β 10%) produces a similar association, while a scenario where only the export of methylated transcripts is reduced (β^m6A 10%) results in a significantly stronger association (Author response image 2). This supports the plausibility of a global export reduction. Additionally, genes with very low methylation levels in control conditions also show a significant increase in the Nuc:Cyt ratio, which is inconsistent with a scenario of preferential export reduction for methylated transcripts (data not shown).

      Author response image 2.

      Wild-type MEFs m6A-GIs (x-axis) vs. fold change nuclear:cytoplasmic localization heat shock 1.5 h and control (y-axis), Pearson’s correlation indicated (left panel). m6ADyn, rates sampled for 100 genes based on gamma distributions and simulation based on reducing the global export rate (β) to 10% (middle panel). m6ADyn simulation for reducing the export rate for m6A methylated transcripts (β^m6A) to 10% (right panel).

      (c) The dramatic increase in the nucleus: cytoplasmic ratio of mRNA upon heat stress may not reflect the overall m6A mRNA distribution upon heat stress. It would be interesting to repeat the same experiment in METTL3 KO cells. Of note, m6A mRNA granules have been observed within 30 minutes of heat shock. Thus, some m6A mRNAs may still be preferentially enriched in these granules for storage rather than being directly degraded. Overall, it would be interesting to understand the authors' position relative to previous studies of m6A during heat stress.

      The reviewer suggests that methylation is actively driving localization during heat shock, rather than being passively regulated. To address this question, we have now knocked down WTAP, an essential component of the methylation machinery, and monitored nuclear:cytoplasmic localization over the course of a heat shock response. Even with reduced m6A levels, high PC1 genes exhibit increased nuclear abundance during heat shock. Notably, the dynamics of this trend are altered, with the peak effect delayed from 1.5h heat shock in siCTRL samples to 4 hours in siWTAP samples (Supplementary Figure 4). This finding underscores that m6A is not the primary driver of these mRNA localization changes but rather reflects broader mRNA metabolic shifts during heat shock. These findings have been added as a panel e) to Supplementary Figure 4.

      (d) Gene Ontology analysis based on the top 1000 PC1 genes shows an enrichment of GOs involved in post-translational protein modification more than GOs involved in cellular response to stress, which is highlighted by the authors and used as justification to study RNA transcriptional events upon heat shock. How do the authors think that GOs involved in post-translational protein modification may contribute to the observed data?

      High PC1 genes exhibit increased methylation and a shift in nuclear-to-cytoplasmic localization during heat stress. While the enriched GO terms for these genes are not exclusively related to stress-response proteins, one could speculate that their nuclear retention reduces translation during heat stress. The heat stress response genes are of particular interest, which are massively transcriptionally induced and display increased methylation. This observation supports m6ADyn predictions that elevated methylation levels in these genes are driven by transcriptional induction rather than solely by decreased export rates.

      (e) Additionally, the authors first mention that there is no dramatic change in m6A levels upon heat shock, "subtle quantitative differences were apparent," but then mention a "systematic increase in m6A levels observed in heat stress". It is unclear to which systematic increase they are referring to. Are the authors referring to previous studies? It is confusing in the field what exactly is going on after heat stress. For instance, in some papers, a preferential increase of 5'UTR m6A has been proposed rather than a systematic and general increase.

      We thank the reviewer for raising this point. In our manuscript, we sought to emphasize, on the one hand, the fact that m6A profiles are - at first approximation - “constitutive”, as indicated by high Pearson correlations between conditions (Supplementary Figure 4a). On the other hand, we sought to emphasize that the above notwithstanding, subtle quantitative differences are apparent in heat shock, encompassing large numbers of genes, and these differences are coherent with time following heat shock (and in this sense ‘systematic’), rather than randomly fluctuating across time points. Based on our analysis, these changes do not appear to be preferentially enriched at 5′UTR sites but occur more broadly across gene bodies (potentially a slight 3’ bias). A quick analysis of the HSF1-induced heat stress response genes, focusing on their relative enrichment of methylation upon heat shock, shows that the 5'UTR regions exhibit a roughly similar increase in methylation after 1.5 hours of heat stress compared to the rest of the gene body (Author response image 3). A prominent previous publication (Zhou et al. 2015) suggested that m6A levels specifically increase in the 5'UTR of HSPA1A in a YTHDF2- and HSF1-dependent manner, and highlighted the role of 5'UTR m6A methylation in regulating cap-independent translation, our findings do not support a 5'UTR-specific enrichment. However, we do observe that the methylation changes are still HSF1-dependent. Off note, the m6A-GI (m6A gene level) as a metric that captures the m6A enrichment of gene body excluding the 5’UTR, due to an overlap of transcription start site associated m6Am derived signal.

      Author response image 3.

      Fold change of m6A enrichment (m6A-IP / input) comparing 1.5 h heat shock and control conditions for 5UTR region and the rest of the gene body (CDS and 3UTR) in the 10 HSF! dependent stress response genes.

      Reviewer #2 (Public review):

      Dierks et al. investigate the impact of m6A RNA modifications on the mRNA life cycle, exploring the links between transcription, cytoplasmic RNA degradation, and subcellular RNA localization. Using transcriptome-wide data and mechanistic modelling of RNA metabolism, the authors demonstrate that a simplified model of m6A primarily affecting cytoplasmic RNA stability is sufficient to explain the nuclear-cytoplasmic distribution of methylated RNAs and the dynamic changes in m6A levels upon perturbation. Based on multiple lines of evidence, they propose that passive mechanisms based on the restricted decay of methylated transcripts in the cytoplasm play a primary role in shaping condition-specific m6A patterns and m6A dynamics. The authors support their hypothesis with multiple large-scale datasets and targeted perturbation experiments. Overall, the authors present compelling evidence for their model which has the potential to explain and consolidate previous observations on different m6A functions, including m6A-mediated RNA export.

      We thank the reviewer for the spot-on suggestions and comments on this manuscript.

      Reviewer #3 (Public review):

      Summary:

      This manuscript works with a hypothesis where the overall m6A methylation levels in cells are influenced by mRNA metabolism (sub-cellular localization and decay). The basic assumption is that m6A causes mRNA decay and this happens in the cytoplasm. They go on to experimentally test their model to confirm its predictions. This is confirmed by sub-cellular fractionation experiments which show high m6A levels in the nuclear RNA. Nuclear localized RNAs have higher methylation. Using a heat shock model, they demonstrate that RNAs with increased nuclear localization or transcription, are methylated at higher levels. Their overall argument is that changes in m6A levels are rather determined by passive processes that are influenced by RNA processing/metabolism. However, it should be considered that erasers have their roles under specific environments (early embryos or germline) and are not modelled by the cell culture systems used here.

      Strengths:

      This is a thought-provoking series of experiments that challenge the idea that active mechanisms of recruitment or erasure are major determinants for m6A distribution and levels.

      We sincerely thank the reviewer for their thoughtful evaluation and constructive feedback.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Supplementary Figure 5A Data: Please double-check the label of the y-axis and the matching legend.

      We corrected this.

      (2) A better description of how the nuclear: cytoplasmic fractionation is performed.

      We added missing information to the Material & Methods section.

      (3) Rec 1hr or Rec 4hr instead of r1 and r4 to indicate the recovery.

      For brevity in Figure panels, we have chosen to stick with r1 and r4.

      (4) Figure 2D: are hours plotted?

      Plotted is the fold change (FC) of the calculated half-lives in hours (right). For the model (left) hours are the fold change of a dimension-less time-unit of the conditions with m6A facilitated degradation vs without. We have now clarified this in the legend.

      (5) How many genes do we have in each category? How many genes are you investigating each time?

      We thank the reviewer for this question. In all cases where we binned genes, we used equal-sized bins of genes that met the required coverage thresholds. We have reviewed the manuscript to ensure that the number of genes included in each analysis or the specific coverage thresholds used are clearly stated throughout the text.

      (6) Simulations on 1000 genes or 2000 genes?

      We thank the reviewer for this question and went over the text to correct for cases in which this was not clearly stated.

      Reviewer #2 (Recommendations for the authors):

      Specific comments:

      (1) The manuscript is very clear and well-written. However, some arguments are a bit difficult to understand. It would be helpful to clearly discriminate between active and passive events. For example, in the sentence: "For example, increasing the m6A deposition rate (⍺m6A) results in increased nuclear localization of a transcript, due to the increased cytoplasmic decay to which m6A-containing transcripts are subjected", I would directly write "increased relative nuclear localization" or "apparent increase in nuclear localization".

      We thank the reviewer for this careful observation. We have modified the quoted sentence, and also sought to correct additional instances of ambiguity in the text.

      Also, it is important to ensure that all relationships are described correctly. For example, in the sentence: "This model recovers the positive association between m6A and nuclear localization but gives rise to a positive association between m6A and decay", I think "decay" should be replaced with "stability". Similarly, the sentence: "Both the decrease in mRNA production rates and the reduction in export are predicted by m6ADyn to result in increasing m6A levels, ..." should it be "Both the increase in mRNA production and..."?

      We have corrected this.

      This sentence was difficult for me to understand: "Our findings raise the possibility that such changes could, at least in part, also be indirect and be mediated by the redistribution of mRNAs secondary to loss of cytoplasmic m6A-dependent decay." Please consider rephrasing it.

      We rephrased this sentence as suggested.

      (2) Figure 2d: "A final set of predictions of m6ADyn concerns m6A-dependent decay. m6ADyn predicts that (a) cytoplasmic genes will be more susceptible to increased m6A mediated decay, independent of their m6A levels, and (b) more methylated genes will undergo increased decay, independently of their relative localization (Figure 2d left) ... Strikingly, the experimental data supported the dual, independent impact of m6A levels and localization on mRNA stability (Figure 2d, right)."

      I do not understand, either from the text or from the figure, why the authors claim that m6A levels and localization independently affect mRNA stability. It is clear that "cytoplasmic genes will be more susceptible to increased m6A mediated decay", as they always show shorter half-lives (top-to-bottom perspective in Figure 2d). Nonetheless, as I understand it, the effect is not "independent of their m6A levels", as half-lives are clearly the shortest with the highest m6A levels (left-to-right perspective in each row).

      The two-dimensional heatmaps allow for exploring conditional independence between conditions. If an effect (in this case delta half-life) is a function of the X axis (in this case m6A levels), continuous increases should be seen going from one column to another. Conversely, if it is a function of the Y axis (in this case localization), a continuous effect should be observed from one row to another. Given that effects are generally observed both across rows and across columns, we concluded that the two act independently. The fact that half-life is shortest when genes are most cytoplasmic and have the highest m6A levels is therefore not necessarily inconsistent with two effects acting independently, but instead interpreted by us as the additive outcome of two independent effects. Having said this, a close inspection of this plot does reveal a very low impact of localization in contexts where m6A levels are very low, which could point at some degree of synergism between m6A levels and localization. We have therefore now revised the text to avoid describing the effects as "independent."

      (3) The methods part should be extended. For example, the description of the mRNA half-life estimation is far too short and lacks details. Also, information on the PCA analysis (Figure 4e & f) is completely missing. The code should be made available, at least for the differential model.

      We thank the reviewer for this point and expanded the methods section on mRNA stability analysis and PCA. Additionally, we added a supplementary file, providing R code for a basic m6ADyn simulation of m6A depleted to normal conditions (added Source Code 1).

      https://docs.google.com/spreadsheets/d/1Wy42QGDEPdfT-OAnmH01Bzq83hWVrYLsjy_B4n CJGFA/edit?usp=sharing

      (4) Figure 4e, f: The authors use a PCA analysis to achieve an unbiased ranking of genes based on their m6A level changes. From the present text and figures, it is unclear how this PCA was performed. Besides a description in the methods sections, the authors could show additional evidence that the PCA results in a meaningful clustering and that PC1 indeed captures induced/reduced m6A level changes for high/low-PC1 genes.

      We have added passages to the text, hoping to clarify the analysis approach.

      (5) In Figure 4i, I was surprised about the m6A dynamics for the HSF1-independent genes, with two clusters of increasing or decreasing m6A levels across the time course. Can the model explain these changes? Since expression does not seem to be systematically altered, are there differences in subcellular localization between the two clusters after heat shock?

      A general aspect of our manuscript is attributing changes in m6A levels during heat stress to alterations in mRNA metabolism, such as production or export. As shown in Supplementary Figure 4d, even in WT conditions, m6A level changes are not strictly associated with apparent changes in expression, but we try to show that these are a reflection of the decreased export rate. In the specific context of HSF1-dependent stress response genes, we observe a clear co-occurrence of increased m6A levels with increased expression levels, which we propose to be attributed to enhanced production rates during heat stress. This suggests that transcriptional induction can drive the apparent rise in m6A levels. We try to control this with the HSF1 KO cells, in which the m6A level changes, as the increased production rates are absent for the specific cluster of stress-induced genes, further supporting the role of transcriptional activation in shaping m6A levels for these genes. For HSF1-independent genes, the HSF-KO cells mirror the behavior of WT conditions when looking at 500 highest and lowest PC1 (based on the prior analysis in WT cells), suggesting that changes in m6A levels are primarily driven by altered export rates rather than changes in production.

      Among the HSF1 targets, Hspa1a seems to show an inverse behaviour, with the highest methylation in ctrl, even though expression strongly goes up after heat shock. Is this related to the subcellular localization of this particular transcript before and after heat shock?

      Upon reviewing the heat stress target genes, we identified an issue with the proper labeling of the gene symbols, which has now been corrected (Figure 4 panel i). The inverse behavior observed for Hspb1 and partially for Hsp90aa1 is not accounted for by the m6ADyn model, and is indeed an interesting exception with respect to all other induced genes. Further investigation will be required to understand the methylation dynamics of Hspb1 during the response to heat stress.

      Reviewer #3 (Recommendations for the authors):

      Page 4. Indicate reference for "a more recent study finding reduced m6A levels in chromatin-associated RNA.".

      We thank the reviewer for this point and added two publications with a very recent one, both showing that chromatin-associated nascent RNA has less m6A methylation

      The manuscript is perhaps a bit too long. It took me a long time to get to the end. The findings can be clearly presented in a more concise manner and that will ensure that anyone starting to read will finish it. This is not a weakness, but a hope that the authors can reduce the text.

      We have respectfully chosen to maintain the length of the manuscript. The model, its predictions and their relationship to experimental observations are somewhat complex, and we felt that further reduction of the text would come at the expense of clarity.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This manuscript presents an interesting new framework (VARX) for simultaneously quantifying effective connectivity in brain activity during sensory stimulation and how that brain activity is being driven by that sensory stimulation. The core idea is to combine the Vector Autoregressive model that is often used to infer Granger-causal connectivity in brain data with an encoding model that maps the features of a sensory stimulus to that brain data. The authors do a nice job of explaining the framework. And then they demonstrate its utility through some simulations and some analysis of real intracranial EEG data recorded from subjects as they watched movies. They infer from their analyses that the functional connectivity in these brain recordings is essentially unaltered during movie watching, that accounting for the driving movie stimulus can protect one against misidentifying brain responses to the stimulus as functional connectivity, and that recurrent brain activity enhances and prolongs the putative neural responses to a stimulus.

      This manuscript presents an interesting new framework (VARX) for simultaneously quantifying effective connectivity in brain activity during sensory stimulation and how that brain activity is being driven by that sensory stimulation. Overall, I thought this was an interesting manuscript with some rich and intriguing ideas. That said, I had some concerns also - one potentially major - with the inferences drawn by the authors on the analyses that they carried out.

      Main comments:

      (1) My primary concern with the way the manuscript is written right now relates to the inferences that can be drawn from the framework. In particular, the authors want to assert that, by incorporating an encoding model into their framework, they can do a better job of accounting for correlated stimulus-driven activity in different brain regions, allowing them to get a clearer view of the underlying innate functional connectivity of the brain. Indeed, the authors say that they want to ask "whether, after removing stimulus-induced correlations, the intrinsic dynamic itself is preserved". This seems a very attractive idea indeed. However, it seems to hinge critically on the idea of fitting an encoding model that fully explains all of the stimulus-driven activity. In other words, if one fits an encoding model that only explains some of the stimulus-driven response, then the rest of the stimulus-driven response still remains in the data and will be correlated across brain regions and will appear as functional connectivity in the ongoing brain dynamics - according to this framework. This residual activity would thus be misinterpreted. In the present work, the authors parameterize their stimulus using fixation onsets, film cuts, and the audio envelope. All of these features seem reasonable and valid. However, they surely do not come close to capturing the full richness of the stimuli, and, as such, there is surely a substantial amount of stimulus-driven brain activity that is not being accounted for by their "B" model and that is being absorbed into their "A" model and misinterpreted as intrinsic connectivity. This seems to me to be a major limitation of the framework. Indeed, the authors flag this concern themselves by (briefly) raising the issue in the first paragraph of their caveats section. But I think it warrants much more attention and discussion.

      We agree. One can never be sure that all stimulus induced correlation is accounted for. We now formulate our question more cautiously: 

      “We will ask here whether, after removing some of the stimulus-induced correlations, the intrinsic dynamic is similar between stimulus and rest conditions.”

      We also highlight that one may expect the opposite result of what we found: 

      “A general observation of these studies is that a portion of the functional connectivity is preserved between rest and stimulus conditions, while some aspects are altered by the perceptual task [12,16], sometimes showing increased connectivity during the stimulus.[15].” 

      We have added a number of additional features (acoustic edges, fixation novelty, and motion) and more carefully characterize how much “connectivity” each one explains in the neural data: 

      “Removing any of the input features increased the effect size of recurrent connections compared to a model with all features (Fig. S4). We then cumulatively added each feature to the VARX model. Effect size monotonically decreases with each feature added (Fig. 3F). Decreases of effect size are significant when adding film cuts (ΔR=-3.6*10<sup>-6</sup>, p<0.0001, N=26, FDR correction, α=0.05) and the sound envelope (ΔR=-3.59*10<sup>-6</sup>, p=0.002, N=26, FDR correction, α=0.05). Thus, adding more input features progressively reduces the strength of recurrent “connections”.”

      We also added more data to the analysis comparing movies vs rest. We now use 4 different movie segments instead of 1 and find reduced recurrent connectivity during movies: 

      “The number of significant recurrent connections in  were significantly reduced during  movie watching compared to rest (Fig. 4C, fixed effect of stimulus: beta = -3.8*10<sup>-3</sup>, t(17) = -3.9, p<0.001), as is the effect size R (Fig. 4D, fixed effect of stimulus: beta = -2.5*10<sup>-4</sup>, t(17) = -4.1, p<0.001).”

      The additional analysis is described in the Methods section:

      “To compare recurrent connectivity between movies and the resting-state, we compute VARX models in four different movie segments of 5 minutes length to match the length of the resting state recording. We use the first and second half of ‘Despicable Me English’, the first half of ‘Inscapes’ and one of the ‘Monkey’ movies. 18 patients include each of these recordings. For each recording in each patient we compute the fraction of significant channels (p<0.001) and average the effect size R across all channel pairs, excluding the diagonal. We test the difference between movies and resting-state with linear mixed-effect models with stimulus as fixed effect (movie vs rest), and patient as random effect, using matlab’s fitlme() routine.”

      We had already seen this trend of decreasing connectivity during movie watching before, and reported on it cautiously as “largely unaltered”. We updated the Abstract correspondingly from “largely unaltered” to “reduced”: 

      “We also find that the recurrent connectivity during rest is reduced during movie watching.”

      We mentioned this possibility in the Discussion before, namely, that additional input features may reduce recurrent connectivity in the model, and therefore show a difference. We discuss this result now as follows: 

      “The stimulus features we included in our model capture mostly low-level visual and auditory input. It is possible that regressing out a richer stimulus characterization would have removed additional stimulus-induced correlation. While we do not expect that this would change the overall effect of a reduced number of “connections” during movie watching compared to resting state, the interpretation of changes in specific connections will be affected by the choice of features. For example, in sensory cortices, higher recurrent connectivity in the LFP during rest would be consistent with the more synchronized state we saw in rest, as reflected by larger oscillatory activity. Synchronization in higher-order cortices, however, is expected to be more strongly influenced by semantic content of external input.”

      In the Discussion we expand on what might happen if additional stimulus features were to be included into the model:  

      “Previous literature does often not distinguish between intrinsic dynamics and extrinsic effects. By factoring out some of the linear effects of the external input we conclude here that recurrent connectivity is reduced in average. From our prior work49, we know that the stimulus features we included here capture a substantial amount of variance across the brain in intracranial EEG. Arguably, however, the video stimuli had rich semantic information that was not captured by the low-level features used here. Adding such semantic features could have further reduced shared variance, and consequently further reduced average recurrent connectivity in the model.”

      “Similarities and differences between rest and movie watching conditions reported previously, do not draw a firm conclusion as to whether overall “functional connectivity” is increased or reduced. Results seem to depend on the time scale of neural activity analyzed, and the specific brain networks [12,16,63]. However, in fMRI, the conclusion seems to be that functional connectivity during movies is stronger than during rest[15], which likely results from stimulus induced correlations. The VARX model can remove some of the effects of these stimuli, revealing that average recurrent connectivity may be reduced rather than increased during stimulus processing.”

      And in the conclusion we now write: 

      “The model revealed a small but significant decrease of recurrent connectivity when watching movies.”

      (2) Related to the previous comment, the authors make what seems to me to be a complex and important point on page 6 (of the pdf). Specifically, they say "Note that the extrinsic effects captured with filters B are specific (every stimulus dimension has a specific effect on each brain area), whereas the endogenous dynamic propagates this initial effect to all connected brain areas via matrix A, effectively mixing and adding the responses of all stimulus dimensions. Therefore, this factorization separates stimulus-specific effects from the shared endogenous dynamic." It seems to me that the interpretation of the filter B (which is analogous to the "TRF") for the envelope, say, will be affected by the fact that the matrix A is likely going to be influenced by all sorts of other stimulus features that are not included in the model. In other words, residual stimulus-driven correlations that are captured in A might also distort what is going on in B, perhaps. So, again, I worry about interpreting the framework unless one can guarantee a near-perfect encoding model that can fully account for the stimulus-driven activity. I'd love to hear the authors' thoughts on this. (On this issue - the word "dominates" on page 12 seems very strong.)

      This is an interesting point we had not thought about. After some theoretical considerations and some empirical testing we conclude that the effect of missing inputs is relevant, but can be easily anticipated. 

      We have added the following to the Results section explaining and demonstrated empirically the effects of adding features and signals to the model: 

      “As with conventional linear regression, the estimate in B for a particular input and output channel is not affected by which other signals are included in or , provided those other inputs are uncorrelated. We confirmed this here empirically by removing dimensions from (Fig. S11A), and by adding uncorrelated input to (Fig. S11B, adding fixation onset does not affect the estimate for auditory envelope responses). In other words, to estimate B, we do not require all possible stimulus features and all brain activity to be measured and included in the model. In contrast, B does vary when correlated inputs are added to (Fig. S11C, adding acoustic edges changes the auditory envelope response). Evidently the auditory envelope and acoustic edges are tightly coupled in time, whereas fixation onset is not. When a correlated input is missing (acoustic edges) then the other input (auditory envelope) absorbs the correlated variance, thus capturing the combined response of both.”

      (3) Regarding the interpretation of the analysis of connectivity between movies and rest... that concludes that the intrinsic connectivity pattern doesn't really differ. This is interesting. But it seems worth flagging that this analysis doesn't really account for the specific dynamics in the network that could differ quite substantially between movie watching and rest, right? At the moment, it is all correlational. But the dynamics within the network could be very different between stimulation and rest I would have thought.

      As discussed above, with more data and additional stimulus features we now see detectable changes in the connectivity. The example in Figure 4G also shows that specific connections may change in different directions, while overall the strength of connections slightly decreases during movie watching compared to rest. We added the following to the results:

      “While the effect size decreases on average, there is some variation across different brain areas (Fig. 4E-G).”

      But even if the connectivity were unchanged, the activity on this network can be different with varying inputs. We actually also saw that there were changes in the variability of activity (Figs. 6 and S13) that may point to non-linear effects. It seems that injecting the input will cause an overall change in power, which can be explained by a relatively simple non-linear gain adaptation. These effects are already discussed at some length in the paper. 

      (4) I didn't really understand the point of comparing the VARX connectivity estimate with the spare-inverse covariance method (Figure 2D). What was the point of this? What is a reader supposed to appreciate from it about the validity or otherwise of the VARX approach?

      We added the following motivation and clarification on this topic: 

      “To test the descriptive validity [43] of the VARX model we follow the approach of recovering structural connectivity from functional activity in simulation. [44] Specifically, we will compare the recurrent connectivity A derived from brain activity simulated assuming a given structural connectivity, i.e. we ask, can the VARX model recover the underlying structural connectivity, at least in a simulated whole-brian model with known connectivity? … For comparison, we also used the sparse-inverse covariance method to recover connectivity from the correlation matrix (functional connectivity). This method is considered state-of-the-art as it is more sensitive than other methods in detecting structural connections [48]”

      (5) I think the VARX model section could have benefitted a bit from putting some dimensions on some of the variables. In particular, I struggled a little to appreciate the dimensionality of A. I am assuming it has to involve both time lags AND electrode channels so that you can infer Granger causality (by including time) between channels. Including a bit more detail on the dimensionality and shape of A might be helpful for others who want to implement the VARX model.

      Your assumption is correct. We added the following to make this easier for readers: 

      “Therefore, A  has dimensions B has dimensions , where are the dimensions of and respectively.”

      (6) A second issue I had with the inferences drawn by the authors was a difficulty in reconciling certain statements in the manuscript. For example, in the abstract, the authors write "We find that the recurrent connectivity during rest is largely unaltered during movie watching." And they also write that "Failing to account for ... exogenous inputs, leads to spurious connections in the intrinsic "connectivity".

      Perhaps this segment of the abstract needed more explanation. To enhance clarity we have also changed the ordering of the findings. Hopefully this is more clear now: 

      “This model captures the extrinsic effect of the stimulus and separates that from the intrinsic effect of the recurrent brain dynamic. We find that the intrinsic dynamic enhances and prolongs the neural responses to scene cuts, eye movements, and sounds. Failing to account for these extrinsic inputs, leads to spurious recurrent connections that govern the intrinsic dynamic. We also find that the recurrent connectivity during rest is reduced during movie watching.”

      Reviewer #2 (Public review):

      Summary:

      The authors apply the recently developed VARX model, which explicitly models intrinsic dynamics and the effect of extrinsic inputs, to simulated data and intracranial EEG recordings. This method provides a directed method of 'intrinsic connectivity'. They argue this model is better suited to the analysis of task neuroimaging data because it separates the intrinsic and extrinsic activity. They show: that intrinsic connectivity is largely unaltered during a movie-watching task compared to eyes open rest; intrinsic noise is reduced in the task; and there is intrinsic directed connectivity from sensory to higher-order brain areas.

      Strengths:

      (1) The paper tackles an important issue with an appropriate method.

      (2) The authors validated their method on data simulated with a neural mass model.

      (3) They use intracranial EEG, which provides a direct measure of neuronal activity.

      (4) Code is made publicly available and the paper is written well.

      Weaknesses:

      It is unclear whether a linear model is adequate to describe brain data. To the author's credit, they discuss this in the manuscript. Also, the model presented still provides a useful and computationally efficient method for studying brain data - no model is 'the truth'.

      We fully agree and have nothing much to add to this, except to highlight the benefit of a linear model even as explanation for non-linear phenomena: 

      “The [noise-quenching] effect we found here can be explained by a VARX model with the addition of a divisive gain adaptation mechanism … The noise-quenching result and its explanation via gain adaptation shows the benefit of using a parsimonious linear model, which can suggest nonlinear mechanisms as simple corrections from linearity.”

      Appraisal of whether the authors achieve their aims:

      As a methodological advancement highlighting a limitation of existing approaches and presenting a new model to overcome it, the authors achieve their aim. Generally, the claims/conclusions are supported by the results.

      The wider neuroscience claims regarding the role of intrinsic dynamics and external inputs in affecting brain data could benefit from further replication with another independent dataset and in a variety of tasks - but I understand if the authors wanted to focus on the method rather than the neuroscientific claims in this manuscript.

      We fully agree. We added the following to the Discussion section:

      “Future studies should test if our findings replicate in an independent iEEG datasets, including active tasks and whether they generalize to other neuroimaging modalities.”

      Impact:

      The authors propose a useful new approach that solves an important problem in the analysis of task neuroimaging data. I believe the work can have a significant impact on the field.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      Minor comments:

      (1) Did you mean "less" or "fewer" in the following sentence "..larger values lead to overfitting, i.e. less significant connections..."?

      We mean fewer. Thanks for catching this. 

      (2) I didn't see any equations showing how the regularization parameter lambda is incorporated into the framework.

      We prefer the math and details of the algorithm to an earlier paper that has now been published. Instead we added the following clarification: 

      “The VARX models were fitted to data with the matlab version of the code31 using conventional L2-norm regularization. The corresponding regularization parameter was set to 𝜆=0.3.”

      (3) I think some readers of this might struggle to understand the paragraph beginning

      "Connectivity plots are created with nilearn's plot_connectome() function...". It's all quite opaque for the uninitiated.

      Agreed. We now write more simply: 

      “Connectivity plots in Fig. 4 were created with routines from the nilearn toolbox [51].”

      (4) The paragraph beginning "The length of responses for Figure 5..." is also very opaque and could do with being explained more fully. Or this text could be removed from the methods and incorporated into the relevant results section where you actually discuss this analysis.

      Thank you for flagging this. We expand on the details in the Methods as follows: 

      “The length of responses for each channel in B and H to external inputs in Fig. 5 is computed with Matlab's findpeaks() function. This function returns the full-width at half of the peak maximum minus baseline. Power in each channel is computed as the squares of the responses averaged over the time window that was analyzed (0-0.6s).”

      (5) I think adding some comments to the text or caption related to Figures 3C and 3D would be helpful so readers can understand these numbers a bit better. One seems to be the delta log p value and the other is the delta ratio. What does positive or negative mean? Readers might appreciate a little more help.

      We expanded it as follows, hopefully this helps: 

      “C) difference of log for VAX model without minus with inputs (panel A - B). Both models are fit to the same data. D) Thresholding panels A and B at p<0.0001 gives a fraction of significant connections. Here we show the fraction of significant channels for models with and without input. Each line is a patient with color indicating increase or decrease  E) Mean over all channels for VARX models with and without inputs. Each line is a patient.”

      (6) It is not clear what the colors mean in Figures 4 E, F, G.

      We updated the color scheme for those figure panels and carefully explained it in the caption. Please see the manuscript for updated figure 4.   

      (7) It might be nice to slightly unpack what you mean by the "variability of the internal dynamic" and why it can be equated with the power of the innovation process.

      In the methods we added the following clarification right after defining the VARX model: 

      “The innovation process captures the internal variability of the model. Without it, repeating the same input would always result in a fixed deterministic output .”

      In the results section we added the following: 

      “As a metric of internal variability we measured the power of the intrinsic innovation process , which captures the unobserved “random” brain activity which leads to variations in the responses.”

      (8) Typos etc.

      a) "... has been attributed to variability of ongoing dynamic"

      b) The manuscript refers to a Figure 3G, but there is no Figure 3G.

      c) n_a = n_a = 1. Is that a typo?

      d) fiction

      Thank you for catching these. We fixed them. 

      Reviewer #2 (Recommendations for the authors):

      (1) I'm curious about the authors' opinions on the conditions studied. Naively, eyes open rest and passive movie watching seem like similar conditions - were the authors expecting to see a difference with VARX? Do the authors expect that they would see bigger differences when there is a larger difference in sensory input, e.g. eyes closed rest vs movie watching? Given the authors are arguing the need to explicitly model external inputs, a real data example contrasting two very different external inputs might better demonstrate the model's utility.

      Thank you for this suggestion. We added an analysis of eyes-closed rest recordings, available in 8 patients (Fig. S8). The difference between movie and rest is indeed more pronounced than for eyes open rest. The result is described in the methods:

      “In a subset of patients with eyes-closed resting state we find the same effect, that is qualitatively more pronounced (Fig. S8).”

      This complements our updated finding of a difference between movie and eyes-open rest that does show a significant difference after adding more data to this analysis. The results have been updated as following

      “The number of significant recurrent connections in  were significantly reduced during  movie watching compared to rest (Fig. 4C, fixed effect of stimulus:

      beta = -3.8*10<sup>-3</sup>, t(17) = -3.9, p<0.001), as is the effect size R (Fig. 4D, fixed effect of stimulus: beta = -2.5*10<sup>-4</sup>, t(17) = -4.1, p<0.001).”

      The abstract has been updated accordingly:

      “We also find that the recurrent connectivity during rest is reduced during movie watching.”

      (2) It would also have been interesting to see how the proposed model compares to DCM - however, I understand if the authors wanted to focus on their model rather than a comparison with other models.

      We did not try the DCM for a number of reasons. 1) it does not allow for delays in the model dynamic (i.e. the entire time course of the response has to be captured by the recurrent dynamic of a single time step A). 2. It is computationally prohibitive and would not allow us to analyze large channel counts. 3. The available code is custom made for fMRI or EEG analysis with very specified signal generation models that do not obviously apply to iEEG. We added the following to the Discussion of the CDM:  

      “Similar to the VARX model, DCM includes intrinsic and extrinsic effects A and B. However, the modeling is limited to first-order dynamics (i.e. η<sub>a</sub>=η<sub>b</sub>=1). Thus, prolonged responses have to be entirely captured with a first-order recurrent A. … In contrast, here we have analyzed up to 300 channels per subject across the brain, which would be prohibitive with DCM. By analyzing a large number of recordings we were able to draw more general conclusions about whole-brain activity.”

      (3) I believe improving the consistency of the terminology used would improve the manuscript:

      a) Intrinsic dynamics vs intrinsic connectivity vs recurrent connectivity:

      - The term 'intrinsic dynamic' is first introduced in paragraph 3 of the introduction. An explicit definition of is meant by this term would benefit the manuscript.

      - Sometimes the terminology changes to 'intrinsic connectivity' or 'recurrent connectivity'. An explicit definition of these terms (if they refer to different things) would also benefit the manuscript.

      We had used the term “intrinsic” and “recurrent” interchangeably. We now try to mostly say “intrinsic dynamic” when we talk about the more general phenomenon or recurrent brain dynamic, while using “recurrent connectivity” when we refer to the model parameters A. 

      We provide now a definition already at the start of the Abstract: 

      “Sensory stimulation of the brain reverberates in its recurrent neural networks. However, current computational models of brain activity do not separate immediate sensory responses from this intrinsic dynamic. We apply a vector-autoregressive model with external input (VARX), combining the concepts of “functional connectivity” and “encoding models”, to intracranial recordings in humans. This model captures the extrinsic effect of the stimulus and separates that from the intrinsic effect of the recurrent brain dynamic.”

      And at the start of the introduction: 

      “The primate brain is highly interconnected between and within brain areas. … We will refer to the dynamic driven by this recurrent architecture as the intrinsic dynamic of the brain.”

      b) Intrinsic vs Endogenous and Extrinsic vs Exogenous:

      - Footnote 1 defines the 'intrinsic' and 'extrinsic' terminology.

      - However, there are instances where the authors switch back to endogenous/exogenous.

      - Methods section: "Overall system response", paragraph 2.

      - Results section: "Recurrent dynamic enhances and prolongs stimulus responses".

      - Conclusions section.

      With a foot in both neuroscience and systems identification, it’s a hard habit to break. Thanks for catching it. We searched and replaced all instances of endogenous and exogenous.  

      (4) Methods:

      a) The model equation would be clearer if the convolution was written out fully. (I had to read reference 1 to understand the model.).

      We now spell out the full equation and hope it's not too cumbersome to read:  

      “For the th signal channel the recurrence of the VARX model is given by: 

      b) How is an individual dimension omitted in the reduced model, are the values in the y, x set to zero?

      No, it is actually removed from the linear prediction. We added: 

      “… omitted from the prediction …”

      c) "The p-value quantifies the probability that a specific connection in A or B is zero" - for each of n_a/n_b filters?

      d) It should be clarified that D is a vector.

      We hope the following clarification addresses both these questions: 

      “The p-value quantifies the probability that a specific connection in either A or B is zero. Therefore, D,P and R<sup>2</sup> all have dimensions or for A or B  respectively.”

      (5) Results:

      a) Stimulus-induced reduction of noise in the intrinsic activity: would be good to define the frequency range for theta and beta in paragraph 2.

      Added. 

      b) Neural mass model simulation:

      - A brief description of what was simulated is needed.

      We basically ran the sample code of the neurolib library. With that in mind maybe the description we already provide is sufficient:  

      “We used the default model simulation of the neurolib python library (using their sample code for the “ALNModel”), which is a mean-field approximation of adaptive exponential integrate-and-fire neurons. This model can generate simulated mean firing rates in 80 brain areas based on connectivity and delay matrices determined with diffusion tensor imaging (DTI). We used 5 min of “resting state” activity (no added stimulus, simulated at 0.1ms resolution, subsequently downsampled to 100Hz).”

      - It's not clear to me why the A matrix should match the structural connectivity.

      We added the following introduction to make the purpose of this simulation clear:

      “To test the descriptive validity [43] of the VARX model we follow the approach of recovering structural connectivity from functional activity in simulation. [44] Specifically, we will compare the “connectivity” A derived from brain activity simulated assuming a given structural connectivity, i.e. we ask, can the VARX model recover the underlying structural connectivity, at least in a simulated whole-brian model with known connectivity?”

      - It would be interesting to see the inferred A matrix.

      We added a Supplement figure for this and the following: 

      “The VARX model was estimated with n<sub>a</sub>=2, and no input. The resulting estimate for A is dominated by the diagonal elements that capture the autocorrelation within brain areas (Fig. S1).”

      - How many filters were used here?

      No input filters were used for this simulation:

      We used 5 min of “resting state” activity (no added stimulus, simulated at 0.1ms resolution, subsequently downsampled to 100Hz). 

      c) Intracranial EEG:

      - It's not clear how overfitting was measured and how the selection of the number of filters (n_a and n_b) was done.

      We have removed the statement about overfitting. Mostly the word is used in the context of testing on a separate dataset, which we did not do here. So this “overfitting” can be confusing. Instead we used the analytic p-value as indication that a larger model order is not supported by the data. We write this now as follows: 

      “Increasing the number of delays n<sub>a</sub>, increases estimated effect size R (Fig. S3A,B), however, larger values lead to fewer significant connections (Fig. S3C). Significance (p-value) is computed analytically, i.e. non-parametrically, based on deviance. Values around n<sub>a</sub>=6 time delays appear to be the largest model order supported by this statistical analysis.”

      d) Figure 1:

      - Typo: "auto-regressive"

      Fixed. Thanks for catching that. 

      - LFP and BHA in C are defined much later in the text, would be useful to define these in the caption. o Shouldn't B (the VARX model parameter) be a 2x3 matrix for different time lags?

      Hopefully the following clarifications address both these points: 

      “C) Example of neural signal y(t) recorded at a single location in the brain. We will analyze local field potentials (LFP) and broad-band high frequency activity (BHA) in separate analyses.  D) Examples of filters B for individual feed-forward connections between an extrinsic input and a specific recording location in the brain.”

      (6) Discussion:

      I could not find Muller et al 2016 listed in the references.

      Added. Thanks for catching that omission. 

      Additional edits prompted by reviewers, but not in the context of any particular comment.

      While reviewers did not raise this following point, we felt the need clarify the terminology in the Methods to make sure there is not misunderstanding in the proposed interpretation of the model: 

      “We will refer to the filters in matrix A and B and as recurrent and feed-forward “connections”, but avoid the use of the word “causal” which can be misleading.”

      In addressing questions to Figure 4, we noticed that there is quite a bit of variability across patients, so the analysis for Figure 4 and 7 which combines data across patients now accounts for a random effect of patient (previously we have used mean values for repeated measures). We added the following to the Methods to explain this:

      “To compare recurrent connectivity between movies and the resting-state (in Fig. 4), we compute VARX models in four different movie segments of 5 minutes length to match the length of the resting state recording. We use the first and second half of ‘Despicable Me English’, the first half of ‘Inscapes’ and one of the ‘Monkey’ movies. 18 patients include each of these recordings. For each recording in each patient we compute the fraction of significant channels (p<0.001) and average the effect size R across all channel pairs, excluding the diagonal. We test the difference between movies and resting-state with linear mixed-effect models with stimulus as fixed effect (movie vs rest), and patient as random effect (to account for the repeated measures for the different video segments), using matlab’s fitlme() routine. For the analysis of asymmetry of recurrent connectivity (in Fig. 4) we also used a mixed-effect model with T1w/T2w ratio as fixed effect and patients as random effect (to account for the repeated measures in multiple brain locations).”

      All analyses were rerun with more data (eyes closed resting) and 2 additional patients that have become available since the first submission. Therefore all figures and statistics have been updated throughout the paper. Other than the difference between movies and resting state which was trending before and is now significant, no results changed.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      Comment 0: In this paper, the authors develop a comprehensive program to investigate the organization of chromosome structures at 100 kb resolution. It is extremely well executed. The authors have thought through all aspects of the problem. The resulting software will be most useful to the community. Interestingly they capture many experimental observations accurately.

      I have very few complaints.

      We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank them for the detailed suggestions and comments.

      Comment 1: The number of parameters in the energy function is very large. Is there any justification for this? Could they simplify the functions?

      We extend our gratitude to the reviewer for their insightful remarks. The parameters within our model can be categorized into two groups: those governing chromosome-chromosome interactions and those governing chromosome-nuclear landmark interactions.

      In terms of chromosome-chromosome interactions, the parameter count is relatively modest compared to the vast amount of Hi-C data available. For instance, while the whole-genome Hi-C matrix at the 100KB resolution encompasses approximately 303212 contacts, our model comprises merely six parameters for interactions among different compartments, along with 1000 parameters for the ideal potential. As outlined in the supporting information, the ideal potential is contingent upon sequence separation, with 1000 chosen to encompass bead separations of up to 100MB. While it is theoretically plausible to reduce the number of parameters by assuming interactions cease beyond a certain sequence separation, determining this scale a priori presents a challenge.

      During the parameterization process, we observed that interchromosomal contacts predicted solely based on compartmental interactions inadequately mirrored Hi-C data. Consequently, we introduced 231 additional parameters to more accurately capture interactions between distinct pairs of autosomes. These interactions may stem from factors such as non-coding RNA or proteins not explicable by simple, non-specific compartmental interactions.

      Regarding parameters concerning chromosome-nuclear landmark interactions, we have 30321 parameters for speckles and 30321 for the nuclear lamina. To streamline the model, we opted to assign a unique parameter to each chromatin bead. However, it is conceivable that many chromatin beads share a similar mechanism for interacting with nuclear lamina or speckles, potentially allowing for a common parameter assignment. Nonetheless, implementing such simplification necessitates a deeper mechanistic understanding of chromosome-nuclear landmark interactions, an aspect currently lacking.

      As our comprehension of nuclear organization progresses, the interpretability of parameter counts may improve, facilitating their reduction.

      Comment 2: What would the modification be if the resolution is increased?

      To increase the resolution of chromatin, we can in principle keep the same energy function as defined in Eq. S6. In this case, we only need to carry out further parameter optimization.

      However, transitioning to higher resolutions may unveil additional features not readily apparent at 100kb. Notably, chromatin loops with an average size of 200kb or smaller have been identified in high-resolution Hi-C data [1]. To effectively capture these loops, new terms in the energy function must be incorporated. For instance, Qi and Zhang [2] employed additional contact potentials between CTCF sites to account for loop formation. Alternatively, an explicit loop-extrusion process could be introduced to model loop formation more accurately.

      Comment 3: They should state that the extracted physical values are scale-dependent. For example, viscosity.

      We thank the reviewer for the comment and would like to clarify that our model does not predict the viscosity. The nucleoplasmic viscosity was set as 1Pa · s to produce a diffusion coefficient that reproduces experimental value. The exact value for the nucleoplasmic viscosity is still rather controversial, and our selected value falls in the range of reported experimental values from 10−1Pa·s to 102Pa · s.

      We have modified the main text to clarify the calculation of the diffusion coefficient.

      “The exponent and the diffusion coefficient Dα = (27±11)×10−4μm2 · s−α both match well with the experimental values [cite], upon setting the nucleoplasmic viscosity as 1Pa · s (see Supporting Information Section: Mapping the reduced time unit to real time for more details).”

      Reviewer 2:

      Comment 0: In this work, Lao et al. develop an open-source software (OpenNucleome) for GPU-accelerated molecular dynamics simulation of the human nucleus accounting for chromatin, nucleoli, nuclear speckles, etc. Using this, the authors investigate the steady-state organization and dynamics of many of the nuclear components.

      We thank the reviewer for summary of our work.

      Comment 1: The authors could introduce a table having every parameter and the optimal parameter value used. This would greatly help the reader.

      We would like to point out that model parameters are indeed provided in Table S1, S2, S3, S4, and Fig. S7. In these tables, we further provided details on how the parameters were determined.

      Given the large number of parameters for the ideal potential (1000), we opted to plot it rather than listing out all the numbers. We added three new figures to plot the interaction parameters between chromosomes, between chromosomes and speckles, and between chromosomes and the nuclear lamina. Numerical values can be found online in the GitHub repository (parameters).

      Comment 2: How many total beads are simulated? Do all beads have the same size?

      The total number of the coarse-grained beads is 70542, including 60642 chromatin beads, 300 nucleolus beads, 1600 speckle beads, and 8000 nuclear lamina beads. The radius of the chromatin, nucleolus, and speckle beads is 0.25, while that of the lamina bead is 0.5. More information of the size and number of the beads are discussed in the Section: Components of the whole nucleus model.

      Comment 3: In Equation S17, what is the 3rd and 4th powers mean? What necessitates it?

      The potential defined in Equation S17 follows the definition of class2 bond in the LAMMPS package (LAMMPS docs). Compared to a typical harmonic potential, the presence of higher order terms produces sharper increase in the energy at large distances (Author response image 1). This essentially reduces the flucatuation of bond length in simulations.

      Author response image 1.

      Comparison between the Class2 potential (defined in Eq. S17) and the Harmonic potential (K(r − r0)2, with K = 20 and r0 = 0.5).

      Comment 4: What do the X-axis and Y-axis numbers in Figure 5A and 5B mean? What are their units?

      We apologize for the lack of clarify in our original figure. In Fig. 5A, the X and Y axis depicts the simulated and experimental radius of gyration (Rg) for individual chromosomes, as indicated in the title of the figure. Similarly, in Fig. 5B, the X and Y axis depicts the simulated and experimental radial position of individual chromosomes.

      We have converted the chromosome Rg values into reduced units and labeled the corresponding axes in the updated figure (Fig. 5). The normalized radial position is unitless and its detailed definition is included in the supporting information Section: Computing simulated normalized chromosome radial positions. We updated the figure caption to provide an explicit reference to the SI text.

      Reviewer 3:

      Comment 0: In this work, the authors present the development of OpenNucleome, a software for simulating the structure and dynamics of the human nucleus. It provides a detailed model of nuclear components such as chromosomes and nuclear bodies, and uses GPU acceleration for better performance based on the OpenMM package. The work also shows the model’s accuracy in comparisons with experimental data and highlights the utility in the understanding of nuclear organization. While I consider this work a good tool for the genome architecture scientific community, I have some comments and questions that could further clarify the usage of this tool and help potential users. I also have a few questions that would help to clarify the technique and results and some suggestions for references.

      We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank them for the detailed suggestions and comments.

      Comment 1: Could the authors elaborate on what they consider to be ’well-established and easily adoptable modeling tools’?

      By well established, we meant that models that have been extensively validated and verified, and are highly regarded by the community.

      By easily adoptable, we meant that tools that are well documented and can be relatively easily learned by new groups without help from the developers.

      We have revised the text to clarify our meaning.

      “Despite the progress made in computational modeling, the absence of well-documented software with easy-to-follow tutorials pose a challenge.”

      Comment 2: Recognizing the value of a diverse range of tools in the community, the Open-MiChroM tool is also an open-source platform built on top of OpenMM. The documentation shows various modeling approaches and many tutorials that contain different approaches besides the MiChroM energy function. How does OpenNucleome compare in terms of facilitating crossvalidation and user accessibility? The two tools seem to be complementary, which is a gain to the field. I recommend adding one or two sentences in the matter. Also, while navigating the OpenNucleome GitHub, I have not found the tutorials mentioned in the text. I also consider a barrier in the process of generating necessary input files. I would suggest expanding the tutorials and documentation to help potential users.

      We thank the reviewer for the excellent comments. We agree that while many of the tutorials were included in the original package, they were not as clearly documented. We have revised them extensively to to now present:

      • A tutorial for optimizing chromosome chromosome interactions.

      • A tutorial for optimizing chromosome nuclear landmark interactions.

      • A tutorial for building initial configurations.

      • A tutorial for relaxing the initial configurations.

      • A tutorial for selecting the initial configurations.

      • A tutorial for setting up performing Langevin dynamics simulations.

      • A tutorial for setting up performing Brownian dynamics simulations.

      • A tutorial for setting up performing simulations with deformed nucleus.

      • A tutorial for analyzing simulation trajectories.

      • A tutorial for introducing new features to the model.

      These tutorials and our well-documented and open source code (https://zhanggroup-mitchemistry.github.io/OpenNucleome) should significantly promote user accessibility. Our inclusion of python scripts for analyzing simulation trajectorials shall allow users to compute various quantities for evaluating and comparing model quality.

      We added a new paragraph in the Section: Conclusions and Dicussion of the main text to compare OpenNucleosome with existing software for genome modeling.

      “Our software enhances the capabilities of existing genome simulation tools [cite]. Specifically, OpenNucleome aligns with the design principles of Open-MiChroM [cite], prioritizing open-source accessibility while expanding simulation capabilities to the entire nucleus. Similar to software from the Alber lab [cite], OpenNucleome offers highresolution genome organization that faithfully reproduces a diverse range of experimental data. Furthermore, beyond static structures, OpenNucleome facilitates dynamic simulations with explicit representations of various nuclear condensates, akin to the model developed by [citet].”

      Comment 3: Lastly, I would appreciate it if the authors could expand their definition of ’standardized practices’.

      We apologize for any confusion caused. By ”standardized practices,” we refer to the fact that different groups often employ unique procedures for structural modeling. These procedures differ in the representation of chromosomes, the nucleus environment, and the algorithms for parameter optimization. This absence of a consensus on the optimal practices for genome modeling can be daunting for newcomers to the field.

      We have revised the text to the following to avoid confusion:

      “Many research groups develop their own independent software, which complicates crossvalidation and hinders the establishment of best practices for genome modeling [3–5].”

      Comment 4: On page 7, the authors refer to the SI Section: Components of the whole nucleus model for further details. Could the authors provide more information on the simulated density of nuclear bodies? Is there experimental data available that details the ratio of chromatin to other nuclear components, which was used as a reference in the simulation?

      We thank the reviewer for the comment. Imaging studies have provided quantitative measures about the size and number of various nuclear bodies. For example, there are 2 ∼ 5 nucleoli per nucleus, with the typical size RNo ≈ 0.5μm [6–10]. In the review by Spector and Lamond [11], the authors showed that there are 20 ∼ 50 speckles, with the typical size RSp ≈ 0.3μm. We used these numbers to guide our simulation of nuclear bodies. These information was mentioned in the Section: Chromosomes as beads on the string polymers of the supporting information.

      The chromatin density is fixed by the average size of chromatin bead and the nucleus size. We chose the size of chromatin based on imaging studies as detailed in the Subsection: Mapping chromatin bead size to real unit of the supporting information. Upon fixing the bead size, the chromatin volume is determined.

      Comment 5: In the statement, ’the ideal potential is only applied for beads from the same chromosome to approximate the effect of loop extrusion by Cohesin molecules for chromosome compaction and territory formation,’ it would be helpful if the authors could clarify the scope of this potential. Specifically, the code indicates that the variable ’dend ideal’ is set at 1000, suggesting an interaction along a 100Mb polymer chain at a resolution of 100Kb per bead. Could the authors elaborate on their motivation for the Cohesin complex’s activity having a significant effect over such long distances within the polymer chain?

      We thank the reviewer for the insight comment. They are correct that the ideal potential was introduced to capture chromosome folding beyond the interactions between compartments, including loop extrusion. Practically, we parameterized the ideal potential such that the simulated average contact probabilities as a function of sequence separation match the experimental values. The reviewer is correct that beyond a specific value of sequence separation, one would expect the impact of loop extrusion on chromosome folding should be negligible, due to Cohesin dissociation. Correspondingly, the interaction potential should be zero at large sequence separations.

      However, it is important to note that the precise separation scale cannot be known a priori. We chose 100Mb as a conservative estimation. However, as we can see from Fig. S7, our parameterization scheme indeed produced interaction parameters are mainly zero at large sequence separations. Interesting, the scale at which the potential approaches 0 (∼ 500KB), indeed agree with the estimated length traveled by Cohesin molecules before dissociation [12].

      Comment 6: On pages 8 and 9, the authors discuss the optimization process. However, in reviewing the code and documentation available on the GitHub page, I could not find specific sections related to the optimization procedure described in the paper. In this context, I have a few questions: Could the authors provide more details or direct me to the parts of the documentation and the text/SI that address the optimization procedure used in their study? Additional clarification on the cost/objective function employed during the optimization process would be highly beneficial, as this was not readily apparent in the text.

      We thank the reviewer for the comment. We revised the SI to include the definition of the cost function for the Adam optimizer.

      “During the optimization process, our aim was to minimize the disparity between experimental findings and simulated data. To achieve this, we defined the cost function as follows:

      where the index i iterates over all the constraints defined in Eq. S28.”

      The detailed optimization procedure was included in the SI as quoted below

      “The details of the algorithm for parameter optimization are as follows

      (1) Starting with a set of values for and we performed 50 independent 3-million-step long MD simulations to obtain an ensemble of nuclear configurations. The 500K steps of each trajectory are discarded

      as equilibration. We collected the configurations at every 2000 simulation steps from the rest of the simulation trajectories to compute the ensemble averages defined on the left-hand side of Eq. S13.

      (2) Check the convergence of the optimization by calculating the percentage of error

      defined as . The summation over i includes all the average contact probabilities defined in Eq. S28.

      (3) If the error is less than a tolerance value etol, the optimization has converged, and we stop the simulations. Otherwise, we update the parameters, α, using the Adam optimizer [13]. With the new parameter values, we return to step one and restart the iteration.”

      Previously, the optimization code was included as part of the analysis folder. To avoid confusion and improve readability, a separate folder named optimization has been created. This folder provides the Adam optimization of chromosome-chromosome interactions (chr-chr optimization) and chromosome-nuclear landmarks interactions (chr-NL optimization).

      Comment 7: What was the motivation for choosing the Adam algorithm for optimization? Adam is designed for training on stochastic objective functions. Could the authors elucidate on the ’stochastic’ aspect of their function to be optimized? Why the Adam algorithm was considered the most appropriate choice for this application?

      We thank the reviewer for the comment. As defined in Eq. R1, the cost function measures the difference between the simulated constraints with corresponding experimental values. The estimation of simulation values, by averaging over an ensemble of chromosome configurations, is inherently noisy and stochastic. Exact ensemble averages can only be achieved with unlimited samples obtained from infinite long simulations.

      In the past, we have used the Newton’s method for parameterization, and the detailed algorithm can be found in the SI of Ref. 14. However, we found that Adam is more efficient as it is a first-order approximation method. The Newton’s method, on the other hand, is second-order approximation method and requires estimation of the Hessian matrix. When the number of constraints is large, as is in our case, the computational cost for estimating the Hessian matrix can be significant. Another advantage of the Adam algorithm lies in its adjustment of the learning rate along the optimization to further speedup convergence.

      Comment 8: The authors mention that examples of setting up simulations, parameter optimization, and introducing new features are provided in the GitHub repository. However, I was unable to locate these examples. Could the authors guide me to these specific resources or consider adding them if they are not currently available?

      We thank the reviewer for the comment. We have improved the GitHub repository and all the tutorials can be found using the links provided in Response to Comment 2.

      Comment 9: Furthermore, the paper states that ’a configuration file that provides the position of individual particles in the PDB file format is needed to initialize the simulations.’ It would be beneficial for new users if the authors could elaborate on how this file is generated. And all other input files in general. Detailing the procedures for a new user to run their system using OpenNucleome would be helpful.

      We thank the reviewer for the comment. The procedure for generating initial configurations was explained in the SI Section: Initial configurations for simulations and quoted below.

      “We first created a total of 1000 configurations for the genome by sequentially generating the conformation of each one of the 46 chromosomes as follows. For a given chromosome, we start by placing the first bead at the center (origin) of the nucleus. The positions of the following beads, i, were determined from the (i − 1)-th bead as . v is a normalized random vector, and 0.5 was selected as the bond length between neighboring beads. To produce globular chromosome conformations, we rejected vectors, v, that led to bead positions with distance from the center larger than 4σ. Upon creating the conformation of a chromosome i, we shift its center of mass to a value ri com determined as follows. We first compute a mean radial distance, with the following equation

      where Di is the average value of Lamin B DamID profile for chromosome i. Dhi and Dlo represent the highest and lowest average DamID values of all chromosomes, and 6σ and 2σ represent the upper and lower bound in radial positions for chromosomes. As shown in Fig. S6, the average Lamin B DamID profiles are highly correlated with normalized chromosome radial positions as reported by DNA MERFISH [cite], supporting their use as a proxy for estimating normalized chromosome radial positions. We then select as a uniformly distributed random variable within the range . Without loss of generality, we randomly chose the directions for shifting all 46 chromosomes.

      We further relaxed the 1000 configurations to build more realistic genome structures. Following an energy minimization process, one-million-step molecular dynamics (MD) simulations were performed starting from each configuration. Simulations were performed with the following energy function

      where UGenome is defined as in Eq. S7. UG-La is the excluded volume potential between chromosomes and lamina, i.e, only the second term in Eq. S24. Parameters in UGenome were from a preliminary optimization. The end configurations of the MD simulations were collected to build the final configuration ensemble (FCE).”

      The tutorial for preparing initial configurations can be found at this link.

      Comment 10: In the section discussing the correlation between simulated and experimental contact maps, as referenced in Figure 4A and Figure S2, the authors mention a high degree of correlation. Could the authors specify the exact value of this correlation and explain the method used for its computation? Considering that comparing two Hi-C matrices involves a large number of data points, it would be helpful to know if all data points were included in this analysis.

      We have updated Fig 4A and S2 to include Pearson correlation coefficients next to the contact maps. The reviewer is correct in that all the non-redundant data points of the contact maps are included in computing the correlation coefficients.

      For improved clarity, we added a new section in the supporting information to detail the calculations. The section is titled Computing Pearson correlation coefficients between experimental and simulated contact maps, and the relevant text is quoted below.

      “We computed the Pearson correlation coefficients (PCC) between experimental and simulated contact maps in Fig. 4A and Fig. S2 as

      xi and yi represent the experimental and simulated contact probabilities, and n is the total number of data points. Only non-redundant data points, i.e., half of the pairwise contacts, are used in the PCC calculation.”

      Comment 11: In addition, the author said: ”Moreover, the simulated and experimental average contact probabilities between pairs of chromosomes agree well, and the Pearson correlation coefficient between the two datasets reaches 0.89.” How does this correlation behave when not accounting for polymer compaction or scaling? An analysis presenting the correlation as a function of genomic distance would be interesting.

      Author response image 2.

      Pearson correlation coefficient between experimental and simulated contact probabilities as a function of the sequence separation within specific chromosomes. For each chromosome, we first gathered a set of experimental contacts alongside a matching set of simulated ones for genomic pairs within a particular separation range. The Pearson correlation coefficient at the corresponding sequence separation was then determined using Equation R4. We limited the calculations to half of the chromosome length to ensure the availability of sufficient data.

      We thank the reviewer for the comment. The analysis presenting the correlation as a function of genomic distance (sequence separation) for each chromosome is shown in Figure S12 and also included in the SI. While the correlation coefficients decreases at larger separation, the values around 0.5 is quite reasonable and comparable to results obtained using Open-Michrom.

      We also computed the correlation of whole genome contact maps after excluding intra-chromosomal contacts. The PCC decreased from 0.89 to 0.4. Again, the correlation coefficient is quite reasonable considering that these contacts are purely predicted by the compartmental interactions and were not directly optimized.

      Comment 12: I recommend using the web-server that is familiar to the authors to benchmark the OpenNucleome tool/model: ”3DGenBench: A Web-Server to Benchmark Computational Models for 3D Genomics.” Nucleic Acids Research, vol. 50, no. W1, July 2022, pp. W4-12.

      We appreciate the reviewer’s suggestion. Unfortunately, the website is no longer active during the time of the revision. However, as detailed in Response to comment 11, we used the one of the popular metrics to exclude polymer compact effect and evaluate the agreement between simulation and experiments.

      Comment 13: Regarding the comparison of simulation results with microscopy data from reference 34. Given their different resolutions and data point/space groupings, how do the authors align these datasets? Could the authors describe how they performed this comparison? How were the radial positions calculated in both the simulations and experiments? Since the data from reference 34 indicates a non-globular shape of the nucleus; how did this factor into the calculation of radial distributions?

      We thank the reviewer for the comment and apologize for the confusion. First, the average properties we examined, including radial positions and interchromosomal contacts, were averaged over all genomic loci. Therefore, they are independent of data resolution.

      Secondly, instead of calculating the absolute radial positions, which are subject to variations in nucleus shape and size, we defined the normalized radial positions. They measure the ratio between the distance from the nucleus center to the chromosome center and the distance from the nucleus center to the lamina. This definition was frequently used in prior imaging studies to measure chromosome radial positions.

      The calculation of the simulated normalized radial positions and the experimental normalized radial positions are discussed in the Section: Computing simulated normalized chromosome radial positions

      “For a given chromosome i, we first determined its center of mass position denoted as Ci. Starting from the center of the nucleus, O, we extend the the vector vOC to identify the intersection point with the nuclear lamina as Pi. The normalized chromosome radial position i is then defined as , where ||·|| represents the L2 norm.

      and Section: Computing experimental normalized chromosome radial positions.

      “We followed the same procedure outlined in Section: Computing simulated normalized chromosome radial positions to compute the experimental values. To determine the center of the nucleus using DNA MERFISH data, we used the algorithm, minimum volume enclosing ellipsoid (MVEE)[15], to fit an ellipsoid for each genome structure. The optimal ellipsoid defined as is obtained by optimizing subjecting to the constraint that . xi correspond to the list of chromatin positions determined experimentally.”

      Comment 14: In the sentence: ”It is evident that telomeres exhibit anomalous subdiffusive motion.” I recommend mentioning the work ”Di Pierro, Michele, et al., ”Anomalous Diffusion, Spatial Coherence, and Viscoelasticity from the Energy Landscape of Human Chromosomes.” Proceedings of the National Academy of Sciences, vol. 115, no. 30, July 2018, pp. 7753-58.”.

      We have revised the sentence to include the citation as follows.

      “In line with previous research [cite], telomeres display anomalous subdiffusive motion. When fitted with the equation , these trajectories yield a spectrum of α values, with a peak around 0.59.”

      Comment 15: Regarding the observation that ’chromosomes appear arrested and no significant changes in their radial positions are observed over timescales comparable to the cell cycle,’ could the authors provide more details on the calculations or analyses that led to this conclusion? Specifically, information on the equilibration/relaxation time of chromosome territories relative to rearrangements within a cell cycle would be interesting.

      Our conclusion here was mostly based on the time trace of normalized radial positions shown in Figure 6A of the main text. Over the timescale of an entire cell cycle (24 hours), the relatively little to no changes in the radial positions supports glassy dynamics of chromosomes. We further determined the mean squared displacement (MSD) for chromosome center of masses. As shown in the left panel of Fig. S12, the MSDs are much smaller than the average size of chromosomes (see Rg values in Fig. 5A), supporting arrested dynamics.

      We further computed the auto-correlation function of the normalized chromosome radial position as

      where t indexes over the trajectory frames and ¯r is the mean position. As shown in Fig. S12, the positions are not completely decorrelated over 10 hours, again supporting slow dynamics. It would be interesting to examine the relaxation timescale more closely in future studies.

      Comment 16: The authors also comment on the SI ”Section: Initial configurations for simulations provides more details on preparing the 1000 initial configurations.” and related to reference 34 mentioning that ”the average Lamin B DamID profiles are highly correlated with chromosome radial positions as reported by DNA MERFISH”. How do the authors account for situations where homologous chromosomes are neighbors or have an interacting interface? Ref. 34 indicates that distinguishing between these scenarios can be challenging, potentially leading to ’invalid distributions’ that are filtered out. Clarification on how such cases were handled in the simulations would be helpful.

      We would like to first clarify that when comparing with experimental data, we averaged over the homologous chromosomes to obtain haploid data. We added the following text in the manuscript to emphasize this point

      “Given that the majority of experimental data were analyzed for the haploid genome, we adopted a similar approach by averaging over paternal and maternal chromosomes to facilitate direct comparison. More details on data analysis can be found in the Supporting Information Section: Details of simulation data analysis.”

      Furthermore, we used the processed DNA MERFISH data from the Zhuang lab, which unambiguously assigns a chromosome ID to each data point. Therefore, the issue mentioned by the reviewer is not present in the procssed data. In our simulations, since we keep track of the explicit connection between genomic segments, the trace of individual chromosomes can be determined for any configuration. Therefore, there is no ambiguity in terms of simulation data.

      Comment 17: When discussing the interaction with nuclear lamina and nuclear envelop deformation, I suggest mentioning the following studies: The already cited ref 52 and ”Contessoto, Vin´ıcius G., et al. ”Interphase Chromosomes of the Aedes Aegypti Mosquito Are Liquid Crystalline and Can Sense Mechanical Cues.” Nature Communications, vol. 14, no. 1, Jan. 2023, p. 326.”

      We updated the text to include the suggested reference.

      “Numerous studies have highlighted the remarkable influence of nuclear shape on the positioning of chromosomes and the regulation of gene expression [16, 17].”

      Comment 18: The authors state that ’Tutorials in the format of Python Scripts with extensive documentation are provided to facilitate the adoption of the model by the community.’ However, as I mentioned, the documentation appears to be limited, and the available tutorials could benefit from further expansion. I suggest that the authors consider enhancing these resources to better assist users in adopting and understanding the model.

      As detailed in the Response to Comment 2, we have updated the GitHub repository to better document the included Jupyter notebooks and tutorials.

      Comment 19: In the Methods section, the authors discuss using Langevin dynamics for certain simulations and Brownian dynamics for others. Could the authors provide more detailed reasoning behind the choice of these different dynamics for different aspects of the simulation? Furthermore, it would be insightful to know how the results might vary if only one of these dynamics was utilized throughout the study. Such clarification would help in understanding the implications of these methodological choices on the outcomes of the simulations.

      We thank the reviewer for the comment. As detailed in the supporting information Section: Mapping the Reduced Time Unit to Real Time, the Brownian dynamics simulations provide a rigorous mapping to the biological timescale. By choosing a specific value for the nucleoplasmic viscosity, we determined the time unit in simulations as τ = 0.65s. With this time conversion, the simulated diffusion coefficients of telomeres match well with experimental values. Therefore, Brownian dynamics simulations are recommended for computing time dependent quantities and the large damping coefficients mimics the complex nuclear environment well.

      On the other hand, the large damping coefficient slows down the configuration relaxation of the system significantly. For computing equilibrium statistical properties, it is useful to use a small coefficient and the Langevin integrator with large time steps to facilitate conformational relaxation.

      References

      [1] Rao, S. S.; Huntley, M. H.; Durand, N. C.; Stamenova, E. K.; Bochkov, I. D.; Robinson, J. T.; Sanborn, A. L.; Machol, I.; Omer, A. D.; Lander, E. S.; others A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014, 159, 1665–1680.

      [2] Qi, Y.; Zhang, B. Predicting three-dimensional genome organization with chromatin states. PLoS computational biology 2019, 15, e1007024.

      [3] Yildirim, A.; Hua, N.; Boninsegna, L.; Zhan, Y.; Polles, G.; Gong, K.; Hao, S.; Li, W.; Zhou, X. J.; Alber, F. Evaluating the role of the nuclear microenvironment in gene function by population-based modeling. Nature Structural & Molecular Biology 2023, 1–14.

      [4] Junior, A. B. O.; Contessoto, V. G.; Mello, M. F.; Onuchic, J. N. A scalable computational approach for simulating complexes of multiple chromosomes. Journal of molecular biology 2021, 433, 166700.

      [5] Fujishiro, S.; Sasai, M. Generation of dynamic three-dimensional genome structure through phase separation of chromatin. Proceedings of the National Academy of Sciences 2022, 119, e2109838119.

      [6] Caragine, C. M.; Haley, S. C.; Zidovska, A. Nucleolar dynamics and interactions with nucleoplasm in living cells. Elife 2019, 8, e47533.

      [7] Brangwynne, C. P.; Mitchison, T. J.; Hyman, A. A. Active liquid-like behavior of nucleoli determines their size and shape in Xenopus laevis oocytes. Proceedings of the National Academy of Sciences 2011, 108, 4334–4339.

      [8] Farley, K. I.; Surovtseva, Y.; Merkel, J.; Baserga, S. J. Determinants of mammalian nucleolar architecture. Chromosoma 2015, 124, 323–331.

      [9] Qi, Y.; Zhang, B. Chromatin network retards nucleoli coalescence. Nature Communications 2021, 12, 6824.

      [10] Caragine, C. M.; Haley, S. C.; Zidovska, A. Surface fluctuations and coalescence of nucleolar droplets in the human cell nucleus. Physical review letters 2018, 121, 148101.

      [11] Spector, D. L.; Lamond, A. I. Nuclear speckles. Cold Spring Harbor perspectives in biology 2011, 3, a000646.

      [12] Banigan, E. J.; Mirny, L. A. Loop extrusion: theory meets single-molecule experiments. Current opinion in cell biology 2020, 64, 124–138.

      [13] Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014,

      [14] Zhang, B.; Wolynes, P. G. Topology, structures, and energy landscapes of human chromosomes. Proceedings of the National Academy of Sciences 2015, 112, 6062–6067.

      [15] Moshtagh, N.; others Minimum volume enclosing ellipsoid. Convex optimization 2005, 111, 1–9.

      [16] Brahmachari, S.; Contessoto, V. G.; Di Pierro, M.; Onuchic, J. N. Shaping the genome via lengthwise compaction, phase separation, and lamina adhesion. Nucleic Acids Res. 2022, 50, 1–14.

      [17] Contessoto, V. G.; Dudchenko, O.; Aiden, E. L.; Wolynes, P. G.; Onuchic, J. N.; Di Pierro, M. Interphase chromosomes of the Aedes aegypti mosquito are liquid crystalline and can sense mechanical cues. Nature Communications 2023, 14, 326.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Mackie and colleagues compare chemosensory preferences between C. elegans and P. pacificus, and the cellular and molecular mechanisms underlying them. The nematodes have overlapping and distinct preferences for different salts. Although P. pacificus lacks the lsy-6 miRNA important for establishing asymmetry of the left/right ASE salt-sensing neurons in C. elegans, the authors find that P. pacificus ASE homologs achieve molecular (receptor expression) and functional (calcium response) asymmetry by alternative means. This work contributes an important comparison of how these two nematodes sense salts and highlights that evolution can find different ways to establish asymmetry in small nervous systems to optimize the processing of chemosensory cues in the environment.

      Strengths:

      The authors use clear and established methods to record the response of neurons to chemosensory cues. They were able to show clearly that ASEL/R are functionally asymmetric in P. pacificus, and combined with genetic perturbation establish a role for che-1-dependent gcy-22.3 in in the asymmetric response to NH<sub>4</sub>Cl.

      Weaknesses:

      The mechanism of lsy-6-independent establishment of ASEL/R asymmetry in P. pacificus remains uncharacterized.

      We thank the reviewer for recognizing the novel contributions of our work in revealing the existence of alternative pathways for establishing neuronal lateral asymmetry without the lsy-6 miRNA in a divergent nematode species. We are certainly encouraged now to search for genetic factors that alter the exclusive asymmetric expression of gcy-22.3.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Mackie et al. investigate gustatory behavior and the neural basis of gustation in the predatory nematode Pristionchus pacificus. First, they show that the behavioral preferences of P. pacificus for gustatory cues differ from those reported for C. elegans. Next, they investigate the molecular mechanisms of salt sensing in P. pacificus. They show that although the C. elegans transcription factor gene che-1 is expressed specifically in the ASE neurons, the P. pacificus che-1 gene is expressed in the Ppa-ASE and Ppa-AFD neurons. Moreover, che-1 plays a less critical role in salt chemotaxis in P. pacificus than C. elegans. Chemogenetic silencing of Ppa-ASE and Ppa-AFD neurons results in more severe chemotaxis defects. The authors then use calcium imaging to show that both Ppa-ASE and Ppa-AFD neurons respond to salt stimuli. Calcium imaging experiments also reveal that the left and right Ppa-ASE neurons respond differently to salts, despite the fact that P. pacificus lacks lsy-6, a microRNA that is important for ASE left/right asymmetry in C. elegans. Finally, the authors show that the receptor guanylate cyclase gene Ppa-gcy-23.3 is expressed in the right Ppa-ASE neuron (Ppa-ASER) but not the left Ppa-ASE neuron (Ppa-ASEL) and is required for some of the gustatory responses of Ppa-ASER, further confirming that the Ppa-ASE neurons are asymmetric and suggesting that Ppa-GCY-23.3 is a gustatory receptor. Overall, this work provides insight into the evolution of gustation across nematode species. It illustrates how sensory neuron response properties and molecular mechanisms of cell fate determination can evolve to mediate species-specific behaviors. However, the paper would be greatly strengthened by a direct comparison of calcium responses to gustatory cues in C. elegans and P. pacificus, since the comparison currently relies entirely on published data for C. elegans, where the imaging parameters likely differ. In addition, the conclusions regarding Ppa-AFD neuron function would benefit from additional confirmation of AFD neuron identity. Finally, how prior salt exposure influences gustatory behavior and neural activity in P. pacificus is not discussed.

      Strengths:

      (1) This study provides exciting new insights into how gustatory behaviors and mechanisms differ in nematode species with different lifestyles and ecological niches. The results from salt chemotaxis experiments suggest that P. pacificus shows distinct gustatory preferences from C. elegans. Calcium imaging from Ppa-ASE neurons suggests that the response properties of the ASE neurons differ between the two species. In addition, an analysis of the expression and function of the transcription factor Ppa-che-1 reveals that mechanisms of ASE cell fate determination differ in C. elegans and P. pacificus, although the ASE neurons play a critical role in salt sensing in both species. Thus, the authors identify several differences in gustatory system development and function across nematode species.

      (2) This is the first calcium imaging study of P. pacificus, and it offers some of the first insights into the evolution of gustatory neuron function across nematode species.

      (3) This study addresses the mechanisms that lead to left/right asymmetry in nematodes. It reveals that the ASER and ASEL neurons differ in their response properties, but this asymmetry is achieved by molecular mechanisms that are at least partly distinct from those that operate in C. elegans. Notably, ASEL/R asymmetry in P. pacificus is achieved despite the lack of a P. pacificus lsy-6 homolog.

      Weaknesses:

      (1) The authors observe only weak attraction of C. elegans to NaCl. These results raise the question of whether the weak attraction observed is the result of the prior salt environment experienced by the worms. More generally, this study does not address how prior exposure to gustatory cues shapes gustatory responses in P. pacificus. Is salt sensing in P. pacificus subject to the same type of experience-dependent modulation as salt sensing in C. elegans?

      We tested if starving animals in the presence of a certain salt will result in those animals avoiding it. However, under our experimental conditions we were unable to detect experiencedependent modulation either in P. pacificus or in C. elegans.

      Author response image 1.

      (2) A key finding of this paper is that the Ppa-CHE-1 transcription factor is expressed in the PpaAFD neurons as well as the Ppa-ASE neurons, despite the fact that Ce-CHE-1 is expressed specifically in Ce-ASE. However, additional verification of Ppa-AFD neuron identity is required. Based on the image shown in the manuscript, it is difficult to unequivocally identify the second pair of CHE-1-positive head neurons as the Ppa-AFD neurons. Ppa-AFD neuron identity could be verified by confocal imaging of the CHE-1-positive neurons, co-expression of Ppa-che1p::GFP with a likely AFD reporter, thermotaxis assays with Ppa-che-1 mutants, and/or calcium imaging from the putative Ppa-AFD neurons.

      In the revised manuscript, we provide additional and, we believe, conclusive evidence for our correct identification of Ppa-AFD neuron being another CHE-1 expressing neuron. Specifically, we have constructed and characterized 2 independent reporter strains of Ppa-ttx-1, a putative homolog of the AFD terminal selector in C. elegans. There are two pairs of ttx-1p::rfp expressing amphid neurons. The anterior neuronal pair have finger-like endings that are unique for AFD neurons compared to the dendritic endings of the 11 other amphid neuron pairs (no neuron type has a wing morphology in P. pacificus). Their cell bodies are detected in the newly tagged TTX-1::ALFA strain that co-localize with the anterior pair of che-1::gfp-expressing amphid neurons (n=15, J2-Adult).

      We note that the identity of the posterior pair of amphid neurons differs between the ttx-1p::rfp promoter fusion reporter and TTX-1::ALFA strains– the ttx-1p::rfp posterior amphid pair overlaps with the gcy-22.3p::gfp reporter (ASER) but the TTX-1::ALFA posterior amphid pair do not overlap with the posterior pair of che-1::gfp-expressing amphid neurons (n=15). Given that there are 4 splice forms detected by RNAseq (Transcriptome Assembly Trinity, 2016; www.pristionchus.org), this discrepancy between the Ppa-ttx-1 promoter fusion reporter and the endogenous expression of the Ppa-TTX-1 C-terminally tagged to the only splice form containing Exon 18 (ppa_stranded_DN30925_c0_g1_i5, the most 3’ exon) may be due to differential expression of different splice variants in AFD, ASE, and another unidentified amphid neuron types.  

      Although we also made reporter strains of two putative AFD markers, Ppa-gcy-8.1 (PPA24212)p::gfp; csuEx101 and Ppa-gcy-8.2 (PPA41407)p::gfp; csuEx100, neither reporter showed neuronal expression.

      (3) Loss of Ppa-che-1 causes a less severe phenotype than loss of Ce-che-1. However, the loss of Ppa-che-1::RFP expression in ASE but not AFD raises the question of whether there might be additional start sites in the Ppa-che-1 gene downstream of the mutation sites. It would be helpful to know whether there are multiple isoforms of Ppa-che-1, and if so, whether the exon with the introduced frameshift is present in all isoforms and results in complete loss of Ppa-CHE-1 protein.

      According to www.pristionchus.org (Transcriptome Assembly Trinity), there is only a single detectable splice form by RNAseq. Once we have a Ppa-AFD-specific marker, we would be able to determine how much of the AFD terminal effector identify (e.g. expression of gcy-8 paralogs) is effected by the loss of Ppa-che-1 function.

      (4) The authors show that silencing Ppa-ASE has a dramatic effect on salt chemotaxis behavior. However, these data lack control with histamine-treated wild-type animals, with the result that the phenotype of Ppa-ASE-silenced animals could result from exposure to histamine dihydrochloride. This is an especially important control in the context of salt sensing, where histamine dihydrochloride could alter behavioral responses to other salts.

      We have inadvertently left out this important control. Because the HisCl1 transgene is on a randomly segregating transgene array, we have scored worms with and without the transgene expressing the co-injection marker (Ppa-egl-20p::rfp, a marker in the tail) to show that the presence of the transgene is necessary for the histamine-dependent knockdown of NH<sub>4</sub>Br attraction. This control is added as Figure S2.

      (5) The calcium imaging data in the paper suggest that the Ppa-ASE and Ce-ASE neurons respond differently to salt solutions. However, to make this point, a direct comparison of calcium responses in C. elegans and P. pacificus using the same calcium indicator is required. By relying on previously published C. elegans data, it is difficult to know how differences in growth conditions or imaging conditions affect ASE responses. In addition, the paper would be strengthened by additional quantitative analysis of the calcium imaging data. For example, the paper states that 25 mM NH<sub>4</sub>Cl evokes a greater response in ASEL than 250 mM NH<sub>4</sub>Cl, but a quantitative comparison of the maximum responses to the two stimuli is not shown.

      We understand that side-by-side comparisons with C. elegans using the same calcium indicator would lend more credence to the differences we observed in P. pacificus versus published findings in C. elegans from the past decades, but are not currently in a position to conduct these experiments in parallel.

      (6) It would be helpful to examine, or at least discuss, the other P. pacificus paralogs of Ce-gcy22. Are they expressed in Ppa-ASER? How similar are the different paralogs? Additional discussion of the Ppa-gcy-22 gene expansion in P. pacificus would be especially helpful with respect to understanding the relatively minor phenotype of the Ppa-gcy-22.3 mutants.

      In P. pacificus, there are 5 gcy-22-like paralogs and 3 gcy-7-like paralogs, which together form a subclade that is clearly distinct from the 1-1 Cel-gcy-22, Cel-gcy-5, and Cel-gcy-7 orthologs in a phylogenetic tree containing all rGCs in P. pacificus, C. elegans, and C. briggssae (Hong et al, eLife, 2019). In Ortiz et al (2006 and 2009), Cel-gcy-22 stands out from other ASER-type gcy genes (gcy-1, gcy-4, gcy-5) in being located on a separate chromosome (Chr. V) as well as in having a wider range of defects in chemoattraction towards salt ions. Given that the 5 P. pacificus gcy-22-like paralogs are located on 3 separate chromosomes without clear synteny to their C. elegans counterparts, it is likely that the gcy-22 paralogs emerged from independent and repeated gene duplication events after the separation of these Caenorhabditis and Pristionchus lineages. Our reporter strains for two other P. pacificus gcy-22-like paralogs either did not exhibit expression in amphid neurons (Ppa-gcy-22.1p::GFP, ) or exhibited expression in multiple neuron types in addition to a putative ASE neuron (Ppa-gcy-22.4p::GFP). We have expanded the discussion on the other P. pacificus gcy-22 paralogs.

      (7) The calcium imaging data from Ppa-ASE is quite variable. It would be helpful to discuss this variability. It would also be helpful to clarify how the ASEL and ASER neurons are being conclusively identified during calcium imaging.

      For each animal, the orientation of the nose and vulva were recorded and used as a guide to determine the ventral and dorsal sides of the worm, and subsequently, the left and right sides of the worm. Accounting for the plane of focus of the neuron pairs as viewed through the microscope, it was then determined whether the imaged neuron was the worm’s left or right neuron of each pair. We added this explanation to the Methods.

      (8) More information about how the animals were treated prior to calcium imaging would be helpful. In particular, were they exposed to salt solutions prior to imaging? In addition, the animals are in an M9 buffer during imaging - does this affect calcium responses in Ppa-ASE and Ppa-AFD? More information about salt exposure, and how this affects neuron responses, would be very helpful.

      Prior to calcium imaging, animals were picked from their cultivation plates (using an eyelash pick to minimize bacteria transfer) and placed in loading solution (M9 buffer with 0.1% Tween20 and 1.5 mM tetramisole hydrochloride, as indicated in the Method) to immobilize the animals until they were visibly completely immobilized.

      (9) In Figure 6, the authors say that Ppa-gcy-22.3::GFP expression is absent in the Ppa-che1(ot5012) mutant. However, based on the figure, it looks like there is some expression remaining. Is there a residual expression of Ppa-gcy-22.3::GFP in ASE or possibly ectopic expression in AFD? Does Ppa-che-1 regulate rGC expression in AFD? It would be helpful to address the role of Ppa-che-1 in AFD neuron differentiation.

      In Figure 6C, the green signal is autofluorescence in the gut, and there is no GFP expression detected in any of the 55 che-1(-) animals we examined. We are currently developing AFDspecific rGC markers (gcy-8 homologs) to be able to examine the role of Ppa-CHE-1 in regulating AFD identity.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Abstract: 'how does sensory diversity prevail within this neuronal constraint?' - could be clearer as 'numerical constraint' or 'neuron number constraint'.

      We have clarified this passage as ‘…constraint in neuron number’.

      (2) 'Sensory neurons in the Pristionchus pacificus' - should get rid of the 'the'.

      We have removed the ‘the’.

      (3) Figure 2: We have had some good results with the ALFA tag using a similar approach (tagging endogenous loci using CRISPR). I'm not sure if it is a Pristionchus thing, or if it is a result of our different protocols, but our staining appears stronger with less background. We use an adaptation of the Finney-Ruvkin protocol, which includes MeOH in the primary fixation with PFA, and overcomes the cuticle barrier with some LN2 cracking, DTT, then H2O2. No collagenase. If you haven't tested it already it might be worth comparing the next time you have a need for immunostaining.

      We appreciate this suggestion. Our staining protocol uses paraformaldehyde fixation. We observed consistent and clear staining in only 4 neurons in CHE-1::ALFA animals but more background signals from TTX-1::ALFA in Figure 2I-J in that could benefit from improved immunostaining protocol.

      (4) Page 6: 'By crossing the che-1 reporter transgene into a che-1 mutant background (see below), we also found that che-1 autoregulates its own expression (Figure 2F), as it does in C. elegans' - it took me some effort to understand this. It might make it easier for future readers if this is explained more clearly.

      We understand this confusion and have changed the wording along with a supporting table with a more detailed account of che-1p::RFP expression in both ASE and AFD neurons in wildtype and che-1(-) backgrounds in the Results.

      (5) Line numbers would make it easier for reviewers to reference the text.

      We have added line numbers.

      (6) Page 7: is 250mM NH<sub>4</sub>Cl an ecologically relevant concentration? When does off-target/nonspecific activation of odorant receptors become an issue? Some discussion of this could help readers assess the relevance of the salt concentrations used.

      This is a great question but one that is difficult to reconcile between experimental conditions that often use 2.5M salt as point-source to establish salt gradients versus ecologically relevant concentrations that are very heterogenous in salinity. Efforts to show C. elegans can tolerate similar levels of salinity between 0.20-0.30 M without adverse effects have been recorded previously (Hu et al., Analytica Chimica Acta 2015; Mah et al. Expedition 2017).

      (7) It would be nice for readers to have a short orientation to the ecological relevance of the different salts - e.g. why Pristionchus has a particular taste for ammonium salts.

      Pristionchus species are entomophilic and most frequently found to be associated with beetles in a necromenic manner. Insect cadavers could thus represent sources of ammonium in the soil. Additionally, ammonium salts could represent a biological signature of other nematodes that the predatory morphs of P. pacificus could interpret as prey. We have added the possible ecological relevance of ammonium salts into the Discussion.

      (8) Page 11: 'multiple P. pacificus che-1p::GCaMP strains did not exhibit sufficient basal fluorescence to allow for image tracking and direct comparison'. 500ms exposure to get enough signal from RCaMP is slow, but based on the figures it still seems enough to capture things. If image tracking was the issue, then using GCaMP6s with SL2-RFP or similar in conjunction with a beam splitter enables tracking when the GCaMP signal is low. Might be an option for the future.

      These are very helpful suggestions and we hope to eventually develop an improved che1p::GCaMP strain for future studies.

      (9) Sometimes C. elegans genes are referred to as 'C. elegans [gene name]' and sometimes 'Cel [gene name]'. Should be consistent. Same with Pristionchus.

      We have now combed through and corrected the inconsistencies in nomenclature.

      (10) Pg 12 - '...supports the likelihood that AFD receives inputs, possibly neuropeptidergic, from other amphid neurons' - the neuropeptidergic part could do with some justification.

      Because the AFD neurons are not exposed directly to the environment through the amphid channel like the ASE and other amphid neurons, the calcium responses to salts detected in the AFD likely originate from sensory neurons connected to the AFD. However, because there is no synaptic connection from other amphid neurons to the AFD neurons in P. pacificus (unlike in C. elegans; Hong et al, eLife, 2019), it is likely that neuropeptides connect other sensory neurons to the AFDs. To avoid unnecessary confusion, we have removed “possibly neuropeptidergic.”

      (11) Pg16: the link to the Hallam lab codon adaptor has a space in the middle. Also, the paper should be cited along with the web address (Bryant and Hallam, 2021).

      We have now added the proper link, plus in-text citation. https://hallemlab.shinyapps.io/Wild_Worm_Codon_Adapter/ (Bryant and Hallem, 2021)

      Full citation:

      Astra S Bryant, Elissa A Hallem, The Wild Worm Codon Adapter: a web tool for automated codon adaptation of transgenes for expression in non-Caenorhabditis nematodes, G3 Genes|Genomes|Genetics, Volume 11, Issue 7, July 2021, jkab146, https://doi.org/10.1093/g3journal/jkab146

      Reviewer #2 (Recommendations for the authors):

      (1) In Figure 1, the legend states that the population tested was "J4/L4 larvae and young adult hermaphrodites," whereas in the main text, the population was described as "adult hermaphrodites." Please clarify which ages were tested.

      We have tested J4-Adult stage hermaphrodites and have made the appropriate corrections in the text.

      (2) The authors state that "in contrast to C. elegans, we find that P. pacificus is only moderately and weakly attracted to NaCl and LiCl, respectively." However, this statement does not reflect the data shown in Figure 1, where there is no significant difference between C. elegans and P. pacificus - both species show at most weak attraction to NaCl.

      Although there is no statistically significant difference in NaCl attraction between P. pacificus and C. elegans, NaCl attraction in P. pacificus is significantly lower than its attraction to all 3 ammonium salts when compared to C. elegans. We have rephrased this statement as relative differences in the Results and updated the Figure legend.

      (3) In Figure 1, the comparisons between C. elegans and P. pacificus should be made using a two-way ANOVA rather than multiple t-tests. Also, the sample sizes should be stated (so the reader does not need to count the circles) and the error bars should be defined.

      We performed the 2-way ANOVA to detect differences between C. elegans and P. pacificus for the same salt and between salts within each species. We also indicated the sample size on the figure and defined the error bars.

      Significance:

      For comparisons of different salt responses within the same species:

      - For C. elegans, NH<sub>4</sub>Br vs NH<sub>4</sub>Cl (**p<0.01), NH<sub>4</sub>Cl vs NH<sub>4</sub>I (* p<0.05), and NH<sub>4</sub>Cl vs NaCl (* p<0.05). All other comparisons are not significant.

      - For P. pacificus, all salts showed (****p<0.0001) when compared to NaAc and to NH<sub>4</sub>Ac, except for NH<sub>4</sub>Ac and NaAc compared to each other (ns). Also, NH<sub>4</sub>Cl showed (*p<0.05) and NH<sub>4</sub>I showed (***p<0.001) when compared with LiCl and NaCl. All other comparisons are not significant.

      For comparisons of salt responses between different species (N2 vs PS312):

      - NH<sub>4</sub>I and LiCl (*p<0.05); NaAc and NH<sub>4</sub>Ac (****p<0.0001)

      (4) It might be worth doing a power analysis on the data in Figure 3B. If the data are underpowered, this might explain why there is a difference in NH<sub>4</sub>Br response with one of the null mutants but not the other.

      For responses to NH<sub>4</sub>Cl, since both che-1 mutants (rather than just one) showed significant difference compared to wildtype, we conducted a power analysis based on the effect size of that difference (~1.2; large). Given this effect size, the sample size for future experiments should be 12 (ANOVA).

      For responses to NH<sub>4</sub>Br and given the effect size of the difference seen between wildtype (PS312) and ot5012 (~0.8; large), the sample size for future experiments should be 18 (ANOVA) for a power value of 0.8. Therefore, it is possible that the sample size of 12 for the current experiment was too small to detect a possible difference between the ot5013 alleles and wildtype.

      (5) It would be helpful to discuss why silencing Ppa-ASE might result in a switch from attractive to repulsive responses to some of the tested gustatory cues.

      For similar assays using Ppa-odr-3p::HisCl1, increasing histamine concentration led to decreasing C.I. for a given odorant (myristate, a P. pacificus-specific attractant). It is likely that the amount of histamine treatment for knockdown to zero (i.e. without a valence change) will differ depending on the attractant.

      (6) The statistical tests used in Figure 3 are not stated.

      Figure 3 used Two-way ANOVA with Dunnett’s post hoc test. We have now added the test in the figure legend.

      (7) It would be helpful to examine the responses of ASER to the full salt panel in the Ppa-gcy-22.3 vs. wild-type backgrounds.

      We understand that future experiments examining neuron responses to the full salt panel for wildtype and gcy-22.3 mutants would provide further information about the salts and specific ions associated with the GCY-22.3 receptor. However, we have tested a broader range of salts (although not yet the full panel) for behavioral assays in wildtype vs gcy-22.3 mutants, which we have included as part of an added Figure 8.

      (8) The controls shown in Figure S1 may not be adequate. Ideally, the same sample size would be used for the control, allowing differences between control worms and experimental worms to be quantified.

      Although we had not conducted an equal number of negative controls using green light without salt stimuli due to resource constraints (6 control vs ~10-19 test), we provided individual recordings with stimuli to show that conditions we interpreted as having responses rarely showed responses resembling the negative controls. Similarly, those we interpreted as having no responses to stimuli mostly resembled the no-stimuli controls (e.g. WT to 25 mM NH<sub>4</sub>Cl, gcy22.3 mutant to 250 mM NH<sub>4</sub>Cl).

      (9) An osmolarity control would be helpful for the calcium imaging experiments.

      We acknowledge that future calcium imaging experiments featuring different salt concentrations could benefit from osmolarity controls.

      (10) In Figure S7, more information about the microfluidic chip design is needed.

      The chip design features a U-shaped worm trap to facilitate loading the worm head-first, with a tapered opening to ensure the worm fits snugly and will not slide too far forward during recording. The outer two chip channels hold buffer solution and can be switched open (ON) or closed (OFF) by the Valvebank. The inner two chip channels hold experimental solutions. The inner channel closer to the worm trap holds the control solution, and the inner channel farther from the worm trap holds the stimulant solution.

      We have added an image of the chip in Figure S7 and further description in the legend.

      (11) Throughout the manuscript, the discussion of the salt stimuli focuses on the salts more than the ions. More discussion of which ions are eliciting responses (both behavioral and neuronal responses) would be helpful.

      In Figure 7, the gcy-22.3 defect resulted in a statistically significant reduction in response only towards NH<sub>4</sub>Cl but not towards NaCl, which suggests ASER is the primary neuron detecting NH<sub>4</sub><sup>+</sup> ions. To extend the description of the gcy-22.3 mutant defects to other ions, we have added a Figure 8: chemotaxis on various salt backgrounds. We found only a mild increase in attraction towards NH<sub>4</sub><sup>+</sup> by both gcy-22.3 mutant alleles, but wild-type in their responses toward Cl<sup>-</sup>, Na<sup>+</sup>, or I<sup>-</sup>. The switch in the direction of change between the behavioral (enhanced) and calcium imaging result (reduced) suggests the behavioral response to ammonium ions likely involves additional receptors and neurons.

      Minor comments:

      (1) The full species name of "C. elegans" should be written out upon first use.

      We have added ‘Caenorhabditis elegans’ to its first mention.

      (2) In the legend of Figure 1, "N2" should not be in italics.

      We have made the correction.

      (3) The "che-1" gene should be in lowercase, even when it is at the start of the sentence.

      We have made the correction.

      (4) Throughout the manuscript, "HisCl" should be "HisCl1."

      We have made these corrections to ‘HisCl1’.

      (5) Figure 3A would benefit from more context, such as the format seen in Figure 7A. It would also help to have more information in the legend (e.g., blue boxes are exons, etc.).

      (6) "Since NH<sub>4</sub>I sensation is affected by silencing of che-1(+) neurons but is unaffected in che-1 mutants, ASE differentiation may be more greatly impacted by the silencing of ASE than by the loss of che-1": I don't think this is exactly what the authors mean. I would say, "ASE function may be more greatly impacted...".

      We have changed ‘differentiation’ to ‘function’ in this passage.

      (7) In Figure 7F-G, the AFD neurons are referred to as AFD in the figure title but AM12 in the graph. This is confusing.

      Thank you for noticing this oversight. We have corrected “AM12” to “AFD”.

      (8) In Figure 7, the legend suggests that comparisons within the same genotype were analyzed. I do not see these comparisons in the figure. In which cases were comparisons within the same genotype made?

      Correct, we performed additional tests between ON and OFF states within the same genotypes (WT and mutant) but did not find significant differences. To avoid unnecessary confusion, we have removed this sentence.

      (9) The nomenclature used for the transgenic animals is unconventional. For example, normally the calcium imaging line would be listed as csuEx93[Ppa-che-1p::optRCaMP] instead of Ppache-1p::optRCaMP(csuEx93).

      We have made these corrections to the nomenclature.

      (10) Figure S6 appears to come out of order. Also, it would be nice to have more of a legend for this figure. The format of the figure could also be improved for clarity.

      We have corrected Figure S6 (now S8) and added more information to the legend.

      (11) Methods section, Chemotaxis assays: "Most assays lasted ~3.5 hours at room temperature in line with the speed of P. pacificus without food..." It's not clear what this means. Does it take the worms 3.5 hours to crawl across the surface of the plate?

      Correct, P. pacificus requires 3-4 hours to crawl across the surface of the plate, which is the standard time for chemotaxis assays for some odors and all salts. We have added this clarification to the Methods.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides valuable insights into how the brain parses the syntactic structure of a spoken sentence. A unique contribution of the work is to use a large language model to quantify how the mental representation of syntactic structure updates as a sentence unfolds in time. Solid evidence is provided that distributive cortical networks are engaged for incremental parsing of a sentence, although the contribution could be further strengthened if the authors would further highlight the main results and clarify the benefit of using a large language model.

      We thank the editors for the overall positive assessment. We have revised our manuscript to further emphasize our main findings and highlight the advantages of using a large language model (LLM) over traditional behavioural and corpus-based data.

      This study aims to investigate the neural dynamics underlying the incremental construction of structured interpretation during speech comprehension. While syntactic cues play an important role, they alone do not define the essence of this parsing process. Instead, this incremental process is jointly determined by the interplay of syntax, semantics, and non-linguistic world knowledge, evoked by the specific words heard sequentially by listeners. To better capture these multifaceted constraints, we derived structural measures from BERT, which dynamically represent the evolving structured interpretation as a sentence unfolds word-by-word.

      Typically, the syntactic structure of a sentence can be represented by a context-free parse tree, such as a dependency parse tree or a constituency-based parse tree, which abstracts away from specific content, assigning a discrete parse depth to each word regardless of its semantics. However, this context-free parse tree merely represents the result rather than the process of sentence parsing and does not elucidate how a coherent structured interpretation is concurrently determined by multifaceted constraints. In contrast, BERT parse depth, trained to approach the context-free discrete dependency parse depth, is a continuous variable. Crucially, its deviation from the corresponding discrete parse depth indicates the preference for the syntactic structure represented by this context-free parse. As BERT processes a sentence delivered word-by-word, the dynamic change of BERT parse depth reflects the incremental nature of online speech comprehension.

      Our results reveal a behavioural alignment between BERT parse depth and human interpretative preference for the same set of sentences. In other words, BERT parse depth could represent a probabilistic interpretation of a sentence’s structure based on its specific contents, making it possible to quantify the preference for each grammatically correct syntactic structure during incremental speech comprehension. Furthermore, both BERT and human interpretations show correlations with linguistic knowledge, such as verb transitivity, and non-linguistic knowledge, like subject noun thematic role preference. Both types of knowledge are essential for achieving a coherent interpretation, in accordance with the “constraint-based hypothesis” of sentence processing.

      Motivated by the observed behavioural alignment between BERT and human listeners, we further investigated BERT structural measures in source-localized EEG/MEG using representational similarity analyses (RSA). This approach revealed the neural dynamics underlying incremental speech comprehension on millisecond scales. Our main findings include: (1) a shift from bi-hemispheric lateral frontal-temporal regions to left-lateralized regions in representing the current structured interpretation as a sentence unfolds, (2) a pattern of sequential activations in the left lateral temporal regions, updating the structured interpretation as syntactic ambiguity is resolved, and (3) the influence of lexical interpretative coherence activated in the right hemisphere over the resolved sentence structure represented in the left hemisphere.

      From our perspective, the advantages of using a LLM (or deep language model) like BERT are twofold. Conceptually, BERT structural measures offer a deep contextualized structural representation for any given sentence by integrating the multifaceted constraints unique to the specific contents described by the words within that sentence. Modelling this process on a word-by-word basis is challenging to achieve with behavioural or corpus-based metrics. Empirically, as demonstrated in our responses to the reviewers below, BERT measures show better performance compared to behavioural and corpus-based metrics in aligning with listeners’ neural activity. Moreover, when it comes to integrating multiple sources of constraints for achieving a coherent interpretation, BERT measures also show a better fit with the behavioural data of human listeners than corpus-based metrics.

      Taken together, we propose that LLMs, akin to other artificial neural networks (ANNs), can be considered as computational models for formulating and testing specific neuroscientific hypotheses, such as the “constraint-based hypothesis” of sentence processing in this study. However, we by no means overlook the importance of corpus-based and behavioural metrics. These metrics play a crucial role in interpreting and assessing whether and how ANNs stimulate human cognitive processes, a fundamental step in employing ANNs to gain new insights into the neural mechanisms of human cognition.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this study, the authors investigate where and when brain activity is modulated by incoming linguistic cues during sentence comprehension. Sentence stimuli were designed such that incoming words had varying degrees of constraint on the sentence's structural interpretation as participants listened to them unfolding, i.e. due to varying degrees of verb transitivity and the noun's likelihood of assuming a specific thematic role. Word-by-word "online" structural interpretations for each sentence were extracted from a deep neural network model trained to reproduce language statistics. The authors relate the various metrics of word-by-word predicted sentence structure to brain data through a standard RSA approach at three distinct points of time throughout sentence presentation. The data provide convincing evidence that brain activity reflects preceding linguistic constraints as well as integration difficulty immediately after word onset of disambiguating material.

      We thank Reviewer #1 (hereinafter referred to as R1) for their recognition of the objectives of our study and the analytical approaches we have employed in this study.

      The authors confirm that their sentence stimuli vary in degree of constraint on sentence structure through independent behavioral data from a sentence continuation task. They also show a compelling correlation of these behavioral data with the online structure metric extracted from the deep neural network, which seems to pick up on the variation in constraints. In the introduction, the authors argue for the potential benefits of using deep neural networkderived metrics given that it has "historically been challenging to model the dynamic interplay between various types of linguistic and nonlinguistic information". Similarly, they later conclude that "future DLMs (...) may provide new insights into the neural implementation of the various incremental processing operations(...)".

      We appreciate R1’s positive comments on the design, quantitative modelling and behavioural validation of the sentence stimuli used in this experiment.

      By incorporating structural probing of a deep neural network, a technique developed in the field of natural language processing, into the analysis pipeline for investigating brain data, the authors indeed take an important step towards establishing advanced machine learning techniques for researching the neurobiology of language. However, given the popularity of deep neural networks, an argument for their utility should be carefully evidenced.

      We fully concur with R1 regarding the need for cautious evaluation and interpretation of deep neural networks’ utility. In fact, this perspective underpinned our decision to conduct extensive correlation analyses using both behavioural and corpus-based metrics to make sense of BERT metrics. These analyses were essential to interpret and validate BERT metrics before employing them to investigate listeners’ neural activity during speech comprehension. We do not in any way undermine the importance of behavioural or corpus-based data in studying language processing in the brain. On the contrary, as evidenced by our findings, these traditional metrics are instrumental in interpreting and guiding the use of metrics derived from LLMs.

      However, the data presented here don't directly test how large the benefit provided by this tool really is. In fact, the authors show compelling correlations of the neural network-derived metrics with both the behavioral cloze-test data as well as several (corpus-)derived metrics. While this is a convincing illustration of how deep language models can be made more interpretable, it is in itself not novel. The correlation with behavioral data and corpus statistics also raises the question of what is the additional benefit of the computational model? Is it simply saving us the step of not having to collect the behavioral data, not having to compute the corpus statistics or does the model potentially uncover a more nuanced representation of the online comprehension process? This remains unclear because we are lacking a direct comparison of how much variance in the neural data is explained by the neural network-derived metrics beyond those other metrics (for example the main verb probability or the corpusderived "active index" following the prepositional phrase).

      From our perspective, a primary advantage of using the neural network-derived metrics (or LLMs as computational models of language processing), compared to traditional behavioural and corpus-based metrics, lies in their ability to offer more nuanced, contextualized representations of natural language inputs. There seems no effective way of computationally capturing the distributed and multifaceted constraints within specific contexts until the current generation of LLMs came along. While it is feasible to quantify lexical properties or contextual effects based on the usage of specific words via corpora or behavioural tests, this method appears less effective in modelling the composition of meanings across more words on the sentence level. More critically, it struggles with capturing how various lexical constraints collectively yield a coherent structured interpretation.

      Accumulating evidence suggests that models designed for context prediction or next-word prediction, such as word2vec and LLMs, outperform classic count-based distributional semantic models (Baroni et al. 2014) in aligning with neural activity during language comprehension (Schrimpf et al. 2021; Caucheteux and King 2022). Relevant to this, we have conducted additional analyses to directly assess the additional variance of neural data explained by BERT metrics, over and above what traditional metrics account for. Specifically, using RSA, we re-tested model RDMs based on BERT metrics while controlling for the contribution from traditional metrics (via partial correlation).

      During the first verb (V1) epoch, we tested model RDMs of V1 transitivity based on data from either the behavioural pre-test (i.e., continuations following V1) or massive corpora. Contrasting sharply with the significant model fits observed for BERT V1 parse depth in bilateral frontal and temporal regions, the two metrics of V1 transitivity did not exhibit any significant effects (see Author response image 1).

      Author response image 1

      RSA model fits of BERT structural metrics and behavioural/corpus-based metrics in the V1 epoch. (upper) Model fits of BERT V1 parse depth (relevant to Appendix 1-figure 10A); (middle) Model fits of the V1 transitivity based on the continuation pre-rest conducted at the end of V1 (e.g., completing “The dog found …”); (bottom) Model fits of the V1 transitivity based on the corpus data (as described in Methods). Note that verb transitivity is quantified as the proportion of its transitive uses (i.e., followed by a direct object) relative to its intransitive uses.

      In the PP1 epoch, which was aligned to the onset of the preposition in the prepositional phrase (PP), we tested the probability of a PP continuation following V1 (e.g., the probability of a PP after “The dog found…”). While no significant results were found for PP probability, we have plotted the uncorrected results for PP probability (Author response image 2). These model fits have very limited overlap with those of BERT parse depth vector (up to PP1) in the left inferior frontal gyrus (approximately at 360 ms) and the left temporal regions (around 600 ms). It is noteworthy that the model fits of the BERT parse depth vector (up to PP1) remained largely unchanged even when PP probability was controlled for, indicating that the variance explained by BERT metrics cannot be effectively accounted for by the PP probability obtained from the human continuation pre-test.

      Author response image 2

      Comparison between the RSA model fits of BERT structural metrics and behavioural / corpusbased metrics in the PP1 epoch. (upper) Model fits of BERT parse depth vector up to PP1 (relevant to Figure 6B in the main text); (middle) Model fits of the probability of a PP continuation in the prerest conducted at the end of the first verb; (bottom) Model fits of BERT parse depth vector up to PP1 after partialling out the variance explained by PP probability.

      Finally, in the main verb (MV) epoch, we tested the model RDM based on the probability of a MV continuation following the PP (e.g., the probability after “The dog found in the park…”). When compared with the BERT parse depth vector (up to MV), we observed a similar effect in the left dorsal frontal regions (see Author response image 3). However, this effect did not survive after the whole-brain multiple comparison correction. Subsequent partial correlation analyses revealed that the MV probability accounted for only a small portion of the variance in neural data explained by the BERT metric, primarily the effect observed in the left dorsal frontal regions around 380 ms post MV onset. Meanwhile, the majority of the model fits of the BERT parse depth vector remained largely unchanged after controlling for the MV probability.

      Note that the probability of a PP/MV continuation reflect participants’ predictions based on speech input preceding the preposition (e.g., “The dog found…”) or the main verb (e.g., “The dog found in the park…”), respectively. In contrast, BERT parse depth vector is designed to represent the structure of the (partial) sentence in the speech already delivered to listeners, rather than to predict a continuation after it. Therefore, in the PP1 and MV epochs, we separately tested BERT parse depth vectors that included the preposition (e.g., “The dog found in…”) and the main verb (e.g., “The dog found in the park was…”) to accurately capture the sentence structure at these specific points in a sentence. Despite the differences in the nature of information captured by these two types of metrics, the behavioural metrics themselves did not exhibit significant model fits when tested against listeners’ neural activity.

      Author response image 3

      Comparison between the RSA model fits of BERT structural metrics and behavioural / corpusbased metrics in the MV epoch. (upper) Model fits of BERT parse depth vector up to MV (relevant to Figure 6C in the main text); (middle) Model fits of the probability of a MV continuation in the pre-rest conducted at the end of the prepositional phrase (e.g., “The dog found in the park …”); (bottom) Model fits of BERT parse depth vector up to MV after partialling out the variance explained by MV probability.

      Regarding the corpus-derived interpretative preference, we observed that neither the Active index nor the Passive index showed significant effects in the PP1 epoch. In the MV epoch, while significant model fits of the passive index were observed, which temporally overlapped with the BERT parse depth vector (up to MV) after the recognition point of the MV, the effects of these two model RDMs emerged in different hemispheres, as illustrated in Figures 6C and 8D in the main text. Consequently, we opted not to pursue further partial correlation analysis with the corpus-derived interpretative preference. Besides, as shown in Figure 8A, 8B and 8C, subject noun thematic role preference and non-directional index exhibit significant model fits in the PP1 or the MV epoch. Interesting, these effects lead corresponding effects of BERT metrics in the same epoch (see Figure 6B and 6C), suggesting that the overall structured interpretation emerges after the evaluation and integration of multifaceted lexical constraints.

      In summary, our findings indicate that, in comparison to corpus-derived or behavioural metrics, BERT structural metrics are more effective in explaining neural data, in terms of modelling both the unfolding sentence input (i.e., incremental BERT parse vector) and individual words (i.e., V1) within specific sentential contexts. This advantage of BERT metrics might be due to the hypothesized capacity of LLMs to capture more contextually rich representations. Such representations effectively integrate the diverse constraints present in a given sentence, thereby outperforming corpus-based metrics or behavioural metrics in this respect. Concurrently, it is important to recognize the significant role of corpus-based / behavioral metrics as explanatory variables. They are instrumental not only in interpreting BERT metrics but also in understanding their fit to listeners’ neural activity (by examining the temporal sequence and spatial distribution of model fits of these two types of metrics). Such an integrative approach allows for a more comprehensive understanding of the complex neural processes underpinning speech comprehension.

      With regards to the neural data, the authors show convincing evidence for early modulations of brain activity by linguistic constraints on sentence structure and importantly early modulation by the coherence between multiple constraints to be integrated. Those modulations can be observed across bilateral frontal and temporal areas as well as parts of the default mode network. The methods used are clear and rigorous and allow for a detailed exploration of how multiple linguistic cues are neurally encoded and dynamically shape the final representation of a sentence in the brain. However, at times the consequences of the RSA results remain somewhat vague with regard to the motivation behind different metrics and how they differ from each other. Therefore, some results seem surprising and warrant further discussion, for example: Why does the neural network-derived parse depth metric fit neural data before the V1 uniqueness point if the sentence pairs begin with the same noun phrase? This suggests that the lexical information preceding V1, is driving the results. However, given the additional results, we can already exclude an influence of subject likelihood for a specific thematic role as this did not model the neural data in the V1 epoch to a significant degree.

      As pointed out by R1, model fits of BERT parse depth vector (up to V1) and its mismatch for the active interpretation were observed before the V1 uniqueness point (Figures 6A and 6D). These early effects could be attributed to the inclusion of different subject nouns in the BERT parse depth vectors. In our MEG data analyses, RSA was performed using all LoTrans and HiTrans sentences. Each of the 60 sentence sets contained one LoTrans sentence and one HiTrans sentence, which resulted in a 120 x 120 neural data RDM for each searchlight ROI across the brain within each sliding time window. Although LoTrans and HiTrans sentences within the same sentence set shared the same subject noun, subject nouns varied across sentence sets. This variation was expected to be reflected in both the model RDM of BERT metrics and the data RDM, a point further clarified in the revised manuscript.

      In contrast, when employing a model RDM constructed solely from the BERT V1 parse depth, we observed model fits peaking precisely at the uniqueness point of V1 (see Appendix 1figure 10). It is important to note that BERT V1 parse depth is a contextualized metric influenced by the preceding subject noun, which could account for the effects of BERT V1 parse depth observed before the uniqueness point of V1.

      Relatedly, In Fig 2C it seems there are systematic differences between HiTrans and LoTrans sentences regarding the parse depth of determiner and subject noun according to the neural network model, while this is not expected according to the context-free parse.

      We thank R1 for pointing out this issue. Relevant to Figure 3D (Figure 2C in the original manuscript), we presented the distributions of BERT parse depth for individual words as the sentence unfolds in Appendix 1-figure 2. Our analysis revealed that the parse depth of the subject noun in high transitivity (HiTrans) and low transitivity (LoTrans) sentences did not significantly differ, except for the point at which the sentence reached V1 (two-tailed twosample t-test, P = 0.05).

      However, we observed a significant difference in the parse depth of the determiner between HiTrans and LoTrans sentences (two-tailed two-sample t-test, P < 0.05 for all results in Appendix 1-figure 2). Additionally, the parse depth of the determiner was found to covary with that of V1 as the input unfolded to different sentence positions (Pearson correlation, P < 0.05 for all plots in Appendix 1-figure 2). This difference, unexpected in terms of the contextfree (dependency) parse used for training the BERT structural probing model, might be indicative of a “leakage” of contextual information during the training of the structural probing model, given the co-variation between the determiner and V1 which was designed to be different in their transitivity in the two types of sentences.

      Despite such unexpected differences observed in the BERT parse depths of the determiner, we considered the two sentence types as one group with distributed features (e.g., V1 transitivity) in the RSA, and used the BERT parse depth vector including all words in the sentence input to construct the model RDMs. Moreover, as indicated in Appendix 1-figure 3, compared to the content words, the determiner contributed minimally to the incremental BERT parse depth vector. Consequently, the noted discrepancies in BERT parse depth of the determiner between HiTrans and LoTrans sentences are unlikely to significantly bias our RSA results.

      "The degree of this mismatch is proportional to the evidence for or against the two interpretations (...). Besides these two measures based on the entire incremental input, we also focused on Verb1 since the potential structural ambiguity lies in whether Verb1 is interpreted as a passive verb or the main verb." The neural data fits in V1 epoch differ in their temporal profile for the mismatch metrics and the Verb 1 depth respectively. I understand the "degree of mismatch" to be a measure of how strongly the neural network's hidden representations align with the parse depth of an active or passive sentence structure. If this is correct, then it is not clear from the text how far this measure differs from the Verb 1 depth alone, which is also indicating either an active or passive structure.

      Within the V1 epoch, we tested three distinct types of model RDMs based on BERT metrics: (1) The BERT parse depth vector, representing the neural network’s hidden representation of the incremental sentence structure including all words up to V1. (2) The mismatch metric for either the Active or Passive interpretation, calculated as the distance between the BERT parse depth vector and the context-free parse depth vector for each interpretation. (3) The BERT parse depth of V1, crucial in representing the preferred structural interpretation of the unfolding sentence given its syntactic role as either a passive verb or the main verb.

      While the BERT parse depth vector per se does not directly indicate a preferred interpretation, its mismatch with the context-free parse depth vectors of the two possible interpretations reveals the favoured interpretation, as significant neural fit is only anticipated for the mismatch with the interpretation being considered. The contextualized BERT depth of V1 is also indicative of the preferred structure given the context-free V1 parse depth corresponding to different syntactic roles, however, compared to the interpretative mismatch, it does not fully capture contributions from other words in the input. Consequently, we expected the interpretative mismatch and the BERT V1 depth to yield different results. Indeed, our analysis revealed that, although both metrics extracted from the same BERT layer (i.e., layer 13) demonstrated early RSA fits in the left fronto-temporal regions, the V1 depth showed relatively more prolonged effects with a notable peak occurring precisely at the uniqueness point of V1 (compare Figure 6C and Appendix 1-figure 10). These complementary results underscore the capability of BERT metrics to align with neural responses, in terms of both an incrementally unfolding sentence and a specific word within it.

      In previous studies, differences in neural activity related to distinct amounts of open nodes in the parse tree have been interpreted in terms of distinct working memory demands (Nelson et al. pnas 2017, Udden et al tics 2020). It seems that some of the metrics, for example the neural network-derived parse depth or the V1 depth may be similarly interpreted in the light of working memory demands. After all, during V1 epoch, the sentences do not only differ with respect to predicted sentence structure, but also in the amount of open nodes that need to be maintained. In the discussion, however, the authors interpret these results as "neural representations of an unfolding sentence's structure".

      We agree with the reviewer that the Active and Passive interpretations differ in terms of the number of open nodes before the actual main verb is heard. Given the syntactic ambiguity in our sentence stimuli (i.e., LoTrans and Hi Trans sentences), it is infeasible to determine the exact number of open nodes in each sentence as it unfolds. Nevertheless, the RSA fits observed in the dorsal lateral frontal regions could be indicative of the varying working memory demands involved in building the structured interpretations across sentences. We have added this perspective in the revised manuscript.

      Reviewer #2 (Public Review):

      This article is focused on investigating incremental speech processing, as it pertains to building higher-order syntactic structure. This is an important question because speech processing in general is lesser studied as compared to reading, and syntactic processes are lesser studied than lower-level sensory processes. The authors claim to shed light on the neural processes that build structured linguistic interpretations. The authors apply modern analysis techniques, and use state-of-the-art large language models in order to facilitate this investigation. They apply this to a cleverly designed experimental paradigm of EMEG data, and compare neural responses of human participants to the activation profiles in different layers of the BERT language model.

      We thank Reviewer #2 (hereinafter referred to as R2) for the overall positive remarks on our study.

      Strengths:

      (1) The study aims to investigate an under-explored aspect of language processing, namely syntactic operations during speech processing

      (2) The study is taking advantage of technological advancements in large language models, while also taking linguistic theory into account in building the hypothesis space

      (3) The data combine EEG and MEG, which provides a valuable spatio-temporally resolved dataset

      (4) The use of behavioural validation of high/low transitive was an elegant demonstration of the validity of their stimuli

      We thank R2 for recognizing and appreciating the motivation and the methodology employed in this study.

      Weaknesses:

      (1) The manuscript is quite hard to understand, even for someone well-versed in both linguistic theory and LLMs. The questions, design, analysis approach, and conclusions are all quite dense and not easy to follow.

      To address this issue, we have made dedicated efforts to clarify the key points in our study. We also added figures to visualize our experimental design and methods (see Figure 1, Figure 3C and Figure 5 in the revised main text). We hope that these revisions have made the manuscript more comprehensible and straightforward for the readers.

      (2) The analyses end up seeming overly complicated when the underlying difference between sentence types is a simple categorical distinction between high and low transitivity. I am not sure why tree depth and BERT are being used to evaluate the degree to which a sentence is being processed as active or passive. If this is necessary, it would be helpful for the authors to motivate this more clearly.

      Indeed, as pointed by R2, the only difference between LoTrans and HiTrans sentences is the first verb (V1), whose transitivity is crucial for establishing an initial preference for either an Active or a Passive interpretation as the sentence unfolds. Nonetheless, in line with the constraint-based approach to sentence processing and supported by previous research findings, a coherent structured interpretation of a sentence is determined by the combined constraints imposed by all words within that sentence. In our study, the transitivity of V1 alone is insufficient to fully explain the interpretative preference for the sentence structure. The overall sentence-level interpretation also depends on the thematic role preference of the subject noun – its likelihood of being an agent performing an action or a patient receiving the action.

      This was evident in our findings, as shown in Author response image 1 above, where the V1 transitivity based on corpus or behavioural data did not fit to the neural data during the V1 epoch. In contrast, BERT structural measures [e.g., BERT parse depth vector (up to V1) and BERT V1 parse depth] offered contextualized representations that are presumed to integrate various lexical constraints present in each sentence. These BERT metrics exhibited significant model fits for the same neural data in the V1 epoch. Besides, a notable feature of BERT is its bi-directional attention mechanism, which allows for the dynamic updating of an earlier word’s representation as more of the sentence is heard, which is also changeling to achieve with corpus or behavioural metrics. For instance, the parse depth of the word “found” in the BERT parse depth vector for “The dog found…” differs from its parse depth in the vector for “The dog found in…”. This feature of BERT is particularly advantageous for investigating the dynamic nature of structured interpretation during speech comprehension, as it stimulates the continual updating of interpretation that occurs as a sentence unfolds (as shown by Figure 7 in the main text). We have elaborated on the rationale for employing BERT parse depth in this regard in the revised manuscript.

      (3) The main data result figures comparing BERT and the EMEG brain data are hard to evaluate because only t-values are provided, and those, only for significant clusters. It would be helpful to see the full 600 ms time course of rho values, with error bars across subjects, to really be able to evaluate it visually. This is a summary statistic that is very far away from the input data

      We appreciate this suggestion from R2. In the Appendix 1 of the revised manuscript, we have provided individual participants’ Spearman’s rho time courses for every model RDM tested in all the three epochs (see Appendix 1-figures 8-10 & 14-15). Note that RSA was conducted in the source-localized E/MEG, it is infeasible to plot the rho time course for each searchlight at one of the 8196 vertices on the cortical surface mesh. Instead, we plotted the rho time course of each ROI reported in the original manuscript. These plots complement the time-resolved heatmap of peak t-value in Figures 6-8 in the main text.

      (4) Some details are omitted or not explained clearly. For example, how was BERT masked to give word-by-word predictions? In its default form, I believe that BERT takes in a set of words before and after the keyword that it is predicting. But I assume that here the model is not allowed to see linguistic information in the future.

      In our analyses, we utilized the pre-trained version of BERT (Devlin et al. 2019) as released by Hugging Face (https://github.com/huggingface). It is noteworthy that BERT, as described in the original paper, was initially trained using the Cloze task, involving the prediction of masked words within an input. In our study, however, we neither retrained nor fine-tuned the pre-trained BERT model, nor did we employ it for word-by-word prediction tasks. We used BERT to derive the incremental representation of a sentence’s structure as it unfolded word-by-word.

      Specifically, we sequentially input the text of each sentence into the BERT, akin to how a listener would receive the spoken words in a sentence (see Figure 3C in the main text). For each incremental input (such as “The dog found”), we extracted the hidden representations of each word from BERT. These representations were then transformed into their respective BERT parse depths using a structural probing model (which was trained using sentences with annotated dependency parse tress from the Penn Treebank Dataset). The resulting BERT parse depths were subsequently used to create model RDMs, which were then tested against neural data via RSA.

      Crucially, in our approach, BERT was not exposed to any future linguistic information in the sentence. We never tested BERT parse depth of a word in an epoch where this word had not been heard by the listener. For example, the three-dimensional BERT parse depth vector for “The dog found” was tested in the V1 epoch corresponding to “found”, while the fourdimensional BERT parse depth vector for “The dog found in” was tested in the PP1 epoch of “in”.

      How were the auditory stimuli recorded? Was it continuous speech or silences between each word? How was prosody controlled? Was it a natural speaker or a speech synthesiser?

      Consistent with our previous studies (Kocagoncu et al. 2017; Klimovich-Gray et al. 2019; Lyu et al. 2019; Choi et al. 2021), all auditory stimuli in this study were recorded by a female native British English speaker, ensuring a neutral intonation throughout. We have incorporated this detail into the revised version of our manuscript for clarity.

      It is difficult for me to fully assess the extent to which the authors achieved their aims, because I am missing important information about the setup of the experiment and the distribution of test statistics across subjects.

      We are sorry for the previously omitted details regarding the experimental setup and the results of individual participants. As detailed in our responses above, we have now included the necessary information in the revised manuscript.

      Reviewer #3 (Public Review):

      Syntactic parsing is a highly dynamic process: When an incoming word is inconsistent with the presumed syntactic structure, the brain has to reanalyze the sentence and construct an alternative syntactic structure. Since syntactic parsing is a hidden process, it is challenging to describe the syntactic structure a listener internally constructs at each time moment. Here, the authors overcome this problem by (1) asking listeners to complete a sentence at some break point to probe the syntactic structure mentally constructed at the break point, and (2) using a DNN model to extract the most likely structure a listener may extract at a time moment. After obtaining incremental syntactic features using the DNN model, the authors analyze how these syntactic features are represented in the brain using MEG.

      We extend our thanks to Reviewer #3 (referred to as R3 below) for recognizing the methods we used in this study.

      Although the analyses are detailed, the current conclusion needs to be further specified. For example, in the abstract, it is concluded that "Our results reveal a detailed picture of the neurobiological processes involved in building structured interpretations through the integration across multifaceted constraints". The readers may remain puzzled after reading this conclusion.

      Following R3’s suggestion, we have revised the abstract and refined our conclusions in the main text to explicitly highlight our principal findings. These include: (1) a shift from bihemispheric lateral frontal-temporal regions to left-lateralized regions in representing the current structured interpretation as a sentence unfolds, (2) a pattern of sequential activations in the left lateral temporal regions, updating the structured interpretation as syntactic ambiguity is resolved, and (3) the influence of lexical interpretative coherence activated in the right hemisphere over the resolved sentence structure represented in the left hemisphere.

      Similarly, for the second part of the conclusion, i.e., "including an extensive set of bilateral brain regions beyond the classical fronto-temporal language system, which sheds light on the distributed nature of language processing in the brain." The more extensive cortical activation may be attributed to the spatial resolution of MEG, and it is quite well acknowledged that language processing is quite distributive in the brain.

      We fully agree with R3 on the relatively low spatial resolution of MEG. Our emphasis was on the observed peak activations in specific regions outside the classical brain areas related to language processing, such as the precuneus in the default mode network, which are unlikely to be artifacts due to the spatial resolution of MEG. We have revised the relevant contents in the Abstract.

      The authors should also discuss:

      (1) individual differences (whether the BERT representation is a good enough approximation of the mental representation of individual listeners).

      To address the issue of individual differences which was also suggested by R2, we added individual participants’ model fits in ROIs with significant effects of BERT representations in Appendix 1 of the revised manuscript (see Appendix 1-figures 8-10 & 14-15).

      (2) parallel parsing (I think the framework here should allow the brain to maintain parallel representations of different syntactic structures but the analysis does not consider parallel representations).

      In the original manuscript, we did not discuss parallel parsing because the methods we used does not support a direct test for this hypothesis. In our analyses, we assessed the preference for one of two plausible syntactic structures (i.e., Active and Passive interpretations) based on the BERT parse vector of an incremental sentence input. This assessment was accomplished by calculating the mismatch between the BERT parse depth vector and the context-free dependency parse depth vector representing each of the two structures. However, we only observed one preferred interpretation in each epoch (see Figures 6D-6F) and did not find evidence supporting the maintenance of parallel representations of different syntactic structures in the brain. Nevertheless, in the revised manuscript, we have mentioned this possibility, which could be properly explored in future studies.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Consider fitting the behavioral data from the continuation pre-test to the brain data in order to illustrate the claimed advantage of using a computational model beyond more traditional methods.

      Following R1’s suggestion, we conducted additional RSA using more behavioural and corpusbased metrics. We then directly compared the fits of these traditional metrics to brain data with those of BERT metrics in the same epoch to provide empirical evidence for the advantage of using a computational model like BERT to explain listeners’ neural data (see Appendix 1figures 11-13).

      Clarify the use of "neural representations: For a clearer assessment of the results, please discuss your results (especially the fits with BERT parse depth) in terms of the potential effects of distinct sentence structure expectations on working memory demands and make clear where these can be disentangled from neural representations of an unfolding sentence's structure.

      In the revised manuscript, we have noted the working memory demands associated with the online construction of a structured interpretation during incremental speech comprehension. As mentioned in our response to the relevant comment by R1 above, our experimental paradigm is not suitable for quantitatively assessing working memory demands since it is difficult to determine the exact number of open nodes for our stimuli with syntactic ambiguity before the disambiguating point (i.e., the main verb) is reached. Therefore, while we can speculate the potential contribution of varying working memory demands (which might correlate with BERT V1 parse depth) to RSA model fits, we think it is not possible to disentangle their effects from the neural representation of an unfolding sentence’s structure modelled by BERT parse depths in our current study.

      Please add in methods a description of how the uniqueness point was determined.

      In this study, we defined the uniqueness point of a word as the earliest point in time when this word can be fully recognized after removing all of its phonological competitors. To determine the uniqueness point for each word of interest, we first identified the phoneme by which this word can be uniquely recognized according to CELEX (Baayen et al. 1993). Then, we manually labelled the offset of this phoneme in the auditory file of the spoken sentence in which this word occurred. We have added relevant description of how the uniqueness point was determined in the Methods section of the revised manuscript.

      I found the name "interpretative mismatch" very opaque. Maybe instead consider "preference".

      We chose to use the term “interpretative mismatch” rather than “preference” based on the operational definition of this metric, which is the distance between a BERT parse depth vector and one of the two context-free parse depth vectors representing the two possible syntactic structures, so that a smaller distance value (or mismatch) signifies a stronger preference for the corresponding interpretation.

      In the abstract, the authors describe the cognitive process under investigation as one of incremental combination subject to "multi-dimensional probabilistic constraint, including both linguistic and non-linguistic knowledge". The non-linguistic knowledge is later also referred to as "broad world knowledge". These terms lack specificity and across studies have been operationalized in distinct ways. In the current study, this "world knowledge" is operationalized as the likelihood of a subject noun being an agent or patient and the probability for a verb to be transitive, so here a more specific term may have been the "knowledge about statistical regularities in language".

      In this study, we specifically define “non-linguistic world knowledge” as the likelihood of a subject noun assuming the role of an agent or patient, which relates to its thematic role preference. This type of knowledge is primarily non-linguistic in nature, as exemplified by comparing nouns like “king” and “desk”. Although it could be reflected by statistical regularities in language, thematic role preference hinges more on world knowledge, plausibility, or real-world statistics. In contrast, “linguistic knowledge” in our study refers to verb transitivity, which focuses on the grammatically correct usage of a verb and is tied to statistical regularities within language itself. In the revised manuscript, we have provided clearer operational definitions for these two concepts and have ensured consistent usage throughout the text.

      Please spell out what exactly the "constraint-based hypothesis" is (even better, include an explicit description of the alternative hypothesis?).

      The “constraint-based hypothesis”, as summarized in a review (McRae and Matsuki 2013), posits that various sources of information, referred to as “constraints”, are simultaneously considered by listeners during incremental speech comprehension. These constraints encompass syntax, semantics, knowledge of common events, contextual pragmatic biases, and other forms of information gathered from both intra-sentential and extra-sentential context. Notably, there is no delay in the utilization of these multifaceted constraints once they become available, neither is a fixed priority assigned to one type of constraint over another. Instead, a diverse set of constraints is immediately brought into play for comprehension as soon as they become available as the relevant spoken word is recognized.

      An alternative hypothesis, proposed earlier, is the two-stage garden path model (Frazier and Rayner 1982; Frazier 1987). According to this model, there is an initial parsing stage that relies solely on syntax. This is followed by a second stage where all available information, including semantics and other knowledge, is used to assess the plausibility of the results obtained in the first-stage analysis and to conduct re-analysis if necessary (McRae and Matsuki 2013). In the Introduction of our revised manuscript, we have elaborated on the “constraint-based hypothesis” and mentioned this two-stage garden path model as its alternative.

      Fig1 B&C: In order to make the data more interpretable, could you estimate how many possible grammatical structural configurations there are / how many different grammatical structures were offered in the pretest, and based on this what would be the "chance probability" of choosing a random structure or for example show how many responded with a punctuation vs alternative continuations?

      In our analysis of the behavioural results, we categorized the continuations provided by participants in the pre-test at the offset of Verb1 (e.g., “The dog found/walked …”) into 6 categories, including DO (direct object), INTRANS (intransitive), PP (prepositional phrase), INF (infinitival complement), SC (sentential complement) and OTHER (gerund, phrasal verb, etc.).

      Author response table 1.

      Similarly, we categorized the continuations that followed the offset of the prepositional phrase (e.g., “The dog found/walked in the park …”) into 7 categories, including MV (main verb), END (i.e., full stop), PP (prepositional phrase), INF (infinitival complement), CONJ (conjunction), ADV (adverb) and OTHER (gerund, sentential complement, etc.).

      Author response table 2.

      It is important to note that the results of these two pre-tests, including the types of continuations and their probabilities, exhibited considerable variability between and within each sentence type (see also Figures 2B and 2C).

      Typo: "In addition, we found that BERT structural interpretations were also a correlation with the main verb probability" >> correlated instead of correlation.

      We apologize for this typo. We have conducted a thorough proofreading to identify and correct any other typos present in the revised manuscript.

      "In this regard, DLMs excel in a flexible combination of different types of features embedded in their rich internal representations". What are the "different types", spell out at least some examples for illustration.

      We have rephrased this sentence to give a more detailed description.

      Fig 2 caption: "Same color scheme as in (A)" >> should be 'as in (B)'?, and later A instead of B.

      We are sorry for this typo. We have corrected it in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      My biggest recommendation is to make the paper clearer in two ways: (i) writing style, by hand-holding the reader through each section, and the motivation for each step, in both simple and technical language; (ii) schematic visuals, of the experimental design and the analysis. A schematic of the main experimental manipulation would be helpful, rather than just listing two example sentences. It would also be helpful to provide a schematic of the experimental setup and the analysis approach, so that people can refer to a visual aid in addition to the written explanation. For example, it is not immediately clear what is being correlated with what - I needed to go to the methods to understand that you are doing RSA across all of the trials. Make sure that all of the relevant details are explained, and that you motivate each decision.

      We thank R2 for these suggestions. In the revised manuscript, we have enhanced the clarity of the main text by providing a more detailed explanation of the motivation behind each analysis and the interpretation of the corresponding results. Additionally, in response to R2’s recommendation, we have added a few figures, including the illustration of the experimental design (Figure 1) and methods (see Figure 3C and Figure 5).

      Different visualisation of neural results - The main data result figures comparing BERT and the EMEG brain data are hard to evaluate because only t-values are provided, and those, are only for significant clusters. It would be helpful to see the full 600 ms time course of rho values, with error bars across subjects, to really be able to evaluate it visually.

      In the original manuscript, we opted to present t-value time courses for the sake of simplicity in illustrating the fits of the 12 model RDMs tested in 3 epochs. Following R2’s suggestion, we have included the ROI model fit time courses of each model RDM for all individual participants, as well as the mean model fit time course with standard error in Appendix 1figures 8-10 & 14-15.

      How are the authors dealing with prosody differences that disambiguate syntactic structures, that BERT does not have access to?

      All spoken sentence stimuli were recorded by a female native British English speaker, ensuring a neutral intonation throughout. Therefore, prosody is unlikely to vary systematically between different sentence types or be utilized to disambiguate syntactic structures. Sample speech stimuli have been made available in the following repository: https://osf.io/7u8jp/.

      A few writing errors: "was kept updated every time"

      We are sorry for the typos. We have conducted proof-reading carefully to identify and correct typos throughout the revised manuscript.

      Explain why the syntactic trees have "in park the" rather than "in the park"?

      The dependency parse trees (e.g., Figure 3A) were generated according to the conventions of dependency parsing (de Marneffe et al. 2006).

      Why are there mentions of the multiple demand network in the results? I'm not sure where this comes from.

      The mention of the multiple demand network was made due to the significant RSA fits observed in the dorsal lateral prefrontal regions and the superior parietal regions, which are parts of the multiple demand network. This observation was particularly notable for the BERT parse depth vector in the main verb epoch when the potential syntactic ambiguity was being resolved. It is plausible that these effects observed are partly attributed to the varying working memory demands required to maintain the “opening nodes” in the different syntactic structures being considered by listeners at this point in the sentence.

      Reviewer #3 (Recommendations For The Authors):

      The study first asked human listeners to complete partial sentences, and incremental parsing of the partial sentences can be captured based on the completed sentences. This analysis is helpful and I wonder if the behavioral data here are enough to model the E/MEG responses. For example, if I understood it correctly, the parse depth up to V1 can be extracted based on the completed sentences and used for the E/MEG analysis.

      The behavioural data alone do not suffice to model the E/MEG data. As we elucidated in our responses to R1, we employed three behavioural metrics derived from the continuation pretests. These metrics include the V1 transitivity and the PP probability, given the continuations after V1 (e.g., after “The dog found…”), as well as the MV probability, given the continuations after the prepositional phrase (e.g., after “The dog found in the park…”). These metrics aimed to capture participants’ prediction based on their structured interpretations at various positions in the sentence. However, none of these behavioural metrics yielded significant model fits to the listeners’ neural activity, which sharply contrasts with the substantial model fits of the BERT metrics in the same epochs. Besides, we also tried to model V1 parse depth as a weighted average based on participants’ continuations. As shown in Figure 3A, V1 parse depth is 0 in the active interpretation, 2 in the passive interpretation, while the parse depth of the determiner and the subject noun does not differ. However, this continuation-based V1 parse depth [i.e., 0 × Probability(active interpretation) + 2 × Probability(passive interpretation)] did not show significant model fits.

      Related to this point, I wonder if the incremental parse extracted using BERT is consistent with the human results (i.e., parsing extracted based on the completed sentences) on a sentence-bysentence basis.

      In fact, we did provide evidence showing the alignment between the incremental parse extracted using BERT and the human interpretation for the same partial sentence input (see Figure 4 in the main text and Appendix 1-figures 4-6).

      Furthermore, in Fig 1d, is it possible to calculate how much variance of the 3 probabilities is explained by the 4 factors, e.g., using a linear model? If these factors can already explain most of the variance of human parsing, is it possible to just use these 4 factors to explain neural activity?

      Following R3’s suggestion, we have conducted additional linear modelling analyses to compare the extent to which human behavioural data can be explained by corpus metrics and BERT metrics separately. Specifically, for each of the three probabilities obtained in the pretests (i.e., DO, PP, and MV), we constructed two linear models. One model utilized the four corpus-based metrics as regressors (i.e., SN agenthood, V1 transitivity, Passive index, and Active index), while the other model used BERT metrics as regressors (i.e., BERT parse depth of each word up to V1 from layer 13 for DO/PP probability and BERT parse depth of each word up to the end of PP from layer 14 for MV probability, consistent with the BERT layers reported in Figure 6).

      As shown in the table below, corpus metrics demonstrate a more effective fit than BERT metrics for predicting the DO/PP probability. The likelihood of a DO/PP continuation is chiefly influenced by the lexical syntactic property of V1 (i.e., transitivity), and appears to rely less on contextual factors. Since V1 transitivity is explicitly included as one of the corpus metrics, it is thus expected to align more closely with the DO/PP probability compared to BERT metrics, primarily reflecting transitive versus intransitive verb usage.

      Author response table 3.

      Actually, BERT V1 parse depth was not correlated with V1 transitivity when the sentence only unfolds to V1 (see Appendix 1-figure 6). This lack of correlation may stem from the fact that the BERT probing model was designed to represent the structure of a (partially) unfolded sentence, rather than to generate a continuation or prediction. Moreover, V1 transitivity alone does not conclusively determine the Active or Passive interpretation by the end of V1. For instance, both transitive and intransitive continuations after V1 are compatible with an Active interpretation. Consequently, the initial preference for an Active interpretation (as depicted by the early effects before V1 was recognized in Figure 6D), might be predominantly driven by the animate subject noun (SN) at the beginning of the sentence, a word order cue in languages like English (Mahowald et al. 2023).

      In contrast, when assessing the probability of a MV following the PP (e.g., after “The dog found in the park ...”), BERT metrics significantly outperformed corpus metrics in terms of fitting the MV probability. Although SN thematic role preference and V1 transitivity were designed to be the primary factors constraining the structured interpretation in this experiment, we could only obtain their context-independent estimates from corpora (i.e., considering all contexts). Additionally, despite Active/Passive index (a product of these two factors) are correlated with the MV probability, it may oversimplify the task of capturing the specific context of a given sentence. Furthermore, the PP following V1 is also expected to influence the structured interpretation. For instance, whether “in the park” is a more plausible scenario for people to find a dog or for a dog to find something. Thus, this finding suggests that the corpus-based metrics are not as effective as BERT in representing contextualized structured interpretations (for a longer sentence input), which might require the integration of constraints from every word in the input.

      In summary, corpus-based metrics excel in explaining human language behaviour when it primarily relies on specific lexical properties. However, they significantly lag behind BERT metrics when more complex contextual factors come into play at the same time. Regarding their performance in fitting neural data, among the four corpus-based metrics, we only observed significant model fits for the Passive index in the MV epoch when the intended structure for a Passive interpretation was finally resolved, while the other three metrics did not exhibit significant model fits in any epoch. Note that subject noun thematic role preference did fit neural data in the PP and MV epochs (Figure 8A and 8B). In contrast, the incremental BERT parse depth vector exhibited significant model fits in all three epochs we tested (i.e., V1, PP1, and MV).

      To summarize, I feel that I'm not sure if the structural information BERT extracts reflect the human parsing of the sentences, especially when the known influencing factors are removed.

      Based on the results presented above and, in the manuscript, BERT metrics align closely with human structured interpretations in terms of both behavioural and neural data. Furthermore, they outperform corpus-based metrics when it comes to integrating multiple constraints within the context of a specific sentence as it unfolds.

      Minor issues:

      Six types of sentences were presented. Three types were not analyzed, but the results for the UNA sentences are not reported either.

      In this study, we only analysed two out of the six types of sentences, i.e., HiTrans and LoTrans sentences. The remaining four types of sentences were included to ensure a diverse range of sentence structures and avoid potential adaption the same syntactic structure.

      Fig 1b, If I understood it correctly, each count is a sentence. Providing examples of the sentences may help. Listing the sentences with the corresponding probabilities in the supplementary materials can also help.

      Yes, each count in Figure 2B (Figure 1B in the original manuscript) is a sentence. All sentence stimuli and results of pre-tests are available in the following repository https://osf.io/7u8jp/.

      "trajectories of individual HiTrans and LoTrans sentences are considerably distributed and intertwined (Fig. 2C, upper), suggesting that BERT structural interpretations are sensitive to the idiosyncratic contents in each sentence." It may also mean the trajectories are noisy.

      We agree with R3 that there might be unwanted noise underlying the distributed and intertwined BERT parse depth trajectories of individual sentences. Meanwhile, it is also important to note that the correlation between BERT parse depths and lexical constraints of different words at the same position across sentences is statistically supported.

      References

      Baayen RH, Piepenbrock R, van H R. 1993. The {CELEX} lexical data base on {CD-ROM}. Baroni M, Dinu G, Kruszewski G. 2014. Don't count, predict! A systematic comparison of contextcounting vs. context-predicting semantic vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol 1.238-247.

      Caucheteux C, King JR. 2022. Brains and algorithms partially converge in natural language processing. Communications Biology. 5:134.

      Choi HS, Marslen-Wilson WD, Lyu B, Randall B, Tyler LK. 2021. Decoding the Real-Time Neurobiological Properties of Incremental Semantic Interpretation. Cereb Cortex. 31:233-247.

      de Marneffe M-C, MacCartney B, Manning CD editors. Generating typed dependency parses from phrase structure parses, Proceedings of the 5th International Conference on Language Resources and Evaluation; 2006 May 22-28, 2006; Genoa, Italy:European Language Resources Association. 449-454 p.

      Devlin J, Chang M-W, Lee K, Toutanova K editors. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019 June 2-7, 2019; Minneapolis, MN, USA:Association for Computational Linguistics. 4171-4186 p.

      Frazier L. 1987. Syntactic processing: evidence from Dutch. Natural Language & Linguistic Theory. 5:519-559.

      Frazier L, Rayner K. 1982. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology. 14:178-210.

      Klimovich-Gray A, Tyler LK, Randall B, Kocagoncu E, Devereux B, Marslen-Wilson WD. 2019. Balancing Prediction and Sensory Input in Speech Comprehension: The Spatiotemporal Dynamics of Word Recognition in Context. Journal of Neuroscience. 39:519-527.

      Kocagoncu E, Clarke A, Devereux BJ, Tyler LK. 2017. Decoding the cortical dynamics of soundmeaning mapping. Journal of Neuroscience. 37:1312-1319.

      Lyu B, Choi HS, Marslen-Wilson WD, Clarke A, Randall B, Tyler LK. 2019. Neural dynamics of semantic composition. Proceedings of the National Academy of Sciences of the United States of America. 116:21318-21327.

      Mahowald K, Diachek E, Gibson E, Fedorenko E, Futrell R. 2023. Grammatical cues to subjecthood are redundant in a majority of simple clauses across languages. Cognition. 241:105543.

      McRae K, Matsuki K. 2013. Constraint-based models of sentence processing. Sentence processing. 519:51-77.

      Schrimpf M, Blank IA, Tuckute G, Kauf C, Hosseini EA, Kanwisher N, Tenenbaum JB, Fedorenko E. 2021. The neural architecture of language: Integrative modeling converges on predictive processing. Proceedings of the National Academy of Sciences of the United States of America. 118:e2105646118.

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the reviewers for their careful reading of our manuscript and their considered feedback. Please see our detailed response to reviewer comments inset below.

      In addition to requested modifications we have also uploaded the proteomics data from 2 of the experiments contained within the manuscript onto the Immunological Proteome Resource (ImmPRes) website: immpres.co.uk making the data available in an easy-to-use graphical format for interested readers to interrogate and explore. We have added the following text to the data availability section (lines 1085-1091) to indicate this:

      “An easy-to-use graphical interface for examining protein copy number expression from the 24-hour TCR WT and Pim dKO CD4 and CD8 T cell proteomics and IL-2 and IL-15 expanded WT and Pim dKO CD8 T cell proteomics datasets is also available on the Immunological Proteome Resource website: immpres.co.uk (Brenes et al., 2023) under the Cell type(s) selection: “T cell specific” and Dataset selection: “Pim1/2 regulated TCR proteomes” and “Pim1/2 regulated IL2 or IL15 CD8 T cell proteomes”.”

      As well as indicating in figure legends where proteomics datasets are first introduced in Figures 1, 2 and 4 with the text:

      “An interactive version of the proteomics expression data is available for exploration on the Immunological Proteome Resource website: immpres.co.uk

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary and Strengths:

      The study focuses on PIM1 and 2 in CD8 T cell activation and differentiation. These two serine/threonine kinases belong to a large network of Serine/Threonine kinases that acts following engagement of the TCR and of cytokine receptors and phosphorylates proteins that control transcriptional, translational and metabolic programs that result in effector and memory T cell differentiation. The expression of PIM1 and PIM2 is induced by the T-cell receptor and several cytokine receptors. The present study capitalized on high-resolution quantitative analysis of the proteomes and transcriptomes of Pim1/Pim2-deficient CD8 T cells to decipher how the PIM1/2 kinases control TCRdriven activation and IL-2/IL-15-driven proliferation, and differentiation into effector T cells.

      Quantitative mass spectrometry-based proteomics analysis of naïve OT1 CD8 T cell stimulated with their cognate peptide showed that the PIM1 protein was induced within 3 hours of TCR engagement, and its expression was sustained at least up to 24 hours. The kinetics of PIM2 expression was protracted as compared to that of PIM1. Such TCRdependent expression of PIM1/2 correlated with the analysis of both Pim1 and Pim2 mRNA. In contrast, Pim3 mRNA was only expressed at very low levels and the PIM3 protein was not detected by mass spectrometry. Therefore, PIM1 and 2 are the major PIM kinases in recently activated T cells. Pim1/Pim2 double knockout (Pim dKO) mice were generated on a B6 background and found to express a lower number of splenocytes. No difference in TCR/CD28-driven proliferation was observed between WT and Pim dKO T cells over 3 days in culture. Quantitative proteomics of >7000 proteins further revealed no substantial quantitative or qualitative differences in protein content or proteome composition. Therefore, other signaling pathways can compensate for the lack of PIM kinases downstream of TCR activation.

      Considering that PIM1 and PIM2 kinase expression is regulated by IL-2 and IL-15, antigen-primed CD8 T cells were expanded in IL-15 to generate memory phenotype CD8 T cells or expanded in IL-2 to generate effector cytotoxic T lymphocytes (CTL). Analysis of the survival, proliferation, proteome, and transcriptome of Pim dKO CD8 T cells kept for 6 days in IL-15 showed that PIM1 and PIM2 are dispensable to drive the IL-15mediated metabolic or differentiation programs of antigen-primed CD8 T cells. Moreover, Pim1/Pim2-deficiency had no impact on the ability of IL-2 to maintain CD8 T cell viability and proliferation. However, WT CTL downregulated the expression of CD62L whereas the Pim dKO CTL sustained higher CD62L expression. Pim dKO CTL was also smaller and less granular than WT CTL. Comparison of the proteome of day 6 IL-2 cultured WT and Pim dKO CTL showed that the latter expressed lower levels of the glucose transporters, SLC2A1 and SLC2A3, of a number of proteins involved in fatty acid and cholesterol biosynthesis, and CTL effector proteins such as granzymes, perforin, IFNg, and TNFa. Parallel transcriptomics analysis showed that the reduced expression of perforin and some granzymes correlated with a decrease in their mRNA whereas the decreased protein levels of granzymes B and A, and the glucose transporters SLC2A1 and SLC2A3 did not correspond with decreased mRNA expression. Therefore, PIM kinases are likely required for IL-2 to maximally control protein synthesis in CD8 CTL. Along that line, the translational repressor PDCD4 was increased in Pim dKO CTL and pan-PIM kinase inhibitors caused a reduction in protein synthesis rates in IL-2expanded CTL. Finally, the differences between Pim dKO and WT CTL in terms of CD62L expression resulted in Pim dKO CTL but not WT CTL retained the capacity to home to secondary lymphoid organs. In conclusion, this thorough and solid study showed that the PIM1/2 kinases shape the effector CD8 T cell proteomes rather than transcriptomes and are important mediators of IL2-signalling and CD8 T cell trafficking.

      Weaknesses:

      None identified by this reviewer.

      Reviewer #2 (Public Review):

      Summary:

      Using a suite of techniques (e.g., RNA seq, proteomics, and functional experiments ex vivo) this paper extensively focuses on the role of PIM1/2 kinases during CD8 T-cell activation and cytokine-driven (i.e., IL-2 or IL-15) differentiation. The authors' key finding is that PIM1/2 enhances protein synthesis in response to IL-2 stimulation, but not IL-15, in CD8+ T cells. Loss of PIM1/2 made T cells less 'effector-like', with lower granzyme and cytokine production, and a surface profile that maintained homing towards secondary lymphoid tissue. The cytokines the authors focus on are IL-15 and Il-2, which drive naïve CD8 T cells towards memory or effector states, respectively. Although PIM1/2 are upregulated in response to T-cell activation and cytokine stimulation (e.g., IL-15, and to a greater extent, IL-2), using T cells isolated from a global mouse genetic knockout background of PIM1/2, the authors find that PIM1/2 did not significantly influence T-cell activation, proliferation, or expression of anything in the proteome under anti-

      CD3/CD28 driven activation with/without cytokine (i.e., IL-15) stimulation ex vivo. This is perhaps somewhat surprising given PIM1/2 is upregulated, albeit to a small degree, in response to IL-15, and yet PIM1/2 did not seem to influence CD8+ T cell differentiation towards a memory state. Even more surprising is that IL-15 was previously shown to influence the metabolic programming of intestinal intraepithelial lymphocytes, suggesting cell-type specific effects from PIM kinases. What the authors went on to show, however, is that PIM1/2 KO altered CD8 T cell proteomes in response to IL-2. Using proteomics, they saw increased expression of homing receptors (i.e., L-selectin, CCR7), but reduced expression of metabolism-related proteins (e.g., GLUT1/3 & cholesterol biosynthesis) and effector-function related proteins (e.g., IFNy and granzymes). Rather neatly, by performing both RNA-seq and proteomics on the same IL2 stimulated WT vs. PIM1/2 KO cells, the authors found that changes at the proteome level were not corroborated by differences in RNA uncovering that PIM1/2 predominantly influence protein synthesis/translation. Effectively, PIM1/2 knockout reduced the differentiation of CD8+ T cells towards an effector state. In vivo adoptive transfer experiments showed that PIM1/2KO cells homed better to secondary lymphoid tissue, presumably owing to their heightened L-selectin expression (although this was not directly examined).

      Strengths:

      Overall, I think the paper is scientifically good, and I have no major qualms with the paper. The paper as it stands is solid, and while the experimental aim of this paper was quite specific/niche, it is overall a nice addition to our understanding of how serine/threonine kinases impact T cell state, tissue homing, and functionality. Of note, they hint towards a more general finding that kinases may have distinct behaviour in different T-cell subtypes/states. I particularly liked their use of matched RNA-seq and proteomics to first suggest that PIM1/2 kinases may predominantly influence translation (then going on to verify this via their protein translation experiment - although I must add this was only done using PIM kinase inhibitors, not the PIM1/2KO cells). I also liked that they used small molecule inhibitors to acutely reduce PIM1/2 activity, which corroborated some of their mouse knockout findings - this experiment helps resolve any findings resulting from potential adaptation issues from the PIM1/2 global knockout in mice but also gives it a more translational link given the potential use of PIM kinase inhibitors in the clinic. The proteomics and RNA seq dataset may be of general use to the community, particularly for analysis of IL-15 or IL-2 stimulated CD8+ T cells.

      We thank the reviewer for their comments supporting the robustness and usefulness of our data.

      Weaknesses:

      It would be good to perform some experiments in human T cells too, given the ease of e.g., the small molecule inhibitor experiment.

      The suggestions to check PIM inhibitor effects in human T cell is a good one. We think an ideal experiment would be to use naïve cord blood derived CD4 and CD8 cells as a model to avoid the impact of variability in adult PBMC and to really look at what PIM kinases do as T cells first respond to antigen and cytokines. In this context there is good evidence that the signalling pathways used by antigen receptors or the cytokines IL-2 and IL-15 are not substantially different in mouse and human. We have also previously compared proteomes of mouse and human IL-2 expanded cytotoxic T cells and they are remarkably similar. As such we feel that mature mouse CD8 T cells are a genetically tractable model to use to probe the signalling pathways that control cytotoxic T cell function. To repeat the full set of experiments observed within this study with human T cells would represent 1-year of work by an experienced postdoctoral fellow.

      Unfortunately, the funding for the project has come to an end and there is no capacity to complete this work.

      Would also be good for the authors to include a few experiments where PIM1/2 have been transduced back into the PIM1/2 KO T cells, to see if this reverts any differences observed in response to IL-2 - although the reviewer notes that the timeline for altering primary T cells via lentivirus/CRISPR may be on the cusp of being practical such that functional experiments can be performed on day 6 after first stimulating T cells.

      A rescue experiment could indeed be informative, though of course comes with challenges/caveats with re-expressing both proteins that have been deleted at once and ability to control the level of PIM kinase that is re-expressed. This work using the Pim dKO mice was performed from 2019-2021 and was seriously impacted by the work restrictions during the COVID19 pandemic. We had to curtail all mouse colonies to allow animal staff to work within the legal guidelines. We had to make choices and the Pim1/2 dKO colony was stopped because we felt we had generated very useful data from the work but could not justify continued maintenance of the colony at such a difficult time. As such we no longer have this mouse line to perform these rescue experiments.

      We have however, performed a limited number of retroviral overexpression studies in WT IL-2-expanded CTL, where T cells were transfected after 24 hours activation and phenotype measured on day 6 of culture. We chose to leave these out of the initial manuscript as these were overexpression under conditions where PIM expression was already high, rather than a true test of the ability of PIM1 or PIM2 to rescue the Pim dKO phenotype. A more robust test would also have required doing these overexpression experiments in IL-15 expanded or cytokine deprived CTL where PIM kinase expression is low, however, we ran out of time and funding to complete this work.

      We have provided Author response image 1 below from the experiments performed in the IL-2 CTL for interested readers. The limited experiments that were performed do support some key phenotypes observed with the Pim dKO mice or PIM inhibitors, finding that PIM1 or PIM2 overexpression was sufficient to increase S6 phosphorylation, and provided a small further increase in GzmB expression above the already very high levels in IL-2 expanded CTL.

      Author response image 1.

      PIM1 or PIM2 overexpression drives increased GzmB expression and S6 phosphorylation in WT IL-2 CTL. OT1 lymph node cell suspensions were activated for 24 hours with SIINFEKL peptide (10 ng/mL), IL-2 (20 ng/mL) and IL-12 (2 ng/mL) then transfected with retroviruses to drive expression of PIM1-GFP, PIM2-GFP fusion proteins or a GFP only control. T cells were split into fresh media and IL-2 daily and (A) GzmB expression and (B) S6 phosphorylation assessed by flow cytometry in GFP+ve vs GFP-ve CD8 T cells 5 days post-transfection (i.e. day 6 of culture). Histograms are representative of 2 independent experiments.

      Other experiments could also look at how PIM1/2 KO influences the differentiation of T cell populations/states during ex vivo stimulation of PBMCs or in vivo infection models using (high-dimensional) flow cytometry (rather than using bulk proteomics/RNA seq which only provide an overview of all cells combined).

      We did consider the idea of in vivo experiments with the Pim1/2 dKO mice but rejected this idea as the mice have lost PIM kinases in all tissues and so we would not be able to understand if any phenotype was CD8 T cell selective. To note the Pim1/2 dKO mice are smaller than normal wild type mice (discussed further below) and clearly have complex phenotypes. An ideal experiment would be to make mice with floxed Pim1 and Pim2 alleles so that one could use cre recombinase to make a T cell-specific deletion and then study the impact of this in in vivo models. We did not have the budget or ethical approval to make these mice. Moreover, this study was carried out during the COVID pandemic when all animal experiments in the UK were severely restricted. So our objective was to get a molecular understanding of the consequences of losing theses kinases for CD8 T cells focusing on using controlled in vitro systems. We felt that this would generate important data that would guide any subsequent experiments by other groups interested in these enzymes.

      We do accept the comment about bulk population proteomics. Unfortunately, single cell proteomics is still not an option at this point in time. High resolution multidimensional flow cytometry is a valuable technique but is limited to looking at only a few proteins for which good antibodies exist compared to the data one gets with high resolution proteomics.

      Alongside this, performing a PCA of bulk RNA seq/proteomes or Untreated vs. IL-2 vs. IL-15 of WT and PIM1/2 knockout T cells would help cement their argument in the discussion about PIM1/2 knockout cells being distinct from a memory phenotype.

      We thank the reviewer for this very good suggestion. We have now included PCAs for the RNAseq and proteomics datasets of IL-2 and IL-15 expanded WT vs Pim dKO CTL in Fig S5 and added the following text to the discussion section of the manuscript (lines 429-431):

      “… and PCA plots of IL-15 and IL-2 proteomics and RNAseq data show that Pim dKO IL-2 expanded CTL are still much more similar to IL-2 expanded WT CTL than to IL-15 expanded CTL (Fig S5)”.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      In panel B of Figure S1, are the smaller numbers of splenocytes found in dKO fully accounted for by a reduction in the numbers of T cells or also correspond to a reduction in B cell numbers? Are the thymus and lymph nodes showing the same trend?

      We’re happy to clarify on this.

      Since we were focused on T cell phenotypes in the paper this is what we have plotted in this figure, however there is also a reduction in total number of B, NK and NKT cells in the Pim dKO mice (see James et al, Nat Commun, 2021 for additional subset percentages). We find that all immune subsets we have measured make up the same % of the spleen in Pim dKO vs WT mice (we show this for T cell subsets in what was formerly Fig S1C and is now Fig S1A), the total splenocyte count is just lower in the Pim dKO mice (which we show in what was formerly Fig S1B and is now Fig S1C). To note, the Pim dKO mice were smaller than their WT counterparts (though we have not formally weighed and quantified this) and we think this is likely the major factor leading to lower total splenocyte numbers.

      We have not checked the thymus so can’t comment on this. We can confirm that lymph nodes from Pim dKO mice had the same number and % CD4 and CD8 T cells as in WT.

      For our in vitro studies we have made sure to either use co-cultures or for single WT and Pim dKO cultures to equalise starting cell densities between wells to account for the difference in total splenocyte number. We have now clarified this point in the methods section lines 682-684

      “For generation of memory-like or effector cytotoxic T lymphocytes (CTL) from mice with polyclonal T cell repertoires, LN or spleen single cell suspensions at an equal density for WT and Pim dKO cultures (~1-3 million live cells/mL)….”

      Reviewer #2 (Recommendations For The Authors):

      Line 89-99 - PIM kinase expression is elevated in T cells in autoimmunity and inhibiting therefore may make some sense if PIM is enhancing T cell activity. Why then would you use an inhibitor in cancer settings? This needs better clarification for readers, with reference to T cells, particularly given this is an important justification for looking at PIM kinases in T cells.

      We thank the reviewer for highlighting the lack of clarity in our explanation here.

      PIM kinase inhibitors alone are proposed as anti-tumour therapies for select cancers to block tumour growth. However so far these monotherapies haven’t been very effective in clinical trials and combination treatment options with a number of strategies are being explored. There are two lines of logic for why PIM kinase inhibitors might be a good combination with an e.g. anti-PD1 or adoptive T cell immunotherapy. 1) PIM kinase inhibition has been shown to reduce inhibitory/suppressive surface proteins (e.g. PDL1) and cytokine (e.g. TGFbeta) expression in tumour cells and macrophages in the tumour microenvironment. 2) Inhibiting glycolysis and increasing memory/stem-like phenotype has been identified as desirable for longer-lasting more potent anti-tumour T cell immunity. PIM kinase inhibition has been shown to reduce glycolytic function and increase several ‘stemness’ promoting transcription factors e.g. TCF7 in a previous study. Controlled murine cancer models have shown improvement in clearance with the combination of pan-Pim kinase inhibitors and anti-PD1/PDL1 treatments (Xin et al, Cancer Immunol Res, 2021 and Chatterjee et al, Clin Cancer Res 2019).

      It is worth noting, this is seemingly contradictory with other studies of Pim kinases in T cells that have generally found Pim1/2/3 deletion or inhibition in T cells to be suppressive of their function.

      We have clarified this reasoning/seeming conflict of results in the introductory text as follows (lines 90-101):

      “PIM kinase inhibitors have also entered clinical trials to treat some cancers (e.g. multiple myeloma, acute myeloid leukaemia, prostate cancer), and although they have not been effective as a monotherapy, there is interest in combining these with immunotherapies. This is due to studies showing PIM inhibition reducing expression of inhibitory molecules (e.g. PD-L1) on tumour cells and macrophages in the tumour microenvironment and a reported increase of stem-like properties in PIM-deficient T cells which could potentially drive longer lasting anti-cancer responses (Chatterjee et al., 2019; Xin et al., 2021; Clements and Warfel, 2022). However, PIM kinase inhibition has also generally been shown to be inhibitory for T cell activation, proliferation and effector activities (Fox et al., 2003; Mikkers et al., 2004; Jackson et al., 2021) and use of PIM kinase inhibitors could have the side effect of diminishing the anti-tumour T cell response.”  

      Line 93 - The use of 'some cancers' is rather vague and unscientific - please correct phrasing like this. The same goes for lines 54 and 77 (some kinases and some analyses).

      We have clarified the sentence in what is now Line 91 to include examples of some of the cancers that PIM kinase inhibitors have been explored for (see text correction in response to previous reviewer comment), which are predominantly haematological malignancies. The use of the phrase ‘some kinases’ and ‘some analyses’ in what are now Lines 52 and 75 is in our view appropriate as the subsequent sentence/(s) provide specific details on the kinases and analyses that are being referred to.

      Lines 146-147 - Could it be that rather than redundancies, PIM KO is simply not influential on TCR/CD28 signalling in general but influences other pathways in the T cell?

      We agree that the lack of PIM1/2 effect could also be because PIM targets downstream of TCR/CD28 are not influential and have clarified the text as follows (lines 156-161):

      “These experiments quantified expression of >7000 proteins but found no substantial quantitative or qualitative differences in protein content or proteome composition in activated WT versus Pim dKO CD4 and CD8 T cells (Fig 1G-H) (Table S1). Collectively these results indicate that PIM kinases do not play an important unique role in the signalling pathways used by the TCR and CD28 to control T cell activation.”

      Line 169 - Instead of specifying control - maybe put upregulate or downregulate for clarity.

      We have changed the text as per reviewer suggestion (see line 183)

      Line 182-183 - I would move the call out for Figure 2D to after the last call out for Figure 2C to make it more coherent for readers.

      We have changed the text as per reviewer suggestion (see lines 197-200)

      Line 190 - 14,000 RNA? total, unique? mRNA?

      These are predominantly mRNA since a polyA enrichment was performed as part of the standard TruSeq stranded mRNA sample preparation process, however, a small number of lncRNA etc were also detected in our RNA sequencing. We left the results in as part of the overall analysis since it may be of interest to others but don’t look into it further. We do mention the existence of the non-mRNA briefly in the subsequent sentence when discussing the total number of DE RNA that were classified as protein coding vs non-coding.

      We have edited this sentence as follows to more accurately reflect that the RNA being referred to is polyA+ (lines 205-207):

      “The RNAseq analysis quantified ~14,000 unique polyA+ mRNA and using a cut off of >1.5 fold-change and q-value <0.05 we saw that the abundance of 381 polyA+ RNA was modified by Pim1/Pim2-deficiency (Fig 2E) (Table S2A).

      Questions/points regarding figures:

      Figure 1 - Is PIM3 changed in expression with the knockout of PIM1/2 in mice? Although the RNA is low could there be some compensation here? The authors put a good amount of effort in to showing that mouse T cells do not exhibit differences from knocking out pim1/2 i.e., Efforts have been made to address this using activation markers and cell size, cytokines, and proliferation and proteomics of activated T cells. What do the resting T cells look like though? Although TCR signalling is not impacted, other pathways might be. Resting-state comparison may identify this.

      In all experiments Pim3 mRNA was only detected at very low levels and no PIM3 protein was detected by mass spectrometry in either wild type or PIM1/2 double KO TCR activated or cytokine expanded CD8 T cells (See Tables S1, S3, S4). There was similarly no change in Pim3 mRNA expression in RNAseq of IL-2 or IL-15 expanded CD8 T cells (See Tables S2, S6). While we have not confirmed this in resting state cells for all the conditions examined, there is no evidence that PIM3 compensates for PIM1/2deficiency or that PIM3 is substantially expressed in T cells.

      Figure 1A&B - Does PIM kinase stay elevated when removing TCR stimulus? During egress from lymph node and trafficking to infection/tumour/autoimmune site, T cells experience a period of 'rest' from T-cell activation so is PIM upregulation stabilized, or does it just coincide with activation? This could be a crucial control given the rest of the study focuses on day 6 after initial activation (which includes 4 days of 'rest' from TCR stimulation). Nice resolution on early time course though.

      This is an interesting question. Unfortunately, we do not know how sensitive PIM kinases are to TCR stimulus withdrawal, as we have not tried removing the TCR stimulus during early activation and measuring PIM expression.

      Based on the data in Fig 2A there is a hint that 4 hours withdrawal of peptide stimulus may be enough to lose PIM1/2 expression (after ~36 hrs of TCR activation), however, we did not include a control condition where peptide is retained within the culture. Therefore, we cannot resolve this question from the current experimental data, as this difference could also be due to a further increase in PIMs in the cytokine treated conditions rather than a reduction in expression in the no cytokine condition. This ~36-hour time point is also at a stage where T cells have become more dependent on cytokines for their sustained signalling compared to TCR stimulus.

      It is worth noting that PIM kinases are thought to have fairly short mRNA and protein half lives (~5-20 min for PIM1 in primary cells, ~10 min – 1 hr for PIM2). This is consistent with previous observations that cytotoxic T cells need sustained IL-2/Jak signalling to sustain PIM kinase expression, e.g. in Rollings et al (2018) Sci Signaling, DOI:10.1126/scisignal.aap8112 . We would therefore expect that sustained signalling from some external signalling receptor whether this is TCR, costimulatory receptors or cytokines is required to drive Pim1/2 mRNA and protein expression.

      Figure 1D - the CD4 WT and Pim dKO plots are identical - presumably a copying error - please correct.

      We apologise for the copying error and have amended the manuscript to show the correct data. We thank the reviewer for noticing this mistake.

      In Figure 1H - there is one protein found significant - would be nice to mention what this is - for example, if this is a protein that influences TCR levels this could be quite important.

      The protein is Phosphoribosyl Pyrophosphate synthase 1 like 1 (Prps1l1).

      This was a low confidence quantification (based on only 2 peptides) with no known function in T cells. Based on what is known, this gene is predominantly expressed in the testis (though also detected in spleen, lung, liver). A whole-body KO mouse found no difference in male fertility. No further phenotype has been reported in this mouse. See: Wang et al (2018) Mol Reprod Dev, DOI: 10.1002/mrd.23053

      We have added the following text to the legend of Figure 1H to address this protein:

      “Phosphoribosyl Pyrophosphate synthase 1 like 1 (Prps1l1), was found to be higher in Pim dKO CD8 T cells, but was a low confidence quantification (based on only 2 unique peptides) with no known function in T cells.”

      Figure S1 - In your mouse model the reduction in CD4 T cells is quite dramatic in the spleen - is this reduced homing or reduced production of T cells through development?

      Could you quantify the percentage of CD45+ cells that are T cells from blood too? Would be good to have a more thorough analysis of this new mouse model.

      We apologise for the lack of clarity around the Pim dKO mouse phenotype. Something we didn’t mention previously due to a lack of a formal measurement is that the Pim dKO mice were typically smaller than their WT counterparts. This is likely the main reason for total splenocytes being lower in the Pim dKO mice - every organ is smaller. It is not a phenotype reported in Pim1/2 dKO mice on an FVB background, though has been reported in the Pim1/2/3 triple KO mouse before (see Mikkers et al, Mol Cell Biol 2004 doi: 10.1128/MCB.24.13.6104-6115.2004).

      The % cell type composition of the spleen is equivalent between WT and Pim dKO mice and as mentioned above, was controlled for when setting up of our in vitro cultures.

      We have revised the main text and changed the order of the panels in Fig S1 to make this caveat clearer as follows (lines 138-144):

      “There were normal proportions of peripheral T cells in spleens of Pim dKO mice (Fig S1A) similar to what has been reported previously in Pim dKO mice on an FVB/N genetic background (Mikkers et al., 2004), though the total number of T cells and splenocytes was lower than in age/sex matched wild-type (WT) mouse spleens (Fig S1B-C). This was not attributable to any one cell type (Fig S1A)(James et al., 2021) but was instead likely the result of these mice being smaller in size, a phenotype that has previously been reported in Pim1/2/3 triple KO mice (Mikkers et al., 2004).”

      Figure S1C - why are only 10-15% of the cells alive? Please refer to this experiment in the main text if you are going to include it in the supplementary figure.

      With regards what was previously Fig S1C (now Fig S1A) we apologise for our confusing labelling. We were quoting these numbers as the percentage of live splenocytes (i.e. % of live cells). Typically ~80-90% of the total splenocytes were alive by the time we had processed, stained and analysed them by flow cytometry direct ex vivo. Of these CD4 and CD8 T cells made up ~%10-15 of the total live splenocytes (with most of the rest of the live cells being B cells).  

      We have modified the axis to say “% of splenocytes” to make it clearer that this is what we are plotting.

      Figure S1 - Would be good to show that the T cells are truly deficient in PIM1/2 in your mice to be absolutely sure. You could just make a supplementary plot from your mass spec data.

      This is a good suggestion and we have now included this data as supplementary figure 2.

      To note, due to the Pim1 knockout mouse design this is not as simple as showing presence or absence of total PIM1 protein detection in this instance.

      To elaborate: the Pim1/Pim2 whole body KO mice used in this study were originally made by Prof Anton Berns’ lab (Pim1 KO = Laird et al Nucleic Acids Res, 1993, doi: 10.1093/nar/21.20.4750, with more detail on deletion construct in te Riele, H. et al, Nature,1990, DOI: 10.1038/348649a0; Pim2 KO = Mikkers et al, Mol Cell Biol, 2004, DOI: 10.1128/MCB.24.13.6104-6115.2004). They were given to Prof Victor Tybulewicz on an FVB/N background. He then backcrossed them onto the C57BL/6 background for > 10 generations then gave them to us to intercross into Pim1/2 dKO mice on a C57BL/6 background.

      The strategy for Pim1 deletion was as follows:

      A neomycin cassette was recombined into the Pim1 gene in exon 4 deleting 296 Pim1 nucleotides. More specifically, the 98th pim-1 codon (counted from the ATG start site = the translational starting point for the 34 kDa isoform of PIM1) was fused in frame by two extra codons (Ser, Leu) to the 5th neo codon (pKM109-90 was used). The 3'-end of neo included a polyadenylation signal. The cassette also contains the PyF101 enhancer (from piiMo +PyF101) to ensure expression of neo on homologous recombination in ES cells.

      Collectively this means that the PIM1 polypeptide is made prior to amino acid 98 of the 34 kDa isoform but not after this point. This deletes functional kinase activity in both the 34 kDa and 44 kDa PIM1 isoforms. Ablation of PIM1 kinase function using this KO was verified via kinase activity assay in Laird et al. Nucelic Acids Res 1993.

      The strategy to delete Pim2 was as follows:

      “For the Pim2 targeting construct, genomic BamHI fragments encompassing Pim2 exons 1, 2, and 3 were replaced with the hygromycin resistance gene (Pgp) controlled by the human PGK promoter.” (Mikkers et al Mol Cell Biol, 2004)

      The DDA mass spectrometry data collected in Fig 1 G-H and supplementary table 1 confirmed we do not detect peptides from after amino acid residue 98 in PIM1 (though we do detect peptides prior to this deletion point) and we do not detect peptides from the PIM2 protein in the Pim dKO mice. Thus confirming that no catalytically active PIM1/PIM2 proteins were made in these mice.

      We have added a supplementary figure S2 showing this and the following text (Lines 155-156):

      “Proteomics analysis confirmed that no catalytically active PIM1 and PIM2 protein were made in Pim dKO mice (Fig S2).”

      Figure 2A - I found the multiple arrows a little confusing - would just use arrows to indicate predicted MW of protein and stars to indicate non-specific. Why are there 3 bands/arrows for PIM2?  

      The arrows have now been removed. We now mention the PIM1 and PIM2 isoform sizes in the figure legend and have left the ladder markings on the blots to give an indication of protein sizes. There are 2 isoforms for PIM1 (34 and 44 kDa) in addition to the nonspecific band and 3 isoforms of PIM2 (40, 37, 34 kDa, though two of these isoform bands are fairly faint in this instance). These are all created via ribosome use of different translational start sites from a single Pim1 or Pim2 mRNA transcript.

      The following text has been added to the legend of Fig 2A:

      “Western blots of PIM1 (two isoforms of 44 and 34 kDa, non-specific band indicated by *), PIM2 (three isoforms of 40, 37 and 34 kDa) or pSTAT5 Y694 expression.”

      Figure 2A - why are the bands so faint for PIM1/2 (almost non-existent for PIM2 under no cytokine stim) here yet the protein expression seems abundant in Figure 1B upon stim without cytokines? Is this a sensitivity issue with WB vs proteomics? My apologies if I have missed something in the methods but please explain this discrepancy if not.

      There is differing sensitivity of western blotting versus proteomics, but this is not the reason for the discrepancy between the data in Fig 1B versus 2A. These differences reflect that Fig1 B and Fig 2A contrast PIM levels in two different sets of conditions and that while proteomics allows for an estimate of ‘absolute abundance’ Western blotting only shows relative expression between the conditions assessed.  

      To expand on this… Fig 1B proteomics looks at naïve versus 24 hr aCD3/aCD28 TCR activated T cells. The western blot data in Fig 2A looks at T cells activated for 1.5 days with SIINFEKL peptide and then washed free of the media containing the TCR stimulus and cultured with no stimulus for 4 or 24 hrs hours and contrast this with cells cultured with IL-2 or IL-15 for 4 or 24 hours. All Fig 2A can tell us is that cytokine stimuli increases and/or sustains PIM1 and PIM2 protein above the level seen in TCR activated cells which have not been cultured with cytokine for a given time period. Overexposure of the blot does reveal detectable PIM1 and PIM2 protein in the no cytokine condition after 4 hrs. Whether this is equivalent to the PIM level in the 24 hr TCR activated cells in Fig 1B is not resolvable from this experiment as we have not included a sample from a naïve or 24 hr TCR activated T cell to act as a point of reference.

      Figure 4F - Your proteomics data shows substantial downregulation in proteomics data for granzymes and ifny- possibly from normalization to maximise the differences in the graph - and yet your flow suggests there are only modest differences. Can you explain why a discrepancy in proteomics and flow data - perhaps presenting in a more representative manner (e.g., protein counts)?

      The heatmaps are a scaled for ‘row max’ to ‘row min’ copy number comparison on a linear scale and do indeed visually maximise differences in expression between conditions. This feature of these heatmaps is also what makes the lack of difference in GzmB and GzmA at the mRNA heatmap in Fig 5C quite notable.

      We have now included bar graphs of Granzymes A and B and IFNg protein copy number in Figure 4 (see new Fig 4G-H) to make clearer the magnitude of the effect on the major effector proteins involved in CTL killing function. It is worth noting that flow cytometry histograms from what was formerly Fig 4G (now Fig 4I) are on a log-scale so the shift in fluorescence does generally correspond well with the ~1.7-2.75-fold reduction in protein expression observed.

      Figure 4G - did you use isotype controls for this flow experiment? Would help convince labelling has worked - particularly for low levels of IFNy production.

      We did not use isotype controls in these experiments but we are using a well validated interferon gamma antibody and very carefully colour panel/compensation controls to minimise background staining. The only ways to be 100% confident that an antibody is selective is to use an interferon gamma null T cell which we do not have. We do however know that the antibody we use gives flow cytometry data consistent with other orthogonal approaches to measure interferon gamma e.g. ELISA and mass spectrometry.

      Figure 5M - why perform this with just the PIM kinase inhibitors? Can you do this readout for the WT vs. PIM1/2KO cells too? This would really support your claims for the paper about PIM influencing translation given the off-target effects of SMIs.

      Regrettably we have not done this particular experiment with the Pim dKO T cells. As mentioned above, due to this work being performed predominantly during the COVID19 pandemic we ultimately had to make the difficult decision to cease colony maintenance. When work restrictions were lifted we could not ethically or economically justify resurrecting a mouse colony for what was effectively one experiment, which is why we chose to test this key biological question with small molecule inhibitors instead.

      We appreciate that SMIs have off target effects and this is why we used multiple panPIM kinase inhibitors for our SMI validation experiments. While the use of 2 different inhibitors still doesn’t completely negate the concern about possible off-target effects, our conclusions re: PIM kinases and impact on proteins synthesis are not solely based on the inhibitor work but also based on the decreased protein content of the PIM1/2 dKO T cells in the IL-2 CTL, and the data quantifying reductions in levels of many proteins but not their coding mRNA in PIM1/2dKO T cells compared to controls.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their positive and constructive evaluations. Based upon the reviewers’ helpful comments, we have performed complementary experiments. In particular, we additionally show that:

      • a complete analysis of CXCR1/2 binding chemokines in the secretions of tissular CD8+ T cells reinforces the key role of CXCL8 in CD8+ T cell-induced fibrocyte chemotaxis (new panel D in Figure 2)

      • a direct contact between fibrocytes and CD8+ T cells triggers CD8+ T cell cytotoxicity against primary basal bronchial epithelial cells (new Figure 6)

      • the interaction between CD8+ T cells and fibrocytes is bidirectional, with CD8+ T cells triggering the development of fibrocyte immune properties (new Figure 7)

      • the characteristic time to reach a stationary state reminiscent of a resolution of the COPD condition was estimated to be about 2.5 years using the simulations. Interfering with chemotaxis and adhesion processes by inhibiting CXCR1/2 and CD54, respectively was not sufficient to reverse the COPD condition, as predicted by the mathematical model (new Figure 9)

      • the massive proliferation effect induced by fibrocytes is specific to CD8+ T cells and not CD4+ T cells (new Figure 3-figure supplement 2), and that fibrocytes moderately promote the death of unactivated CD8+ T cells in direct co-culture (new Figure 3-figure supplement 3)

      We have graphically summarized our findings (new Figure 10) suggesting the existence of a positive feedback loop playing a role in the vicious cycle that promotes COPD. A new table describing patient characteristics for basal bronchial epithelial cell purification has also been added (new Supplementary File 9), the Supplementary Files 7 and S8 have been up-dated to take into account the new experiments.

      The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD041402.  

      Reviewer #1 (Recommendations For The Authors):

      The experimental approaches are all rationally designed and the data clearly presented, with appropriate analyses and sample sizes. I could find no technical or interpretative concerns. The interrelationship between the observational data (histology) with the quantitative live cell imaging and the follow-on functional investigations is especially laudable. The data nicely unifies several years of accumulated data regarding the (separate) participation of CD8 T cells and fibrocytes in COPD.

      We thank the reviewer for his/her comments.

      I have only minor comments:

      1) Line 79: The observation that T cells may influence fibrocyte differentiation/function was initially made some years earlier by Abe et al (J Immunol 2001; 7556), and should be cited in addition to the follow-on work of Niedermeyer.

      This reference has been added to acknowledge this seminal work.

      2) Line 632: Corticosteroids originate from the cortex of the adrenal gland. Budenoside and fluticasone are glucocorticoids, not corticosteroids.

      This mistake has been corrected in the discussion of the revised manuscript (see line 802 in the revised manuscript).

      3) Given the state of T cell immunotherapies, cytokine/chemokine antagonists, and emerging fibrocyte-targeted drugs, can the authors possibly speculate as to desired pathways to target therapeutically?

      Chemokine-receptor based therapies could be used to inhibit fibrocyte recruitment into the lungs, such as CXCR4 blockade. We have very recently shown that using the CXCR4 antagonist, plerixafor, alleviates bronchial obstruction and reduces peri-bronchial fibrocytes density (Dupin et al., 2023). Because CXCR4 expression in human fibrocytes is dependent on mTOR signaling and is inhibited by rapamycin in vitro (Mehrad et al., 2009), alternative strategies consisting of targeting fibrocytes via mTOR have been proposed. This target has proven effective in bronchiolitis obliterans, idiopathic pulmonary fibrosis, and thyroid-associated ophthalmopathy, using rapamycin (Gillen et al., 2013; Mehrad et al., 2009), sirolimus (Manjarres et al., 2023) or an insulin-like growth factor-1 (IGF-I) receptor blocking antibody (Douglas et al., 2020; Smith et al., 2017). Inhibiting mTOR is also expected to have effects on CD8+ T cells, ranging from an immunostimulatory effect by activation of memory CD8+ T-cell formation, to an immunosuppressive effect by inhibition of T cell proliferation (Araki et al., 2010). Last, chemokine-receptor base therapies could also include strategies to inhibit the CD8+-induced fibrocyte chemotaxis, such as dual CXCR1-CXCR2 blockade. We were able to test this latter strategy in our mathematical model, see response to point 6 of reviewer 2.

      Immunotherapies directly targeting the interaction between fibrocytes and CD8+ T cells could also be considered, such as CD86 or CD54 blockade. The use of abatacept and belatacept, that interfere with T cell co-stimulation, is effective in patients with rheumatoid arthritis (Pombo-Suarez & Gomez-Reino, 2019) and in kidney-transplant recipients (Vincenti et al., 2016), respectively. Targeting the IGF-I receptor by teprotumumab in the context of thyroid-associated ophthalmopathy also improved disease outcomes, possibly by altering fibrocyte-T cell interactions (Bucala, 2022; Fernando et al., 2021).

      We also tested this CD86 and CD54 blocking strategy for COPD treatment by simulations, see response to point 6 of reviewer 2.

      However, such therapies should be used with caution as they may favour adverse events such as infections, particularly in the COPD population (Rozelle & Genovese, 2007). Additionally, the fibrocytes-lymphocytes interaction has recently been shown to promote anti-tumoral immunity via the PD1-PDL1 immunological synapse (Afroj et al., 2021; Mitsuhashi et al., 2023). Therefore, care should be taken in the selection of patients to be treated and/or timing of treatment administration with regards to the increased risk of lung cancer in COPD patients.

      The discussion section has been altered accordingly.

      4) The authors may want to consider mentioning (and citing) recent insight into the immune-mediated fibrosis in thyroid-associated ophthalmopathy

      These important publications are now cited in a dedicated paragraph about the possible therapeutical interventions (see answer to point 3, and discussion in the revised manuscript).

      Reviewer #2 (Recommendations For The Authors):

      Specific comments

      1) The rationale for the selection of chemokines overexpressed by CD8+ T cells in COPD is based on literature data of n=2 patients per group. This is limited and risky. I am less concerned about false positives given the selection of chemokines and the available literature but am worried about the possibility that many chemokines may not have been selected based on insufficient power to do meaningful stats on this comparison. For example, many other CXCR1/2 binding CXCL chemokines exist and these could contribute to the migration effect in Fig 2C as well. Given the currently available single-cell resources it should be possible to extend these observations and to investigate CXCL chemokine expression in COPD CD8 T cells to the benefit of Fig 2A in full detail.

      We agree with the reviewer that the rationale for the selection of chemokines of interest could be reinforced by the analysis of supplementary single-cell resources. We used data from the COPD cell atlas (Gene Expression Omnibus GSE136831 (Sauler et al., 2022)) to perform such an analysis of chemokine expression by CD8+ CD103+ and CD8+ CD103- T cells. However, the expression level of all chemokines was globally very low, and was not different between control and COPD patients (see Author response image 1).

      Author response image 1.

      Expression of CXC chemokines in lung CD8+ CD103+ and CD8+ CD103- T cells from patients with COPD (n=18 independent samples) in comparison with healthy control subjects (n=29 independent samples) under resting conditions by Single-Cell RNA sequencing analysis (GEO accession GSE136831). The heatmaps show the normalized expression of genes (horizontal axes) encoding CXC chemokines. PF4=CXCL4, PPBP= CXCL7.

      The latter results are in discrepancy with those resulting from transcriptomic analysis of microarray data obtained on purified lung CD8+ CD103+ and CD8+ CD103- T cells, showing a significant level of chemokines expression (Hombrink et al., 2016), and a differential expression of CCL2, CCL26, CXCL2, CXCL8 and CCL3L1 between CD8+ T lymphocytes of control and COPD patients (Figure 2A in the revised manuscript). The reason for these differences is unclear, and could be attributed to biological differences (samples obtained from different patients) or, more likely, to differences in sample processing (cell sorting by flow cytometry for microarray analysis, that could activate minimally CD8+ cells) and/or methodological differences (differences of sensitivity between microarray and scRNA seq).

      Nevertheless, microarray data regarding CXCL8 expression are in good agreement with our in vitro experiments, showing an enhanced CXCL8 expression by CD8+ T cells purified from COPD lungs, in comparison with that of control subjects. In addition, the CXCL8 blocking antibody fully abrogates the increase of migration induced by secretion of COPD CD8+ T cells, to the same extent as the blocking of CXCR1/2 by reparixin. This suggests that this supplementary chemotaxis is mainly due to CXCL8 and not other CXCR1/2 binding CXCL chemokines, and correlates CXCL8 measurements to functional experiments. This precision has been now added in the results section of the revised version.

      2) Equally, it would strengthen the work if multiplex ELISA assays could be provided on the supernatants used in Fig 2D to provide a more comprehensive view of CXCR1/2 binding chemokines.

      In order to have a complete view of CXCR1/2 binding chemokines, we have now performed supplementary ELISA assays to measure the concentrations of CXCL1, 3, 5, 6 and 7, in addition of the measurements of CXCL2 and CXCL8 already presented in the previous version of the manuscript (Figure 2D). Results of these new assays are now presented in the revised version of Figure 2. Concentrations of CXCL1, 3, 5, 6 and 7 were unchanged between the control and COPD conditions.

      3) In the functional analyses, I missed information on the activation of the fibrocytes. Equally, the focus on CD8 T cells was mainly on proliferation in the functional work. RNAseq analyses on the cells, comparing CD8 T cells and fibrocytes, alone and in co-culture to each other would help to identify interaction patterns in comprehensive detail. Such an experiment would bolster the significance of the studies by providing impact analysis not only on the T cells beyond proliferation but by expanding on the effect of the interaction on the fibrocyte as well.

      Regarding the activation state of fibrocytes, we apologize if this was not clear: in our in vitro co-culture experiments, we chose not to activate the fibrocytes. This setting is in agreement with previous findings, demonstrating an antigen-independent T cell proliferation effect driven by fibrocytes (Nemzek et al., 2013), and it is now explicitly written in the results of the revised manuscript.

      Regarding the focus of the functional analyses:

      First, we have pushed forward the analysis of the consequences of the interaction beyond CD8+ T cells proliferation. In particular, having shown that fibrocytes promote CD8+ T cells expression of cytotoxic molecules such as granzyme B, we decided to investigate the cytotoxic capacity of CD8+ T cells against primary basal bronchial epithelial cells (see new Supplementary File 9 in the revised manuscript for patient characteristics).

      Direct co-culture with fibrocytes increased total and membrane expression of the cytotoxic degranulation marker CD107a, which was only significant in non-activated CD8+ T cells (see new Figure 6A-E in the revised manuscript). A parallel increase of cytotoxicity against primary epithelial cells was observed in the same condition (see new Figure 6F-H in the revised manuscript). This demonstrates that following direct interaction with fibrocytes, CD8+ T cells have the ability to kill target cells such as bronchial epithelial cells. This is now included in the results section of the revised manuscript.

      Second, we have now performed proteomic analyses on fibrocytes, alone or in co-culture during 6 days with CD8+ T cells either non-activated or activated (see new Figure 7A in the revised manuscript). Of the top ten pathways that were most significantly activated in co-cultured vs mono-cultured fibrocytes, largest upregulated genes were those of the dendritic cell maturation box, the multiple sclerosis signaling pathway, the neuroinflammation signaling pathway and the macrophage classical signaling pathway, irrespective of the activation state of CD8+ T cells (see new Figure 7B in the revised manuscript). The changes were globally identical in the two conditions of CD8+ T cell activation, with some upregulation more pronounced in the activated condition. They were mostly driven by up-regulation of a core set of Major Histocompatibility Complex class I (HLA-B, C, F) and II (HLA-DMB, DPA1, DPB1, DRA, DRB1, DRB3) molecules, co-simulatory and adhesion molecules (CD40, CD86 and CD54). Another notable proteomic signature was that of increased expression of IFN signaling-mediators IKBE and STAT1, and the IFN-responsive genes GBP2, GBP4 and RNF213. We also observed a strong downregulation of CD14, suggesting fibrocyte differentiation, and an upregulation of the matrix metalloproteinase-9 (MMP9) in the non-activated condition only. Altogether, these changes suggest that the interaction between CD8+ T cells and fibrocytes promotes the development of fibrocyte immune properties, which could subsequently impact the activation of CD4+ T cells activation.

      Up-regulated pathways identified in proteomic profile of fibrocytes co-cultured with CD8+ T cells are very consistent with a shift towards a proinflammatory phenotype rather than towards a reparative role. The activation of IFN-γ signaling could be triggered by CD8+ T cell secretion of IFN upon fibrocyte interaction, suggesting the existence of a positive feedback loop (see new Figure 10). Additionally, the priming of fibrocytes by CD8+ T cells could also induce CD4+ T cell activation.

      4) I suggest rewording the abstract to capture the main storyline and wording more. The abstract is good, but I see so many novelties in the paper that are not well sold in the abstract, particularly the modelling aspects.

      As suggested by the reviewer, we revised the abstract, as shown below and in the revised manuscript. The changes are indicated in red:

      Revised abstract:

      Bronchi of chronic obstructive pulmonary disease (COPD) are the site of extensive cell infiltration, allowing persistent contacts between resident cells and immune cells. Tissue fibrocytes interaction with CD8+ T cells and its consequences were investigated using a combination of in situ, in vitro experiments and mathematical modeling. We show that fibrocytes and CD8+ T cells are found in vicinity in distal airways and that potential interactions are more frequent in tissues from COPD patients compared to those of control subjects. Increased proximity and clusterization between CD8+ T cells and fibrocytes are associated with altered lung function. Tissular CD8+ T cells from COPD patients promote fibrocyte chemotaxis via the CXCL8-CXCR1/2 axis. Live imaging shows that CD8+ T cells establish short-term interactions with fibrocytes, that trigger CD8+ T cell proliferation in a CD54- and CD86-dependent manner, pro-inflammatory cytokines production, CD8+ T cell cytotoxic activity against bronchial epithelial cells and fibrocyte immunomodulatory properties. We defined a computational model describing these intercellular interactions and calibrated the parameters based on our experimental measurements. We show the model’s ability to reproduce histological ex vivo characteristics, and observe an important contribution of fibrocyte-mediated CD8+ T cell proliferation in COPD development. Using the model to test therapeutic scenarios, we predict a recovery time of several years, and the failure of targeting chemotaxis or interacting processes. Altogether, our study reveals that local interactions between fibrocytes and CD8+ T cells could jeopardize the balance between protective immunity and chronic inflammation in bronchi of COPD patients.

      5) The probabilistic model appears to suggest that reduced CD8 T cell death may also explain the increase in the pathology in COPD. Did the authors find that fibrocytes reduce cell death of the CD8 T cells?

      Taking advantage of the staining of CD8+ T cells with the death marker Zombie NIR™, we have quantified CD8+ T cell death in our co-culture assay. The presence of fibrocytes in the indirect co-culture assay did not affect CD8+ T cell death (see new Figure 3-figure supplement 3A-B in the revised manuscript). In direct co-culture, the death of CD8+ T cells was significantly increased in the non-activated condition but not in the activated condition (see new Figure 3-figure supplement 3C-D in the revised manuscript). Of note, these results are in agreement with a recent study showing the existence of CD8+ T cell-population-intrinsic mechanisms regulating cellular behavior, with induction of apoptosis to avoid an excessive increase in T cell population (Zenke et al., 2020). This is taken into account in our mathematical model by an increased probability p_(dC+) of dying when a CD8+ T cell is surrounded by many other T cells in its neighborhood. It also suggests that the reduced CD8+ T cell death evidenced in tissues from patients with COPD (Siena et al., 2011) might not be due to the specific interplay between fibrocyte and CD8+ T cells, but rather to a global pro-survival environment in COPD lungs.

      These new data have been described in the results section.

      6) Following the modeling in Figure 6, curiosity came to mind, which is how long it would take for the pathology to disappear if a drug would be applied to the patient. How much should the interactions be reduced and how long would it take to reach clinical benefit? Could such predictions be made? I understand that this may be outside the main message of the manuscript but perhaps this could be included in the discussion.

      This is a very interesting question, that we have addressed by performing additional simulations to investigate the outcomes of possible therapeutic interventions. First, we applied a COPD dynamics during 20 years, to generate the COPD state, that provide the basis for treatment implementation. Then, we applied a COPD dynamic during 7 years, that mimics the placebo condition (see new Figure 9A in the revised manuscript, and below), that we compared to a control dynamics (“Total inhibition”), that mimics an ideal treatment able to restore all cellular processes. As expected the populations of fibrocytes and CD8+ T cells, as well as the density of mixed clusters, decreased. These numbers reached levels similar of healthy subjects after approximately 2.5 years, and this time point can therefore be considered as the steady state (Figure 9B-E).

      Monitoring of the different processes revealed that these effects were mainly due to a reduction in fibrocyte-induced CD8+ T duplication, and a transient or more prolonged increase in basal fibrocyte and CD8+ T death (Figure 9C-D).

      Then, three possible realistic treatments were considered (Figure 9A). We tested the effect of directly inhibiting the interaction between fibrocytes and CD8+ T cells by blocking CD54. This was implemented in the model by altering the increased probability of a CD8+ T cell to divide when a fibrocyte is in its neighbourhood, as shown by the co-culture results (Figure 4). We also chose to reflect the effect of a dual CXCR1/2 inhibition by setting the displacement function of fibrocyte similar to that of control dynamics, in agreement with the in vitro experiments (Figure 2E). Blocking CD54 only slightly reduced the density of CD8+ T cells compared to the placebo condition, and had no effect on fibrocyte and mixed cluster densities (Figure 9B). CXCR1/2 inhibition was a little bit more potent on the reduction of CD8+ T cells than CD54 inhibition, and it also significantly decreased the density of mixed clusters (Figure 9B). As expected, this occurred through a reduction of fibrocyte-induced duplication, which was affected more strongly by CXCR1/2 blockage than by CD54 blockage (Figure 9C-E). Combining both therapies (CD54 and CXCR1/2 inhibition) did not strongly major the effects (Figure 9B-E). In all the conditions tested, the size of the fibrocyte population remained unchanged, suggesting that other processes such as fibrocyte death or infiltration should be targeted to expect broader effects.

      The results section has been altered accordingly.

      Using the simulations, we were also able to estimate the characteristic time to reach a stationary state reminiscent of a resolution of the COPD condition. This time of approximately 2.5 years was totally unpredictable by in vitro experiments, and indicates that a treatment aiming at restoring these cellular processes should be continued during several years to obtain significant changes.

      We have also investigated the outcomes of more realistic treatments, modifying specifically processes such as chemotaxis or targeting directly the intercellular interactions. The modification of parameters controlling these processes only slightly affected the final state, suggesting that such treatments may be more effective when used in combination with other drugs e.g. those affecting fibrocyte infiltration and/or death.

      The discussion section has been altered accordingly.

      Reviewer #3 (Recommendations For The Authors):

      1) Broader assessment of cell types in the lung: Staining for other cell types such as dendritic cells, CD4 cells, and interstitial macrophages, and comparing their proximity to fibrocytes with that of CD8 cells would better justify the CD8 focus.

      We agree with the reviewer that multiple stainings would have better justified the focus on CD8+ T cells. However, it is difficult to distinguish fibrocytes, dendritic cells and interstitial macrophages on the basis of immunohistochemistry, as we and others previously showed (Dupin et al., 2019; Mitsuhashi et al., 2015; Pilling et al., 2009). On the other hand, the study of Afroj et al. indicated the possible interaction between fibrocytes and CD8+ T cells in cancer context, with the induction of CD8+ T cell proliferation (Afroj et al., 2021). This T cell-costimulatory function of fibrocytes and CD8+ T cells was further confirmed in a very recent study, together with the antitumor effects of PD-L1 and VEGF blockade (Mitsuhashi et al., 2023). These data, along with the specific implication on CD8+ T cells in COPD, relying mainly on their abundance in COPD bronchi (O’Shaughnessy et al., 1997), their overactivation state (Roos-Engstrand et al., 2009), their cytotoxic phenotype (Freeman et al., 2010; Wang et al., 2020) and the protection against lung inflammation and emphysema induced by their depletion (Maeno et al., 2007) justified the CD8 focus.

      To further justify this focus, we have now performed co-culture between fibrocytes and CD4+ T cells, indicating that the massive fibrocyte-mediated proliferation was specific to CD8+ T cells (see answer to comment 3 below). This is in agreement with the results obtained with the simulations, showing that considering fibrocytes and CD8+ T cells only was sufficient to reproduce the spatial patterns in the bronchi of healthy and COPD patients. Altogether, we think that focusing on the CD8+ T cell-fibrocyte interplay was pertinent in the context of COPD. It does obviously not exclude the possibility of other interactions, that could be the focus of other studies.

      2) Transcriptomic analysis: Using n=2 and only showing the chemokines as well as selected adhesion receptor data narrows the focus but does not provide broader insights into the interactions. Using a more robust sample size and performing a comprehensive pathway analysis would represent an unbiased analysis to determine the most dysregulated pathways. Importantly, the authors could use a single-cell RNA-seq dataset to broadly assess the transcriptomes of several cell types in the lung (such as the data from (Sauler et al, Characterization of the COPD alveolar niche using single-cell RNA sequencing).

      This very pertinent suggestion has also been raised by reviewer 2, see our answer to comment 1 of reviewer 2, and below:

      We agree with the reviewer that the rationale for the selection of chemokines of interest could be reinforced by the analysis of supplementary single-cell resources. We used data from the COPD cell atlas (Gene Expression Omnibus GSE136831 (Sauler et al., 2022)) to perform such an analysis of chemokine expression by CD8+ CD103+ and CD8+ CD103- T cells. However, the expression level of all chemokines was globally very low, and was not different between control and COPD patients (see Figure scRNAseq, in the answer to comment 1 of reviewer 2).

      These latter results are in discrepancy with those resulting from transcriptomic analysis of microarray data obtained on purified lung CD8+ CD103+ and CD8+ CD103- T cells, showing a significant level of chemokines expression (Hombrink et al., 2016), and a differential expression of CCL2, CCL26, CXCL2, CXCL8 and CCL3L1 between CD8+ T lymphocytes of control and COPD patients (Figure 2A in the revised manuscript). The reason for these differences is unclear, and could be attributed to biological differences (samples obtained from different patients) or, more likely, to differences in sample processing (cell sorting by flow cytometry for microarray analysis, that could activate minimally CD8+ cells) and/or methodological differences (differences of sensitivity between microarray and scRNA seq).

      Nevertheless, microarray data regarding CXCL8 expression are in good agreement with our in vitro experiments, showing an enhanced CXCL8 expression by CD8+ T cells purified from COPD lungs, in comparison with that of control subjects. In addition, the CXCL8 blocking antibody fully abrogates the increase of migration induced by secretion of COPD CD8+ T cells, to the same extent as the blocking of CXCR1/2 by reparixin. This suggests that this supplementary chemotaxis is mainly due to CXCL8 and not other CXCR1/2 binding CXCL chemokines, and correlates CXCL8 measurements to functional experiments. This precision has been now added in the text of the revised version.

      3) Inclusion of control/comparison cell types in co-culture studies would help establish that CD8 cells are more relevant for interactions with fibrocytes than for example CD4 cells.

      We have now performed co-cultures between fibrocytes and CD4+ T cells, with the same settings than for CD8+ T cells. The results from these experiments show that fibrocytes did not have any significant effect of CD4+ T cells death, regardless of their activation state (see new Figure 3-figure supplement 2A-C in the revised manuscript, and below). Fibrocytes were able to promote CD4+ T cells proliferation in the activated condition but not in the non-activated condition (see new Figure 3-figure supplement 2A-D in the revised manuscript). Altogether this indicates that although fibrocyte-mediated effect on proliferation is not specific to CD8+ T cells, the amplitude of the effect is much larger on CD8+ T cells than on CD4+ T cells.

      These new data have been added in the results section.

      4) In vitro analysis of cells from non-COPD patients would also help assess whether the circulating cells from COPD patients have a level of baseline activation which promotes the vicious cycle but may not exist in healthy cells.

      Regarding circulating cells, the present study relies on the COBRA cohort (COhort of BRonchial obstruction and Asthma), which includes only asthma and COPD patients, and therefore does not grant access to healthy subjects’ blood samples (Pretolani et al., 2017). Unfortunately, we have no other ongoing study with healthy subjects that would allow us to retrieve blood for research, and fibrocytes can only be grown from freshly drawn blood samples. We agree with the reviewer that it is a limitation of our study, which is now acknowledged at the end of the discussion section.  

      References

      Afroj, T., Mitsuhashi, A., Ogino, H., Saijo, A., Otsuka, K., Yoneda, H., Tobiume, M., Nguyen, N. T., Goto, H., Koyama, K., Sugimoto, M., Kondoh, O., Nokihara, H., & Nishioka, Y. (2021). Blockade of PD-1/PD-L1 Pathway Enhances the Antigen-Presenting Capacity of Fibrocytes. The Journal of Immunology, 206(6), 1204‑1214. https://doi.org/10.4049/jimmunol.2000909

      Araki, K., Youngblood, B., & Ahmed, R. (2010). The role of mTOR in memory CD8+ T-cell differentiation. Immunological reviews, 235(1), 234‑243. https://doi.org/10.1111/j.0105-2896.2010.00898.x

      Bucala, R. J. (2022). Targeting fibrocytes in autoimmunity. Proceedings of the National Academy of Sciences, 119(5), e2121739119. https://doi.org/10.1073/pnas.2121739119

      Douglas, R. S., Kahaly, G. J., Patel, A., Sile, S., Thompson, E. H. Z., Perdok, R., Fleming, J. C., Fowler, B. T., Marcocci, C., Marinò, M., Antonelli, A., Dailey, R., Harris, G. J., Eckstein, A., Schiffman, J., Tang, R., Nelson, C., Salvi, M., Wester, S., … Smith, T. J. (2020). Teprotumumab for the Treatment of Active Thyroid Eye Disease. The New England Journal of Medicine, 382(4), 341‑352. https://doi.org/10.1056/NEJMoa1910434

      Dupin, I., Henrot, P., Maurat, E., Abohalaka, R., Chaigne, S., Hamrani, D. E., Eyraud, E., Prevel, R., Esteves, P., Campagnac, M., Dubreuil, M., Cardouat, G., Bouchet, C., Ousova, O., Dupuy, J.-W., Trian, T., Thumerel, M., Begueret, H., Girodet, P.-O., … Berger, P. (2023). CXCR4 blockade alleviates pulmonary and cardiac outcomes in early COPD (p. 2023.03.10.529743). bioRxiv. https://doi.org/10.1101/2023.03.10.529743

      Dupin, I., Thumerel, M., Maurat, E., Coste, F., Eyraud, E., Begueret, H., Trian, T., Montaudon, M., Marthan, R., Girodet, P.-O., & Berger, P. (2019). Fibrocyte accumulation in the airway walls of COPD patients. The European Respiratory Journal, 54(3), Article 3. https://doi.org/10.1183/13993003.02173-2018

      Fernando, R., Caldera, O., & Smith, T. J. (2021). Therapeutic IGF-I receptor inhibition alters fibrocyte immune phenotype in thyroid-associated ophthalmopathy. Proceedings of the National Academy of Sciences, 118(52), e2114244118. https://doi.org/10.1073/pnas.2114244118

      Freeman, C. M., Han, M. K., Martinez, F. J., Murray, S., Liu, L. X., Chensue, S. W., Polak, T. J., Sonstein, J., Todt, J. C., Ames, T. M., Arenberg, D. A., Meldrum, C. A., Getty, C., McCloskey, L., & Curtis, J. L. (2010). Cytotoxic potential of lung CD8+ T cells increases with COPD severity and with in vitro stimulation by IL-18 or IL-15. Journal of immunology (Baltimore, Md. : 1950), 184(11), 6504‑6513. https://doi.org/10.4049/jimmunol.1000006

      Gillen, J. R., Zhao, Y., Harris, D. A., LaPar, D. J., Stone, M. L., Fernandez, L. G., Kron, I. L., & Lau, C. L. (2013). Rapamycin Blocks Fibrocyte Migration and Attenuates Bronchiolitis Obliterans in a Murine Model. The Annals of thoracic surgery, 95(5), 1768‑1775. https://doi.org/10.1016/j.athoracsur.2013.02.021

      Hombrink, P., Helbig, C., Backer, R. A., Piet, B., Oja, A. E., Stark, R., Brasser, G., Jongejan, A., Jonkers, R. E., Nota, B., Basak, O., Clevers, H. C., Moerland, P. D., Amsen, D., & van Lier, R. A. W. (2016). Programs for the persistence, vigilance and control of human CD8+ lung-resident memory T cells. Nature Immunology, 17(12), Article 12. https://doi.org/10.1038/ni.3589

      Maeno, T., Houghton, A. M., Quintero, P. A., Grumelli, S., Owen, C. A., & Shapiro, S. D. (2007). CD8+ T Cells are required for inflammation and destruction in cigarette smoke-induced emphysema in mice. Journal of Immunology (Baltimore, Md.: 1950), 178(12), 8090‑8096. https://doi.org/10.4049/jimmunol.178.12.8090

      Manjarres, D. C. G., Axell-House, D. B., Patel, D. C., Odackal, J., Yu, V., Burdick, M. D., & Mehrad, B. (2023). Sirolimus suppresses circulating fibrocytes in idiopathic pulmonary fibrosis in a randomized controlled crossover trial. JCI Insight. https://doi.org/10.1172/jci.insight.166901

      Mehrad, B., Burdick, M. D., & Strieter, R. M. (2009). Fibrocyte CXCR4 regulation as a therapeutic target in pulmonary fibrosis. The International Journal of Biochemistry & Cell Biology, 41(8‑9), 1708‑1718. https://doi.org/10.1016/j.biocel.2009.02.020

      Mitsuhashi, A., Goto, H., Saijo, A., Trung, V. T., Aono, Y., Ogino, H., Kuramoto, T., Tabata, S., Uehara, H., Izumi, K., Yoshida, M., Kobayashi, H., Takahashi, H., Gotoh, M., Kakiuchi, S., Hanibuchi, M., Yano, S., Yokomise, H., Sakiyama, S., & Nishioka, Y. (2015). Fibrocyte-like cells mediate acquired resistance to anti-angiogenic therapy with bevacizumab. Nature Communications, 6(1), Article 1. https://doi.org/10.1038/ncomms9792

      Mitsuhashi, A., Koyama, K., Ogino, H., Afroj, T., Nguyen, N. T., Yoneda, H., Otsuka, K., Sugimoto, M., Kondoh, O., Nokihara, H., Hanibuchi, M., Takizawa, H., Shinohara, T., & Nishioka, Y. (2023). Identification of fibrocyte cluster in tumors reveals the role in antitumor immunity by PD-L1 blockade. Cell Reports, 112162. https://doi.org/10.1016/j.celrep.2023.112162

      Nemzek, J. A., Fry, C., & Moore, B. B. (2013). Adoptive transfer of fibrocytes enhances splenic T-cell numbers and survival in septic peritonitis. Shock (Augusta, Ga.), 40(2), 106‑114. https://doi.org/10.1097/SHK.0b013e31829c3c68

      O’Shaughnessy, T. C., Ansari, T. W., Barnes, N. C., & Jeffery, P. K. (1997). Inflammation in bronchial biopsies of subjects with chronic bronchitis : Inverse relationship of CD8+ T lymphocytes with FEV1. American Journal of Respiratory and Critical Care Medicine, 155(3), 852‑857. https://doi.org/10.1164/ajrccm.155.3.9117016

      Pilling, D., Fan, T., Huang, D., Kaul, B., & Gomer, R. H. (2009). Identification of markers that distinguish monocyte-derived fibrocytes from monocytes, macrophages, and fibroblasts. PloS One, 4(10), e7475. https://doi.org/10.1371/journal.pone.0007475

      Pombo-Suarez, M., & Gomez-Reino, J. J. (2019). Abatacept for the treatment of rheumatoid arthritis. Expert Review of Clinical Immunology, 15(4), 319‑326. https://doi.org/10.1080/1744666X.2019.1579642

      Pretolani, M., Soussan, D., Poirier, I., Thabut, G., Aubier, M., COBRA Study Group, & COBRA cohort Study Group. (2017). Clinical and biological characteristics of the French COBRA cohort of adult subjects with asthma. The European Respiratory Journal, 50(2), 1700019. https://doi.org/10.1183/13993003.00019-2017

      Roos-Engstrand, E., Ekstrand-Hammarström, B., Pourazar, J., Behndig, A. F., Bucht, A., & Blomberg, A. (2009). Influence of smoking cessation on airway T lymphocyte subsets in COPD. COPD, 6(2), 112‑120. https://doi.org/10.1080/15412550902755358

      Rozelle, A. L., & Genovese, M. C. (2007). Efficacy results from pivotal clinical trials with abatacept. Clinical and Experimental Rheumatology, 25(5 Suppl 46), S30-34.

      Sauler, M., McDonough, J. E., Adams, T. S., Kothapalli, N., Barnthaler, T., Werder, R. B., Schupp, J. C., Nouws, J., Robertson, M. J., Coarfa, C., Yang, T., Chioccioli, M., Omote, N., Cosme, C., Poli, S., Ayaub, E. A., Chu, S. G., Jensen, K. H., Gomez, J. L., … Rosas, I. O. (2022). Characterization of the COPD alveolar niche using single-cell RNA sequencing. Nature Communications, 13(1), Article 1. https://doi.org/10.1038/s41467-022-28062-9

      Siena, L., Gjomarkaj, M., Elliot, J., Pace, E., Bruno, A., Baraldo, S., Saetta, M., Bonsignore, M. R., & James, A. (2011). Reduced apoptosis of CD8+ T-lymphocytes in the airways of smokers with mild/moderate COPD. Respiratory Medicine, 105(10), 1491‑1500. https://doi.org/10.1016/j.rmed.2011.04.014

      Smith, T. J., Kahaly, G. J., Ezra, D. G., Fleming, J. C., Dailey, R. A., Tang, R. A., Harris, G. J., Antonelli, A., Salvi, M., Goldberg, R. A., Gigantelli, J. W., Couch, S. M., Shriver, E. M., Hayek, B. R., Hink, E. M., Woodward, R. M., Gabriel, K., Magni, G., & Douglas, R. S. (2017). Teprotumumab for Thyroid-Associated Ophthalmopathy. The New England Journal of Medicine, 376(18), 1748‑1761. https://doi.org/10.1056/NEJMoa1614949

      Vincenti, F., Rostaing, L., Grinyo, J., Rice, K., Steinberg, S., Gaite, L., Moal, M.-C., Mondragon-Ramirez, G. A., Kothari, J., Polinsky, M. S., Meier-Kriesche, H.-U., Munier, S., & Larsen, C. P. (2016). Belatacept and Long-Term Outcomes in Kidney Transplantation. The New England Journal of Medicine, 374(4), 333‑343. https://doi.org/10.1056/NEJMoa1506027

      Wang, X., Zhang, D., Higham, A., Wolosianka, S., Gai, X., Zhou, L., Petersen, H., Pinto-Plata, V., Divo, M., Silverman, E. K., Celli, B., Singh, D., Sun, Y., & Owen, C. A. (2020). ADAM15 expression is increased in lung CD8+ T cells, macrophages, and bronchial epithelial cells in patients with COPD and is inversely related to airflow obstruction. Respiratory Research, 21(1), 188. https://doi.org/10.1186/s12931-020-01446-5

      Zenke, S., Palm, M. M., Braun, J., Gavrilov, A., Meiser, P., Böttcher, J. P., Beyersdorf, N., Ehl, S., Gerard, A., Lämmermann, T., Schumacher, T. N., Beltman, J. B., & Rohr, J. C. (2020). Quorum Regulation via Nested Antagonistic Feedback Circuits Mediated by the Receptors CD28 and CTLA-4 Confers Robustness to T Cell Population Dynamics. Immunity, 52(2), 313-327.e7. https://doi.org/10.1016/j.immuni.2020.01.018

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This important study investigated the role of oxytocin (OT) neurons in the paraventricular nucleus (PVN) and their projections to the medial prefrontal cortex (mPFC) in regulating pup care and infanticide behaviors in mandarin voles. The researchers used techniques like immunofluorescence, optogenetics, OT sensors, and peripheral OT administration. Activating OT neurons in the PVN reduced the time it took pup-caring male voles to approach and retrieve pups, facilitating pup-care behavior. However, this activation had no effect on females. Interestingly, this same PVN OT neuron activation also reduced the time for both male and female infanticidal voles to approach and attack pups, suggesting PVN OT neuron activity can promote pup care while inhibiting infanticide behavior. Inhibition of these neurons promoted infanticide. Stimulating PVN->mPFC OT projections facilitated pup care in males and in infanticide-prone voles, activation of these terminals prolonged latency to approach and attack. Inhibition of PVN->mPFC OT projections promoted infanticide. Peripheral OT administration increased pup care in males and reduced infanticide in both sexes. However, some results differed in females, suggesting other mechanisms may regulate female pup care.

      Strengths:

      This multi-faceted approach provides converging evidence, strengthens the conclusions drawn from the study, and makes them very convincing. Additionally, the study examines both pup care and infanticide behaviors, offering insights into the mechanisms underlying these contrasting behaviors. The inclusion of both male and female voles allows for the exploration of potential sex differences in the regulation of pup-directed behaviors. The peripheral OT administration experiments also provide valuable information for potential clinical applications and wildlife management strategies.

      Weaknesses:

      While the study presents exciting findings, there are several weaknesses that should be addressed. The sample sizes used in some experiments, such as the Fos study and optogenetic manipulations, appear to be small, which may limit the statistical power and generalizability of the results. Effect sizes are not reported, making it difficult to evaluate the practical significance of the findings. The imaging parameters and analysis details for the Fos study are not clearly described, hindering the interpretation of these results (i.e., was the entire PVN counted?). Also, does the Fos colocalization align with previous studies that look at PVN Fos and maternal/ paternal care? Additionally, the study lacks electrophysiological data to support the optogenetic findings, which could provide insights into the neural mechanisms underlying the observed behaviors. 

      In some previous studies (He et al., 2019; Mei, Yan, Yin, Sullivan, & Lin, 2023), the sample size in morphological studies is also small and may be representative. We agree with reviewer’s opinion that results from larger sample size may be more statistically powerful and generalizable. We will pay attention to this issue in the future study. As reviewer suggested, we have added effect size both in the source data and in the main text, including d, η2  and odds ratio. We have added the objective magnification used in the figure legend. The imaging parameters and analysis details for the Fos study have also been added in the revised manuscript. Brain slices of 40 µm thick were collected consecutively on 4 slides, each slide had 6 brain slices spaced 160 µm apart from each other. PVN area were determined based on the Allen Mouse Brain Atlas and our previous study, and Fos, OT and merged positive neurons were counted. Our result about Fos and OT colocalization is consistent with previous study. In a previous study on virgin male prairie voles, OT and Fos colabeled neurons in the PVN increased after exposure to conspecific pups and experiencing paternal care (Kenkel et al., 2012). In another study of prairie voles, OT and c-fos colabeled neurons in PVN significantly increased after becoming parents which may be due to a shift from virgin to parents (Kelly, Hiura, Saunders, & Ophir, 2017). To support the optogenetic findings, we used c-Fos expression as a marker of neuron activity and revealed significant increases/decreases of c-Fos positive neurons induced by optogenetic activation/inhibition (Supplementary Data Fig. 1), and additionally we found that optogenetic inhibition of OT neurons reduced levels of OT release using OT1.0 sensors. Based on these two experiments, we verified that optogenetic manipulation in the present study is validate and results of optogenetic experiment are reliable (Supplementary Data Fig. 5).

      The study has several limitations that warrant further discussion. Firstly, the potential effects of manipulating OT neurons on the release of other neurotransmitters (or the influence of other neurochemicals or brain regions) on pup-directed behaviors, especially in females, are not fully explored. Additionally, it is unclear whether back-propagation of action potentials during optogenetic manipulations causes the same behavioral effect as direct stimulation of PVN OT cells. Moreover, the authors do not address whether the observed changes in behavior could be explained by overall increases or decreases in locomotor activity.

      We agree with reviewer’s suggestion that several limitations should be discussed. Although we used a virus strategy to specifically activate or inhibit PVN OT neurons, other neurochemical may also be released during optogenetic manipulations because OT neurons may also release other neurochemicals. In one of our previous studies, activation of the OT neuron projections from the PVN to the VTA as well as to the Nac brain also altered pup-directed behaviors, which may also be accompanied by dopamine release (He et al., 2021). In addition, backpropagation of action potentials during optogenetic manipulations may also causes the same behavioral effect as direct stimulation of PVN OT cells. These effects on pup-directed behaviors should also be investigated further in the future study. For the optogenetics experiments, we have referred to some of the previous research (Mei et al., 2023; Murugan et al., 2017), and in our study we have also carried out the verification of the reliability of the methods. To exclude effects of locomotor activity on pup directed behaviors, we also investigated effect of optogenetic manipulations on the locomotor activity of experimental animals and found that optogenetic manipulation did not change levels of locomotor activity (Supplementary Data Fig. 6).

      The authors do not specify the percentage of PVN->mPFC neurons labeled that were OT-positive, nor do they directly compare the sexes in their behavioral analysis (or if they did, it is not clear statistically). While the authors propose that the sex difference in pup-directed behaviors is due to females having greater OT expression, they do not provide evidence to support this claim from their labeling data. It is also uncertain whether more OT neurons were manipulated in females compared to males. The study could benefit from a more comprehensive discussion of other factors that could influence the neural circuit under investigation, especially in females.

      AAV11-Ef1a-EGFP virus can infect fibers and retrogradely reach to cell body, thus this virus can be used to retrogradely trace neurons. We injected this virus (green, AAV11-Ef1a-EGFP) in the mPFC and observed virus infected and OT (red) positive neuron in the PVN (Yellow), and we also counted the OT neurons that project from PVN to mPFC and found that approximately 45.16% and 40.79% of cells projecting from PVN to the mPFC were OT-positive, and approximately 18.48% and 18.89% of OT cells in the PVN projected to the mPFC in females and males, respectively (Supplementary Data Fig. 4). In addition, as reviewers suggested, we compared the numbers of OT neurons, activated OT neurons (OT and Fos double-labeled neurons) and level of OT release between males and females. We found that females have more activated OT neurons (Figure1, d, g) and released higher levels of OT into the mPFC (Figure 4 d, e) than males. This part has been added in the result and discussion. We did not analyze whether more OT neurons were manipulated in females compared to males, which is indeed a limitation of this study that requires our attention. 

      As the reviewers suggested, we also discussed other factors that could influence the neural circuit under investigation. In addition to OT neurons, OTR neurons may also regulate behavioral responses to pups. In a study of virgin female mice, pup exposure was found to activate oxytocin and oxytocin receptor expressing neurons (Okabe et al., 2017). Other brain regions such as preoptic area (POA) may also be involved in parental behaviors. For example, virgin female mice repeatedly exposed to pups showed shorter retrieval latencies and greater c-Fos expression in the preoptic area (POA), concentrations of OT in the POA were also significantly increased, and the facilitation of alloparental behavior by repeated exposure to pups occurred through the organization of the OT system (Okabe et al., 2017). A recent study suggests that OT of the PVN is involved in the care of pups by male voles (He et al., 2021). This study suggests that PVN to ventral tegumental area (VTA) OT projections as well as VTA to nucleus accumbens (NAc) DA projections are involved in the care of pups by male voles. Inhibition of OT projections from the PVN to the VTA reduces DA release in the NAc during licking and grooming of pups (He et al., 2021). The effects of these factors on pup-directed responses should also be considered in the future study. 

      Reviewer #2 (Public Review):

      Summary:

      This series of experiments studied the involvement of PVN OT neurons and their projection to the mPFC in pup-care and attack behavior in virgin male and female Mandarin voles. Using Fos visualization, optogenetics, fiber photometry, and IP injection of OT the results converge on OT regulating caregiving and attacks on pups. Some sex differences were found in the effects of the manipulations.

      Strengths:

      Major strengths are the modern multi-method approaches and involving both sexes of Mandarin vole in every experiment.

      Weaknesses:

      Weaknesses include the lack of some specific details in the methods that would help readers interpret the results. These include:

      (1) No description of diffusion of centrally injected agents.

      Thanks for your professional consideration. Individuals with appropriate viral expression and optical fiber implant location were included in the statistical analysis, otherwise excluded. For optogenetic experiments, the virus (AAV2/9-mOXT-hCHR2(H134R)–mCherry-ER2-WPRE-pA or rAAV-mOXT-eNpHR3.0-mCherry-WPRE-hGH-pA) was designed and constructed to only infect OT neurons, which limited the diffusion of the virus. For fiber photometric experiments, the OT1.0 sensor was largely able to restrict expression within the mPFC brain region, and additionally individuals with incorrect optical fiber embedding position were not included in the statistical analysis. The diffusion of central optogenetic viruses and OT1.0 sensors are shown in the supplemental figure (Supplementary Data Fig. 7).

      (2) Whether all central targets were consistent across animals included in the data analyses. This includes that is not stated if the medial prelimbic mPFC target was in all optogenetic study animals as shown in Figure 4 and if that is the case, there is no discussion of that subregion's function compared to other mPFC subregions.

      As shown in Figure 4 and in the schematic diagram of the optogenetic experiment, the central targets of virus infection and fiber location remain consistent in the data analysis, otherwise the data would be excluded. In the present study, viruses were injected into the prelimbic (PrL). The PrL and infralimbic (IL) regions of the mPFC play different roles in different social interaction contexts (Bravo-Rivera, Roman-Ortiz, Brignoni-Perez, Sotres-Bayon, & Quirk, 2014; Moscarello & LeDoux, 2013). A study has shown that the PrL region of the mPFC contributes to active avoidance in situations where conflict needs to be mitigated, but also contributes to the retention of conflict responses for reward (Capuzzo & Floresco, 2020). This may reveal that the suppression of infanticide by PVN to mPFC OT projections is a behavioral consequence of active conflict avoidance. In a study on pain in rats, OT neurons projections from the PVN to the PrL were found to increase the responsiveness of cell populations in the PrL, suggesting that OT may act by altering the local excitation-inhibition (E/I) balance in the PrL (Liu et al., 2023). A study on anxiety-related behaviors in male rats suggests that the anxiolytic effects of OT in the mPFC are PrL-specific but not infralimbic or anterior cingulate and that this is achieved primarily through the engagement of GABAergic neurons, which ultimately modulate downstream anxiety-related brain regions, including the amygdala (Sabihi, Dong, Maurer, Post, & Leuner, 2017). This finding may provide possible downstream pathways for further research. 

      (3) How groups of pup-care and infanticidal animals were created since there was no obvious pretest mentioned so perhaps there was the testing of a large number of animals until getting enough subjects in each group.  

      Before the experiments, we exposed the animals to pups, and subjects may exhibit pup care, infanticide, or neglect; we grouped subjects according to their behavioral responses to pups, and individuals who neglected pups were excluded.

      (4) The apparent use of a 20-minute baseline data collection period for photometry that started right after the animals were stressed from handling and placement in the novel testing chamber.

      In fiber photometric experiments, all experimental animals were required to acclimatize to the environment for at least 20 minutes prior to the experiment as described in the Methods section. The time 0 in Fig. 4 represents the point in time when a behavior or a segment of behavior started and is not the actual time 0 at which the test was started.

      (5) A weakness in the results reporting is that it's unclear what statistics are reported (2 x 2 ANOVA main effect of interaction results, t-test results) and that the degrees of freedom expected for the 2 X 2 ANOVAs in some cases don't appear to match the numbers of subjects shown in the graphs; including sample sizes in each group would be helpful because the graph panels are very small and data points overlap.

      Thanks for your suggestion. We displayed analysis methods for the data statistics and the sample sizes for each group of experiments in the figure legends.

      The additional context that could help readers of this study is that the authors overlook some important mPFC and pup caregiving and infanticide studies in the introduction which would help put this work in better context in terms of what is known about the mPFC and these behaviors. These previous studies include Febo et al., 2010; Febo 2012; Peirera and Morrell, 2011 and 2020; and a very relevant study by Alsina-Llanes and Olazábal, 2021 on mPFC lesions and infanticide in virgin male and female mice. The introduction states that nothing is known about the mPFC and infanticide. In the introduction and discussion, stating the species and sex of the animals tested in all the previous studies mentioned would be useful. The authors also discuss PVN OT cell stimulation findings seen in other rodents, so the work seems less conceptually novel. Overall, the findings add to the knowledge about OT regulation of pup-directed behavior in male and female rodents, especially the PVN-mPFC OT projection.

      We appreciate you very much to provide so many valuable references. We have cited them in the introduction and discussion. We agree with the reviewer’s opinion that nothing is known about the mPFC and infanticide is incorrect. It should be whether mPFC OT projections are involved in paternal cares and infanticide remains unclear. A study in mother rats indicated that inactivation or inhibition of neuronal activity in the mPFC largely reduced pup retrieval and grouping (Febo, Felix-Ortiz, & Johnson, 2010). In a subsequent study on firing patterns in the mPFC of mother rats suggested that sensory-motor processing occurs in the mPFC that may affect decision making of maternal care to their pups (Febo, 2012). In a study on new mother rats examining different regions of the mPFC (anterior cingulate (Cg1), PrL, IL), they identified a involvement of the IL cortex in biased preference decision-making in favour of the offspring (Pereira & Morrell, 2020). A study on maternal motivation in rats suggests that in the early postpartum period, the IL and Cg1 subregion in mPFC, are the motivating circuits for pup-specific biases (Pereira & Morrell, 2011), while the PrL subregion, are recruited and contribute to the expression of maternal behaviors in the late postpartum period (Pereira & Morrell, 2011).

      Reviewer #3 (Public Review):

      Summary:

      Here Li et al. examine pup-directed behavior in virgin Mandarin voles. Some males and females tend towards infanticide, others tend towards pup care. c-Fos staining showed more oxytocin cells activated in the paraventricular nucleus (PVN) of the hypothalamus in animals expressing pup care behaviors than in infanticidal animals. Optogenetic stimulation of PVN oxytocin neurons (with an oxytocin-specific virus to express the opsin transgene) increased pup-care, or in infanticidal voles increased latency towards approach and attack.

      Suppressing the activity of PVN oxytocin neurons promoted infanticide. The use of a recent oxytocin GRAB sensor (OT1.0) showed changes in medial prefrontal cortex (mPFC) signals as measured with photometry in both sexes. Activating mPFC oxytocin projections increased latency to approach and attack in infanticidal females and males (similar to the effects of peripheral oxytocin injections), whereas in pup-caring animals only males showed a decrease in approach. Inhibiting these projections increased infanticidal behaviors in both females and males and had no effect on pup caretaking.

      Strengths:

      Adopting these methods for Mandarin voles is an impressive accomplishment, especially the valuable data provided by the oxytocin GRAB sensor. This is a major achievement and helps promote systems neuroscience in voles.

      Weaknesses:

      The study would be strengthened by an initial figure summarizing the behavioral phenotypes of voles expressing pup care vs infanticide: the percentages and behavioral scores of individual male and female nulliparous animals for the behaviors examined here. Do the authors have data about the housing or life history/experiences of these animals? How bimodal and robust are these behavioral tendencies in the population?

      As our response to reviewer 2, animals generally exhibit three types of behavioral responses toward pups, and data on the percentage of these different behavioral types occurring in the group will be included in another study in our lab. The reviewer's suggestion of scoring the behaviors is an inspiring idea that will help us to more fully parse these behaviors. Mandarin voles were captured from the wild in Henan, China. The experimental subjects were F2 generation voles reared in the Experimental Animal Centre of Shaanxi Normal University. In our observations, pup care and infanticide behaviors were conserved across several pup exposures, especially pup care behaviors, whereas for infanticide behaviors we did not conduct more pup exposures in order to protect the pups. 

      Optogenetics with the oxytocin promoter virus is a nice advance here. More details about their preparation and methods should be in the main text, and not simply relegated to the methods section. For optogenetic stimulation in Figure 2, how were the stimulation parameters chosen? There is a worry that oxytocin neurons can co-release other factors- are the authors sure that oxytocin is being released by optogenetic stimulation as opposed to other transmitters or peptides, and acting through the oxytocin receptor (as opposed to a vasopressin receptor)?

      As reviewer suggested, more detailed information about virus construction and choice of optogenetic stimulation parameter have been added in the revised manuscript. The details about the construction of CHR2 and mCherry viruses used in optogenetic manipulation can refer to a previous study in which they constructed an rAAV-expressing Venus from a 2.6 kb region upstream of OT exon 1, which is conserved in mammalian species (Knobloch et al., 2012). For details about construction of the eNpHR 3.0 virus, expression of the vector is driven by the mouse OXT promoter, a 1kb promoter upstream of exon 1 of the OXT gene, which has been shown to induce cell type-specific expression in OXT cells (Peñagarikano et al., 2015). Details about the construction of OT1.0 sensor can be referred to the research of Professor Li's group (Qian et al., 2023). The mapping of the viral vectors and OT1.0 sensor is shown below. 

      The optogenetic stimulation parameters were used based on a previous study (He et al., 2021). However, our description of the parameters in the experiment is still not in detail, so some information about optogenetic stimulation parameters has been added in the method. In pupdirected pup care behavioral test, light stimulation lasted for 11 min. Parameters used in optogenetic manipulation of PVN OT neurons were ~ 3 mW, 20 Hz, 20 ms, 8 s ON and 2 s OFF and parameters used in optogenetic manipulation of PVN OT neurons projecting to mPFC were ~ 10 mW, 20 Hz, 20 ms, 8 s ON and 2 s OFF to cover the entire interaction. We performed fiber photometric experiments to determine the role that OT plays in behavior, and these results were able to support each other with optogenetic experiments. In addition, we further confirmed the role of optogenetic manipulation on OT release in combination with optogenetic inhibition and OT1.0 sensors (Supplementary Data Fig. 2). It has been previously shown that OT is able to act specifically on OTR in mPFC-PL (Sabihi et al., 2017). Our study focuses on oxytocin neurons as well as oxytocin release, and more research is needed to construct a more complex and complete network regarding the involvement of the OTR and other factors in the mPFC in these behaviors.

      Author response image 1.

      Author response image 2.

       

      Given that they are studying changes in latency to approach/attack, having some controls for motion when oxytocin neurons are activated or suppressed might be nice. Oxytocin is reported to be an anxiolytic and a sedative at high levels.

      As our response to reviewer 1, to exclude effects of locomotor activity on pup directed behaviors, we also investigated effect of optogenetic manipulations on the locomotor activity of experimental animals and found that optogenetic manipulation did not change levels of locomotor activity (Supplementary Data Fig. 6).

      The OT1.0 sensor is also amazing, these data are quite remarkable. However, photometry is known to be susceptive to motion artifacts and I didn't see much in the methods about controls or correction for this. It's also surprising to see such dramatic, sudden, and large-scale suppression of oxytocin signaling in the mPFC in the infanticidal animals - does this mean there is a substantial tonic level of oxytocin release in the cortex under baseline conditions?

      The optical fiber recording system used in the present study can automatically exclude effects of motion artifacts by simultaneously recording signals stimulated by a 405nm light source. As shown in the formula below, the z-score data were calculated and presented, and the increase and decline of the OT signal is a trend relative to the baseline. For a smooth baseline, the decreasing signal is generally amplified after calculation. In our experiments combining optogenetic inhibition and OT1.0 sensors, we were able to find that there was a certain level of OT release at baseline, on which there was room for a decrease in the signal recorded by the OT1.0 sensor.

      Figure 5 is difficult to parse as-is, and relates to an important consideration for this study: how extensive is the oxytocin neuron projection from PVN to mPFC?

      AAV11-Ef1a-EGFP virus can infect fiber and retrogradely reach to cell body, thus this virus can be used to retrogradely trace neurons. We injected the this virus (green, AAV11-Ef1aEGFP) in the mPFC and observed virus infected and OT (red) positive neuron in the PVN (Yellow), and we also counted the OT neurons that project from PVN to mPFC and found that approximately 45.16% and 40.79% of cells projecting from PVN to the mPFC were OT-positive, and approximately 18.48% and 18.89% of OT cells in the PVN projected to the mPFC in females and males, respectively (Supplementary Data Fig. 4).  

      In Figures 6 and 7, the authors use the phrase 'projection terminals'; however, to my knowledge, there have not been terminals (i.e., presynaptic formations opposed to a target postsynaptic site) observed in oxytocin neuron projections into target central regions.

      According your suggestion, we replaced the ‘terminals’ with ‘fibers’ to describe it more accurately..

      Projection-based inhibition as in Figure 7 remains a controversial issue, as it is unclear if the opsin activation can be fast enough to reduce the fast axonal/terminal action potential. Do the authors have confirmation that this works, perhaps with the oxytocin GRAB OT sensor?

      Thanks for your suggestion. We measured the OT release using OT1.0 sensors when the OT neuron projections in the mPFC were optogenetically inhibited. The result showed that optogenetic inhibition of OT neuron fibers in the mPFC significantly reduced OT release that validate the method of projection-based inhibition (Supplementary Data Fig. 5).

      As females and males had similar GRAB OT1.0 responses in mPFC, why would the behavioral effects of increasing activity be different between the sexes?

      In the present study, females released higher levels of OT into the mPFC (Figure 4 d, e) than males upon occurrence of different behaviors. In addition, females already exhibited more rapid approach and retrieval of pups than male before the optogenetic activation this may be the reason no effects of this manipulation were found in female.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Check for spelling and grammar errors throughout.

      Thanks to the reviewer's suggestion, we have checked and revised the article.

      (2) Report effect sizes for all significant findings to allow evaluation of practical significance.

      As reviewer suggested, we have added effect size both in the source data and in the main text, including d, η2  and odds ratio.

      (3) Provide detailed information on the imaging parameters and analysis methods used in the Fos study.

      The imaging parameters and analysis details for the Fos study have also been added in the revised manuscript. Brain slices of 40 µm thick were collected consecutively on 4 slides, each slide had 6 brain slices spaced 160 µm apart from each other. PVN area were determined based on the Allen Mouse Brain Atlas and our previous study, andFos, OT and merged positive neurons were counted.

      (4) Compare the Fos colocalization results with previous studies examining PVN Fos and maternal/paternal care to contextualize the findings.

      Our result about Fos and OT colocalization is consistent with previous study. In a previous study on virgin male prairie voles, OT and Fos colabeled neurons in the PVN increased after exposure to conspecific pups and experiencing paternal care (Kenkel et al., 2012). In another study of prairie voles, OT and c-fos colabeled neurons in PVN significantly increased after becoming parents which may be due to a shift from virgin to parents (Kelly et al., 2017).

      (5) Discuss the limitations of the study, such as the potential effects of manipulating OT neurons on the release of other transmitters or the influence of other neurochemicals or brain regions on pupdirected behaviors, especially in females.

      We agree with reviewer’s suggestion that several limitations should be discussed. Although we used a virus strategy to specifically activate or inhibit PVN OT neurons, other neurochemical may also be released during optogenetic manipulations because OT neurons may also release other neurochemicals. In one of our previous studies, activation of the OT neuron projections from the PVN to the VTA as well as to the Nac brain also altered pup-directed behaviors, which may also be accompanied by dopamine release (He et al., 2021). In addition, backpropagation of action potentials during optogenetic manipulations may also causes the same behavioral effect as direct stimulation of PVN OT cells. These effects on pup-directed behaviors should also be investigated further in the future study.

      (6) Address the possibility of back-propagation of action potentials in the optogenetic manipulations causing the same behavioral effects as PVN OT cell stimulation.

      We agree with the reviewer’s opinion hat optogenetic manipulation may possibly induce back-propagation of action potentials that may result in same behavioral effects as OT cell stimulation. We will pay attention to this issue in the future study.  

      (7) Investigate whether changes in locomotor behavior could explain the observed effects on pupdirected behaviors.

      To exclude effects of locomotor activity on pup directed behaviors, we also investigated effect of optogenetic manipulations on the locomotor activity of experimental animals and found that optogenetic manipulation did not change levels of locomotor activity (Supplementary Data Fig. 6).

      (8) Report the percentage of PVN->mPFC neurons labeled that were OT-positive.

      AAV11-Ef1a-EGFP virus can infect fiber and retrogradely reach to cell body, thus this virus can be used to retrogradely trace neurons. We injected this virus (green, AAV11-Ef1a-EGFP) in the mPFC and observed virus infected and OT (red) positive neuron in the PVN (Yellow), and we also counted the OT neurons that project from PVN to mPFC and found that approximately 45.16% and 40.79% of cells projecting from PVN to the mPFC were OT-positive, and approximately 18.48% and 18.89% of OT cells in the PVN projected to the mPFC in females and males, respectively (Supplementary Data Fig. 4).

      (9)  Directly compare the sexes in the behavioral analysis and discuss any potential sex differences.

      We agree with the reviewer's suggestion and have added comparisons between two sexes and discussion about relevant results. 

      (10) If available, report and discuss the OT expression levels and the number of OT neurons manipulated in each sex.

      In the present study, we have counted the number of OT cells, but did not measure the level of OT expression using WB or qPCR. In addition, the percentages of CHR2(H134R) and eNpHR3.0 virus infected neurons in total OT positive neurons were presented (Supplementary Data Fig. 7), but we did not know how many cells were actually manipulated during the optogenetic experiment.

      (11) Expand the discussion to include what could be regulating or interacting with the OT circuit under investigation, particularly in females where the effects were less pronounced.

      As the reviewers suggested, we have also added relevant discussion. In addition to OT neurons, OTR neurons may also regulate behavioral responses to pups. In a study of virgin female mice pup exposure was found to activate oxytocin and oxytocin receptor expressing neurons (Okabe et al., 2017). Other brain regions such as preoptic area (POA) may also be involved in parental behaviors. For example, virgin female mice repeatedly exposed to pups showed shorter retrieval latencies and greater c-Fos expression in the preoptic area (POA), concentrations of OT in the POA were also significantly increased, and the facilitation of alloparental behavior by repeated exposure to pups occurred through the organization of the OT system (Okabe et al., 2017). A recent study suggests that OT of the PVN is involved in the care of pups by male voles (He et al., 2021). This study suggests that PVN to ventral tegumental area (VTA) OT projections as well as VTA to nucleus accumbens (NAc) DA projections are involved in the care of pups by male voles. Inhibition of OT projections from the PVN to the VTA reduces DA release in the NAc during licking and grooming of pups (He et al., 2021).

      Reviewer #2 (Recommendations For The Authors):

      A few additional things the authors may want to consider:

      (1) I don't understand the subject numbers in the peripheral OT study data shown in Figure 8. Panels p and q have 69 females shown and 50 males. Was there a second, much larger, IP injection study conducted that was different than the subjects shown in panels a-o that had ~5 subjects per treatment group per sex?

      Sorry for the confusing. More animals were used to test effects of OT on infanticide behaviors in our pre-test. These data combined with data from formal pharmacological experiment were presented in Fig. 8p, q. After OT treatment, the changes in detailed and specific behaviors were only collected in several animals. We have clarified that in the revised manuscript. 

      (2) The authors suggest higher baseline OT release in the female mPFC, which makes sense and helps explain some of their results. It seems that the data in Figure 1 show what is probably no sex difference in OT cell numbers in the PVN of Mandarin voles, which is unlike the old studies in mice or rats. If readers look at the data in Figure 1 showing what seems to be no sex difference in OT cell number, the authors' argument in the discussion about mPFC OT release levels higher in females would be inconsistent with their own data shown. The authors have the brain sections they need to help support or undermine this argument in the discussion, so maybe it would be useful to analyze the OT cell numbers across the PVN and report it in this paper or briefly mention it in the discussion.

      We compared the numbers of OT neurons, activated OT neurons (OT and Fos doublelabeled neurons) and level of OT release between males and females. We found that females have more activated OT neurons (Figure1, d, g) and released higher levels of OT into the mPFC (Figure 4 d, e) than males. This part has been added in the result and discussion. The inconsistency of the OT cell numbers with previous studies may be due to the method of cell counting, as we did not count all slides consecutively.  

      (3) The discussion suggests visual cues are involved in mPFC OT release relevant for pup care or infanticide, but this is a very odd claim for nocturnal animals that live and nest with their pups in underground burrows.

      Sorry for the confusing. Here, we cited the finding in mice that activation of PVN OT neurons induced by visual stimulation promoted pup care to support our finding that the activity of OT cells of the PVN is involved in pup care, rather than to illustrate the role of visual stimulation in voles. We have clarified that in the revised manuscript.

      (4) The lack of decrease in mPFC OT release in the 2nd and 3rd approaches to pups is probably because the release was so high after the 1st approach that it didn't have time to drop before the subsequent approaches. The authors don't state how long those between-approach intervals were on average to help readers interpret this result.

      As described in our methods, we spaced about 60 s between each behavioral test to allow the signal return back to the baseline level.

      (5) Do PVN-mPFC OT somata collateralize to other brain sites? Could mPFC terminal stimulation activate entire PVN cells and every site they project to? A caveat could be mentioned in the discussion if there's support for this from other optogenetic and PVN OT cell projection studies.

      We verified the OT projections from PVN to mPFC, to validate the optogenetic manipulation of this pathway, but did not investigate whether the OT neurons projecting from PVN to mPFC also project collaterally to other brain regions. It is suggested that mPFC terminal stimulation only activate PVN OT cells projecting mPFC, whether other OT neurons were activated remains unclear. 

      (6) I don't see an ethics statement related to the experiments obviously having to involve pup injury or death. Nothing is said in methods about what happened after adult subjects attacked pups. I assumed the tests were quickly terminated and pups euthanized.

      In case the pups were attacked, we removed them immediately to avoid unnecessary injuries, and injured pups were euthanized.

      (7) The authors could be more specific about what psychological diseases they refer to in the abstract and elsewhere that are relevant to this study. Depression? Rare cases of psychosis? Even within the already rare parental psychosis, infanticide is tragic but rare.

      Infanticide is caused by a variety of factors, mental illness, especially depression and psychosis, is often a very high risk factor among them (Milia & Noonan, 2022; Naviaux, Janne, & Gourdin, 2020). In human, infanticide has been used to refer to the killing, neglect or abuse of newborn babies and older children (Jackson, 2006). Here, we believe that research on the neural mechanisms of infanticide can also contribute to the understanding and treatment of attacks on children, physical and verbal abuse, and direct killing of babies. 

      (8) Figure 8 - in one case the "*" is a chi-square result , correct?

      Thanks for your careful checking. In Figure 8p, q, we applied the chi-square test and  added it in the legend.

      Reviewer #3 (Recommendations For The Authors):

      The only other thing is a typo on line 135: the authors mean 'stimulation' instead of 'simulation'.

      Corrected.

      References

      Bravo-Rivera, C., Roman-Ortiz, C., Brignoni-Perez, E., Sotres-Bayon, F., & Quirk, G. J. (2014). Neural structures mediating expression and extinction of platform-mediated avoidance. J Neurosci, 34(29), 9736-9742. doi:10.1523/jneurosci.0191-14.2014

      Capuzzo, G., & Floresco, S. B. (2020). Prelimbic and Infralimbic Prefrontal Regulation of Active and Inhibitory Avoidance and Reward-Seeking. J Neurosci, 40(24), 4773-4787. doi:10.1523/jneurosci.0414-20.2020

      Febo, M. (2012). Firing patterns of maternal rat prelimbic neurons during spontaneous contact with pups. Brain Res Bull, 88(5), 534-542. doi:10.1016/j.brainresbull.2012.05.012

      Febo, M., Felix-Ortiz, A. C., & Johnson, T. R. (2010). Inactivation or inhibition of neuronal activity in the medial prefrontal cortex largely reduces pup retrieval and grouping in maternal rats. Brain Res, 1325, 77-88. doi:10.1016/j.brainres.2010.02.027

      He, Z., Young, L., Ma, X. M., Guo, Q., Wang, L., Yang, Y., . . . Tai, F. (2019). Increased anxiety and decreased sociability induced by paternal deprivation involve the PVN-PrL OTergic pathway. Elife, 8. doi:10.7554/eLife.44026

      He, Z., Zhang, L., Hou, W., Zhang, X., Young, L. J., Li, L., . . . Tai, F. (2021). Paraventricular Nucleus Oxytocin Subsystems Promote Active Paternal Behaviors in Mandarin Voles. J Neurosci, 41(31), 66996713. doi:10.1523/jneurosci.2864-20.2021

      Jackson, M. (2006). Infanticide. The Lancet, 367(9513), 809. doi:https://doi.org/10.1016/S01406736(06)68323-2

      Kelly, A. M., Hiura, L. C., Saunders, A. G., & Ophir, A. G. (2017). Oxytocin Neurons Exhibit Extensive Functional Plasticity Due To Offspring Age in Mothers and Fathers. Integr Comp Biol, 57(3), 603618. doi:10.1093/icb/icx036

      Kenkel, W. M., Paredes, J., Yee, J. R., Pournajafi-Nazarloo, H., Bales, K. L., & Carter, C. S. (2012). Neuroendocrine and behavioural responses to exposure to an infant in male prairie voles. J Neuroendocrinol, 24(6), 874-886. doi:10.1111/j.1365-2826.2012.02301.x

      Knobloch, H. S., Charlet, A., Hoffmann, L. C., Eliava, M., Khrulev, S., Cetin, A. H., . . . Grinevich, V. (2012). Evoked axonal oxytocin release in the central amygdala attenuates fear response. Neuron, 73(3), 553-566. doi:10.1016/j.neuron.2011.11.030

      Liu, Y., Li, A., Bair-Marshall, C., Xu, H., Jee, H. J., Zhu, E., . . . Wang, J. (2023). Oxytocin promotes prefrontal population activity via the PVN-PFC pathway to regulate pain. Neuron, 111(11), 17951811.e1797. doi:10.1016/j.neuron.2023.03.014

      Mei, L., Yan, R., Yin, L., Sullivan, R. M., & Lin, D. (2023). Antagonistic circuits mediating infanticide and maternal care in female mice. Nature, 618(7967), 1006-1016. doi:10.1038/s41586-023-061479

      Milia, G., & Noonan, M. (2022). Experiences and perspectives of women who have committed neonaticide, infanticide and filicide: A systematic review and qualitative evidence synthesis. J Psychiatr Ment Health Nurs, 29(6), 813-828. doi:10.1111/jpm.12828

      Moscarello, J. M., & LeDoux, J. E. (2013). Active avoidance learning requires prefrontal suppression of amygdala-mediated defensive reactions. J Neurosci, 33(9), 3815-3823. doi:10.1523/jneurosci.2596-12.2013

      Murugan, M., Jang, H. J., Park, M., Miller, E. M., Cox, J., Taliaferro, J. P., . . . Witten, I. B. (2017). Combined Social and Spatial Coding in a Descending Projection from the Prefrontal Cortex. Cell, 171(7), 1663-1677.e1616. doi:10.1016/j.cell.2017.11.002

      Naviaux, A. F., Janne, P., & Gourdin, M. (2020). Psychiatric Considerations on Infanticide: Throwing the Baby out with the Bathwater. Psychiatr Danub, 32(Suppl 1), 24-28. 

      Okabe, S., Tsuneoka, Y., Takahashi, A., Ooyama, R., Watarai, A., Maeda, S., . . . Kikusui, T. (2017). Pup exposure facilitates retrieving behavior via the oxytocin neural system in female mice. Psychoneuroendocrinology, 79, 20-30. doi:10.1016/j.psyneuen.2017.01.036

      Peñagarikano, O., Lázaro, M. T., Lu, X. H., Gordon, A., Dong, H., Lam, H. A., . . . Geschwind, D. H. (2015). Exogenous and evoked oxytocin restores social behavior in the Cntnap2 mouse model of autism. Sci Transl Med, 7(271), 271ra278. doi:10.1126/scitranslmed.3010257

      Pereira, M., & Morrell, J. I. (2011). Functional mapping of the neural circuitry of rat maternal motivation: effects of site-specific transient neural inactivation. J Neuroendocrinol, 23(11), 1020-1035. doi:10.1111/j.1365-2826.2011.02200.x

      Pereira, M., & Morrell, J. I. (2020). Infralimbic Cortex Biases Preference Decision Making for Offspring over Competing Cocaine-Associated Stimuli in New Mother Rats. eNeuro, 7(4). doi:10.1523/eneuro.0460-19.2020

      Qian, T., Wang, H., Wang, P., Geng, L., Mei, L., Osakada, T., . . . Li, Y. (2023). A genetically encoded sensor measures temporal oxytocin release from different neuronal compartments. Nat Biotechnol, 41(7), 944-957. doi:10.1038/s41587-022-01561-2

      Sabihi, S., Dong, S. M., Maurer, S. D., Post, C., & Leuner, B. (2017). Oxytocin in the medial prefrontal cortex attenuates anxiety: Anatomical and receptor specificity and mechanism of action. Neuropharmacology, 125, 1-12. doi:10.1016/j.neuropharm.2017.06.024

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):  

      Summary:

      In this manuscript, Shao et al. investigate the contribution of different cortical areas to working memory maintenance and control processes, an important topic involving different ideas about how the human brain represents and uses information when it is no longer available to sensory systems. In two fMRI experiments, they demonstrate that the human frontal cortex (area sPCS) represents stimulus (orientation) information both during typical maintenance, but even more so when a categorical response demand is present. That is, when participants have to apply an added level of decision control to the WM stimulus, sPCS areas encode stimulus information more than conditions without this added demand. These effects are then expanded upon using multi-area neural network models, recapitulating the empirical gradient of memory vs control effects from visual to parietal and frontal cortices. In general, the experiments and analyses provide solid support for the authors' conclusions, and control experiments and analyses are provided to help interpret and isolate the frontal cortex effect of interest. However, I suggest some alternative explanations and important additional analyses that would help ensure an even stronger level of support for these results and interpretations.

      Strengths:

      -  The authors use an interesting and clever task design across two fMRI experiments that is able to parse out contributions of WM maintenance alone along with categorical, rule-based decisions. Importantly, the second experiment only uses one fixed rule, providing both an internal replication of Experiment 1's effects and extending them to a different situation when rule-switching effects are not involved across mini-blocks.

      - The reported analyses using both inverted encoding models (IEM) and decoders (SVM) demonstrate the stimulus reconstruction effects across different methods, which may be sensitive to different aspects of the relationship between patterns of brain activity and the experimental stimuli.

      - Linking the multivariate activity patterns to memory behavior is critical in thinking about the potential differential roles of cortical areas in sub-serving successful working memory. Figure 3 nicely shows a similar interaction to that of Figure 2 in the role of sPCS in the categorization vs. maintenance tasks.

      - The cross-decoding analysis in Figure 4 is a clever and interesting way to parse out how stimulus and rule/category information may be intertwined, which would have been one of the foremost potential questions or analyses requested by careful readers. However, I think more additional text in the Methods and Results to lay out the exact logic of this abstract category metric will help readers bet0ter interpret the potential importance of this analysis and result.

      We thank the reviewer for the positive assessment of our manuscript. Please see lines 366-372, 885-894 in the revised manuscript for a detailed description of the abstract category index, and see below for a detailed point-by-point response.

      Weaknesses:

      - Selection and presentation of regions of interest: I appreciate the authors' care in separating the sPCS region as "frontal cortex", which is not necessarily part of the prefrontal cortex, on which many ideas of working memory maintenance activity are based. However, to help myself and readers interpret these findings, at a minimum the boundaries of each ROI should be provided as part of the main text or extended data figures. Relatedly, the authors use a probabilistic visual atlas to define ROIs in the visual, parietal, and frontal cortices. But other regions of both lateral frontal and parietal cortices show retinotopic responses (Mackey and Curtis, eLife, 2017: https://elifesciences.org/articles/22974) and are perhaps worth considering. Do the inferior PCS regions or inferior frontal sulcus show a similar pattern of effects across tasks? And what about the middle frontal gyrus areas of the prefrontal cortex, which are most analogous to the findings in NHP studies that the authors mention in their discussion, but do not show retinotopic responses? Reporting the effects (or lack thereof) in other areas of the frontal cortex will be critical for readers to interpret the role of the frontal cortex in guiding WM behavior and supporting the strongly worded conclusions of broad frontal cortex functioning in the paper. For example, to what extent can sPCS results be explained by visual retinotopic responses? (Mackey and Curtis, eLife, 2017: https://elifesciences.org/articles/22974).

      We thank the reviewer for the suggestions. We have added a Supplemental Figure 1 to better illustrate the anatomical locations of ROIs.  

      Following the reviewer’s suggestion, we defined three additional subregions in the frontal cortex based on the HCP atlas [1], including the inferior precentral sulcus (iPCS, generated by merging 6v, 6r, and PEF), inferior frontal sulcus (IFS, generated by merging IFJp, IFJa, IFSp, IFSa, and p47r), and middle frontal gyrus (MFG, generated by merging 9-46d, 46, a9-46v, and p9-46v). We then performed the same analyses as in the main text using both mixed-model and within-condition IEMs. Overall, we found that none of the ROIs demonstrated significant orientation representation in Experiment 1, for either IEM analysis (Author response image 1A and 1C). In Experiment 2, however, the IFS and MFG (but not iPCS) demonstrated a similar pattern to sPCS for orientation representation, though these results did not persist in the within-condition IEM with lower SNR (Author response image 1B and 1D). Moreover, when we performed the abstract category decoding analysis in the three ROIs, only the MFG in Experiment 2 showed significant abstract category decoding results, with no significant difference between experiments (Author response image 1E). To summarize, the orientation and category results observed in sPCS in the original manuscript were largely absent in other frontal regions. There was some indication that the MFG might share some results for orientation representation and category decoding, although this pattern was weaker and was only observed in some analyses in Experiment 2. Therefore, although we did not perform retinotopic mapping and cannot obtain a direct measure of retinotopic responses in the frontal cortex, these results suggest that our findings are unlikely to be explained by visual retinotopic responses: the iPCS, which is another retinotopic region, did not show the observed pattern in any of the analyses. Notably, the iPCS results are consistent with our previous work demonstrating that orientation information cannot be decoded from iPCS during working memory delay [2]. We have included these results on lines 395-403, 563-572 in the revised manuscript to provide a more comprehensive understanding of the current findings. 

      Author response image 1.

      Orientation reconstruction and abstract category decoding results in iPCS, IFS, and MFG.

      - When looking at the time course of effects in Figure 2, for example, the sPCS maintenance vs categorization effects occur very late into the WM delay period. More information is needed to help separate this potential effect from that of the response period and potential premotor/motor-related influences. For example, are the timecourses shifted to account for hemodynamic lag, and if so, by how much? Do the sPCS effects blend into the response period? This is critical, too, for a task that does not use a jittered delay period, and potential response timing and planning can be conducted by participants near the end of the WM delay. For example, the authors say that " significant stimulus representation in EVC even when memoranda had been transformed into a motor format (24)". But, I *think* this paper shows the exact opposite interpretation - EVC stimulus information is only detectable when a motor response *cannot* be planned (https://elifesciences.org/articles/75688). Regardless, parsing out the timing and relationship to response planning is important, and an ROI for M1 or premotor cortex could also help as a control comparison point, as in reference (24).

      We thank the reviewer for raising this point. We agree that examining the contribution of response-related activity in our study is crucial, as we detail below:

      First, the time course results in the manuscript are presented without time shifting. The difference in orientation representation in Figure 2 emerged at around 7 s after task cue onset and 1 s before probe onset. Considering a 4-6 s hemodynamic response lag, the difference should occur around 1-3 s after task cue onset and 5-7 s prior to probe onset. This suggests that a substantial portion of the effect likely occurred during the delay rather than response period.

      Second, our experimental design makes it unlikely that response planning would have influenced our results, as participants were unable to plan their motor responses in advance due to randomized response mapping at the probe stage on a trial-by-trial basis. Moreover, even if response planning had impacted the results in sPCS, it would have affected both conditions similarly, which again, would not explain the observed differences between conditions.

      Third, following the reviewer’s suggestion, we defined an additional ROI (the primary motor cortex, M1) using the HCP atlas and repeated the IEM analysis. No significant orientation representation was observed in either condition in M1, even during the response period (Figure S3), further suggesting that our results are unlikely to be explained by motor responses or motor planning.

      Based on the evidence above, we believe motor responses or planning are unlikely to account for our current findings. We have included these results on lines 264-267 to further clarify this issue.

      Lastly, upon re-reading the Henderson et al. paper [3], we confirmed that stimulus information was still decodable in EVC when a motor response could be planned (Figure 2 of Henderson et al.). In fact, the authors also discussed this result in paragraph 5 of their discussion. This finding, together with our results in EVC, indicates that EVC maintains stimulus information in working memory even when the information is no longer task-relevant, the functional relevance of which warrants further investigation in future research.

      - Interpreting effect sizes of IEM and decoding analysis in different ROIs. Here, the authors are interested in the interaction effects across maintenance and categorization tasks (bar plots in Figure 2), but the effect sizes in even the categorization task (y-axes) are always larger in EVC and IPS than in the sPCS region... To what extent do the authors think this representational fidelity result can or cannot be compared across regions? For example, a reader may wonder how much the sPCS representation matters for the task, perhaps, if memory access is always there in EVC and IPS? Or perhaps late sPCS representations are borrowing/accessing these earlier representations? Giving the reader some more intuition for the effect sizes of representational fidelity will be important. Even in Figure 3 for the behavior, all effects are also seen in IPS as well. More detail or context at minimum is needed about the representational fidelity metric, which is cited in ref (35) but not given in detail. These considerations are important given the claims of the frontal cortex serving such an important for flexible control, here.

      We thank the reviewer for raising this point. We agree that the effect sizes are always larger in EVC and IPS. This is because the specific decoding method we adopted, IEM, is based on the concept of population-level feature-selective responses, and decoding results would be most robust in regions with strong feature-tuning responses, such as EVC and parts of IPS. Therefore, to minimize the impact of effect size on our results, we avoided direct comparisons of representational strength across ROIs, focusing instead on differences in representational strength between conditions within the same ROI. With this approach, we found that EVC and IPS showed high representational fidelity throughout the trial, but only in sPCS did we observe significant higher fidelity in categorization condition, where orientation was actually not a behavioral goal but was manipulated in working memory to achieve the goal. Moreover, although representational fidelity in the EVC was the highest, its behavioral predictability decreased during the delay period, unlike sPCS. These results suggest that the magnitude of fidelity alone is not the determining factor for the observed categorization vs. maintenance effect or for behavioral performance. We have included further discussion on this issue on lines 208-211 of the revised manuscript.

      The reviewer also raised a good point that IPS showed similar behavioral correlation results as sPCS. In the original manuscript, we discussed the functional similarities and distinctions between IPS and sPCS in the discussion. We have expanded on this point on lines 610-627 in the revised manuscript:

      “While many previous WM studies have focused on the functional distinction between sensory and frontoparietal cortex, it has remained less clear how frontal and parietal cortex might differ in terms of WM functions. Some studies have reported stimulus representations with similar functionality in frontal and parietal cortex [4, 5], while others have observed differential patterns [6-8]. We interpret the differential patterns as reflecting a difference in the potential origin of the corresponding cognitive functions. For example, in our study, sPCS demonstrated the most prominent effect for enhanced stimulus representation during categorization as well as the tradeoff between stimulus difference and category representation, suggesting that sPCS might serve as the source region for such effects. On the other hand, IPS did show visually similar patterns to sPCS in some analyses. For instance, stimulus representation in IPS was visually but not statistically higher in the categorization task. Additionally, stimulus representation in IPS also predicted behavioral performance in the categorization task. These results together support the view that our findings in sPCS do not occur in isolation, but rather reflect a dynamic reconfiguration of functional gradients along the cortical hierarchy from early visual to parietal and then to frontal cortex.”

      Lastly, following the reviewer’s suggestion, we have included more details on the representational fidelity metric on lines 201-206, 856-863 in the revised manuscript for clarity.

      Recommendations:

      Figure 3 layout - this result is very interesting and compelling, but I think could be presented to have the effect demonstrated more simply for readers. The scatter plots in the second and third rows take up a lot of space, and perhaps having a barplot as in Figure 2 showing the effects of brain-behavior correlations collapsed across the WM delay period timing would make the effect stand out more.

      We thank the reviewer for the suggestion. We have added a subplot (C) to Figure 3 to demonstrate the brain-behavior correlation collapsed across the late task epoch.

      When discussing the link between sPCS representations and behavior, I think this paper should likely be cited ([https://www.jneurosci.org/content/24/16/3944](https://www.jneurosci.org/content/24/ 16/3944)), which shows univariate relationships between sPCS delay activity and memory-guided saccade performance.

      We thank the reviewer for the suggestion and have included this citation on lines 278-279 in the revised manuscript.

      Interpretation of "control" versus categorization - the authors interpret that "It would be of interest to further investigate whether this active control in the frontal cortex could be generalized to tasks that require other types of WM control such as mental rotation." I think more discussion on the relationship between categorization and "control" is needed, especially given the claim of "flexible control" throughout. Is stimulus categorization a form of cognitive control, and if so, how?  

      We thank the reviewer for raising this point. Cognitive control is generally defined as the process by which behavior is flexibly adapted based on task context and goals, and most theories agree that this process occurs within working memory [9, 10]. With this definition, we consider stimulus categorization to be a form of cognitive control, because participants needed to adapt the stimulus based on the categorization rule in working memory for subsequent category judgements. With two categorization rules, the flexibility in cognitive control increased, because participants need to switch between the two rules multiple times throughout the experiment, instead of being fixed on one rule. We now clarify these two types of controls on lines 112-116 in the introduction.

      However, we agree that the latter form of control could be more related to rule switching that might not be specific to categorization per se. For instance, if participants perform rule switching in another type of WM task that requires WM control such as mental rotation, it remains to be tested whether similar results would be observed and/or whether same brain regions would be recruited. We have included further information on this issue on lines 572-575 in the revised manuscript.

      Reviewer #2 (Public Review):

      Summary:

      The authors provide evidence that helps resolve long-standing questions about the differential involvement of the frontal and posterior cortex in working memory. They show that whereas the early visual cortex shows stronger decoding of memory content in a memorization task vs a more complex categorization task, the frontal cortex shows stronger decoding during categorization tasks than memorization tasks. They find that task-optimized RNNs trained to reproduce the memorized orientations show some similarities in neural decoding to people. Together, this paper presents interesting evidence for differential responsibilities of brain areas in working memory.

      Strengths:

      This paper was strong overall. It had a well-designed task, best-practice decoding methods, and careful control analyses. The neural network modelling adds additional insight into the potential computational roles of different regions.

      We thank the reviewer for the positive assessment of our manuscript.

      Weaknesses:

      While the RNN model matches some of the properties of the task and decoding, its ability to reproduce the detailed findings of the paper was limited. Overall, the RRN model was not as well-motivated as the fMRI analyses.

      We are grateful for the reviewer’s suggestions on improving our RNN results. Please see below for a detailed point-by-point response.

      Recommendations:

      Overall, I thought that this paper was excellent. I have some conceptual concerns about the RNN model, and minor recommendations for visualization.

      (1) I think that the RNN modelling was certainly interesting and well-executed. However, it was not clear how much it contributed to the results. On the one hand, it wasn't clear why reproducing the stimulus was a critical objective of the task (ie could be more strongly motivated on biological grounds). On the other hand, the agreement between the model and the fMRI results is not that strong. The model does not reproduce stronger decoding in 'EVC' for maintenance vs categorization. Also, the pattern of abstract decoding is very different from the fMRI (eg the RNN has stronger categorical encoding in 'EVC' than 'PFC' and larger differences between fixed and flexible rules in earlier areas than is evident in the fMRI). Together, the RNN modelling comes across as a little ad hoc, without really nailing the performance.

      We thank the reviewer for prompting us to further elaborate on the rationale for our RNN analysis. In our fMRI results, we observed a tradeoff between maintaining stimulus information in more flexible tasks (Experiment 1) and maintaining abstract category information in less flexible tasks (Experiment 2). This led to the hypothesis that participants might have employed different coding strategies in the two experiments. Specifically, in flexible environments, stimulus information might be preserved in its original identity in the higher-order cortex, potentially reducing processing demands in each task and thereby facilitating efficiency and flexibility; whereas in less flexible tasks, participants might generate more abstract category representations based on task rules to facilitate learning. To directly test this idea, we examined whether explicitly placing a demand for the RNN to preserve stimulus representation would recapitulate our fMRI findings in frontal cortex by having stimulus information as an output, in comparison to a model that did not specify such a demand. Meanwhile, we totally agree with the reviewer that there are alternative ways to implement this objective in the model. For instance, changing the network encoding weights (lazy vs. rich regime) to make feedforward neural networks either produce high-dimensional stimulus or low-dimensional category representations [11]. However, we feel that exploring these alternatives may fall outside the scope of the current study.

      Regarding the alignment between the fMRI and RNN results: for the stimulus decoding results in EVC, we found that with an alternative decoding method (IEM), a similar maintenance > categorization pattern was observed in EVC-equivalent module, suggesting that our RNN was capable of reproducing EVC results, albeit in a weaker manner (please see our response to the reviewer’s next point). For the category decoding results, we would like to clarify that the category decoding results in EVC was not necessarily better than those in sPCS. Although category decoding accuracy was numerically higher in EVC, it was more variable compared to IPS and sPCS. To illustrate this point, we calculated the Bayes factor for the category decoding results of RNN2 in Figure 6C, and found that the amount of evidence for category decoding as well as for the decoding difference between RNNs in IPS and sPCS modules was high, whereas the evidence in the EVC was insufficient (Response Table 1).

      Author response table 1.

      Bayes factors for category decoding and decoding differences in Figure 6C lower panel.

      Nevertheless, we agree with the reviewer that all three modules demonstrated the category decoding difference between experiments, which differs from our fMRI results. This discrepancy may be partially due to differences in signal sensitivity. RNN signals typically have a higher SNR compared to fMRI signals, as fMRI aggregates signals from multiple neurons and single-neuron tuning effects can be reduced. We have acknowledged this point on lines 633-636 in the revised manuscript. Nonetheless, the current RNNs effectively captured our key fMRI findings, including increased stimulus representation in frontal cortex as well as the tradeoff in category representation with varying levels of flexible control. We believe the RNN results remain valuable in this regard.

      Honestly, I think the paper would have a very similar impact without the modelling results, but I appreciate that you put a lot of work into the modeling, and this is an interesting direction for future research. I have a few suggestions, but nothing that I feel too strongly about.

      - It might be informative to use IEM to better understand the RNN representations (and how similar they are to fMRI). For example, this could show whether any of the modules just encode categorical information. 

      - You could try providing the task and/or retro cue directly to the PFC units. This is a little unrealistic, but may encourage a stronger role for PFC.

      - You might adjust the ratio of feedforward/feedback connections, if you can find good anatomical guidance on what these should be.

      Obviously, I don't have much - it's a tricky problem!

      We thank the reviewer for the suggestions. To better align the fMRI and RNN results, we first performed the same IEM analyses used in the fMRI analyses on the RNN data. We found that with IEM, the orientation representation in the EVC module demonstrated a pattern similar to that in the fMRI data, showing a negative trend for the difference between categorization and maintenance, although the trend did not reach statistical significance (Author response image 2A). Meanwhile, the difference between categorization and maintenance remained a positive trend in the sPCS module.

      Second, following the reviewer’s suggestion, we adjusted the ratio of feedforward/feedback connections between modules to 1:2, such that between Modules 1 and 2 and between Modules 2 and 3, there were always more feedback than feedforward connections, consistent with recent theoretical proposals [12]. We found that, this change preserved the positive trend for orientation differences in the sPCS module, but in the meantime also made the orientation difference in the EVC and IPS modules more positive (Author response image 2B).

      To summarize, we found that the positive difference between categorization and maintenance in the sPCS module was robust across difference RNNs and analytical approaches, further supporting that RNNs with stimulus outputs can replicate our key fMRI findings in the frontal cortex. By contrast, the negative difference between categorization and maintenance in EVC was much weaker. It was weakly present using some analytical methods (i.e., the IEM) but not others (i.e., SVMs), and increasing the feedback ratio of the entire network further weakened this difference. We believe that this could be due to that the positive difference was mainly caused by top-down, feedback modulations from higher cortex during categorization, such that increasing the feedback connection strengthens this pattern across modules. We speculate that enhancing the negative difference in the EVC module might require additional modules or inputs to strengthen fine-grained stimulus representation in EVC, a mechanism that might be of interest to future research. We have added a paragraph to the discussion on the limitations of the RNN results on lines 629-644.

      Author response image 2.

      Stimulus difference across RNN modules.  (A). Results using IEM (p-values from Module 1 to 3: 0.10, 0.48, 0.01). (B). Results using modified RNN2 with changed connection ratio (p-values from Module 1 to 3: 0.12, 0.22, 0.08). All p-values remain uncorrected.

      (2) Can you rule out that during the categorization task, the orientation encoding in PFC isn't just category coding? You had good controls for category coding, but it would be nice to see something for orientation coding. e.g., fit your orientation encoding model after residualizing category encoding, or show that category encoding has worse CV prediction than orientation encoding.

      We thank the reviewer for raising this point. To decouple orientation and category representations, we performed representational similarity analysis (RSA) in combination with linear mixed-effects modeling (LMEM) on the fMRI data. Specifically, we constructed three hypothesized representational dissimilarity matrices (RDMs), one for graded stimulus (increasing distance between orientations as they move farther apart, corresponding to graded feature tuning responses), one for abstract category (0 for all orientations within the same category and 1 for different categories), and another for discrete stimulus (indicating equidistant orientation representations). We then fit the three model RDMs together using LMEM with subject as the random effect (Author response image 3A). This approach is intended to minimize the influence of collinearity between RDMs on the results [13].

      Overall, the LMEM results (Author response image 3B-D) replicated the decoding results in the main text, with significant stimulus but not category representation in sPCS in Experiment 1, and marginally significant category representation in the same brain region in Experiment 2. These results further support the validity of our main findings and emphasize the contribution of stimulus representation independent of category representation.

      Author response image 3.

      Delineating stimulus and category effects using LMEM.  (A) Schematic illustration of this method. (B) Results for late epoch in Experiment 1, showing the fit of each model RDM. (C) Results for early epoch in Experiment 2. (D) Results for late epoch in Experiment 2.

      (3) Is it possible that this region of PFC is involved in categorization in particular and not 'control-demanding working memory'? 

      We thank the reviewer for raising this possibility. Cognitive control is generally defined as the process by which behavior is flexibly adapted based on task context and goals, and most theories agree that this process occurs within working memory [9, 10]. With this definition, we consider stimulus categorization to be a form of cognitive control, because participants need to adapt the stimulus based on the categorization rule in working memory for subsequent category judgements.  However, in the current study we only used one type of control-demanding working memory task (categorization) to test our hypothesis, and therefore it remains unclear whether the current results in sPCS can generalize to other types of WM control tasks.

      We have included a discussion on this issue on lines 572-575 in the revised manuscript.

      (4) Some of the figures could be refined to make them more clear:

      a.  Figure 4 b/c should have informative titles and y-axis labels.

      b.  Figure 5, the flexible vs fixed rule isn't used a ton up to this point - it would help to (also include? Replace?) with something like exp1/exp2 in the legend. It would also help to show the true & orthogonal rule encoding in these different regions (in C, or in a separate panel), especially to the extent that this is a proxy for stimulus encoding.

      c.  Figure 6: B and C are very hard to parse right now. (i) The y-axis on B could use a better label. (ii) It would be useful to include an inset of the relevant data panel from fMRI that you are reproducing. (iii) Why aren't there fixed rules for RNN1?

      We thank the reviewer for the suggestions and have updated the figures accordingly as following:

      Overall I think this is excellent - my feedback is mostly on interpretation and presentation. I think the work itself is really well done, congrats!

      References

      (1) Glasser, M.F., et al., A multi-modal parcellation of human cerebral cortex. Nature, 2016. 536(7615): p. 171-178.

      (2) Yu, Q. and Shim, W.M., Occipital, parietal, and frontal cortices selectively maintain taskrelevant features of multi-feature objects in visual working memory. Neuroimage, 2017. 157: p. 97-107.

      (3) Henderson, M.M., Rademaker, R.L., and Serences, J.T., Flexible utilization of spatial- and motor-based codes for the storage of visuo-spatial information. Elife, 2022. 11.

      (4) Christophel, T.B., et al., Cortical specialization for attended versus unattended working memory. Nat Neurosci, 2018. 21(4): p. 494-496.

      (5) Yu, Q. and Shim, W.M., Temporal-Order-Based Attentional Priority Modulates Mnemonic Representations in Parietal and Frontal Cortices. Cereb Cortex, 2019. 29(7): p. 3182-3192.

      (6) Li, S., et al., Neural Representations in Visual and Parietal Cortex Differentiate between Imagined, Perceived, and Illusory Experiences. J Neurosci, 2023. 43(38): p. 6508-6524.

      (7) Hu, Y. and Yu, Q., Spatiotemporal dynamics of self-generated imagery reveal a reverse cortical hierarchy from cue-induced imagery. Cell Rep, 2023. 42(10): p. 113242.

      (8) Lee, S.H., Kravitz, D.J., and Baker, C.I., Goal-dependent dissociation of visual and prefrontal cortices during working memory. Nat Neurosci, 2013. 16(8): p. 997-9.

      (9) Miller, E.K. and Cohen, J.D., An integrative theory of prefrontal cortex function. Annu Rev Neurosci, 2001. 24: p. 167-202.

      (10) Badre, D., et al., The dimensionality of neural representations for control. Curr Opin Behav Sci, 2021. 38: p. 20-28.

      (11) Flesch, T., et al., Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron, 2022. 110(7): p. 1258-1270 e11.

      (12) Wang, X.J., Theory of the Multiregional Neocortex: Large-Scale Neural Dynamics and Distributed Cognition. Annu Rev Neurosci, 2022. 45: p. 533-560.

      (13) Bellmund, J.L.S., et al., Mnemonic construction and representation of temporal structure in the hippocampal formation. Nat Commun, 2022. 13(1): p. 3395.

    1. Author response:

      The following is the authors’ response to the original reviews.

      The reviewers suggest a number of experiments and re-analyses to strengthen their claims and enhance the impact of the study. While a number of these are longer term, below is a summary of experiments and analyses recommended by the reviewers that can be accomplished in the shorter term:

      (1) Clarification of statistical approaches, quantification, data presentation and description of cerebellar anatomical nomenclature (e.gs. detailed statistical methods for the GEO dataset analysis, FDR correction, quantification in Figs 2-4)

      The revised manuscript will provide detailed statistical methods including FDR  correction for GEO dataset analyses and quantification. Please see specific responses to GEO dataset analyses below.

      (2) Improved quality of images for select immunostains and in situ hybridization

      The revised manuscript will address the quality of the images as indicated by the reviewers.

      (3) Include a control group of hGFAP-Cre mice with loxP sites but without Sufu deletion to assess the effects of Cre-induced double-strand breaks on phosphorylated H2AX-DSB signaling.

      The breeding scheme we used to generate homozygous SUFU conditional mutants will not generate pups carrying only hGFAP-Cre. Thus, we are unable to compare expression of gH2AX expression in littermates that do not carry loxP sites. The reviewer is correct in pointing out the possibility of Cre recombinase activity inducing double-strand breaks on its own. However, it is likely that any hGFAP-Cre induced double-strand breaks does not sufficiently cause the phenotypes we observed in homozygous mutants (Sufu-cKO) mice because the cerebellum of mice carry heterozygous SUFU mutations (hGFAP-Cre;Sufu-fl/+) do not display the profound cerebellar phenotypes observed in Sufu-cKO mice. We cannot rule out, however, any undetectable abnormalities that could be present which may require further analyses.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      SUFU modulates Sonic hedgehog (SHH) signaling and is frequently mutated in the B-subtype of SHH-driven medulloblastoma. The B-subtype occurs mostly in infants, is often metastatic, and lacks specific treatment. Yabut et al. found that Fgf5 was highly expressed in the B-subtype of SHH-driven medulloblastoma by examining a published microarray expression dataset. They then investigated how Fgf5 functions in the cerebellum of mice that have embryonic Sufu loss of function. This loss was induced using the hGFAP-cre transgene, which is expressed in multiple cell types in the developing cerebellum, including granule neuron precursors (GNPs) derived from the rhombic lip. By measuring the area of Pax6+ cells in the external granule cell layer (EGL) of Sufu-cKO mice at postnatal day 0, they find Pax6+ cells occupy a larger area in the posterior lobe adjacent to the secondary fissure, which is poorly defined. They show that Fgf5 RNA and phosphoErk1/2 immunostaining are also higher in the same disrupted region. Some of the phosphoErk1/2+ cells are proliferative in the Sufu-cKO. Western blot analysis of Gli proteins that modulate SHH signaling found reduced expression and absence of Gli1 activity in the region of cerebellar dysgenesis in Sufu-cKO mice. This suggests the GNP expansion in this region is independent of SHH signaling. Amazingly, intraventricular injection of the FGFR1-2 antagonist AZD4547 from P0-4 and examined histologically at P7 found the treatment restored cytoarchitecture in the cerebella of Sufu-cKO mice. This is further supported by NeuN immunostaining in the internal granule cell layer, which labels mature, non-diving neurons, and KI67 immunostaining, indicating dividing cells, and primarily found in the EGL. The mice were treated beginning at a timepoint when cerebellar cytoarchitecture was shown to be disrupted and it is indistinguishable from control following treatment. Figure 3 presents the most convincing and exciting data in this manuscript.

      Sufu-cKO do not readily develop cerebellar tumors. The authors detected phosphorylated H2AX immunostaining, which labels double-strand breaks, in some cells in the EGL in regions of cerebellar dysgenesis in the Sufu-cKO, as was cleaved Caspase 3, a marker of apoptosis. P53, downstream of the double-strand break pathway, the protein was reduced in Sufu-cKO cerebellum. Genetically removing p53 from the Sufu-cKO cerebellum resulted in cerebellar tumors in 2-month old mice. The Sufu;p53-dKO cerebella at P0 lacked clear foliation, and the secondary fissure, even more so than the Sufu-cKO. Fgf5 RNA and signaling (pERK1/2) were also expressed ectopically.

      The conclusions of the paper are largely supported by the data, but some data analysis need to be clarified and extended.

      (1) The rationale for examining Fgf5 in medulloblastoma is not sufficiently convincing. The authors previously reported that Fgf15 was upregulated in neocortical progenitors of mice with conditional loss of Sufu (PMID: 32737167). In Figure 1, the authors report FGF5 expression is higher in SHH-type medulloblastoma, especially the beta and gamma subtypes mostly found in infants. These data were derived from a genome-wide dataset and are shown without correction for multiple testing, including other Fgfs. Showing the expression of other Fgfs with FDR correction would better substantiate their choice or moving this figure to later in the manuscript as support for their mouse investigations would be more convincing.

      To assess FGF5 (ENSG00000138675) expression in MB tissues, we used Geo2R (Barrett et al., 2013) to analyze published human MB subtype expression arrays from accession no. GSE85217 (Cavalli et al., 2017). GEO2R is an interactive web tool that compares expression levels of genes of interest (GOI) between sample groups in the GEO series using original submitter-supplied processed data tables. We entered the GOI Ensembl ID and organized data sets according to age and MB subgroup or MB<sup>SHH</sup> subtype classifications. GEO2R results presented gene expression levels as a table ordered by FDR-adjusted (Benjamini & Hochberg) p-values, with significance level cut-off at 0.05, processed by GEO2R’s built-in limma statistical test. Resulting data were subsequently exported into Prism (GraphPad). We generated scatter plots presenting FGF5 expression levels across all MB subgroups (Figure 1A) and MB<sup>SHH</sup> subtypes (Figure 1D). We performed additional statistical analyses to compare FGF5 expression levels between MB subgroups and MB<sup>SHH</sup> subtypes and graphed these data as violin plots (Figure 1B, 1C, and 1E). For these analyses, we used one-way ANOVA with Holm-Sidak’s multiple comparisons test, single pooled variance. P value ≤0.05 was considered statistically significant. Graphs display the mean ± standard error of the mean (SEM).

      Author response image 1.

      Comparative expression of FGF ligands, FGF5, FGF10, FGF12, and FGF19, across all MB subgroups. FGF12 expression is not significantly different, while FGF5, FGF10, and FGF19, show distinct upregulation in MB<sup>SHH subgroup (MB<sup>WNT</sup> n=70, MB<sup>SHH</sup> n=224, MB<sup>GR3</sup> n=143, MB<sup>GR4</sup> n=326).

      Expression of the 21 known FGF ligands were also analyzed. Many FGFs did not exhibit differential expression levels in MB<sup>SHH</sup> compared to other MB subgroups, such as with FGF12 in Figure 1. FGF5, FGF10, and FGF19 (the human orthologue of mouse FGF15) all showed specific upregulation in MB<sup>SHH</sup> compared to other MB subgroups (Author response image 1), supporting our previous observations that FGF15 is a downstream target of SHH signaling (Yabut et al., 2020), as the reviewer pointed out. However, further stratification of MB<sup>SHH</sup> patient data revealed that only FGF5 specifically showed upregulation in infants with MBSHH (MB<sup>SHHβ</sup> and MB<sup>SHHγ</sup> Author response image 2) indicating a more prominent role for FGF5 in the developing cerebellum and driver of MB<sup>SHH</sup> tumorigenesis in this dynamic environment.

      Author response image 2.

      Comparative expression of FGF5, FGF10, and FGF19 in different MB<sup>SHH</sup> subtypes. FGF5 specifically show mRNA relative levels above 6 in 81% of MB<sup>SHH</sup> infant patient tumors (n=80 MB<sup>SHHα</sup> and MB<sup>SHHγ</sup> tumors) unlike 35% of MB<sup>SHHα</sup> (n=65) or 0% of MB<sup>SHHδ</sup>  (n=75) tumors.

      (2) The Sufu-cKO cerebellum lacks a clear anchor point at the secondary fissure and foliation is disrupted in the central and posterior lobes. It would be helpful for the authors to review Sudarov & Joyner (PMID: 18053187) for nomenclature specific to the developing cerebellum.

      The reviewers are correct that the cerebellar foliation is severely disrupted in central and posterior lobes, as per Sudarov and Joyner (Neural Development 2007). This nomenclature may be referred to describe the regions referred in this manuscript.

      (3) The metrics used to quantify cerebellar perimeter and immunostaining are not sufficiently described. It is unclear whether the individual points in the bar graph represent a single section from independent mice, or multiple sections from the same mice. For example, in Figures 2B-D. This also applies to Figure 3C-D.

      All quantification were performed from 2-3 20 um cerebellar sections of 3-6 independent mice per genotype analyzed. Individual points in the bar graphs represent the average cell number (quantified from 2-3 sections) from each mice. Figure 2B show data points from n=4 mice per genotype. Figure 2C show data from n=3 mice per genotype. Figure 2D show data from n=6 mice per genotype.  Figure 3C-D show data from n=3 mice per genotype.

      (4) The data on Fgf5 RNA expression presented in Figure 2E are not sufficiently convincing. The perimeter and cytoarchitecture of the cerebellum are difficult to see and the higher magnification shown in 2F should be indicated in 2E.

      The lack of foliation in Sufu-cKO cerebellum is clear particularly when visualizing the perimeter via DAPI labeling (Figure 2E). The expression area of FGF5 is also visibly larger, given that all images in Figure 2E are presented in the same scale (scale bars = 500 um). 

      (5) The data presented in Figure 3 are not sufficiently convincing. The number of cells double positive for pErk and KI67 (Figure 3B) are difficult to see and appear to be few, suggesting the quantification may be unreliable.

      We used KI67+ expression to provide a molecular marker of regions to be quantified in both WT and Sufu-cKO sections. Quantification of labeled cells were performed in images obtained by confocal microscopy, enabling imaging of 1-2 um optical slices since Ki67 or pERK expression might not localize within the same cellular compartments. We relied on continuous DAPI nuclear staining to distinguish individual cells in each optical slice and the colocalization of of Ki67 and pERK. All quantification were performed from 2-3 20 um cerebellar sections of 3-6 independent mice per genotype analyzed. Individual points in the bar graphs represent the average cell number (quantified from 2-3 sections) from each mice.

      (6) The data presented in Figure 4F-J would be more convincing with quantification. The Sufu;p53-dKO appears to have a thickened EGL across the entire vermis perimeter, and very little foliation, relative to control and single cKO cerebella. This is a more widespread effect than the more localized foliation disruption in the Sufu-cKO. 

      We agree with the reviewers that quantification of these phenotypes provide a solid measure of the defects. The phenotypes of Sufu:p53-dKO cerebellum are so profound requiring  in-depth characterization that will be the focus of future studies.

      (7) Figure 5 does not convincingly summarize the results. Blue and purple cells in sagittal cartoon are not defined. Which cells express Fgf5 (or other Fgfs) has not been determined. The yellow cells are not defined in relation to the initial cartoon on the left.

      The revised manuscript will address this confusion by clearly labeling the cells and their roles in the schematic diagram.

      Reviewer #2 (Public Review):

      Summary:

      Mutations in SUFU are implicated in SHH medulloblastoma (MB). SUFU modulates Shh signaling in a context-dependent manner, making its role in MB pathology complex and not fully understood. This study reports that elevated FGF5 levels are associated with a specific subtype of SHH MB, particularly in pediatric cases. The authors demonstrate that Sufu deletion in a mouse model leads to abnormal proliferation of granule cell precursors (GCPs) at the secondary fissure (region B), correlating with increased Fgf5 expression. Notably, pharmacological inhibition of FGFR restores normal cerebellar development in Sufu mutant mice.

      Strengths:

      The identification of increased FGF5 in subsets of MB is novel and a key strength of the paper.

      Weaknesses:

      The study appears incomplete despite the potential significance of these findings. The current paper does not fully establish the causal relationship between Fgf5 and abnormal cerebellar development, nor does it clarify its connection to SUFU-related MB. Some conclusions seem overstated, and the central question of whether FGFR inhibition can prevent tumor formation remains untested.

      Reviewer #3 (Public Review):

      Summary:

      The interaction between FGF signaling and SHH-mediated GNP expansion in MB, particularly in the context of Sufu LoF, has just begun to be understood. The manuscript by Yabut et al. establishes a connection between ectopic FGF5 expression and GNP over-expansion in a late-stage embryonic Sufu LoF model. The data provided links region-specific interaction between aberrant FGF5 signaling with the SHH subtype of medulloblastoma. New data from Yabut et al. suggest that ectopic FGF5 expression correlates with GNP expansion near the secondary fissure in Sufu LoF cerebella. Furthermore, pharmacological blockade of FGF signaling inhibits GNP proliferation. Interestingly, the data indicate that the timing of conditional Sufu deletion (E13.5 using the hGFAP-Cre line) results in different outcomes compared to later deletion (using Math1-cre line, Jiwani et al., 2020). This study provides significant insights into the molecular mechanisms driving GNP expansion in SHH subgroup MB, particularly in the context of Sufu LoF. It highlights the potential of targeting FGF5 signaling as a therapeutic strategy. Additionally, the research offers a model for better understanding MB subtypes and developing targeted treatments.

      Strengths:

      One notable strength of this study is the extraction and analysis of ectopic FGF5 expression from a subset of MB patient tumor samples. This translational aspect of the study enhances its relevance to human disease. By correlating findings from mouse models with patient data, the authors strengthen the validity of their conclusions and highlight the potential clinical implications of targeting FGF5 in MB therapy.

      The data convincingly show that FGFR signaling activation drives GNP proliferation in Sufu, conditional knockout models. This finding is supported by robust experimental evidence, including pharmacological blockade of FGF signaling, which effectively inhibits GNP proliferation. The clear demonstration of a functional link between FGFR signaling and GNP expansion underscores the potential of FGFR as a therapeutic target in SHH subgroup medulloblastoma.

      Previous studies have demonstrated the inhibitory effect of FGF2 on tumor cell proliferation in certain MB types, such as the ptc mutant (Fogarty et al., 2006)(Yaguchi et al., 2009). Findings in this manuscript provide additional support suggesting multiple roles for FGF signaling in cerebellar patterning and development.

      Weaknesses:

      In the GEO dataset analysis, where FGF5 expression is extracted, the reporting of the P-value lacks detail on the statistical methods used, such as whether an ANOVA or t-test was employed. Providing comprehensive statistical methodologies is crucial for assessing the rigor and reproducibility of the results. The absence of this information raises concerns about the robustness of the statistical analysis.

      The revised manuscript will include the following detailed explanation of the statistical analyses of the GEO dataset:

      For the analysis of expression values of FGF5 (ENSG00000138675), we obtained these values using Geo2R (Barrett et al., 2013), which directly analyze published human MB subtype expression arrays from accession no. GSE85217 (Cavalli et al., 2017). GEO2R is an interactive web tool that compares expression levels of genes of interest (GOI) between sample groups in the GEO series using original submitter-supplied processed data tables. We simply entered the GOI Ensembl ID and organized data sets according to age and MB subgroup or MBSHH subtype classifications. GEO2R results presented gene expression levels as a table ordered by FDR-adjusted (Benjamini & Hochberg) p-values, with significance level cut-off at 0.05, processed by GEO2R’s built-in limma statistical test. Resulting data were subsequently exported into Prism (GraphPad). We generated scatter plots presenting FGF5 expression levels across all MB subgroups (Figure 1A) and MBSHH subtypes (Figure 1D). We performed additional statistical analyses to compare FGF5 expression levels between MB subgroups and MBSHH subtypes and graphed these data as violin plots (Figure 1B, 1C, and 1E). For these analyses, we used one-way ANOVA with Holm-Sidak’s multiple comparisons test, single pooled variance. P value ≤0.05 was considered statistically significant. Graphs display the mean ± standard error of the mean (SEM). Sample sizes were:

      Author response table 1.

      Another concern is related to the controls used in the study. Cre recombinase induces double-strand DNA breaks within the loxP sites, and the control mice did not carry the Cre transgene (as stated in the Method section), while Sufu-cKO mice did. This discrepancy necessitates an additional control group to evaluate the effects of Cre-induced double-strand breaks on phosphorylated H2AX-DSB signaling. Including this control would strengthen the validity of the findings by ensuring that observed effects are not artifacts of Cre recombinase activity.

      The breeding scheme we used to generate homozygous SUFU conditional mutants will not generate pups carrying only hGFAP-Cre. Thus, we are unable to compare expression of gH2AX expression in littermates that do not carry loxP sites. The reviewer is correct in pointing out the possibility of Cre recombinase activity inducing double-strand breaks on its own. However, it is likely that any hGFAP-Cre induced double-strand breaks does not sufficiently cause the phenotypes we observed in homozygous mutants (Sufu-cKO) mice because the cerebellum of mice carry heterozygous SUFU mutations (hGFAP-Cre;Sufu-fl/+) do not display the profound cerebellar phenotypes observed in Sufu-cKO mice. We cannot rule out, however, any undetectable abnormalities that could be present which may require further analyses.

      Although the use of the hGFAP-Cre line allows genetic access to the late embryonic stage, this also targets multiple celltypes, including both GNPs and cerebellar glial cells. However, the authors focus primarily on GNPs without fully addressing the potential contributions of neuron-glial interaction. This oversight could limit the understanding of the broader cellular context in which FGF signaling influences tumor development. 

      The reviewer is correct in that hGFAP-Cre also targets other cell types, such as cerebellar glial cells, which are generated when Cre-expression has begun. It is possible that cerebellar glial cell development is also compromised in Sufu-cKO mice and may disrupt neuron-glial interaction, due to or independently of FGF signaling. In-depth studies are required to interrogate how loss of SUFU specifically affect development of cerebellar glial cells and influence their cellular interactions in the developing cerebellum.

      Recommendations for the authors:

      Editorial Comments:

      The reviewers suggest a number of steps to improve the manuscript that include additional experiments and a deeper analyses and re-evaluation of existing data. Short of significant new experiments, there appears to be number of straightforward analyses that can improve the study:

      (1) Reanalyses of statistical and quantitative approaches used (e.gs FDR correction, cerebellar deficits, GEO analyses.

      The revised manuscript will include detailed information on the statistical and quantitative approaches as addressed in our response to the reviewer’s comments.

      (2) More clear presentation of qualitative labeling approaches (immunohistochemistry and in situ hybridization).

      A detailed description of the protocols used will be included in  the methods section for labeling methods in the revised manuscript.

      Reviewer #1 (Recommendations For The Authors):

      AZD4547 treatment of the dKO mice would provide more convincing evidence that FGF-targeted treatments could curtail tumor growth in these mice or refute the suggestion that FGF-targeted treatment could prevent tumor growth.

      We agree that performing AZD4547 treatment on Sufu-dKO mice will strengthen these studies. However, we are unable to address since these mice are now unavailable. We hope that future studies will address these.

      Atoh1 is referred to as Math1 (older nomenclature) and should be corrected.

      The revised manuscript will include this change in nomenclature.

      Check verb tense throughout the manuscript.

      We will edit the manuscript further to check verb tenses prior to submission of the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Specific Comments:

      (1) The identification of increased FGF5 in subsets of MB is novel and a key strength of the paper. However, the causal relationship between FGF5 and MB remains unestablished. Based on the current data, FGF5 can only be considered a biomarker for stratifying MB.

      We agree with the reviewer that our studies do not provide direct evidence that FGF5 cause MB. Future investigation focusing on determining if FGF5 inhibition leads to phenotypic rescue will strongly establish the relationship between FGF5 and MB. The reviewer is also correct that our studies reveal that FGF5 acts as a potential biomarker, as we mentioned in the Discussion section.

      (2) The upregulation of Fgf5 in Sufu-deficient cerebella is crucial to this study, yet the presented data are unconvincing to support this conclusion. In comparing Fgf5 expression between WT and Sufu mutants (Figures 2E, F and 4I), the cerebellar sections differ significantly, with mutant sections seemingly from a more lateral position. The authors should provide images of mutant sections from more comparable positions to accurately assess the effect of Sufu deficiency on Fgf5 expression. Additionally, the signals in Figure 2F resemble non-specific backgrounds rather than specific RNAscope signals.

      The WT and mutant sections analyzed were carefully selected from comparable levels. The abnormal foliation in Sufu-cKO make the mutant sections look like they are from the lateral cerebellum.

      Figure 2F (enlarged regions) point to punctate RNAScope signals which is characteristic of this labeling method (see RBFOX3 or GFAP labeling in DAPI-labeled cells in the mouse brain at https://acdbio.com/science/applications/research-areas/neuroscience). The higher number of punctate signals in some, but not all, DAPI-labeled cells in Figure 2F indicate that the FGF5 RNAScope signal is specific.

      (3) Jiwani et al. (2020) reported that Fgf8 also expressed in region B of the EGL, is upregulated in Sufu-deficient cerebella and is necessary and sufficient for Sufu mutant GCP proliferation. The current study does not distinguish whether the FGFR inhibitor AZD4547 blocks Fgf5 and Fgf8 function in restoring cerebellar histology in Sufu mutants.

      AZD4547 potently inhibits FGFR1, FGFR2, and FGFR3 autophosphorylation (Gavine et al., Cancer Research, 2012). FGF8 is reported to bind to these receptors (Ornitz and Itoh, 2015). Thus, the reviewer is correct that the studies will not distinguish between FGF5 or FGF8 activity. Further investigation on FGF8 expression and the effects of its inhibition in the Sufu-cKO neonatal cerebellum will determine whether tumorigenic processes are driven by either FGF5 or FGF8. Nevertheless, we postulate that FGF5 is exerting a greater effect in activating FGF signaling in the developing cerebellum given that it is highly expressed along the external granule layers of the developing cerebellum (Author response image 3).

      Author response image 3.

      Expression of FGF5 and FGF8 in the P4 mouse cerebellum (Allen Brain Atlas, https://developingmouse.brain-map.org )

      (4) The authors should show whether AZD4547 treatment restores normal Fgf5 expression. Importantly, they need to test whether AZD4547 rescues the proliferation defect observed in Sufu;p53 double mutants.

      We agree that performing AZD4547 treatment on Sufu-dKO mice will strengthen these studies. However, we are unable to address since these mice are now unavailable. We hope that future studies will address these.

      (5) Jiwani et al. (2020) showed that deleting Sufu with Atoh1-Cre promotes Gli3R and suppresses Gli2 levels, leading to increased cell proliferation and delayed cell cycle exit in the central lobe. The findings of the current study (Supplementary Figure 1) seem to differ from this previous report, yet both studies conclude that Sufu-KO disrupts differentiation. The authors should provide an explanation for this discrepancy.

      Our results align completely with the findings by Jiwani et al. (2020). Both studies showed reduced levels of Gli3R, showing nearly 50% reduction, when Sufu is deleted (see Figure 4A-4D in Jiwani et al., 2020).

      (6) The hGFAP-Cre mouse line is used to delete Sufu from the cerebellum, but it is not commonly used for GCP-specific deletion. The authors need to provide a reference or more details on the temporal and spatial activity of the Cre line, as the cited paper describes its generation but offers little information on its activity in the developing cerebellum.

      We appreciate the reviewer’s reminder to include the reference for the Schuller et al. 2008 paper. This study characterized the hGFAP-Cre temporal and spatial expression in the developing cerebellum, including granule cell precursors. We will include this reference in the revised manuscript.

      (7) Based on the provided data, it is difficult to determine which cell types express Fgf5. Given that hGFAP-cre may delete Sufu in other cerebellar cell types, the authors should demonstrate that Fgf5 is expressed in granule cells or granule cell precursors.

      Future studies will focus on further characterization of the role of FGF5 in cerebellar development, including the identity cells expressing FGF5. The reviewer is correct in that hGFAP-Cre also targets other cell types and that Sufu deletion in these cells induced ectopic FGF5 expression.

      (8) The provided data show an increase in pERK+ cells in GCPs at the secondary fissure. This increase may simply reflect an accumulation of GCPs. It is unconvincing that there is an increase in pERK due to the loss of Sufu.

      The reviewer is correct that the increase in GCPs will also result increase the number of pERK+ cells. To control for this, our quantification reflects the number of cells per unit area where Ki67+ cells. With these parameters, we found that there is an increased density of pERK+ cells in a given Ki67+ region. All quantification were performed from 2-3 20 um cerebellar sections of 3-6 independent mice per genotype analyzed. Individual points in the bar graphs represent the average cell number (quantified from 2-3 sections) from each mice.

      (9) No data are provided on MB formation in Sufu-cKO; p53- mutants, and it is unknown whether FGFR inhibitors block tumor formation.

      We agree that performing AZD4547 treatment on Sufu-dKO mice will strengthen these studies. However, we are unable to address since these mice are now unavailable. We hope that future studies will address these.

      (10) The authors frequently mention "preneoplastic lesions" of GCPs in Sufu mutant mice. What evidence supports this claim?

      Preneoplastic lesions are defined as cells carrying genetic and phenotypic alterations that show higher risk of malignancy (such as MB) but lack the capacity to grow autonomously in the absence of a secondary factor (Feo et al., 2011). In Sufu-cKO mice, we see abnormally proliferating and behaving granule precursor cells that do not grow autonomously, in the absence of a p53 LOF. The combined deletion of Sufu and p53 transforms these cells to become neoplastic.

      (11) Fgf5 is normally expressed in region B. What is its potential function? Does AZD4547 affect normal development? 

      Future studies will focus on further characterization of the role of FGF5 in cerebellar development, including the identity cells expressing FGF5. Regarding AZD4547, we did not observe any obvious difference between AZD4547-treated and vehicle-treated cerebelli. These indicate that AZD4547 inhibition of FGFRs under physiologic conditions does not significantly disrupt normal cerebellar development.

      (12) Figure 3G: It is unclear which specimens were treated with AZD4547. The authors mention treatment in line 281 but contradict themselves in the figure legend.

      We thank the reviewer for pointing out this typo. Cerebellar tissues shown in Figure 3G were all treated with AZD4547. The figure legend will be corrected in the revised manuscript.

      (13) Figure 4J: The higher magnification images of the pERK/Ki67 staining appear identical in the control and Sufu;p53-dKO. The authors need to correct the mistake.

      We thank the reviewer for pointing this out. We will correct this figure in the revised manuscript.

      Minor Comments:

      (1) Whenever possible, images comparing WT and mutants should be presented at the same scale within a figure. For example, readers might easily conclude that mutant brains are smaller than controls in Figure 4G.

      Unfortunately, because the cerebellum of Sufu;p53-dKO mice are significantly bigger, we are unable to show the whole cerebellum in the same scale in Figure 4G. We wanted to emphasize the significant and abnormal cerebellar growth in this figure.

      (2) The figure legend for Supplementary Figure 2 is missing.

      Thank you for pointing this out. We will add a figure legend in this Supplementary data in the revised manuscript.

      (3) The authors state that the expansion of Pax6+ GNPs in the newborn Sufu-cKO cerebellum (Figure 2) occurs in similar anatomical subregions where infantile MB tumors typically arise (Tan et al., 2018). The cited paper describes more abundant SHH MB in the cerebellar hemisphere. The authors need to elaborate on their statement to clarify this point.

      The reviewer is correct in that Tan et al., 2018 observed tumors arising from the cerebellar hemisphere. More specifically, these tumors arise in the posterior/ventral regions of the cerebellar hemispheres (Figure 2 in Tan et al., 2018). Similarly, Sufu-cKO mice have more severe defects in the posterior/ventral regions of the cerebellar hemisphere (Figures 2A and 3F) and therefore corroborate the findings by Tan et al., that abnormal SHH signaling in these regions results in increased sensitivity to MB formation.

      Reviewer #3 (Recommendations For The Authors):

      Figure1 [Upregulated FGF5 expression in MBS-HH tumors]

      - Statistical analysis from the Geo expression dataset does not provide enough detail. At least, the authors should mention whether they have made any adjustments from the default settings and how they extracted/plotted the FGF5 expression (Figure 1BCE).

      For the analysis of expression values of FGF5 (ENSG00000138675), we obtained these values using Geo2R (Barrett et al., 2013), which directly analyze published human MB subtype expression arrays from accession no. GSE85217 (Cavalli et al., 2017). GEO2R is an interactive web tool that compares expression levels of genes of interest (GOI) between sample groups in the GEO series using original submitter-supplied processed data tables. We simply entered the GOI Ensembl ID and organized data sets according to age and MB subgroup or MBSHH subtype classifications. GEO2R results presented gene expression levels as a table ordered by FDR-adjusted (Benjamini & Hochberg) p-values, with significance level cut-off at 0.05, processed by GEO2R’s built-in limma statistical test. Resulting data were subsequently exported into Prism (GraphPad). We generated scatter plots presenting FGF5 expression levels across all MB subgroups (Figure 1A) and MB<sup>SHH</sup> subtypes (Figure 1D). We performed additional statistical analyses to compare FGF5 expression levels between MB subgroups and MB<sup>SHH</sup> subtypes and graphed these data as violin plots (Figure 1B, 1C, and 1E). For these analyses, we used one-way ANOVA with Holm-Sidak’s multiple comparisons test, single pooled variance. P value ≤0.05 was considered statistically significant. Graphs display the mean ± standard error of the mean (SEM). See Author response table 1 for sample sizes.

      Figure 3 [Ectopic activation of FGF signaling in the EGL of P0 Sufu-cKO cerebellum]

      - Gil1-lz mice reference wrong. Correct Bai CB, et al. 2002

      - Generation of Sufu-cKO;Gli1-LacZ triple transgenic mice not described 

      - Veh vs. treated not labelled (Figure 3F)

      We will address these minor text changes in the revised manuscript. A more detailed description of the generation of Sufu-cKO;Gli1-LacZ triple transgenic will also be included in the Methods section.

      Figure 5 [Proposed model]

      - In the text, Figure 5 is mistaken for Figure 8. 

      We will address these minor text changes in the revised manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study reports comprehensive multi-omic data on the changes induced in young and aged male mouse tail fibroblasts after treatment with chemical reprogramming factors. The authors claim that chemical reprogramming factors induce changes consistent with a reduction of cellular 'biological' age (e.g., correlations with established aging markers in whole tissues). However, the study relies on previously identified aging markers (instead of aging in the tail fibroblast system itself), and thus, at this stage, the evidence in support of the observed molecular changes truly reflecting changes in biological age in the study system is still incomplete.

      Essential revisions

      After discussion with reviewers, we believe that the conclusions of the manuscript would be significantly strengthened with the following revisions:

      (1) Rather than basing the analysis of age-related markers on public tissue data, it is recommended that authors use their own data on pre-reprogramming fibroblasts to define molecular aging-related markers/signatures specifically for male tail fibroblasts at 4 vs 20 months. This should also always be included in figures as reference points.

      We appreciate these helpful comments. Please refer to our responses to Reviewers #1 and #2 concerning these suggestions and the corresponding changes we have made in the revised manuscript.

      (2) In general, the methods as written lack the details necessary to fully understand the study/reproduce it independently, notably in terms of data analysis choices (e.g. use of FWER/FDR type correction for multiple testing, use of raw vs normalized RNA counts for PCA, etc).

      Thank you for this feedback. We have modified our text to address this issue. Please refer to our responses to Reviewer #1 for the specific changes we have made.

      (3) More generally, the authors should better outline the limitations/caveats of their experimental design in the discussion and/or abstract, including the specific cell type and the choice of using only male data (since aging itself is very sex-dimorphic, and the impact of partial reprogramming on aging phenotypes may also be sex-dimorphic).

      Thank you for this important feedback. We have now added a section to our Discussion in which we directly address potential limitations of our study concerning sex-specific differences and the cell type used.

      Public Reviews:

      Reviewer #1:

      Summary:

      The investigators employed multi-omics approach to show the functional impact of partial chemical reprogramming in fibroblasts from young and aged mice.

      Strengths:

      Multi-omics data was collected, including epigenome, transcriptome, proteome, phosphoproteome, and metabolome. Different analyses were conducted accordingly, including differential expression analysis, gene set enrichment analysis, transcriptomic and epigenetic clock-based analyses. The impact of partial chemical reprogramming on aging was supported by these multi-source results.

      We appreciate the reviewer noting the strength and comprehensiveness of our approach.

      Weaknesses:

      More experimental data may be needed to further validate current findings.

      We thank the reviewer for this suggestion. To further validate our findings, we have proceeded as follows: (1) First, we have investigated the role of Prkaca activation during partial chemical reprogramming with 7c (see updated Fig. 5C, Fig. 5 – figure supplement 1B). By confocal microscopy, we show that partial chemical reprogramming with 7c does not cause Prkaca to localize to mitochondria; rather, its cellular distribution is altered to favor nuclear localization. We also use RNAi to knockdown Prkaca and find that Prkaca is not necessary for mediating the increase in mitochondrial membrane potential upon partial chemical reprogramming with 7c.

      (2) We have determined the effect of partial chemical reprogramming with 7c on apoptosis using Annexin V assay (see updated Fig. 5 – figure supplement 1C). We show that during the course of partial chemical reprogramming, the proportion of apoptotic cells steadily increases to about 20 percent.

      (3) We have re-analyzed our multi-omics data to determine the molecular differences (e.g. at the epigenome, transcriptome, proteome, and metabolome levels) between fibroblasts isolated from young and old mice (see updated Fig. 2 – figure supplement 1, Fig. 6 – figure supplement 1, and Fig. 7 – figure supplement 2). Additionally, we have updated Fig. 7A to include statistical comparisons of transcriptomic age of 4-month-old and 20-month-old fibroblasts. Finally, we have updated Fig. 3D to include functional enrichment of gene and protein expression levels of aged fibroblasts.

      (4) We have more thoroughly characterized the effects of partial chemical reprogramming on the epigenome (see Fig. 7 – figure supplement 3).

      (5) Julie Y. Chen was added on as an additional co-author for producing the analyses shown in Fig. 7 – figure supplement 2, and Fig. 7 – figure supplement 3.

      Reviewer #2:

      The short-term administration of reprogramming factors to partially reprogram cells has gained traction in recent years as a potential strategy to reverse aging in cells and organisms. Early studies used Yamanaka factors in transgenic mice to reverse aging phenotypes, but chemical cocktails could present a more feasible approach for in vivo delivery. In this study, Mitchell et al sought to determine the effects that short-term administration of chemical reprogramming cocktails have on biological age and function. To address this question, they treated young and old mouse fibroblasts with chemical reprogramming cocktails and performed transcriptome, proteome, metabolome, and DNA methylation profiling pre- and post-treatment. For each of these datasets, they identified changes associated with treatment, showing downregulation of some previously identified molecular signatures of aging in both young and old cells. From these data, the authors conclude that partial chemical reprogramming can rejuvenate both young and old fibroblasts.

      The main strength of this study is the comprehensive profiling of cells pre- and post-treatment with the reprogramming cocktails, which will be a valuable resource for better understanding the molecular changes induced by chemical reprogramming. The authors highlighted consistent changes across the different datasets that are thought to be associated with aging phenotypes, showing reduction of age-associated signatures previously identified in various tissues. However, from the findings, it remains unclear which changes are functionally relevant in the specific fibroblast system being used. Specifically:

      (1) The 4 month and 20 month mouse fibroblasts are designated "young" vs "old" in this study. An important analysis that was not shown for each of the profiled modalities was a comparison of untreated young vs old fibroblasts to determine age-associated molecular changes in this specific model of aging. Then, rather than using aging signatures defined in other tissues, it would be more appropriate to determine whether the chemical cocktails reverted old fibroblasts to a younger state based on the age-associated changes identified in this comparison.

      In our study, we have used 4 biological samples per group for young and old untreated fibroblasts, and these samples have been used to calculate the effect of 7c and 2c cocktails on gene expression in each age group. Therefore, the correlation between logFC induced by 7c/2c treatment and logFC between young and old fibroblasts would be biased, since the same untreated samples would be used in both calculations: estimates B-A and C-B will be, on average, negatively correlated even if A, B and C are independent random variables. For this reason, to investigate the effect of cocktails on biological age, we utilized gene expression signatures of aging, estimated based on more than 2,600 samples of different ages from 25 data sources (PMID: 37269831). Notably, our multi-tissue signatures of aging were identified based on data from 17 tissues, including skin. Therefore, these biomarkers seem to represent more reliable and universal molecular mechanisms of aging. Since they have been identified using independent data, the signatures also don’t introduce the statistical bias described above. For these reasons, we think that they are more applicable for the current analysis. To demonstrate that the utilized aging signatures are overall consistent with the changes observed in studied fibroblasts, we performed GSEA-based analysis, testing association between logFC in aged fibroblasts and various signatures of aging and reprogramming (similar to our analysis in Fig. 2E). We found that the changes in aged fibroblasts from the current study demonstrated positive association with the majority of aging signatures (kidney, liver and multi-tissue signatures in mouse and rat) (Fig. 2 – figure supplement 1A) and were negatively associated with signatures of reprogramming. In addition, we characterized functional changes perturbed in untreated aged fibroblasts at the level of gene expression and protein concentrations and observed multiple changes consistent with the aging signatures, such as upregulation of genes and proteins involved in inflammatory response and interferon signaling (Fig. 3D, Fig. 2 – figure supplement 1C). Therefore, changes observed in untreated aged fibroblasts seem to agree with age-related molecular changes identified across mammalian tissues in our previous studies.

      We would also like to mention that the epigenetic clocks used in this study consistently show that the fibroblasts from 20-month-old fibroblasts are significantly older than the fibroblasts from 4-month-old mice (Fig. 7B). Moreover, we have revised the manuscript to show that these epigenetic differences between young and old untreated fibroblasts are not due to overall changes in mean DNA methylation (Fig. 7 – figure supplement 2). In contrast, in the revised manuscript, we observe that 7c treatment is reducing the epigenetic age of cells by decreasing mean DNA methylation levels (Fig. 7 – figure supplement 3).

      (2) Across all datasets, it appears that the global profiles of young vs old mouse fibroblasts are fairly similar compared to treated fibroblasts, suggesting that the chemical cocktails are not reverting the fibroblasts to a younger state but instead driving them to a different cell state. Similarly, in most cases where specific age-related processes/genes are being compared across untreated and treated samples, no significant differences are observed between young and old fibroblasts.

      We agree that our data shows that partial chemical reprogramming seems to induce a similar effect on young and old fibroblasts. In Fig. 2 – figure supplement 1B, the Spearman correlation coefficients for the effects on gene expression in young and old fibroblasts are 0.80 and 0.85 for 2c and 7c, respectively. It is important to note that the effect of partial chemical reprogramming is a magnitude higher (say in terms of number of differentially expressed genes) than the effect of aging in the untreated fibroblasts. Partial chemical reprogramming with 7c, we believe, is pushing the cells to a younger state as a byproduct of producing a different cellular metabolic state with a strong increase in OXPHOS capacity.

      (3) Functional validation experiments to confirm that specific changes observed after partial reprogramming are indeed reducing biological age is limited.

      Functional validation of rejuvenating interventions is limited in vitro, as cells do not completely maintain their “aged” phenotype once isolated and cultured, and pursuing partial chemical reprogramming in vivo in naturally-aged mice was beyond the scope of the study. One of the best reporters of biological age that are preserved in primary cells in vitro are epigenetic and transcriptomic clocks, which were both utilized in this manuscript to show that 7c treatment, but not 2c, reduces biological age. We show that splicing-related damage is marginally elevated in old fibroblasts compared to young, and that 7c reduces splicing damage by reducing intron retention. Moreover, the epigenetic clocks used in this study show that the 20-month-old fibroblasts are significantly older than the 4-month-old fibroblasts, indicating that the “aged” phenotype is at least partially preserved. Furthermore, according to previous studies (PMIDs: 37269831, 31353263), one of the strongest functional biomarkers of aging is downregulation of mitochondrial function and energy metabolism, including oxidative phosphorylation, while upregulation of these functions is usually associated with extended lifespan in mice. For this reason, we have focused on these pathways in our study and assessed them with functional assays.

      (4) Partial reprogramming appears to substantially reduce biological age of the young (4 month) fibroblasts based on the aging signatures used. It is unclear how this result should be interpreted.

      This is a caveat of all reprogramming strategies/”anti-aging” interventions developed and tested to date. Currently, there are no genetic or pharmacological methods that target only the “aged” state and not the “young” state as well (i.e. an intervention that would only cause a change in old cells and revert them to a younger state). However, “young” cells in our study and many other studies are still the cells of an intermediate age, as aging appears to begin early during development. Therefore, perhaps unsurprisingly, partial chemical reprogramming seemed to have similar effects on fibroblasts isolated from young and old mice, which is in line with OSK/OSKM reprogramming. These results should be interpreted as follows: partial chemical reprogramming does not depend on the epigenetic state (biological age) of adult cells to induce rejuvenation. We have updated the discussion section of our manuscript accordingly.

      Recommendations for the authors:

      Reviewer #1:

      (1) How was the PCA conducted for RNA-seq data? Were the raw or normalized counts used for PCA?

      Normalized counts were used for PCA of the RNA-seq data.

      (2) Supplementary Fig 3c, why was the correlation between the red rows and red columns low? Was the color of group messed up? Why was the Pearson correlation used instead of Spearman correlation? Most of the correlation analyses in the manuscript used Spearman correlation.

      We thank the reviewer for noticing this mistake. The colors of the groups have now been corrected. Furthermore, to be consistent with the rest of the manuscript, we have performed a Spearman correlation analysis on the normalized proteomics data to evaluate sample-to-sample similarities and updated Fig. 3 – figure supplement 1 accordingly. Overall, the results are similar to those obtained by Pearson correlation.

      (3) Were the significant metabolites tested by one-way ANOVA adjusted for family-wise type I error rate? It is surprising that over 50% metabolites were significant.

      Yes, the significant metabolites were adjusted for family-wise type I error rate (with a 5% significance threshold) in Fig. 6B.

      (4) Missing full names of several abbreviations, such as NIA, RLE, PSI, etc.

      Thank you for noticing the missing abbreviations. We have corrected this by writing out the full term in the first instance in which each abbreviation appears.

      (5) Methods section may be too long. Some paragraphs could be moved to supplementary text.

      eLife does not have a limit to the number of figures or amount of text. Therefore, we have kept the methods section largely unaltered as we feel that they would be helpful to the scientific community.

      Reviewer #2:

      (1) As discussed in the public review, I would recommend first establishing what differences exist between 4 month and 20 month fibroblasts to identify potential age-related changes in these fibroblasts.

      We thank the reviewer for this suggestion. We have now thoroughly characterized the molecular differences between fibroblasts taken from young and old mice at the epigenome, transcriptome, proteome, and metabolome levels. Please refer to previous responses for more specific details.

      We have also attempted to establish aging-related differences at the phosphoproteome level, particularly in regards to mitochondrial processes (see figure below), but only GOcc: mitochondrion and GObp: mitochondrial transport come close to being statistically significant (raw p-values of 0.05 and 0.08, respectively) in the control comparison.

      Author response image 1.

      (2) While the global changes currently highlighted in the study are informative and should remain in the revised manuscript, additional analyses to show which age-related changes identified in point 1 are reverted upon 2c or 7c treatment would better address the question of whether these cocktails revert age-related changes seen in fibroblasts. These analyses should be performed for each dataset (i.e transcriptomic, proteomic, epigenomic, metabolomic) generated.

      Thank you for this comment. We have now evaluated the effects of partial chemical reprogramming on the specific molecular differences between fibroblasts isolated from young and old mice (see updated Fig. 2 – figure supplement 1, Fig. 6 – figure supplement 1, Fig. 7 – figure supplement 2, and Fig. 7 – figure supplement 3). For functional enrichment of aged fibroblasts at the gene and protein level, please refer to updated Fig. 3D.

      (3) Comparisons between partial reprogramming and OSKM reprogramming signatures are repeatedly made in the paper, but it is not clear from the text whether similarity to OSKM reprogramming signatures is a desired or undesired feature. Since there are likely both rejuvenating and oncogenic aspects of the OSKM signatures, it is unclear what conclusions can be made from these comparisons.

      Two central questions of this study were (1) if partial chemical reprogramming could induce cellular rejuvenation, and (2) if so, would it do so by merely chemically activating expression of Yamanaka factors. In this study, we find that 7c, the cocktail that demonstrated the most profound effect on biological age, only minorly upregulates Klf4, downregulates c-Myc, and has no effect on Sox2 or Oct4 expression. Thus, partial chemical reprogramming seems to operate through a mechanism independent of upregulating OSK/OSKM gene expression. This is crucial as it suggests that there are other transcription factors outside of OSKM that can be targeted to induce cellular rejuvenation and reversal of biological age. However, the direct transcriptional targets of partial chemical reprogramming are currently unknown and require further investigation.

      Partial reprogramming with OSK/OSKM has several limitations, including low efficiency, oncogenic risk, and differences in the speed of reprogramming according to cell/tissue type. These risks could be inherently tied to the transcription factors OSKM themselves; thus, partial chemical reprogramming, by avoiding strong activation of these genes, could potentially avoid these risks and provide a safer means for reversing biological age in vivo. However, extensive follow-up studies beyond the scope of this manuscript are certainly required to determine this.

      We have addressed this comment by modifying the discussion to include these points.

      (4) When analyzing the phospho-proteomics data, results are discussed as general changes in phosphorylation of proteins involved in different cellular processes. However, phosphorylation can either activate or inhibit a specific protein, and can depend on the specific residue in a protein that is modified. Different proteins in a cellular process can also respond in opposite directions to phosphorylation. Treating activating and inactivating phosphorylation events separately in describing these results would be more informative.

      We agree that an analysis that considers for each specific phosphosite whether it activates or inactivates a particular pathway would in principle be preferable over our current enrichment analysis that only accounts for the increase or decrease in phosphorylation of each site without knowing its biological meaning. However, unfortunately, we think it is currently practically not possible to conduct such an analysis. The proposed analysis would require a database with information on which residues are (de-)phosphorylated when a certain pathway is activated. However, as far as we know, there are currently no databases that link activation or inactivation of specific phosphosites to pathways in repositories like KEGG, HALLMARK, GObp, GOcc, GOmf, Reactome, etc.

      Some databases link phosphosites to drugs, diseases and kinases (e.g. PTMsigDB (PMID: 30563849)). However, these authors explicitly state: “We note that we do not capture functional annotations of PTM sites in PTMsigDB, such as activating or inactivating effect on the modified protein.” Furthermore, even in these databases, for the vast majority of the registered phosphosites, the responsible kinases are unknown, especially in mice. In our work, we made use of PhosphoSitePlus for kinase substrate enrichment analysis (see Fig. 5B). Such analyses, where kinase activity is inferred based on activated phosphosites are indeed commonly performed (see PMIDs: 34663829, 37269289, 37585503).

      In the absence of a repository that assigns activity to phosphosites, if enrichment analysis is being done for biological pathways, it is standard practice to so without accounting for whether phosphosites are activating or inactivating (see PMID: 34663829), as we have done in our manuscript (Fig. 5A).

      Despite the drawbacks, we believe our analysis is relevant, as it demonstrates important biological activity in these pathways uopn 2c/7c treatments as compared to controls. For example, the observed increase in abundance in mitochondrial OXPHOS complexes (Fig. 3E) combined with an increase in general phosphorylation of mitochondrial proteins (Fig. 5A) likely points to an increase mitochondrial activity, although one cannot exclude that some individual phosphorylation events might have inhibitory effects on certain mitochondrial proteins, while others might indicate increases in activity.

      (5) For the transcriptomic and epigenetic aging clocks used in Fig 7, significance tests need to be included for untreated 4 month vs 20 month fibroblasts. Particularly for the transcriptional clock, the differences are small and suggest that it may not be a strong aging signature.

      We have updated our clock analysis with the most recent versions of the clocks and added statistical significance between 4-month-old and 20-month-old untreated fibroblasts there (Fig. 7A). The difference is statistically significant for the chronological clock. However, when the lifespan-adjusted clock was applied, no statistical significance was observed, suggesting that 20-month-old fibroblasts do not exhibit substantial changes in gene expression associated with decreased healthspan and increased mortality.

      (6) For heatmaps shown in Figure 3D and Figure 4, please include untreated 4 month and 20 month fibroblasts as well to determine if pathways being compared are different between young and old fibroblasts.

      We have updated Figure 3D with functional enrichment results for aged fibroblasts at gene and protein expression levels, as requested. As for Fig. 4, we explained in our reply to point 1 of Reviewer #2 in the public review why addition of aged fibroblasts there would be biased there. Instead, we have performed GSEA-based association analysis for changes observed in aged fibroblasts and signatures of aging (Fig. 2 – figure supplement 1), confirming that our signatures are overall consistent with patterns of 20-month-old fibroblasts from the current study.

    1. Author response

      The following is the authors’ response to the current reviews.

      We thank the editor for the eLife assessment and reviewers for their remaining comments. We will address them in this response.

      First, we thank eLife for the positive assessment. Regarding the point of visual acuity that is mentioned in this assessment, we understand that this comment is made. It is not an uncommon comment when rodent vision is discussed. However, we emphasize that we took the lower visual acuity of rats and the higher visual acuity of humans into account when designing the human study, by using a fast and eccentric stimulus presentation for humans. As a result, we do not expect a higher discriminability of stimuli in humans. We have described this in detail in our Methods section when describing the procedure in the human experiment:

      “We used this fast and eccentric stimulus presentation with a mask to resemble the stimulus perception more closely to that of rats. Vermaercke & Op de Beeck (2012) have found that human visual acuity in these fast and eccentric presentations is not significantly better than the reported visual acuity of rats. By using this approach we avoid that differences in strategies between humans and rats would be explained by such a difference in acuity”

      Second, regarding the remaining comment of Reviewer #2 about our use of AlexNet:

      While it is indeed relevant to further look into different computational architectures, we chose to not do this within the current study. First, it is a central characteristic of the study procedure that the computational approach and chosen network is chosen early on as it is used to generate the experimental design that animals are tested with. We cannot decide after data collection to use a different network to select the stimuli with which these data were collected. Second, as mentioned in our first response, using AlexNet is not a random choice. It has been used in many previously published vision studies that were relatively positive about the correspondence with biological vision (Cadieu et al., 2014; Groen et al., 2018; Kalfas et al., 2018; Nayebi et al., 2023; Zeman et al., 2020). Third, our aim was not to find a best DNN model for rat vision, but instead examining the visual features that play a role in our complex discrimination task with a model that was hopefully a good enough starting point. The fact that the designs based upon AlexNet resulted in differential and interpretable effects in rats as well as in humans suggests that this computational model was a good start. Comparing the outcomes of different networks would be an interesting next step, and we expect that our approach could work even better when using a network that is more specifically tailored to mimic rat visual processing.

      Finally, regarding the choice to specifically chose alignment and concavity as baseline properties, this choice is probably not crucial for the current study. We have no reason to expect rats to have an explicit notion about how a shape is built up in terms of a part-based structure, where alignment relates to the relative position of the parts and concavity is a property of the main base. For human vision it might be different, but we did not focus on such questions in this study.


      The following is the authors’ response to the original reviews.

      We would like to thank you for giving us the opportunity to submit a revised draft our manuscript. We appreciate the time and effort that you dedicated to providing insightful feedback on our manuscript and are grateful for the valuable comments and improvements on our paper. It helped us to improve our manuscript. We have carefully considered the comments and tried our best to address every one of them. We have added clarifications in the Discussion concerning the type of neural network that we used, about which visual features might play a role in our results as well as clarified the experimental setup and protocol in the Methods section as these two sections were lacking key information points.

      Below we provide a response to the public comments and concerns of the reviewers.

      Several key points were addressed by at least two reviewers, and we will respond to them first.

      A first point concerns the type of network we used. In our study, we used AlexNet to simulate the ventral visual stream and to further examine rat and human performance. While other, more complex neural networks might lead to other results, we chose to work with AlexNet because it has been used in many other vision studies that are published in high impact journals ((Cadieu et al., 2014; Groen et al., 2018; Kalfas et al., 2018; Nayebi et al., 2023; Zeman et al., 2020). We did not try to find a best DNN model for rat vision but instead, we were looking for an explanation of which visual features play a role in our complex discrimination task. We added a consideration to our Discussion addressing why we worked with AlexNet. Since our data will be published on OSF, we encourage to researchers to use our data with other, more complex neural networks and to further investigate this issue.

      A second point that was addressed by multiple reviewers concerns the visual acuity of the animals and its impact on their performance. The position of the rat was not monitored in the setup. In a previous study in our lab (Crijns & Op de Beeck, 2019), we investigated the visual acuity of rats in the touchscreen setups by presenting gratings with different cycles per screen to see how it affects their performance in orientation discrimination. With the results from this study and general knowledge about rat visual acuity, we derived that the decision distance of rats lies around 12.5cm from the screen. We have added this paragraph to the Discussion.

      A third key point that needs to be addressed as a general point involves which visual features could explain rat and human performance. We reported marked differences between rat and human data in how performance varied across image trials, and we concluded through our computationally informed tests and analyses that rat performance was explained better by lower levels of processing. Yet, we did not investigate which exact features might underlie rat performance. As a starter, we have focused on taking a closer look at pixel similarity and brightness and calculating the correlation between rat/human performance and these two visual features.

      We calculated the correlation between the rat performances and image brightness of the transformations. We did this by calculating the difference in brightness of the base pair (brightness base target – brightness base distractor), and subtracting the difference in brightness of every test target-distractor pair for each test protocol (brightness test target – brightness test distractor for each test pair). We then correlated these 287 brightness values (1 for each test image pair) with the average rat performance for each test image pair. This resulted in a correlation of 0.39, suggesting that there is an influence of brightness in the test protocols. If we perform the same correlation with the human performances, we get a correlation of -0.12, suggesting a negative influence of brightness in the human study.

      We calculated the correlation between pixel similarity of the test stimuli in relation to the base stimuli with the average performance of the animals on all nine test protocols. We did this by calculating the pixel similarity between the base target with every other testing distractor (A), the pixel similarity between the base target with every other testing target (B), the pixel similarity between the base distractor with every other testing distractor (C) and the pixel similarity between the base distractor with every other testing target (D). For each test image pair, we then calculated the average of (A) and (D), and subtracted the average of (C) and (B) from it. We correlated these 287 values (one for each image pair) with the average rat performance on all test image pairs, which resulted in a correlation of 0.34, suggesting an influence of pixel similarity in rat behaviour. Performing the same correlation analysis with the human performances results in a correlation of 0.12.

      We have also addressed this in the Discussion of the revised manuscript. Note that the reliability of the rat data was 0.58, clearly higher than the correlations with brightness and pixel similarity, thus these features capture only part of the strategies used by rats.

      We have also responded to all other insightful suggestions and comments of the reviewers, and a point-by-point response to the more major comments will follow now.  

      Reviewer #1, general comments:

      The authors should also discuss the potential reason for the human-rat differences too, and importantly discuss whether these differences are coming from the rather unusual approach of training used in rats (i.e. to identify one item among a single pair of images), or perhaps due to the visual differences in the stimuli used (what were the image sizes used in rats and humans?). Can they address whether rats trained on more generic visual tasks (e.g. same-different, or category matching tasks) would show similar performance as humans?

      The task that we used is typically referred to as a two-alternative forced choice (2AFC). This is a simple task to learn. A same-different task is cognitively much more demanding, also for artificial neural networks (see e.g. Puebla & Bowers, 2022, J. Vision). A one-stimulus choice task (probably what the reviewer refers to with category matching) is known to be more difficult compared to 2AFC, with a sensitivity that is predicted to be Sqrt(2) lower according to signal detection theory (MacMillan & Creelman, 1991). We confirmed this prediction empirically in our lab (unpublished observations). Thus, we predict that rats perform less good in the suggested alternatives, potentially even (in case of same-different) resulting in a wider performance gap with humans.

      I also found that a lot of essential information is not conveyed clearly in the manuscript. Perhaps it is there in earlier studies but it is very tedious for a reader to go back to some other studies to understand this one. For instance, the exact number of image pairs used for training and testing for rats and humans was either missing or hard to find out. The task used on rats was also extremely difficult to understand. An image of the experimental setup or a timeline graphic showing the entire trial with screenshots would have helped greatly.

      All the image pairs used for training and testing for rats and humans are depicted in Figure 1 (for rats) and Supplemental Figure 6 (for humans). For the first training protocol (Training), only one image pair was shown, with the target being the concave object with horizontal alignment of the spheres. For the second training protocol (Dimension learning), three image pairs were shown, consisting of the base pair, a pair which differs only in concavity, and a pair which differs only in alignment. For the third training protocol (Transformations) and all testing protocols, all combination of targets and distractors were presented. For example, in the Rotation X protocol, the stimuli consisted of 6 targets and 6 distractors, resulting in a total of 36 image pairs for this protocol. The task used on rats is exactly as shown in Figure 1. A trial started with two blank screens. Once the animal initiated a trial by sticking its head in the reward tray, one stimulus was presented on each screen. There was no time limit and so the stimuli remained on the screen until the animal made a decision. If the animal touched the target, it received a sugar pellet as reward and a ITI of 20s started. If the animal touched the distractor, it did not receive a sugar pellet and a time-out of 5s started in addition to the 20s ITI.

      We have clarified this in the manuscript.

      The authors state that the rats received random reward on 80% of the trials, but is that on 80% of the correctly responded trials or on 80% of trials regardless of the correctness of the response? If these are free choice experiments, then the task demands are quite different. This needs to be clarified. Similarly, the authors mention that 1/3 of the trials in a given test block contained the old base pair - are these included in the accuracy calculations?

      The animals receive random reward on 80% on all testing trials with new stimuli, regardless of the correctness of the response. This was done to ensure that we can measure true generalization based upon learning in the training phase, and that the animals do not learn/are not trained in these testing stimuli. For the trials with the old stimuli (base pair), the animals always received real reward (reward when correct; no reward in case of error).

      The 1/3rd trials with old stimuli are not included in the accuracy calculations but were used as a quality check/control to investigate which sessions have to be excluded and to assure that the rats were still doing the task properly. We have added this in the manuscript.

      The authors were injecting noise with stimuli to cDNN to match its accuracy to rat. However, that noise potentially can interacted with the signal in cDNN and further influence the results. That could generate hidden confound in the results. Can they acknowledge/discuss this possibility?

      Yes, adding noise can potentially interact with the signal and further influence the results. Without noise, the average training data of the network would lie around 100% which would be unrealistic, given the performances of the animals. To match the training performance of the neural networks with that of the rats, we added noise 100 times and averaged over these iterations (cfr. (Schnell et al., 2023; Vinken & Op de Beeck, 2021)).  

      Reviewer #2, weaknesses:

      1) There are a few inconsistencies in the number of subjects reported. Sometimes 45 humans are mentioned and sometimes 50. Probably they are just typos, but it's unclear.

      Thank you for your feedback. We have doublechecked this and changed the number of subjects where necessary. We collected data from 50 human participants, but had to exclude 5 of them due to low performance during the quality check (Dimension learning) protocols. Similarly, we collected data from 12 rats but had to exclude one animal because of health issues. All these data exclusion steps were mentioned in the Methods section of the original version of the manuscript, but the subject numbers were not always properly adjusted in the description in the Results section. This is now corrected.

      2) A few aspects mentioned in the introduction and results are only defined in the Methods thus making the manuscript a bit hard to follow (e.g. the alignment dimension), thus I had to jump often from the main text to the methods to get a sense of their meaning.

      Thank you for your feedback. We have clarified some aspects in the Introduction, such as the alignment dimension.

      4) Many important aspects of the task are not fully described in the Methods (e.g. size of the stimuli, reaction times and basic statistics on the responses).

      We have added the size of the stimuli to the Methods section and clarified that the stimuli remained on the screen until the animals made a choice. Reaction time in our task would not be interpretable given that stimuli come on the screen when the animal initiates a trial with its back to the screen. Therefore we do not have this kind of information.

      Reviewer #1

      • Can the authors show all the high vs zero and zero vs high stimulus pairs either in the main or supplementary figures? It would be instructive to know if some other simple property covaried between these two sets.

      In Figure 1, all images of all protocols are shown. For the High vs. Zero and Zero vs. High protocols, we used a deep neural network to select a total of 7 targets and 7 distractors. This results in 49 image pairs (every combination of target-distractor).

      • Are there individual differences across animals? It would be useful for the authors to show individual accuracy for each animal where possible.

      We now added individual rat data for all test protocols – 1 colour per rat, black circle = average. We have added this picture to the Supplementary material (Supplementary Figure 1).

      • Figure 1 - it was not truly clear to me how many image pairs were used in the actual experiment. Also, it was very confusing to me what was the target for the test trials. Additionally, authors reported their task as a categorisation task, but it is a discrimination task.

      Figure 1 shows all the images that were used in this study. Every combination of every target-distractor in each protocol (except for Dimension learning) was presented to the animals. For example in Rotation X, the test stimuli as shown in Fig. 1 consisted of 6 targets and 6 distractors, resulting in a total of 36 image pairs for this test protocol.

      In each test protocol, the target corresponded to the concave object with horizontally attached spheres, or the object from the pair that in the stimulus space was closed to this object. We have added this clarification in the Introduction: “We started by training the animals in a base stimulus pair, with the target being the concave object with horizontally aligned spheres. Once the animals were trained in this base stimulus pair, we used the identity-preserving transformations to test for generalization.” as well as in the caption of Figure 1. We have changed the term “categorisation task” to “discrimination task” throughout the manuscript.

      • Figure 2 - what are the red and black lines? How many new pairs are being tested here? Panel labels are missing (a/b/c etc)

      We have changed this figure by adding panel labels, and clarifying the missing information in the caption. All images that were shown to the animals are presented on this figure. For Dimension Learning, only three image pairs were shown (base pair, concavity pair, alignment pair) and for the Transformations protocol, every combination of every target and distractor were shown, i.e. 25 image pairs in total.

      • Figure 3 - last panel: the 1st and 2nd distractor look identical.

      We understand your concern as these two distractors indeed look quite similar. They are different however in terms of how they are rotated along the x, y and z axes (see Author response image 1 for a bigger image of these two distractors). The similarity is due to the existence of near-symmetry in the object shape which causes high self-similarity for some large rotations.

      Author response image 1.

      • Line 542 – authors say they have ‘concatenated’ the performance of the animals, but do they mean they are taking the average across animals?

      It is both. In this specific analysis we calculated the performance of the animals, which was indeed averaged across animals, per test protocol, per stimulus pair. This resulted in 9 arrays (one for each test protocol) of several performances (1 for each stimulus pair). These 9 arrays were concatenated by linking them together in one big array (i.e. placing them one after the other). We did the same concatenation with the distance to hyperplane of the network on all nine test protocols. These two concatenated arrays with 287 values each (one with the animal performance and one with the DNN performance) were correlated.

      • Line 164 - What are these 287 image pairs - this is not clear.

      The 287 image pairs correspond to all image pairs of all 9 test protocols: 36 (Rotation X) + 36 (Rotation Y) + 36 (Rotation Z) + 4 (Size) + 25 (Position) + 16 (Light location) + 36 (Combination Rotation) + 49 (Zero vs. high) + 49 (High vs. zero) = 287 image pairs in total. We have clarified this in the manuscript.

      • Line 215 - Human rat correlation (0.18) was comparable to the best cDNN layer correlation. What does this mean?

      The human rat correlation (0.18) was closest to the best cDNN layer - rat correlation (about 0.15). In the manuscript we emphasize that rat performance is not well captured by individual cDNN layers.  

      Reviewer #2

      Major comments

      • In l.23 (and in the methods) the authors mention 50 humans, but in l.87 they are 45. Also, both in l.95 and in the Methods the authors mention "twelve animals" but they wrote 11 elsewhere (e.g. abstract and first paragraph of the results).

      In our human study design, we introduced several Dimension learning protocols. These were later used as a quality check to indicate which participants were outliers, using outlier detection in R. This resulted in 5 outlying human participants, and thus we ended with a pool of 45 human participants that were included in the analyses. This information was given in the Methods section of the original manuscript, but we did not mention the correct numbers everywhere. We have corrected this in the manuscript. We also changed the number of participants (humans and rats) to the correct one throughout the entire manuscript.

      • At l.95 when I first met the "4x4 stimulus grid" I had to guess its meaning. It would be really useful to see the stimulus grid as a panel in Figure 1 (in general Figures S1 and S4 could be integrated as panels of Figure 1). Also, even if the description of the stimulus generation in the Methods is probably clear enough, the authors might want to consider adding a simple schematic in Figure 1 as well (e.g. show the base, either concave or convex, and then how the 3 spheres are added to control alignment).

      We have added the 4x4 stimulus grid in the main text.

      • There is also another important point related to the choice of the network. As I wrote, I find the overall approach very interesting and powerful, but I'm actually worried that AlexNet might not be a good choice. I have experience trying to model neuronal responses from IT in monkeys, and there even the higher layers of AlexNet aren't that helpful. I need to use much deeper networks (e.g. ResNet or GoogleNet) to get decent fits. So I'm afraid that what is deemed as "high" in AlexNet might not be as high as the authors think. It would be helpful, as a sanity check, to see if the authors get the same sort of stimulus categories when using a different, deeper network.

      We added a consideration to the manuscript about which network to use (see the Discussion): “We chose to work with Alexnet, as this is a network that has been used as a benchmark in many previous studies (e.g. (Cadieu et al., 2014; Groen et al., 2018; Kalfas et al., 2018; Nayebi et al., 2023; Zeman et al., 2020)), including studies that used more complex stimuli than the stimulus space in our current study. […] . It is in line with the literature that a typical deep neural network, AlexNet and also more complex ones, can explain human and animal behaviour to a certain extent but not fully. The explained variance might differ among DNNs, and there might be DNNs that can explain a higher proportion of rat or human behaviour. Most relevant for our current study is that DNNs tend to agree in terms of how representations change from lower to higher hierarchical layers, because this is the transformation that we have targeted in the Zero vs. high and High vs. zero testing protocols. (Pinto et al., 2008) already revealed that a simple V1-like model can sometimes result in surprisingly good object recognition performance. This aspect of our findings is also in line with the observation of Vinken & Op de Beeck (2021) that the performance of rats in many previous tasks might not be indicative of highly complex representations. Nevertheless, there is still a relative difference in complexity between lower and higher levels in the hierarchy. That is what we capitalize upon with the Zero vs. high and High vs. zero protocols. Thus, it might be more fruitful to explicitly contrast different levels of processing in a relative way rather than trying to pinpoint behaviour to specific levels of processing.”

      • The task description needs way more detail. For how long were the stimuli presented? What was their size? Were the positions of the stimuli randomized? Was it a reaction time task? Was the time-out used as a negative feedback? In case, when (e.g. mistakes or slow responses)? Also, it is important to report some statistics about the basic responses. What was the average response time, what was the performance of individual animals (over days)? Did they show any bias for a particular dimension (either the 2 baseline dimensions or the identity preserving ones) or side of response? Was there a correlation within animals between performance on the baseline task and performance on the more complex tasks?

      Thank you for your feedback. We have added more details to the task description in the manuscript.

      The stimuli were presented on the screens until the animals reacted to one of the two screens. The size of the stimuli was 100 x 100 pixel. The position of the stimuli was always centred/full screen on the touchscreens. It was not a reaction time task and we also did not measure reaction time.

      • Related to my previous comment, I wonder if the relative size/position of the stimulus with respect to the position of the animal in the setup might have had an impact on the performance, also given the impact of size shown in Figure 2. Was the position of the rat in the setup monitored (e.g. with DeepLabCut)? I guess that on average any effect of the animal position might be averaged away, but was this actually checked and/or controlled for?

      The position of the rat was not monitored in the setup. In a previous study from our lab (Crijns & Op de Beeck, 2019), we investigated the visual acuity of rats in the touchscreen setups by presenting gratings with different cycles per screen to see how it affects their performance in orientation discrimination. With the results from this study and general knowledge about rat visual acuity, we derived that the decision distance of rats lies around 12.5cm from the screen. We have added this to the discussion.

      Minor comments

      • l.33 The sentence mentions humans, but the references are about monkeys. I believe that this concept is universal enough not to require any citation to support it.

      Thank you for your feedback. We have removed the citations.

      • This is very minor and totally negligible. The acronymous cDNN is not that common for convents (and it's kind of similar to cuDNN), it might help clarity to stick to a more popular acronymous, e.g. CNN or ANN. Also, given that the "high" layers used for stimulus selection where not convolutional layers after all (if I'm not mistaken).

      Thank you for your feedback. We have changed the acronym to ‘CNN’ in the entire manuscript.

      • In l.107-109 the authors identified a few potential biases in their stimuli, and they claim these biases cannot explain the results. However, the explanation is given only in the next pages. It might help to mention that before or to move that paragraph later, as I was just wondering about it until I finally got to the part on the brightness bias.

      We expanded the analysis of these dimensions (e.g. brightness) throughout the manuscript.

      • It would help a lot the readability to put also a label close to each dimension in Figures 2 and 3. I had to go and look at Figure S4 to figure that out.

      Figures 2 and 3 have been updated, also including changes related to other comments.

      • In Figure 2A, please specify what the red dashed line means.

      We have edited the caption of Figure 2: “Figure 2 (a) Results of the Dimension learning training protocol. The black dashed horizontal line indicates chance level performance and the red dashed line represents the 80% performance threshold. The blue circles on top of each bar represent individual rat performances. The three bars represent the average performance of all animals on the old pair (Old), the pair that differs only in concavity (Conc) and on the pair that differs only in alignment (Align). (b) Results of the Transformations training protocol. Each cell of the matrix indicates the average performance per stimulus pair, pooled over all animals. The columns represent the distractors, whereas the rows separate the targets. The colour bar indicates the performance correct. ”

      • Related to that, why performing a binomial test on 80%? It sounds arbitrary.

      We performed the binomial test on 80% as 80% is our performance threshold for the animals

      • The way the cDNN methods are introduced makes it sound like the authors actually fine-tuned the weights of AlexNet, while (if I'm not mistaken), they trained a classifier on the activations of a pre-trained AlexNet with frozen weights. It might be a bit confusing to readers. The rest of the paragraph instead is very clear and easy to follow.

      We think the most confusing sentence was “ Figure 7 shows the performance of the network after training the network on our training stimuli for all test protocols. “ We changed this sentence to “ Figure 8 shows the performance of the network for each of the test protocols after training classifiers on the training stimuli using the different DNN layers.“

      Reviewer #3

      Main recommendations:

      Although it may not fully explain the entire pattern of visual behavior, it is important to discuss rat visual acuity and its impact on the perception of visual features in the stimulus set.

      We have added a paragraph to the Discussion that discusses the visual acuity of rats and its impact on perceiving the visual features of the stimuli.

      The authors observed a potential influence of image brightness on behavior during the dimension learning protocol. Was there a correlation between image brightness and the subsequent image transformations?

      We have added this to the Discussion: “To further investigate to which visual features the rat performance and human performance correlates best with, we calculated the correlation between rat performance and pixel similarity of the test image pairs, as well as the correlation between rat performance and brightness in the test image pairs. Here we found a correlation of 0.34 for pixel similarity and 0.39 for brightness, suggesting that these two visual features partly explain our results when compared to the full-set reliability of rat performance (0.58). If we perform the same correlation with the human performances, we get a correlation of 0.12 for pixel similarity and -0.12 for brightness. With the full-set reliability of 0.58 (rats) and 0.63 (humans) in mind, this suggests that even pixel similarity and brightness only partly explain the performances of rats and humans.”

      Did the rats rely on consistent visual features to perform the tasks? I assume the split-half analysis was on data pooled across rats. What was the average correlation between rats? Were rats more internally consistent (split-half within rat) than consistent with other rats?

      The split-half analysis was indeed performed on data pooled across rats. We checked whether rats are more internally consistent by comparing the split-half within correlations with the split-half between correlations. For the split-half within correlations, we split the data for each rat in two subsets and calculated the performance vectors (performance across all image pairs). We then calculated the correlation between these two vectors for each animal. To get the split-half between correlation, we calculated the correlation between the performance vector of every subset data of every rat with every other subset data from the other rats. Finally, we compared for each animal its split-half within correlation with the split-half between correlations involving that animal. The result of this paired t-test (p = 0.93, 95%CI [-0.09; 0.08]) suggests that rats were not internally more consistent.

      Discussion of the cDNN performance and its relation to rat behavior could be expanded and clarified in several ways:

      • The paper would benefit from further discussion regarding the low correlations between rat behavior and cDNN layers. Is the main message that cDNNs are not a suitable model for rat vision? Or can we conclude that the peak in mid layers indicates that rat behavior reflects mid-level visual processing? It would be valuable to explore what we currently know about the organization of the rat visual cortex and how applicable these models are to their visual system in terms of architecture and hierarchy.

      We added a consideration to the manuscript about which network to use (see Discussion).

      • The cDNN exhibited above chance performance in various early layers for several test protocols (e.g., rotations, light location, combination rotation). Does this limit the interpretation of the complexity of visual behavior required to perform these tasks?

      This is not uncommon to find. Pinto et al. (2008) already revealed that a simple V1-like model can sometimes result in surprisingly good object recognition performance. This aspect of our findings is also in line with the observation of Vinken & Op de Beeck (2021) that the performance of rats in many previous tasks might not be indicative of highly complex representations. Nevertheless, there is still a relative difference in complexity between lower and higher levels in the hierarchy. That is what we capitalize upon with the High vs zero and the Zero vs high protocols. Thus, it might be more fruitful to explicitly contrast different levels of processing in a relative way rather than trying to pinpoint behavior to specific levels of processing. This argumentation is added to the Discussion section.

      • How representative is the correlation profile between cDNN layers and behavior across protocols? Pooling stimuli across protocols may be necessary to obtain stable correlations due to relatively modest sample numbers. However, the authors could address how much each individual protocol influences the overall correlations in leave-one-out analyses. Are there protocols where rat behavior correlates more strongly with higher layers (e.g., when excluding zero vs. high)?

      We prefer to base our conclusions mostly on the pooled analyses rather than individual protocols. As the reviewer also mentions, we can expect that the pooled analyses will provide the most stable results. For information, we included leave-one-out analyses in the supplemental material. Excluding the Zero vs. High protocol did not result in a stronger correlation with the higher layers. It was rare to see correlations with higher layers, and in the one case that we did (when excluding High versus zero) the correlations were still higher in several mid-level layers.

      Author response image 2.

      • The authors hypothesize that the cDNN results indicate that rats rely on visual features such as contrast. Can this link be established more firmly? e.g., what are the receptive fields in the layers that correlate with rat behavior sensitive to?

      This hypothesis was made based on previous in-lab research ((Schnell et al., 2023) where we found rats indeed rely on contrast features. In this study, we performed a face categorization task, parameterized on contrast features, and we investigated to what extent rats use contrast features to perform in a face categorization task. Similarly as in the current study, we used a DNN that as trained and tested on the same stimuli as the animals to investigate the representations of the animals. There, we found that the animals use contrast features to some extent and that this correlated best with the lower layers of the network. Hence, we would say that the lower layers correlate best with rat behaviour that is sensitive to contrast. Earlier layers of the network include local filters that simulate V1-like receptive fields. Higher layers of the network, on the other hand, are used for object selectivity.

      • There seems to be a disconnect between rat behavior and the selection of stimuli for the high (zero) vs. zero (high) protocols. Specifically, rat behavior correlated best with mid layers, whereas the image selection process relied on earlier layers. What is the interpretation when rat behavior correlates with higher layers than those used to select the stimuli?

      We agree that it is difficult to pinpoint a particular level of processing, and it might be better to use relative terms: lower/higher than. This is addressed in the manuscript by the edit in response to three comments back.

      • To what extent can we attribute the performance below the ceiling for many protocols to sensory/perceptual limitations as opposed to other factors such as task structure, motivation, or distractibility?

      We agree that these factors play a role in the overall performance difference. In Figure 5, the most right bar shows the percentage of all animals (light blue) vs all humans (dark blue) on the old pair that was presented during the testing protocol. Even here, the performance of the animals was lower than humans, and this pattern extended to the testing protocols as well. This was most likely due to motivation and/or distractibility which we know can happen in both humans and rats but affects the rat results more with our methodology.

      Minor recommendations:

      • What was the trial-to-trial variability in the distance and position of the rat's head relative to the stimuli displayed on the screen? Can this variability be taken into account in the size and position protocols? How meaningful is the cDNN modelling of these protocols considering that the training and testing of the model does not incorporate this trial-to-trial variability?

      We have no information on this trial-to-trial variability. We have information though on what rats typically do overall from an earlier paper that was mentioned in response to an earlier comment (Crijns et al.).

      We have added a disclaimer in the Discussion on our lack of information on trial-to-trial variability.

      • Several of the protocols varied a visual feature dimension (e.g., concavity & alignment) relative to the base pair. Did rat performance correlate with these manipulations? How did rat behavior relate to pixel dissimilarity, either between target and distractor or in relation to the trained base pair?

      We have added this to the Discussion. See also our general comments in the Public responses.

      • What could be the underlying factor(s) contributing to the difference in accuracy between the "small transformations" depicted in Figure 2 and some of the transformations displayed in Figure 3? In particular, it seems that the variability of targets and distractors is greater for the "small transformations" in Figure 2 compared to the rotation along the y-axis shown in Figure 3.

      There are several differences between these protocols. Before considering the stimulus properties, we should take into account other factors. The Transformations protocol was a training protocol, meaning that the animals underwent several sessions in this protocol, always receiving real reward during the trials, and only stopping once a high enough performance was reached. For the protocols in Figure 3, the animals were also placed in these protocols for multiple sessions in order to obtain enough trials, however, the difference here is that they did not receive real reward and testing was also stopped if performance was still low.

      • In Figure 3, it is unclear which pairwise transformation accuracies were above chance. It would be helpful if the authors could indicate significant cells with an asterisk. The scale for percentage correct is cut off at 50%. Were there any instances where the behaviors were below 50%? Specifically, did the rats consistently choose the wrong option for any of the pairs? It would be helpful to add "old pair", "concavity" and "alignment" to x-axis labels in Fig 2A .

      We have added “old”, “conc” and “align” to the x-axis labels in Figure 2A.

      • Considering the overall performance across protocols, it seems overstated to claim that the rats were able to "master the task."

      When talking about “mastering the task”, we talk about the training protocols where we aimed that the animals would perform at 80% and not significantly less. We checked this throughout the testing protocols as well, where we also presented the old pair as quality control, and their performance was never significantly lower than our 80% performance threshold on this pair, suggesting that they mastered the task in which they were trained. To avoid discussion on semantics, we also rephrased “master the task” into “learn the task”.

      • What are the criteria for the claim that the "animal model of choice for vision studies has become the rodent model"? It is likely that researchers in primate vision may hold a different viewpoint, and data such as yearly total publication counts might not align with this claim.

      Primate vision is important for investigating complex visual aspects. With the advancements in experimental techniques for rodent vision, e.g. genetics and imaging techniques as well as behavioural tasks, the rodent model has become an important model as well. It is not necessarily an “either” or “or” question (primates or rodents), but more a complementary issue: using both primates and rodents to unravel the full picture of vision.

      We have changed this part in the introduction to “Lately, the rodent model has become an important model in vision studies, motivated by the applicability of molecular and genetic tools rather than by the visual capabilities of rodents”.

      • The correspondence between the list of layers in Supplementary Tables 8 and 9 and the layers shown in Figures 4 and 6 could be clarified.

      We have clarified this in the caption of Figure 7

      • The titles in Figures 4 and 6 could be updated from "DNN" to "cDNN" to ensure consistency with the rest of the manuscript.

      Thank you for your feedback. We have changed the titles in Figures 4 and 6 such that they are consistent with the rest of the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      (1) Potential bleed-over across frequencies in the spectral domain is a major concern for all of the results in this paper. The fact that alpha power, 36Hz and 40Hz frequency-tagged amplitude and 4Hz intermodulation frequency power is generally correlated with one another amplifies this concern. The authors are attaching specific meaning to each of these frequencies, but perhaps there is simply a broadband increase in neural activity when anticipating an auditory target compared to a visual target?

      We appreciate the reviewer’s insightful comment regarding the potential bleed-over across frequencies in the spectral domain. We fully acknowledge that the trade-off between temporal and frequency resolution is a challenge, particularly given the proximity of the frequencies we are examining.

      To address this concern, we performed additional analyses to investigate whether there is indeed a broadband increase in neural activity when anticipating an auditory target as compared to a visual target, as opposed to distinct frequency-specific effects. Our results show that the bleed-over between frequencies is minimal and does not significantly affect our findings. Specifically, we repeated the analyses using the same filter and processing steps for the 44 Hz frequency. At this frequency, we did not observe any significant differences between conditions.

      These findings suggest that the effects we report are indeed specific to the 40 Hz frequency band and not due to a general broadband increase in neural activity. We hope this addresses the reviewer’s concern and strengthens the validity of our frequency-specific results. We have now added this analysis to the methods section of our manuscript.

      Line 730: To confirm that 4 Hz is a sufficient distance between tagging frequencies, we repeated to analysis for 43.5 to 44.5. We found no indication of frequency-bleeding over, as the effects observed at 40 Hz, were not present at 44 Hz (see SUPPL Fig. 11).

      We do, however, not specifically argue against the possibility of a broadband increase in sensory processing when anticipating an auditory compared to a visual target. But even a broadband-increase would directly contradict the alpha inhibition hypothesis, which poses that an increase in alpha completely disengage the whole cortex. We have made this clearer in the text now.

      Line 491: As auditory targets were significantly more difficult than visual targets in our first study and of comparable difficulty in our second study, these results strongly speak to a vigilance increase of sensory processing independent of modality and an inability to selectively disengage one sensory modality in anticipation of a demanding task. This view is consistent with previous work in which visual SSEPs elicited by irrelevant background stimulation increased with task load in an auditory discrimination task (Jacoby et al., 2012).

      (2) Moreover, 36Hz visual and 40Hz auditory signals are expected to be filtered in the neocortex. Applying standard filters and Hilbert transform to estimate sensory evoked potentials appears to rely on huge assumptions that are not fully substantiated in this paper. In Figure 4, 36Hz "visual" and 40Hz "auditory" signals seem largely indistinguishable from one another, suggesting that the analysis failed to fully demix these signals.

      We appreciate the reviewer’s insightful concern regarding the filtering and demixing of the 36 Hz visual and 40 Hz auditory signals, and we share the same reservations about the reliance on standard filters and the Hilbert transform method.

      To address this, we would like to draw attention to SUPPL Fig. 11, which demonstrates that a 4 Hz difference is sufficient to effectively demix the signals using our chosen filtering and Hilbert transform approach. We argue that the reason the 36 Hz visual and 40 Hz auditory signals show similar topographies lies not in incomplete demixing but rather in the possibility that this condition difference reflects sensory integration, rather than signal contamination.

      This interpretation is further supported by our findings with the intermodulation frequency at 4 Hz, which also suggests cross-modal integration. Furthermore, source localization analysis revealed that the strongest condition differences were observed in the precuneus, an area frequently associated with sensory integration processes. We have now expanded on this in the discussion section to better clarify this point.

      Line 578: Previous research has shown that simultaneous frequency-tagging at multiple frequencies can evoke a response at the intermodulation frequency (f1 – f2), which in multimodal settings is thought to reflect cross-modal integration (Drijvers et al., 2021). This concept aligns closely with our findings, where increased vigilance in the sensory system, prompted by anticipation of a difficult auditory target, resulted in an increase in the intermodulation frequency. Similarly, our data shows that visual signal enhancement was localized in the precuneus, further supporting the role of this region in sensory integration (Al-Ramadhani et al., 2021; Xie et al., 2019).

      (3) The asymmetric results in the visual and auditory modalities preclude a modality-general conclusion about the function of alpha. However, much of the language seems to generalize across sensory modalities (e.g., use of the term 'sensory' rather than 'visual').

      We agree that in some cases we have not made a sufficient distinction between visual and sensory. We have now made sure, that when using ‘sensory’, we either describe overall theories, which are not visual-exclusive or refer to the possibility of a broad sensory increase. However, when directly discussing our results and the interpretation thereof, we now use ‘visual’.

      (4) In this vein, some of the conclusions would be far more convincing if there was at least a trend towards symmetry in source-localized analyses of MEG signals. For example, how does alpha power in primary auditory cortex (A1) compare when anticipating auditory vs visual target? What do the frequency tagged visual and auditory responses look like when just looking at primary visual cortex (V1) or A1?

      We thank the reviewer for this important suggestion and have added a virtual channel analysis. We were however, not interested in alpha power in primary auditory cortex, as we were specifically interested in the posterior alpha, which is usually increased when expecting an auditory compared to a visual target (and used to be interpreted as a blanket inhibition of the visual cortex). We have now improved upon the clarity concerning this point in the manuscript.

      We have however, followed the reviewer’s suggestion of a virtual channel analysis, showing that the condition differences are not observable in primary visual cortex for the 36 Hz visual signal and in primary auditory cortex for the 40 Hz auditory signal. Our data clearly shows that there is an alpha condition difference in V1, while there no condition difference for 36 Hz in V1 and for 40 Hz in Heschl’s Gyrus.

      Line 356: Additionally, we replicated this effect with a virtual channel analysis in V1 (see SUPPL Fig. 12)

      Line 403: Furthermore, a virtual channel analysis in V1 and Heschl’s gyrus confirmed that there were no condition differences in primary visual and auditory areas (see SUPPL Fig. 12).

      (5) Blinking would have a huge impact on the subject's ability to ignore the visual distractor. The best thing to do would be to exclude from analysis all trials where the subjects blinked during the cue-to-target interval. The authors mention that in the MEG experiment, "To remove blinks, trials with very large eye-movements (> 10 degrees of visual angle) were removed from the data (See supplement Fig. 5)." This sentence needs to be clarified, since eye-movements cannot be measured during blinking. In addition, it seems possible to remove putative blink trials from EEG experiments as well, since blinks can be detected in the EEG signals.

      We agree with the reviewer that this point has been phrased in a confusing way. From the MEG-data, we removed eyeblinks using ICA. Along for the supplementary Fig. 5 analysis, we used the eye-tracking data to make sure that participants were in fact fixating the centre of the screen. For this analysis, we removed trials with blinks (which can be seen in the eye-tracker as huge amplitude movements or as large eye-movements in degrees of visual angle; see figure below to show a blink in the MEG data and the according eye-tracker data in degrees of visual angle). We have now clarified this in the methods section.

      As for the concern closed eyes to ignore visual distractors, in both experiments we can observe highly significant distractor cost in accuracy for visual distractors, which we hope will convince the reviewer that our visual distractors were working as intended.

      Author response image 1.

      Illustration of eye-tracker data for a trial without and a trial with a blink. All data points recorded during this trial are plottet. A, ICA component 1, which reflects blinks and its according data trace in a trial. No blink is visible. B, eye-tracker data transformed into degrees of visual angle for the trial depicted in A. C, ICA component 1, which reflects blinks and its according data trace in a trial. A clear blink is visible. D, eye-tracker data transformed into degrees of visual angle for the trial depicted in C.

      Line 676: To confirm that participants had focused on the fixation cross during the cue-to-target interval, we incorporated eye-tracking into our MEG-experiment (EyeLink 1000 Plus). Correct trials of the second block were analysed for vertical and horizontal eye-movements. To exclude blinks from this analysis, trials with very large eye-movements (> 10 degrees of visual angle) were removed from the eye-tracking data (See suppl Fig. 5).

      (6) It would be interesting to examine the neutral cue trials in this task. For example, comparing auditory vs visual vs neutral cue conditions would be indicative of whether alpha was actively recruited or actively suppressed. In addition, comparing spectral activity during cue-to-target period on neutral-cue auditory correct vs incorrect trials should mimic the comparison of auditory-cue vs visual-cue trials. Likewise, neutral-cue visual correct vs incorrect trials should mimic the attention-related differences in visual-cue vs auditory-cue trials.

      We have analysed the neutral cue trials in the EEG dataset (see suppl. Fig. 1). There were no significant differences to auditory or visual cues, but descriptively alpha power was higher for neutral cues compared to visual cues and lower for neutral cues compared to auditory cues. While this may suggest that for visual trials alpha is actively suppressed and for auditory trials actively recruited, we do not feel comfortable to make this claim, as the neutral condition may not reflect a completely neutral state. The neutral task can still be difficult, especially because of the uncertainty of the target modality.

      As for the analysis of incorrect versus correct trials, we appreciate the idea, but unfortunately the accuracy rate was quite high so that the number of incorrect trials is insufficient to perform a reliable analysis.

      (7) In the abstract, the authors state that "This implies that alpha modulation does not solely regulate 'gain control' in early sensory areas but rather orchestrates signal transmission to later stages of the processing stream." However, I don't see any supporting evidence for the latter claim, that alpha orchestrates signal transmission to later stages of the processing stream. If the authors are claiming an alternative function to alpha, this claim should be strongly substantiated.

      We thank the reviewer for pointing out, that we have not sufficiently explained our case. The first point refers to gain control as elucidated by the alpha inhibition hypothesis, which claims that increases in alpha disengage an entire cortical area. Since we have confirmed the alpha increase in our data to originate from primary visual cortex through source analysis, this should lead to decreased visual processing. The increase in 36 Hz visual processing therefore directly contradicts the alpha inhibition hypothesis. We propose an alternative explanation for the functionality of alpha activity in this task. Through pulsed inhibition, information packages of relevant visual information could be transmitted down the processing stream, thereby enhancing relevant visual signal transmission. We argue the fact that the enhanced visual 36 Hz signal we found correlated with visual alpha power on a trial-by-trial basis, and did not originate from primary visual cortex, but from areas known for sensory integration supports our claim.

      We have now tried to make this point clearer by rephrasing our manuscript. Additionally, we have also now further clarified this point in our discussion.

      Line 527: Our data provides evidence in favour of this view, as we can show that early sensory alpha activity covaries over trials with SSEP magnitude in higher order sensory areas. If alpha activity exerted gain control in early visual regions, increased alpha activity would have to lead to a decrease in SSEP responses. In contrast, we observe that increased alpha activity originating from early visual cortex is related to enhanced visual processing. Source localization confirmed that this enhancement was not originating from early visual areas, but from areas associated with later stages of the processing stream such as the precuneus, which has been connected to sensory integration (Al-Ramadhani et al., 2021; Xie et al., 2019). While we cannot completely rule out alternative explanations, it seems plausible to assume that inhibition of other task-irrelevant communication pathways leads to prioritised and thereby enhanced processing over relevant pathways. In line with previous literature (Morrow et al., 2023; Peylo et al., 2021; Zhigalov & Jensen, 2020b), we therefore suggest that alpha activity limits task-irrelevant feedforward communication, thereby enhancing processing capabilities in relevant downstream areas (see Fig. 1A).

      Reviewer #1 (Recommendations for the authors):Minor Concerns:

      (1) I suggest adding more details about the task in the Results and/or Figure 1 legend. Specifically, when describing the task, I think it would help the readers if the authors specified what the participants had to do to get a trial correct (e.g., press left / down / right arrow if the tone pitch was low (500Hz) / medium (1000Hz) / high (2000Hz).)

      (2) Please clarify whether Gaboar patch was drifting.

      (3) Figure 2C-D: I suggest clarifying in the X-tick labels that + and - trials are in separate blocks (e.g., put 'Block1 visual-' instead of 'visual-').

      We followed the suggestions of the reviewer detailed in point 1-3, which indeed greatly improves the clarity and readability of these parts.

      (4) "Interestingly, auditory distractors reduced reaction times to visual targets, which could be explained by a generally faster processing of auditory targets (Jain et al., 2015), possibly probing faster responses in visual tasks (Naue et al., 2011)." - Please elaborate on how faster processing of auditory targets could lead to the probing of faster responses in visual tasks. Further, if I understand correctly, this should result in a speed-accuracy trade-off, which is not observed in the MEG experiments. If there is a learning effect due to the blocked structure in the MEG experiments, why is it not observed on auditory trials?

      We thank the reviewer for suggesting clarifying this paragraph. We have now rephrased this part and added additional information.

      Concerning the reviewer’s theory, intersensory facilitation can occur in the absence of a speed-accuracy trade-off, as it can affect the motor execution after a decision has been made. Nevertheless, learning effects could also have led to this result in the MEG experiment. Our difficulty calibration did not lead to comparable accuracies in block 1, where auditory targets wetre now less difficult than visual targets. Whith the addition of distractors in block 2, accuracy for auditory targets decreased, while it increased for visual targets. Indeed, one interpretation could be that there was a learning effect for visual targets, which was not prevalent for auditory targets. However, the speed increase when visual targets are coupled with auditory distractors is prevalent in both experiments. Accordingly, we find the intersensory facilitation account more likely.

      line 148: Interestingly, auditory distractors reduced reaction times to visual targets, which could be explained by a generally faster processing of auditory targets (Jain et al., 2015). As such, the auditory distractor possibly caused intersensory facilitation (Nickerson., 1973), whereby reaction times to a target can be facilitated when accompanied by stimuli of other sensory modalities, even if they are irrelevant or distracting.

      (5) Please briefly describe the cluster permutation analysis in the results section.

      We have now added a brief description of the cluster permutation analysis we performed in the results section.

      Line 166: We then applied cluster permutation analysis, whereby real condition differences were tested against coincidental findings by randomly permutating the condition labels to the data and testing for condition differences 1000 times (Maris & Oostenveld, 2007).

      (6) Figure 4A legend: "auditory steady-state evoked potential (ASSEP) averaged over 6 central electrodes displaying the highest 40 Hz power (Fz, FC1, FC2, F11, F2, FCz)." - I suggest marking these 6 electrodes in the scalp map on the figure panel.

      We have followed the suggestion of the reviewer and marked the electrodes/sensors used to illustrate the steady-state responses.

      (7) Lines 281-283: "It was highly significant for the visual 36 Hz response (Fig. 5A, middle columns, p = .033; t(19) = 2.29; BF(10) = 1.91) but did not reach significance for the visual 40 Hz response (Fig. 5B, middle column; p = 0.20; t(19) = 1.32; BF(10) = 0.49)." - Was "visual 40Hz response" a typo? I believe 40Hz pertains to auditory, not visual?

      We thank the reviewer for pointing out this error and agree that the phrasing was sometimes confusing. We have now used the terms VSSEP and ASSEP to make things clearer throughout the manuscript.

      L. 224-229: The median split was highly significant for the 36 Hz VSSEP response (Fig. 5A, middle columns, p \= .033; t<sub>(19)</sub> = 2.29; BF<sub>(10)</sub> = 1.91) but did not reach significance for the 40 Hz ASSEP response (Fig. 5B, middle column; p = 0.20; t<sub>(19)</sub> = 1.32; BF<sub>(10)</sub> = 0.49).

      Reviewer #2 (Public review):

      Brickwedde et al. investigate the role of alpha oscillations in allocating intermodal attention. A first EEG study is followed up with an MEG study that largely replicates the pattern of results (with small to be expected differences). They conclude that a brief increase in the amplitude of auditory and visual stimulus-driven continuous (steady-state) brain responses prior to the presentation of an auditory - but not visual - target speaks to the modulating role of alpha that leads them to revise a prevalent model of gating-by-inhibition.

      Overall, this is an interesting study on a timely question, conducted with methods and analysis that are state-of-the-art. I am particularly impressed by the author's decision to replicate the earlier EEG experiment in MEG following the reviewer's comments on the original submission. Evidently, great care was taken to accommodate the reviewers suggestions.

      We thank the reviewer for the positive feedback and expression of interest in the topic of our manuscript.

      Nevertheless, I am struggling with the report for two main reasons: It is difficult to follow the rationale of the study, due to structural issues with the narrative and missing information or justifications for design and analysis decisions, and I am not convinced that the evidence is strong, or even relevant enough for revising the mentioned alpha inhibition theory. Both points are detailed further below.

      We have now revised major parts of the introduction and results in line with the reviewer’s suggestions, hoping that our rationale is now easier to follow and that our evidence will now be more convincing. We have separated our results section into the first study (EEG) and to second study (MEG), to enhance the rationale of our design choices and readability. We have clarified all mentioned ambiguous parts in our methods section. Additionally, we have revised the introduction to now explain more clearly what results to expect under the alpha inhibition theory in contrast to our alternative account.

      Strength/relevance of evidence for model revision: The main argument rests on 1) a rather sustained alpha effect following the modality cue, 2) a rather transient effect on steady-state responses just before the expected presentation of a stimulus, and 3) a correlation between those two. Wouldn't the authors expect a sustained effect on sensory processing, as measured by steady-state amplitude irrespective of which of the scenarios described in Figure 1A (original vs revised alpha inhibition theory) applies? Also, doesn't this speak to the role of expectation effects due to consistent stimulus timing? An alternative explanation for the results may look like this: Modality-general increased steady-state responses prior to the expected audio stimulus onset are due to increased attention/vigilance. This effect may be exclusive (or more pronounced) in the attend-audio condition due to higher precision in temporal processing in the auditory sense or, vice versa, too smeared in time due to the inferior temporal resolution of visual processing for the attend-vision condition to be picked up consistently. As expectation effects will build up over the course of the experiment, i.e., while the participant is learning about the consistent stimulus timing, the correlation with alpha power may then be explained by a similar but potentially unrelated increase in alpha power over time.

      We thank the reviewer for raising these insightful questions and suggestions.

      It is true that our argument rests on a rather sustained alpha effect and a rather transient effect on steady-state responses ,and a correlation between the two. However, this connection would not be expected under the alpha inhibition hypothesis, which states that alpha activity would inhibit a whole cortical area (when irrelevant to the task), exerting “gain control”. This notion directly contradicts our results of the “irrelevant” visual information a) being transmitted at all and b) increasing.

      However, it has been shown in various reports (see for instance Dugué et al., 2011; Haegens et al., 2011; Spaak et al., 2012) that alpha activity exerts pulsed inhibition, so we proposed an alternative theory of an involvement in signal transmission. In this case, the cyclic inhibition would serve as an ordering system, which only allows for high-priority information to pass, resulting in higher signal-to-noise ratio. We do not make a claim about how fast or when these signals are transmitted in relation to alpha power. For instance, it could be that alpha power increases as a preparatory state even before signal is actually transmitted.  Zhigalov (2020 Hum. Brain M.) has shown that in V1, frequency-tagging responses were up-and down regulated with attention – independent of alpha activity.

      However, we do believe that visual alpha power correlates on a trial-by-trial level with visual 36 Hz frequency-tagging increases (see Fig. 5 and 10 in our manuscript) - a relationship which has not been found in V1 by us and others (see SUPPL Fig. 12 and Zhigalov 2020, Hum. Brain Mapp.) suggest a strong connection. Furthermore, the fact that the alpha modulation originates from early visual areas and occurs prior to any frequency-tagging changes, while the increase in frequency-tagging can be observed in areas which are later in the processing stream (such as the precuneus) is strongly indicative for an involvement of alpha power in the transmission of this signal. We cannot fully exclude alternative accounts and mechanisms which effect both alpha power and frequency-tagging responses.  

      The alternative account described by the reviewer does not contradict our theory, as we argue that the alpha power modulation reflects an expectation effect (and the idea that it could be related to the resolution of auditory versus visual processing is very interesting!). It is also possible that this expectation is, as the reviewer suggests, related to attention/vigilance and might result in a modality-general signal increase. By way of support, we observed an increase in the frequency-tagging response in sensory integration areas. Accordingly, we argue that the alternative explanation provided by the reviewer contradicts the alpha inhibition hypothesis, but not necessarily our alternative theory.

      We have now revised the discussion and are confident our case is now stronger and easier to follow. Additionally, we mentioned the possibility for alternative explanations as well as the possibility, that alpha networks fulfil different roles in different locations/task environments.

      Line 523: Here we propose that alpha activity, rather than modulating early primary sensory processing, exhibits its inhibitory effects at later stages of the processing stream (Antonov et al., 2020; Gundlach et al., 2020; Zhigalov & Jensen, 2020a; Zumer et al., 2014), gating feedforward or feedback communication between sensory areas (Bauer et al., 2020; Haegens et al., 2015; Uemura et al., 2021). Our data provides evidence in favour of this view, as we can show that early sensory alpha activity covaries over trials with SSEP magnitude in higher order sensory areas. If alpha activity exerted gain control in early visual regions, increased alpha activity would have to lead to a decrease in SSEP responses. In contrast, we observe that increased alpha activity originating from early visual cortex is related to enhanced visual processing. Source localization confirmed that this enhancement was not originating from early visual areas, but from areas associated with later stages of the processing stream such as the precuneus, which has been connected to sensory integration (Al-Ramadhani et al., 2021; Xie et al., 2019). While we cannot completely rule out alternative explanations, it seems plausible to assume that inhibition of other task-irrelevant communication pathways leads to prioritised and thereby enhanced processing over relevant pathways. In line with previous literature (Morrow et al., 2023; Peylo et al., 2021; Zhigalov & Jensen, 2020b), we therefore suggest that alpha activity limits task-irrelevant feedforward communication, thereby enhancing processing capabilities in relevant downstream areas (see Fig. 1A).

      References:

      Dugué, L., Marque, P., & VanRullen, R. (2011). The phase of ongoing oscillations mediates the causal relation between brain excitation and visual perception. Journal of Neuroscience, 31(33), 11889–11893. https://doi.org/10.1523/JNEUROSCI.1161-11.2011

      Haegens, S., Nácher, V., Luna, R., Romo, R., & Jensen, O. (2011). α-Oscillations in the monkey sensorimotor network influence discrimination performance by rhythmical inhibition of neuronal spiking. Proceedings of the National Academy of Sciences, 108(48), 19377–19382. https://doi.org/10.1073/PNAS.1117190108

      Spaak, E., Bonnefond, M., Maier, A., Leopold, D. A., & Jensen, O. (2012). Layer-Specific Entrainment of Gamma-Band Neural Activity by the Alpha Rhythm in Monkey Visual Cortex. Current Biology, 22(24), 2313–2318. https://doi.org/10.1016/J.CUB.2012.10.020

      Zhigalov, A., & Jensen, O. (2020). Alpha oscillations do not implement gain control in early visual cortex but rather gating in parieto-occipital regions. Human Brain Mapping, 41(18), 5176–5186. https://doi.org/10.1002/hbm.25183

      Structural issues with the narrative and missing information: Here, I am mostly concerned with how this makes the research difficult to access for the reader. I list the some major, followed by more specific points below:

      In the introduction the authors pit the original idea about alpha's role in gating against some recent contradictory results. If it's the aim of the study to provide evidence for either/or, predictions for the results from each perspective are missing. Also, it remains unclear how this relates to the distinction between original vs revised alpha inhibition theory (Fig. 1A). Relatedly, if this revision is an outcome rather than a postulation for this study, it shouldn't be featured in the first figure.

      We agree with the reviewer that we have not sufficiently clarified our goal as well as how different functionalities of alpha oscillations would lead to different outcomes. We have revised the introduction and restructured the results part and hope that it is now easier to follow. The results part now follows study 1 (EEG) and study 2 (MEG) chronologically, so that results can more easily be differentiated and our design choices for the second study can be explained better.

      Line 50: Recent evidence challenged a direct connection between alpha activity and visual information processing in early visual cortex. As such, both visual steady-state responses and alpha power were modulated by attention, but did not covary when investigating individual trials (Zhigalov & Jensen, 2020). Unfortunately, very few studies have investigated direct connections between alpha activity, attention and sensory signals, especially over trials. Furthermore, results seem to depend on timing of alpha activity in relation to sensory responses as well as stimulus type and outcome measure (Morrow et al., 2023).

      Accordingly, the objective of the current study is to test the alpha inhibition hypothesis compared to an alternative theory. Based on the alpha inhibition hypothesis, alpha modulation is connected to ‘gain control’ in early visual areas through modulation of excitability (Foxe & Snyder, 2011; Jensen & Mazaheri, 2010; Van Diepen et al., 2019).  In contrast, we propose that inhibitory effects of alpha modulation are exhibited at later stages of the processing stream (Peylo et al., 2021; Yang et al., 2023; Zhigalov & Jensen, 2020a; Zumer et al., 2014), gating feedforward or feedback communication between sensory areas (see Fig. 1B; Bauer et al., 2020; Haegens et al., 2015; Uemura et al., 2021).

      Line 80: The aim of our study was to directly test the alpha inhibition hypothesis by investigating if cue-induced modulation of alpha activity coincides with the suppression of frequency-tagging responses in task-irrelevant modalities.

      Line 99: In brief, while we observed the expected cue-induced early-visual alpha modulation, the amplitude of auditory and visual SSEP/SSEFs as well as their intermodulation frequency increased just prior to the onset of the auditory target, contradicting the alpha inhibition hypothesis. The difference between conditions of visual SSEP/SSEFs originated from sensory integration areas and correlated with early sensory alpha activity on a trial-by-trial basis, speaking to an effect of alpha modulation on signal transmission rather than inhibition of early visual areas.

      The analysis of the intermodulation frequency makes a surprise entrance at the end of the Results section without an introduction as to its relevance for the study. This is provided only in the discussion, but with reference to multisensory integration, whereas the main focus of the study is focussed attention on one sense. (Relatedly, the reference to "theta oscillations" in this sections seems unclear without a reference to the overlapping frequency range, and potentially more explanation.) Overall, if there's no immediate relevance to this analysis, I would suggest removing it.

      We thank the reviewer for pointing this out and have now added information about this frequency to the introduction. We believe that the intermodulation frequency analysis is important, as it potentially supports the notion that condition differences in the visual-frequency tagging response are related to downstream processing rather than overall visual information processing in V1. We would therefore prefer to leave this analysis in the manuscript.

      Line 75: Furthermore, when applying two different frequencies for two different sensory modalities, their intermodulation frequency (f1-f2) has been suggested to reflect cross-modal integration (Drijvers et al., 2021). Due to distinct responses, localisation and attention-dependence, frequency-tagging provides an optimal tool to study sensory signal processing and integration over time.

      Reviewer #2 (Recommendations for the authors):

      As detailed in several points below, I found that I didn't get the information I needed to fully understand design/analysis decisions. In some cases, this may just be a case of re-organising the manuscript, in others crucial info should be added:

      Specific issues:

      Page 2, line 51: How does recent evidence contradict this? Please explain.

      We have added a section that describes the results contradicting the alpha inhibition hypothesis.

      Line 50: Recent evidence challenged a direct connection between alpha activity and visual information processing in early visual cortex. As such, both visual steady-state responses and alpha power were modulated by attention, but did not covary when investigating individual trials (Zhigalov & Jensen, 2020).

      Page 3, line 78-80: "... also interested in relationships [...] on a trial-by-trial basis" - why? Please motivate.

      We thank the reviewer for highlighting this section, which we feel was not very well phrased. We have rewritten this whole paragraph and hope that our motivation for this study is now clear.

      Line 50: Recent evidence challenged a direct connection between alpha activity and visual information processing in early visual cortex. As such, both visual steady-state responses and alpha power were modulated by attention, but did not covary when investigating individual trials (Zhigalov & Jensen, 2020). Unfortunately, very few studies have investigated direct connections between alpha activity, attention and sensory signals, especially over trials. Furthermore, results seem to depend on timing of alpha activity in relation to sensory responses as well as stimulus type and outcome measure (Morrow et al., 2023).

      Page 4, line 88-92: "... implementing a blocked design" - unclear why? This is explained to some extent in the next few lines but remains unclear without knowing outcomes of the EEG experiment with more detail. Overall, it seems like this methodological detail may be better suited for a narrative in the Results section, that follows a more chronological order from the findings of the EEG experiment to the design of the MEG study.

      More generally, and maybe I missed it, I couldn't find a full account of why a block design was chosen and what the added value was. I believe that re-organising the Results section would allow precisely stating how that was an improvement over the EEG experiment.

      In line with the reviewer’s suggestion, we have now restructured the results section. The first section of the study 2 results now explains our design choices with direct reference to the results of the EEG experiment.

      Line 298: To test the robustness of our results and to employ additional control analyses, we replicated our experiment using MEG (see Fig. 7A). While an increase in visual information processing parallel to an increase in alpha modulation already contradicts the notion of alpha inhibition exerting “gain control”, affecting the whole visual cortex, our claim that alpha modulation instead affects visual information at later processing stages still required further validation. As such, our goal was to perform source analyses showing alpha modulation originating from primary visual areas affected visual information at later processing stages (e.g. not in primary visual cortex). Additionally, to exclude that the uncertainty over possible distractors affected our results, we employed a block design, where block 1 consisted only of trials without distractors and in block 2 targets were always accompanied by a distractor. Furthermore, we aligned the visual and auditory task to be more similar, both of them now featuring frequency-discrimination, which related to sound pitch (frequency) in the auditory condition and stripe-frequency of the Gabor patch in the visual condition. Lastly, to make sure our effects were driven by sensory modality-differences rather than task-difficulty differences, we included a short calibration phase. Prior to the experiment, difficulty of pitch sounds, and Gabor patch frequency were calibrated for each individual, ascertaining a success rate between 55% to 75%.

      The point above also applies to lines 95-97 where it's unclear what "aligning the visual with the auditory task" means. Also, what would be the predictions for "more nuanced interactions [...]"

      We agree that this phrasing was more than confusing and in the process of restructuring our results section, we have now revised this passage (see cited text from our manuscript to the point just above).

      Page 9, line 207-209: One of the few mentions of the "ambivalent" condition (attention to audio+vision?). To what end was that condition added to the experiment originally? The explanation that this condition was dropped from analysis because it did not show significant results does not seem methodologically sound.

      We thank the reviewer for pointing this out, as we had changed the name from ambivalent to non-specific, but this word had slipped our attention. The condition was added to the experiment as a control, which enables us to verify that our cues as well as our distractors work as intended. While interesting to analyse (and we did not drop it completely, the condition comparisons are in the supplementary material), we felt that further analysis of this condition would not contribute to addressing our research question. To be specific, the prerequisite to analysing the effect of alpha modulation is a significant effect of alpha modulation in the first place. We have now clarified the rationale for this condition, as well as our reasoning for omitting it from correlation and source analysis.

      Line 173 When presenting unspecified cues, alpha power changes were not significant, but descriptively larger compared to visual target conditions and lower compared to auditory target conditions (see suppl Fig. 2). However as significant alpha modulation was a prerequisite to test our hypotheses, we excluded this condition from further analysis.

      Page 9, line 209-212: "condition differences in alpha were only significant in block 2 [...] therefore we performed the [...] analysis [...] only for the second half of the experiment." This sounds like double-dipping. Maybe just an issue of phrasing?

      We thank the reviewer for pointing out that it may appear like ‘double dipping’. The reasoning was the same as the point above, we require a significant alpha modulation to test the effect of alpha modulation on further processing. We have revised this part to be clearer.

      Line 345: In line with previous studies (van Diepen & Mazaheri, 2017), condition differences in alpha activity were only significant in block 2, where distractors were present. As alpha modulation was a prerequisite to test our hypotheses, we performed the following analyses solely with data from block 2 (see Fig. 8).

      Page 12, line 281: Bayes factors are used here (and elsewhere), in addition to NHST. May be worthwhile to mention that briefly before use and give an intro sentence on its use, value and interpretation, and why these are added sometimes but not for all tests reported.

      We agree that we did not introduce this at all and have now added a section, which explains the inclusion as well as the interpretation of the Bayes factor.

      Line 218: To estimate the robustness of these results, we additionally conducted median split analyses between trials with high and low alpha power for each participant, as well as averaged the correlation coefficient of each participant and calculated a one-sample t-test against 0. For each analysis we provided the Bayes Factor, which estimates the strength of support for or against the null hypothesis (BF > 3.2 is considered as substantial evidence and BF > 10 is considered as strong evidence; Kass & Raftery, 1995).

      Throughout the Results section, it's not always clear which results are from the EEG or from the MEG study. Adopting the recommendation in point c) may help with that.

      According to the reviewer’s recommendation, we have restructured our results section and first present the EEG study and afterwards the MEG study.

      Similarly, it seems pivotal to add "visual" and "auditory" when mentioning the 36/40-Hz steady-state responses (or stimulation) to help the reader.

      We agree that visual/auditory 36 Hz / 40 Hz frequency-tagging responses, expecting visual/auditory target becomes lengthy and confusing very quickly. We therefore decided to introduce the abbreviation of visual steady-state evoked potentials/fields (VSSEP/VSSEF) and auditory steady-state evoked potentials/fields (ASSEP/ASSEF).

      Figure 5 - showing the same cluster as "early" and "late" in the margin for the MEG data is potentially confusing.

      We thank the reviewer for pointing this out and have now adapted the figure to just show one cluster, as we only found this one cluster in our MEG analysis.

      Reviewer #3 (Public review):

      This paper seems very strong, particularly given that the follow-up MEG study both (a) clarifies the task design and separates the effect of distractor stimuli into other experimental blocks, and (b) provides source-localization data to more concretely address whether alpha inhibition is occurring at or after the level of sensory processing, and (c) replicates most of the EEG study's key findings.

      We thank the reviewer for their positive feedback and evaluation of our work.

      There are some points that would be helpful to address to bolster the paper. First, the introduction would benefit from a somewhat deeper review of the literature, not just reviewing when the effects of alpha seem to occur, but also addressing how the effect can change depending on task and stimulus design (see review by Morrow, Elias & Samaha (2023).

      We thank the reviewer for this suggestion and agree. We have now added a paragraph to the introduction that refers to missing correlation studies and the impact of task design.

      Line 53: Unfortunately, very few studies have investigated direct connections between alpha activity, attention and sensory signals, especially over trials. Furthermore, results seem to depend on timing of alpha activity in relation to sensory responses as well as stimulus type and outcome measure (Morrow et al., 2023).

      Additionally, the discussion could benefit from more cautionary language around the revision of the alpha inhibition account. For example, it would be helpful to address some of the possible discrepancies between alpha and SSEP measures in terms of temporal specificity, SNR, etc. (see Peylo, Hilla, & Sauseng, 2021). The authors do a good job speculating as to why they found differing results from previous cross-modal attention studies, but I'm also curious whether the authors think that alpha inhibition/modulation of sensory signals would have been different had the distractors been within the same modality or whether the cues indicated target location, rather than just modality, as has been the case in so much prior work?

      We thank the reviewer for suggesting these interesting discussion points and have included a paragraph in our discussion that clarifies these issues.

      Line 543: It should be noted, the comparison between modulation in alpha activity and in SSEP/SSEFs is difficult, especially concerning timing. This is largely owed to differences in signal-to-noise due to trial averaging in the frequency versus the time domain and temporal and frequency lag in the estimation of alpha activity (Peylo et al., 2021). It is further noteworthy, that the majority of evidence for the alpha inhibition hypothesis focused on the effect of pre-target alpha modulation on behaviour and target-related potentials (Morrow et al., 2023). However, in our data alpha modulation occurs clearly ahead of SSVEP/SSVEF modulation on a scale that could not be simply explained by temporal or frequency smearing. Additionally, significant trial-by-trial correlations, which occur in the frequency domain for both signal types, underline the strong relationship between both measurements.

      Interestingly, we could show that the magnitude of the correlation between alpha power and visual information processing varied between conditions, suggesting a dynamic and adaptive regime. This notion supports the view that alpha oscillations represent a mechanism rather than a specific function, which can fulfil different roles depending on task demand and network location, which has been confirmed in a recent study revealing functionally distinct alpha networks (Clausner et al., 2024). As such, it is conceivable that alpha oscillations can in some cases inhibit local processing, while in other cases, depending on network location, connectivity and demand, alpha oscillation can facilitate signal transmission. In different contexts, utilizing unimodal targets and distractors, spatial cueing, or covert attention, different functional processes could be involved (Morrow et al., 2023). Future research should intensify efforts to disentangle these effects, investigating localized alpha networks intracranially or through combinations of fMRI, EEG and MEG, to clearly measure their effects on sensory processing and behaviour.

      Overall, the analyses and discussion are quite comprehensive, and I believe this paper to be an excellent contribution to the alpha-inhibition literature.

      Reviewer #3 (Recommendations for the authors):

      Overall, the paper is well-written, and the analyses and interpretations are strong. I think that the end of the introduction would feel more complete and more read more easily if you outlined all of your main hypotheses (not just trials signaling an auditory stimulus, but visual trials too, and what about distractor trials? This could help justify changes to task design in the MEG study), and then the key findings that motivated the follow-up design, which you then discuss (as opposed to introducing a new aim in this paragraph).

      We thank the reviewer for this positive evaluation. Based on feedback und suggestions from all reviewers, we have revised the structure of the manuscript. The introduction now states more clearly which results would be expected under the alpha inhibition theory and how our results contradict this. The results section has now been divided into two studies, which will make the rationale for our follow-up design easier to follow.

      Line 80: The aim of our study was to directly test the alpha inhibition hypothesis by investigating if cue-induced modulation of alpha activity coincides with the suppression of frequency-tagging responses in task-irrelevant modalities.

      Line 96: In brief, while we observed the expected cue-induced early-visual alpha modulation, the amplitude of auditory and visual SSEP/SSEFs as well as their intermodulation frequency increased just prior to the onset of the auditory target, contradicting the alpha inhibition hypothesis. The difference between conditions of visual SSEP/SSEFs originated from sensory integration areas and correlated with early sensory alpha activity on a trial-by-trial basis, speaking to an effect of alpha modulation on signal transmission rather than inhibition of early visual areas.

      Minor issues:

      L84 - "is" should be "was"

      L93 - "allows" should be "allowed"

      L113 - I think "changed" would suffice

      Fig 1A (text within figure on top) - "erea" should be "area" and caption title should include "of" (Illustration of the...)

      L213 - time window could be clarified

      Fig 4 -captions inconsistently capitalize words and use ) and , following the caption letters

      L253-255 - give you are looking at condition differences, do you mean the response was larger before an auditory target than before a visual target? It currently reads as if you mean that it was larger in that window right before the target as opposed to other time windows

      L368 - "behaviorally" should be "behavioral"

      L407-408 - I think auditory SSEP/SSVEFs should be auditory or visual SSEP/SSEFs, unless you are specifically only talking about auditory SSEPs and visual SSEFs

      L411 - also uses SSVEFs

      L413 - "frequently, or in the case of..."

      L555 - "predicting" should be predicted? Or do you mean only cues that correctly predicted the target?

      We are very grateful for the reviewer for pointing out these mistakes, all of which we have remedied in our manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents potentially valuable results on glutamine-rich motifs in relation to protein expression and alternative genetic codes. The author's interpretation of the results is so far only supported by incomplete evidence, due to a lack of acknowledgment of alternative explanations, missing controls and statistical analysis and writing unclear to non experts in the field. These shortcomings could be at least partially overcome by additional experiments, thorough rewriting, or both.

      We thank both the Reviewing Editor and Senior Editor for handling this manuscript.

      Based on your suggestions, we have provided controls, performed statistical analysis, and rewrote our manuscript. The revised manuscript is significantly improved and more accessible to non-experts in the field.

      Reviewer #1 (Public Review):

      Summary

      This work contains 3 sections. The first section describes how protein domains with SQ motifs can increase the abundance of a lacZ reporter in yeast. The authors call this phenomenon autonomous protein expression-enhancing activity, and this finding is well supported. The authors show evidence that this increase in protein abundance and enzymatic activity is not due to changes in plasmid copy number or mRNA abundance, and that this phenomenon is not affected by mutants in translational quality control. It was not completely clear whether the increased protein abundance is due to increased translation or to increased protein stability.

      In section 2, the authors performed mutagenesis of three N-terminal domains to study how protein sequence changes protein stability and enzymatic activity of the fusions. These data are very interesting, but this section needs more interpretation. It is not clear if the effect is due to the number of S/T/Q/N amino acids or due to the number of phosphorylation sites.

      In section 3, the authors undertake an extensive computational analysis of amino acid runs in 27 species. Many aspects of this section are fascinating to an expert reader. They identify regions with poly-X tracks. These data were not normalized correctly: I think that a null expectation for how often poly-X track occur should be built for each species based on the underlying prevalence of amino acids in that species. As a result, I believe that the claim is not well supported by the data.

      Strengths

      This work is about an interesting topic and contains stimulating bioinformatics analysis. The first two sections, where the authors investigate how S/T/Q/N abundance modulates protein expression level, is well supported by the data. The bioinformatics analysis of Q abundance in ciliate proteomes is fascinating. There are some ciliates that have repurposed stop codons to code for Q. The authors find that in these proteomes, Q-runs are greatly expanded. They offer interesting speculations on how this expansion might impact protein function.

      Weakness

      At this time, the manuscript is disorganized and difficult to read. An expert in the field, who will not be distracted by the disorganization, will find some very interesting results included. In particular, the order of the introduction does not match the rest of the paper.

      In the first and second sections, where the authors investigate how S/T/Q/N abundance modulates protein expression levels, it is unclear if the effect is due to the number of phosphorylation sites or the number of S/T/Q/N residues.

      There are three reasons why the number of phosphorylation sites in the Q-rich motifs is not relevant to their autonomous protein expression-enhancing (PEE) activities:

      First, we have reported previously that phosphorylation-defective Rad51-NTD (Rad51-3SA) and wild-type Rad51-NTD exhibit similar autonomous PEE activity. Mec1/Tel1-dependent phosphorylation of Rad51-NTD antagonizes the proteasomal degradation pathway, increasing the half-life of Rad51 from ∼30 min to ≥180 min (1). (page 1, lines 11-14)

      Second, in our preprint manuscript, we have already shown that phosphorylation-defective Rad53-SCD1 (Rad51-SCD1-5STA) also exhibits autonomous PEE activity similar to that of wild-type Rad53-SCD (Figure 2D, Figure 4A and Figure 4C). We have highlighted this point in our revised manuscript (page 9, lines 19-21).

      Third, as revealed by the results of Figure 4, it is the percentages, and not the numbers, of S/T/Q/N residues that are correlated with the PEE activities of Q-rich motifs.

      The authors also do not discuss if the N-end rule for protein stability applies to the lacZ reporter or the fusion proteins.

      The autonomous PEE function of S/T/Q-rich NTDs is unlikely to be relevant to the N-end rule. The N-end rule links the in vivo half-life of a protein to the identity of its N-terminal residues. In S. cerevisiae, the N-end rule operates as part of the ubiquitin system and comprises two pathways. First, the Arg/N-end rule pathway, involving a single N-terminal amidohydrolase Nta1, mediates deamidation of N-terminal asparagine (N) and glutamine (Q) into aspartate (D) and glutamate (E), which in turn are arginylated by a single Ate1 R-transferase, generating the Arg/N degron. N-terminal R and other primary degrons are recognized by a single N-recognin Ubr1 in concert with ubiquitin-conjugating Ubc2/Rad6. Ubr1 can also recognize several other N-terminal residues, including lysine (K), histidine (H), phenylalanine (F), tryptophan (W), leucine (L) and isoleucine (I) (68-70). Second, the Ac/N-end rule pathway targets proteins containing N-terminally acetylated (Ac) residues. Prior to acetylation, the first amino acid methionine (M) is catalytically removed by Met-aminopeptidases (MetAPs), unless a residue at position 2 is non-permissive (too large) for MetAPs. If a retained N-terminal M or otherwise a valine (V), cysteine (C), alanine (A), serine (S) or threonine (T) residue is followed by residues that allow N-terminal acetylation, the proteins containing these AcN degrons are targeted for ubiquitylation and proteasome-mediated degradation by the Doa10 E3 ligase (71).

      The PEE activities of these S/T/Q-rich domains are unlikely to arise from counteracting the N-end rule for two reasons. First, the first two amino acid residues of Rad51-NTD, Hop1-SCD, Rad53-SCD1, Sup35-PND, Rad51-ΔN, and LacZ-NVH are MS, ME, ME, MS, ME, and MI, respectively, where M is methionine, S is serine, E is glutamic acid and I is isoleucine. Second, Sml1-NTD behaves similarly to these N-terminal fusion tags, despite its methionine and glutamine (MQ) amino acid signature at the N-terminus. (Page 12, line 3 to page 13, line 2)

      The most interesting part of the paper is an exploration of S/T/Q/N-rich regions and other repetitive AA runs in 27 proteomes, particularly ciliates. However, this analysis is missing a critical control that makes it nearly impossible to evaluate the importance of the findings. The authors find the abundance of different amino acid runs in various proteomes. They also report the background abundance of each amino acid. They do not use this background abundance to normalize the runs of amino acids to create a null expectation from each proteome. For example, it has been clear for some time (Ruff, 2017; Ruff et al., 2016) that Drosophila contains a very high background of Q's in the proteome and it is necessary to control for this background abundance when finding runs of Q's.

      We apologize for not explaining sufficiently well the topic eliciting this reviewer’s concern in our preprint manuscript. In the second paragraph of page 14, we cite six references to highlight that SCDs are overrepresented in yeast and human proteins involved in several biological processes (5, 43) and that polyX prevalence differs among species (79-82).

      We will cite a reference by Kiersten M. Ruff in our revised manuscript (38).

      K. M. Ruff, J. B. Warner, A. Posey and P. S. Tan (2017) Polyglutamine length dependent structural properties and phase behavior of huntingtin exon1. Biophysical Journal 112, 511a.

      The authors could easily address this problem with the data and analysis they have already collected. However, at this time, without this normalization, I am hesitant to trust the lists of proteins with long runs of amino acid and the ensuing GO enrichment analysis. Ruff KM. 2017. Washington University in St.

      Ruff KM, Holehouse AS, Richardson MGO, Pappu RV. 2016. Proteomic and Biophysical Analysis of Polar Tracts. Biophys J 110:556a.

      We thank Reviewer #1 for this helpful suggestion and now address this issue by means of a different approach described below.

      Based on a previous study (43), we applied seven different thresholds to seek both short and long, as well as pure and impure, polyX strings in 20 different representative near-complete proteomes, including 4X (4/4), 5X (4/5-5/5), 6X (4/6-6/6), 7X (4/7-7/7), 8-10X (≥50%X), 11-10X (≥50%X) and ≥21X (≥50%X).

      To normalize the runs of amino acids and create a null expectation from each proteome, we determined the ratios of the overall number of X residues for each of the seven polyX motifs relative to those in the entire proteome of each species, respectively. The results of four different polyX motifs are shown in our revised manuscript, i.e., polyQ (Figure 7), polyN (Figure 8), polyS (Figure 9) and polyT (Figure 10). Thus, polyX prevalence differs among species and the overall X contents of polyX motifs often but not always correlate with the X usage frequency in entire proteomes (43).

      Most importantly, our results reveal that, compared to Stentor coeruleus or several non-ciliate eukaryotic organisms (e.g., Plasmodium falciparum, Caenorhabditis elegans, Danio rerio, Mus musculus and Homo sapiens), the five ciliates with reassigned TAAQ and TAGQ codons not only have higher Q usage frequencies, but also more polyQ motifs in their proteomes (Figure 7). In contrast, polyQ motifs prevail in Candida albicans, Candida tropicalis, Dictyostelium discoideum, Chlamydomonas reinhardtii, Drosophila melanogaster and Aedes aegypti, though the Q usage frequencies in their entire proteomes are not significantly higher than those of other eukaryotes (Figure 1). Due to their higher N usage frequencies, Dictyostelium discoideum, Plasmodium falciparum and Pseudocohnilembus persalinus have more polyN motifs than the other 23 eukaryotes we examined here (Figure 8). Generally speaking, all 26 eukaryotes we assessed have similar S usage frequencies and percentages of S contents in polyS motifs (Figure 9). Among these 26 eukaryotes, Dictyostelium discoideum possesses many more polyT motifs, though its T usage frequency is similar to that of the other 25 eukaryotes (Figure 10).

      In conclusion, these new normalized results confirm that the reassignment of stop codons to Q indeed results in both higher Q usage frequencies and more polyQ motifs in ciliates.  

      Reviewer #2 (Public Review):

      Summary:

      This study seeks to understand the connection between protein sequence and function in disordered regions enriched in polar amino acids (specifically Q, N, S and T). While the authors suggest that specific motifs facilitate protein-enhancing activities, their findings are correlative, and the evidence is incomplete. Similarly, the authors propose that the re-assignment of stop codons to glutamine-encoding codons underlies the greater user of glutamine in a subset of ciliates, but again, the conclusions here are, at best, correlative. The authors perform extensive bioinformatic analysis, with detailed (albeit somewhat ad hoc) discussion on a number of proteins. Overall, the results presented here are interesting, but are unable to exclude competing hypotheses.

      Strengths:

      Following up on previous work, the authors wish to uncover a mechanism associated with poly-Q and SCD motifs explaining proposed protein expression-enhancing activities. They note that these motifs often occur IDRs and hypothesize that structural plasticity could be capitalized upon as a mechanism of diversification in evolution. To investigate this further, they employ bioinformatics to investigate the sequence features of proteomes of 27 eukaryotes. They deepen their sequence space exploration uncovering sub-phylum-specific features associated with species in which a stop-codon substitution has occurred. The authors propose this stop-codon substitution underlies an expansion of ploy-Q repeats and increased glutamine distribution.

      Weaknesses:

      The preprint provides extensive, detailed, and entirely unnecessary background information throughout, hampering reading and making it difficult to understand the ideas being proposed.

      The introduction provides a large amount of detailed background that appears entirely irrelevant for the paper. Many places detailed discussions on specific proteins that are likely of interest to the authors occur, yet without context, this does not enhance the paper for the reader.

      The paper uses many unnecessary, new, or redefined acronyms which makes reading difficult. As examples:

      1) Prion forming domains (PFDs). Do the authors mean prion-like domains (PLDs), an established term with an empirical definition from the PLAAC algorithm? If yes, they should say this. If not, they must define what a prion-forming domain is formally.

      The N-terminal domain (1-123 amino acids) of S. cerevisiae Sup35 was already referred to as a “prion forming domain (PFD)” in 2006 (48). Since then, PFD has also been employed as an acronym in other yeast prion papers (Cox, B.S. et al. 2007; Toombs, T. et al. 2011).

      B. S. Cox, L. Byrne, M. F., Tuite, Protein Stability. Prion 1, 170-178 (2007). J. A. Toombs, N. M. Liss, K. R. Cobble, Z. Ben-Musa, E. D. Ross, [PSI+] maintenance is dependent on the composition, not primary sequence, of the oligopeptide repeat domain. PLoS One 6, e21953 (2011).

      2) SCD is already an acronym in the IDP field (meaning sequence charge decoration) - the authors should avoid this as their chosen acronym for Serine(S) / threonine (T)-glutamine (Q) cluster domains. Moreover, do we really need another acronym here (we do not).

      SCD was first used in 2005 as an acronym for the Serine (S)/threonine (T)-glutamine (Q) cluster domain in the DNA damage checkpoint field (4). Almost a decade later, SCD became an acronym for “sequence charge decoration” (Sawle, L. et al. 2015; Firman, T. et al. 2018).

      L. Sawle and K, Ghosh, A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J. Chem Phys. 143, 085101(2015).

      T. Firman and Ghosh, K. Sequence charge decoration dictates coil-globule transition in intrinsically disordered proteins. J. Chem Phys. 148, 123305 (2018).

      3) Protein expression-enhancing (PEE) - just say expression-enhancing, there is no need for an acronym here.

      Thank you. Since we have shown that the addition of Q-rich motifs to LacZ affects protein expression rather than transcription, we think it is better to use the “PEE” acronym.

      The results suggest autonomous protein expression-enhancing activities of regions of multiple proteins containing Q-rich and SCD motifs. Their definition of expression-enhancing activities is vague and the evidence they provide to support the claim is weak. While their previous work may support their claim with more evidence, it should be explained in more detail. The assay they choose is a fusion reporter measuring beta-galactosidase activity and tracking expression levels. Given the presented data they have shown that they can drive the expression of their reporters and that beta gal remains active, in addition to the increase in expression of fusion reporter during the stress response. They have not detailed what their control and mock treatment is, which makes complete understanding of their experimental approach difficult. Furthermore, their nuclear localization signal on the tag could be influencing the degradation kinetics or sequestering the reporter, leading to its accumulation and the appearance of enhanced expression. Their evidence refuting ubiquitin-mediated degradation does not have a convincing control.

      Although this reviewer’s concern regarding our use of a nuclear localization signal on the tag is understandable, we are confident that this signal does not bias our findings for two reasons. First, the negative control LacZ-NV also possesses the same nuclear localization signal (Figure 1A, lane 2). Second, another fusion target, Rad51-ΔN, does not harbor the NVH tag (Figure 1D, lanes 3-4). Compared to wild-type Rad51, Rad51-ΔN is highly labile. In our previous study, removal of the NTD from Rad51 reduced by ~97% the protein levels of corresponding Rad51-ΔN proteins relative to wild-type (1).

      Based on the experimental results, the authors then go on to perform bioinformatic analysis of SCD proteins and polyX proteins. Unfortunately, there is no clear hypothesis for what is being tested; there is a vague sense of investigating polyX/SCD regions, but I did not find the connection between the first and section compelling (especially given polar-rich regions have been shown to engage in many different functions). As such, this bioinformatic analysis largely presents as many lists of percentages without any meaningful interpretation. The bioinformatics analysis lacks any kind of rigorous statistical tests, making it difficult to evaluate the conclusions drawn. The methods section is severely lacking. Specifically, many of the methods require the reader to read many other papers. While referencing prior work is of course, important, the authors should ensure the methods in this paper provide the details needed to allow a reader to evaluate the work being presented. As it stands, this is not the case.

      Thank you. As described in detail below, we have now performed rigorous statistical testing using the GofuncR package (Figure 11, Figure 12 and DS7-DS32).

      Overall, my major concern with this work is that the authors make two central claims in this paper (as per the Discussion). The authors claim that Q-rich motifs enhance protein expression. The implication here is that Q-rich motif IDRs are special, but this is not tested. As such, they cannot exclude the competing hypothesis ("N-terminal disordered regions enhance expression").

      In fact, “N-terminal disordered regions enhance expression” exactly summarizes our hypothesis.

      On pages 12-13 and Figure 4 of our preprint manuscript, we explained our hypothesis in the paragraph entitled “The relationship between PEE function, amino acid contents, and structural flexibility”.

      The authors also do not explore the possibility that this effect is in part/entirely driven by mRNA-level effects (see Verma Na Comms 2019).

      As pointed out by the first reviewer, we present evidence that the increase in protein abundance and enzymatic activity is not due to changes in plasmid copy number or mRNA abundance (Figure 2), and that this phenomenon is not affected in translational quality control mutants (Figure 3).

      As such, while these observations are interesting, they feel preliminary and, in my opinion, cannot be used to draw hard conclusions on how N-terminal IDR sequence features influence protein expression. This does not mean the authors are necessarily wrong, but from the data presented here, I do not believe strong conclusions can be drawn. That re-assignment of stop codons to Q increases proteome-wide Q usage. I was unable to understand what result led the authors to this conclusion.

      My reading of the results is that a subset of ciliates has re-assigned UAA and UAG from the stop codon to Q. Those ciliates have more polyQ-containing proteins. However, they also have more polyN-containing proteins and proteins enriched in S/T-Q clusters. Surely if this were a stop-codon-dependent effect, we'd ONLY see an enhancement in Q-richness, not a corresponding enhancement in all polar-rich IDR frequencies? It seems the better working hypothesis is that free-floating climate proteomes are enriched in polar amino acids compared to sessile ciliates.

      We thank this reviewer for raising this point, however her/his comments are not supported by the results in Figure 7.

      Regardless, the absence of any kind of statistical analysis makes it hard to draw strong conclusions here.

      We apologize for not explaining more clearly the results of Tables 5-7 in our preprint manuscript.

      To address the concerns about our GO enrichment analysis by both reviewers, we have now performed rigorous statistical testing for SCD and polyQ protein overrepresentation using the GOfuncR package (https://bioconductor.org/packages/release/bioc/html/GOfuncR.html). GOfuncR is an R package program that conducts standard candidate vs. background enrichment analysis by means of the hypergeometric test. We then adjusted the raw p-values according to the Family-wise error rate (FWER). The same method had been applied to GO enrichment analysis of human genomes (89).

      The results presented in Figure 11 and Figure 12 (DS7-DS32) support our hypothesis that Q-rich motifs prevail in proteins involved in specialized biological processes, including Saccharomyces cerevisiae RNA-mediated transposition, Candida albicans filamentous growth, peptidyl-glutamic acid modification in ciliates with reassigned stop codons (TAAQ and TAGQ), Tetrahymena thermophila xylan catabolism, Dictyostelium discoideum sexual reproduction, Plasmodium falciparum infection, as well as the nervous systems of Drosophila melanogaster, Mus musculus, and Homo sapiens (78). In contrast, peptidyl-glutamic acid modification and microtubule-based movement are not overrepresented with Q-rich proteins in Stentor coeruleus, a ciliate with standard stop codons.

      Recommendations for the authors:

      Please note that you control which revisions to undertake from the public reviews and recommendations for the authors.

      Reviewer #1 (Recommendations For The Authors):

      The order of paragraphs in the introduction was very difficult to follow. Each paragraph was clear and easy to understand, but the order of paragraphs did not make sense to this reader. The order of events in the abstract matches the order of events in the results section. However, the order of paragraphs in the introduction is completely different and this was very confusing. This disordered list of facts might make sense to an expert reader but makes it hard for a non-expert reader to understand.

      Apologies. We endeavored to improve the flow of our revised manuscript to make it more readable.

      The section beginning on pg 12 focused on figures 4 and 5 was very interesting and highly promising. However, it was initially hard for me to tell from the main text what the experiment was. Please add to the text an explanation of the experiment, because it is hard to figure out what was going on from the figures alone. Figure 4 is fantastic, but would be improved by adding error bars and scaling the x-axis to be the same in panels B,C,D.

      Thank you for this recommendation. We have now scaled both the x-axis and y-axis equivalently in panels B, C and D of Figure 4. Error bars are too small to be included.

      It is hard to tell if the key variable is the number of S/T/Q/N residues or the number of phosphosites. I think a good control would be to add a regression against the number of putative phosphosites. The sequences are well designed. I loved this part but as a reader, I need more interpretation about why it matters and how it explains the PEE.

      As described above, we have shown that the number of phosphorylation sites in the Q-rich motifs is not relevant to their autonomous protein expression-enhancing (PEE) activities.

      I believe that the prevalence of polyX runs is not meaningful without normalizing for the background abundance of each amino acid. The proteome-wide abundance and the assumption that amino acids occur independently can be used to form a baseline expectation for which runs are longer than expected by chance. I think Figures 6 and 7 should go into the supplement and be replaced in the main text with a figure where Figure 6 is normalized by Figure 7. For example in P. falciparum, there are many N-runs (Figure 6), but the proteome has the highest fraction of N’s (Figure 7).

      Thank you for these suggestions. The three figures in our preprint manuscript (Figures 6-8) have been moved into the supplementary information (Figures S1-S3). For normalization, we have provided four new figures (Figures 7-10) in our revised manuscript.

      The analysis of ciliate proteomes was fascinating. I am particularly interested in the GO enrichment for “peptidyl-glutamic acid modification” (pg 20) because these enzymes might be modifying some of Q’s in the Q-runs. I might be wrong about this idea or confused about the chemistry. Do these ciliates live in Q-rich environments? Or nitrogen rich environments?

      Polymeric modifications (polymodifications) are a hallmark of C-terminal tubulin tails, whereas secondary peptide chains of glutamic acids (polyglutamylation) and glycines (polyglycylation) are catalyzed from the γ-carboxyl group of primary chain glutamic acids. It is not clear if these enzymes can modify some of the Q’s in the Q-runs.

      To our knowledge, ciliates are abundant in almost every liquid water environment, i.e., oceans/seas, marine sediments, lakes, ponds, and rivers, and even soils.

      I think you should include more discussion about how the codons that code for Q’s are prone to slippage during DNA replication, and thus many Q-runs are unstable and expand (e.g. Huntington’s Disease). The end of pg 24 or pg 25 would be good places.

      We thank the reviewer for these comments.

      PolyQ motifs have a particular length-dependent codon usage that relates to strand slippage in CAG/CTG trinucleotide repeat regions during DNA replication. In most organisms having standard genetic codons, Q is encoded by CAGQ and CAAQ. Here, we have determined and compared proteome-wide Q contents, as well as the CAGQ usage frequencies (i.e., the ratio between CAGQ and the sum of CAGQ, CAGQ, TAAQ, and TAGQ).

      Our results reveal that the likelihood of forming long CAG/CTG trinucleotide repeats are higher in five eukaryotes due to their higher CAGQ usage frequencies, including Drosophila melanogaster (86.6% Q), Danio rerio (74.0% Q), Mus musculus (74.0% Q), Homo sapiens (73.5% Q), and Chlamydomonas reinhardtii (87.3% Q) (orange background, Table 2). In contrast, another five eukaryotes that possess high numbers of polyQ motifs (i.e., Dictyostelium discoideum, Candida albicans, Candida tropicalis, Plasmodium falciparum and Stentor coeruleus) (Figure 1) utilize more CAAQ (96.2%, 84.6%, 84.5%, 86.7% and 75.7%) than CAAQ (3.8%, 15.4%, 15.5%, 13.3% and 24.3%), respectively, to avoid the formation of long CAG/CTG trinucleotide repeats (green background, Table 2). Similarly, all five ciliates with reassigned stop codons (TAAQ and TAGQ) have low CAGQ usage frequencies (i.e., from 3.8% Q in Pseudocohnilembus persalinus to 12.6% Q in Oxytricha trifallax) (red font, Table 2). Accordingly, the CAG-slippage mechanism might operate more frequently in Chlamydomonas reinhardtii, Drosophila melanogaster, Danio rerio, Mus musculus and Homo sapiens than in Dictyostelium discoideum, Candida albicans, Candida tropicalis, Plasmodium falciparum, Stentor coeruleus and the five ciliates with reassigned stop codons (TAAQ and TAGQ).

      Author response table 1.

      Usage frequencies of TAA, TAG, TAAQ, TAGQ, CAAQ and CAGQ codons in the entire proteomes of 20 different organisms.

      Pg 7, paragraph 2 has no direction. Please add the conclusion of the paragraph to the first sentence.

      This paragraph has been moved to the “Introduction” section” of the revised manuscript.

      Pg 8, I suggest only mentioning the PFDs used in the experiments. The rest are distracting.

      We have addressed this concern above.

      Pg 12. Please revise the "The relationship...." text to explain the experiment.

      We apologize for not explaining this topic sufficiently well in our preprint manuscript.

      SCDs are often structurally flexible sequences (4) or even IDRs. Using IUPred2A (https://iupred2a.elte.hu/plot_new), a web-server for identifying disordered protein regions (88), we found that Rad51-NTD (1-66 a.a.) (1), Rad53-SCD1 (1-29 a.a.) and Sup35-NPD (1-39 a.a.) are highly structurally flexible. Since a high content of serine (S), threonine (T), glutamine (Q), asparanine (N) is a common feature of IDRs (17-20), we applied alanine scanning mutagenesis approach to reduce the percentages of S, T, Q or N in Rad51-NTD, Rad53-SCD1 or Sup35-NPD, respectively. As shown in Figure 4 and Figure 5, there is a very strong positive relationship between STQ and STQN amino acid percentages and β-galactosidase activities. (Page 13, lines 5-10)

      Pg 13, first full paragraph, "Futionally, IDRs..." I think this paragraph belongs in the Discussion.

      This paragraph is now in the “Introduction” section (Page 5, Lines 11-15).

      Pg. 15, I think the order of paragraphs should be swapped.

      These paragraphs have been removed or rewritten in the “Introduction section” of our revised manuscript.

      Pg 17 (and other parts) I found the lists of numbers and percentages hard to read and I think you should refer readers to the tables.

      Thank you. In the revised manuscript, we have avoided using lists of numbers and percentages, unless we feel they are absolutely essential.

      Pg. 19 please add more interpretation to the last paragraph. It is very cool but I need help understanding the result. Are these proteins diverging rapidly? Perhaps this is a place to include the idea of codon slippage during DNA replication.

      Thank you. The new results in Table 2 indicate that the CAG-slippage mechanism is unlikely to operate in ciliates with reassigned stop codons (TAAQ and TAGQ).

      Pg 24. "Based on our findings from this study, we suggest that Q-rich motifs are useful toolkits for generating novel diversity during protein evolution, including by enabling greater protein expression, protein-protein interactions, posttranslational modifications, increased solubility, and tunable stability, among other important traits." This idea needs to be cited. Keith Dunker has written extensively about this idea as have others. Perhaps also discuss why Poly Q rich regions are different from other IDRs and different from other IDRs that phase-separate.

      Agreed, we have cited two of Keith Dunker’s papers in our revised manuscript (73, 74).

      Minor notes:

      Please define Borg genomes (pg 25).

      Borgs are long extrachromosomal DNA sequences in methane-oxidizing Methanoperedens archaea, which display the potential to augment methane oxidation (101). They are now described in our revised manuscript. (Page 15, lines 12-14)

      Reviewer #2 (Recommendations For The Authors):

      The authors dance around disorder but never really quantify or show data. This seems like a strange blindspot.

      We apologize for not explaining this topic sufficiently well in our preprint manuscript. We have endeavored to do so in our revised manuscript.

      The authors claim the expression enhancement is "autonomous," but they have not ruled things out that would make it not autonomous.

      Evidence of the “autonomous” nature of expression enhancement is presented in Figure 1, Figure 4, and Figure 5 of the preprint manuscript.

      Recommendations for improving the writing and presentation.

      The title does not recapitulate the entire body of work. The first 5 figures are not represented by the title in any way, and indeed, I have serious misgivings as to whether the conclusion stated in the title is supported by the work. I would strongly suggest the authors change the title.

      Figure 2 could be supplemental.

      Thank you. We think it is important to keep Figure 2 in the text.

      Figures 4 and 5 are not discussed much or particularly well.

      This reviewer’s opinion of Figure 4 and Figure 5 is in stark contrast to those of the first reviewer.

      The introduction, while very thorough, takes away from the main findings of the paper. It is more suited to a review and not a tailored set of minimal information necessary to set up the question and findings of the paper. The question that the authors are after is also not very clear.

      Thank you. The entire “Introduction” section has been extensively rewritten in the revised manuscript.

      Schematics of their fusion constructs and changes to the sequence would be nice, even if supplemental.

      Schematics of the fusion constructs are provided in Figure 1A.

      The methods section should be substantially expanded.

      The method section in the revised manuscript has been rewritten and expanded. The six Javascript programs used in this work are listed in Table S4.

      The text is not always suited to the general audience and readership of eLife.

      We have now rewritten parts of our manuscript to make it more accessible to the broad readership of eLife.

      In some cases, section headers really don't match what is presented, or there is no evidence to back the claim.

      The section headers in the revised manuscript have been corrected.

      A lot of the listed results in the back half of the paper could be a supplemental table, listing %s in a paragraph (several of them in a row) is never nice

      Acknowledged. In the revised manuscript, we have removed almost all sentences listing %s.

      Minor corrections to the text and figures.

      There is a reference to table 1 multiple times, and it seems that there is a missing table. The current table 1 does not seem to be the same table referred to in some places throughout the text.

      Apologies for this mistake, which we have now corrected in our revised manuscript.

      In some places its not clear where new work is and where previous work is mentioned. It would help if the authors clearly stated "In previous work...."

      Acknowledged. We have corrected this oversight in our revised manuscript.

      Not all strains are listed in the strain table (KO's in figure 3 are not included)

      Apologies, we have now corrected Table S2, as suggested by this reviewer.

      Author response table 2.

      S. cerevisiae strains used in this study

    1. Author Response

      The following is the authors’ response to the original reviews.

      On behalf of my co-authors, we thank you very much for giving us the opportunity to revise our manuscript entitled “A positive feedback loop between ZEB2 and ACSL4 regulates lipid metabolism to promote breast cancer metastasis” (manuscript number: eLife-RP-RA-2023-87510).

      We would like to convey our appreciation to you and the expert reviewers for your valuable time and effort in reviewing and improving our work. We are grateful for the constructive comments raised by the six expert reviewers. We have studied the reviewer’s comments carefully and have accordingly conducted additional experiments as recommended. We have made the following revisions point by point. We found that our work was substantially strengthened by addressing these points.

      Reviewer #1 (Public Review):

      In this study, Jiamin Lin et al. investigated the potential positive feedback loop between ZEB2 and ACSL4, which regulates lipid metabolism and breast cancer metastasis. They reported a correlation between high expression of ZEB2 and ACSL4 and poor survival of breast cancer patients, and showed that depletion of ZEB2 or ACSL4 significantly reduced lipid droplets abundance and cell migration in vitro. The authors also claimed that ZEB2 activated ACSL4 expression by directly binding to its promoter, while ACSL4 in turn stabilized ZEB2 by blocking its ubiquitination. While the topic is interesting, there are several major concerns with the study and its conclusions are not convincing.

      1) Figure 1A, the clinical relevance or biological significance of drug-resistant luminal breast cancer cell lines with metastatic cancer is questionable. Additionally, the RNA-seq analysis lacked multiple test correction for differential gene expression analysis, and no fold-change cut-off was used, leading to incorrect thresholds and wrongly identified significant signals.

      We appreciate the reviewer’s valuable questions to improve our manuscript. We identified many EMT related transcription factors such as ZEB2, SNAIL, TWIST, etc. was up-regulated in drug-resistant cells, so we hypothesized that drug-resistant cells may undergone EMT and acquire metastatic capability. The drug-resistant cells used in this study had already been proved and examined in the previous studies of our research team as follows:

      (1) Zheng FM, Long ZJ, Hou ZJ et al., A novel small molecule aurora kinase inhibitor attenuates breast tumor-initiating cells and overcomes drug resistance. Mol Cancer Ther. 2014 Aug;13(8):1991-2003.

      (2) Yang N, Wang C, Wang Z, et al., FOXM1 recruits nuclear Aurora kinase A to participate in a positive feedback loop essential for the self-renewal of breast cancer stem cells. Oncogene. 2017 Jun 15;36(24):3428-3440.

      For the second question, we used the fold-change cut-off in RNA-seq analysis and the fold change was over 1.5-fold and the adjust P value is less than 0.05. To make it more clearly, we have reset the cut off with a |log2FC|2 and p<0.05 and generated the volcano Plot using R4.3.0 software for differentially expressed genes as follows in Author response image 1. The results showed 3217 and 3035 up-regulated genes in TAXOL-resistant and EPI-resistant cells respectively, along with 2427 (TAXOL) and 2901 (EPI) down-regulated genes. Both ACSL4 and ZEB2 were up-regulated in two cell lines. We have put the figure in the new supplementary Fig S2.

      Author response image 1.

      2) Figure 1D-E, the clinical associations between ACSL4 and ZEB2 overexpression and poor patient survival are not justified. The authors used an old web tool, the Kaplan-Meier plotter database, based on microarray data, to perform the analysis. The reviewer repeated the analysis and found that multiple microarray probes for ZEB2 were available, leading to opposite results when different probes were selected. The reviewer also repeated the analysis using more reliable TCGA RNA-seq data and found no correlation between ASCL4 or ZEB2 expression and post-progression survival.

      We appreciate the reviewer’s thoughtful questions. The Kaplan-Meier plotter database (http://kmplot.com/analysis/) we used is handled by a PostgreSQL server, which integrates gene expression and clinical data simultaneously including GEO, EGA and TCGA data. We used auto-select best cutoff for the the Kaplan-Meier analysis. Due to the web tool is old, we repeated the Kaplan-Meier survival analysis using R4.3.0 software and split the patients in TCGA database according to the third quartile expression (new Fig. 1D-F). The results also show that patients with high expression of ACSL4 and/or ZEB2 have relatively worse prognosis as follows in Author response image 2 (p<0.01):

      Author response image 2.

      3) Figure 1I relied on IHC to support the negative correlation between ACSL4 and Erα expression, but the small sample size limits the power to establish the relationship and the results are not definitive without further replication or biological investigation. The authors should provide more detailed and comprehensive analysis, including appropriate statistical tests, to ensure the findings are robust and reliable.

      We appreciate the reviewer’s suggestion. To better understand the positive correlation between ACSL4 and ZEB2 expression, we add up to 45 breast cancer cases for IHC analysis and the correlation is shown as follows in Author response image 3 (new Fig. 1 H):

      Author response image 3.

      4) Figure 3B-C lacks justification of the differences by showing only one field without any internal control for exposure. The reviewer suggests to show additional fields where cells with both efficiently and inefficiently knocked-down are present, to justify the robustness of the results. This can also be achieved by mixing control and knockdown cells.

      We totally understand the reviewer's concern. Thank you for pointing out this problem. The lower magnification field of view is shown as follows and it includes both efficiently and inefficiently knocked-down cells. We have changed the Fig. 3B and C as follows in Author response image 4:

      Author response image 4.

      5) Figure 4A-D, oleate-induced cell migration is a well-documented feature across different cancer types. To make it more relevant to the current study, the authors should examine multiple cell lines with high and low ZEB2/ACSL4 expression to determine the underlying relevance.

      We appreciate the reviewer’s comments and performed the suggested experiments. To better determine the role of oleic acid and ACSL4 on cell migration, we use MCF-7 cell line, which has low ZEB2/ACSL4 expression, to test the influence of oleate on the cell migration. Transwell and Wound healing assays revealed that oleic acid treated MCF-7 cells also exhibited enhanced invasive and metastatic capacities compared with control cells. This results indicates that oleate induces cell migration in MCF-7 cells may via mechanisms other than ACSL4. We have added the results to the new Supplementary Fig. 8 as follows in Author response image 5.

      Author response image 5.

      6) Figure 4E, it is difficulty to conclude that cancer cells utilize stored lipids during migration to fuel metastasis based on current data. Do you see any evidence of lipid signal decreasing in the leading edge of the scratch wound-healing migration assay? The authors should also compare signals between unmigrated and migrated cells in the transwell assay.

      We appreciate the reviewer’s constructive suggestion. We performed the wound-healing migration assay and observed that the lipid signal was obviously decreased in the leading edge of the scratch, as shown in the Author response image 6 (New Fig. 4E). In the transwell experiment, the cells which migrated to the lower side of the chamber after 24 hours showed decreased lipid signals (Fig. 4F). All these results indicates that lipid is utilized during migration.

      Author response image 6.

      7) Figure 6 warrants a genome-wide ChIP-seq to justify direct regulation of ASCL4 promoter by ZEB2. The reviewer’s analysis of publicly available ZEB2 ChIP-seq in multiple cell types detected no ZEB2 binding signaling within {plus minus} 5 kb of ASCL4 promoter.

      We thank the reviewer for the concern. We found that the breast cancer cells are not included in some data base, such as Cistrome Data Browser, which is a resource of human and mouse cis-regulatory information derived from ChIP-seq, DNase-seq and ATAC-seq chromatin profiling assays. Due to that different cell type may have totally different mechanisms, that’s why the ZEB2 binding signaling cannot be found within ASCL4 promoter in some cells.

      We searched JASPAR data base (https://jaspar.genereg.net/), which is an open-access database of non-redundant transcription factor (TF) binding profiles, and found the consensus binding sequences (CACCT) of ZEB (zinc finger E-box binding homeobox) transcription family were within the 2kb of ASCL4 promoter as follows in Author response image 7.

      Author response image 7.

      8) Figure 7 presents a series of self-contradictory results. Figure 7C, why no significant change in ZEB2-MYC expression was observed in the presence of ACSL4 and/or HA-Ubi? In Figure 7 E&G, why robust ACSL4 expression is present in the control group in E but not in (G)? Additionally, why there is no degradation in ZEB2 baseline level over time in the shACSL4 group in E? These raise severe concerns about the data quality.

      We appreciate the reviewer to point out these problems.

      Response to question 1: In fig. 7C, we used 293T cell for the ubiquitin assay and it is not a breast cancer cells. The efficiency of over-expression is different between ZEB2 and ACSL4 in 293T cell lines.

      Response to question 2: Because the expression of ACSL4 is low in MCF-7 and is high in MDA-MB-231 cells. In Figure 7E (New Fig. 7G), we used MDA-MB-231 cells for the control and ACSL4 knockdown cells. In Figure 7G (New Fig. 7I), we used MCF-7 cells for the control and ACSL4 over-expressed cells. We have also revised the figure legend of Fig.7 as follows:

      I, The stability of ZEB2 protein was detected by CHX treatment assay in control or ACSL4 over-expressed MCF-7 cells. GAPDH was used as the internal loading control.

      Response to question 3: Because knockdown of ACSL4 also significantly decreased the mRNA level of ZEB2 (New Fig. 7A), so the baseline levels of ZEB2 in the shACSL4 group (New Fig. 7G) were very low and degradation is not obvious.

      9) Figure 7D, the IP result of ACSL4 is not justified as there is no enrichment of ACSL4 in the IP compared to input. With the current data, it is hard to justify that there is any direct interaction. Moreover, based on IF data in Figure 3B-C, ACSL4 is exclusively localized in the cytoplasm, while ZEB2 is exclusively localized in the nucleus. It is hard to believe there is any direct interaction and mutual regulation.

      We appreciate the reviewer’s thoughtful questions. We have repeated the IP assay and found that the enrichment of ACSL4 was observed in the IP process and added to new Fig. 7E as follows in Author response image 8. We also repeated the immunofluorescence assay in the MDA-MB-231 cells. We observed that ZEB2 can also be found in the cytoplasm and co-localized with ACSL in some certain regions of the cytoplasm as follows in Author response image 9 (Supplementary Fig. S11):

      Author response image 8.

      Author response image 9.

      Reviewer #2 (Public Review):

      In this study, the authors validated a positive feedback loop between ZEB2 and ACSL4 in breast cancer, which regulates lipid metabolism to promote metastasis.

      Overall, the study is original, well structured, and easy to read. Despite the reliability of the data discussed in this article, there are still some deficiencies that need to be addressed through further explanation.

      Major issues:

      1) The authors demonstrated that ACSL4 regulates ZEB2 not only via a post-transcriptional mechanism but also via a transcriptional mechanism. The authors have not provided a comprehensive explanation of the specific mechanism in this paper. Therefore, it is recommended that the author delve into the potential mechanisms in the discussion section. For example, related mechanisms affecting ZEB2 ubiquitination degradation, as well as factors affecting ZEB2 upstream transcriptional regulation, etc.

      We appreciate the positive comments and constructive suggestion from the reviewer. We have added the following paragraph in the second paragraph of the discussion section :

      Interestingly, our RNA-seq data revealed that some ubiquitin E3 ligases, such as FBXO4, UBE3C, NEDD4, RBX1 etc. were significantly reduced in ACSL4 knockdown cells (Fig. S12). This result indicated that ACSL4 may reduce the ubiquitin of ZEB2 via down-regulating ubiquitin E3 ligase. Additionally, we found that ACSL4 promoted ZEB transcription as the mRNA level of ZEB2 was significantly reduced after ACSL4 knockdown. A recent study reported that LD-derived lipolysis provide acetyl-CoA for the epigenetic regulation of gene transcription. We observed that ACSL4 can also promote FAO, which generates acetyl-CoA for the epigenetic regulation. It is likely that ACSL4 regulates the ZEB2 mRNA level via lipid-epigenetic reprogramming mechanism, which is worth studying in the future.

      2) To further clarify the interaction of ZEB2 and ACSL4, it is best to perform in vitro glutathione-S-transferase (GST) pulldown assay and immunofluorescence assay.

      We appreciate the reviewer’s suggestion. We performed GST pull-down assay to examine whether ZEB2 and ACSL4 form a complex. GST pull-down assay confirmed the interaction of ZEB2 and ACSL4 as follows in Author response image 10 (Supplementary Fig. S10). We also performed immunofluorescence assay and found that ZEB2 was co-localized with ACSL in some certain regions of the cytoplasm as follows in Author response image 11. (Supplementary Fig. S11):

      Author response image 10.

      Author response image 11.

      3) In Figure 7B, the protein level of ZEB2 seems not to be altered in BT549 BCSC cell line after the depletion of ACSL4.

      We appreciate the reviewer to point out this problem. The protein level of ZEB2 in BT549 BCSC cell is not abundant as MDA-MB-231. We repeated the experiment and found that ZEB2 was reduced after the depletion of ACSL4 in BT549. We have replaced the Fig.7B as follows in Author response image 12:

      Author response image 12.

      4) EMT is characterized by changes in cell morphology, so the staining of cytoskeletons with Phalloidin is needed.

      We appreciate the reviewer’s suggestion and performed the staining. The results show that the ACSL4 knockdown cells had a significantly smaller length to width ratio, which indicates the reversion of EMT process, than those of the control group (p<0.05). We have put the results in Supplementary Fig. S4 as follows in Author response image 13:

      Author response image 13.

      5) Additional breast cancer cases or cohorts (such as TMA) should be used to validate the positive correlation between ACSL4 and ZEB2 expression through IHC analysis.

      We thank the reviewer for the suggestion. To better understand the positive correlation between ACSL4 and ZEB2 expression, we added more breast cancer cases up to 45 for IHC analysis and validated the positive correlation between ACSL4 and ZEB2. We have put the results into Fig 1 H and I as follows in Author response image 14:

      Author response image 14.

      Reviewer #3 (Public Review):

      The manuscript by Lin et al. reveals a novel positive regulatory loop between ZEB2 and ACSL4, which promotes lipid droplets storage to meet the energy needs of breast cancer metastasis. It is of interest, however, some concerns should be addressed to strengthen the finding.

      Major concerns:

      1) The effect of ZEB2 overexpression is not fully demonstrated in the whole study. This point should be addressed.

      We appreciate the positive comments and constructive suggestion from the reviewer. We have performed ZEB2 over-expressed MCF7 cell line. Over-expression of ZEB2 significantly enhanced the metastatic and invasive capacities of MCF7 cells. (Supplementary Fig. S5A and 5B).

      Author response image 15.

      1. Does the addition of oleate restore the ability of migration or invasion in ACSL4 knockdown cells?

      We thank the reviewer for the question. To address this point, the oleate was added in the culture medium of ACSL4 knockdown cells. As expected, the addition of oleate obviously restores the invasive and metastatic capacities of ACSL4 knockdown cells by 33.12% and 18.61% respectively. We have added the results in the new Fig. 4D as follows in Author response image 16:

      Author response image 16.

      3) Which cellular compartment does ACSL4 localize in and interact with ZEB2 to stabilize ZEB2?

      We thank the reviewer for the question. We have repeated the immunofluorescence assay in the MDA-MB-231 cells. We observed that ZEB2 can also be found in the cytoplasm and co-localized with ACSL in some certain regions of the cytoplasm (Supplementary Fig. S11):

      4) The ubiquitination assay and Co-IP assay are just performed in HEK293T cells. This result should be confirmed in MDA-MB-231 cells or Taxol-resistant MCF-7 cells.

      We appreciate the reviewer’s suggestion. We performed the ubiquitination assay and IP assay in MDA-MB-231 cells as follows. The results confirm that knockdown of ACSL4 obviously enhanced the ubiqutination of ZEB2. We have put the results into Fig. 7D and 7F as follows in Author response image 17:

      Author response image 17.

      5) How does ACSL4 regulate ZEB2 at the mRNA level?Please verify.

      We thank the reviewer for the thoughtful question. A recent study reported that LD-derived lipolysis provide acetyl-CoA for the epigenetic regulation of gene transcription. We observed that ACSL4 can promote FAO, which generates acetyl-CoA for the epigenetic regulation. It is likely that ACSL4 regulates the ZEB2 mRNA level via lipid-epigenetic reprogramming mechanism, which is worth studying in the future and we had added the following sentences into the second paragraph in the discussion section :

      Additionally, we found that ACSL4 promoted ZEB2 transcription as the mRNA level of ZEB2 was significantly reduced after ACSL4 knockdown. A recent study reported that LD-derived lipolysis provide acetyl-CoA for the epigenetic regulation of gene transcription. We observed that ACSL4 can also promote FAO, which can generate acetyl-CoA for the epigenetic regulation. It is likely that ACSL4 regulates the ZEB2 mRNA level via lipid-epigenetic reprogramming mechanism, which is worth studying in the future.

      6) In Fig. 2F, the silencing efficiency for ACSL4 and ZEB2 should be shown. In addition, the protein level of ZEB2 or ACSL4 in shZEB2 and shZEB2+ACSL4 groups should also be addressed.

      We appreciate the reviewer's suggestions. We have added the protein levels in Fig 2F and 2H as follows in Author response image 18:

      Author response image 18.

      7) What is the survival status of patients with both high expression of ACSL4 and ZEB2 in TCGA. In addition, more survival data from databases especially patients with both high expression of ACSL4 and ZEB2 are needed to analyze to support the finding.

      We thank the reviewer for the constructive suggestion. We repeated the Kaplan-Meier survival analysis of TCGA RNA-seq data by using R4.3.0 software. The survival data show that the patients with both high expression of ACSL4 and ZEB2 have the worst prognosis in the four groups (P<0.05) ( New Fig. 1D-F).

      Reviewer #1 (Recommendations For The Authors):

      10) Only one siRNA/shRNA was used for knockdown in one cell line. Different siRNAs/shRNAs and multiple cell lines should be used to rule out off-target effects.

      We appreciate the reviewer’s suggestion. We have test three siRNA and shRNA for the knockdown efficiency (negative control siRNA or ACSL4 and ZEB2 siRNA were from the company of GenePharma), we chose one sequence with the best knock-down effect.

      Author response image 19.

      11) Western blot data are required to justify the overexpression or knockdown efficiency of ACSL4 in cells in Figure 2 A-C.

      We thank the reviewer for the suggestion. we have added the following western blot data in Figure 2:

      Author response image 20.

      12) In Figure 1G, there is a huge variation of the protein input, which makes the results not justified. The authors should repeat the experiments to ensure consistency and reproducibility of the results.

      We appreciate the reviewer to point out this problem. Because this is the tissue samples of breast cancer patients. The results are affected by the tumor tissue composition between different patient sample, and it is difficult to obtain fresh tissues. In our paper, paraffin specimens have been used for IHC staining, and the results confirmed that ACSL4 and ZEB2 are positively correlated. We have put the results in the supplementary data.

      Reviewer #2 (Recommendations For The Authors):

      1) Data from Figure 1A showed the EMT transcription factor SNAIL was also among the top upregulated genes. Please explain why the association between ACSL4 and ZEB2 was studied instead of ACSL4 and SNAIL.

      We appreciate the reviewer’s question. We had calculated the correlation between the ACSL4 and SNAIL by Pearson’s correlation test. The correlation of ACSL4 and SNAIL is 0.33, less than that of ZEB2. Bedsides, the binding motif analysis reveals that the consensus sequence of ZEB transcription family is within the ACSL4 promoter. Thus, we investigated the relationship between ACSL4 and ZEB2 in breast cancer cells.

      Author response image 21

      2) What is the limitation of your study? Please add some relevant description in the part of discussion.

      We appreciate the reviewer’s suggestion. We have added the description of limitation of our study in the last paragraph of discussion section as follows:

      The limitation of this study is the clinical samples is only 45. The future study should expand the clinical samples and cases to provide more clinical evidence for the crucial role of ACSL4 in breast cancer metastasis.

      3). In Figure 3 Figure Legends part, the authors used the word "knockout", which is a description error.

      We appreciate the reviewer’s advice. We have corrected "knockout" into "knockdown".

      Reviewer #3 (Recommendations For The Authors):

      Minor concerns:

      1) In line 352-353, the statement about whether the high or low expression of ACSL4 and ZEB2 or the advanced breast cancer affects prognosis is inaccurate.

      We appreciate the reviewer to point out this problem. We have corrected the statement into “We found that patients with higher ACSL4 or ZEB2 expression, especially those with simultaneous high expression had worse prognosis than those with lower expression ”.

      2) The title of the seventh part of your results contains a logical error that is opposite to the experimental conclusion.

      We truly appreciate the reviewer to point out this problem. We have changed the title of the seventh part of results to “ACSL4 regulates ZEB2 mRNA expression and protein stabilization”.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This study by Sokač et al. entitled "GENIUS: GEnome traNsformatIon and spatial representation of mUltiomicS data" presents an integrative multi-omics approach which maps several genomic data sources onto an image structure on which established deep-learning methods are trained with the purpose of classifying samples by their metastatic disease progression signatures. Using published samples from the Cancer Genome Atlas the authors characterize the classification performance of their method which only seems to yield results when mapped onto one out of four tested image-layouts.

      Major recommendations:

      • In its current form, GENIUS analysis is neither computationally reproducible nor are the presented scripts on GitHub generic enough for varied applications with other data. The GENIUS GitHub repository provides a collection of analysis scripts and not a finished software solution (e.g. command line tool or other user interface) (the presented scripts do not even suffice for a software prototype). In detail, the README on their GitHub repository is largely incomplete and reads analogous to an incomplete and poorly documented analysis script and is far from serving as a manual for a generic software solution (this claim was made in the manuscript).

      We apologize for this oversight, and we have now invested considerable resources into making the documentation more detailed and accurate. We have created a new GitHub repository (https://github.com/mxs3203/GENIUS) that contains a small set of example data and all the necessary scripts to run GENIUS. The README file guides the user through each step of the GENIUS framework but it also contains a bash script that runs all the steps at once. When a user would like to use it on their own data, they need to replace the input data with their data but in the same format as the example input data. This is now fully documented in the README file. All scripts have arguments that can be used to point to custom data. The entire pipeline using example data can be run using run_genius.sh script. This script will produce CSV files and PNG files inside the ExtractWithIG folder containing attribution scores for every cancer type tested.

      The authors should invest substantially into adding more details on how data can be retrieved (with example code) from the cited databases and how such data should then be curated alongside the input genome to generically create the "genomic image".

      Data for analysis can be sourced from multiple locations, what we have used in our examples and for development was based on data from the TCGA. It can be retrieved from the official TCGA data hub or through Xena Browser (https://xenabrowser.net/). However, the data formats are generic, and similar data types (mutation, expression, methylation, copy number) can be obtained from multiple sources. We have added example data to demonstrate the layout, and we have a script included that creates the layout from standard mutation, expression, methylation and copy number data formats. We have substantially improved the annotations, including detailed descriptions of the data layout along with examples, and we have, as part of our validation, had an independent person test run the scripts using TCGA example data we provided on the new GitHub page.

      In addition, when looking at the source code, parameter configurations for training and running various modules of GENIUS were hard-coded into the source code and users would have to manually change them in the source code rather than as command line flags in the software call. Furthermore, file paths to the local machine of the author are hard-coded in the source code, suggesting that images are sourced from a local folder and won't work when other users wish to replicate the analysis with other data. I would strongly recommend building a comprehensive command line tool where parameter and threshold configurations can be generically altered by the user via command line flags.

      Apologies, we have changed the code and removed all hard-coded paths. All paths are now relative to the script using them. Furthermore, we made the config file more visible and easier to use. The example run can be found on the new github repository we linked in the previous comment.

      We also inserted the following text in the manuscript

      The GitHub repository contains example data and instructions on how to use the GENIUS framework.

      A comprehensive manual would need to be provided to ensure that users can easily run GENIUS with other types of input data (since this is the claim of the manuscript). Overall, due to the lack of documentation and hard-coded local-machine folder paths it was impossible to computationally reproduce this study or run GENIUS in general.

      Apologies, we have completely reworked the code base, and extensively annotated the code. We have also made highly detailed step-by-step instructions that should enable any user to run GENIUS on their own or public data.

      • In the Introduction the authors write: "To correct for such multiple hypothesis testing, drastic adjustments of p-values are often applied which ultimately leads to the rejection of all but the most significant results, likely eliminating a large number of weaker but true associations.". While this is surely true for any method attempting to separate noise from signal, their argument fails to substantiate how their data transformation will solve this issue. Data transformation and projection onto an image for deep-learning processing will only shift the noise-to-signal evaluation process to the postprocessing steps and won't "magically" solve it during training.

      The data transformation does not solve the problem of multiple hypothesis testing but it facilitates the use of computer vision algorithms and frameworks on rich multi-omics data. Importantly, transforming the data into genome images, training the model, and inspecting it with integrated gradients can be interpreted as running a single test on all of the data.

      Analyzing multiomics data using classical statistical methods typically means that we perform extensive filtering of the data, removing genes with poor expression/methylation/mutation scores, and then e.g. perform logistic regression against a desired outcome, or alternatively, perform multiple statistical tests comparing each genomic feature independently against a desired outcome. Either way, information is lost during initial filtering and we must correct the analysis for each statistical test performed. While this increases confidence in whichever observation remains significant, it also undoubtedly means that we discard true positives. Additionally, classical statistical methods such as those mentioned here do not assume a spatial connection between data points, thus any relevant information relating to spatial organization is lost.

      Instead, we propose the use of the GENIUS framework for multiomics analysis. The GENIUS framework is based on deep neural nets and relies on Convolutions and their ability to extract interactions between the data points. This particularly considers spatial information, which is not possible using classical statistical methods such as logistic regression where the most similar approach to this would include creating many models with many interactions.

      Furthermore, integrated gradients is a non-parametric approach that simply evaluates the trained model relative to input data and output label, resulting in attribution for each input with respect to the output label. In other words, integrated gradients represent the integral of gradients with respect to inputs along the path from a given baseline to input. The integral is described in Author response image 1:

      Author response image 1.

      More about integrated gradients can be read on the Captum webpage (https://captum.ai/docs/introduction) or in original paper https://arxiv.org/abs/1703.01365.

      Since we transformed the data into a data structure (genome image) that assumes a spatial connection between genes, trained the model using convolutional neural networks and analyzed the model using integrated gradients, we can treat the results without any parametric assumption. As a particular novelty, we can sort the list based on attribution score and take top N genes as our candidate biomarkers for the variable of interest and proceed with downstream analysis or potentially functional validation in an in vitro setting. In this manner, the reviewer is correct that the signal-to-noise evaluation is shifted to the post-processing steps. However, the benefit of the GENIUS framework is particularly that it enables integration of multiple data sources without any filtering, and with constructing a novel data structure that facilitates investigation of spatial dependency between data points, thus potentially revealing novel genes or biomarkers that were previously removed through filtering steps. However, further downstream validation of these hits remains critical.

      We added the following paragraph to make this more clear

      "Integrated Gradients is a non-parametric approach that evaluates the trained model relative to input data and output label, resulting in attribution scores for each input with respect to the output label. In other words, Integrated Gradients represent the integral of gradients with respect to inputs along the path from a given baseline. By using Integrated Gradients, we provide an alternative solution to the problem posed by performing multiple independent statistical tests. Here, instead of performing multiple tests, a single analysis is performed by transforming multiomics data into genome images, training a model, and inspecting it with Integrated Gradients. Integrated Gradients will output an attribution score for every gene included in the genome image and those can be ranked in order to retrieve a subset of the most associated genes relative to the output variable."

      In addition, multiple-testing correction is usually done based on one particular data source (e.g. expression data), while their approach claims to integrate five very different genomic data sources with different levels and structures of technical noise. How are these applications comparable and how is the training procedure able to account for these different structures of technical noise? Please provide sufficient evidence for making this claim (especially in the postprocessing steps after classification).

      The reviewer is correct that there will be different technical noise for each data source. However, each data source is already processed by standardized pipelines used for interpreting sequence-level data into gene expression, mutations, copy number alterations and methylation levels. Thus, sequence-level technical noise is not evaluated as part of the GENIUS analysis. Nevertheless, the reviewer is correct that sample-level technical noise, such as low tumor purity or poor quality sequencing, undoubtedly can affect the GENIUS predictions, as is true for all types of sequence analysis. As part of GENIUS, an initial data preprocessing step (which is performed automatically as part of the image generation), is that each data source is normalized within that source and linearly scaled in range zero to one (min-max scaling). This normalization step means that the impact of different events within and between data sources are comparable since the largest/smallest value from one data source will be comparable to the largest/smallest value from another data source.

      Additionally, deep neural networks, particularly convolutional networks, have been shown to be very robust to different levels of technical noise (Jang, McCormack, and Tong 2021; Du et al. 2022). In the manuscript we show the attribution scores for different cancer types in figure 3B of the paper. Here, the top genes include established cancer genes such as P53, VHL, PTEN, APC and PIK3CA, indicating that the attribution scores based on GENIUS analysis is a valid tool to identify potential genes of interest. Furthermore, when focusing the analysis on predicting metastatic bladder cancer, we were able to show that of the top 10 genes with the highest attribution scores, 7 showed significant association with poor outcome in an independent validation cohort of mostly metastatic patients (shown in figure 4).

      • I didn't find any computational benchmark of GENIUS. What are the computational run times, hardware requirements (e.g. memory usage) etc that a user will have to deal with when running an analogous experiment, but with different input data sources? What kind of hardware is required GPUs/CPUs/Cluster?

      We apologize for not including this information in the manuscript. We added the following section in to the manuscript:

      "Computational Requirements

      In order to train the model, we used the following hardware configuration: Nvidia RTX3090 GPU, AMD Ryzen 9 5950X 16 core CPU, and 32Gb of RAM memory. In our study, we used a batch size of 256, which occupied around 60% of GPU memory. Training of the model was dependent on the output variable. For metastatic disease prediction, we trained the model for approximately 4 hours. This could be changed since we used early stopping in order to prevent overfitting. By reducing the batch size to smaller numbers, the technical requirements are reduced making it possible to run GENIUS on most modern laptops."

      • A general comment about the Methods section: Models, training, and validation are very vaguely described and the source code on GitHub is very poorly documented so that parameter choices, model validation, test and validation frameworks and parameter choices are neither clear nor reproducible.

      Apologies, we have updated the methods section with more details on models, training and validation. Additionally, we have moved the section on evaluating model performance from the methods section to the results section, with more details on how training was performed.

      We also agree that the GitHub page is not sufficiently detailed and well structured. To remedy this, we have made a new GitHub page that only has the code needed for analysis, example input data, example runs, and environment file with all library versions. The GitHub repository is also updated in the manuscript.

      The new GitHub page can be found on: https://github.com/mxs3203/GENIUS

      Please provide a sufficient mathematical definition of the models, thresholds, training and testing frameworks.

      We sincerely apologize, but we do not entirely follow the reviewers request on this regard. The mathematical definitions of deep neural networks are extensive and not commonly included in research publications utilizing deep learning. We have used PyTorch to implement the deep neural net, a commonly used platform, which is now referenced in the methods. The design of the deep learning network used for GENIUS is described in figure 1, and the relevant parameters are described in methods. The hyper parameters are described in the methods section, and are as follows:

      "All models were trained with Adagrad optimizer with the following hyperparameters: starting learning rate = 9.9e-05 (including learning rate scheduler and early stopping), learning rate decay and weight decay = 1e-6, batch size = 256, except for memory-intensive chromosome images where the batch size of 240 was used."

      • In chapter "Latent representation of genome" the authors write: "After successful model training, we extracted the latent representations of each genome and performed the Uniform Manifold Approximation and Projection (UMAP) of the data. The UMAP projected latent representations into two dimensions which could then be visualized. In order to avoid modeling noise, this step was used to address model accuracy and inspect if the model is distinguishing between variables of interest.". In the recent light of criticism when using the first two dimensions of UMAP projections with omics data, what is the evidence in support of the author's claim that model accuracy can be quantified with such a 2D UMAP projection? How is 'model accuracy' objectively quantified in this visual projection?

      We apologize for not clarifying this. The UMAP was done on L, the latent vector, which by assumption should capture the most important information from the “genome image”. In order to confirm this, we plotted the first two dimensions of UMAP transformation and colored the points by the output variable. If the model was capturing noise, there should not be any patterns on the plot (randomized cancer-type panel). Since, in most cases, we do see an association between the first two UMAP dimensions and the output variable, we were confident that the model was not modeling (extracting) noise.

      To clarify this, we changed the sentence in the manuscript so it is more clear that this is not an estimation of accuracy but only an initial inspection of the models:

      The UMAP projected latent representations into two dimensions which could then be visualized. In order to avoid modeling noise, this step was used to inspect if the model is distinguishing between variables of interest.

      • In the same paragraph "Latent representation of genome" the authors write: "We observed that all training scenarios successfully utilized genome images to make predictions with the exception of Age and randomized cancer type (negative control), where the model performed poorly (Figure 2B).". Did I understand correctly that all negative controls performed poorly? How can the authors make any claims if the controls fail? In general, I was missing sufficient controls for any of their claims, but openly stating that even the most rudimentary controls fail to deliver sufficient signals raises substantial issues with their approach. A clarification would substantially improve this chapter combined with further controls.

      We apologize for not stating this more clearly. Randomized cancer type was used as a negative control since we expect that model would not be able to make sense of the data if predicting randomized cancer type. As expected, the model failed to predict the randomized cancer types. This can be seen in Figure 2C, where UMAP representations (based on the latent representation of the data, the vector L) are made for each output variable. Not seeing any patterns in UMAP shows that, as expected, the model does not know how to extract useful information from “genome image” when predicting randomized cancer type (as when randomly shuffling the labels there is no genomic information to decipher). Similar patterns were observed for Age, indicating that patient age cannot be determined from the multi-omics data. Conversely, when GENIUS was trained against wGII, TP53, metastatic status, and cancer type, we observed that samples clustered according to the output label.

      Reviewer #2 (Public Review):

      In this manuscript, Birkbak and colleagues use a novel approach to transform multi-omics datasets in images and apply Deep Learning methods for image analysis. Interestingly they find that the spatial representation of genes on chromosomes and the order of chromosomes based on 3D contacts leads to best performance. This supports that both 1D proximity and 3D proximity could be important for predicting different phenotypes. I appreciate that the code is made available as a github repository. The authors use their method to investigate different cancers and identify novel genes potentially involved in these cancers. Overall, I found this study important for the field.

      The major points of this manuscript could be grouped in three parts:

      1) While the authors have provided validation for their model, it is not always clear that best approaches have been used.

      a) In the methods there is no mention of a validation dataset. I would like to see the authors training on a cancer from one cohort and predict on the same cancer from a different cohort. This will convince the reader that their model can generalise. They do something along those lines for the bladder cancer, but no performance is reported. At the very least they should withhold a percentage of the data for validation. Maybe train on 100 and validate on the remaining 300 samples. They might have already done something along these lines, but it was not clear from the methods.

      Apologize for not being sufficiently clear in the manuscript. We did indeed validate the performance within the TCGA cohort, using holdout cross validation. Here, we trained the network on 75% of the cohort samples (N = 3825), and tested on the remaining 25% (N = 1276).

      To make this more clear, we have rewritten section “GENIUS classification identifies tumors likely to become metastatic” as such:

      "The omics data types included somatic mutations, gene expression, methylation, copy number gain and copy number loss. Using holdout type cross-validation, where we split the data into training (75%) and validation (25%), we observed a generally high performance of GENIUS, with a validation AUC of 0.83 for predicting metastatic disease (Figure 2B)."

      We also added the following sentence in the legend of Figure 2:

      "The x-axis represents epochs and y-axis represents AUC score of fixed 25% data we used for accuracy assessment within TCGA cohort."

      The accuracy of GENIUS could not be validated on the other two bladder cohorts since they do not contain all the data for the creation of five-dimensional genome images. However, we were able to investigate if the genes with the highest attribution scores towards metastatic bladder cancer obtained based on the TCGA samples also showed a significant association with poor outcome in the two independent bladder cancer cohorts. Here, we observed that of the top 10 genes with the highest attribution scores, 5 were associated with poor outcome in the early stage bladder cancer cohort, and 7 were associated with poor outcome in the late stage/metastatic bladder cancer cohort.

      b) It was not clear how they used "randomised cancer types as the negative control". Why not use normal tissue data or matched controls?

      In the study, we built six models, one for each variable of interest. One of them was cancer type which performed quite well. In order to assess the model on randomized data, we randomized the labels of cancer type and tried predicting that. This served as “negative control” since we expected the model to perform poorly in this scenario. To make this more clear in the manuscript, we have expanded the description in the main text. We have also added the description of this to each supplementary plot to clarify this further.

      While normal tissue and matched controls would have been an optimal solution, unfortunately, such data is not available.

      c) If Figure 2B, the authors claim they have used cross validation. Maybe I missed it, but what sort of cross validation did they use?

      We apologize for not being sufficiently clear. As described above, we used holdout cross-validation to train and evaluate the model. We clarified this in the text:

      "Using holdout type cross-validation, where we split the data into training (80%) and validation (20%), we observed a generally high performance of GENIUS, with a mean validation AUC of 0.83 (Figure 2B)"

      2) Potential improvement to the method

      a) It is very encouraging the use of HiC data, but the authors used a very coarse approach to integrate it (by computing the chromosome order based on interaction score). We know that genes that are located far away on the same chromosome can interact more in 3D space than genes that are relatively close in 1D space. Did the authors consider this aspect? Why not group genes based on them being located in the same TAD?

      We thank the reviewer for this suggestion and we will start looking into how to use TAD information to create another genome representation. In this study, we tried several genome transformations, which proved to be superior compared to a flat vector of features (no transformation). We are aware that squared genome transformation might not be optimal, so we designed the network that reconstructs the genome image during the training. This way, the genome image is optimized for the output variable of choice by the network itself. However, we note that the order of the genes themselves, while currently based on HiC, can be changed by the user. The order is determined by a simple input file which can be changed by the user with the argument “all_genes_included”. Thus, different orderings can be tested within the overall square layout. This is now detailed in the instructions on the new GitHub page.

      The convolutional neural network uses a kernel size of 3x3, which captures the patterns of genes positioned close to each other but also genes that are far away from each other (potentially on another chromosome). Once convolutions extract patterns from the image, the captured features are used in a feed-forward neural network that makes a final prediction using all extracted features/patterns regardless of their location in the genome image.

      We also inserted the following sentence in discussion:

      "Given that spatial organization improved the prediction, we recognize that there may exist a more optimal representation of multi-omics data which should be explored further in future work. Potential methods for organizing gene orientation in a 2D image could consider integrating topologically associating domains[39] along with the spatial information from HiC. This is already possible to explore with the current implementation of GENIUS, where gene layout can be set manually by the user."

      b) Authors claim that "given that methylation negatively correlates with gene expression, these were considered together". This is clearly not always the case. See for example https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02728-5. What would happen if they were not considered together?

      We thank the reviewer for this insightful comment. We agree with the reviewer that methylation does not always result in lower expression, although methylation levels in most cases should correlate negatively to RNA expression, but with a gene-specific factor. Indeed, there are tools developed that infer RNA expression based on methylation, making use of gene-specific correction factors. E.g. Mattesen et al (Mattesen, Andersen, and Bramsen 2021).

      However, upon reflection we agree with the reviewer that we cannot assume for all genes that methylation equals low expression. Therefore, we have performed an analysis where we compared the methylation level to gene expression levels for all tested genes within bladder cancer. We computed Pearson’s correlation of 16,456 genes that have both methylation and expression scores. Of these, 8528 showed a negative correlation. After p-value correction, this resulted in 4774 genes where methylation was significantly negatively associated with expression. For these genes we performed the subsequent analysis in bladder cancer, where methylation and expression were considered together. This updated analysis has been included in supplementary figure 10, and the results section has been amended to reflect this. Overall, this analysis resulted in 4 of 10 genes being replaced in the downstream analysis. However, we note that the final results did not materially change, nor did the conclusions.

      Author response image 2.

      Correlation between gene-level methylation and gene expression in TCGA BLCA cohort

      3) Interesting results that were not explained.

      a) In Figure 3A methylation seems to be the most important omics data, but in 3B, mutations and expression are dominating. The authors need to explain why this is the case.

      We apologize for not explaining this in more detail. Figure 3B shows the attribution scores scaled within the cancer type, where Figure 3A shows raw attribution scores for each data source included. The reason for this is that methylation and expression have in general, smaller attribution scores but more events where a single mutation often is characterized with large attribution scores and the rest of them with very small attribution. In order to make those numbers comparable and take into account biological differences between the cancer type, we scaled the scores within each cancer type.

      To make this more clear we modified the first sentence in “Interpreting the GENIUS model classifying metastatic cancer biology” section:

      "Analysing raw attribution scores we concluded the most informative data type overall regarding the development of metastatic disease was methylation (Figure 3A). …We also noticed that mutation data often had a single mutation with large attribution score where expression and methylation showed multiple genes with high attribution scores… … The normalization step is crucial to make results comparable as underlying biology is different in each cancer type included in the study."  

      Reviewer #1 (Recommendations For The Authors):

      • While I appreciate the creative acronym of the presented software solution (GENIUS), it may easily be confused with the prominent software Geneious | Bioinformatics Software for Sequence Data Analysis which is often employed in molecular life science research. I would suggest renaming the tool.

      We appreciate the comment but prefer to keep the name. Given that the abbreviation is not exactly the same and the utility is different, we are confident that there will be no accidental mixup between these tools.

      • A huge red flag is the evaluation of the input image design which clearly shows that classification power after training is insufficient for three out of four image layouts (and even for the fourth AUC is between 0.70-0.84 depending on the pipeline step and application). Could the authors please clarify why this isn't cherry-picking (we use the one layout that gave some form of results)? In light of the poor transformation capacity of this multi-omics data onto images, why weren't other image layouts tried and their classification performance assessed? Why should a user assume that this image layout that worked for this particular input dataset will also work with other datasets if image transformation is performing poorly in most cases?

      We apologize for not describing this further in the manuscript. We wrote in the manuscript that we could not know what genome representation is optimal as it is difficult to know. A flat vector represents a simple (or no) transformation since we simply take all of the genes from all of the data sources and append them into a single list. Chromosome image and square image are two transformations we tried, and we focused on the square image since in our hands it showed superior performance relative to other transformations.

      Reviewer #2 (Recommendations For The Authors):

      Minor points:

      1) Legends of supplementary Figures are missing.

      We thank the reviewer for this comment and apologize for missing it. All legends have been added now.

      2) For some tests the authors use F1 score while for other AUC, they should be consistent. Report all metrics for all comparisons or report one and justify why that only metric.

      We apologize for not being sufficiently clear. AUC is a standard score used for binary classification, while the F1 score is used for multiclass classification. We have now described this in the methods section, and hope this is now sufficiently clear.

      "When predicting continuous values, the model used the output from the activation function with the mean squared error loss function. When predicting multi-class labels, the performance measure was defined by the F1 score, a standard measure for multiclass classification that combines the sensitivity and specificity scores and is defined as the harmonic mean of its precision and recall. To evaluate model performance against the binary outcome, ROC analysis was performed, and the area under the curve (AUC) was used as the performance metric."

      3) not sure how representation using UMAP in Figure 2C is helping understand the performance.

      Apologies for the poor wording in the results section. The purpose of the UMAP representation was to visually inspect if the model was distinguishing between variables of interest, not to estimate model performance. We have rephrased the text in the methods section to make this clear:

      "After successful model training, we extracted the latent representations of each genome and performed the Uniform Manifold Approximation and Projection (UMAP) of the data for the purpose of visual inspection of a model."

      And

      "In order to avoid modeling noise, this step was used to inspect if the model is distinguishing between variables of interest."

      And also in the results section:

      "In order to visually inspect patterns captured by the model, we extracted the latent representations of each genome and performed the Uniform Manifold Approximation and Projection (UMAP) of the data to project it into two dimensions."

      4) Instead of pie chart in 3A, the authors should plot stacked barplots (to 100%) so it would be easier to compare between the different cancer types.

      We thank the reviewer for the suggestion; however, since we wanted to compare the relative impact of each data source with each other, we used pie charts. Piecharts are often better for describing relative values, whereas bar plots are better for absolute values.

      References

      Du, Ruishan, Wenhao Liu, Xiaofei Fu, Lingdong Meng, and Zhigang Liu. 2022. “Random Noise Attenuation via Convolutional Neural Network in Seismic Datasets.” Alexandria Engineering Journal 61 (12): 9901–9.

      Jang, Hojin, Devin McCormack, and Frank Tong. 2021. “Noise-Trained Deep Neural Networks Effectively Predict Human Vision and Its Neural Responses to Challenging Images.” PLoS Biology 19 (12): e3001418.

      Mattesen, Trine B., Claus L. Andersen, and Jesper B. Bramsen. 2021. “MethCORR Infers Gene Expression from DNA Methylation and Allows Molecular Analysis of Ten Common Cancer Types Using Fresh-Frozen and Formalin-Fixed Paraffin-Embedded Tumor Samples.” Clinical Epigenetics 13 (1): 20.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Review #1:

      (1) It would be helpful to explain the criteria for choosing a given number of clusters and for accepting the final clustering solution more clearly. The quantitative results (silhouette plots, Rand index) in Supplementary Figure 2 should perhaps be included in the main figure to justify the parameter choices and acceptance of specific clustering solutions.

      We revised the text and added labels to the original Supplementary Figure 2 (now main Figure 4) to clarify how we arrived at the best settings for random-seed clustering. 

      (2) It would be helpful to show how the activity profiles in Figure 3 would look like for 3 or 5 (or 6) clusters, to give the reader an impression of how activity profiles recovered using different numbers of clusters would differ.

      We added a new figure (Supplementary Figure 4) that shows 5- and 6-cluster results. Note that the same three subpopulations in Figure 3 were reliably identified as distinct clusters even with alternative settings, corroborating the results in the tSNE space (Supplementary Figure 3). 

      (3) The authors attempt to link the microstimulation effects to the presence of functional neuron clusters at the stimulation site. How can you rule out that there were other, session-specific factors (e.g., related to the animal's motivation) that affected both neuronal activity and behavior? For example, could you incorporate aspects of the monkey's baseline performance (mean reaction time, fixation breaks, error trials) into the analysis?

      We tested the potential influences of monkeys’ motivational states on our observations using two sets of analysis. First, we examined whether motivational state modulated the likelihood of observing a specific type of neural activity in STN. We focused on three measurements of motivational states: the rate of fixation break, the overall error rate, and mean RT. We found that none of these measurements differed significantly among sessions when we encountered different subpopulations (new Supplemental Figure 7), suggesting that motivational state alone cannot explain the differences in activity patterns of the four subpopulations. 

      Second, we examined how motivational state may be reflected in the microstimulation results. To clarify, because we interleaved trials with and without microstimulation, the microstimulation effects cannot be solely explained by session-specific factors. However, it is possible that motivational state can modulate the magnitude of microstimulation effects. We performed correlation analysis between microstimulation effects (difference in each fitted DDM parameter between trials with and without microstimulation) and motivational state (fixation break, error rate, mean RT on trials without microstimulation). We did not find significant correlation for any combination (Supplemental Table 1). These results suggest that the motivational state of the monkey had little influence on our recording and microstimulation results. However, because our monkeys operated within a narrow range of strong engagement on the task, we cannot rule out the possibility that STN activity or microstimulation effects could change significantly if the monkeys were not as engaged. We have added these results in a new section titled “Heterogeneous activity patterns and microstimulation effects cannot be explained by variations in motivational state”. 

      (4) Line 84: What was the rationale for not including both coherence and reaction time in one multiple regression model?

      On the task we used, RT depends strongly on coherence in a nonlinear fashion (e.g., example behavior in now Figure 5). We thus performed regressions using coherence and RT separately. We revised the text in Methods to clarify our rationale (lines 470-473):

      “To quantitatively measure each neuron’s task-related modulation, we performed two multiple linear regressions for each running window, separately for coherence and RT because monkeys’ RT strongly depends on coherence on our task:”

      Review #2:

      The interpretation of the results, and specifically, the degree to which the identified clusters support each model, is largely dependent on whether the artificial vectors used as model-based clustering seeds adequately capture the expected behavior under each theoretical model. The manuscript would benefit from providing further justification for the specific model predictions summarized in Figure 1B.

      We added information on the original figure/equations that were the basis of the artificial vectors we constructed for clustering analysis and their abbreviated summary in Figure 1B (first paragraph in section “STN subpopulations can support previously theorized functions”). These vectors were meant to capture prominent features of the predicted activity patterns, in the forms of choice, time, and motion strength dependencies. We also emphasize that we obtained very similar results using random clustering seeds.

      Further, although each cluster's activity can be described in the context of the discussed models, these same neural dynamics could also reflect other processes not specific to the models. That is, while a model attributing the STN's role to assessing evidence accumulation may predict a ramping up of neural activity, activity ramping is not a selective correlate of evidence accumulation and could be indicative of a number of processes, e.g., uncertainty, the passage of time, etc. This lack of specificity makes it challenging to infer the functional relevance of cluster activity and should be acknowledged in the discussion.

      We thank the reviewer for pointing out the alternative interpretation of these modulation patterns. We have added this caveat in the Discussion (lines 398-401): “It is also possible that the ramping activity reflects alternative roles for the STN in the evaluation of the decision process, the tracking of elapsed time, or both. How these possible roles relate to those of caudate neurons awaits further investigation (Fan et al., 2024)”. 

      Additionally, although the effects of STN microstimulation on behavior provide important causal evidence linking the STN to decision processes, the stimulation results are highly variable and difficult to interpret. The authors provide a reasonable explanation for the variability, showing that neurons from unique clusters are anatomically intermingled such that stimulation likely affects neurons across several clusters. It is worth noting, however, that a substantial body of literature suggests that neural populations in the STN are topographically organized in a manner that is crucial for its role in action selection, providing "channels" that guide action execution. The authors should comment on how the current results, indicative of little anatomical clustering amongst the functional clusters, relate to other reports showing topographical organization.

      We thank the reviewer for raising this important point. We have added the following text in the Discussion:

      “The intermingled subpopulations may appear at odds with the conventional idea of topography in how the STN is organized. For example, the “tripartite model” suggests that STN is segregated by motor, associative, and limbic functions (Parent and Hazrati, 1995); afferents from motor cortices and neurons related to different types of movements are largely somatotopically organized in the STN (DeLong et al., 1985; Nambu et al., 1996); and certain molecular markers are expressed in an orderly pattern in the STN (reviewed in Prasad and Wallén-Mackenzie, 2024). Because we focused on STN neurons that were responsive on a single oculomotor decision task, our sampling was likely biased toward STN subdivisions related to associative function and oculomotor movements. As such, our results do not preclude the presence of topography at a larger scale. Rather, our results underscore the importance of activity patternbased analysis, in addition to anatomy-based analysis, for understanding the functional organization of the STN.”

      Figure 3 is referenced when describing which cluster activity is choice/coherence dependent, yet it is unclear what specific criteria and measures are being used to determine whether activity is choice/coherence "dependent." Visually, coherence activity seems to largely overlap in panel B (top row). Is there a statistically significant distinction between low and high coherence in this plot? The interpretation of these plots and the methods used to determine choice/coherence "dependence" needs further explanation.

      We added a new figure (Sup Figure 3) that shows the summary of choice and coherence modulation, based on multiple linear regression analysis, for each subpopulation separately. We also updated the description of these activity patterns in Results (lines 122-130):

      In general, the association between cluster activity and each model could be more directly tested. At least two of the models assume coordination with other brain regions. Does the current dataset include recordings from any of these regions (e.g., mPFC or GPe) that could be used to bolster claims about the functional relevance of specific subpopulations? For example, one would expect coordinated activity between neural activity in mPFC and Cluster 2 according to the Ratcliff and Frank model.

      We agree completely that simultaneous recordings of STN and its afferent/efferent regions (such as mPFC, GPe, SNr, and GPi) would provide valuable insights into the specific roles of STN and the basal ganglia as a whole. Such recordings are outside the scope of the current study but are in our future plans. 

      Additionally, the reported drift-diffusion model (DDM) results are difficult to interpret as microstimulation appears to have broad and varied effects across almost all the DDM model parameters. The DDM framework could, however, be used to more specifically test the relationships between each neural cluster and specific decision functions described in each model. Several studies have successfully shown that neural activity tracks specific latent decision parameters estimated by the DDM by including neural activity as a predictor in the model. Using this approach, the current study could examine whether each cluster's activity is predictive of specific decision parameters (e.g., evidence accumulation, decision thresholds, etc.). For example, according to the Ratcliff and Frank model, activity in cluster 2 might track decision thresholds.

      We thank the reviewer for the suggested analysis. Because including the neural activity in the model substantially increases model fitting time, we performed a preliminary round of model fitting for 15 neurons (5 neurons closest to each of the cluster centroids). For each neuron, we measured the average firing rates in three windows: 1) a 350 ms window starting from dots onset (“Dots”), 2) a 350 ms window ending at saccade onset (“Presac”), and 3) a variable window starting from dots onset and ending at 100 ms before saccade onset (“Fullview”). For each window, the firing rates were z-scored across trials.  We incorporated the firing rates into two model types. In the “DV” type, the firing rates were assumed to influence three DDM parameters related to evidence accumulation: k, me, and z. In the “Bound” type, the firing rates were assumed to influence three DDM parameters related to decision bound: a, B_alpha, and B_d. In total, we fitted six combinations of firing rates and model types to each neuron. For comparison, we also fitted the standard model without incorporating firing rates. 

      As shown in Author response image 1, firing rates of single STN neurons had minimal contributions to the fits. With the exception of one neuron, AIC values were greater for model variants including firing rates than the standard model (Author response image 1A), indicating that including firing rate did not improve the fits. For all neurons, the actual fitted coefficients for firing rates were several degrees of magnitude smaller than the corresponding DDM parameter (Author response image 1B; note the range of y axis), indicating that the trial-by-trial variation in firing rate had little influence on the evidence accumulation- or decision bound-related parameters. Based on these preliminary fitting results, we believe that a single STN neuron does not have strong enough influence on the overall evidence accumulation or decision bound to be detected with the model fitting method.  We therefore did not expand the fitting analysis to all neurons. 

      Author response image 1.

      Firing rates of a single STN neuron did not substantially influence decision-related DDM parameters. A, Differences in AIC between DDM variants that included firing rate-dependent terms and the standard DDM. Red dahsed line: difference = -3. Each column represents results from one unit. B, Fitted coefficients for firing rate-related terms were near zero. Note the range of y axis. Values for the top and bottomw panels were obtained from "DV"- and "Bound"-type models, respectively. See text for more details.

      We emphasize, however, that the apparent negative results do not necessarily argue against a causal role of the STN in decision making, rather, these results more likely reflect the methodological limitation: because we used a single task context, the monkeys’ natural trial-by- trial variations in the DDM components may be too small. A better design would be to manipulate task contexts to induce larger changes in evidence accumulation or decision bounds and then test for a correlation between single-neuron firing rates and these changes. We are currently using such a design in a follow-up study. 

      The table in Figure 1B nicely outlines the specific neural predictions for each theoretical model but it would help guide the reader if the heading for each column also included a few summary words to remind the reader of the crux of each theory, e.g. "Ratcliff+Frank 2012 (adjusted decision-bounds)"

      We thank the reviewer for this suggestion. We considered implementing this but eventually decided not to add more headings to the column, because the predicted STN functions of the three models cannot all be succinctly summarized. We thus prefer to include more detailed descriptions in the main text, instead of in the figure. 

      The authors frequently refer to contralateral vs. ipsilateral decisions but never explicitly state what this refers to, i.e. contralateral relative to what (visual field, target direction, recording site, etc.)? The reader can eventually deduce that this means contralateral to the recording site but this should be explicitly stated for clarity.

      We added in Methods: 

      Line 483: “Contralateral/ipsilateral choices refer to saccades toward the targets contralateral/ipsilateral to the recording sites, respectively.” 

      Line 535: Contralateral/ipsilateral choices refer to saccades toward the targets contralateral/ipsilateral to the microstimulation sites, respectively.”

      Again, for clarity, it would be helpful to explicitly define what the authors mean by "sensitive to choice" when referring to Figure 1B as this could be interpreted to mean left/right or ipsilateral/contralateral.

      In the context of Figure 1B, “sensitive to choice” means showing different responses for the two choices in our 2AFC task, regardless of the task geometry. We added explanation in the figure caption.

      Color bar labels would be helpful to include in all figures that include plots with color bars.

      We apologize for omitting the labels. They are added to Figure 2B and C, Supplemental Fig. 1.  

      The authors should briefly note what a "lapse term" is when describing the logistic function results.

      We revised the text in Results (lines 184-186) and Methods (line 527) to clarify that lapse terms were used to capture errors independent of motion strength.

      Are the 3 example sessions in Figure 4 stimulating the same STN site and/or the same monkey? This information should be noted in the caption or main text.

      We revised the caption: “A-C, Monkey’s choice (top) and RT (bottom) performance for trials with (red) and without (black) microstimulation for three example sessions (A,B: two sites in monkey C; C: monkey F).”

      Figure 3B the authors note that "the last cluster shows little task-related modulation" - what criteria are they using to make this conclusion? By eye, the last cluster and cluster 1 seem to show a similar degree of modulation when locked to motion onset.

      We added a new figure (Suppl Figure 2) that shows the summary of choice and coherence modulation, based on multiple linear regression analysis, for each subpopulation separately. 

      Reviewer #3:

      We have grouped the reviewer’s public and specific comments by content. 

      First, the interpretation of the neural subpopulations' activity patterns in relation to the computational models should be clarified, as the observed patterns may not directly correspond to the specific signals predicted by the models. The authors claim that the first subpopulation of STN neurons reflects the normalization signal predicted by the model of Bogacz and Gurney (2007). However, the observed activity patterns only show choice- and coherence-dependent activity, which may represent the input to the normalization computation rather than its output. The authors should clarify this point and discuss the limitations of their interpretation. 

      We agree with the reviewer that the choice- and coherence-dependent activity pattern does not sufficiently indicate a normalization computation. We interpreted such activity as satisfying a necessary condition for, and therefore consistent with, the theoretical model proposed by Bogacz and Gurney. We have reviewed the text to ensure that we never made the claim that the first subpopulation mediates the normalization.   

      Second, the authors could consider using a supervised learning method to more explicitly model the pattern correlations between the three profiles. The authors used k-means clustering to identify STN subpopulations. Given the clear distinction between the three types of neural firing patterns, a supervised learning method (e.g., a generalized linear model) could be used as a more explicit encoding model to account for the pattern correlations between the three profiles.

      We used two approaches to examine the different response profiles. The “random-seed” approach used non-supervised clustering to probe the functional organization of STN neurons, with no a priori assumption about how many subpopulations may be present. The “model-seed” approach is similar in spirit to what the reviewer suggested: we defined artificial vectors, akin to regressors in a generalized linear model, that showed key modulation features as predicted by previous theoretical models. We then projected the neurons’ activity profiles onto these vectors, akin to performing a regression analysis.   

      Third, a neural population model could be employed to better understand how the STN population jointly contributes to decision-making dynamics. The single-neuron encoding analysis reveals mixed effects from multiple decision-related functions. To better understand how the STN population jointly contributes to the decision-making process, the authors could consider using a neural population model (e.g., Wang et al., 2023) to quantify the population dynamics.

      We agree with the reviewer that a neural population model would be helpful for testing our understanding of the roles of STN. However, we believe that this is premature at the moment because we have no knowledge about how these different subpopulations interact with each other within STN, nor how they interact with other basal ganglia nuclei. We hope our results provide a foundation for future experiments that can provide more specific insights in the roles of each subpopulation, which can then be tested in a neural population model as the reviewer suggested.  

      Finally, the added value of the microstimulation experiments should be more directly addressed in the Results section, as the changes in firing patterns compared to the original patterns are not clearly evident. The microstimulation results (Figure 7A) do not show significant changes in firing patterns compared to the original patterns (Figure 3B). As microstimulation is used to identify the hypothetical role of the STN beyond the correlational analysis, the authors should more directly address the added value of these experiments in the Results section.

      We apologize for the confusion. The average firing rates at the top of original Figure 7A (now Figure 8A) were obtained in recordings just before microstimulation, to document which neuron subpopulation was near the stimulation electrode. We were not able to obtain recordings from the same neurons during microstimulation.  

      The ordering of the three hypotheses in the Introduction (1) adjusting decision bounds, (2) computing a normalization signal, (3) implementing a nonlinear computation to improve decision bound adjustment, is inconsistent with the order in which they are addressed in the Results section (2, 1, 3). To improve clarity and readability, the authors should consider presenting the hypotheses and their corresponding results in a consistent order throughout the manuscript.

      We thank the reviewer for this suggestion. We have reordered the text in Introduction to be consistent.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors set out to explore the role of upstream open reading frames (uORFs) in stabilizing protein levels during Drosophila development and evolution. By utilizing a modified ICIER model for ribosome translation simulations and conducting experimental validations in Drosophila species, the study investigates how uORFs buffer translational variability of downstream coding sequences. The findings reveal that uORFs significantly reduce translational variability, which contributes to gene expression stability across different biological contexts and evolutionary timeframes.

      We thank the reviewer for carefully reading our manuscript and providing thoughtful and constructive feedback. We believe the manuscript has been significantly improved by incorporating your suggestions. Please find our detailed responses and corresponding revisions below.

      Strengths:

      (1) The study introduces a sophisticated adaptation of the ICIER model, enabling detailed simulation of ribosomal traffic and its implications for translation efficiency.

      (2) The integration of computational predictions with empirical data through knockout experiments and translatome analysis in Drosophila provides a compelling validation of the model's predictions.

      (3) By demonstrating the evolutionary conservation of uORFs' buffering effects, the study provides insights that are likely applicable to a wide range of eukaryotes.

      We appreciate your positive feedback and thoughtful summary of the strengths of our study.

      Weaknesses:

      (1) Although the study is technically sound, it does not clearly articulate the mechanisms through which uORFs buffer translational variability. A clearer hypothesis detailing the potential molecular interactions or regulatory pathways by which uORFs influence translational stability would enhance the comprehension and impact of the findings.

      Thanks for your constructive comments. In the Discussion section of our previous submission (Original Lines 470-489), we proposed that uORFs function as “molecular dams” to smooth out fluctuations in ribosomal flow toward downstream CDS regions, primarily via mechanisms involving ribosome collision and dissociation. To further address your concern, we have expanded the Discussion and included a new model figure (Fig. 9) to more clearly articulate the potential biological and mechanistic basis by which translating 80S ribosomes may induce the dissociation of 40S ribosomes. The revised section (Lines 540–557) now reads:

      “Ribosome slowdown or stalling on mRNA due to rare codons [56,96-98] or nascent blocking peptides [99-102] frequently triggers ribosome collisions genome-wide [103-105]. Such collisions, especially among elongating 80S ribosomes, often activate ribosome quality control (RQC) pathways that recognize collision interfaces on the 40S subunit, leading to ribosomal subunit dissociation and degradation [106-108]. In mammals, ZNF598 specifically identifies collided ribosomes to initiate ubiquitin-dependent protein and mRNA quality control pathways [109-113]. Analogously, yeast employs Hel2-mediated ubiquitination of uS10, initiating dissociation via the RQC-trigger complex (RQT) [114]. Furthermore, the human RQT (hRQT) complex recognizes ubiquitinated ribosomes and induces subunit dissociation similarly to yeast RQT [115]. However, transient ribosome collisions can evade RQC by promoting resumed elongation through mechanical force provided by trailing ribosomes, thereby mitigating stalling [116]. Beyond 80S collisions, evidence increasingly highlights a distinct collision type involving scanning 40S subunits or pre-initiation (43S) complexes. Recently, an initiation RQC pathway (iRQC) targeting the small ribosomal subunit (40S) has been described, particularly involving collisions between scanning 43S complexes or between stalled 43S and elongating 80S ribosomes (Figure 9B) [117,118]. During iRQC, E3 ubiquitin ligase RNF10 ubiquitinates uS3 and uS5 proteins, resulting in 40S degradation [118]. This mechanism aligns closely with our ICIER model, proposing collision-driven 43S dissociation in the 5' UTRs. Future studies exploring these mechanisms in greater detail will clarify how uORFs modulate translational regulation through buffering effects.”

      (2) The study could be further improved by a discussion regarding the evolutionary selection of uORFs. Specifically, it would be beneficial to explore whether uORFs are favored evolutionarily primarily for their role in reducing translation efficiency or for their capability to stabilize translation variability. Such a discussion would provide deeper insights into the evolutionary dynamics and functional significance of uORFs in genetic regulation.

      Thank you for this insightful suggestion. We agree that understanding whether uORFs are evolutionarily favored for their role in translational repression or for their capacity to buffer translational variability is a compelling and unresolved question. Our study suggests that translational buffering, rather than translational repression alone, can also drive evolutionary selection favoring uORFs, although it remains challenging to empirically disentangle these functions due to their inherent linkage. We have expanded the discussion in the revised manuscript to address this point in more detail (Lines 494-513), which is reproduced as follows:

      “Previous studies have shown that a significant fraction of fixed uORFs in the populations of D. melanogaster and humans were driven by positive Darwinian selection 63,67, suggesting active maintenance through adaptive evolution rather than purely neutral or deleterious processes. While uORFs have traditionally been recognized for their capacity to attenuate translation of downstream CDSs, accumulating evidence now underscores their critical role in stabilizing gene expression under fluctuating cellular and environmental conditions [43,55,56]. Whether the favored evolutionary selection of uORFs acts primarily through their role in translational repression or translational buffering remains a compelling yet unresolved question, as these two functions are inherently linked. Indeed, highly conserved uORFs tend to be translated at higher levels, resulting not only in stronger inhibition of CDS translation [34,45,67] but also in a more pronounced buffering effect, as demonstrated in this study. This buffering capacity of uORFs potentially provides selective advantages by reducing fluctuations in protein synthesis, thus minimizing gene-expression noise and enhancing cellular homeostasis. This suggests that selection may favor uORFs that contribute to translational robustness, a hypothesis supported by findings in yeast and mammals showing that uORFs are significantly enriched in stressresponse genes and control the translation of certain master regulators of stress responses [41,42,94,95]. Our study suggests that translational buffering, rather than translational repression alone, can also drive evolutionary selection favoring uORFs, although it remains challenging to empirically disentangle these functions. Future comparative genomic analyses, coupled with experimental approaches such as ribosome profiling and functional mutagenesis, will be crucial in elucidating the precise evolutionary forces driving uORF conservation and adaptation.”

      Reviewer #2 (Public review):

      uORFs, short open reading frames located in the 5' UTR, are pervasive in genomes. However, their roles in maintaining protein abundance are not clear. In this study, the authors propose that uORFs act as "molecular dam", limiting the fluctuation of the translation of downstream coding sequences. First, they performed in silico simulations using an improved ICIER model, and demonstrated that uORF translation reduces CDS translational variability, with buffering capacity increasing in proportion to uORF efficiency, length, and number. Next, they analzed the translatome between two related Drosophila species, revealing that genes with uORFs exhibit smaller fluctuations in translation between the two species and across different developmental stages within the same specify. Moreover, they identified that bicoid, a critical gene for Drosophila development, contains a uORF with substantial changes in translation efficiency. Deleting this uORF in Drosophila melanogaster significantly affected its gene expression, hatching rates, and survival under stress condition. Lastly, by leveraging public Ribo-seq data, the authors showed that the buffering effect of uORFs is also evident between primates and within human populations. Collectively, the study advances our understanding of how uORFs regulate the translation of downstream coding sequences at the genome-wide scale, as well as during development and evolution.

      The conclusions of this paper are mostly well supported by data, but some definitions and data analysis need to be clarified and extended.

      We thank the reviewer for the thoughtful and constructive review. Your summary accurately captures the key findings of our study. We have carefully addressed all your concerns in the revised manuscript, and we believe it has been significantly improved based on your valuable input.

      (1) There are two definitions of translation efficiency (TE) in the manuscript: one refers to the number of 80S ribosomes that complete translation at the stop codon of a CDS within a given time interval, while the other is calculated based on Ribo-seq and mRNA-seq data (as described on Page 7, line 209). To avoid potential misunderstandings, please use distinct terms to differentiate these two definitions.

      Thank you for highlighting this important point, and we apologize for the confusion. The two definitions of translation efficiency (TE) in our manuscript arise from methodological differences between simulation and experimental analyses. To clarify, in the revised manuscript, we use “translation rate” in the context of simulations to describe the number of 80S ribosomes completing translation at the CDS stop codon per unit time. We retain the conventional “translation efficiency (TE)” for Ribo-seq–based measurements. 

      In this revised manuscript, we have added a more detailed explanation of TE in the revised manuscript (Lines 202–206), which now reads:

      “For each sample, we followed established procedures [62-66] to calculate the translational efficiency (TE) for each feature (CDS or uORF). TE serves as a proxy for the translation rate at which ribosomes translate mRNA into proteins, typically quantified by comparing the density of ribosome-protected mRNA fragment (RPF) to the mRNA abundance for that feature (see Materials and Methods).”

      (2) Page 7, line 209: "The translational efficiencies (TEs) of the conserved uORFs were highly correlated between the two species across all developmental stages and tissues examined, with Spearman correlation coefficients ranging from 0.478 to 0.573 (Fig. 2A)." However, the authors did not analyze the correlation of translation efficiency of conserved CDSs between the two species, and compare this correlation to the correlation between the TEs of CDSs. These analyzes will further support the authors conclusion regarding the role of conserved uORFs in translation regulation.

      In the revised manuscript, we have incorporated a comparison of translational efficiency (TE) correlations for conserved CDSs between the two species. We found that CDSs exhibit significantly higher interspecific TE correlations than uORFs, with Spearman’s rho ranging from 0.588 to 0.806. This suggests that uORFs tend to show greater variability in TE than CDSs, consistent with our model in which uORFs buffer fluctuations in downstream CDS translation. The updated results were included in the revised manuscript (Lines 223-227) as follows:

      “In contrast, TE of CDSs exhibited a significantly higher correlation between the two species in the corresponding samples compared to that of uORFs, with Spearman’s rho ranging from 0.588 to 0.806 (P = 0.002, Wilcoxon signed-rank test; Figure 2A). This observation is consistent with our simulation results, which indicate that uORFs experience greater translational fluctuations than their downstream CDSs.”

      (3) Page 8, line 217: "Among genes with multiple uORFs, one uORF generally emerged as dominant, displaying a higher TE than the others within the same gene (Fig. 2C)." The basis for determining dominance among uORFs is not explained and this lack of clarification undermines the interpretation of these findings.

      Thank you for pointing this out. We apologize for the confusion. In our study, a “dominant” uORF is defined as the one with the highest translation efficiency (TE) among all uORFs within the same gene. This designation is based solely on TE, which we consider a key metric for uORF activity, as it directly reflects translational output and potential regulatory impact. We have revised the manuscript to clarify this definition (Lines 232–244), now stating:

      “Among genes with multiple uORFs, we defined the uORF with the highest TE as the dominant uORF for that gene, as TE is one of the most relevant metrics for assessing uORF function 45,67…… These results suggest that genes with multiple uORFs tend to retain the same dominant uORF across developmental stages, indicating that the dominant uORFs may serve as the key translational regulator of the downstream CDS.

      (4) According to the simulation, the translation of uORFs should exhibit greater variability than that of CDSs. However, the authors observed significantly fewer uORFs with significant TE changes compared to CDSs. This discrepancy may be due to lower sequencing depth resulting in fewer reads mapped to uORFs. Therefore, the authors may compare this variability specifically among highly expressed genes.

      Thank you for this thoughtful observation. We agree that the lower proportion of uORFs showing significant TE changes compared to CDSs, as reported in Table 1, appears inconsistent with our conclusion that uORFs exhibit greater translational variability. However, this discrepancy is largely attributable to differences in sequencing depth and feature length—uORFs are generally much shorter and more weakly expressed than CDSs, resulting in fewer mapped reads and reduced statistical power (Figure S18A).

      To address this issue, we first followed your suggestion and restricted our analysis to genes with both mRNA and RPF RPKM values above the 50th percentile in D. melanogaster and D. simulans. While this filtering increased the total proportion of features with significant TE changes (due to improved read coverage), the proportion of significant uORFs still remained lower than that of CDSs (Table R1). This suggests that even among highly expressed genes, the disparity in read counts between uORFs and CDSs persists (Figure S18B), and thus the issue is not fully resolved.

      To better capture biological relevance, we compared the absolute values of log2(TE changes) between D. melanogaster and D. simulans for uORFs and their corresponding CDSs. Across all samples, uORFs consistently exhibit larger TE shifts than their downstream CDSs, supporting our model that uORFs act as translational buffers (Figure 3B).

      We have made relevant changes to report the new analysis in this revised manuscript. Specifically, in our original submission, we stated this observation with the sentence “The smaller number of uORFs showing significant TE changes compared to CDSs between D. melanogaster and D. simulans likely reflects their shorter length and reduced statistical power, rather than indicating that uORFs are less variable in translation than CDSs.” To make this point clearer, in the revised version (Lines 275-284), we rephrased this sentence which read as follows: 

      “Note that due to their shorter length and generally lower TE, uORFs had considerably lower read counts than CDSs, limiting the statistical power to detect significant interspecific TE differences for uORFs. This trend consistently holds whether analyzing all expressed uORFs (Figure S18A) or only highly expressed genes (Figure S18B). Thus, the fewer uORFs showing significant TE divergence likely reflects lower read counts and statistical sensitivity rather than reduced translational variability relative to CDSs. In fact, the absolute values of log2(fold change) of TE for uORFs between D. melanogaster and D. simulans were significantly greater than those observed for corresponding CDSs across all samples (P < 0.001, Wilcoxon signed-rank test; Figure 3B), suggesting that the magnitude of

      TE changes in CDSs is generally smaller than that in uORFs, due to the buffering effect of uORF.”

      Author response table 1.

      Proportion of uORFs and CDSs with significant TE changes before and after selecting HEGs

      (5) If possible, the author may need to use antibodies against bicoid to test the effect of ATG deletion on bicoid expression, particularly under different developmental stages or growth conditions.

      According to the authors' conclusions, the deletion mutant should exhibit greater variability in bicoid protein abundance. This experiment could provide strong support for the proposed mechanisms.

      Thank you for this excellent suggestion. We fully agree that testing Bcd protein levels across developmental stages or stress conditions using antibodies would be a strong validation of our model, which predicts greater variability in Bcd protein abundance upon uORF deletion.

      In fact, we attempted such experiments in both wild-type and mutant backgrounds. However, we encountered substantial difficulties in obtaining a reliable anti-Bcd antibody. Some Bcd antibodies referenced in the published literature were homemade and often shared among research groups as gifts [1-3] and some commercially available antibodies cited in previous studies are no longer supplied by vendors [4-6]. We managed to obtain a custom-made antibody from Professor Feng Liu, but unfortunately, it produced inconsistent and unsatisfactory results. Despite considerable effort—including during the COVID-19 pandemic—we were unable to identify a reagent suitable for robust and reproducible detection of Bcd protein.

      As an alternative, we used sucrose gradient fractionation followed by qPCR to directly measure the translation efficiency of bicoid in vivo. We believe this approach offers a clear and quantitative readout of translational activity, and it avoids potential confounding from protein degradation, which may vary across conditions and developmental stages. Nonetheless, we recognize the value of antibody-based validation and will pursue this direction in future work if reliable antibodies become available. We have added this limitation to the revised Discussion section (Lines 563–568) as follows:

      “We demonstrated that the bcd uORF represses CDS translation using sucrose gradient fractionation followed by qPCR—an approach that directly measures translation efficiency while minimizing confounding from RNA/protein degradation. However, detecting Bcd protein levels with antibodies across developmental stages or conditions in the mutants and wild-type controls would provide an even stronger validation of our model and should be explored in future studies.”

      Recommendations for the authors:  

      Reviewer #1 (Recommendations for the authors):

      (1) The authors should provide a more detailed explanation for the modifications made to the ICIER model. Specifically, an explanation of the biological or mechanistic rationale behind the ability of the 80S ribosome to cause upstream 40S ribosomes to dissociate from mRNA would help clarify this aspect of the model.

      Thank you for this suggestion. In the original submission, we described our modifications to the ICIER model in the section titled “An extended ICIER model for quantifying uORF buffering in CDS translation” (Lines 88-124 of the revised manuscript). 

      To further clarify the biological rationale behind this mechanism, we have now included a conceptual model figure (Figure 9) illustrating mechanistically how uORF translation can buffer downstream translation within a single mRNA molecule. Additionally, we expanded the Discussion to summarize the current understanding of how collisions between translating 80S ribosomes and scanning 40S subunits may lead to dissociation, referencing known initial ribosome quality control (iRQC) pathways. These revisions provide a clearer mechanistic framework for interpreting the buffering effects modeled in our simulations. The relevant part is reproduced from Discussion (Lines 540-557) which reads as follows:

      “Ribosome slowdown or stalling on mRNA due to rare codons [56,96-98] or nascent blocking peptides [99-102] frequently triggers ribosome collisions genome-wide [103-105]. Such collisions, especially among elongating 80S ribosomes, often activate ribosome quality control (RQC) pathways that recognize collision interfaces on the 40S subunit, leading to ribosomal subunit dissociation and degradation [106-108]. In mammals, ZNF598 specifically identifies collided ribosomes to initiate ubiquitin-dependent protein and mRNA quality control pathways [109-113]. Analogously, yeast employs Hel2-mediated ubiquitination of uS10, initiating dissociation via the RQC-trigger complex (RQT) [114]. Furthermore, the human RQT (hRQT) complex recognizes ubiquitinated ribosomes and induces subunit dissociation similarly to yeast RQT [115]. However, transient ribosome collisions can evade RQC by promoting resumed elongation through mechanical force provided by trailing ribosomes, thereby mitigating stalling [116]. Beyond 80S collisions, evidence increasingly highlights a distinct collision type involving scanning 40S subunits or pre-initiation (43S) complexes. Recently, an initiation RQC pathway (iRQC) targeting the small ribosomal subunit (40S) has been described, particularly involving collisions between scanning 43S complexes or between stalled 43S and elongating 80S ribosomes (Figure 9B) [117,118]. During iRQC, E3 ubiquitin ligase RNF10 ubiquitinates uS3 and uS5 proteins, resulting in 40S degradation [118]. This mechanism aligns closely with our ICIER model, proposing collision-driven 43S dissociation in the 5' UTRs. Future studies exploring these mechanisms in greater detail will clarify how uORFs modulate translational regulation through buffering effects.”

      (2) The figure legend references Figure 5C; however, this figure appears to be missing from the document.

      We apologize for the oversight. The missing panel previously referred to as Figure 5C has now been incorporated into the revised Figure 6A. The figure and its corresponding legend have been corrected accordingly in the updated manuscript.

      Reviewer #2 (Recommendations for the authors):

      This is an important study that enhances our understanding of the roles of uORFs in translational regulation. In addition to the suggestions provided in the public review, the following minor points should be addressed before publication in eLife:

      (1) Page 7, line 207: "We identified 18,412 canonical uORFs shared between the two species (referred to as conserved uORFs hereafter)." The term "canonical uORFs" requires clarification. Does this refer to uORFs with specific sequence features, conservation, or another defining characteristic?

      Thank you for pointing this out. We apologize for the lack of clarity. In our study, a canonical uORF is defined as an open reading frame (ORF) that initiates with a canonical AUG start codon located in the 5′ untranslated region (UTR) and terminates with a stop codon (UAA, UAG, or UGA) within the same mRNA. Conservation of uORFs is defined solely based on the presence of AUG start codons at orthologous positions in the 5′ UTR across species, regardless of differences in the stop codon.

      To clarify this definition, we have revised the sentence as follows (Lines 213-219): “We focused on canonical uORFs that initiate with an ATG start codon in the 5′ UTR and terminate with a stop codon (TAA, TAG, or TGA). Because the ATG start codon is the defining feature of a canonical uORF and tends to be more conserved than its downstream sequence [67], we defined uORF conservation based on the presence of the ATG start codon in the 5′ UTR of D. melanogaster and its orthologous positions in D. simulans, regardless of differences in the stop codon. Using this criterion, we identified 18,412 canonical uORFs with conserved start codons between the two species.”

      (2) Page 8, line 227: "Furthermore, the dominant uORFs showed a higher proportion of conserved uATGs than the other translated uORFs." There appears to be a typographical error. Should "other uATGs" instead read "other uORFs"?

      Thank you for pointing this out. As we addressed in response to your previous concern, in this study, we defined uORF conservation primarily based on the presence of their start codon (uATG) both in D. melanogaster and the orthologous sites of D. simulans, as the start codon is the defining feature of a uORF and tends to be more conserved than the remaining sequence, as demonstrated in our previous study [7]. We used the term “conserved uATGs” to reflect this definition and believe it accurately conveys the intended meaning in this context.

      (3) Page 8, line 240: "uORFs exhibited a significant positive correlation with the TE of their downstream CDSs in all samples analyzed (P < 0.001, Spearman's correlation)." A Spearman's rho of 0.11 or 0.21 may not practically represent a "significant" positive correlation. Consider rephrasing this as "a positive correlation."

      Thank you for the suggestion. We have revised the sentence in the manuscript to read (Lines 257-259): “uORFs exhibited a modest, yet statistically significant, positive correlation with the TE of their downstream CDSs across all samples analyzed (P < 0.001, Spearman’s correlation).”

      (4) Page 9, line 269: The analysis of interspecific TE changes between uORFs and their corresponding CDSs is a crucial piece of evidence supporting the authors' conclusions. Presenting this analysis as part of the figures, rather than in "Table 1," would improve clarity and accessibility.

      Thank you for this suggestion. In Table 1, we originally presented the number of uORFs and CDSs that showed significant differences in TE between D. melanogaster and D. simulans during various developmental stages. One key point we aimed to emphasize was that, although TE changes in uORFs and their downstream CDSs are positively correlated, there is a notable difference in the magnitude of these changes. To better convey this, we have summarized the core findings of Table 1 in graphical form.

      In Figure 3B of the revised version, we compared the absolute values of interspecific TE changes between CDS and uORF, showing that CDSs consistently exhibit smaller shifts than their upstream uORFs. This result further supports the translational buffering effect of uORFs on downstream CDS expression. We have included the updated results in the revised manuscript (Lines 281-284) as follows:

      “In fact, the absolute values of log2(fold change) of TE for uORFs between D. melanogaster and D. simulans was significantly greater than that observed for corresponding CDSs across all samples (P < 0.001, Wilcoxon signed-rank test; Figure 3B), suggesting that the magnitude of TE changes in CDSs is generally smaller than that in uORFs, due to the buffering effect of uORF.”

      (5) Page 9, line 279: The phrase "dominantly translated" needs clarification. Does it refer to Figure 2C, where one uORF is dominantly translated within a gene, or does it mean that the uORF's translation is higher than that of its corresponding CDS?

      We apologize for the obscurity. The phrase "dominantly translated" means one uORF with the highest TE compared to other uORFs within a gene. We have rephrased the relevant sentence in the revised version (Lines 299-304), which now reads:

      “To investigate how the conservation level and translation patterns of uORFs influence their buffering capacity on CDS translation, we categorized genes expressed in each pair of samples into three classes:

      Class I, genes with conserved uORFs that are dominantly translated (i.e., exhibiting the highest TE among all uORFs within the same gene) in both Drosophila species; Class II, genes with conserved uORFs that are translated in both species but not dominantly translated in at least one; and Class III, the remaining expressed genes.”

      (6) The sequencing data and analysis code should be made publicly available before publication to ensure transparency and reproducibility.

      Thank you for this suggestion. As described in the Data availability section, all deepsequencing data generated in this study, including single-ended mRNA-Seq and Ribo-Seq data of 10 developmental stages and tissues of Drosophila simulans and paired-end mRNA-Seq data of 0-2 h, 26 h, 6-12 h, and 12-24 h Drosophila melanogaster embryos, were deposited in the China National Genomics Data Center Genome Sequence Archive (GSA) under accession numbers CRA003198, CRA007425, and CRA007426. The mRNA-Seq and Ribo-Seq data for the different developmental stages and tissues of Drosophila melanogaster were published in our previous paper [8] and were deposited in the Sequence Read Archive (SRA) under accession number SRP067542.

      All original code has been deposited on GitHub: https://github.com/lujlab/uORF_buffer; https://github.com/lujlab/Buffer_eLife2025.

      Response reference

      (1) Li, X.Y., MacArthur, S., Bourgon, R., Nix, D., Pollard, D.A., Iyer, V.N., Hechmer, A., Simirenko, L., Stapleton, M., Luengo Hendriks, C.L., et al. (2008). Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol 6, e27. 10.1371/journal.pbio.0060027.

      (2) Horner, V.L., Czank, A., Jang, J.K., Singh, N., Williams, B.C., Puro, J., Kubli, E., Hanes, S.D., McKim, K.S., Wolfner, M.F., and Goldberg, M.L. (2006). The Drosophila calcipressin sarah is required for several aspects of egg activation. Curr Biol 16, 1441-1446. 10.1016/j.cub.2006.06.024.

      (3) Lee, K.M., Linskens, A.M., and Doe, C.Q. (2022). Hunchback activates Bicoid in Pair1 neurons to regulate synapse number and locomotor circuit function. Curr Biol 32, 2430-2441 e2433. 10.1016/j.cub.2022.04.025.

      (4) Wharton, T.H., Nomie, K.J., and Wharton, R.P. (2018). No significant regulation of bicoid mRNA by Pumilio or Nanos in the early Drosophila embryo. PLoS One 13, e0194865. 10.1371/journal.pone.0194865.

      (5) Wang, J., Zhang, S., Lu, H., and Xu, H. (2022). Differential regulation of alternative promoters emerges from unified kinetics of enhancer-promoter interaction. Nat Commun 13, 2714. 10.1038/s41467-022-30315-6.

      (6) Xu, H., Sepulveda, L.A., Figard, L., Sokac, A.M., and Golding, I. (2015). Combining protein and mRNA quantification to decipher transcriptional regulation. Nat Methods 12, 739-742. 10.1038/nmeth.3446.

      (7) Zhang, H., Wang, Y., Wu, X., Tang, X., Wu, C., and Lu, J. (2021). Determinants of genomewide distribution and evolution of uORFs in eukaryotes. Nat Commun 12, 1076. 10.1038/s41467-021-21394-y.

      (8) Zhang, H., Dou, S., He, F., Luo, J., Wei, L., and Lu, J. (2018). Genome-wide maps of ribosomal occupancy provide insights into adaptive evolution and regulatory roles of uORFs during Drosophila development. PLoS Biol 16, e2003903. 10.1371/journal.pbio.2003903.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This reviewed preprint is a bit of Frankenstein monster, as it crams together three quite different sets of data. It is essentially three papers combined into one-one paper focused on the role of CIB2/CIB3 in VHCs, one on the role of CIB2/CIB3 in zebrafish, and one on structural modeling of a CIB2/3 and TMC1/2 complex. The authors try to combine the three parts with the overarching theme of demonstrating that CIB2/3 play a functionally conserved role across species and hair cell types, but given the previous work on these proteins, especially Liang et al. (2021) and Wang et al. (2023), this argument doesn't work very well. My sense is that the way the manuscript is written now, the sum is less than the individual parts, and the authors should consider whether the work is better split into three separate papers. 

      We appreciate the frank evaluation of our work and point out that combining structural with functional data from mouse and zebrafish offers a comprehensive view of the role played by TMC1/TMC2 and CIB2/3 complexes in hair-cell mechanotransduction. We believe that readers will benefit from this comprehensive analyses.

      The most important shortcoming is the novelty of the work presented here. In line 89 of the introduction the authors state "However, whether CIB2/3 can function and interact with TMC1/2 proteins across sensory organs, hair-cell types, and species is still unclear." They make a similar statement in the first sentence of the discussion and generally use this claim throughout the paper as motivation for why they performed the experiments. Given the data presented in the Liang et al. (2021) and Wang et al. (2023 papers), however, this statement is not well supported. Those papers clearly demonstrate a role for CIB2/CIB3 in auditory and vestibular cells in mice. Moreover, there is also data in Riazuddin et al. (2012) paper that demonstrates the importance of CIB2 in zebrafish and Drosophila. I think the authors are really stretching to describe the data in the manuscript as novel. Conceptually, it reads more as solidifying knowledge that was already sketched out in the field in past studies. 

      We note that work on mouse and fish CIB knockouts in our laboratories started over a decade ago and that our discoveries are contemporary to those recently presented by Liang et al., 2021 and Wang et al., 2023, which we acknowledge, cite, and give credit as appropriate. We also note that work on fish knockouts and on fish Cib3 is completely novel. Nevertheless, the abstract text “Whether these interactions are functionally relevant across mechanosensory organs and vertebrate species is unclear” has been replaced by “These interactions have been proposed to be functionally relevant across mechanosensory organs and vertebrate species.”; and the introduction text “However, whether CIB2/3 can function and interact with TMC1/2 proteins across sensory organs, hair-cell types, and species is still unclear” has been replaced by “However, additional evidence showing that CIB2/3 can function and interact with TMC1/2 proteins across sensory organs, hair-cell types, and species is still needed.”. The work by Wang et al., 2023 is immediately discussed after the first sentence in the discussion section and the work by Liang et al., 2021 is also cited in the same paragraph. We believe that changes in abstract and introduction along with other changes outlined below put our work in proper context.

      There is one exception, however, and that is the last part of the manuscript. Here structural studies (AlphaFold 2 modeling, NMR structure determination, and molecular dynamics simulations) bring us closer to the structure of the mammalian TMCs, alone and in complex with the CIB proteins. Moreover, the structural work supports the assignment of the TMC pore to alpha helices 4-7.

      Thanks for the positive evaluation of this work.

      Reviewer #2 (Public Review):

      The paper 'Complexes of vertebrate TMC1/2 and CIB2/3 proteins 1 form hair-cell mechanotransduction cation channels' by Giese and coworkers is quite an intense reading. The manuscript is packed with data pertaining to very different aspects of MET apparatus function, scales, and events. I have to praise the team that combined molecular genetics, biochemistry, NMR, microscopy, functional physiology, in-vivo tests for vestibulo-ocular reflexes, and other tests for vestibular dysfunction with molecular modeling and simulations. The authors nicely show the way CIBs are associated with TMCs to form functional MET channels. The authors clarify the specificity of associations and elucidate the functional effects of the absence of specific CIBs and their partial redundancy. 

      We appreciate the positive evaluation of our work and agree with the reviewer in that the combination of data obtained using various techniques in vivo and in silico provide a unique view on the role played by CIB2 and CIB3 in hair-cell mechanotransduction. 

      Reviewer #3 (Public Review):

      This study demonstrates that from fish to mammals CIB2/3 is required for hearing, revealing the high degree of conservation of CIB2/3 function in vertebrate sensory hair cells. The modeling data reveal how CIB2/3 may affect the conductance of the TMC1/2 channels that mediate mechanotransduction, which is the process of converting mechanical energy into an electrical signal in sensory receptors. This work will likely impact future studies of how mechanotransduction varies in different hair cell types. 

      One caveat is that the experiments with the mouse mutants are confirmatory in nature with regard to a previous study by Wang et al., and the authors use lower resolution tools in terms of function and morphological changes. Another is that the modeling data is not supported by electrophysiological experiments, however, as mentioned above, future experiments may address this weakness.

      We thank the reviewer for providing positive feedback and for highlighting caveats that can and will be addressed by future experiments.

      Reviewer #1 (Recommendations For The Authors): 

      Lines 100-101. Please temper this statement, as FM1-43 is only a partial proxy for MET. 

      The original text has been modified to: “In contrast to auditory hair cells, we found that the vestibular hair cells in Cib2KO/KO mice apparently have MET. We assessed MET via uptake of FM 1-43 (Figure 1A), a styryl dye that mostly permeates into hair cells through functional MET channels (Meyers et al., 2003), indicating that there may be another CIB protein playing a functionally redundant role.”

      Lines 111-113. These data do not fully match up with the Kawashima et al. (2011) data. Please discuss. 

      We have modified the text to better report the data: “Tmc2 expression increases during development but remains below Tmc1 levels in both type 1 and type 2 hair cells upon maturation (Figure 1C).”

      Lines 125-126. The comparison in 2A-B is not described correctly for the control. The strain displayed is Cib2^+/+;Cib3^KO/KO (not wild-type). Show the Cib2^+/+;Cib3^+/+ if you are going to refer to it (and is this truly Cib2^+/+;Cib3^+/+ from a cross or just the background strain?). 

      Thanks for pointing this out. To avoid confusion, we have revised the sentence as follow: “We first characterized hearing function in Cib3KO/KO and control littermate mice at P16 by measuring auditory-evoked brainstem responses (ABRs). Normal ABR waveforms and thresholds were observed in Cib3KO/KO indicating normal hearing.”  

      Lines 137-140. Did you expect anything different? This is a trivial result, given the profound loss of hearing in the Cib2^KO/KO mice. 

      We did not expect anything different and have deleted the sentence: “Furthermore, endogenous CIB3 is unable to compensate for CIB2 loss in the auditory hair cells, perhaps due to extremely low expression level of CIB3 in these cells and the lack of compensatory overexpression of CIB3 in the cochlea of Cib2KO/KO mice (Giese et al., 2017).”

      Lines 194-196. But what about Cib2^KO/KO; Isn't the conclusion that the vestibular system needs either CIB2 or CIB3? 

      Yes, either CIB2 or CIB3 can maintain normal vestibular function. A prior study by Michel et al., 2017, has evaluated and reported intact vestibular function in Cib2KO/KO mice.

      Lines 212-214. Yes. This is a stronger conclusion than the one earlier. 

      We have revised the sentence as follow: “Taken together, these results support compulsory but functionally redundant roles for CIB2 and CIB3 in the vestibular hair cell MET complex.”

      Lines 265-267. I'm not sure that I would state this conclusion here given that you then argue against it in the next paragraph. 

      We have modified this statement to make the conclusions clearer and more consistent between the two paragraphs. The modified text reads: “Thus, taken together the results of our FM 1-43 labeling analysis are consistent with a requirement for both Cib2 and Cib3 to ensure normal MET in all lateral-line hair cells.”

      Line 277. I would be more precise and say something like "and sufficiently fewer hair cells responded to mechanical stimuli and admitted Ca2+..." 

      We have modified the text as requested: “We quantified the number of hair bundles per neuromast with mechanosensitive Ca2+ responses, and found that compared to controls, significantly fewer cells were mechanosensitive in cib2 and cib2;cib3 mutants (Figure 5-figure supplement 2A, control: 92.2 ± 2.5; cib2: 49.9 ± 5.8, cib2;cib3: 19.0 ± 6.6, p > 0.0001).”

      Line 278 and elsewhere. It doesn't make sense to have three significant digits in the error. I would say either "92.2 {plus minus} 2.5" or "92 {plus minus} 2." 

      Edited as requested.

      Lines 357-358. Move the reference to the figure to the previous sentence, leaving the "(Liang et al., 2021) juxtaposed to its reference (crystal structure). Otherwise, the reader will look for crystal structures in Figure 7-figure supplements 1-5. 

      Text has been edited as requested: “The intracellular domain linking helices a2 and a3, denoted here as IL1, adopts a helix-loop-helix with the two helices running parallel to each other and differing in length (Figure 7-figure supplements 1-5). This is the same fold observed in its crystal structure in complex with CIB3 (Liang et al., 2021), which validated the modeling approach.”

      Line 450. What other ions were present besides K+? I assume Cl- or some other anion.

      What about Na+ or Ca+? It's hard to evaluate this sentence without that information. 

      Systems have 150 mM KCl and CIB-bound Ca2+ when indicated (no Na+ or free Ca2+). This is now pointed out when the models are described first: “These models were embedded in either pure POPC or stereocilia-like mixed composition bilayers and solvated (150 mM KCl) to …”. The sentence mentioned by the reviewer has also been modified: “In systems with pure POPC bilayers we observed permeation of K+ in either one or both pores of the TMC1 dimer, with or without CIB2 or CIB3 and with or without bound Ca2+, despite the presence of Cl- (150 mM KCl).”  

      Lines 470-472. These results suggest that the maximum conductance of TMC1 > TMC2. How do these results compare with the Holt and Fettiplace data? 

      Thanks for pointing this out. A comparison would be appropriate and has been added: “We also speculate that this is due to TMC2 having an intrinsic lower singlechannel conductance than TMC1, as has been suggested by some experiments (Kim et al., 2013), but not others (Pan et al., 2013). It is also possible that our TMC2 model is not in a fully open conformation, which can only be reached upon mechanical stimulation.”

      Line 563. Yes, the simulations only allow you to say that the interaction is stable for at least microseconds. However, the gel filtration experiments suggest that the interaction is stable for much longer. Please comment. 

      Thank you for pointing this out. We agree with this statement and modified the text accordingly: “Simulations of these models indicate that there is some potential preferential binding of TMC1 and TMC2 to CIB3 over CIB2 (predicted from BSA) and that TMC + CIB interactions are stable and last for microseconds, with biochemical and NMR experiments showing that these interactions are stable at even longer timescales.”  

      Figure 3. Please use consistent (and sufficiently large to be readable) font size. 

      Figure has been updated.

      Figure 4. Magnification is too low to say much about bundle structure.

      The reviewer is right – we cannot evaluate bundle structure with the images shown in Figure 4. Our goal was to determine if the vestibular hair cells had been degenerated in the absence of CIB2/3 and Figure 4 panel A data reveals intact hair cells. We changed the text “High-resolution confocal imaging did not reveal any obvious vestibular hair cell loss and hair bundles looked indistinguishable from control in Cib2KO/KO;Cib3KO/KO mice (Figure 4A).” to “High-resolution confocal imaging did not reveal any obvious vestibular hair cell loss in Cib2KO/KO;Cib3KO/KO mice (Figure 4A).” to avoid any confusions.

      Reviewer #2 (Recommendations For The Authors):

      Some datasets presented here can be published separately. Although I understand that the field is developing fast and there is no time to sort and fit the data by category or scale, everything needs to be published together and quickly.

      I have no real questions about the data on the functional association of CIB2 and 3 with TMC 1 and 2 in mouse hair cells as well as association preferences between their homologs in zebrafish. The authors have shown a clear differentiation of association preferences for CIB2 and CIB3 and the ability to substitute for each other in cochlear and vestibular hair cells. The importance of CIB2 for hearing and CIB3 for vestibular function is well documented. The absence of the startle response in cib2/3 negative zebrafish is a slight variation from what was observed in mice where CIB2 is sufficient for hearing. The data look very solid and show an overall structural and functional conservation of these complexes throughout vertebrates. The presented models look plausible, but of course, there is a chance that they will be corrected/improved in the future. 

      Thanks for appreciating the significance of our study.

      Regarding NMR, there is indeed a large number of TROSY peaks of uniformly labeled CIB2 undergoing shifts with sequential additions of the loop and the N-terminal TMC peptides. Something is going on. The authors may consider a special publication on this topic when at least partial peak assignments are established. 

      We are continuing our NMR studies of CIB and TMC interactions and plan to have follow up studies. 

      After reading the manuscript, I may suggest four topics for additional discussion. 

      (1) Maybe it is obvious for people working in the field, but for the general reader, the simulations performed with and without Ca2+ come out of the blue, with no explanation. The authors did not mention clearly that CIB proteins have at least two functional EF-hand (EF-hand-like) motifs that likely bind Ca2+ and thereby modulate the MET channel. 

      This is a good point. We have modified the introductory text to include: “CIB2 belongs to a family of four closely related proteins (CIB1-4) that have partial functional redundancy and similar structural domains, with at least two Ca2+/Mg2+-binding EF-hand motifs that are highly conserved for CIB2/3 (Huang et al., 2012).”

      If the data on affinities for Ca2+, as well as Ca2+-dependent propensity for dimerization and association with TMC exist, they should be mentioned for CIB2 and CIB3 and discussed.

      To address this, we have added the following text to the discussion: “How TMC + CIB interactions depend on Ca2+ concentration may have important functional implications for adaptation and hair cell mechanotransduction. Structures of CIB3 and worm CALM-1, a CIB2 homologue, both bind divalent ions via EF-hand motifs proximal to their C-termini (Jeong et al., 2022; Liang et al., 2021). Reports on CIB2 affinities for Ca2+ are inconsistent, with _K_D values that range from 14 µM to 0.5 mM (Blazejczyk et al., 2009; Vallone et al., 2018). Although qualitative pull-down assays done in the presence or the absence of 5 mM CaCl2 suggest that the TMC1 and CIB2 interactions are Ca2+independent (Liang et al., 2021), strength and details of the CIB-TMC-IL1 and CIB-TMCNT contacts might be Ca2+-dependent, especially considering that Ca2+ induces changes that lead to exposure of hydrophobic residues involved in binding (Blazejczyk et al., 2009).”

      Also, it is not clearly mentioned in the figure legends whether the size-exclusion experiments or TROSY NMR were performed in the presence of (saturating) Ca2+ or not. If the presence of Ca2+ is not important, it must be explained.  

      Size exclusion chromatography and NMR experiments were performed in the presence of 3 mM CaCl2. We have indicated this in appropriate figure captions as requested, and also mentioned it in the discussion text: “Interestingly, the behavior of CIB2 and CIB3 in solution (SEC experiments using 3 mM CaCl2) is different in the absence of TMC1-IL1.” and “Moreover, our NMR data (obtained using 3 mM CaCl2) indicates that TMC1-IL1 + CIB2 is unlikely to directly interact with CIB3.”

      (2) Speaking about the conservation of TMC-CIB structure and function, it would be important to compare it to the C. elegans TMC-CALM-1 structures. Is CALM-1, which binds Ca2+ near its C-terminus, homologous or similar to CIBs? 

      This is an important point. To address it, we have added the following text in the discussion: “Remarkably, the AF2 models are also consistent with the architecture of the nematode TMC-1 and CALM-1 complex (Jeong et al., 2022), despite low sequence identity (36% between human TMC1 and worm TMC-1 and 51% between human CIB2 and worm CALM-1). This suggests that the TMC + CIB functional relationship may extend beyond vertebrates.” We also added: “How TMC + CIB interactions depend on Ca2+ concentration may have important functional implications for adaptation and hair cell mechanotransduction. Structures of CIB3 and worm CALM-1, a CIB2 homologue, both bind divalent ions via EF-hand motifs proximal to their C-termini (Jeong et al., 2022; Liang et al., 2021).” 

      Additionally, superposition of CALM-1 (in blue) from the TMC-1 complex structure (PDB code: 7usx; Jeong et al., 2022) with one and our initial human CIB2 AF2 models (in red) show similar folds, notably in the EF-hand motifs of CALM-1 and CIB2 (Author response image 1).

      Author response image 1.

      Superposition of CALM-1 structure (blue; Jeong et al., 2022) and AlphaFold 2 model of CIB2 (red). Calcium ions are shown as green spheres.

      (1) Based on simulations, CIBs stabilize the cytoplasmic surfaces of the dimerized TMCs.

      The double CIB2/3 knock-out, on the other hand, clearly destabilizes the morphology of stereocilia and leads to partial degeneration. One question is whether the tip link in the double null forms normally and whether there is a vestige of MET current in the beginning. The second question is whether the stabilization of the TMC's intracellular surface has a functional meaning. I understand that not complete knock-outs, but rather partial loss-of-function mutants may help answer this question. The reader would be impatient to learn what process most critically depends on the presence of CIBs: channel assembly, activation, conduction, or adaptation. Any thoughts about it? 

      These are all interesting questions, although further investigations would be needed to understand CIB’s role on channel assembly, activation, conduction, and adaption. We have added to the discussion text: “Further studies should help provide a comprehensive view into CIB function in channel assembly, activation, and potentially hair-cell adaption.”

      (2) The authors rely on the permeation of FM dyes as a criterion for normal MET channel formation. What do they know about the permeation path a 600-800 Da hydrophobic dye may travel through? Is it the open (conductive) or non-conductive channel? Do ions and FM dyes permeate simultaneously or can this be a different mode of action for TMCs that relates them to TMEM lipid scramblases? Any insight from simulations?

      We are working on follow-up papers focused on elucidating the permeation mechanisms of aminoglycosides and small molecules (such as FM dyes) through TMCs as well as its potential scramblase activity.

      Reviewer #3 (Recommendations For The Authors):

      Introduction: 

      The rationale and context for determining whether Cib2 and Cib3 proteins are essential for mechanotransduction in zebrafish hair cells is completely lacking in the introduction. All background information about what is known about the MET complex in sensory hair cells focuses on work done with mouse cochlear hair cells without regard to other species. This is especially surprising as the third author uses zebrafish as an animal model and makes major contributions to this study, addressing the primary question posed in the introduction. Instead, the authors relegate this important information to the results section. Moreover, not mentioning the Jeong 2022 study when discussing the Liang 2021 findings is odd considering that the primary question is centered on CIB2 and TMC1/2 in other species. 

      Thank you for pointing this out. We now discuss and reference relevant background on the MET complex in zebrafish hair cells in the introduction. We added: “In zebrafish, Tmcs, Lhfpl5, Tmie, and Pcdh15 are also essential for sensory transduction, suggesting that these molecules form the core MET complex in all vertebrate hair cells (Chen et al., 2020; Erickson et al., 2019, 2017; Ernest et al., 2000; Gleason et al., 2009; Gopal et al., 2015; Maeda et al., 2017, 2014; Pacentine and Nicolson, 2019; Phillips et al., 2011; Seiler et al., 2004; Söllner et al., 2004).”. We also added: “In zebrafish, knockdown of Cib2 diminishes both the acoustic startle response and mechanosensitive responses of lateral-line hair cells (Riazuddin et al., 2012).”

      Discussion: 

      The claim that mouse vestibular hair cells in the double KO are structurally normal is not well supported by the images in Fig. 4A and is at odds with the findings by Wang et al., 2023. More discussion about the discrepancy of these results (instead of glossing over it) is warranted. The zebrafish image of the hair bundles in the zebrafish cib2/3 double knockout also appear abnormal, i.e. somewhat thinner. These results are consistent with Wang et al., 2023. Is it the case that neither images (mouse and fish) are representative? Unfortunately, the neuromast hair bundles in the double mutant are not shown, so it is difficult to draw a conclusion.

      The reviewer is right – we cannot evaluate mouse hair-cell bundle structure with the images shown in Figure 4. Our goal was to determine if the vestibular hair cells had been degenerated in the absence of CIB2/3 and Figure 4 panel A data reveals intact hair cells. We changed the text “High-resolution confocal imaging did not reveal any obvious vestibular hair cell loss and hair bundles looked indistinguishable from control in Cib2KO/KO;Cib3KO/KO mice (Figure 4A).” to “High-resolution confocal imaging did not reveal any obvious vestibular hair cell loss in Cib2KO/KO;Cib3KO/KO mice (Figure 4A).” to avoid any confusions. In addition, we have changed the discussion as follows: “We demonstrate that vestibular hair cells in mice and zebrafish lacking CIB2 and CIB3 are not degenerated but have no detectable MET, assessed via FM 1-43 dye uptake, at time points when MET function is well developed in wild-type hair cells.”

      In the discussion, the authors mention that Shi et al showed differential expression with cib2/3 in tall versus short hair cells of zebrafish cristae. However, there is no in situ data in the Shi study for cib2 and cib3. Instead, Shi et al show in situs for zpld1a and cabp5b that mark these cell types in the lateral crista. The text is slightly misleading and should be changed to reflect that UMAP data support this conclusion.

      We have removed reference to cib2/3 zebrafish differential expression from our discussion. It is true that this differential expression has only been inferred by UMAP and not in situ data.

      It should be noted that the acoustic startle reflex is mediated by the saccule in zebrafish, which does not possess layers of short and tall hair cells, but rather only has one layer of hair cells. Whether saccular hair cells can be regarded as strictly 'short' hair cell types remains to be determined. In this paragraph of the discussion, the authors are confounding their interpretation by not being careful about which endorgan they are discussing (line 521). In fact, there is a general error in the manuscript in referring to vestibular organs without specifying what is shown. The cristae in zebrafish do not participate in behavioral reflexes until 25 dpf and they are not known to synapse onto the Mauthner cell, which mediates startle reflexes.

      Thank you for pointing out these issues. We now state in the results that the startle reflex in zebrafish relies primarily on the saccule. In the discussion we now focus mainly on short and tall hair cells of the crista. We also outline again in the discussion that the saccule is required for acoustic startle and the crista are for angular acceleration.

      Minor points: 

      Lines 298-302: The Zhu reference is not correct (wrong Zhu author). The statement on the functional reliance on Tmc2a versus Tmc1/2b should be referenced with Smith et al., 2020 and the correct Zhu 2021 study from the McDermott lab. Otherwise, the basis for the roles of the Tmcs in the cartoon in panel 6E is not clear.

      Thanks for pointing out this oversight. We have updated the reference.

      Line 548 should use numbers to make the multiple points, otherwise, this sentence is long and awkward. 

      The sentence has been re-arranged to make it shorter and to address another point raised by referees: “Structural predictions using AF2 show conserved folds for human and zebrafish proteins, as well as conserved architecture for their protein complexes. Predictions are consistent with previous experimentally validated models for the TMC1 pore (Ballesteros et al., 2018; Pan et al., 2018), with the structure of human CIB3 coupled to mouse TMC1-IL1 (Liang et al., 2021), and with our NMR data validating the interaction between human TMC1 and CIB2/3 proteins. Remarkably, the AF2 models are also consistent with the architecture of the nematode TMC-1 and CALM-1 complex (Jeong et al., 2022), despite low sequence identity (36% between human TMC1 and worm TMC-1 and 51% between human CIB2 and worm CALM-1). This suggests that the TMC + CIB functional relationship may extend beyond vertebrates.”

      Suggested improvements to the figures: 

      In general, some of the panels are so close together that keys or text for one panel look like they might belong to another. Increasing the white space would improve this issue. 

      Figure 3 has been adjusted as requested, Figure 7 has been split into two (Figure 7 and Figure 8) to make them more readable and to move data from the supplement to the main text as requested below.

      Fig1A. The control versus the KO images look so different that this figure fails to make the point that FM labeling is unaffected. The authors should consider substituting a better image for the control. It is not ideal to start off on a weak point in the first panel of the paper. 

      We agree and have updated Figure 1 accordingly.

      Fig1C. It is critical to state the stage here. Also P12? 

      scRNA-seq data are extracted from Matthew Kelley’s work and are a combination of P1, P12 and P100 utricular hair cells as following: Utricular hair cells were isolated by flow cytometry from 12- and 100-day old mice. Gene expression was then measured with scRNA-seq using the 10x platform. The data were then combined with a previously published single cell data set (samples from GSE71982) containing utricular hair cells isolated at P1. This dataset shows gene expression in immature vs mature utricular hair cells. The immature hair cells consist of a mixture of type I and type II cells.

      Fig1D. This schematic is confusing. The WT and KO labels are misplaced and the difference between gene and protein diagrams is not apparent. Maybe using a different bar diagram for the protein or at least adding 'aa' to the protein diagrams would be helpful. 

      Sorry for the confusion. We have revised panel 1D to address these concerns.

      Fig1E. Would be good to add 'mRNA' below the graph. 

      Done. We have added “mRNA fold change on the Y-axis” label.

      Fig2C and D. Why use such a late-stage P18 for the immunohistochemistry? 

      Data presented in panel 2C are from P5 explants kept 2 days in vitro. For panel 2D, P18 is relevant since ABR were performed at P16 and hair cell degeneration in CIB2 mutants as previously described occurs around P18-P21.

      Fig3A. Why isn't the cib2-/- genotype shown? 

      Data on cib2-/- mutant mice have already been published and no vestibular deficits have been found. See Giese et al., 2017 and Michel et al., 2017

      Fig3F. Does this pertain to the open field testing? It would make sense for this panel to be associated with those first panels. 

      Figure 3 has been updated as requested. 

      Fig4A. Which vestibular end organ? Are these ampullary cells? (Same question for 4B.) The statement in the text about 'indistinguishable' hair bundles is not supported by these panels. There appears to be an obvious difference here--the hair bundles look splayed in the double KO. Either the magnification of the images is not the same or the base of the bundles is wider in the double KO as well. This morphology appears to be at odds with results reported by Wang et al., 2023. 

      The vestibular end organs shown in Figure 4A are ampullae. Magnifications are consistent across all the panels. While reviewer might be right regarding the hair bundle morphology, SEM data would be the best approach to address this point. Unfortunately, we currently do not have such data and we believe that only vestibular hair loss can be addressed using IF images. Thus, we are only commenting on the absence of obvious vestibular haircell loss in the double KO mutants.

      Fig4C. To support the claim that extrastriolar hair cells in the Cib3-/- mice are less labeled with FM dye it would be necessary to at least indicate the two zones but also to quantify the fluorescence. One can imagine that labeling is quite variable due to differences in IP injection.

      The two zones have been outlined in Figure 4C as requested.

      Fig5. Strangely the authors dedicate a third of Figure 1 to describing the mouse KO of Cib3, yet no information is given about the zebrafish CRISPR alleles generated for this study. There is nothing in the results text or in this figure. At least one schematic could be added to introduce the fish alleles and another panel of gEAR information about cib2 and cib3 expression to help explain the neuromast data as was done in Fig1C.

      We have added a supplemental figure (Figure 5-figure Supplement 1) that outlines where the zebrafish cib2 and cib3 mutations are located. We also state in the results additional information regarding these lesions. In addition, we provide context for examining cib2/3 in zebrafish hair cells by referencing published data from inner ear and lateral line scRNAseq data in the results section.

      Absolutely nitpicky here, but the arrow in 5H may be confused for a mechanical stimulus.

      The arrow in 5H has been changed to a dashed line.

      Why not include the data from the supplemental figure at the end of this figure? 

      The calcium imaging data in the supplement could be included in the main figure but it would make for a massive figure. In eLife supplements can be viewed quite easily online, next to the main figures.

      Fig6. The ampullary hair bundles look thinner in 6I. Is this also the case for double KO neuromast bundles? Such data support the findings of Wang et al., 2023.

      We did not quantify the width of the hair bundles in the crista or neuromast. It is possible that the bundles are indeed thinner similar to Wang et al 2023.

      Fig7A. IL1 should be indicated in this panel. 

      IL1 has been indicated, as suggested.

      Fig7 supp 12. Color coding of the subunits would be appreciated here. 

      Done as requested.

      Fig7. Overall the supplemental data for Figure 7 is quite extensive and the significance of this data is underappreciated. The authors could consider pushing panel C to supplemental as it is a second method to confirm the modeling interactions and instead highlight the dimer models which are more relevant than the monomer structures. Also, I find the additional alpha 0 helix quite interesting because it is not seen in the C. elegans cryoEM structure. Panel G should be given more importance instead of positioned deep into the figure next to the salt bridges in F. Overall, the novelty and significance of the modeling data deserves more importance in the paper. 

      We thank the reviewer for these helpful suggestions. The amphipathic alpha 0 helix is present in the C. elegans cryo-EM structure, although it is named differently in their paper (Jeong et al., 2022). We have now clarified this in the text: “Our new models feature an additional amphipathic helix, which we denote a0, extending almost parallel to the expected plane of the membrane bilayer without crossing towards the extracellular side (as observed for a mostly hydrophobic a0 in OSCA channels and labeled as H3 in the worm TMC-1 structure) …”. In addition, we have modified Figure 7 and highlighted panel G in a separate Figure 8 as requested.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      One enduring mystery involving the evolution of genomes is the remarkable variation they exhibit with respect to size. Much of that variation is due to differences in the number of transposable elements, which often (but not always) correlates with the overall quantity of DNA. Amplification of TEs is nearly always either selectively neutral or negative with respect to host fitness. Given that larger effective population sizes are more efficient at removing these mutations, it has been hypothesized that TE content, and thus overall genome size, may be a function of effective population size. The authors of this manuscript test this hypothesis by using a uniform approach to analysis of several hundred animal genomes, using the ratio of synonymous to nonsynonymous mutations in coding sequence as a measure of the overall strength of purifying selection, which serves as a proxy for effective population size over time. The data convincingly demonstrates that it is unlikely that effective population size has a strong effect on TE content and, by extension, overall genome size (except for birds).

      Strengths:

      Although this ground has been covered before in many other papers, the strength of this analysis is that it is comprehensive and treats all the genomes with the same pipeline, making comparisons more convincing. Although this is a negative result, it is important because it is relatively comprehensive and indicates that there will be no simple, global hypothesis that can explain the observed variation.

      Weaknesses:

      In several places, I think the authors slip between assertions of correlation and assertions of cause-effect relationships not established in the results.

      Several times in the previous version of the manuscript we used the expression “effect of dN/dS on…” which might suggest a causal relationship. We have rephrased these expressions and highlighted the changes in the main text, so that correlation is not mistaken with causation (see also responses to detailed comments below).

      In other places, the arguments end up feeling circular, based, I think, on those inferred causal relationships. It was also puzzling why plants (which show vast differences in DNA content) were ignored altogether.

      The analysis focuses on metazoans for two reasons: one practical and one fundamental.

      The practical reason is computational. Our analysis included TE annotation, phylogenetic estimation and dN/dS estimation, which would have been very difficult with the hundreds, if not thousands, of plant genomes available. If we had included plants, it would have been natural to include fungi as well, to have a complete set of multicellular eukaryotic genomes, adding to the computational burden. The second fundamental reason is that plants show important genome size differences due to more frequent whole genome duplications (polyploidization) than in animals. It is therefore possible that the effect of selection on genome size is different in these two groups, which would have led us to treat them separately, decreasing the interest of this comparison. For these reasons we chose to focus on animals that still provide very wide ranges of genome size and population size well suited to test the impact of genetic drift on the genomic TE content.

      Reviewer #2 (Public review):

      Summary:

      The Mutational Hazard Hypothesis (MHH) is a very influential hypothesis in explaining the origins of genomic and other complexity that seem to entail the fixation of costly elements. Despite its influence, very few tests of the hypothesis have been offered, and most of these come with important caveats. This lack of empirical tests largely reflects the challenges of estimating crucial parameters.

      The authors test the central contention of the MHH, namely that genome size follows effective population size (Ne). They martial a lot of genomic and comparative data, test the viability of their surrogates for Ne and genome size, and use correct methods (phylogenetically corrected correlation) to test the hypothesis. Strikingly, they not only find that Ne is not THE major determinant of genome size, as is argued by MHH, but that there is not even a marginally significant effect. This is remarkable, making this an important paper.

      Strengths:

      The hypothesis tested is of great importance.

      The negative finding is of great importance for reevaluating the predictive power of the tested hypothesis.

      The test is straightforward and clear.

      The analysis is a technical tour-de-force, convincingly circumventing a number of challenges of mounting a true test of the hypothesis.

      Weaknesses:

      I note no particular strengths, but I believe the paper could be further strengthened in three major ways.

      (1) The authors should note that the hypothesis that they are testing is larger than the MHH.

      The MHH hypothesis says that (i) low-Ne species have more junk in their genomes and

      (ii) this is because junk tends to be costly because of increased mutation rate to nulls, relative to competing non/less-junky alleles.

      The current results reject not just the compound (i+ii) MHH hypothesis, but in fact any hypothesis that relies on i. This is notably a (much) more important rejection. Indeed, whereas MHH relies on particular constructions of increased mutation rates of varying plausibility, the more general hypothesis i includes any imaginable or proposed cost to the extra sequence (replication costs, background transcription, costs of transposition, ectopic expression of neighboring genes, recombination between homologous elements, misaligning during meiosis, reduced organismal function from nuclear expansion, the list goes on and on). For those who find the MHH dubious on its merits, focusing this paper on the MHH reduces its impact - the larger hypothesis that the small costs of extra sequence dictate the fates of different organisms' genomes is, in my opinion, a much more important and plausible hypothesis, and thus the current rejection is more important than the authors let on.

      The MHH is arguably the most structured and influential theoretical framework proposed to date based on the null assumption (i), therefore setting the paper up with the MHH is somehow inevitable. Because of this, we mostly discuss the assumption (ii) (the mutational aspect brought about by junk DNA) and the peculiarities of TE biology that can drive the genome away from the expectations of (i). We however agree that the hazard posed by extra DNA is not limited to the gain of function via the mutation process, but can be linked to many other molecular processes as mentioned above. Moreover, we also agree that our results can be interpreted within the general framework of the nearly-neutral theory. They demonstrate that mutations, whether increasing or decreasing genome size, have a distribution of fitness effects that falls outside the range necessary for selection in larger populations. In the revised manuscript, we made the concept of hazard more comprehensive and further stressed that this applies not only to TEs but any nearly-neutral mutation affecting non-coding DNA (lines 491-496): “Notably, these results not only reject the theory of extra non-coding DNA being costly for its point mutational risk, but also challenges the more general idea of its accumulation depending on other kinds of detrimental effects, such as increased replication, pervasive transcription, or ectopic recombination. Therefore, our results can be considered more general than a mere rejection of the MHH hypothesis, as they do not support any theory predicting that species with low Ne would accumulate more non-coding DNA.”

      (2) In addition to the authors' careful logical and mathematical description of their work, they should take more time to show the intuition that arises from their data. In particular, just by looking at Figure 1b one can see what is wrong with the non-phylogenetically-corrected correlations that MHH's supporters use. That figure shows that mammals, many of which have small Ne, have large genomes regardless of their Ne, which suggests that the coincidence of large genomes and frequently small Ne in this lineage is just that, a coincidence, not a causal relationship. Similarly, insects by and large have large Ne, regardless of their genome size. Insects, many of which have large genomes, have large Ne regardless of their genome size, again suggesting that the coincidence of this lineage of generally large Ne and smaller genomes is not causal. Given that these two lineages are abundant on earth in addition to being overrepresented among available genomes (and were even more overrepresented when the foundational MHH papers collected available genomes), it begins to emerge how one can easily end up with a spurious non-phylogenetically corrected correlation: grab a few insects, grab a few mammals, and you get a correlation. Notably, the same holds for lineages not included here but that are highly represented in our databases (and all the more so 20 years ago): yeasts related to S. cerevisiae (generally small genomes and large median Ne despite variation) and angiosperms (generally large genomes (compared to most eukaryotes) and small median Ne despite variation). Pointing these clear points out will help non-specialists to understand why the current analysis is not merely a they-said-them-said case, but offers an explanation for why the current authors' conclusions differ from the MHH's supporters and moreover explain what is wrong with the MHH's supporters' arguments.

      We thank the referee for this perspective. We agree that comparing dispersion of the points from the non-phylogenetically corrected correlation with the results of the phylogenetic contrasts intuitively emphasizes the importance of accounting for species relatedness. We added on to the discussion to stress the phylogenetic structure present in the data (lines 408-417): “It is important to note how not treating species traits as non-independent leads to artifactual results (Figure 2B-C). For instance, mammals have on average small population sizes and the largest genomes. Conversely, insects tend to have large Ne and overall small genomes. With a high sampling power and phylogenetic inertia being taken into account, our meta-analysis clearly points at a phylogenetic structure in the data: the main clades are each confined to separate genome size ranges regardless of their dN/dS variation. The other way around, variability in genome size can be observed in insects, irrespective of their dN/dS. Relying on non phylogenetically corrected models based on a limited number of species (such as that available at the time of the MHH proposal) can thus result in a spurious positive scaling between genome size and Ne proxies.”

      (3) A third way in which the paper is more important than the authors let on is in the striking degree of the failure of MHH here. MHH does not merely claim that Ne is one contributor to genome size among many; it claims that Ne is THE major contributor, which is a much, much stronger claim. That no evidence exists in the current data for even the small claim is a remarkable failure of the actual MHH hypothesis: the possibility is quite remote that Ne is THE major contributor but that one cannot even find a marginally significant correlation in a huge correlation analysis deriving from a lot of challenging bioinformatic work. Thus this is an extremely strong rejection of the MHH. The MHH is extremely influential and yet very challenging to test clearly. Frankly, the authors would be doing the field a disservice if they did not more strongly state the degree of importance of this finding.

      We respectfully disagree with the review that there is currently no evidence for an effect of Ne on genome size evolution. While it is accurate that our large dataset allows us to reject the universality of Ne as the major contributor to genome size variation, this does not exclude the possibility of such an effect in certain contexts. Notably, there are several pieces of evidence that find support for Ne to determine genome size variation and to entail nearly-neutral TE dynamics under certain circumstances, e.g. of particularly strongly contrasted Ne and moderate divergence times (Lefébure et al., 2017 Genome Res 27: 1016-1028; Mérel et al., 2021 Mol Biol Evol 38: 4252-4267; Mérel et al., 2024 biorXiv: 2024-01; Tollis and Boissinot, 2013 Genome Biol Evol 5: 1754-1768; Ruggiero et al., 2017 Front Genet 8: 44). The strength of such works is to analyze the short-term dynamics of TEs in response to N<sub>e</sub> within groups of species/populations, where the cost posed by extra DNA is likely to be similar. Indeed, the MHH predicts genome size to vary according to the combination of drift and mutation under the nearly-neutral theory of molecular evolution. Our work demonstrates that it is not true universally but does not exclude that it could exist locally. Moreover, defence mechanisms against TEs proliferation are often complex molecular machineries that might or might not evolve according to different constraints among clades. We have detailed these points in the discussion (lines 503-518).

      Reviewer #3 (Public review):

      Summary

      The Mutational Hazard Hypothesis (MHH) suggests that lineages with smaller effective population sizes should accumulate slightly deleterious transposable elements leading to larger genome sizes. Marino and colleagues tested the MHH using a set of 807 vertebrate, mollusc, and insect species. The authors mined repeats de novo and estimated dN/dS for each genome. Then, they used dN/dS and life history traits as reliable proxies for effective population size and tested for correlations between these proxies and repeat content while accounting for phylogenetic nonindependence. The results suggest that overall, lineages with lower effective population sizes do not exhibit increases in repeat content or genome size. This contrasts with expectations from the MHH. The authors speculate that changes in genome size may be driven by lineage-specific host-TE conflicts rather than effective population size.

      Strengths

      The general conclusions of this paper are supported by a powerful dataset of phylogenetically diverse species. The use of C-values rather than assembly size for many species (when available) helps mitigate the challenges associated with the underrepresentation of repetitive regions in short-read-based genome assemblies. As expected, genome size and repeat content are highly correlated across species. Nonetheless, the authors report divergent relationships between genome size and dN/dS and TE content and dN/dS in multiple clades: Insecta, Actinopteri, Aves, and Mammalia. These discrepancies are interesting but could reflect biases associated with the authors' methodology for repeat detection and quantification rather than the true biology.

      Weaknesses

      The authors used dnaPipeTE for repeat quantification. Although dnaPipeTE is a useful tool for estimating TE content when genome assemblies are not available, it exhibits several biases. One of these is that dnaPipeTE seems to consistently underestimate satellite content (compared to repeat masker on assembled genomes; see Goubert et al. 2015). Satellites comprise a significant portion of many animal genomes and are likely significant contributors to differences in genome size. This should have a stronger effect on results in species where satellites comprise a larger proportion of the genome relative to other repeats (e.g. Drosophila virilis, >40% of the genome (Flynn et al. 2020); Triatoma infestans, 25% of the genome (Pita et al. 2017) and many others). For example, the authors report that only 0.46% of the Triatoma infestans genome is "other repeats" (which include simple repeats and satellites). This contrasts with previous reports of {greater than or equal to}25% satellite content in Triatoma infestans (Pita et al. 2017). Similarly, this study's results for "other" repeat content appear to be consistently lower for Drosophila species relative to previous reports (e.g. de Lima & Ruiz-Ruano 2022). The most extreme case of this is for Drosophila albomicans where the authors report 0.06% "other" repeat content when previous reports have suggested that 18%->38% of the genome is composed of satellites (de Lima & Ruiz-Ruano 2022). It is conceivable that occasional drastic underestimates or overestimates for repeat content in some species could have a large effect on coevol results, but a minimal effect on more general trends (e.g. the overall relationship between repeat content and genome size).

      There are indeed some discrepancies between our estimates of low complexity repeats and those from the literature due to the approach used. Hence, occasional underestimates or overestimates of repeat content are possible. As noted, the contribution of “Other” repeats to the overall repeat content is generally very low, meaning an underestimation bias. We thank the reviewer for providing this interesting review.

      We emphasized these points in the discussion of our revised manuscript (lines 358-376): “While the remarkable conservation of avian genome sizes has prompted interpretations involving further mechanisms (see discussion below), dnaPipeTE is known to generally underestimate satellite content (Goubert et al. 2015). This bias is more relevant for those species that exhibit large fractions of satellites compared to TEs in their repeatome. For instance, the portions of simple and low complexity repeats estimated with dnaPipeTE are consistently smaller than those reported in previous analyses based on assembly annotation for some species, such as Triatoma infestans (0.46% vs 25%; 7 Mbp vs 400 Mbp), Drosophila eugracilis (1.28% vs 10.89%; 2 Mbp vs 25 Mbp), Drosophila albomicans (0.06% vs 18 to 38%; 0.12 Mbp vs 39 to 85 Mbp) and some other Drosophila species (Pita et al. 2017; de Lima and Ruiz-Luano 2022; Supplemental Table S2). Although the accuracy of Coevol analyses might occasionally be affected by such underestimations, the effect is likely minimal on the general trends. Inability to detect ancient TE copies is another relevant bias of dnaPipeTE. However, the strong correlation between repeat content and genome size and the consistency of dnaPipeTE and earlGrey results, even in large genomes such as that of Aedes albopictus, indicate that dnaPipeTE method is pertinent for our large-scale analysis. Furthermore, such an approach is especially fitting for the examination of recent TEs, as this specific analysis is not biased by very repetitive new TE families that are problematic to assemble.”

      Not being able to correctly estimate the quantity of satellites might pose a problem for quantifying the total content of junk DNA. However, the overall repeat content mostly composed of TEs correlates very well with genome size, both in the overall dataset and within clades (with the notable exception of birds) so we are confident that this limitation is not the explanation of our negative results. Moreover, while satellite information might be missing, this is not problematic to test our hypothesis, as we focus on TEs, whose proliferation mechanism differs significantly from that of tandem repeats and largely account for genome size variation.

      Another bias of dnaPipeTE is that it does not detect ancient TEs as well as more recently active TEs (Goubert et al., 2015 Genome Biol Evol 7: 1192-1205). Thus, the repeat content used for PIC and coevolve analyses here is inherently biased toward more recently inserted TEs. This bias could significantly impact the inference of long-term evolutionary trends.

      Indeed, dnaPipeTE is not good at detecting old TE copies due to the read-based approach, biasing the outcome towards new elements. We agree that TE content can be underestimated, especially in those genomes that tend to accumulate TEs rather than getting rid of them. However, the sum of old TEs and recent TEs is extremely well correlated to genome size (Pearson’s correlation: r = 0.87, p-value < 2.2e-16; PIC: slope = 0.22, adj-R<sup>2</sup> = 0.42, p-value < 2.2e-16). Our main result therefore does not rely on an accurate estimation of old TEs. In contrast, we hypothesized that recent TEs could be interesting because selection could be more likely to act on TEs insertion and dynamics rather than on non-coding DNA as a whole. Our results demonstrate that this is not the case. It should be noted that in spite of its limits towards old TEs, dnaPipeTE is well-suited for this analysis as it is not biased by highly repetitive new TE families that are challenging to assemble. In the revised manuscript, we now emphasize the limitations of dnaPipeTE and discuss the consequences on our results. See lines 359-374 (reported above) and lines 449-455: “On the other hand, it is conceivable the avian TE diversity to be underappreciated due to the limits of sequencing technologies used so far in resolving complex repeat-rich regions. For instance, employment of long-reads technologies allowed to reveal more extended repeated regions that were previously ignored with short read assemblies (Kapusta and Suh 2017; Benham et al. 2024). Besides, quite large fractions might indeed be satellite sequences constituting relevant fractions of the genome that are challenging to identify with reference- or read-based methods (Edwards et al. 2025).”

      Finally, in a preliminary work on the dipteran species, we showed that the TE content estimated with dnaPipeTE is generally similar to that estimated from the assembly with earlGrey (Baril et al., 2024 Mol Biol Evol 38: msae068) across a good range of genome sizes going from drosophilid-like to mosquito-like (TE genomic percentage: Pearson’s r = 0.88, p-value = 1.951e-10; TE base pairs: Pearson’s r = 0.90, p-value = 3.573e-11; see also the corrected Supplementary Figure S2 and new Supplementary Figure S3). While TEs for these species are probably dominated by recent to moderately recent TEs, Ae. albopictus is an outlier for its genome size and the estimations with the two methods are largely consistent. However, the computation time required to estimate TE content using EarlGrey was significantly longer, with a ~300% increase in computation time, making it a very costly option (a similar issue applicable to other assembly-based annotation pipelines). Given the rationale presented above, we decided to use dnaPipeTE instead of EarlGrey.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Since I am not an expert in the field, some of these comments may simply reflect a lack of understanding on my part. However, in those cases, I hope they can help the authors clarify important points. I did have a bunch of comments concerning the complexity of the relationship between TEs and their hosts that would likely affect TE content, but I ended up deleting most of them because they were covered in the discussion. However, I do think that in setting up the paper, particularly given the results, it might have been useful to introduce those issues in the introduction. That is to say, treating TEs as a generic mutagen that will fit into a relatively simple model is unlikely to be correct. What will ultimately be more interesting are the particulars of the ways that the relationships between TEs and their host evolve over time. Finally, given the huge variation in plant genes with respect to genome size and TE content, along with really interesting variation in deletion rates, I'm surprised that they were not included. I get that you have to draw a line somewhere, and this work builds on a bunch of other work in animals, but it seems like a missed opportunity.

      We chose to restrict the introduction to the rationale behind the MHH as it is the starting point and focus of the manuscript. Because the aspects of the complexity of TE-host relationships are only covered in a speculative way, we limited them to the discussion but it is true that introducing them at the very beginning gives a more comprehensive overview. The introduction now includes a few sentences about lineage-specific selective effect of TEs and TE-host evolution (lines 83-86): “On top of that, an alternative TE-host-oriented perspective is that the accumulation of TEs in particular depends on their type of activity and dynamics, as well as on the lineage-specific silencing mechanisms evolved by host genomes (Ågren and Wright 2011).”

      Page 4. "The MHH is highly popular..." Evidence for this? It is fine as is, but it could also be seen as a straw man argument. Perhaps make clear this is an opinion of the authors?

      That MHH is popular and well-known is more a fact than an opinion: the original paper by Lynch and Conery (2003) and “The origins of genome architecture” by Lynch (2007) have respectively 1872 and 1901 citations to the present date (04/03/2025). Besides, the MHH is often invoked in highly cited reviews about TEs, e.g. Bourque et al., 2018 Genome Biol 19:1-12; Wells and Feschotte, 2020 Annu Rev Genet 54: 539-561.

      Page 4. "on phylogenetically very diverse datasets..." Given the fact that even closely related plants can show huge variation in genome size, it's a shame that they weren't included here. There are also numerous examples of closely related plants that are obligate selfers and out-crossers.

      This is true, and some studies already tested MHH in specific plant groups (Ågren et al., 2014 BMC Genom 15: 1-9; Hu et al., 2011 Nat Genet 43: 476-481; Wright et al., 2008 Int J Plant Sci 169: 105-118), including selfers vs out-crossers cases (Glémin et al., 2019 Evolutionary genomics: statistical and computational methods: 331-369). Further development in this kingdom would be interesting. However, the boundary was set to metazoans since the very beginning of analyses to maintain a large phylogenetic span and a manageable computational burden. Furthermore, some of the included animal clades are supposed to display good Ne contrasts according to known LHTs or to previous literature: for instance, the very different Ne of mammals and insects, as well as more narrowed examples like Drosophilidae and solitary vs eusocial hymenopterans.

      Page 6. "species-poor, deep-branching taxa were excluded" I see why this was done, as these taxa would not provide close as well as distant comparisons, but I would have thought they might have provided some interesting outlying data. As the geneticists say, value the exceptions.

      The reason to exclude them was not only that they would solely provide very distant comparisons. The lack of a rich and balanced sampling would imply calculating nucleotide substitution rates over hundreds of millions of years, which typically lead to saturation of synonymous sites. In case of saturation of synonymous sites, the synonymous divergence will be underestimated, and therefore, the dN/dS ratio no longer a valuable estimate of N<sub>e</sub>. Outside vertebrates and insects, the available genomes in a clade would mostly correspond to a few species from an entire phylum, making it challenging to estimate dN/dS and to correlate present day genome size with Ne estimated over hundreds of millions of years.

      Figure 1. What are the scaling units for each of these values? I get that dN/dS is between 0 and 1, but what about genome sizes? Are these relative sizes? Are TE content values a percent of the total? This may be mentioned elsewhere, but I think it is worth putting that information here as well.

      Thanks for pointing this out. Both genome sizes and TE contents are in bp, we added this information in the legend of the figure.

      Page 8. TE content estimates are invariably wrong given the diversity of TEs and, in many genomes, the presence of large numbers of low copy number "dead" elements. If that varies between taxa, this could cause problems. Given that, I would have liked to see the protocols used here be compared to a set of "gold standard" genomes with exceptionally well-annotated TEs (Humans and D. melanogaster, for instance).

      As already mentioned, dnaPipeTE is indeed biased towards young TEs (elements older than 25-30% are generally not detected). TE content can therefore be underestimated, especially in those genomes that tend to accumulate TEs rather than getting rid of them. Although most of them do not have “gold-standard” genomes, a comparison of dnaPipeTE with TE annotations from assemblies is already provided for a subset of species. Some variation can be present - see Supplemental Figure S6 and comments of Reviewer#3 about detection of satellite sequences. However, the subset covers a good range of genome sizes and overall dnaPipeTE emerges as an appropriate tool to characterize the general patterns of repeat content variation.

      Page 11. "close to 1 accounts for more..." I would say "closer" rather than "close".

      Agreed and changed.

      Page 11. "We therefore employed this parameter..." I know you made the point earlier, but maybe reiterate the general point here that selection is lower on average with a lower effective population size. Actually, I'm wondering if we don't need a different term for long-term net effective population size, which dN/dS is measuring.

      We reiterated here the relationship among dN/dS, Ne and magnitude of selection (lines 200-204): “a dN/dS closer to 1 accounts for more frequent accumulation of mildly deleterious mutations over time due to increased genetic drift, while a dN/dS close to zero is associated with a stronger effect of purifying selection. We therefore employed this parameter as a genomic indicator of N<sub>e</sub>, as the two are expected to scale negatively between each other.”

      Page 11. "We estimated dN/dS with a mapping method..." I very much appreciate that the authors are using the same pipeline for the analysis of all of these taxa, but I would also be interested in how these dN/dS values compare with previously obtained values for a subset of intensively studied taxa.

      The original publication of the method demonstrated that dN/dS estimations using mapping are highly similar to those obtained with maximum likelihood methods, such as implemented in CODEML (Romiguier et al., 2014 J Evol Biol 27: 593-603). Below is the comparison for 16 vertebrate species from Figuet et al. (2016 Mol Biol Evol 33: 1517-1527), where dN/dS are reasonably correlated (slope = 0.57, adjusted-R<sup>2</sup> = 0.39, p-value=0.006). That being said, some noise can be present as the compared genes and the phylogeny used are different. Although we expect some value between 0 and 1, some range of variation is to be expected depending on both the species used and the markers, as substitution rates and/or selection strength might be different. Differences in dN/dS for the same species would not necessarily imply an issue with one of the methods.

      Author response image 1.

      Page 12. " As expected, Bio++ dN/dS scales positively with..." Should this be explicitly referenced earlier? I do see that references mentioning both body mass and longevity are included earlier, but the terms themselves are not.

      We added a list of the expected correlations for dN/dS and LHTs at the beginning of the paragraph (lines 205-208): “In general, dN/dS is expected to scale positively with body length, age at first birth, maximum longevity, age at sexual maturity and mass, and to scale negatively with metabolic rate, population density and depth range.”

      Page 12. "dN/dS estimation on the trimmed phylogeny deprived of short and long branches results in a stronger correlation with LHTs, suggesting that short branches..." and what about the long branches? Trimming them helps because LHTs change over long periods of time?

      Trimming of long branches should avoid saturation in the signal of synonymous substitutions if present (whereby increase in dN is not parallelled by corresponding increase in dS due to depletion of all sites). Excluding very long branches was one of the reasons why we excluded taxonomic groups with few species. See lines 131-133: “For reliable estimation of substitution rates, this dataset was further downsized to 807 representative genomes as species-poor, deep-branching taxa were excluded”. Correlating present-day genome size with Ne estimates over long periods of time could weaken a potential correlation. However, exploratory analyses (not included) did not indicate that excluding long branches improved the relationship between Ne and genome size/TE content. The rationale is explained in Materials and Methods but was wrongly formulated. We rephrased it and added a reference (lines 636-638): “Estimation of dN/dS on either very long or short terminal branches might lead to loss of accuracy due to branch saturation (Weber et al. 2014) or to a higher variance of substitution rates, respectively”.

      Table 2. "Expected significant correlations are marked in bold black; significant correlations opposite to the expected trend are marked in bold red." Expected based on the initial hypothesis? Perhaps frame it as a test of the hypothesis?

      As per the comment above, we added a sentence in the main text to clarify the expected correlations for dN/dS and LHTs (lines 205-208): “In general, dN/dS is expected to scale positively with body length, age at first birth, maximum longevity, age at sexual maturity and mass, and to scale negatively with metabolic rate, population density and depth range.”. The second expected correlation is that between dN/dS and genome size/TE content, which is stated at the beginning of paragraph 2.5 (lines 244-245): “If increased genetic drift leads to TE expansions, a positive relationship between dN/dS and TE content, and more broadly with genome size, should be observed.”.

      Page 14. "Based on the available traits, the two kinds of Ne proxies analyzed here correspond in general..." the two kinds being dN/dS and a selection of LHT?

      We rephrased the sentence as such (lines 233-234): “Based on the available traits, the estimations of dN/dS ratios obtained using two different methods correspond in general to each other”.

      Table 3. Did you explain why there is a distinction between GC3-poor and GC3-rich gene sets?

      No, the explanation is missing, thank you for pointing it out. The choice comes from the observations made by Mérel et al. (2024 biorXiv: 2024-01), who do find a stronger relationship between dN/dS and genome size in Drosophila using the same tool (Coevol) in GC3-poor genes than in GC3-rich ones or in random sets of genes exhibiting heterogeneity in GC3 content. There are several possible explanations for this. First, mixing genes with various base compositions in the same concatenate can alter the calculation of codon frequency and impair the accuracy of the model estimating substitution rates.

      Moreover, base composition and evolutionary rates may not be two independent molecular traits, at the very least in Drosophila, and more generally in species experiencing selection on codon bias. Because optimal codons are enriched in G/C bases at the third position (Duret and Mouchiroud, 1999 PNAS 96: 4482-4487), GC3-rich genes are likely to be more expressed and therefore evolve under stronger purifying selection than GC3-poor genes in Drosophila.

      Accordingly, Merel and colleagues observed significantly higher dN/dS estimates for GC3-poor genes than for GC3-rich genes. Additionally, selection on codon usage acting on these highly expressed genes, that are GC3-rich, violates the assumed neutrality of dS. This implies that dN/dS estimates based on genes under selection on codon bias are likely less appropriate proxies of Ne than expected.

      Although some of these observations may be specific to Drosophila, this criterion was taken into consideration as taking restricted gene subsets was required for Coevol runs. We added this explanation in materials and methods (lines 723-738).

      Page 16. "Coevol dN/dS scales negatively with genome size across the whole dataset (Slope = -0.287, adjusted-R<sup>2</sup> = 0.004, p-value = 0.039) and within insects" Should I assume that none of the other groups scale negatively on their own, but cumulatively, all of them do?

      Yes, and this is an “insect-effect”: the regression of the whole dataset is negative but it is not anymore when insects are removed (with the model still being far from significant).

      Page 16. "Overall, we find no evidence for a recursive association of dN/dS with genome size and TE content across the analysed animal taxa as an effect of long-term Ne variation." I get the point, but this is starting to feel a bit circular. What you see is a lack of an association between dN/dS and TE content, but what do you mean by "as an effect of..." here? You are using dN/dS as a proxy, so the wording here feels odd.

      See the reply below.

      Page 17. I'm not sure that "effect" here is the word to use. You are looking at associations, not cause-effect relationships. Certainly, dN/dS is not causing anything; it is an effect of variation in purifying selection.

      Agreed, dN/dS is the ratio reflecting the level of purifying selection, not the cause itself. dN/dS is employed here as the independent variable in the correlation with genome size or TE content. dN/dS has an “effect” on the dependent variables in the sense that it can predict their variation, not in the sense that it is causing genome size to vary. We rephrased this and similar sentences to avoid misunderstandings (changes are highlighted in the revised text).

      Page 17. "Instead, mammalian TE content correlates positively with metabolic rate and population density, and negatively with body length, mass, sexual maturity, age at first birth and longevity." I guess I'm getting tripped up by measures of current LHTs and historical LHTs which, I'm assuming, varies considerably over the long periods of time that impact TE content evolution.

      PIC analyses can be considered as correlations on current LHTs as we compare values (or better, contrasts) at the tips of phylogenies. In the case of Coevol, traits are inferred at internal nodes, in such a way that the model should take into account the historical variation of LHTs, too.

      Page 18. "positive effect of dN/dS on recent TE insertions..." Again, this is not a measure of the effect of dN/dS on TE insertions, it is a measure of correlation. I know it's shorthand, but in this case, I think it really matters that we avoid making cause inferences.

      We have rephrased this as ”...very weak positive correlation of dN/dS with recent TE insertions…”.

      Page 18. "are consistent with the scenarios depicted by genome size and overall TE content in the corresponding clades." Maybe be more explicit here at the very end of the results about what those scenarios are.

      Correlating the recent TE content with dN/dS and LHTs basically recapitulates the relationship found using the other genomic traits (genome size and overall TE content). We have rephrased the closing sentence as “Therefore, the coevolution patterns between population size and recent TE content are consistent with the pictures emerging from the comparison of population size proxies with genome size and overall TE content in the corresponding clades” (lines 312-315).

      Page 19. "However, the difficulty in assembling repetitive regions..." I would say the same is true of TE content, which is almost always underestimated for the same reasons.

      “Repetitive regions” is here intended as an umbrella term including all kinds of repeats, from simple ones to transposable elements.

      Page 20. "repeat content has a lower capacity to explain size compared to other clades." Perhaps, but I'm not convinced this is not due to large numbers of low copy number elements, perhaps purged at varying rates. Are we certain that dnaPipeTE would detect these? Have rates of deletion in the various taxa examined been estimated?

      It is possible that low copy number elements are detected differently, according to the rate of decay in different species and depending also on the annotation method (indeed low copy families are less likely to be captured during read sampling by dnaPipeTE). A negative correlation between assembly size and deletion rate was observed in birds (Ji et al., 2023 Sci Adv 8: eabo0099). So we should expect a rate of TE removal inversely proportional to genome size, a positive correlation between TE content and genome size, and negative relationship between TE content and deletion rate, too. The relationship of TE content with deletion rate and genome size however appears more complex than this, even this paper using assembly-based TE annotations. However, misestimations of repeat content are also potentially due to the limited capacity of dnaPipeTE of detecting simple and low complexity repeats (see comments from Reviewer#3), which might be important genomic components in birds (see a few comments below).

      Page 21. "DNA gain, and their evolutionary dynamics appear of prime importance in driving genome size variation." How about DNA loss over time?

      See response to the comment below.

      Page 22. "in the latter case, the pace of sequence erosion could be in the long run independent of drift and lead to different trends of TE retention and degradation in different lineages." Ah, I see my earlier question is addressed here. How about deletion as a driver as well?

      Deletion was not investigated here. However, deletion processes are surely very different across animals and their impact merits to be studied as well within a comparative framework. Small scale deletion events have even been proposed to contrast the increase in genome size by TE expansion (Petrov et al., 2002 Theor Popul Biol 61: 531-544). In fact, their magnitude would not be high enough to effectively contrast processes of genome expansion in most organisms (Gregory, 2004 Gene 324: 15-34). However, larger-scale deletions might play an important role in genome size determinism by counterbalancing DNA gain (Kapusta et al., 2017 PNAS 114: E1460-E1469; Ji et al., 2023 Sci Adv 8: eabo0099). For sake of space we do not delve in detail into this issue, but we do provide some perspectives about the role of deletion (see lines 518-521 and 535-541).

      Page 22. "however not surprising given the higher variation of TE load compared to the restricted genome size range." I admit, I'm struggling with this. If it isn't genes, and it isn't satellites, and it isn't TEs, what is it?

      Most birds having ~1Gb genomes and displaying very low TE contents. Other studies annotated TEs in avian genome assemblies and also found a not so strong correlation between amount of TEs and genome size (Ji et al., 2023 Sci Adv 8: eabo0099, Kapusta and Suh, 2016 Ann N Y Acad Sci 1389: 164-185). It is possible that the TE diversity is underappreciated in birds due to the limits of sequencing technologies used so far in resolving complex repeat-rich regions. For instance, employment of long-reads technologies allowed to reveal more extended repeated regions that were previously ignored with short read assemblies (Kapusta and Suh, 2016 Ann N Y Acad Sci 1389: 164-185). Besides, quite large fractions might indeed be satellite sequences constituting relevant fractions of the genome (Edwards et al., 2025 biorXiv: 2025-02). We added this perspective in the discussion (lines 446-455): “As previous studies find relatively weak correlations between TE content and genome size in birds (Ji et al. 2022; Kapusta and Suh 2017), it is possible for the very narrow variation of the avian genome sizes to impair the detection of consistent signals. On the other hand, it is conceivable the avian TE diversity to be underappreciated due to the limits of sequencing technologies used so far in resolving complex repeat-rich regions. For instance, employment of long-reads technologies allowed to reveal more extended repeated regions that were previously ignored with short read assemblies (Kapusta and Suh 2017; Benham et al. 2024). Besides, quite large fractions might indeed be satellite sequences constituting relevant fractions of the genome that are challenging to identify with reference- or read-based methods (Edwards et al. 2025).” See also responses to Reviewer#3’s concerns about dnaPipeTE.

      Page 24. "Our findings do not support the quantity of non-coding DNA being driven in..." Many TEs carry genes and are "coding".

      Yes. Non-coding DNA intended as the non-coding portion of genomes not directly involved in organisms’ functions and fitness (in other words sequences not undergoing purifying selection). TEs do have coding parts but are in most part molecular parasites hijacking hosts’ machinery.

      Page 25. "There is some evidence of selection acting against TEs proliferation." Given that the vast majority of TEs are recognized and epigenetically silenced in most genomes, I'd say the evidence is overwhelming. Here I suspect you mean evidence for success in preventing proliferation. Actually, since we know that systems of TE silencing have a cost, it might be worth considering how the costs and benefits of these systems may have influenced overall TE content.

      We meant selection against TE proliferation in the making, notably visible at the level of genome-wide signatures for relaxed/effective selection. We rephrased it as “Evidence for signatures of negative selection against TE proliferation exist at various degrees.” (line 543).

      Reviewer #3 (Recommendations for the authors):

      Page 14: Please define GC3-rich and GC3-poor gene sets and how they were established, as well as why the analyses were conducted separately on GC3-rich and GC3-poor genes.

      We added a detailed explanation for the choice of GC3-rich and GC3-poor genes (see modified section Methods - Phylogenetic independent contrasts and Coevol reconstruction, lines 723-738).

      “Genes were selected according to their GC content at the third codon position (GC3). Indeed, mixing genes with heterogeneous base composition in the same concatenate might result in an alteration of the calculation of codon frequencies, and consequently impair the accuracy of the model estimating substitution rates (Mérel et al. 2024). Moreover, genes with different GC3 levels can reflect different selective pressures, as highly expressed genes should be enriched in optimal codons as a consequence of selection on codon usage. In Drosophila, where codon usage bias is at play, most optimal codons present G/C bases at the third position (Duret and Mouchiroud, 1999), meaning that genes with high GC3 content should evolve under stronger purifying selection than GC3-poor genes. Accordingly, Mérel et al. (2024) do find a stronger relationship between dN/dS and genome size when using GC3-poor genes, as compared to GC3-rich genes or gene concatenates of random GC3 composition. Finally, dN/dS can be influenced by GC-biased gene conversion (Bolívar et al. 2019; Ratnakumar et al. 2010), and the strength at which such substitution bias acts can be reflected by base composition. For these reasons, two sets of 50 genes with similar GC3 content were defined in order to employ genes undergoing similar evolutionary regimes.”

      Please add lines between columns and rows in tables. Table 3 is especially difficult to follow due to its size, and lines separating columns and rows would vastly help with readability.

      We added lines delimiting cells in all the main tables.

      Throughout the text and figures, please be consistent with either scientific names or common names for lineages or clades.

      Out of the five groups, for four of them the common name is the same as the scientific one (except Aves/birds).

      Regarding the title for section 3.1, I don't believe "underrate" is the best word here. I find this title confusing.

      We replaced the term “underrate” with “underestimate” in the title.

      The authors report that read type (short vs. long) does not have a significant effect on assembly size relative to C-value. However, the authors (albeit admittedly in the discussion) removed lower-quality assemblies using a minimum N50 cutoff. Thus, this lack of read-type effect could be quite misleading. I strongly recommend the authors either remove this analysis entirely from the manuscript or report results both with and without their minimum N50 cutoff. I expect that read type should have a strong effect on assembly size relative to C-value, especially in mammals where TEs and satellites comprise ~50% of the genome.

      Yes, it's likely that if we took any short-read assembly, we would have a short-read effect. We do not mean to suggest that in general short reads produce the same assembly quality as long reads, but that in this dataset we do not need to account for the read effect in the model to predict C-values. Adding the same test including all assemblies will be very time-consuming because C-values should be manually checked as already done for the species. If we removed this test, readers might wonder whether our genome size predictions are not distorted by a short-read effect. We now make it clear that this quality filter likely has an outcome on our observations: “This suggests that the assemblies selected for our dataset can mostly provide a reliable measurement of genome size, and thus a quasi-exhaustive view of the genome architecture.” (lines 333-335).

      There seem to be some confusing inconsistencies between Supplementary Table S2 and Supplementary Figure S2. In Supplementary Table S2, the authors report ~24% of the Drosophila pectinifera genome as unknown repeats. This is not consistent with the stacked bar plot for D. pectinifera in Supplementary Figure S2.

      True, the figure is wrong, thank you for spotting the error. The plot of Supplemental Figure S2 was remade with the correct repeat proportions as in Supplementary Tables S2 and S4. Because the reference genome sizes on which TE proportions are calculated are different for the two methods, we added another supplemental figure showing the same comparison in Kbp (now Supplemental Figure S3).

      At the bottom of page 20: "many species with a high duplication score in our dataset correspond to documented duplication" How many?

      Salmoniformes (9), Acipenseriformes (1), Cypriniformes (3) out of 23 species with high duplication score. It’s detailed in the results (lines 193-196): “Of the 24 species with more than 30% of duplicated BUSCO genes, 13 include sturgeon, salmonids and cyprinids, known to have undergone whole genome duplication (Du et al. 2020; Li and Guo 2020; Lien et al. 2016), and five are dipteran species, where gene duplications are common (Ruzzante et al. 2019).”

      Top of page 21: "However, the contribution of duplicated genes to genome size is minimal compared to the one of TEs, and removing species with high duplication scores does not affect our results: this implies that duplication does not impact genome size strongly enough to explain the lack of correlation with dN/dS." This sentence is confusing and needs rewording.

      We reworded the sentence (lines 383-384): “this implies that duplication is unlikely to be the factor causing the relationship between genome size and dN/dS to deviate from the pattern expected from the MHH”.

      Beginning of section 3.3: "Our dN/dS calculation included several filtering steps by branch length and topology: indeed, selecting markers by such criteria appears to be an essential step to reconcile estimations with different methodologies" A personal communication is cited here. Are there really no peer-reviewed sources supporting this claim?

      This mainly comes from a comparison of dN/dS calculation with different methods (notably ML method of bpp vs Coevol bayesian framework) on a set of Zoonomia species. We observed that estimations with different methods appeared correlated but with some noise: filtering out genes with deviant topologies (by a combination of PhylteR and of an unpublished Bayesian shrinkage model) reconciled even more the estimations obtained from different methods. Results are not shown here but the description of an analogous procedure is presented in Bastian, M. (2024). Génomique des populations intégrative: de la phylogénie à la génétique des populations (Doctoral dissertation, Université lyon 1) that we added to the references.

      Figure 2 needs to be cropped to remove the vertical gray line on the right of the figure as well as the portion of visible (partly cropped) text at the top. What is the "Tree scale" in Figure 1?

      Quality of figure 2 in the main text was adjusted. The tree scale is in amino acid substitutions, we added it in the legend of the figure.

      It is also unclear whether the authors used TE content or overall repeat content for their analyses.

      The overall repeat content includes both TEs and other kinds of repeats (simple repeats, low complexity repeats, satellites). The contribution of such other repeats to the total content is generally quite low for most species compared to that of TEs (only 13 genomes in all dataset have more than 3% of “Other” repeats). Conversely, the “other” repeats were not included in the recent content since the divergence of a copy from its consensus sequence is pertinent only for TEs.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      The manuscript aims to elucidate the impact of a prophage within the genome of Shewanella fidelis on its interaction with the marine tunicate Ciona robusta. The authors made a deletion mutant of S. fidelis that lacks one of its two prophages. This mutant exhibited an enhanced biofilm phenotype, as assessed through crystal violet staining, and showed reduced motility. The authors examined the effect of prophage deletion on several genes that could modulate cyclic-diGMP levels. While no significant changes were observed under in vitro conditions, the gene for one protein potentially involved in cyclic-diGMP hydrolysis was overexpressed during microbe-host interactions. The mutant was retained more effectively within a one-hour timeframe, whereas the wild-type (WT) strain became more abundant after 24 hours. Fluorescence microscopy was used to visualize the localization patterns of the two strains, which appeared to differ. Additionally, a significant difference in the expression of one immune protein was noted after one hour, but this difference was not evident after 23 hours. An effect of VCBC-C addition on the expression of one prophage gene was also observed.

      Strengths:

      I appreciate how the authors integrate diverse expertise and methods to address questions regarding the impact of prophages on gut microbiome-host interactions. The chosen model system is appropriate, as it allows for high-throughput experimentation and the application of simple imaging techniques.

      Weaknesses:

      My primary concern is that the manuscript primarily describes observations without providing insight into the molecular mechanisms underlying the observed differences. It is particularly unclear how the presence of the prophage leads to the phenotypic changes related to bacterial physiology and host-microbe interactions.

      We appreciate the overall, enthusiastic reviewer feedback.  The current manuscript presents experimental evidence of the biological impact of the deletion of a stably integrated prophage in the genome of Shewanella fidelis 3313. The molecular mechanisms responsible for these biological effects are currently unknown but based on the limited genetic insight of some predicted gene regions, we can speculate on prophage-mediated influences impacting swimming behaviors. Below, we address additional concerns raised by the reviewer.

      Which specific prophage genes are critical, or is the insertion at a specific site in the bacterial genome the key factor?  While significant effects on bacterial physiology are reported under in vitro conditions, there is no clear attribution to particular enzymes or proteins.

      In this particular case, it is not entirely clear, as most ORFs within the prophage region have unknown functions, i.e., predicted as hypothetical proteins. In addition, the original insertion site does not appear to interrupt any specific gene but may impact adjacent genes/pathways (Fig 1b). Enhanced annotations, along with future targeted deletion methods for distinct prophage segments, will help us better investigate which predicted gene regions influence the observed traits. This will deepen our understanding of the mechanisms that regulate prophage influence on these traits.

      In contrast, when the system is expanded to include the tunicate, differences in the expression of a cyclic-diGMP hydrolase become apparent. Why do we not observe such differences under in vitro conditions, despite noting variations in biofilm formation and motility? Furthermore, given that the bacterial strain possesses two prophages, I am curious as to why the authors chose to target only one and not both.

      Differences in expression patterns of c-di-GMP regulators were also noted in vitro, but they just missed the statistical significance threshold when rho was used as a bacterial reference gene. The expression pattern of pdeB was consistent among each biological replicate, however. In full transparency, pdeB qPCR was originally performed with recA as a reference standard (bioRxiv preprint, ver 1). Here, significant changes in pdeB expression were observed in the in vitro assays comparing WT and ΔSfPat. These results prompted us to study changes in pdeB expression during in vivo colonization experiments, which also revealed significant changes. However, there was a concern that a potential SOS response would also activate recA, despite our preliminary data suggesting SOS was not involved. As a precautionary, we repeated the experiments with rho as a reference gene after it was identified as a stable reference. However, with rho as a reference gene, statistically significant responses were noted during in vivo colonization, but not in the in vitro assays. 

      In the current manuscript, one prophage was targeted based on preliminary findings indicating that the SfPat prophage region influences behaviors likely to impact colonization of the Ciona robusta gut. A separate genetic segment was also previously targeted for deletion as a misidentified prophage-like region, but that strain is not included in the current description. The currently presented data indicate that the observed phenomena can be attributed to the SfPat prophage.

      Regarding the microbe-host interaction, it is not clear why the increased retention ability of the prophage deletion strain did not lead to greater cell retention after 24 hours, especially since no differences in the immune response were observed at that time point.

      A predominantly adherent (non-motile) phenotype would likely facilitate elimination within fecal strings. There is substantial evidence from multiple model systems that strong swimming ability enhances the exploration and colonization of mucosal surfaces. Swimming helps with the penetration of mucus layers, chemotaxis toward epithelial surfaces, and overall “decision-making” in terms of shifting from a free-swimming (planktonic) state in the lumen within dietary material to a more sessile, adherent phenotype at the mucosal surface.

      Concerning the methodological approach, I am puzzled as to why the authors opted for qPCR instead of transcriptomics or proteomics. The latter approaches could have provided a broader understanding of the prophage's impact on both the microbe and the host.

      We agree with the reviewer that a transcriptomics approach would provide a broader understanding of the prophage’s impact on the microbe and animal host. Future studies will include a full multi-omic evaluation of this interaction. 

      Reviewer #1 (Recommendations for the authors):

      Besides my above mentioned issues, I have a few more mini things:

      (A) what makes S. fidelis being a persistant member of the host microbiome? Please elaborate more on quantitive studies in this respect. –

      Shewanella species are stable members of the Ciona gut, and previous efforts (Dishaw et al, 2016) revealed that chitin and/or secreted host effectors could influence biofilm formation. The Ciona gut produces copious amounts of endogenous chitin-rich mucus, and a variety of bacteria have been identified that thrive under these conditions. In addition, versatile bacteria like Shewanella sp. likely expand the metabolic potential of filter-feeders like Ciona. Thus, our subsequent studies began to focus on these and other microbes isolated from the Ciona gut that appear to be stable residents. Identical strains have been recovered numerous times (since 2011) from this wild population of Ciona robusta.  

      (B) The authors use the word inter kingdom and refer to phage, bacterium and animal. As phages are not part of the three kingdoms of life I believe the terminology is wrong.

      Thank you for bringing this to our attention. In this context, we were referring to bacteria+phage as a unit and their interkingdom interaction with the animal host. But we recognize that this term can be misleading. Another, more appropriate term is ‘tripartite,’ and we have changed interkingdom to tripartite as appropriate, e.g., the abstract.

      (C) I like lines 55-61 and was expecting to see in the manuscript what of those things would be true for the chosen prophage.

      We looked at the coding region annotations within the prophage and the adjacent regions. The prophage coding regions are mostly annotated as unknown or predicted proteins, and a few as known phage-related components. We intend to reanalyze future and improved annotations and conduct deletion experiments targeting specific open reading frames (ORFs).

      (D) In line 76 the authors mention a Gödecke reference for Pseudomonas. I believe that this paper only deals with S. oneidensis.

      The inadvertent Gödecke reference has been removed.

      (E) All figures: The captions are too short to understand what the figures are showing and everything is too small and hard to read or see. Along these lines it is often unclear what the many datapoints show. Biological replicates, technical replicates....Overall figure 1 does not seem to contain much information.

      Figures and captions have been improved as suggested. Thank you for bringing this to our attention.

      (F) Figure 3 what are a and b showing?

      Figure and descriptive legend have been improved.

      (G) Figure 4: Why did the author check expression only for one gene after 1 h but several genes after 24 h?

      Since we observed that in vitro VCBP-C alters biofilms of S. fidelis 3313 (Dishaw et al 2016), we hypothesized that the bacteria may alter host VCBP-C expression and that the influence of integrated prophages may further modulate gene expression. Since VCBP-C is endogenously expressed in the gut of Ciona, we expected that early exposure/colonization (one hour) would be crucial for the bacterial-VCBP interactions. Hence, the VCBP-C was our primary target. We then tested multiple immune response genes at 24 hours to get a more detailed understanding of the maturing immune responses. Future studies will expand our efforts using global transcriptomics to understand better the immune response during bacterial exposure and colonization events.

      (H) Do the authors mean stationary or localised?

      We are not sure about the context of the reviewer’s question here but we think our modifications have addressed these concerns. 

      Reviewer #2 (Public review):

      Summary:

      In the manuscript, "Prophage regulation of Shewanella fidelis 3313 motility and biofilm formation: implications for gut colonization dynamics in Ciona robusta", the authors are experimentally investigating the idea that integrated viruses (prophages) within a bacterial colonizer of the host Ciona robusta affect both the colonizer and the host. They found a prophage within the Ciona robusta colonizing bacterium Shewanella fidelis 3313, which affected both the bacteria and host. This prophage does so by regulating the phosphodiesterase gene pdeB in the bacterium when the bacterium has colonized the host. The prophage also regulates the activity of the host immune gene VCBP-C during early bacterial colonization. Prophage effects on both these genes affect the precise localization of the colonizing bacterium, motility of the bacterium, and bacterial biofilm formation on the host. Interestingly, VCBP-C expression also suppressed a prophage structural protein, creating a tripartite feedback loop in this symbiosis. This is exciting research that adds to the emerging body of evidence that prophages can have beneficial effects not only on their host bacteria but also on how that bacteria interacts in its environment. This study establishes the evolutionary conservation of this concept with intriguing implications of prophage effects on tripartite interactions.

      Strengths:

      This research effectively shows that a prophage within a bacterium colonizing a model ascidian affects both the bacterium and the host in vivo. These data establish the prophage effects on bacterial activity and expand these effects to the natural interactions within the host animal. The effects of the prophage through deletion on a suite of host genes are a strength, as shown by striking microscopy.

      Weaknesses:

      Unfortunately, there are abundant negative data that cast some limitations on the interpretation of the data. That is, examining specific gene expression has its limitations, which could be avoided by global transcriptomics of the bacteria and the host during colonization by the prophage-containing and prophage-deleted bacteria (1 hour and 24 hours). In this way, the tripartite interactions leading to mechanism could be better established.

      We thank the reviewer for their comments and recognize this important limitation. As a follow-up to the current study, we plan to perform more comprehensive global meta-transcriptomics analyses to better understand differentially expressed genes across both the host and microbe during colonization.

      Impact:

      The authors are correct to speculate that this research can have a significant impact on many animal microbiome studies, since bacterial lysogens are prevalent in most microbiomes. Screening for prophages, determining whether they are active, and "curing" the host bacteria of active prophages are effective tools for understanding the effects these mobile elements have on microbiomes. There are many potential effects of these elements in vivo, both positive and negative, this research is a good example of why this research should be explored.

      Context:

      The research area of prophage effects on host bacteria in vitro has been studied for decades, while these interactions in combination with animal hosts in vivo have been recent. The significance of this research shows that there could be divergent effects based on whether the study is conducted in vitro or in vivo. The in vivo results were striking. This is particularly so with the microscopy images. The benefit of using Ciona is that it has a translucent body which allows for following microbial localization. This is in contrast to mammalian studies where following microbial localization would either be difficult or near impossible.

      Reviewer #2 (Recommendations for the authors):

      In general, I found that the research shown in this manuscript is solid, and the manuscript is well-written. I have no specific comments about the writing of the manuscript that would be of benefit.

      Figure 1 would benefit from the shrinking of white space between panels a and b. Also, in panel b, it is very difficult to read the x-axis, the number of basepairs. It is suggested to increase this font size.

      Figure 1 has been improved as suggested.

      Figure 2 is fine, however, what do three asterisks (***) in panel a signify? It is not described in the legend. One minor point that affects data understanding as presented, the wildtype (WT) change in expression is normalized to itself, therefore always equaling 1.0. This method of presentation muddies the variation in gene expression in the presence of the prophage. This is not an issue in Figure 2, but does have an effect on understanding Figure 2 - figure supplement 1.

      Figure 2 - figure supplement 1, as stated above, the normalization of the WT change in gene expression to 1.0 makes it difficult to understand the results. Why is pilZ change in gene expression not significant in panel s1a? It seems the median change is 50%, or whatever averaging is done, it's unclear whether this is the median and whether the error bars are standard deviation or some other metric.

      These should be defined in the statistical analysis section of the methods or in the legend itself. Further, in panel s1b, why is the reduction in gene expression of pdeB statistically significant, while a similar reduction in gene expression of pleD is not statistically significant?

      RQ values were calculated from 2<sup>-ddCt</sup>. The error bars in the figures were calculated by adding or subtracting the standard error from RQ. Since WT was used as a reference value for qPCR, the RQ value was normalized as 1 for all replicates and nonparametric tests were used to calculate the statistical significance. The values for pilZ were very close to significant; a value of 0.063 was derived via the Wilcoxon test. Only the changes in expression of pdeB were determined to be statistically significant, via the Wilcoxon test.

      Figure 3 panels a and b would be helped by having the same y-axis for each. It is impressive the amount of WT bacterial colonization takes place in 24 hours, particularly in the absence of the prophage, but it does not appear as impressive when the axes are changed between panels. Similar axes should be considered for every comparative graph.

      Figure 3 - figure supplement 1 legend would benefit from the same description of the animal's digestive locations as in the legend in Figure 3.

      We appreciate these suggestions and have made these changes accordingly. We have remade and combined Figure 3 a and b

      Figure 4, while it is unfortunate that none of the immune genes evaluated had a response to the deletion of the SfPat prophage in S. fidelis 3313 at 24 hours, did any of these genes have an effect at 1 hour of evaluation as VCBP-C did?

      The expression of this expanded gene set was not evaluated at one hour. This time point will, however, be included in our global evaluation of gene expression in our future transcriptome sequencing effort.

      Figure 5, the only question I have with these data is whether or not there is a dose-dependent effect of VCBP-C on SfPat P5 expression?

      Prior studies have found VCBP-C can impact biofilm formation in Shewanella sp. in a dose-dependent manner (some of the data appears in Dishaw et al, 2016). However, we have not yet considered whether VCBP-C impacts the expression of SfPat P5 (a phage capsid component) in a dose-dependent manner. We will consider this in future experimental designs.

      It is mentioned in the introduction (and data shown in the preprint) that there is more than one active prophage in Shewanella fidelis 3313. The preprint data shows that the Mu prophages had little effect on the studies. It may be worth discussing the presence and lack of effects of these Mu prophages. It also may lead to some discussion about the complexities of polylysogeny (as discussed by Silpe, et al, Nature, 2023).

      A full-length, inducible, Mu-like prophage region has been identified in the genome that has not been targeted for deletion, but will be included in follow-up studies. An earlier incomplete genome assembly contributed to the incorrect targeting and deletion of a prior Mu-like region, which was discussed in an earlier preprint version. Discussion and references to that strain have been removed from the more recent preprint versions. For clarity, the current manuscript describes strains that remain focused on the SfPat prophage, noting its contribution to the observed behavioral changes / traits.

      Is there any spontaneous induction of SfPat in vitro or in vivo with temperature change (prophages have been induced with heat stress), excessive UV exposure, or mitomycin C treatment?

      Preliminary induction studies using UV, mitomycin C, and temperature have been completed, but remain inconclusive with SfPat due to inconsistent induction patterns.

      Could you speculate, or perhaps do the experiment, as to whether the addition of VCBP-C to S. fidelis 3313 cultures affects biofilm production? The deletion of SfPat leads to greater biofilm production in vitro, while exogenously added VCBP-C represses SfPat P5 expression, would VCPB-C addition lead to greater biofilm production? Lastly, and this may be a failure of my understanding, is VCBP-C able to bind to S. fidelis? If so, does the prophage alter the bacteria and, consequently, the ability of VCBP-C to bind to the bacteria?

      Our lab is actively working to better understand the physical interactions of VCBP-C and bacteria, particularly lysogenic bacteria. Deletion mutants are helping us better understand the potential influence of the bacterial accessory genome on interactions with host immune mediators. Biofilm assays have been done in the context of VCBP-C (Dishaw et al, 2016). Subsequently, we tested the influence of 50 µg/ml VCBP-C on WT and prophage KO-strains, which include SfPat KO along with neutral (control) regions of the genome. We found that the presence of VCBP-C reduced biofilm formation in WT and phage KO variants at 4 hrs and 24 hrs. However, at 12 hrs, VCBP-C treatment appears to increase biofilm formation in the phage-KO strain. While the role (if any) of SfMu is remains unclear, these preliminary data imply the existence of a feedback circuit (influenced by time) where immune effector binding and prophage influence on host gene expression together shape retention outcomes in the gut microbiome. This hypothesis remains to be tested further.

      Author response image 1.

      WT S. fidelis 3313 was exposed in vitro to 50 µg/ml VCBP-C in stationary cultures. Biofilms were observed for 24hrs.  At 12 hrs, the presence of VCBP-C increased the amount of biofilms, whereas reduced biofilms were observed at 4 and 24hrs. Our findings (manuscript Fig 2a) reveal that SfPat contributes to biofilm formation, exposure to SfPat deletion mutants increases host VCBP-C expression (manuscript Fig. 4a), and VCBP-C binding to WT S. fidelis 3313 reduces the expression of SfPat P5 capsid protein (manuscript Fig. 5). These findings suggest that in vivo exposure/ colonization assays benefit from detailed time-course observations to be further explored in follow-up, future experiments.

      Reviewer #3 (Public review):

      In this manuscript, Natarajan and colleagues report on the role of a prophage, termed SfPat, in the regulation of motility and biofilm formation by the marine bacterium Shewanella fidelis. The authors investigate the in vivo relevance of prophage carriage by studying the gut occupation patterns of Shewanella fidelis wild-type and an isogenic SfPat- mutant derivative in a model organism, juveniles of the marine tunicate Ciona robusta. The role of bacterial prophages in regulating bacterial lifestyle adaptation and niche occupation is a relatively underexplored field, and efforts in this direction are appreciated.

      While the research question is interesting, the work presented lacks clarity in its support for several major claims, and, at times, the authors do not adequately explain their data.

      Major concerns:

      (1) Prophage deletion renders the SfPat- mutant derivative substantially less motile and with a higher biofilm formation capacity than the WT (Fig. 2a-b). The authors claim the mutant is otherwise isogenic to the WT strain upon sequence comparison of draft genome sequences (I'll take the opportunity to comment here that GenBank accessions are preferable to BioSample accessions in Table 1). Even in the absence of secondary mutations, complementation is needed to validate functional associations (i.e., phenotype restoration). A strategy for this could be phage reintegration into the mutant strain (PMID: 19005496).

      We are currently investigating complementation strategies. However, there have been some challenges in re-infecting and/or reintegrating the prophage into the genome. A preferred integration site may be damaged due to the deletion approach. While the SfPat prophage has mostly predicted genes of unknown function or significance, we have begun prioritizing the deletion of distinct segments to help identify functional relevance.

      (2) The authors claim that the downshift in motility (concomitant with an upshift in biofilm formation) is likely mediated by the activity of c-di-GMP turnover proteins. Specifically, the authors point to the c-di-GMP-specific phosphodiesterase PdeB as a key mediator, after finding lower transcript levels for its coding gene in vivo (lines 148-151, Fig. 2c), and suggesting higher activity of this protein in live animals (!)(line 229). I have several concerns here:

      (2.1) Findings shown in Fig. 2a-b are in vitro, yet no altered transcript levels for pdeB were recorded (Fig. 2c). Why do the authors base their inferences only on in vivo data?

      (2.2) Somewhat altered transcript levels alone are insufficient for making associations, let alone solid statements. Often, the activity of c-di-GMP turnover proteins is local and/or depends on the activation of specific sensory modules - in the case of PdeB, a PAS domain and a periplasmic sensor domain (PMID: 35501424). This has not been explored in the manuscript, i.e., specific activation vs. global alterations of cellular c-di-GMP pools (or involvement of other proteins, please see below). Additional experiments are needed to confirm the involvement of PdeB. Gaining such mechanistic insights would greatly enhance the impact of this study.

      (2.3) What is the rationale behind selecting only four genes to probe the influence of the prophage on Ciona gut colonization by determining their transcript levels in vitro and in vivo? If the authors attribute the distinct behavior of the mutant to altered c-di-GMP homeostasis, as may be plausible, why did the authors choose those four genes specifically and not, for example, the many other c-di-GMP turnover protein-coding genes or c-di-GMP effectors present in the S. fidelis genome? This methodological approach seems inadequate to me, and the conclusions on the potential implication of PdeB are premature.

      We chose to study genes that were shown previously to influence biofilms and motility in a cyclic-di-GMP dependent manner in a Shewanella spp (Chao et al 2013, S Rakshe 2011). Future transcriptomic efforts and targeted deletion approaches will further define the specific influence of prophages.

      (3) The behavior of the WT strain and the prophage deletion mutant is insufficiently characterized. For instance, how do the authors know that the higher retention capacity reported for the WT strain with respect to the mutant (Fig. 3b) is not merely a consequence of, e.g., a higher growth rate? It would be worth investigating this further, ideally under conditions reflecting the host environment.

      To clarify the method, in vitro growth curves did not suggest any significant difference in growth rate between the WT and the deletion mutant strains. Subsequently, for the in vivo experiments, bacterial cultures were pelleted and resuspended in sterile, nutrient-free artificial seawater. This limits growth until the bacterial strains are introduced to the animals.

      (4) Related to the above, sometimes the authors refer to "retention" (e.g., line 162) and at other instances to "colonization" (e.g., line 161), or even adhesion (line 225). These are distinct processes. The authors have only tracked the presence of bacteria by fluorescence labeling; adhesion or colonization has not been assessed or demonstrated in vivo. Please revise.

      We thank the reviewer for this feedback; the manuscript has been revised accordingly. While we refer to our assays as ‘colonization assays,’ we report results of ‘retention’ of various bacterial strains in the ‘exposed’ animals. Furthermore, when fluorescent staining is utilized, we report retention in defined niches. Since colonization is likely a two-step process, i.e., 1) retention and 2) colonization or long-term establishment of these microbial communities, using these terms correctly is warranted. In separate (unpublished) surveys of adult animals taken from the field, identical strains have been recovered numerous times over a twelve-year period.

      (5) The higher CFU numbers for the WT after 24 h (line 161) might also indicate a role of motility for successful niche occupation or dissemination in vivo. The authors could test this hypothesis by examining the behavior of, e.g., flagellar mutants in their in vivo model.

      Interestingly, we find numerous flagellar/motility-associated protein coding genes like Flg, Fli and Fle present within the S. fidelis genome possessing an EAL domain, implicating them in the regulation of cyclic-di-GMP. Hence, a future global transcriptomic approach will help improve our understanding of the roles of these regulatory pathways.

      (6) The endpoint of experiments with a mixed WT-mutant inoculum (assumedly 1:1? Please specify) was set to 1 h, I assume because of the differences observed in CFU counts after 24 h. In vivo findings shown in Fig. 3c-e are, prima facie, somewhat contradictory. The authors report preferential occupation of the esophagus by the WT (line 223), which seems proficient from evidence shown in Fig. S3. Yet, there is marginal presence of the WT in the esophagus in experiments with a mixed inoculum (Fig. 3d) or none at all (Fig. 3e). Likewise, the authors claim preferential "adhesion to stomach folds" by the mutant strain (line 225), but this is not evident from Fig. 3e. In fact, the occupation patterns by the WT and mutant strain in the stomach in panel 3e appear to differ from what is shown in panel 3d. The same holds true for the claimed "preferential localization of the WT in the pyloric cecum," with Fig. 3d showing a yellow signal that indicates the coexistence of WT and mutant.

      The results section is reworded to improve clarity. The WT and KO are mixed 1:1 to achieve the 10<sup>7</sup> cfu count.

      (7) In general, and especially for in vivo data, there is considerable variability that precludes drawing conclusions beyond mere trends. One could attribute such variability in vivo to the employed model organism (which is not germ-free), differences between individuals, and other factors. This should be discussed more openly in the main text and presented as a limitation of the study.

      Yes, a salient feature of this model is that we can leverage genetic diversity in our experimental design, but it can introduce experimental variability.

      Even with such intrinsic factors affecting in vivo measurements, certain in vitro experiments, which are expected, in principle, to yield more reproducible results, also show high variability (e.g., Fig. 5). What do the authors attribute this variability to?

      For experiments involving VCBP-C protein, we can use affinity-purified protein recovered from live animals, or recombinant protein that we synthesize in-house (Dishaw et al 2011, 2016). In the latter, we often observe slight lot-to-lot variation in affinity for the target (the bacterial surface). To account for this variation and to ensure the observations are robust despite it, production lots can be mixed in additional biological replicates. As such, slight variability in the in vitro assays can be due to this batch effect.

      (8) Line 198-199: Why not look for potential prophage excision directly rather than relying on indirect, presumptive evidence based on qPCR?

      The decision to rely on qPCR of prophage structural genes was based on preliminary data, in particular among lysogens possessing more than one prophage. Neither the plaque assay nor SYBR Gold staining could distinguish among the particles, and TEM imaging was not sufficiently qualitative. Since these prophages do not exclusively produce particles when induced, qPCR targeting structural proteins was found to be most informative.

      Reviewer #3 (Recommendations for the authors):

      Other major comments:

      Line 137 (and Fig. 2 legend): The authors did not test chemotaxis towards any specific chemoeffector, only motility. Please correct and see below my comments about motility assays.

      The reviewer is correct; we have modified our descriptors.

      Lines 142-144: The authors conflate quorum sensing with c-di-GMP metabolism. If the authors measured the expression of genes "regulating cyclic di-GMP," it is likely because c-di-GMP is known to regulate the switch between planktonic and sessile lifestyles. However, whether this is mediated by quorum sensing is a separate issue that was not explored in this work. Please revise.

      Thank you; these changes were made accordingly.

      Line 150: c-di-GMP is not a quorum sensing signal; please correct.

      Yes, we corrected the inadvertent yet misleading statement.

      Line 193: Please clarify "RNA was extracted from the biofilms." If S. fidelis was grown on "MA [Marine Agar] for 24 h in the presence or absence of 50 µg/ml VCBP-C" (lines 192-193), was RNA isolated from colonies growing on the plates? Was VCBP-C added to the agar? This is also unclear in the Methods section (lines 381-384), where it seems the authors conducted this experiment using broth cultures in multiwell plates, removing the supernatant, and extracting RNA from the biofilms (i.e., cells adhered to the walls and bottom of the wells?). Why only biofilm cells?

      Thank you for bringing this to our attention. We have rewritten the appropriate sections and methods to improve clarity. Following our initial studies, which revealed differential bacterial phenotypes (biofilm formation and motility assays), we decided to target and investigate gene expression in the biofilms. This way, the sessile cells that were not part of the biofilm do not obfuscate the data.

      Lines 204-205: The authors should refer to the behavior of the mutant, since they did not test what happens upon prophage integration, but after prophage deletion.

      The wording has been changed accordingly.

      Lines 206-207: Please explain why the authors state that "these different bacterial phenotypes" (referring to altered biofilm formation and motility) "influence host immune responses in a manner consistent with influences on gut colonization dynamics". What specific relationship are the authors suggesting between these processes, and in what way is this "consistent"?

      We previously demonstrated (Dishaw et al 2016) that copious amounts of VCBP-C protein are present under normal conditions in the gut and mostly found tethered to chitin-rich mucus lining the gut epithelium. The up-regulation of VCBP-C within one hour of exposure to the SfPat mutant relative to the WT S. fidelis is consistent with a role for VCBP-C in modulating bacterial settlement dynamics (Dishaw et al 2016). The mutant phenotype of reduced swimming and increased biofilm production is a likely trigger for the increased production of this secreted immune effector that may influence the retention of this bacterial variant, relative to the WT.

      Line 229: Apart from what I noted above about the authors' claim regarding PdeB activity, I believe the figure referred to here should be Fig. 2, not Fig. 5.

      Thank you for catching that oversight. It has been corrected.

      Figure 1: Was hypothetical protein 2 included in the deletion?

      Yes, the hypothetical protein 2 was included in the deletion

      Figure 3a-b: It is challenging to interpret data on plots using so many colors - including what appears to be a white circle (?) in Fig. 3a. How many replicates are represented here? Is it indeed n=3 in Fig. 3a and n=6 in Fig. 3b?  

      Figure 3a is a bee swarm plot. Each color represents biological replicates, and the smaller circles represent technical replicates. It facilitates showing ALL the data, including the spread of the data. Regarding the number replicates, 3a and 3b are different experiments, with 3a representing a biofilm assay with three biological replicates and 3b a motility assay with six biological replicates.

      Figure 3: An explanation for the abbreviation "FP" is missing.

      Thank you for catching this oversight. The abbreviation has been defined.

      Figure S3: FP, which is proficiently occupied by the WT strain (Fig. S3a), is not labeled in the images provided for the mutant (Fig. S3c-d). It would be helpful to show it for comparison.

      Those other images did not have fecal pellets to label; however, Figure 3c does show a fecal pellet for an animal exposed to both WT and the SfPat mutant.

      Questions and comments regarding methods:

      Lines 290-291, 307: Please indicate an approximate range for "room temperature."

      The information has been added to the revised manuscript.

      Lines 292, 302: Why use hybrid LB/MB broth and agar? And strictly speaking, which LB formula (Lennox/Luria/Miller)?

      The hybrid broth reduces the concentration of salts that can interfere in some assays. The LB formula was Luria, and it is now included in the manuscript.

      Lines 300-302: The conjugation procedure is poorly described. It seems the authors conducted conjugal transfer by biparental mating in broth culture by inoculating a single colony of S. fidelis 3313 into an already grown culture of the E. coli donor strain?

      The biparental mating was done on plates; the manuscript has been clarified.

      Motility assay concerns:

      Swimming motility is generally assayed in soft agar (0.25-0.3% w/v). Why did the authors use 0.5% low-melt agarose? Usually, agar is employed instead of agarose, and such a high concentration of solidifying agent typically prevents proper swimming (see e.g. Kearns 2010).

      Our laboratory uses low-melt agarose for phage propagation and other assays. We continued using it because we observed robust and reproducible results in the swarming and swimming motility assays. In addition, 0.5% agarose is less dense than 0.5% agar, and its consistency is similar to that of the lower percentage soft agar.

      Lines 316-317: Please clarify: what is the "overlay motility assay" that was carried out "overnight at RT and then inoculated onto the center of soft agar"? Was this a two-step experiment? How were bacteria inoculated (stabbed, injected)? If injected, what volume and cell density were used?

      Thank you for bringing this to our attention. The methods section has been revised for clarity.

      Line 319: Each variable tested in duplicate? From what I understand, the only variable measured in this test is the diameter of the swimming halos. Do the authors mean they used two biological replicates? If so, please indicate the number of technical replicates as well.

      Multiple biological replicates were performed, each time with two technical replicates. Two perpendicular measurements (of diameter) for each technical replicate was recorded to avoid bias. The methods section has been edited to improve clarity.

      Line 320: Were the swimming halos asymmetrical, hence the need to take two perpendicular measurements? If that was the case, it could indicate an excessive amount of solidifying agent.

      The halos were sometimes asymmetric, but to avoid variation across datasets, it became standard practice to measure perpendicular distances as stated above. 

      Regarding qPCR experiments:

      Please clarify how normalization of transcript levels was performed.

      It seems the authors conducted a double normalization, first with respect to the calibrator (rho), and again using the wild-type as a baseline reference for fold-change calculations (absence of error bars for WT data). If so, please specify on the vertical axes of the figures and in the Methods/figure legends.

      Since, in addition to rho, the authors assessed the expression stability of the "housekeeping" genes gyrB and recA, please also include the primers used for these genes.

      The appropriate manuscript sections have been updated for clarity. The bacterial qPCR was normalized to an internal standard, and then relative expression differences between SfPat and the WT were determined. The missing primer sequences have also been added.

      Observations:

      Figure 2a-b: It is intriguing that the remarkable reduction in motility of the mutant is not associated with a comparably significant increase in biofilm formation.

      A statistically significant increase in biofilm was observed, along with a decrease in motility. As is common in crystal violet assays, some of the tertiary structures were not very stable and likely washed out during processing.

      Additionally, it is noteworthy that data for the mutant in panel 2a exhibit minimal variability, with all OD570 recordings being around 3.0. Did the authors dilute the crystal violet elution solution after adding acetic acid, or might they have reached the saturation limit of the spectrophotometer?

      The eluted acetic acid was not diluted further, and significant changes were observed. If the solution had been further diluted, the observed changes might have been more pronounced. 

      Minor comments and recommendations:

      All the suggested changes below have been incorporated

      • Line 55: "Antibiotic resistance determinants" might be preferable to "genes" to avoid using "genes" twice in the same sentence.

      • Line 75-76: Italicize Pseudomonas aeruginosa.

      • Line 134: Instead of "at least," specify the average fold-change.

      • Line 141: In the heading, refer to the influence of the "prophage" (singular) rather than "prophages" (plural).

      • Discussion (style): Consider using past tense for phrases like "we utilize..." (line 202); "we find..." (line 204), etc.

      • Line 365 and elsewhere: Consider "mRNA levels" or "transcript levels" instead of "gene expression".

      • Table 3: UQ950 is a strain, not a plasmid. I assume the plasmid carried by UQ950 is pSMV3.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Point-by-point responses to the reviewers' comments:

      All three reviewers found our analysis of focal adhesion-associated oncogenic pathways (Figs 3 and S3) to be inconsistent (Reviewer 1), not convincing/consistent (Reviewer 2, #2), and too variable and not well supported (Reviewer 3, #2). This was probably the basis for the eLife assessment, which stated: “However, the study is incomplete because the downstream molecular activities of PLECTIN that mediate the cancer phenotypes were not fully evaluated.” We agree with the reviewers that the degree of attenuation of the FAK, MAP/Erk, and PI3K/AKT signaling pathways differs depending on the cell line used (Huh7 and SNU-475) and the mode of inactivation (CRISPR/Cas9-generated plectin KO, functional KO (∆IFBD), and organoruthenium-based inhibitor plecstatin-1). However, we do not share the reviewers' skepticism about the unconvincing nature of the data presented.

      Several previous studies have shown that plectin inactivation invariably leads to dysregulation of cell adhesions and associated signaling pathways in various cell systems. The molecular mechanisms driving these changes are not fully understood, but the most convincingly supported scenarios are uncoupling of keratin filaments (hemidesmosomes; (Koster et al., 2004)) and vimentin filaments (focal adhesions; (Burgstaller et al., 2010; Gregor et al., 2014)) from adhesion sites in conjunction with altered actomyosin contractility (Osmanagic-Myers et al., 2015; Prechova et al., 2022; Wang et al., 2020). This results in altered morphometry (Wang et al., 2020), dynamics (Gregor et al., 2014), and adhesion strength (Bonakdar et al., 2015) of adhesions. These changes are accompanied by reduced mechanotransduction capacity and attenuation of downstream signaling such as FAK, Src, Erk1/2, and p38 in dermal fibroblasts (Gregor et al., 2014); decrease in pFAK, pSrc, and pPI3K levels in prostate cancer cells (Wenta et al., 2022); increase in pErk and pSrc in keratinocytes (Osmanagic-Myers et al., 2006); decrease in pERK1/2 in HCC cells (Xu et al., 2022) and head and neck squamous carcinoma cells (Katada et al., 2012).  

      Consistent with these published findings, we show that upon plectin inactivation, the HCC cell line SNU475 exhibits aberrant cytoskeletal organization (vimentin and actin; Figs 4A-D, S4A-F), altered number, topography and morphometry of focal adhesions (Figs 4A, E-G, S4H,I), and ineffective transmission of traction forces (Fig 4H,I). Similar, although not quantified, phenotypes are present in Huh7 with inactivated plectin (data not shown). It is worth noting, that even robust cytoskeletal (e.g. #ventral stress fibers, Fig 4A,D and vimentin architecture, Fig S4A-C) and focal adhesion (%central FA, Fig 4A,E) phenotypes differ significantly between different modes of plectin inactivation and would certainly do so if compared between cell lines. These phenotypes are heterogeneous but not inconsistent. Interestingly, both SNU-475 and Huh7 plectin-inactivated cells show similar functional consequences such as prominent decrease in migration speed (Fig 5B). This suggests that while specific aspects of cytoarchitecture are differentially affected in different cell lines, the functional consequences of plectin inactivation are shared between HCC cell lines.

      It is therefore not surprising that the activation status of downstream effectors, resulting from different degrees of cytoskeletal and focal adhesion reconfiguration, is not identical (or even comparable) between cell lines and treatment conditions. Furthermore, we compare highly epithelial (keratin- and almost no vimentin-expressing) Huh7 cells with highly dedifferentiated (low keratin- and high vimentinexpressing) SNU-475 cells, which differ significantly in their cytoskeleton, adhesions, and signaling networks. Alternative approaches to plectin inactivation are not expected to result in the same degree of dysregulation of specific signaling pathways. Effects of adaptation (CRISPR/Cas9-generated KOs and ∆IFBDs), engagement of different binding domains (CRISPR/Cas9-generated ∆IFBDs), and pleiotropic modes of action (plecstatin-1) are expected.

      In our study, we provide the reader with an unprecedented complex comparison of adhesion-associated signaling between WT and plectin-inactivated HCC cell lines. First, we compared the proteomes of WT, KO and PST-treated WT SNU-475 cells using MS-based shotgun proteomics and phosphoproteomics (Fig 3A-C). Second, we extensively and quantitatively immunoblotted the major molecular denominators of MS-identified dysregulated pathways (such as “FAK signaling”, “ILK signaling”, and “Integrin signaling”) with the following results. Data (shown in Figs 3D and S3C) are expressed as a percentage of untreated WT, with downregulated values are highlighted in red:

      Author response table 1.

      In addition, we show dysregulated expression (mostly downregulation) of focal adhesion constituents ITGβ1 and αv, talin, vinculin, and paxilin which nicely complements fewer and larger focal adhesions in plectin-inactivated HCC cells. In light of these results, we believe that our statement that “Although these alterations were not found systematically in both cell lines and conditions (reflecting thus presumably their distinct differentiation grade and plectin inactivation efficacy), collectively these data confirmed plectin-dependent adhesome remodeling together with attenuation of oncogenic FAK, MAPK/Erk, and PI3K/Akt pathways upon plectin inactivation” (see pages 8-9) is fully supported. Furthermore, in support of the results of MS-based (phospho)proteomic and immunoblot analyses we show strong correlation between plectin expression and the signatures of “Integrin pathway” (R<sup>2</sup>=0.15, p= 2x10<sup>-45</sup>), “FAK pathway” (R<sup>2</sup>=0.11, p= 2x10<sup>-34</sup>), “PI3K Akt/mTOR signaling” (R<sup>2</sup>=0.06, p= 2x10<sup>-20</sup>) or “Erk pathway” (R<sup>2</sup>=0.10, p= 6x10<sup>-30</sup>) in HCC samples from 1268 patients (Fig S7-2C and S7-3).

      In conclusion, we show that plectin is required for proper/physiological adhesion-associated signaling pathways in HCC cells. The HCC adhesome and associated pathways are dysregulated upon plectin inactivation and we show context-dependent varying degrees of attenuation of the FAK, MAPK/Erk, and PI3K/Akt pathways. In our view, presenting context-dependent variability in expression/activation of pathway molecular denominators is a trade-off for our intention to address this aspect of plectin inactivation in the complexity of different cell lines, tissues, and modes of inactivation. We prefer rather this complex approach to presenting “more convincing” black-and-white data assessed in a single cell line (Qi et al., 2022) or upon plectin inactivation by a single approach (compare with otherwise excellent studies such as (Xu et al., 2022) or (Buckup et al., 2021)). In fact, unlike the reviewers, we consider this complexity (and the resulting heterogeneity of the data) to be a strength rather than a weakness of our study.

      Reviewer 1:

      (1) The authors suggest that plectin controls oncogenic FAK, MAPK/Erk, and PI3K/Akt signaling in HCC cells, representing the mechanisms by which plectin promotes HCC formation and progression. However, the effect of plectin inactivation on these signaling was inconsistent in Huh7 and SNU-475 cells (Figure 3D), despite similar cell growth inhibition in both cell lines (Figure 2G). For example, pAKT and pERK were only reduced by plectin inhibition in SNU-475 cells but not in Huh7 cells.

      We agree with the reviewer that plectin inactivation yields varying degrees of attenuation of the FAK, MAPK/Erk, and PI3K/Akt pathways depending on the cell type (Huh7 vs SNU-475 cells) and mode of plectin inactivation (CRISPR/Cas9-generated plectin KO vs functional KO (∆IFBD) vs organorutheniumbased inhibitor plecstatin-1). This context-dependent heterogeneity in the expression/activation of molecular denominators of signaling pathways reflects different degrees of cytoskeletal (e.g. #ventral stress fibers, Fig 4A,D and vimentin architecture, Fig S4A-C) and focal adhesion (e.g. %central FA, Fig 4A,E) phenotypes under different conditions. We expect, that functional consequences (such as reduced migration and anchorage-independent proliferation) arise from a combination of changes in individual pathways. The sum of often subtle changes will result in comparable effects not only on cell growth, but also on migration or transmission of traction forces. For more detailed comment, please see our response to all Reviewers on the first three pages of this letter.

      We believe, that our data show that both pAkt and pErk are attenuated upon plectin inactivation in both Huh7 and SNU-475 cells. The following data (shown in Figs 3D and S3C) are expressed as a percentage of untreated WT, with downregulated values are highlighted in red:

      Author response table 2.

      (2) In addition, pFAK was not changed by plectin inhibition in both cells, and the ratio of pFAK/FAK was increased in both cells.

      We agree with the reviewer that pFAK/FAK levels are either comparable or slightly higher upon plectin inactivation. However, we believe that our data convincingly show that FAK expression is downregulated in both Huh7 and Snu-475 cells. In our opinion, this results in an overall attenuation of the FAK signaling (see percentage for Normalized pFAKxNormalized FAK), which is expectedly more pronounced in migratory Snu-475 cells. The following data (shown in Figs 3D and S3C) are expressed as a percentage of untreated WT, with downregulated values are highlighted in red:

      Author response table 3.

      Given these results, we feel that our statement that “inhibition of plectin attenuates FAK signaling” (pages 8-9) is well supported.

      (3) Thus, it is hard to convince me that plectin promotes HCC formation and progression by regulating these signalings.

      Previous studies have shown that dysregulation of cell adhesions and attenuation of adhesionassociated FAK, MAPK/Erk, and PI3K/Akt signaling has inhibitory effects on HCC formation and progression. We show that plectin is required for the proper/physiological functioning of adhesionassociated signaling pathways in selected HCC cells. The HCC adhesome and associated pathways are dysregulated upon plectin inactivation and we show context-dependent varying degrees of attenuation of the FAK, MAPK/Erk, and PI3K/Akt pathways. We support these conclusions by providing the reader with proteomic and phosphoproteomic comparisons of adhesion-associated signaling between WT and plectin-inactivated HCC cell lines (Figs 3B,C and S3A,B). We further validate our findings by extensive and quantitative immunoblotting analysis (Figs 3D and S3C). In addition, we show a strong correlation between plectin expression and the signatures of “Integrin pathway” (R<sup>2</sup>=0.15, p= 2x10<sup>-45</sup>), “FAK pathway” (R<sup>2</sup>=0.11, p= 2x10<sup>-34</sup>), “PI3K Akt/mTOR signaling” (R<sup>2</sup>=0.06, p= 2x10<sup>-20</sup>) or “Erk pathway” (R<sup>2</sup>=0.10, p= 6x10<sup>-30</sup>) in HCC samples from 1268 patients (Fig S7E).

      Our data and conclusions are fully consistent with previously published studies in HCC cells. For instance, even a mild decrease in FAK levels leads to a significant reduction in colony size (see effects of KD (Gnani et al., 2017) , effects of FAK inhibitor and sorafenib in xenografts (Romito et al., 2021), or effects of inhibitors in soft agars and xenografts (Wang et al., 2016)). Similar effects were observed upon partial Akt inhibition (compare with Akt inhibitors in soft agars (Cuconati et al., 2013; Liu et al., 2020)). Of course, we cannot rule out synergistic plectin-dependent effects mediated via adhesion-independent mechanisms. To identify these mechanisms and to distinguish contribution of various consequences of cytoskeletal dysregulation to phenotypes described in this manuscript would be experimentally challenging and we feel that these studies go beyond the scope of our current study.

      As we feel that the adhesion-independent mechanisms were not sufficiently discussed in the original manuscript, we have removed the original sentence “Given the well-established oncogenic activation of these pathways in human cancer(33), our study identifies a new set of potential therapeutic targets.” (page 15) from the Discussion and added the following text: “However, it is conceivable that dysregulated cytoskeletal crosstalk could affect HCC through multiple mechanisms independent from FA-associated signaling. Indeed, we and others (Jirouskova et al., 2018; Xu et al., 2022) have shown that upon plectin inactivation, liver cells acquire epithelial characteristics that promote increased intercellular cohesion and reduced migration. Further studies will be required to identify and investigate synergistic adhesion-independent effects of plectin inactivation on HCC growth and metastasis.” (page 15). See also our response to Reviewer 2, #4 and Reviewer 3, #3 and #4.

      (4) The authors claimed that Plectin inactivation inhibits HCC invasion and metastasis using in vitro and in vivo models. However, the results from in vivo models were not as compelling as the in vitro data. The lung colonization assay is not an ideal in vivo model for studying HCC metastasis and invasion, especially when Plectin inhibition suppresses HCC cell growth and survival. Using an orthotopic model that can metastasize into the lung or spleen could be much more convincing for an essential claim.

      We agree with the reviewer that the orthotopic in vivo model would be an ideal setting to address HCC metastasis experimentally. There are several published models of HCC extrahepatic metastasis, including an orthotopic model of lung metastasis (Fan et al., 2012; Voisin et al., 2024; You et al., 2016), but to our knowledge, none of these orthotopic models are commonly used in the field. In contrast, the administration of tumor cells via the tail vein of mice is a standard, well-established approach of first choice for modelling lung metastasis in a variety of tumor types (e.g. (Hiratsuka et al., 2011; Jakab et al., 2024; Lu et al., 2020)), including HCC (Jin et al., 2017; Lu et al., 2020; Tao et al., 2015; Zhao et al., 2020). 

      Furthermore, we do not believe that the use of an orthotopic model would provide a comparable advantage in terms of plectin-mediated effects on metastatic growth compared to tail vein delivery of tumor cells. Importantly, the lung colonization model used in our study allows for the injection of a defined number of HCC cells into the bloodstream, thus eliminating the effect of the primary tumor size on the number of metastasizing cells. To distinguish between effects of plectin inhibition on HCC cell growth/survival and dissemination, we carefully evaluated both the number and volume of lung metastases (Figs 6I and S6C-F). The observed reduction in the number of metastases (Figs 6I and S6D) reflects the initiation/early phase of metastasis formation, which is strongly influenced by the adhesion, migration, and invasion properties of the HCC cells and corresponds well with the phenotypes described after plectin inactivation in vitro (Figs 4H,I; 5; 6A-E; S5; and S6A,B). The reduction in the volume of metastases (Figs 6I and S6E) reflects the effects of plectin inhibition on HCC cell growth and metastatic outgrowth and corresponds well with the in vitro data shown in Figs 2G,H and S2F,G.

      (5) Also, in Figure 6H, histology images of lungs from this experiment need to be shown to understand plectin's effect on metastasis better.

      We are grateful to the reviewer for bringing our attention to the lung colonization assay results presented. The description of the experiments in the text of the original manuscript was incorrect. The animals monitored by in vivo bioluminescence imaging (shown in Fig 6H) are the same as the mice from which cleared whole lung lobes were analyzed by lattice light sheet fluorescence microscopy (shown in Fig. 6I). The corrected description is now provided in the revised manuscript as follows: “To identify early phase of metastasis formation, we next monitored the HCC cell retention in the lungs using in vivo bioluminescence imaging (Fig. 6H). This experimental cohort was expanded for WT-injected mice which were administered PST…” (page 11).

      Therefore, lungs from all animals shown in Fig 6H,I were CUBIC-cleared and analyzed by lattice light sheet fluorescence microscopy. As requested by Reviewer 2, Recommendation #1, we provide in the revised manuscript (Fig S6F) “whole slide scan results for all the groups” which could help to understand plectin's effect on metastasis better”. To address the reviewer's concern, we also post-processed cleared and visualized lungs for hematoxylin staining and immunolabeled them for HNF4α. A representative image is shown as a panel A in Author response image 1. Post-processing of CUBIC-cleared and immunolabeled lung lobes resulted in partial tissue destruction and some samples were lost. In addition, as the entire experimental setup was designed for the early phase of metastasis formation, only small Huh7 foci were formed (compared to the larger metastases that developed within 13 weeks after inoculation shown in the panel B). As the IHC for HNF4α provides significantly lower sensitivity compared to the immunofluorescence images provided in the manuscript, we were only able to identify a few HNF4α-positive foci. Overall, we consider our immunofluorescence images to be qualitatively and quantitatively superior to IHC sections. However, if the reviewer or the editor considers it beneficial, we are prepared to show our current data as a part of the manuscript.

      Author response image 1.

      (A) HNF4α staining of lung tissue after CUBIC clearing from mice inoculated with WT Huh7 from the timepoint of BLI, when the positive signal in chest area has been detected. This timepoint was then selected for the comparison of initial stages of lung colonization. (B) H&E and HNF4α staining from lung tissue of mice inoculated with WT Huh7 cells from the survival experiment. Scale bars, 50 µm.

      (6) Figure 6G, it is unclear how many mice were used for this experiment. Did these mice die due to the tumor burdens in the lungs?

      The number of animals is given in the legend to Fig 6G (page 34; N = 14 (WT), 13 (KO)). Large Huh7 metastases were identified in the lungs of animals that could be analyzed post-mortem by IHC (see panel B in the figure above). No large metastases were found in other organs examined, such as the liver, kidney and brain. It is therefore highly likely that these mice died as a result of the tumor burden in the lungs. A similar conclusion was drawn from the results of the lung colonization model in the previous studies (Jin et al., 2017; Zhao et al., 2020).

      (7) The whole paper used inhibition strategies to understand the function of plectin. However, the expression of plectin in Huh7 cells is low (Figure 1D). It might be more appropriate to overexpress plectin in this cell line or others with low plectin expression to examine the effect on HCC cell growth and migration.

      For this study, we selected two model HCC cell lines – Huh7 and SNU-475. Our intention was to investigate the role of plectin in “well-differentiated” (Huh7) and “poorly differentiated” (SNU-475) HCC cells, including thus early and advanced stages of HCC development (as categorized before (Boyault et al., 2007; Yuzugullu et al., 2009a); see also our description and rationale on page 6). As anticipated, less migratory “epithelial-like” Huh7 cells are characterized by relatively high E-cadherin, low vimentin, and low plectin expression levels (Fig 1D). In contrast, migratory “mesenchymal-like” SNU-475 cells are characterized by relatively low E-cadherin, high vimentin, and high plectin expression levels (Fig 1D). Therefore, the majority of analyses were performed in both relatively low plectin-expressing Huh7 and high plectin-expressing SNU-475 cells. It is noteworthy, that inactivation of plectin had similar (although less pronounced) inhibitory effects on growth and migration in both Huh7 and SNU-475 cells.

      We agree with the reviewer that “It might be more appropriate to overexpress plectin in this cell line or others with low plectin expression to examine the effect on HCC cell growth and migration”. In fact, we have received similar suggestions since we started publishing our studies on plectin. There are two reasons, which preclude the successful overexpression experiments. First, there are about 14 known isoforms of plectin (Prechova et al., 2023). Although, previous studies have analyzed the phenotypic rescue potential of some plectin isoforms using transient transfection (e.g. (Burgstaller et al., 2010; Osmanagic-Myers et al., 2015; Prechova et al., 2022)), the isoform variability precludes rescue/overexpression experiments if the causative isoform is not known. Second, plectin is a giant cytoskeletal crosslinker protein of more than 4,500 amino acids with binding sites for intermediate filaments, F-actin, and microtubules. Overexpression of the approximately 500 kDa-large crosslinker invariably leads to the collapse of cytoskeletal networks in every cell type we have tested so far. See also our response to Reviewer 3, #2.

      Reviewer 2:

      (1) The annotation of mouse numbers is confusing. In Figures 2A B D E F, it should be the same experiment, but the N numbers in A are 6 and 5. In E and F they are 8 and 3. Similarly, in Figure 2H, in the tumor size curve, the N values are 4,4,5,6. In the table, N values are 8,8,10,11 (the authors showed 8,7,8,7 tumors that formed in the picture). 

      We are grateful to the reviewer for bringing our attention to the inconsistency the number of animals in DEN-induced hepatocarcinogenesis. Results from two independent cohorts are presented in the manuscript. The first cohort was used for MRI screening (Fig 2A-C) and at the second screening timepoint of 44 weeks, approximately 75% of animals died during anesthesia. Therefore, the second cohort of Ple<sup>ΔAlb</sup> and Ple<sup>fl/fl</sup> mice was used for macroscopic confirmation and histology (Figs 2D-F and S2A). We agree with the reviewer that the original presentation of the data may be misleading; therefore, we have rephrased the sentence describing macroscopic confirmation and histology (Figs 2D-F and S2A) as follows: “Decreased tumor burden in the second cohort of Ple<sup>ΔAlb</sup> mice was confirmed macroscopically…” (page 7).

      For the experiments shown in Fig 2H, mice were injected in both hind flanks. We have added this information to the figure legend along with the correct number of tumors.

      (2) In Figure 3D and Figure S3C, the changes in most of the proteins/phosphorylation sites are not convincing/consistent. These data are not essential for the conclusion of the paper and WB is semi-quantitative. Maybe including more plots of the proteins from proteomic data could strengthen their detailed conclusions about the link between Plectin and the FAK, MAPK/Erk, PI3K/Akt pathways as shown in 3E.

      We agree with the reviewer that plectin inactivation yields varying degrees of attenuation of the FAK, MAPK/Erk, and PI3K/Akt pathways depending on the cell type (Huh7 vs SNU-475 cells) and mode of plectin inactivation (CRISPR/Cas9-generated plectin KO vs functional KO (∆IFBD) vs organorutheniumbased inhibitor plecstatin-1). This context-dependent heterogeneity in the expression/activation of pathway molecular denominators reflects different degrees of cytoskeletal (e.g. #ventral stress fibers, Fig 4A,D and vimentin architecture, Fig S4A-C) and focal adhesion (e.g. %central FA, Fig 4A,E) phenotypes under different conditions. See also the detailed response to all reviewers (on the first three pages of this letter) and the responses to Reviewer 1, #1 and #2, Reviewer 3, #4.

      Our immunoblot analysis is based on NIR fluorescent secondary antibodies which were detected and quantified using an Odyssey imaging system (LI-COR Biosciences). This approach allows a wider linear detection range than chemiluminescence without a signal loss and is considered to provide quantitative immunoblot detection (Mathews et al., 2009; Pillai-Kastoori et al., 2020) (see also manufacturer's website: https://www.licor.com/bio/applications/quantitative-western-blots/).

      Following the reviewer's recommendation, we have carefully reviewed our proteomic and phosphoproteomic data. There are no further MS-based data (other than those already presented in the manuscript) to support the association of plectin with the FAK, MAPK/Erk, PI3K/Akt pathways.

      (3) Figure S7A and B, The pictures do not show any tumor, which is different from Figure 7A and B (and from the quantification in S7A lower right). Is it just because male mice were used in Figure 7 and female mice were used in Figure S7? Is there literature supporting the sex difference for the Myc-sgP53 model?

      As indicated in the Figure legends and in the corresponding text in the Results section (page 12), the Fig 7A,B shows Myc;sgTp53-driven hepatocarcinogenesis in male mice, whereas Fig S7C,D shows results from the female cohort. In general, the HDTVi-induced HCC onset and progression differs considerably between individual experiments, and it is therefore crucial to compare data within an experimental cohort (as we have done for Ple<sup>ΔAlb</sup> and Ple<sup>fl/fl</sup> mice). Nevertheless, we cannot exclude the influence of sexual dimorphism on the results presented. The existence of sexual dimorphism in liver cancer is supported by a substantial body of evidence derived from various studies (e.g. (Bigsby and CaperellGrant, 2011; Bray et al., 2024)). To date, no reports have specifically addressed sexual dimorphism in Myc;sgTp53 HDTVI-induced liver cancer. This is likely due to the fact that the vast majority of studies using this model have only presented data for one sex. However, a study using an HDTVI-administered combination of c-MET and mutated beta-catenin oncogenes to induce HCC in mice observed elevated levels of alpha-fetoprotein (AFP) in males when compared to females (Bernal et al., 2024). The study suggests that estrogen may have a protective effect in female mice, as ovariectomized females had AFP levels comparable to those observed in males. Our data suggest that female hormones may have a similar effect in the Myc;sgTp53 HDTVI-induced liver cancer model.

      (4) Figure 2F, S2A, Ple<sup>ΔAlb</sup> mice more frequently formed larger tumors, as reflected by overall tumor size increase. The interpretation of the authors is "possibly implying reduced migration or increased cohesion of plectin-depleted cells". It is quite arbitrary to make this suggestion in the absence of substantial data or literature to support this theory.

      We agree with the reviewer that our statement “Notably, Ple<sup>ΔAlb</sup> mice more frequently formed larger tumors, as reflected by overall tumor size increase (Fig. 2F; Figure 2—figure supplement 1A), possibly implying reduced migration or increased cohesion of plectin-depleted cells(25).” (page 7) is rather speculative. As we did not further address the formation of larger tumors in Ple<sup>ΔAlb</sup> mice further in the current study, we wanted to provide the readers with some, even speculative, hypotheses. In support of our hypothesis, we cite our own publication (#26; Jirouskova et al., J Hepatol., 2018), where we show that plectin inactivation in Ple<sup>ΔAlb</sup> livers results in upregulation of the epithelial marker E-cadherin. Previous studies have shown that similar increase in E-cadherin expression levels reflects mesenchymalto-epithelial transition (e.g. (Adhikary et al., 2014; Auersperg et al., 1999; Wendt et al., 2011)) and is often associated with reduced cancer cell migration/invasion. This is consistent with our finding that “migrating plectin-disabled SNU-475 cells exhibited more cohesive, epithelial-like features while progressing collectively. By contrast, WT SNU-475 leader cells were more polarized and found to migrate into scratch areas more frequently than their plectin-deficient counterparts (Figure 5—figure supplement 1B). Consistent with this observation, individually seeded SNU-475 cells less frequently assumed a polarized, mesenchymal-like shape upon plectin inactivation in both 2D and 3D environments (Fig. 5C). Moreover, plectin-inactivated SNU-475 cells exhibited a decrease in N-cadherin and vimentin levels when compared to WT counterparts (Figure 5—figure supplement 1C).” (page 10).

      In conclusion, we have shown that plectin-deficient hepatocytes express higher levels of E-cadherin and hepatocyte-derived SNU-475 cells express less N-cadherin and vimentin. In addition, we show that SNU475 cells exhibited more cohesive, epithelial-like features in scratch-wound experiments. To address the reviewer's concern and to further support our statement about the increased cohesiveness of plectindeficient HCC cells we have included the citation of the recent study #27 (Xu et al., 2022). Using the MHCC97H and MHCC97L HCC cell lines, this study shows that plectin downregulation “inhibits HCC cell migration and epithelial mesenchymal transformation”, which is fully consistent with our hypothesis. To mitigate the impression of an unsubstantiated statement, we also discuss adhesion-independent plectin-mediated mechanisms in the revised Discussion section as follows: “However, it is conceivable that dysregulated cytoskeletal crosstalk could affect HCC through multiple mechanisms independent from FA-associated signaling. Indeed, we and others (Jirouskova et al., 2018; Xu et al., 2022) have shown that upon plectin inactivation, liver cells acquire epithelial characteristics that promote increased intercellular cohesion and reduced migration. Further studies will be required to identify and investigate synergistic adhesion-independent effects of plectin inactivation on HCC growth and metastasis.” (page 15).

      (5) Mutation or KO PLEC has been shown to cause severe diseases in humans and mice, including skin blistering, muscular dystrophy, and progressive familial intrahepatic cholestasis. Please elaborate on the potential side effects of targeting Plectin to treat HCC.

      Indeed, mutation or ablation of plectin has been implicated in many diseases (collectively known as plectinopathies). These multisystem disorders include an autosomal dominant form of epidermolysis bullosa simplex (EBS), limb-girdle muscular dystrophy, aplasia cutis congenita, and an autosomal recessive form of EBS that may be associated with muscular dystrophy, pyloric atresia, and/or congenital myasthenic syndrome. Several mutations have also been associated with cardiomyopathy and malignant arrhythmias. Progressive familial intrahepatic cholestasis has also been reported. In genetic mouse models, loss of plectin leads to skin fragility, extensive intestinal lesions, instability of the biliary epithelium, and progressive muscle wasting (for more details see (Vahidnezhad et al., 2022)). 

      It is therefore important to evaluate potential side effects, and plectin inactivation therefore presents challenges comparable to other anti-HCC targets. For instance, Sorafenib, the most widely used chemotherapy in recent decades, targets numerous serine/threonine and tyrosine kinases (RAF1, BRAF, VEGFR 1, 2, 3, PDGFR, KIT, FLT3, FGFR1, and RET) that are critical for proper non-pathological functions (Strumberg et al., 2007; Wilhelm et al., 2006; Wilhelm et al., 2004). The combinatorial therapy of atezolizumab and bevacizumab targets also PD-L1 in conjunction with VEGF, which plays an essential role in bone formation (Gerber et al., 1999), hematopoiesis (Ferrara et al., 1996), or wound healing (Chintalgattu et al., 2003). To allow readers to read a comprehensive account of the pathological consequences of plectin inactivation, we included two additional citations (Prechova et al., 2023; Vahidnezhad et al., 2022)  and rephrased Introduction section as follows: “…multiple reports have linked plectin with tumor malignancy(12) and other pathologies (Prechova et al., 2023; Vahidnezhad et al., 2022), mechanistic insights…” (page 4-5).

      Reviewer 3:

      (1) The rationale for using Huh7 cells in the manuscript is not well explained as it has the lowest Plectin expression levels.

      For this study, we selected two model HCC cell lines - Huh7 and SNU-475. Our intention was to address the role of plectin in “well-differentiated” (Huh7) and “poorly differentiated” (SNU-475) HCC cells, thus including early and advanced stages of HCC development (as categorized before (Boyault et al., 2007; Yuzugullu et al., 2009b) see also our description and reasoning on page 6). The Huh7 cell line is also a well-established and widely used model suitable for both in vitro and in vivo settings (e.g. (Du et al., 2024; Fu et al., 2018; Si et al., 2023; Zheng et al., 2018).

      As anticipated, less migratory “epithelial-like” Huh7 cells are characterized by relatively high E-cadherin, low vimentin, and low plectin expression levels (Fig 1D). In contrast, migratory “mesenchymal-like” SNU475 cells are characterized by relatively low E-cadherin, high vimentin, and high plectin expression levels (Fig 1D). Therefore, the majority of analyses were performed in both relatively low plectin-expressing Huh7 and high plectin-expressing SNU-475 cells. It is noteworthy, that inactivation of plectin had similar (although less pronounced) inhibitory effects on the phenotypes in both Huh7 and SNU-475 cells. We believe that these findings highlight the importance of plectin in HCC growth and metastasis, as plectin inactivation has inhibitory effects on both early (low plectin) and advanced (high plectin) stages of HCC.

      (2) The KO cell experiments should be supplemented with overexpression experiments.

      We agree with the reviewer that it would be helpful to complement our plectin inactivation experiments by overexpressing plectin in the HCC cell lines used in this study. In fact, we have received similar suggestions since we started to publish our studies on plectin. There are two reasons, which preclude the successful overexpression experiments. First, there is about 14 known isoforms of plectin (Prechova et al., 2023). Although previous studies have analyzed the phenotypic rescue potential of some plectin isoforms using transient transfection (e.g. (Burgstaller et al., 2010; Osmanagic-Myers et al., 2015; Prechova et al., 2022)), the isoform variability precludes rescue/overexpression experiments if the causative isoform is not known. Second, plectin is a giant cytoskeletal crosslinker protein of more than 4,500 amino acids with binding sites for intermediate filaments, F-actin, and microtubules. Overexpression of the approximately 500 kDa-large crosslinker invariably leads to the collapse of cytoskeletal networks in every cell type we have tested so far. See also our response to Reviewer 1, #7.

      (3) There is significant concern that while ablation of Ple led to reduced tumor number, these mice had larger tumors. The data indicate that Plectin may have distinct roles in HCC initiation versus progression. The data are not well explained and do not fully support that Plectin promotes hepatocarcinogenesis.

      In the DEN-induced HCC model MRI screening revealed fewer tumors and also tumor volume was reduced at 32 and 44 weeks post-induction (Fig 2A-C). Larger tumors formed in Ple<sup>ΔAlb</sup> compared to Ple<sup>fl/fl</sup> livers (Figs 2F and S2A) refer only to a subset of macroscopic tumors visually identified at necropsy. Larger Ple<sup>ΔAlb</sup> tumors were not observed in the Myc;sgTp53 HDTVI-induced HCC model (data not shown). In contrast, plectin deficiency reduced the size of xenografts formed in NSG mice (Fig 2H), and agar colonies grown from Huh7 and SNU-475 cells with inactivated plectin were also smaller (Fig S2F). In all in vivo and in vitro approaches presented in the manuscript, plectin inactivation reduced the number of colonies/xenografts/tumors. As hepatocarcinogenesis is a multistep process including initiation, promotion, and progression (Pitot, 2001), we feel confident in concluding that plectin inactivation inhibits hepatocarcinogenesis and we consider this conclusion to be fully supported by the data presented in the manuscript.

      However, we agree with the reviewer that larger macroscopic Ple<sup>ΔAlb</sup> tumors in the DEN-induced HCC model are intriguing. As we do not see similar effects (or even trends) in other approaches used in this study, we cannot exclude the contribution of plectin-deficient environment in Ple<sup>ΔAlb</sup> livers during longterm (44 weeks) tumor formation and growth. In our previous study (Jirouskova et al., 2018), we showed that plectin deficiency in Ple<sup>ΔAlb</sup> livers leads to biliary tree malformations, collapse of bile ducts and ductules, and mild ductular reaction. We could speculate that Ple<sup>ΔAlb</sup> livers suffer from continuous bile leakage into the parenchyma, which would exacerbate all models of long-term pathology.

      As we did not further address the formation of larger tumors in Ple<sup>ΔAlb</sup> mice further in the current study, we offered the reader the hypothesis that large tumors could “…possibly implying reduced migration or increased cohesion of plectin-depleted cells25.” In support of our hypothesis, we cite our own publication (#26; Jirouskova et al., J Hepatol., 2018), where we show that plectin inactivation in Ple<sup>ΔAlb</sup> livers results in upregulation of the epithelial marker E-cadherin. Previous studies have shown that similar increase in E-cadherin expression levels reflects mesenchymal-to-epithelial transition (e.g. (Adhikary et al., 2014; Auersperg et al., 1999; Wendt et al., 2011)) and is often associated with reduced cancer cell migration/invasion. This is consistent with our finding that “migrating plectin-disabled SNU475 cells exhibited more cohesive, epithelial-like features while progressing collectively. By contrast, WT SNU-475 leader cells were more polarized and found to migrate into scratch areas more frequently than their plectin-deficient counterparts (Figure 5—figure supplement 1B). Consistent with this observation, individually seeded SNU-475 cells less frequently assumed a polarized, mesenchymal-like shape upon plectin inactivation in both 2D and 3D environments (Fig. 5C). Moreover, plectin-inactivated SNU-475 cells exhibited a decrease in N-cadherin and vimentin levels when compared to WT counterparts (Figure 5—figure supplement 1C).” (page 10).

      In conclusion, we have shown that plectin-deficient hepatocytes express higher levels of E-cadherin and hepatocyte-derived SNU-475 cells less N-cadherin and vimentin. In addition, we show that SNU-475 cells exhibited more cohesive, epithelial-like features in scratch-wound experiments. To address the reviewer's concern and to further support our claim of increased cohesiveness of plectin-deficient HCC cells we included the citation of the recent study(27). Using the MHCC97H and MHCC97L HCC cell lines, this study shows that plectin downregulation “inhibits HCC cell migration and epithelial mesenchymal transformation” and is therefore fully consistent with our hypothesis. To mitigate the impression of an unsubstantiated statement, we also discuss adhesion-independent plectin-mediated mechanisms in the revised Discussion section as follows: “However, it is conceivable that dysregulated cytoskeletal crosstalk could affect HCC through multiple mechanisms independent from FA-associated signaling. Indeed, we and others (Jirouskova et al., 2018; Xu et al., 2022) have shown that upon plectin inactivation, liver cells acquire epithelial characteristics that promote increased intercellular cohesion and reduced migration. Further studies will be required to identify and investigate synergistic adhesionindependent effects of plectin inactivation on HCC growth and metastasis.” (page 15).

      (4) Figure 3 showed that Plectin does not regulate p-FAK/FAK expression. Therefore, the statement that Plectin regulates the FAK pathway is not valid. Furthermore, there are too many variables in turns of p-AKT and p-ERK expression, making the conclusion not well supported.

      We agree with the reviewer that pFAK/FAK levels are either comparable or slightly higher upon plectin inactivation. However, we believe that our data convincingly show that FAK expression is downregulated in both Huh7 and Snu-475 cells. In our opinion, this results in an overall attenuation of the FAK signaling (see percentage for Normalized pFAKxNormalized FAK), which is expectedly more pronounced in migratory Snu-475 cells. The following data (shown in Figs 3D and S3C) are expressed as a percentage of untreated WT, with downregulated values highlighted in red:

      Author response table 4.

      Given these results, we believe that our statement that “inhibition of plectin attenuates FAK signaling” (pages 8-9) is well supported.

      We believe, that our data show that both pAkt and pErk are attenuated upon plectin inactivation in both Huh7 and SNU-475 cells. The following data (presented in Figs 3D and S3C) are shown as a percentage of untreated WT, with downregulated values highlighted in red:

      Author response table 5.

      We agree with the reviewer that plectin inactivation yields varying degrees of attenuation of the FAK, MAPK/Erk, and PI3K/Akt pathways depending on the cell type (Huh7 vs SNU-475 cells) and mode of plectin inactivation (CRISPR/Cas9-generated plectin KO vs functional KO (∆IFBD) vs organorutheniumbased inhibitor plecstatin-1). This context-dependent heterogeneity in the expression/activation of pathway molecular denominators reflects different degrees of cytoskeletal (e.g. #ventral stress fibers, Fig 4A,D and vimentin architecture, Fig S4A-C) and focal adhesion (e.g. %central FA, Fig 4A,E) phenotypes under different conditions. See also the detailed response to all Reviewers (on the first three pages of this letter) and the responses to Reviewer 1, #1 and #2 and Reviewer 2, #4.

      (5) The studies of plecstatin-1 in HCC should be expanded to a panel of human HCC cells with various Plectin expression levels in turns of cell growth and cell migration. The IC50 values should be determined and correlate with Plectin expression.

      Following the reviewer's suggestion, we have included graphs showing IC50 values for Huh7 (low plectin) and SNU-475 (high plectin) cells as Fig S2E. As expected, the IC50 values are higher for SNU-475 cells. Corresponding parts of the Figure legends have been changed. We refer to new data in the Results section as follows: “If not stated otherwise, we applied PST in the final concentration of 8 µM, which corresponds to the 25% of IC50 for Huh7 cells (Figure 2—figure supplement 1E).” (page 7). We also provide details of the IC50 determination in the revised Supplement Materials and methods section (pages 5-6).

      (6) One of the major issues is the mechanistic studies focusing on Plectin regulating HCC migration/metastasis, whereas the in vivo mouse studies focus on HCC formation (Figures 3 and 7). These are distinct processes and should not be mixed.

      In our study, we investigated the role of plectin in the development and dissemination of HCC. Using DEN- and Myc;sgTp53 HDTVI-induced HCC models (Figs 2A-F, S2A, 7A-C, and S7A-D), we show the effects of plectin inactivation on HCC formation in vivo. These studies are complemented by xenografts (Figs 2H and S2G) and in vitro colony formation assay (Figs 2G and S2F). Using an in vivo lung colonization assay (Figs 6G-I and S6C-F), we show the effects of plectin inactivation on the metastatic potential of HCC cells. In complementary in vitro studies, we show how plectin deficiency affects migration (Figs 5 and S5) and invasion (Figs 6A-E and S6A,B). 

      Our mechanistic studies show that plectin inactivation leads to dysregulation of cytoskeletal networks, adhesions, and adhesion-associated signaling. We believe that we have provided substantial experimental data suggesting that the proposed mechanisms play a role in plectin-mediated inhibition of both HCC development and dissemination. Of course, we cannot rule out additional, adhesionindependent mechanisms for HCC formation. To clarify this, we have revised the Discussion section as follows: “However, it is conceivable that dysregulated cytoskeletal crosstalk could affect HCC through multiple mechanisms independent from FA-associated signaling. Indeed, we and others (Jirouskova et al., 2018; Xu et al., 2022) have shown that upon plectin inactivation, liver cells acquire epithelial characteristics that promote increased intercellular cohesion and reduced migration. Further studies will be required to identify and investigate synergistic adhesion-independent effects of plectin inactivation on HCC growth and metastasis.” (page 15).

      (7) Figure 7B showed that Ple KO mice were treated with PST, but the data are not presented in the manuscript. Tumor cell proliferation and apoptosis rates should be analyzed as well.

      We do not show any effects of PST in Ple<sup>ΔAlb</sup> mice. As stated in the Fig 7B legend: “Myc;sgTp53 HCC was induced in Ple<sup>fl/fl</sup>, Ple<sup>ΔAlb</sup>, and PST-treated Ple<sup>fl/fl</sup> (Ple<sup>fl/fl</sup>+PST) male mice as in (A). Shown are representative images of Ple<sup>fl/fl</sup>, Ple<sup>ΔAlb</sup>, and Ple<sup>fl/fl</sup>+PST livers from mice with fully developed multifocal HCC sacrificed 6 weeks post-induction.”.

      Following the reviewer's recommendation, we include the analysis of proliferation and apoptosis rates as revised Fig S7A,B. Please note, that no differences in apoptosis and proliferation rates were found between experimental conditions. Due to additional data, the original Fig S7 – 1 has been split into revised Fig S7 – 1 and Fig S7 – 2.

      (8) The status of FAK, AKT, and ERK pathway activation was not analyzed in mouse liver samples. In Figure 7D, most of the adjusted p-values are not significant.

      We are aware that the majority of FDR corrected p-values shown in the Fig 7D are not significant. In fact, we deliberated with our colleagues from the laboratory of Prof. Samuel Meier-Menches (Department of Analytical Chemistry, University of Vienna), who conducted all the proteomic studies presented in this manuscript, on whether to present such "weak" data. Following a lengthy discussion, a decision was taken to include them despite the anticipation of criticism from the reviewers. The rationale for including these data is that, despite the lack of statistical significance, the findings are consistent with those of MS/immunoblot analyses of HCC cells (Figs 3 and S3) and patient data (Figs 7E, S7-2). The lack of statistical significance observed in the presented data is a consequence of the limited number of animals included in the Ple<sup>fl/fl</sup>, Ple<sup>ΔAlb</sup>, and PST-treated Ple<sup>fl/fl</sup> cohorts, which has resulted in a high degree of variability in the MS results. We agree with the reviewer that the inclusion of immunoblot analysis would provide further support for our conclusions. However, we do not have any remaining liver tissue that could be analyzed.

      (9) There is no evidence to support that PST is capable of overcoming therapy resistance in HCC. For example, no comparison with the current standard care was provided in the preclinical studies.

      We are grateful to the reviewer for bringing our attention to the incorrect statement in the Abstract: “…we show that plectin inhibitor plecstatin-1 (PST) is well-tolerated and capable of overcoming therapy resistance in HCC”. To address the reviewer's concern, we rephrased the Abstract as follows: “…we show that plectin inhibitor plecstatin-1 (PST) is well-tolerated and potently inhibits HCC progression”.

      Recommendations for the authors: 

      Reviewer 2 (Recommendations for the authors):

      (1) In Figures 6I and S6C, it would be better to show the whole slide scan result for all the groups.

      Following the reviewer's recommendation, we include the whole slide scan result for all the groups as revised Fig S6F.

      (2) In Figures S7C and D, what do the highlighted/colored dots represent? They are not mentioned in the figure legend or the results.

      Following the reviewer's recommendation, we include the explanation in the revised Figure legends (page 30).

      (3) In Figure 2H, the experiment schedule showed "6w Huh7 t.v.i.", but should it be subcutaneous injection?

      We are grateful to the reviewer for bringing our attention to the incorrect description of the experiment. The schematics was corrected. The schematic has been corrected. We have also noticed an error in the table summarizing the number of tumors formed (N) and have corrected the values for the WT+PST and KO conditions.

      (4) Supplemental Materials and Methods, Xenograft tumorigenesis, Error: 2.5×106 Huh7 cells in 250 ml PBS mice were administered subcutaneously in the left and right hind flanks. It probably should be "250ul".

      We are grateful to the reviewer for bringing our attention to the incorrect description of the experiment. The corresponding part of the Materials and Methods section has been corrected (page 2).

      (5) In Figure legend Supplementary Figure 6 C,D,E : "Representative magnified images from lung lobes with GFP-positive WT, KO, and WT+PST SNU-475 nodules". There is no picture for the WT+PST SNU-475 group.

      We are grateful to the reviewer for bringing our attention to the incorrect description of the experiment. The corresponding part of the Figure legend (“WT+PST SNU-475”) has been deleted (page 27).

      (6) In the Figure legend for Figure 6H, "Representative BLI images of WT, KO, and PST-treated WT (WT+PST) SNU-475 cells-bearing mice are shown". Should it be Huh7, not SNU-475?

      We are grateful to the reviewer for bringing our attention to the incorrect description of the experiment. The description of the cell line has been corrected (page 34).

      (7) The statement that current therapies rely on multikinase inhibitors is no longer correct.

      We are grateful to the reviewer for bringing our attention to the incorrect statement. To address the reviewer's concern, we rephrased the original part of Discussion section: “Current therapies for HCC rely on multikinase inhibitors (such as sorafenib) that provide only moderate survival benefit(60,61) due to primary resistance and the plasticity of signaling networks(62)” as follows: “Current systemic therapies for advanced HCC rely on a combination of multikinase inhibitor (such as sorafenib) or anti-VEGF /VEGF inhibitor (such as bevacizumab) treatment with immunotherapy(59). Multikinase inhibitors provide only moderate survival benefit(60,61) due to primary resistance and the plasticity of signaling networks(62), and only a subset of patients benefits from addition of immunotherapy in HCC treatment(63)” (page 15).

      References

      Adhikary, A., S. Chakraborty, M. Mazumdar, S. Ghosh, S. Mukherjee, A. Manna, S. Mohanty, K.K. Nakka, S. Joshi, A. De, S. Chattopadhyay, G. Sa, and T. Das. 2014. Inhibition of epithelial to mesenchymal transition by E-cadherin up-regulation via repression of slug transcription and inhibition of Ecadherin degradation: dual role of scaffold/matrix attachment region-binding protein 1 (SMAR1) in breast cancer cells. The Journal of biological chemistry. 289:25431-25444.

      Auersperg, N., J. Pan, B.D. Grove, T. Peterson, J. Fisher, S. Maines-Bandiera, A. Somasiri, and C.D. Roskelley. 1999. E-cadherin induces mesenchymal-to-epithelial transition in human ovarian surface epithelium. Proc Natl Acad Sci U S A. 96:6249-6254.

      Bernal, A., M. McLaughlin, A. Tiwari, F. Cigarroa, and L. Sun. 2024. Abstract 772: Investigation of gender disparity in liver tumor formation using a hydrodynamic tail vein injection mouse model. Cancer Research. 84:772-772.

      Bigsby, R.M., and A. Caperell-Grant. 2011. The role for estrogen receptor-alpha and prolactin receptor in sex-dependent DEN-induced liver tumorigenesis. Carcinogenesis. 32:1162-1166.

      Bonakdar, N., A. Schilling, M. Sporrer, P. Lennert, A. Mainka, L. Winter, G. Walko, G. Wiche, B. Fabry, and W.H. Goldmann. 2015. Determining the mechanical properties of plectin in mouse myoblasts and keratinocytes. Exp Cell Res. 331:331-337.

      Boyault, S., D.S. Rickman, A. de Reynies, C. Balabaud, S. Rebouissou, E. Jeannot, A. Herault, J. Saric, J. Belghiti, D. Franco, P. Bioulac-Sage, P. Laurent-Puig, and J. Zucman-Rossi. 2007. Transcriptome classification of HCC is related to gene alterations and to new therapeutic targets. Hepatology. 45:42-52.

      Bray, F., M. Laversanne, H. Sung, J. Ferlay, R.L. Siegel, I. Soerjomataram, and A. Jemal. 2024. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 74:229-263.

      Buckup, M., M.A. Rice, E.C. Hsu, F. Garcia-Marques, S. Liu, M. Aslan, A. Bermudez, J. Huang, S.J. Pitteri, and T. Stoyanova. 2021. Plectin is a regulator of prostate cancer growth and metastasis. Oncogene. 40:663-676.

      Burgstaller, G., M. Gregor, L. Winter, and G. Wiche. 2010. Keeping the vimentin network under control: cell-matrix adhesion-associated plectin 1f affects cell shape and polarity of fibroblasts. Mol Biol Cell. 21:3362-3375.

      Chintalgattu, V., D.M. Nair, and L.C. Katwa. 2003. Cardiac myofibroblasts: a novel source of vascular endothelial growth factor (VEGF) and its receptors Flt-1 and KDR. J Mol Cell Cardiol. 35:277-286. Cuconati, A., C. Mills, C. Goddard, X. Zhang, W. Yu, H. Guo, X. Xu, and T.M. Block. 2013. Suppression of AKT anti-apoptotic signaling by a novel drug candidate results in growth arrest and apoptosis of hepatocellular carcinoma cells. PLoS One. 8:e54595.

      Du, Y.Q., B. Yuan, Y.X. Ye, F.L. Zhou, H. Liu, J.J. Huang, and Y.F. Wei. 2024. Plumbagin Regulates Snail to Inhibit Hepatocellular Carcinoma Epithelial-Mesenchymal Transition in vivo and in vitro. J Hepatocell Carcinoma. 11:565-580.

      Fan, Z.C., J. Yan, G.D. Liu, X.Y. Tan, X.F. Weng, W.Z. Wu, J. Zhou, and X.B. Wei. 2012. Real-time monitoring of rare circulating hepatocellular carcinoma cells in an orthotopic model by in vivo flow cytometry assesses resection on metastasis. Cancer Res. 72:2683-2691.

      Ferrara, N., K. Carver-Moore, H. Chen, M. Dowd, L. Lu, K.S. O'Shea, L. Powell-Braxton, K.J. Hillan, and M.W. Moore. 1996. Heterozygous embryonic lethality induced by targeted inactivation of the VEGF gene. Nature. 380:439-442.

      Fu, Q., Q. Zhang, Y. Lou, J. Yang, G. Nie, Q. Chen, Y. Chen, J. Zhang, J. Wang, T. Wei, H. Qin, X. Dang, X. Bai, and T. Liang. 2018. Primary tumor-derived exosomes facilitate metastasis by regulating adhesion of circulating tumor cells via SMAD3 in liver cancer. Oncogene. 37:6105-6118.

      Gerber, H.P., T.H. Vu, A.M. Ryan, J. Kowalski, Z. Werb, and N. Ferrara. 1999. VEGF couples hypertrophic cartilage remodeling, ossification and angiogenesis during endochondral bone formation. Nat Med. 5:623-628.

      Gnani, D., I. Romito, S. Artuso, M. Chierici, C. De Stefanis, N. Panera, A. Crudele, S. Ceccarelli, E. Carcarino, V. D'Oria, M. Porru, E. Giorda, K. Ferrari, L. Miele, E. Villa, C. Balsano, D. Pasini, C. Furlanello, F. Locatelli, V. Nobili, R. Rota, C. Leonetti, and A. Alisi. 2017. Focal adhesion kinase depletion reduces human hepatocellular carcinoma growth by repressing enhancer of zeste homolog 2. Cell Death Differ. 24:889-902.

      Gregor, M., S. Osmanagic-Myers, G. Burgstaller, M. Wolfram, I. Fischer, G. Walko, G.P. Resch, A. Jorgl, H. Herrmann, and G. Wiche. 2014. Mechanosensing through focal adhesion-anchored intermediate filaments. FASEB J. 28:715-729.

      Hiratsuka, S., S. Goel, W.S. Kamoun, Y. Maru, D. Fukumura, D.G. Duda, and R.K. Jain. 2011. Endothelial focal adhesion kinase mediates cancer cell homing to discrete regions of the lungs via E-selectin up-regulation. Proc Natl Acad Sci U S A. 108:3725-3730.

      Jakab, M., K.H. Lee, A. Uvarovskii, S. Ovchinnikova, S.R. Kulkarni, S. Jakab, T. Rostalski, C. Spegg, S. Anders, and H.G. Augustin. 2024. Lung endothelium exploits susceptible tumor cell states to instruct metastatic latency. Nat Cancer. 5:716-730.

      Jin, H., C. Wang, G. Jin, H. Ruan, D. Gu, L. Wei, H. Wang, N. Wang, E. Arunachalam, Y. Zhang, X. Deng, C. Yang, Y. Xiong, H. Feng, M. Yao, J. Fang, J. Gu, W. Cong, and W. Qin. 2017. Regulator of Calcineurin 1 Gene Isoform 4, Down-regulated in Hepatocellular Carcinoma, Prevents Proliferation, Migration, and Invasive Activity of Cancer Cells and Metastasis of Orthotopic Tumors by Inhibiting Nuclear Translocation of NFAT1. Gastroenterology. 153:799-811 e733.

      Jirouskova, M., K. Nepomucka, G. Oyman-Eyrilmez, A. Kalendova, H. Havelkova, L. Sarnova, K. Chalupsky, B. Schuster, O. Benada, P. Miksatkova, M. Kuchar, O. Fabian, R. Sedlacek, G. Wiche, and M. Gregor. 2018. Plectin controls biliary tree architecture and stability in cholestasis. J Hepatol. 68:1006-1017.

      Katada, K., T. Tomonaga, M. Satoh, K. Matsushita, Y. Tonoike, Y. Kodera, T. Hanazawa, F. Nomura, and Y. Okamoto. 2012. Plectin promotes migration and invasion of cancer cells and is a novel prognostic marker for head and neck squamous cell carcinoma. J Proteomics. 75:1803-1815.

      Koster, J., S. van Wilpe, I. Kuikman, S.H. Litjens, and A. Sonnenberg. 2004. Role of binding of plectin to the integrin beta4 subunit in the assembly of hemidesmosomes. Mol Biol Cell. 15:1211-1223.

      Liu, H., Q. Chen, D. Lu, X. Pang, S. Yin, K. Wang, R. Wang, S. Yang, Y. Zhang, Y. Qiu, T. Wang, and H. Yu. 2020. HTBPI, an active phenanthroindolizidine alkaloid, inhibits liver tumorigenesis by targeting Akt. FASEB J. 34:12255-12268.

      Lu, H.H., S.Y. Lin, R.R. Weng, Y.H. Juan, Y.W. Chen, H.H. Hou, Z.C. Hung, G.A. Oswita, Y.J. Huang, S.Y. Guu, K.H. Khoo, J.Y. Shih, C.J. Yu, and H.C. Tsai. 2020. Fucosyltransferase 4 shapes oncogenic glycoproteome to drive metastasis of lung adenocarcinoma. EBioMedicine. 57:102846.

      Mathews, S.T., E.P. Plaisance, and T. Kim. 2009. Imaging systems for westerns: chemiluminescence vs. infrared detection. Methods in molecular biology (Clifton, N.J.). 536:499-513.

      Osmanagic-Myers, S., M. Gregor, G. Walko, G. Burgstaller, S. Reipert, and G. Wiche. 2006. Plectincontrolled keratin cytoarchitecture affects MAP kinases involved in cellular stress response and migration. J Cell Biol. 174:557-568.

      Osmanagic-Myers, S., S. Rus, M. Wolfram, D. Brunner, W.H. Goldmann, N. Bonakdar, I. Fischer, S. Reipert, A. Zuzuarregui, G. Walko, and G. Wiche. 2015. Plectin reinforces vascular integrity by mediating crosstalk between the vimentin and the actin networks. J Cell Sci. 128:4138-4150.

      Pillai-Kastoori, L., A.R. Schutz-Geschwender, and J.A. Harford. 2020. A systematic approach to quantitative Western blot analysis. Analytical biochemistry. 593:113608.

      Pitot, H.C. 2001. Pathways of progression in hepatocarcinogenesis. Lancet (London, England). 358:859860.

      Prechova, M., Z. Adamova, A.L. Schweizer, M. Maninova, A. Bauer, D. Kah, S.M. Meier-Menches, G. Wiche, B. Fabry, and M. Gregor. 2022. Plectin-mediated cytoskeletal crosstalk controls cell tension and cohesion in epithelial sheets. J Cell Biol. 221.

      Prechova, M., K. Korelova, and M. Gregor. 2023. Plectin. Curr Biol. 33:R128-R130.

      Qi, L., T. Knifley, M. Chen, and K.L. O'Connor. 2022. Integrin alpha6beta4 requires plectin and vimentin for adhesion complex distribution and invasive growth. J Cell Sci. 135.

      Romito, I., M. Porru, M.R. Braghini, L. Pompili, N. Panera, A. Crudele, D. Gnani, C. De Stefanis, M. Scarsella, S. Pomella, S. Levi Mortera, E. de Billy, A.L. Conti, V. Marzano, L. Putignani, M. Vinciguerra, C. Balsano, A. Pastore, R. Rota, M. Tartaglia, C. Leonetti, and A. Alisi. 2021. Focal adhesion kinase inhibitor TAE226 combined with Sorafenib slows down hepatocellular carcinoma by multiple epigenetic effects. J Exp Clin Cancer Res. 40:364.

      Si, T., L. Huang, T. Liang, P. Huang, H. Zhang, M. Zhang, and X. Zhou. 2023. Ruangan Lidan decoction inhibits the growth and metastasis of liver cancer by downregulating miR-9-5p and upregulating PDK4. Cancer Biol Ther. 24:2246198.

      Strumberg, D., J.W. Clark, A. Awada, M.J. Moore, H. Richly, A. Hendlisz, H.W. Hirte, J.P. Eder, H.J. Lenz, and B. Schwartz. 2007. Safety, pharmacokinetics, and preliminary antitumor activity of sorafenib: a review of four phase I trials in patients with advanced refractory solid tumors. Oncologist. 12:426-437.

      Tao, Q.F., S.X. Yuan, F. Yang, S. Yang, Y. Yang, J.H. Yuan, Z.G. Wang, Q.G. Xu, K.Y. Lin, J. Cai, J. Yu, W.L. Huang, X.L. Teng, C.C. Zhou, F. Wang, S.H. Sun, and W.P. Zhou. 2015. Aldolase B inhibits metastasis through Ten-Eleven Translocation 1 and serves as a prognostic biomarker in hepatocellular carcinoma. Mol Cancer. 14:170.

      Vahidnezhad, H., L. Youssefian, N. Harvey, A.R. Tavasoli, A.H. Saeidian, S. Sotoudeh, A. Varghaei, H. Mahmoudi, P. Mansouri, N. Mozafari, O. Zargari, S. Zeinali, and J. Uitto. 2022. Mutation update: The spectra of PLEC sequence variants and related plectinopathies. Human mutation. 43:17061731.

      Voisin, L., M. Lapouge, M.K. Saba-El-Leil, M. Gombos, J. Javary, V.Q. Trinh, and S. Meloche. 2024. Syngeneic mouse model of YES-driven metastatic and proliferative hepatocellular carcinoma. Dis Model Mech. 17.

      Wang, D.D., Y. Chen, Z.B. Chen, F.J. Yan, X.Y. Dai, M.D. Ying, J. Cao, J. Ma, P.H. Luo, Y.X. Han, Y. Peng, Y.H. Sun, H. Zhang, Q.J. He, B. Yang, and H. Zhu. 2016. CT-707, a Novel FAK Inhibitor, Synergizes with Cabozantinib to Suppress Hepatocellular Carcinoma by Blocking Cabozantinib-Induced FAK Activation. Mol Cancer Ther. 15:2916-2925.

      Wang, W., A. Zuidema, L. Te Molder, L. Nahidiazar, L. Hoekman, T. Schmidt, S. Coppola, and A. Sonnenberg. 2020. Hemidesmosomes modulate force generation via focal adhesions. J Cell Biol. 219.

      Wendt, M.K., M.A. Taylor, B.J. Schiemann, and W.P. Schiemann. 2011. Down-regulation of epithelial cadherin is required to initiate metastatic outgrowth of breast cancer. Mol Biol Cell. 22:24232435.

      Wenta, T., A. Schmidt, Q. Zhang, R. Devarajan, P. Singh, X. Yang, A. Ahtikoski, M. Vaarala, G.H. Wei, and A. Manninen. 2022. Disassembly of alpha6beta4-mediated hemidesmosomal adhesions promotes tumorigenesis in PTEN-negative prostate cancer by targeting plectin to focal adhesions. Oncogene. 41:3804-3820.

      Wilhelm, S., C. Carter, M. Lynch, T. Lowinger, J. Dumas, R.A. Smith, B. Schwartz, R. Simantov, and S. Kelley. 2006. Discovery and development of sorafenib: a multikinase inhibitor for treating cancer. Nat Rev Drug Discov. 5:835-844.

      Wilhelm, S.M., C. Carter, L. Tang, D. Wilkie, A. McNabola, H. Rong, C. Chen, X. Zhang, P. Vincent, M. McHugh, Y. Cao, J. Shujath, S. Gawlak, D. Eveleigh, B. Rowley, L. Liu, L. Adnane, M. Lynch, D. Auclair, I. Taylor, R. Gedrich, A. Voznesensky, B. Riedl, L.E. Post, G. Bollag, and P.A. Trail. 2004. BAY 43-9006 exhibits broad spectrum oral antitumor activity and targets the RAF/MEK/ERK pathway and receptor tyrosine kinases involved in tumor progression and angiogenesis. Cancer Res. 64:7099-7109.

      Xu, R., S. He, D. Ma, R. Liang, Q. Luo, and G. Song. 2022. Plectin Downregulation Inhibits Migration and Suppresses Epithelial Mesenchymal Transformation of Hepatocellular Carcinoma Cells via ERK1/2 Signaling. Int J Mol Sci. 24.

      You, A., M. Cao, Z. Guo, B. Zuo, J. Gao, H. Zhou, H. Li, Y. Cui, F. Fang, W. Zhang, T. Song, Q. Li, X. Zhu, H. Yin, H. Sun, and T. Zhang. 2016. Metformin sensitizes sorafenib to inhibit postoperative recurrence and metastasis of hepatocellular carcinoma in orthotopic mouse models. J Hematol Oncol. 9:20.

      Yuzugullu, H., K. Benhaj, N. Ozturk, S. Senturk, E. Celik, A. Toylu, N. Tasdemir, M. Yilmaz, E. Erdal, K.C. Akcali, N. Atabey, and M. Ozturk. 2009a. Canonical Wnt signaling is antagonized by noncanonical Wnt5a in hepatocellular carcinoma cells. Molecular Cancer. 8:90.

      Yuzugullu, H., K. Benhaj, N. Ozturk, S. Senturk, E. Celik, A. Toylu, N. Tasdemir, M. Yilmaz, E. Erdal, K.C. Akcali, N. Atabey, and M. Ozturk. 2009b. Canonical Wnt signaling is antagonized by noncanonical Wnt5a in hepatocellular carcinoma cells. Mol Cancer. 8:90.

      Zhao, J., Y. Hou, C. Yin, J. Hu, T. Gao, X. Huang, X. Zhang, J. Xing, J. An, S. Wan, and J. Li. 2020. Upregulation of histamine receptor H1 promotes tumor progression and contributes to poor prognosis in hepatocellular carcinoma. Oncogene. 39:1724-1738.

      Zheng, H., Y. Yang, C. Ye, P.P. Li, Z.G. Wang, H. Xing, H. Ren, and W.P. Zhou. 2018. Lamp2 inhibits epithelial-mesenchymal transition by suppressing Snail expression in HCC. Oncotarget. 9:3024030252.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      This manuscript by Leibinger et al describes their results from testing an interesting hypothesis that microtubule detyrosination inhibits axon regeneration and its inhibitor parthenolide could facilitate axon regeneration and perhaps functional recovery. Overall, the results from in vitro studies are largely well performed. However, the in vivo data are less convincing.

      Interpretation of the findings in this study are limited by several gaps:

      1) It is unclear whether microtubule detyrosination a primary effect of hIL-6 and PTEN deletion or secondary to the increased axon growth?

      This point is based on a misunderstanding, as shown in Fig. 2 by Western blot, that detyrosination was increased after intravitreal injection of AAV2-hIL-6 into optic nerves. These optic nerves were uninjured! This indicates that the increased detyrosination is an effect of the treatment itself and does not occur due to axonal regeneration.

      Why hIL-6 and PTEN nevertheless increase axonal regeneration is because the positive effect on other signaling pathways, such as JAK/STAT3 and mTOR, ultimately predominates. Consequently, we show, for both PTEN ko and hIL-6, that we can further enhance these positive effects by neutralizing the negative aspect of increased detyrosination using DMAPT.

      2) Is there any direct evidence for Akt and/or JAK/Stat3 to promote microtubule detyrosination?

      Regarding the AKT/GSK3 signaling pathway, it has been well described that GSK3 activity leads to phosphorylation of microtubule-associated protein 1B, which results in enhanced tubulin detyrosination (Lucas et al., 1998, Goold et al 1999, Owen and Gordon-Weeks 2003). As shown in our previous and cited work, hIL-6 promotes the activation of AKT, which in turn inhibits GSK3 (Leibinger et al. 2016). In Fig. 2, we have also shown that intravitreal hIL-6 treatment in the optic nerve leads to increased inhibitory phosphorylation of GSK3 at the target site of AKT, and that tubulin detyrosination is increased. The same was also shown for PTEN ko: In a previous publication, we showed that PTEN ko increases AKT activity, inhibiting GSK3 phosphorylation (Leibinger et al. 2019). In Fig. 3 of the actual study, we show that PTEN ko results in enhanced tubulin detyrosination. In conclusion, treatments activating the AKT/GSK3 signaling enhance tubulin detyrosination.

      On the other hand, JAK/STAT3 has no direct effect on detyrosination. This was demonstrated in experiments using the CNTF application, which reportedly activates the JAK/STAT3 pathway without affecting AKT/GSK3 (Leibinger et al, 2009, 2016, 2017).

      In cell culture, we have shown that activation of the JAK/STAT3 pathway by CNTF does not change tubulin detyrosination in neurites (Fig. 1 H, I, M; N). Moreover, DMAPT in RGC’s cell bodies does not affect the phosphorylation of STAT3 and S6, and thus has no measurable effect on JAK/STAT3 or the mTOR pathway.

      3) What is the impact of parthenolide on cell soma of neurons and other cell types?

      Parthenolide and DMAPT show a regenerative effect in the nanomolar range (cell culture) and a bell-shaped concentration-response curve. We show a close correlation between detyrosinated microtubules and regeneration (with and without hIL-6 or PTEN-KO), which is, in our opinion, convincing. Moreover, we would like to address a likely misunderstanding in this comment and provide further clarification. The detyrosination of alpha-tubulin occurs after its attachment to microtubules through the action of the tubulin carboxy peptidase vasohibin 1 and 2 (Vash 1, 2). Consequently, tubulin is already present in the detyrosinated form within existing microtubules, and the administration of DMAPT does not affect these pre-existing microtubules. However, DMAPT does play a crucial role in preventing the detyrosination of newly attached tubulin dimers in the growth cones of developing axons. This explains why we can detect detyrosinated tubulin specifically in those regions and why our immunohistochemical analyses in the cell culture experiments focused solely on axon tips.

      It is important to note that when used at low concentrations, which promote axon growth, DMAPT does not measurably affect detyrosination in other neuronal compartments, such as the RGCs' somata. We might observe a decrease in detyrosination only at much higher concentrations. However, this outcome would be inconsequential to our findings.

      Whether additional effects of DMAPT contribute to improved regeneration is not excluded, although unlikely. If so, their investigation would be beyond the scope of the current paper.

      4) Direct evidence that parthenolide augments PTEN deletion in optic nerve or spinal cord is not provided.

      Our research paper primarily investigates the combination of DMAPT with h-IL-6. We chose to combine DMAPT with hIL-6 because, unlike PTEN-KO, only hIL-6 has been demonstrated to facilitate functional recovery following a complete spinal cord crush injury (Leibinger et al., 2021). Therefore, it is unclear why conducting in vivo experiments with PTEN-KO would be necessary, which cannot be used therapeutically. Since we have shown the beneficial effects of DMAPT on hIL-6 in two different in vivo models (optic nerve and spinal cord) anatomically and functionally, we feel that the repetition of these experiments with PTEN ko, which has no therapeutic implication, would not justify the sacrifice of additional animals. This would contradict the principles of reduction, refinement, and replacement, aiming to minimize the use of animals in our research.

      In contrast, the PTEN experiments primarily serve to support the underlying mechanism and demonstrate that DMAPT generally counteracts the negative effect on MT detyrosination, even in conjunction with other procedures that activate the PI3K/AKT pathway. These findings were mechanistically elucidated through cell culture experiments utilizing immunohistochemial analysis, which the editors highlighted as strengths of our paper.

      5) Serotonergic neurotoxin DHT ablates both regenerating and non-regenerating serotonergic axons, which makes spinal cord findings it difficult to interpret.

      The impact of unregenerated serotonergic axons on stereotypic hind leg movements, as assessed through BMS analysis, appears to be minimal, as demonstrated in our previous study (Leibinger et al., 2021). Specifically, our findings revealed that depleting serotonergic neurons using DHT did not significantly affect the BMS score in uninjured animals (Leibinger et al., 2021). Furthermore, even in the control group comprising animals with spinal cord lesions where anatomical regeneration of the RpST did not occur, the administration of DHT had no discernible effect (Fig. 7 K, L).

      To address this concern, we included the following information in the revised manuscript: "It might be considered plausible that the depletion of non-regenerated serotonergic axons could have contributed to these results. However, we can largely dismiss this possibility, as DHT did not influence the non-regenerated vehicle control group. Additionally, in a previous publication, we have demonstrated that the general depletion of serotonergic neurons in uninjured animals also does not significantly impact open field locomotion, as measured by the BMS score and subscore (Leibinger et al., 2021)."

      6) DMAPT was given by i.p. injection. What happens to microtubule detyrosination in other cells within and outside of CNS?

      This question is the same as raised under point 3. -> response see 3.

      Reviewer #2 (Public Review):

      In the current study, Fischer and colleagues extensively examined the role of parthenolide in inhibiting microtubule detyrosination and making the mechanistic link for the compound to facilitate the role of IL6 and PTEN/KO in promoting neurite outgrowth and axon regeneration. The in vitro and mechanistic work laid the foundation for the authors to reach several key predictions that such detyrosination can be applied for in vivo applications. Thus the authors extended the work to optic nerve regeneration and spinal cord recovery. The in vivo compound that the authors utilized is DMAPT, which plays a synergistic role with existing pro-regeneration therapies, such as Il6 treatment.

      The major strength of the work is the first half of the mechanistic inquiries, where the authors combined cell biology and biochemistry approaches to dissect the mechanistic link from parthenolide to microtube dynamics. The shortcoming is that the in vivo data is limited, and the effects might be considered mild, especially by benchmarking with other established and effective strategies.

      The work is solid and prepares a basis for others to test the role of DMAPT in other settings, especially in the setting of other effective pro-regenerative approaches. With the goal of comprehensive and functional recovery in vivo, the impact of the work and the utilities of the methods remain to be tested broadly in other models in vivo.

      Reviewer #3 (Public Review):

      The primary goal of this paper is to examine microtubule detyrosination as a potential therapeutic target for axon regeneration. Using dimethylamino-parthenolide (DMAPT), this study extensively examines mechanistic links between microtubule detyrosination, interleukin-6 (IL-6), and PTEN in neurite outgrowth in retinal ganglion cells in vitro. These findings provide convincing evidence that parthenolide has a synergistic effect on IL-6- and PTEN-related mechanisms of neurite outgrowth in vitro. The potential efficacy of systemic DMAPT treatment to promote axon regeneration in mouse models of optic nerve crush and spinal cord injury was also examined.

      Strengths

      1) The examination of synergistic activities between parthenolide, hyperIL-6, and PTEN knockout is leveraged not only for potential therapeutic value, but also to validate and delineate mechanism of action.

      2) The in vitro studies, including primary human retinal ganglion cells, utilize a multi-level approach to dissect the mechanistic link from parthenolide to microtubule dynamics.

      3) The studies provide a basis for others to test the role of DMAPT in other settings, particularly in the context of other effective pro-regenerative approaches.

      Weaknesses

      1) In vivo studies are limited to select outcomes of recovery and do not validate or address mechanism of action in vivo.

      Reviewer #1 (Recommendations For The Authors):

      Overall, it doesn't seem like the authors bought into or addressed any issues raised during the previous review. In testing their central hypothesis, a critical experiment was to assess the outcome of PTEN knockout in combination with their novel treatment (parthenolide or DMAPT). Unfortunately, this and other issues have not been addressed in this revision.

      PTEN is not part of our central hypothesis. Our research paper primarily investigates the combination of DMAPT with h-IL-6. We chose to combine DMAPT with hIL-6 because, unlike PTEN-KO, only hIL-6 has been demonstrated to facilitate functional recovery following a complete spinal cord crush injury (Leibinger et al., 2021). Therefore, it is unclear why conducting in vivo experiments with PTEN-KO would be necessary, which cannot be used therapeutically. Since we have shown the beneficial effects of DMAPT on hIL-6 in two different in vivo models (optic nerve and spinal cord) anatomically and functionally, we feel that the repetition of these experiments with PTEN ko, which has no therapeutic implication, would not justify the sacrifice of additional animals. This would contradict the principles of reduction, refinement, and replacement, aiming to minimize the use of animals in our research.

      In contrast, the PTEN experiments primarily serve to support the underlying mechanism and demonstrate that DMAPT generally counteracts the negative effect on MT detyrosination, even in conjunction with other procedures that activate the PI3K/AKT pathway. These findings were mechanistically elucidated through cell culture experiments utilizing immunohistochemial analysis, which the editors highlighted as strengths of our paper.

      Reviewer #2 (Recommendations For The Authors):

      The response and revision provided here did not improve the manuscript - the authors chose to focus on re-organizing the methods but did not provide any new experimental data. Thus my recommendations remain the same as the previous round. In brief, the in vivo evidence was rather weak, especially if no further evidence was offered to respond to these points below.

      To possibly improve the manuscript, the authors could consider enhancing the in vivo parts in the following manner;

      1) possibly detyrosination staining in the optic nerve vertical section - it would be interesting to see how the detyrosination assays may work for regenerating conditions, or as an alternate, the authors may consider retina tissue biochemistry (with & without IL6, with & without DMAPT) repeating the biochemical assays as established Fig 2B –

      The detyrosination of alpha-tubulin occurs after its attachment to microtubules through the action of the tubulin carboxy peptidase vasohibin 1 and 2 (Vash 1, 2). Consequently, tubulin is already present in the detyrosinated form within existing microtubules, and the administration of DMAPT does not affect these pre-existing microtubules. However, DMAPT does play a crucial role in preventing the detyrosination of newly attached tubulin dimers in the growth cones of developing axons. This explains why we can detect detyrosinated tubulin specifically in those regions and why our immunohistochemical analyses in the cell culture experiments focused solely on axon tips.

      It is important to note that when used at low concentrations, which promote axon growth, DMAPT does not measurably affect detyrosination in other neuronal compartments, such as the RGCs' somata. We might observe a decrease in detyrosination only at much higher concentrations. Because of these reasons, we could not clearly identify and stain axon tips in 14 µm thick optic nerve sections.

      2) How do the authors benchmark the DMAPT retreatment in the setting of PTEN (aav2-cre injection for cKO) and /or PTEN/SOCS3/CNTF dKO? Which are the best approaches to promote optic nerve regeneration? Would the authors expect DMAPT retreatment to be synergetic with PTENcKO?

      Based on our previous findings, we anticipate that DMAPT would exhibit a synergistic effect when combined with PTEN ko, as demonstrated in our in vitro studies with cultured neurons. Additionally, synergistic effects between DMAPT and PTEN/SOCS3 dKO +CNTF are possible. While these hypotheses hold promise, our current paper primarily focuses on combining DMAPT with hIL-6, which has consistently shown remarkable efficacy as a standalone treatment in optic nerve regeneration.

      3) Regarding the DMAPT treatment, one notable issue was that the RGC survival subject to ONC was very poor, which may limit the effects of DMAPT daily injection. The authors may consider further combining DMAPT with the DLK/LZK inhibitors to examine the synergistic effects.

      As DMAPT itself is not neuroprotective and does not affect retinal ganglion cells' (RGCs) regenerative state by inducing the expression of regeneration-associated genes, a combination with a neuroprotective and regenerative treatment would show stronger effects. This is exactly what we found when combining DMAPT with neuroprotective hIL-6 (Leibinger et al. 2016) in the current paper.

      Moreover, in the raphespinal tract, where respective neurons do not undergo apoptotic cell death after axotomy, the DMAPT effect on anatomic axon regeneration was stronger than in the optic nerve, even without combination with hIL-6, with some axons reaching distances of up to 7 mm distal to the lesion. So, DMAPT can induce long-distance regeneration in neuronal populations unaffected by cell death. Therefore, additional experiments with DLK/LZK inhibitors, as suggested by this reviewer, would not provide an additional benefit to our paper and would not justify the additional sacrifice of animal lives.

      4) Overall, the phenotypes in Figs 5-8 were rather weak after DMAPT treatment, which are universal challenges to spinal cord regeneration. The authors may present this section of the data with further clarification on the selection standards in the methods, such as how the animals and treatment were selected and how a double-blinded experimental design may help further evaluate the effects of DMAPT treatment. I found little relevant information in the current manuscript.

      In the anatomic and functional regeneration analysis presented in Fig. 5-8, we only included animals with a BMS score of 0 one day after the spinal cord crush, indicating a complete absence of hind leg movement. Furthermore, we employed immunohistochemical staining to ensure that no serotonergic axons were detected at 8-10 mm from the lesion site in any of the animals, thus confirming the thoroughness of the lesion (Supplementary Fig. 4). Both the evaluation of the BMS score and the assessment of anatomical regeneration was conducted in a double-blinded manner, ensuring unbiased and objective observations. To address this concern, we will add the following paragraph in the M&M part:

      “Blinding procedure for in vivo experiments Before the start of the experiment, individual vials containing DMAPT or vehicle (DMSO) stock solution were prepared for each particular experimental animal. The vials were randomized by a person who was neither involved in the implementation nor in the evaluation of the experiments. These numbers were randomly distributed to mice of the same age and sex in different cages. This was carried out independently by another person who was neither involved in the data evaluation nor the randomization of the samples. This was followed by the execution of the experiments and the evaluation by scientists who were not involved in any of the randomization processes and did not know the identity of the injected samples. After completion of the data collection, values from mice with signs of spared axons were first removed from the data set for reasons of quality assurance. The criteria for this were a BMS-Sore of a maximum of 0-1 on the first day after the lesion and the absence of uninjured serotonergic axons in spinal cord cross-sections >8-10 mm distal to the lesion site. Finally, the data points were assigned to the respective experimental groups by the person who initially blinded the vials.”

      Reviewer #3 (Recommendations For The Authors):

      Addition of supporting data, revision of discussion, and inclusion of references for parthenolide activities improved the manuscript and adequately addressed concerns


      The following is the authors’ response to the original reviews.

      We feel that the use of human RGCs should be considered a highlight and strength of our paper because, as far as we know, our study is the first to utilize human primary cultures of RGCs to confirm the effectiveness of drugs on human cells. Therefore, this might be of interest to colleagues in our field. Moreover, we have added additional data as suppl. Fig. proving that these cells are living RGCs so this concern has been addressed. In addition, we provide further explanations why other activities of DMAPT beyond microtubule detyrosination, such as oxidative stress and NFkB inhibition, are not considered in experimental examinations or in the interpretation of findings. Therefore, we strongly recommend that this point should not be considered a weakness.<br />

      Strengths:

      1) The examination of synergistic activities between parthenolide, hyper-IL-6, and PTEN knockout is leveraged not only for potential therapeutic value, but also to validate and delineate mechanism of action.

      2) The in vitro studies utilize a multi-level approach that combines cell biology and biochemistry approaches to dissect the mechanistic link from parthenolide to microtubule dynamics.

      3) The studies provide a basis for others to test the role of DMAPT in other settings, particularly in the context of other effective pro-regenerative approaches.

      Weaknesses:

      1) In vivo studies are limited to select outcomes of recovery and do not validate or address mechanism of action in vivo.

      2) Known activities of DMAPT beyond microtubule detyrosination, such as oxidative stress, mitochondrial function and NFkB inhibition, are not considered in experimental examinations or in the interpretation of findings.

      Our research indicates that parthenolide exhibits a regenerative effect within a nanomolar range and with a bell-shaped concentration-response curve in culture. Moreover, we demonstrate a close correlation between the inhibition of detyrosinated microtubules and regeneration and consider the effects of hIL-6 or PTEN-KO on detyrosination in mouse and human RGCs. Therefore, we offer a coherent and satisfactory mechanistic explanation for the effects of parthenolide. We, therefore, feel the request to experimentally explore additional, somewhat speculative possibilities is not reasonable or helpful, and this issue should not be considered as a weakness. Moreover, to the best of our knowledge, no evidence suggests profound antioxidative effects of DMAPT or parthenolide within these low-concentration ranges and that these would affect axon regeneration. Antioxidative effects may also not explain the observed bell-shaped curve. Furthermore, we have already considered the effect of NFkappaB in our previous work (Gobrecht et al., 2016) and shown that NFkappaB remains unaffected by low concentrations of parthenolide. Hence, conducting additional experiments addressing oxidative stress or other speculative causes will not strengthen our findings and do not justify the additional sacrifice of animal lives.

      Nevertheless, we added the following sentence in our manuscript to address this issue: “Although we cannot exclude the possibility that other known activities of parthenolide/DMAPT, such as oxidative stress or NF-kB inhibition, could have contributed to the observed effects, this is rather unlikely because such effects have only been reported at much higher micromolar concentrations (Bork et al., 1997; Saadane et al., 2007; Carlisi et al., 2016; Gobrecht et al., 2016).”

      Editorial Comments:

      The reviewers' consensus is that this manuscript, although containing an impressive amount of data, lacks cohesion.

      The mechanistic studies in vitro are of a distinctly different caliber than the in vivo studies. Additional data is needed to demonstrate that the mechanisms delineated in vitro are related to the outcomes in vivo. As is, this reads as a comprehensive in vitro study with premature in vivo data tacked on the end.

      The manuscript should contain the necessary background and contextual information needed to fully understand the work. Clarity of rationale and context for experimental method/design (why one reagent or insult is selected over another), result interpretation (what does this data tell you and not tell you), and implications for results (what does this mean in the context of current knowledge) should be improved throughout.

      Technical:

      1) There is no validation of human RGC cultures. If this data is to remain in the manuscript, proper verification data should be provided to demonstrate that these are indeed RGCs and that they are viable.

      The retinal ganglion cells (RGCs) were identified by applying the same criteria as murine and rat RGCs,encompassing morphological and immunohistochemical criteria. The staining of a piece of human retina (see Author response image 1) shows βIII-tubulin-positive cells in the ganglion cell layer and forming axonal bundles in the fiber layer. These are RGCs, and it is confirmed that the βIII-tubulin antibody stains human RGCs (Author response image 1A). In addition, the somata of these human RGCs in the retina have a similar diameter (somewhat larger than murine RGCs Author response image 1A, B) to the cultured βIII-tubulin-positive cells (RGCs) and a similar morphology. Finally, these regenerating neurons are GAP43-positive, a regeneration-associated protein shown in Author response image 1C. Thus, these data prove that the cultured cells were human RGCs. These data were included as a suppl. Fig. 1.

      The viability of the neurons was confirmed, as evidenced by their ability to grow neurites - a clear indication of their vitality. We also verified the viability by calceinstaining.

      As far as we know, our study is the first to utilize human primary cultures of RGCs to confirm the effectiveness of CNTF and parthenolide on human cells. Therefore, we would have expected this accomplishment to be emphasized as a strength of our paper.

      Author response image 1.

      A) Retinal flat mounts from human (left) and mouse (right) stained for βIII-tubulin. Scale bar: 50 μm. B) Human (left) and mouse (right) RGCs cultured for 4 days and stained for βIII-tubulin. Scale bar: 25 μm. C) Human βIIItubulin-positive RGCs with regenerating neurites are also GAP43-positive. Scale bar: 50 μm

      2) For graphs depicting means and errors, it is advised that the authors evaluate their use of SEM. Standard deviation should be used when illustrating the distribution of measurements/individuals within a population. Standard error should be used for determining accuracy of the calculated mean, i.e. how close are individuals to the calculated mean? Since standard error is a measure of accuracy rather than distribution, it moves towards zero as the population size increases, regardless of the distribution. Thus, error bars intended to show the range of an effect (i.e. how much functional recovery with treatment?), should be depicted as standard deviation, which illustrates the actual range of data.

      To provide best possible transparency we incorporated each individual data point within our graphs, thus offering a detailed depiction of the complete range of effects. We firmly believe that this approach provides enhanced clarity compared to a standard deviation and grants a more comprehensive understanding of the data. It is worth noting that also presenting the standard error adds supplementary information regarding the accuracy of the calculated mean.

      Thus, we firmly stand by our chosen method of data presentation, as we believe it furnishes readers with more valuable insights. However, if there are additional compelling arguments to display the standard deviation instead of the standard error, we are more than willing to consider them.

      3) One notable issue was that the RGC survival subject to ONC was very poor, which may limit the effects of DMAPT daily injection. The authors may consider further combining DMAPT with the DLK/LZK inhibitors to examine the synergistic effects.

      As DMAPT itself is not neuroprotective and does not affect retinal ganglion cells' (RGCs) regenerative state by inducing the expression of regeneration-associated genes, a combination with a neuroprotective and regenerative treatment would show stronger effects. This is exactly what we found when combining DMAPT with neuroprotective hIL-6 (Leibinger et al. 2016) in the current paper.

      Moreover, in the raphespinal tract, where respective neurons do not undergo apoptotic cell death after axotomy, the DMAPT effect on anatomic axon regeneration was stronger than in the optic nerve, even without combination with hIL-6, with some axons reaching distances of up to 7 mm distal to the lesion. So, DMAPT can induce long-distance regeneration in neuronal populations unaffected by cell death. Therefore, we feel that additional experiments with DLK/LZK inhibitors, as suggested by this reviewer, would not provide an additional benefit to our paper and not justify the additional sacrifice of animal lives.

      To address this issue, we added the following paragraph: “Expectedly, DMAPT was not able to protect RGCs from axotomy-induced cell death (Fig. 4 F, G) since it does solely accelerate microtubule polymerization in axonal growth cones without affecting neuroprotective signaling pathways in the cell body (Fig. 1 F, G; supplementary Fig. 2). We then repeated these experiments in combination with intravitreally applied AAV2hIL-6 which reportedly has a significant neuroprotective effect (Leibinger et al., 2016) (Fig. 4 H).”

      4) Serotonergic neurotoxin DHT, which in the spinal cord injury model ablates both regenerating and nonregenerating serotonergic axons, which makes interpretation of the results difficult. This should be addressed directly in interpretation and discussion.

      The impact of unregenerated serotonergic axons on stereotypic hind leg movements, as assessed through BMS analysis, appears to be minimal, as demonstrated in our previous study (Leibinger et al., 2021). Specifically, our findings revealed that depleting serotonergic neurons using DHT did not significantly affect the BMS score in uninjured animals (Leibinger et al., 2021). Furthermore, even in the control group comprising animals with spinal cord lesions where anatomical regeneration of the RpST did not occur, the administration of DHT had no discernible effect (Fig. 7 K, L).

      To address this concern, we propose including the following information in the revised manuscript: "It might appear conceivable that the depletion of non-regenerated serotonergic axons may have contributed to these results. However, we can rule this out since DHT did not influence the non-regenerated vehicle control group. Furthermore, we have shown in a previous publication that the general depletion of serotonergic neurons in uninjured animals also has no significant influence on openfield locomotion as measured in the BMS score and subscore (Leibinger et al., 2021). Furthermore, we have shown in a previous publication that the general depletion of serotonergic neurons in uninjured animals also has no significant influence on openfield locomotion as measured in the BMS score and subscore (Leibinger et al., 2021).”

      5). Overall, the phenotypes in Figs 5-8 were rather weak after DMAPT treatment, which are universal challenges to spinal cord regeneration. The authors may present this section of the data with further clarification on the selection standards in the methods, such as how the animals and treatment were selected and how a double-blinded experimental design may help further evaluate the effects of DMAPT treatment. I found little relevant information in the current manuscript.

      In the anatomic and functional regeneration analysis presented in Figures 5-8, we only included animals with a BMS score of 0 one day after the spinal cord crush, indicating a complete absence of hind leg movement. Furthermore, we employed immunohistochemical staining to ensure that no serotonergic axons were detected at 8-10 mm from the lesion site in any of the animals, thus confirming the thoroughness of the lesion (Supplementary Fig. 4). Both the evaluation of the BMS score and the assessment of anatomical regeneration was conducted in a doubleblinded manner, ensuring unbiased and objective observations. To address this concern, we will add the following paragraph in the M&M part:

      “Blinding procedure for in vivo experiments Before the start of the experiment, individual vials containing DMAPT or vehicle (DMSO) stock solution were prepared for each experimental animal. The vials were randomized by a person who was neither involved in the implementation nor evaluated the experiments. These numbers were randomly distributed to mice of the same age and sex in different cages. This was carried out independently by another person who was neither involved in the data evaluation nor the randomization of the samples. This was followed by the execution of the experiments and the evaluation by scientists who were not involved in any randomization processes and did not know the identity of the injected samples. After completion of the data collection, values from mice with signs of spared axons were first removed from the data set for quality assurance. The criteria for this were a BMS Sore of a maximum of 0-1 on the first day after the lesion and the absence of uninjured serotonergic axons in spinal cord cross-sections >9-10 mm distal to the lesion site. Finally, the data points were assigned to the respective experimental groups by the person who initially blinded the vials.”

      6) Several supplemental figures are discussed as critical elements of the studies performed. The authors are encouraged to include figures discussed as primary data as primary figures in the manuscript and provide the necessary information regarding experimental design and methods, including "n".

      Thank you for the suggestion.

      7) While the "n" is clear for some subsets of figures (as noted in the rebuttal), it is not clear for all outcomes/figure subsets. For example, it appears that some outcomes were performed in only a subset of the total experimental population and not in the context of statistically significant result. A good example of this is the figure for in vivo suboptimal dosing. The experimental design suggests n=7-10, but the group considered suboptimal due to statistical insignificance is listed as n=4. Is this an entirely separate cohort? If so, is n=4 sufficient and was it considered statistically in the context of the higher-powered cohorts? The lack of clarity regarding experimental design should be addressed.

      To ensure transparency we have provided all n-numbers for each outcome and figure subset. Additionally, the precise n-numbers can be inferred by observing the number of individual points depicted in the graphs. All statistical data are appropriately indicated in the figure legends for reference.

      The data presented in suppl. Fig. 3 represents a preliminary experiment to find effective doses of DMAPT in vivo. In this initial phase, we tested three different doses of DMAPT (0.2, 2, 20 µg/kg) in a reduced group size of only four animals per group. This reduction in animal numbers aligns with the principles to determine reduction, refinement, and replacement, aiming to minimize the use of animals in our research. Subsequently, the group demonstrating the most robust effect (2 µg/kg) was expanded by including additional animals to meet the a priori calculated sample size and validate the results. These additional animal data are presented in Figure 4 A-C. In the case of suppl. Fig. 3 A, B the statistical analysis indicated a significant effect in A using an n=4. As a result, there was no need to utilize additional animals for this particular experiment.

      Gaps:

      1) By in vitro studies, the authors showed that hIL-6 treatment or PTEN knockout elevated microtubule detyrosination. But when does this occur? In another words, is this a primary effect of these treatments or secondary to the increased axon growth? How does this fit with the observations that these interventions promote axon regeneration both in vitro and in vivo?

      This point also seems to be based on a misunderstanding, as shown in Figure 2 by Western blot, that detyrosination was increased after intravitreal injection of AAV2-hIL-6 into optic nerves. These optic nerves were uninjured! This indicates that the increased detyrosination is an effect of the treatment itself and does not occur due to axonal regeneration.

      Why hIL-6 and PTEN nevertheless increase axonal regeneration is because the positive effect on other signaling pathways, such as JAK/STAT3 and mTOR, ultimately predominates. Consequently, we show, for both PTEN ko and hIL-6, that we can further enhance these positive effects by neutralizing the negative aspect of increased detyrosination using DMAPT.

      2) Is there any direct evidence for Akt and/or JAK/Stat3 to promote microtubule detyrosination?

      As described in our previous and cited work, hIL-6, in contrast to CNTF, promotes the activation of AKT (Leibinger et al. 2016). In Fig. 2, we have also shown that intravitreal hIL-6 treatment in the optic nerve leads to increased phosphorylation of GSK3, a substrate of AKT, and that tubulin detyrosination is increased.

      As far as we know, JAK/STAT3 has no direct effect on detyrosination.

      In cell culture, we have shown that activation of the JAK/STAT3 pathway by CNTF application does not change tubulin detyrosination in neurites (Fig. 1 H, I, M; N).

      DMAPT in RGC’s cell bodies does not affect the phosphorylation of STAT3 and S6, and thus has no measurable effect on JAK/STAT3 or the mTOR pathway. Moreover, tubulin detyrosination in neuronal cell bodies is not affected by DMAPT.

      3) Empirical data linking in vivo regeneration with mechanisms delineated in in vitro studies is limited. The addition of such data (i.e. biochemical assays, relevant histology) would better enable interpretation of in vivo studies and improve cohesiveness of the work as a whole.

      The mechanistic links between hIL-6 /PTEN-signaling and tubulin detyrosination and the abrogation of the adverse effects by DMAPT have been extensively addressed in vitro, which has been positively highlighted here in several places. Indeed, the in vivo data were intended to mainly confirm that the mechanisms elaborated in vitro are relevant to axonal regeneration and functional restoration in vivo. Most importantly our data demonstrate that systemic DMAPT application promotes axon regeneration in the CNS and improves functional recovery after a complete spinal cord injury. Form a clinical point of view this is important.

      4) DMAPT activities are not limited to microtubule detyrosination. These alternate activities should be considered, particularly in in vivo studies. Empirical evidence of the potential impact for these mechanisms in the retina, optic nerve, and systemically is strongly encouraged. In vitro studies or studies of a specific neuronal population are insufficient to extrapolate activities in an intact system.

      Parthenolide and DMAPT show a regenerative effect in the nanomolar range (cell culture) and a bell-shaped concentration-response curve. We show a close correlation between detyrosinated microtubules and regeneration (with and without hIL6 or PTEN-KO), which is, in our opinion, convincing. Whether additional effects of DMAPT contribute to improved regeneration is not excluded, although unlikely. If so, their investigation would be beyond the scope of the current paper.

      5) How do the authors benchmark the DMAPT retreatment in the setting of PTEN (aav2-cre injection for cKO) and /or PTEN/SOCS3/CNTF dKO? Which are the best approaches to promote optic nerve regeneration? Would the authors expect DMAPT retreatment to be synergetic with PTENcKO?

      Based on our previous findings, we anticipate that DMAPT would exhibit a synergistic effect when combined with PTEN ko, as demonstrated in our in vitro studies with cultured neurons. Additionally, synergistic effects between DMAPT and PTEN/SOCS3 dKO +CNTF are possible. While these hypotheses hold promise, our current paper primarily focuses on combining DMAPT with hIL-6, which has consistently shown remarkable efficacy as a standalone treatment in optic nerve regeneration.

      Furthermore, our rationale for combining DMAPT with hIL-6 rather than PTEN-KO stems from the fact that, unlike PTEN-KO, hIL-6 has been proven to enable functional recovery following complete spinal cord crush injuries (Leibinger et al., 2021).

      6) A cohesive discussion of findings would be beneficial. What can and cannot be elucidated from in vitro and in vivo studies? How does the in vivo effect compare to existing strategies? What are the limitations of the studies performed? Are there alternative explanations for the findings in vitro or in vivo?

      We appreciate these suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study advances our understanding of why diabetes is a risk factor for more severe Covid-19 disease. The authors offer solid evidence that cathepsin L is more active in diabetic individuals, that this higher activity is recapitulated at the cellular level in the presence of high glucose, and that high glucose leads to higher cathepsin L maturation. While not all aspects of the relationship between diabetes and cathepsin L (e.g., effects of metabolic acidosis) have been investigated, the work should be of interest to researchers in diabetes, virology, and immunology.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The study by He et al. investigates the relationship of an increased susceptibility of diabetes patients to COVID-19. The paper raises the possibility that hyperglycemia-induced cathepsin L maturation could be one of the driving forces in this pathology, suggesting that an increased activity of CTSL leads to accelerated virus infection rates due to an elevated processing of the SARS-CoV-2 spike protein.

      In a clinical case-control study, the team found that the severity of corona infections was higher in diabetic patients, and their CTSL levels correlated well with the progression of the disease. They further showed an increase in CTSL activity in the long term as well as acute hyperglycemia. SARS-CoV-2 increasingly infected cells that were cultured in serum from diabetic patients, the same was observed using high glucose medium. No effect was observed in the medium with increased concentrations of insulin. CTSL knockout abolished the glucose-dependent increase in infection.

      Increased glucose levels did not correlate with an increase in CTSL transcription. Rather He et al. could show that high glucose levels led to CTSL translocation from the ER into the lysosome. It was the glucose-dependent processing of the protease to its active form which promoted infection.

      Strengths:

      It is a complete study starting from a clinical observation and ending on the molecular mechanism. A strength is certainly the wide selection of experiments. The clinical study to investigate the effect of glucose on CTSL concentrations in healthy individuals sets the stage for experiments in cell culture, animal models, and human tissue. The effect of CTSL knockout cell lines on glucose-induced SARS-CoV2 infection rates is convincing. Finally, the team used a combination of Western blots and confocal microscopy to identify the underlying molecular mechanisms. The authors manage to keep the diabetic condition at the center of their study and therefore extend on previous knowledge of glucose-induced CTSL activation and their consequences for COVID-19 infections. By doing so, they create a novel connection between CTSL involvement in SARS-CoV2 infections and diabetes.

      Weaknesses:

      (1) The authors suggest that hyperglycemia as a symptom of diabetes leads to an increased infection rate in those patients. Throughout their study, the team focuses on two select symptoms of a diabetic condition, hyperglycemia and hyperinsulinemia. The team acknowledges in the discussion that there could be various other reasons. Hyperglycemia can lead to metabolic acidosis and a shift in blood pH. As CTSL activity is highly dependent on pH, it would have been crucial to include this parameter in the study.

      We sincerely appreciate your valuable comment. We agree that hyperglycemia can lead to metabolic acidosis and alter blood pH. However, the normal range for blood pH in humans is relatively narrow, typically ranging from 7.35 to 7.45. In our study, we ensured that blood pH remained within this normal range for both diabetic and healthy control samples. To address your concern, we conducted experiments to investigate CTSL activity in response to pH fluctuations within this physiological range. The updated Fig. 4a now presents these findings, demonstrating consistent CTSL activity despite pH variations. Statistical analysis was performed using one-way ANOVA with Tukey’s post hoc test to ensure robustness. We have also amended the figure legend and provided corresponding descriptions in the final edition manuscript (line 15-18, page 7).

      Author response image 1.

      (2) The study rarely differentiates between cellular and extracellular CTSL activity. A more detailed explanation for the connection between the intracellular CTSL and serum CTSL in diabetic individuals, presumably via lysosomal exocytosis, could be helpful with regard to the final model to give a more complete picture.

      Thank you for your insightful comments. Previous studies have elucidated the process by which lysosomal CTSL is transported via vesicles and subsequently secreted from the cell membrane through exocytosis (references 1-5). To provide a more comprehensive understanding, we have incorporated this information on Fig. 6h, page 32 of the final edition manuscript. This addition aims to enhance clarity regarding the connection between intracellular and serum CTSL activity in diabetic individuals, particularly through lysosomal exocytosis.

      Author response image 2.

      References:

      (1) Reddy A et al. Plasma membrane repair is mediated by Ca(2+)-regulated exocytosis of lysosomes. Cell. 2001 Jul 27;106(2):157-69. doi: 10.1016/s0092-8674(01)00421-4. PMID: 11511344.

      (2) Hasanagic M et al. Different Pathways to the Lysosome: Sorting out Alternatives. Int Rev Cell Mol Biol. 2015;320:75-101. doi: 10.1016/bs.ircmb.2015.07.008. Epub 2015 Aug 19. PMID: 26614872.

      (3) Reiser J et al. Specialized roles for cysteine cathepsins in health and disease. J Clin Invest. 2010 Oct;120(10):3421-31. doi: 10.1172/JCI42918. Epub 2010 Oct 1. PMID: 20921628; PMCID: PMC2947230.

      (4) Jaiswal JK et al. Membrane proximal lysosomes are the major vesicles responsible for calcium-dependent exocytosis in nonsecretory cells. J Cell Biol. 2002 Nov 25;159(4):625-35. doi: 10.1083/jcb.200208154. Epub 2002 Nov 18. PMID: 12438417; PMCID: PMC2173094.

      (5) Coutinho MF et al. Mannose-6-phosphate pathway: a review on its role in lysosomal function and dysfunction. Mol Genet Metab. 2012 Apr;105(4):542-50. doi: 10.1016/j.ymgme.2011.12.012. Epub 2011 Dec 23. PMID: 22266136.

      (3) In the early result section, an effect of hyperglycemia on total CTSL concentrations is described, but the data is not very convincing. Over the course of the manuscript, the hypothesis shifts increasingly towards an increase in protease trans-localization and processing to the active form rather than a change in total protease amounts. The overall importance of CTSL concentrations remains questionable.

      Thank you for your insightful feedback. We have addressed your concerns regarding the impact of hyperglycemia on CTSL concentrations. Fig. 2h-j illustrate the effect of acute hyperglycemia on both CTSL concentration and activity in 15 healthy male volunteers over a 160-minute period. During this short timeframe, CTSL concentration remained stable, as evidenced by consistent RNA results from cells exposed to varying glucose levels (Supplementary Fig.1). However, there was a significant increase in CTSL activity, indicating that glucose elevation rapidly triggers CTSL maturation through propeptide cleavage. This activation process occurs more rapidly than CTSL protein synthesis. In summary, acute hyperglycemia specifically elevates CTSL activity, while chronic hyperglycemia may impact both CTSL activity and concentration (Fig. 2a-d). Additionally, Tournu C, et al. (1998) (reference 1) and Shi Q, et al. (2018) (reference 2) have reported that increased glucose metabolism promotes the maturation and secretion of CTSL and other proteases. These findings align with our evidence that hyperglycemia drives CTSL maturation, as discussed at line 10-25, page 12 in the final edition manuscript.

      References:

      (1) Tournu C et al. Glucose controls cathepsin expression in Ras-transformed fibroblasts. Arch Biochem Biophys. 1998 Dec 1;360(1):15-24. doi: 10.1006/abbi.1998.0916. PMID: 9826424.

      (2) Shi Q et al. Increased glucose metabolism in TAMs fuels O-GlcNAcylation of lysosomal Cathepsin B to promote cancer metastasis and chemoresistance. Cancer Cell. 2022 Oct 10;40(10):1207-1222.e10. doi: 10.1016/j.ccell.2022.08.012. Epub 2022 Sep 8. PMID: 36084651.

      Reviewer #2 (Public Review):

      Summary:

      In this study, the authors hypothesized that individuals with diabetes have elevated blood CTSL levels, which facilitates SARS-CoV-2 infection. The authors conducted in vitro experiments, revealing that elevated glucose levels promote SARS-CoV-2 infection in wild-type cells. In contrast, CTSL knockout cells show reduced susceptibility to high glucose-promoted effects. Additionally, the authors utilized lung tissue samples obtained from both diabetic and non-diabetic patients, along with db/db diabetic and control mice. Their findings indicate that diabetic conditions lead to an elevation in CTSL activity in both humans and mice.

      Strengths:

      The authors have effectively met their research objectives, and their conclusions are supported by the data presented. Their findings suggest that high glucose levels promote CTSL maturation and translocation from the endoplasmic reticulum to the lysosome, potentially contributing to diabetic comorbidities and complications.

      Weaknesses:

      (1) In Figure 1e, the authors measured plasma levels of COVID-19 related proteins, including ACE2, CTSL, and CTSB, in both diabetic and non-diabetic COVID-19 patients. Notably, only CTSL levels exhibited a significant increase in diabetic patients compared to non-diabetic patients, and these levels varied throughout the course of COVID-19. Given that the diabetes groups encompass both male and female patients, it is essential to ascertain whether the authors considered the potential impact of gender on CTSL levels. The diabetes groups comprised a higher percentage of male patients (61.3%) compared to the non-diabetes group, where males constituted only 38.7%.

      Thank you for your insightful feedback. In response to your concerns regarding the potential impact of gender on CTSL levels in diabetic and non-diabetic COVID-19 patients, we conducted analyses to address this issue. While our initial study involved 62 COVID-19 patients, with 31 having diabetes and 31 without, matching based on gender and age, we acknowledged the challenge of obtaining balanced gender distribution in both groups due to the difficulty of collecting blood samples from COVID-19 patients. To mitigate potential gender bias resulting from small sample sizes, we conducted a supplementary clinical study involving 122 non-COVID-19 volunteers, including 61 individuals with diabetes and 61 without. The percentage of males in the diabetes group was 50.8%, while in the healthy group, males constituted 44.3% (P value = 0.468), indicating no significant gender bias. We have incorporated this information into the discussion section on line 4-13, page 11 in the final edition manuscript, to provide clarity on this aspect of our study.

      (2) Lines 145-149: "The results showed that WT Huh7 cell cultured in high glucose medium exhibited a much higher infective rate than those in low glucose medium. However, CTSL KO Huh7 cells maintained a low infective rate of SARS-CoV-2 regardless of glucose or insulin levels (Fig. 3f-h). Therefore, hyperglycemia enhanced SARS-CoV-2 infection dependent on CTSL." However, this evidence may be insufficient to support the claim that hyperglycemia enhances SARS-CoV-2 infection dependent on CTSL. The human hepatoma cell line Huh7 might not be an ideal model to validate the authors' hypothesis regarding high blood glucose promoting SARS-CoV-2 infection through CTSL.

      Thank you for your valuable feedback. We have addressed the concerns regarding the sufficiency of evidence supporting the claim that hyperglycemia enhances SARS-CoV-2 infection dependent on CTSL. Specifically, we have revised the expression to state, “Therefore, hyperglycemia enhanced SARS-CoV-2 infection through CTSL.” as suggested, in line 9, page 7 in the final edition manuscript. Additionally, we acknowledge the potential involvement of other bioactive factors, such as 1,5-anhydro-D-glucitol (1,5-AG), in mediating SARS-CoV-2 infection in patients with diabetes, as outlined in the discussion section from line 13-21, page 13 in the final edition manuscript.

      Regarding the choice of the human hepatoma cell line Huh7 as a model for investigating hyperglycemia-induced CTSL maturation and SARS-CoV-2 infection, we recognize the importance of tissue specificity and the liver’s significance as a target organ for COVID-19. Despite potential limitations, such as generalization of liver function abnormalities and lack of tissue specificity in SARS-CoV-2 impact, Huh7 cells offer practical advantages as a mature cell model for studying SARS-CoV-2 infection, including accessibility, susceptibility to infection, and stable proliferation (reference 1-3). We have elaborated on these considerations in the discussion section at line 19-23, page 11 in the final edition manuscript, to provide context for our choice of experimental model.

      References:

      (1) Gupta A et al. Extrapulmonary manifestations of COVID-19. Nat Med. 2020 Jul;26(7):1017-1032. doi: 10.1038/s41591-020-0968-3. Epub 2020 Jul 10. PMID: 32651579.

      (2) Nie X et al. Multi-organ proteomic landscape of COVID-19 autopsies. Cell. 2021 Feb 4;184(3):775-791.e14. doi: 10.1016/j.cell.2021.01.004. Epub 2021 Jan 9. PMID: 33503446; PMCID: PMC7794601.

      (3) Ciotti M et al. The COVID-19 pandemic. Crit Rev Clin Lab Sci. 2020 Sep;57(6):365-388. doi: 10.1080/10408363.2020.1783198. Epub 2020 Jul 9. PMID: 32645276.

      (3) The Abstract and Introduction sections lack effective organization.

      Thank you for your valuable comments. We have rewritten the Abstract and Introduction sections and incorporated the updated descriptions in the final edition manuscript.

      Reviewer #1 (Recommendations For The Authors):

      (1) When referring to diabetes, does this exclusively include diabetes type 2?

      Thank you for your inquiry. In our study, the term “diabetes” encompasses the condition of hyperglycemia in a broad sense, rather than specifically indicating type 1 diabetes (T1DM) or type 2 diabetes (T2DM). This broader definition aligns with the scope of our research objectives and findings, particularly observed in the cell experiments conducted. We have clarified this point in the revised discussion section, from line 6-9, page 12 in the final edition manuscript, to provide additional context for readers.

      (2) The titles of the individual paragraphs are not very strong and descriptive. More precise titles help to structure the paper better for the reader.

      Thank you for your valuable comments. We have rewritten the title of each section to make it more precise for readers and incorporated the updated descriptions in the manuscript.

      (3) Fig.3c, adding a 0 nM insulin control would be nice.

      Thank you for your suggestion. We have revised Fig.3c according to your advice. The revised figure was located at page 29 in the final edition manuscript. The corresponding figure legend has also been revised.

      Author response image 3.

      (4) Fig.3e non-infection control would be nice.

      Thank you for your suggestion. We have incorporated your feedback by adding a non-infection control in Fig. 3e. In this revised figure, we included a measurement of SARS-CoV-2 pseudovirus infection assessed through the fluorescence captured by a reader. Cells infected by the pseudovirus exhibited activation of the firefly luciferase, resulting in the release of fluorescence. Conversely, non-infected control cells showed no fluorescence, with the reader recording a value of zero. The updated figure can now be found on page 29 in the final edition manuscript, and we have adjusted the corresponding figure legend accordingly.

      Author response image 4.

      (5) In Figure 5, the processing of CTSL in cells (b-c) strongly differs from processing in tissue (d-e) focusing on amounts of dc-mCTSL. Do you have an explanation for this? Overall, blots are hard to judge by eye and it would be nice to include blots with shorter exposure.

      Thank you for your insightful feedback. The differences observed in the processing of CTSL between cells (Fig. 5b) and tissues (Fig. 5d-e) may be attributed to the complexities inherent in tissue samples, which can impact the clarity of the images. Furthermore, in human tissue samples, it is pertinent to consider that patients in the diabetes group had their blood glucose levels controlled within or near the normal range prior to lung surgery. As a result, the evidence supporting CTSL maturation in human lung tissue blotting images may be less compelling. We have addressed this aspect in the revised results section (lines 10-13, page 9). Additionally, we will consider including blots with shorter exposure to enhance visual clarity in future studies.

      (6) Considering Fig2B and Figure S1, the evidence of an effect of hyperglycemia or high glucose medium on total CTSL protein concentration is not very strong. In my opinion, this claim in the results section for Fig2 should be revisited.

      Thank you for your valuable suggestion. We have revisited the section in question and made appropriate revisions. The original sentence has been modified to accurately reflect the findings: "We found that plasma CTSL activity was strongly positively correlated with chronic hyperglycemia indicated by HbA1c and was significantly higher in diabetic patients than in euglycemic individuals (Fig. 2a, c). Additionally, plasma CTSL concentration showed a positive trend with chronic hyperglycemia indicated by HbA1c (Fig. 2b, d)". These changes have been incorporated into the revised results section (lines 12-16, page 5).

      (7) Overall, data hinting to increased CTSL activity is stronger than protein amount. This being said, in hyperglycemia, blood pH can be affected (metabolic acidosis). As CTSL has higher activity at low pH, could the increase in activity be caused by a drop in pH? Can you include this aspect in your manuscript? For example, is there a pH difference in serum of nondiabetic vs diabetic patients?

      Thank you for your valuable input. We have already addressed the potential impact of pH changes on CTSL activity in our response to Weakness No. 1. As indicated, although hyperglycemia can lead to metabolic acidosis and changes in blood pH, the pH levels observed in our study remained within the normal range (7.35 to 7.45). Therefore, we conducted experiments to investigate CTSL activity in response to changes in pH, which showed consistent activity levels within this range. This information has been included in our revised manuscript (line 15-18, page 7).

      Reviewer #2 (Recommendations For The Authors):

      (1) The Abstract and Introduction sections lack effective organization. The manuscript's style resembles that of Cell Journal rather than aligning with the customary format of eLife.

      Thank you for your valuable comments. The Abstract and Introduction sections have been reorganized to be more precise for readers has been included in our revised manuscript. Additionally, we have meticulously updated the manuscript's style to align with the standard format of eLife in our revised manuscript, especially key resources table of materials and methods sections.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank all the reviewers for their positive evaluation of our paper, as described in the Strengths section. We are also grateful for their helpful comments and suggestions, which we have addressed below. We believe that the manuscript has been significantly improved as a result of these suggestions. In addition to these changes, we also corrected some inconsistencies (statistical values in the last sentence of a Figure 5 caption) and sentences in the main text (lines 155, 452, 522) (these corrections did not affect the results).

      Fig. 5e: R=0.599, P<0.001 -> R=0.601, P=0.007

      L150: "the angle of stick tilt angle" -> "the angle of stick tilt"

      L437: "no such" -> "such"

      L522: "?" -> "."

      Reviewer #1 (Public Review):

      Summary/Strengths:

      This manuscript describes a stimulating contribution to the field of human motor control. The complexity of control and learning is studied with a new task offering a myriad of possible coordination patterns. Findings are original and exemplify how baseline relationships determine learning.

      Weaknesses:

      A new task is presented: it is a thoughtful one, but because it is a new one, the manuscript section is filled with relatively new terms and acronyms that are not necessarily easy to rapidly understand.

      First, some more thoughts may be devoted to the take-home message. In the title, I am not sure manipulating a stick with both hands is a key piece of information. Also, the authors appear to insist on the term ‘implicit’, and I wonder if it is a big deal in this manuscript and if all the necessary evidence appears in this study that control and adaptation are exclusively implicit. As there is no clear comparison between gradual and abrupt sessions, the authors may consider removing at least from the title and abstract the words ‘implicit’ and ‘implicitly’. Most importantly, the authors may consider modifying the last sentence of the abstract to clearly provide the most substantial theoretical advance from this study.

      Thank you for your positive comment on our paper. We agree with the reviewer that our paper used a lot of acronyms that might confuse the readers. As we have addressed below (in the rebuttal to the Results section), we have reduced the number of acronyms.

      Regarding the comment on the use of the word “implicit” in the title and the abstract, we believe that its use in this paper is very important and indispensable. One of our main findings was that the pattern of adaptation between the tip-movement direction and the stick-tilt angle largely followed that in the baseline condition when aiming at different target directions. This adaptation was largely implicit because participants were not aware of the presence of the perturbation as the amount of perturbation was gradually increased. This implicitness suggests that the adaptation pattern of how the movement should be corrected is embedded in the motor learning system. On the other hand, if this adaptation pattern was achieved on the basis of the explicit strategy of changing the direction of the tip-movement, the adaptation pattern that follows the baseline pattern is not at all surprising. For these reasons, we will continue to use the word "implicit".

      It seems that a substantial finding is the ‘constraint’ imposed by baseline control laws on sensorimotor adaptation. This seems to echo and extend previous work of Wu, Smith et al. (Nat Neurosci, 2014): their findings, which were not necessarily always replicated, suggested that the more participants were variable in baseline, the better they adapted to a systematic perturbation. The authors may study whether residual errors are smaller or adaptation is faster for individuals with larger motor variability in baseline. Unfortunately, the authors do not present the classic time course of sensorimotor adaptation in any experiment. The adaptation is not described as typically done: the authors should thus show the changes in tip movement direction and stick-tilt angle across trials, and highlight any significant difference between baseline, early adaptation, and late adaptation, for instance. I also wonder why the authors did not include a few noperturbation trials after the exposure phase to study after-effects in the study design: it looks like a missed opportunity here. Overall, I think that showing the time course of adaptation is necessary for the present study to provide a more comprehensive understanding of that new task, and to re-explore the role of motor variability during baseline for sensorimotor adaptation.

      We appreciate the reviewer for raising these important issues.

      Regarding the learning curve, because the amount of perturbation was gradually increased except for Exp.1B, we were not able to obtain typical learning curves (i.e., the curve showing errors decaying exponentially with trials). However, it may still be useful to show how the movement changed with trials during adaptation. Therefore, following the reviewer's suggestion, we have added the figures of the time course of adaptation in the supplementary data (Figures S1, S2, S4, and S5).

      There are two reasons why our experiments did not include aftereffect quantification trials (i.e., probe trials). First, in the case of adaptation to a visual perturbation (e.g., visual rotation), probe trials are not necessary because the degree of adaptation can be easily quantified by the amount of compensation in the perturbation trials (however, in the case of dynamic perturbations such as force fields, the use of probe trials is necessary). Second, the inclusion of probe trials allows participants to be aware of the presence of the perturbation, which we would like to avoid.

      We also appreciate the interesting additional questions regarding the relevance of our work to the relationship between baseline motor variability and adaptation performance. As this topic, although interesting, is outside the scope of this paper, we concluded that we would not address it in the manuscript. In fact, the experiments were not ideal for quantifying motor variability in the baseline phase because participants had to aim at different targets, which could change the characteristics of motor variability. In addition, we gradually increased the size of the perturbation except for Exp.1B (see Author response image 1, upper panel), which could make it difficult to assess the speed of adaptation. Nevertheless, we think it is worth mentioning this point in this rebuttal. Specifically, we examined the correlation between baseline motor variability when aiming the 0 deg target (tip-movement direction or stick-tilt angle) and adaptation speed in Exp 1A and Exp 1B (Author response image 1 and Author response image 2). To assess adaptation speed in Exp.1A, we quantified the slope of the tip-movement direction to a gradually increasing perturbation (Author response image 1, upper panel). The adaptation speed in Exp.1B was obtained by fitting the exponential function to the data (Author response image 2, upper panel). Although the statistical results were not completely consistent, we found that the participants with greater the motor variability at baseline tended to show faster adaptation, as shown in a previous study (Wu et al., Nat Neurosci, 2014).

      Author response image 1.

      Correlation between the baseline variability and learning speed (Experiment 1A). In Exp 1A, the rotation of the tip-movement direction was gradually increased by 1 degree per trial up to 30 degrees. The learning speed was quantified by calculating how quickly the direction of movement followed the perturbation (upper panel). The lower left panel shows the variability of the tip-movement direction versus learning speed, while the lower right panel shows the variability of the stick-tilt angle versus learning speed. Baseline variability was calculated as a standard deviation across trials (trials in which a target appeared in a 0-degree direction).

      Author response image 2.

      Correlation between the baseline variability and learning speed (Experiment 1B). In Exp 1B, the rotation of the tip-movement direction was abruptly applied from the first trial (30 degrees). The learning speed was calculated as a time constant obtained by exponential curve fitting. The lower left panel shows the variability of the tip-movement direction versus learning speed, while the lower right panel shows the variability of the stick-tilt angle versus learning speed. Baseline variability was calculated as a standard deviation across trials (trials in which a target appeared in a 0-degree direction).

      The distance between hands was fixed at 15 cm with the Kinarm instead of a mechanical constraint. I wonder how much this distance varied and more importantly whether from that analysis or a force analysis, the authors could determine whether one hand led the other one in the adaptation.

      Thank you very much for this important comment. Since the distance between the two hands was maintained by the stiff virtual spring (2000 N/m), it was kept almost constant throughout the experiments as shown in Author response image 3 (the averaged distance during a movement). The distance was also maintained during reaching movements (Author response image 4).

      We also thank the reviewer for the suggestion regarding the force analysis. As shown in Author response image 5, we did not find a role for a specific hand for motor adaptation from the handle force data. Specifically, Author response image 5 shows the force applied to each handle along and orthogonal to the stick. If one hand led the other in adaptation, we should have observed a phase shift as adaptation progressed. However, no such hand specific phase shift was observed. It should be noted, however, that it was theoretically difficult to know from the force sensors which hand produced the force first, because the force exerted by the right handle was transmitted to the left handle and vice versa due to the connection by the stiff spring. 

      Author response image 3.

      The distance between hands during the task. We show the average distance between hands for each trial. The shaded area indicates the standard deviation across participants.

      Author response image 4.

      Time course changes in the distance between hands during the movement. The color means the trial epoch shown in the right legend.

      Author response image 5.

      The force profile during the movement (Exp 1A). We decomposed the force of each handle into the component along (upper panels) and orthogonal to the stick (lower panels). Changes in the force profiles in the adaptation phase are shown (left: left hand force, right: right hand force). The colors (magenta to cyan) mean trial epoch shown in the right legend.

      I understand the distinction between task- and end-effector irrelevant perturbation, and at the same time results show that the nervous system reacts to both types of perturbation, indicating that they both seem relevant or important. In line 32, the errors mentioned at the end of the sentence suggest that adaptation is in fact maladaptive. I think the authors may extend the Discussion on why adaptation was found in the experiments with end-effector irrelevant and especially how an internal (forward) model or a pair of internal (forward) models may be used to predict both the visual and the somatosensory consequences of the motor commands.

      Thank you very much for your comment. As we already described in the discussion of the original manuscript (Lines 519-538 in the revised manuscript), two potential explanations exist for the motor system’s response to the end-effector irrelevant perturbation (i.e., stick rotation). First, the motor system predicts the sensory information associated with the action and attempts to correct any discrepancies between the prediction and the actual sensory consequences, regardless of whether the error information is end-effector relevant or end-effector irrelevant. Second, given the close coupling between the tip-movement direction and stick-tilt angle, the motor system can estimate the presence of end-effector relevant error (i.e., tip-movement direction) by the presence of end-effector irrelevant error (i.e., stick-tilt angle). This estimation should lead to the change in the tip-movement direction. As the reviewer pointed out, the mismatch between visual and proprioceptive information is another possibility, we have added the description of this point in Discussion (Lines 523-526).

      Reviewer #1 (Recommendations For The Authors):

      Minor

      Line 16: “it remains poorly understood” is quite subjective and I would suggest reformulating this statement.

      We have reformulated this statement as “This limitation prevents the study of how….”  (Line 16).

      Introduction

      Line 49: the authors may be more specific than just saying ‘this task’. In particular, they need to clarify that there is no redundancy in studies where the shoulder is fixed and all movement is limited to a plane ... which turns out to truly happen in a limited set of experimental setups (for example: Kinarm exoskeleton, but not endpoint; Kinereach system...).

      We have changed this to “such a planar arm-reaching task” (Line 49).

      Line 61: large, not infinite because of biomechanical constraints.

      We have changed “an infinite” to “a large” (Line 61) and “infinite” to “a large number of” (legend in Fig. 1f).

      Lines 67-69: consider clarifying.

      We have tried to clarify the sentence (Lines 67-69).

      Results

      TMD and STA, and TMD-STA plane, are new terms with new acronyms that are not easy to immediately understand. Consider avoiding acronyms.

      We have reduced the use of these acronyms as much as possible. 

      “visual TMD–STA plane” -> “plane representing visual movement patterns” (Lines 179180)

      “TMD axis” -> “x-axis” (Line 181, Line 190)

      “physical TMD–STA plane” -> “plane representing physical movement patterns” (Lines 182-187)

      “physical TMD–STA plane” -> “physical plane” (Line 191, Line 201, Lines 216-217, Line 254, Line 301, Line 315, Line 422, Line 511, and captions of Figures 4-9, S3)

      “visual TMD–STA plane” -> “visual plane” (Line 193, Line 241, Line 248, Line 300, Lines

      313-314, and captions of Figures 4-9, S3)

      “STA axis” -> “y-axis” (Line 241)

      Line 169: please clarify the mismatch(es) that are created when the tip-movement direction is visually rotated in the CCW direction around the starting position (tip perturbation), whereas the stick-tilt angle remains unchanged.

      Thank you for your pointing this out. We have clarified that the stick-tilt angle remains identical to the tilt of both hands (Lines 171-172).

      Discussion

      I understand the physical constraint imposed between the 2 hands with the robotic device, but I am not sure I understand the physical constraint imposed by the TMD-STA relationship.

      The phrase “physical constraint” meant the constraint of the movement on the physical space. However, as the reviewer pointed out, this phrase could confuse the constraint between the two hands. Therefore, we have avoided using the phrase “physical constraint” throughout the manuscript.

      Some work looking at 3-D movements should be used for Discussion (e.g. Lacquaniti & Soechting 1982; work by d’Avella A or Jarrasse N).

      Thank you for sharing this important information. We have cited these studies in Discussion (Lines 380-382). 

      Reviewer #2 (Public Review):

      Summary:

      The authors have developed a novel bimanual task that allows them to study how the sensorimotor control system deals with redundancy within our body. Specifically, the two hands control two robot handles that control the position and orientation of a virtual stick, where the end of the stick is moved into a target. This task has infinite solutions to any movement, where the two hands influence both tip-movement direction and stick-tilt angle. When moving to different targets in the baseline phase, participants change the tilt angle of the stick in a specific pattern that produces close to the minimum movement of the two hands to produce the task. In a series of experiments, the authors then apply perturbations to the stick angle and stick movement direction to examine how either tipmovement (task-relevant) or stick-angle (task-irrelevant) perturbations affect adaptation. Both types of perturbations affect adaptation, but this adaptation follows the baseline pattern of tip-movement and stick angle relation such that even task-irrelevant perturbations drive adaptation in a manner that results in task-relevant errors. Overall, the authors suggest that these baseline relations affect how we adapt to changes in our tasks. This work provides an important demonstration that underlying solutions/relations can affect the manner in which we adapt. I think one major contribution of this work will also be the task itself, which provides a very fruitful and important framework for studying more complex motor control tasks.

      Strengths:

      Overall, I find this a very interesting and well-written paper. Beyond providing a new motor task that could be influential in the field, I think it also contributes to studying a very important question - how we can solve redundancy in the sensorimotor control system, as there are many possible mechanisms or methods that could be used - each of which produces different solutions and might affect the manner in which we adapt.

      Weaknesses:

      I would like to see further discussion of what the particular chosen solution implies in terms of optimality.

      The underlying baseline strategy used by the participants appears to match the path of minimum movement of the two hands. This suggests that participants are simultaneously optimizing accuracy and minimizing some metabolic cost or effort to solve the redundancy problem. However, once the perturbations are applied, participants still use this strategy for driving adaptation. I assume that this means that the solution that participants end up with after adaptation actually produces larger movements of the two hands than required. That is - they no longer fall onto the minimum hand movement strategy - which was used to solve the problem. Can the authors demonstrate that this is either the case or not clearly? These two possibilities produce very different implications in terms of the results.

      If my interpretation is correct, such a result (using a previously found solution that no longer is optimal) reminds me of the work of Selinger et al., 2015 (Current Biology), where participants continue to walk at a non-optimal speed after perturbations unless they get trained on multiple conditions to learn the new landscape of solutions. Perhaps the authors could discuss their work within this kind of interpretation. Do the authors predict that this relation would change with extensive practice either within the current conditions or with further exploration of the new task landscape? For example, if more than one target was used in the adaptation phase of the experiment?

      On the other hand, if the adaptation follows the solution of minimum hand movement and therefore potentially effort, this provides a completely different interpretation.

      Overall, I would find the results even more compelling if the same perturbations applied to movements to all of the targets and produced similar adaptation profiles. The question is to what degree the results derive from only providing a small subset of the environment to explore.

      Thank you very much for pointing out this significant issue. As the reviewer correctly interprets, the physical movement patterns deviated from the baseline relationship as exemplified in Exp.2. However, this deviation is not surprising for the following reason. Under the perturbation that creates the dissociation between the hands and the stick, the motor system cannot simultaneously return both the visual stick motion and physical hands motion to the original motions: When the motor system tries to return the visual stick motion to the original visual motion, then the physical hands motion inevitably deviates from the original physical hands motion, and vice versa.  

      Our interpretation of this result is that the motor system corrects the movement to reduce the visual dissociation of the visual stick motion from the baseline motion (i.e., sensory prediction error), but this movement correction is biased by the baseline physical hands motion. In other words, the motor system attempts to balance the minimization of sensory prediction error and the minimization of motor cost. Thus, our results do not indicate that the final adaptation pattern is non-optimal, but rather reflect the attempts for optimization.

      In the revised manuscript, we have added the description of this interpretation (Lines 515-517).

      Reviewer #2 (Recommendations For The Authors):

      The authors have suggested that the only study (line 472) that has also examined an end-effector irrelevant perturbation is the bimanual study of Omrani et al., 2013, which only examined reflex activity rather than adaptation. To clarify this issue - exactly what is considered end-effector irrelevant perturbations - I was wondering about the bimanual perturbations in Dimitriou et al., 2012 (J Neurophysiol) and the simultaneous equal perturbations in Franklin et al., 2016 (J Neurosci), as well as other recent papers studying task-irrelevant disturbances which aren’t discussed. I would consider these both to also be end-effector irrelevant perturbations, although again they only used these to study reflex activity and not adaptation as in the current paper. Regardless, further explanation of exactly what is the difference between task-irrelevant and end-effector irrelevant would be useful to clarify the exact difference between the current manuscript and previous work.

      Thank you for your helpful comments. We have included as references the study by Dimitriou et al. (Line 490) and Franklin et al. (Lines 486-487), which use an endeffector irrelevant perturbation and the task-irrelevant perturbation condition, respectively. We have also added further explanation of what is the difference between task-irrelevant and end-effector irrelevant (Lines 344-352). 

      Line 575: I assume that you mean peak movement speed

      We have added “peak”. (Line 597).

      Reviewer #3 (Public Review):

      Summary:

      This study explored how the motor system adapts to new environments by modifying redundant body movements. Using a novel bimanual stick manipulation task, participants manipulated a virtual stick to reach targets, focusing on how tip-movement direction perturbations affected both tip movement and stick-tilt adaptation. The findings indicated a consistent strategy among participants who flexibly adjusted the tilt angle of the stick in response to errors. The adaptation patterns are influenced by physical space relationships, guiding the motor system’s choice of movement patterns. Overall, this study highlights the adaptability of the motor system through changes in redundant body movement patterns.

      Strengths:

      This paper introduces a novel bimanual stick manipulation task to investigate how the motor system adapts to novel environments by altering the movement patterns of our redundant body.

      Weaknesses:

      The generalizability of the findings is quite limited. It would have been interesting to see if the same relationships were held for different stick lengths (i.e., the hands positioned at different start locations along the virtual stick) or when reaching targets to the left and right of a start position, not just at varying angles along one side. Alternatively, this study would have benefited from a more thorough investigation of the existing literature on redundant systems instead of primarily focusing on the lack of redundancy in endpointreaching tasks. Although the novel task expands the use of endpoint robots in motor control studies, the utility of this task for exploring motor control and learning may be limited.

      Thank you very much for the important comment. Given that there are many parameters (e.g., stick length, locations of hands, target position etc), one may wonder how the findings obtained from only one combination can be generalized to other configurations. In the revised manuscript, we have explicitly described this point (Lines 356-359). 

      Thus, the generalizability needs to be investigated in future studies, but we believe that the main results also apply to other configurations. Regarding the baseline stick movement pattern, the control with tilting the stick was observed regardless of the stick-tip positions (Author response image 6). Regarding the finding that the adapted stick movement patterns follow the baseline movement patterns, we confirmed the same results even when the other targets were used as the target for the adaptation (Author response image 7). 

      Author response image 6.

      Stick-tip manipulation patterns when the length of the stick varied. Top: 10 naïve participants moved the stick with different lengths. A target appeared on one of five directions represented by a color of each tip position. Regardless of the length of the stick and laterality, a similar relationship between tip-movement direction and stick-tilt angle was observed. (middle: at peak velocity, bottom: at movement offset).

      Author response image 7.

      Patterns of adaptation when using the other targets. In the baseline phase, 40 naïve participants moved a stick tip to a peripheral target (24 directions). They showed a stereotypical relationship between the tip-movement direction and the stick-tilt angle (a bold gray curve). In the adaptation phase, participants were divided into four groups, each with a different target training direction (lower left, lower right, upper right, or upper left), and visual rotation was gradually imposed on the tip-movement direction. Irrespective of the target direction, the adaptation pattern of the tipmovement and stick-tilt followed with the baseline relationship.

      We also thank you for your comment about studying the existing redundant systems. We can understand the reviewer's concern about the usefulness of our task, but we believe that we have proposed the novel framework for motor adaptation in the redundant system. The future studies will be able to clarify how the knowledge gained from our task can be generally applied to understand the control and learning of the redundant system.

      Reviewer #3 (Recommendations For The Authors):

      Line 49: replace “uniquely” with primarily. A number of features of the task setup could affect the joint angles, from if/how the arm is supported, whether the wrist is fixed, alignment of the target in relation to the midline of the participant, duration of the task, and whether fatigue is an issue, etc. Your statement relates to fixed limb lengths of a participant, rather than standard reaching tasks as a whole. Not to mention the degree of inter- and intra-subject variability that does exist in point-to-point reaching tasks.

      Thank you for your helpful point. We have replaced “uniquely” with “primarily”. (Line 49).

      Line 72: the cursor is not an end-effector - it represents the end-effector.

      We have changed the expression as “the perturbation to the cursor representing the position of the end-effector (Line 72).

      Lines 73 – 78: it would benefit the authors to consider the role of intersegmental dynamics.

      Thank you for your suggestion. We are not sure if we understand this suggestion correctly, but we interpret that this suggestion to mean that the end-effector perturbation can be implemented by using the perturbation that considers the intersegmental dynamics. However, the implementation is not so straightforward, and the panels in Figure 1j,k are only conceptual for the end-effector irrelevant perturbation. Therefore, we have not described the contribution of intersegmental dynamics here.

      Lines 90 – 92: “cannot” should be “did not”, as the studies being referenced are already completed. This statement should be further unpacked to explain what they did do, and how that does not meet the requirement of redundancy in movement patterns.

      We have changed “cannot” to “did not” (Line 91). We have also added the description of what the previous studies had demonstrated (Line 88-90).

      Figure text could be enlarged for easier viewing.

      We have enlarged texts in all figures. 

      Lines 41 - 47: Interesting selection of supporting references. For the introduction of a novel environment, I would recommend adding the support of Shadmehr and MussaIvaldi 1994.

      Thank you for your suggestion. We have added Shadmehr and Mussa-Ivaldi 1994 as a reference (Line 45).

      Line 49: “this task” is vague - the above references relate to a number of different tasks. For example, the authors could replace it with a reaching task involving an end-point robot.

      Thank you very much for your suggestion. As per the suggestion by Reviewer #1, we have changed this to “such a planar arm-reaching task” (Line 49).

      Line 60: “hypothetical limb with three joints” - in Figure 1a, the human subject, holding the handle of a robotic manipulandum does have flexibility around the wrist.

      Previous studies using planar arm-reaching task have constrained the wrist joint (e.g., Flash & Hogan, 1985; Gordon et al., 1994; Nozaki et al., 2006). We tried to emphasize this point as “participants manipulate a visual cursor with their hands primarily by moving their shoulder and elbow joints” (Line 42). In the revised manuscript, we have also emphasized this point in the legend of Figure 1a.

      Lines 93-108: this paragraph could be cleaned up more clearly stating that while the use of task-irrelevant perturbations has been used in the domain of reaching tasks, the focus of these tasks has not been specifically to address “In our task, we aim to exploit this feature by doing”

      Thank you very much for your helpful comments. To make this paragraph clear, we have modified some sentences (Line 100-104).

      Line 109: “coordinates to adapt” is redundant.

      We have changed this to “adapts” (Line 110).

      Lines 109-112: these sentences could be combined to have better flow.

      Thank you very much for your valuable suggestion. We have combined these two sentences for the better flow (Line 110-112).

      Line 113-114: consider rewording - “This is a redundant task because ...” to something like “Redundancy in the task is achieved by acknowledging that ....“.

      We have changed the expression according to the reviewer’s suggestion (Line 114).

      Line 118: Consider changing “changes” to “makes use of”.

      We have changed the expression (Line 119).

      Lines 346 - 348: grammar and clarity - “This redundant motor task enables the investigation of adaptation patterns in the redundant system following the introduction of perturbations that are either end-effector relevant, end-effector irrelevant, or both.“.

      Thank you very much again for your helpful suggestion of English expression. We have adopted the sentence you suggested (Line 354-356).

    1. Author response:

      The following is the authors’ response to the original reviews.

      We deeply appreciate the reviewer comments on our manuscript. We have proceeded with all the minor changes mentioned. We also want to emphasize three major points:

      (1) Reversine has been shown to have several off-targets effects. Including inducing apoptosis (Chen et al. J Bone Oncol. 2024).

      (2) Hypoxia varies from 2% to 6%. Our definition of hypoxia is 5% concentration of oxygen with 5% concentration of CO<sub>2</sub>, taking into consideration the standard levels of oxygen in the IVF clinics. Physiological oxygen in mouse varies from ~1.5% to 8%.

      (3) Natale et al. 2004 (Dev Bio) and Sozen et al. 2015 (Mech of Dev) described that inhibition of p38 deeply affect the development of pre-implantation embryos after the 8-cell stage. For this reason, comprehensible dissect the interaction between p53, HIF1A and p38 during aneuploid stress is challenging. We do not discard a double function of p38 during lineage specification and in response to DNA damage.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 69: Please add the species used in your cited publications (murine).

      Fixed

      (2) Line 72: Consider changing "Because" to "As".

      Fixed

      (3) Line 88: "from the nuclei" - please refer to where the reader may find the example provided (Figure S1A).

      Fixed

      (4) Line 89: This should be Figure S1B as no quantification is presented in S1A. S1A only contains examples of micronuclei.

      Fixed

      (5) Line 91: Refer to Figure S1A.

      Fixed

      (6) Line 91-93: Are these numbers correct? The query arises from the numbers presented in Figure S1B. Please define how the median was calculated; is it micronuclei CREST+ plus micronuclei CREST-?

      Fixed. We did not differentiate in these percentage the presence of CREST.

      (7) Line 95: extra/missing bracket?

      Fixed

      (8) Line 88-91:

      [a] Regarding the number of cells with micronuclei in this text, please clarify your sample size and how the percentages were calculated as they currently do not align (e.g., are these the total number of embryos from a single experimental replicate?).

      Also, different numbers are found here and in the figure legend: (DMSO-22/256 cells from 32 embryos; Rev-82/144 cells from 18 embryos; AZ-182/304 cells from 38 embryos) vs. Fig S1 legend (DMSO-n=128 cells; Rev-72 cells; AZ-152 cells).

      [c] Is the median calculated using the numbers presented above? If yes, then the numbers do not tally, please check (DMSO-22/256 cells=8.6%; Rev-82/144 cells=56.9%; AZ-182/304 cells =59.9%) vs. Line 91-93: DMSO=12.5%, Rev=75%; AZ=62.5% blastomeres had micronuclei.

      The percentage represents the average of aneuploidy per embryo after normalization.

      See table for DMSO. This number represents the average of aneuploid cells each aneuploid embryo has. Notice that some embryos are fully diploid. Some have more that 12.5% -> 25%. Most of the aneuploid embryos have 12.5% of aneuploidy. It is not black and white as how many aneuploid cell there is in the sample but a full understanding of how aneuploid are the aneuploid embryos in each sample.

      Author response image 1.

      (9) Line 108:

      [a] "n=28 per treatment" please clarify whether this refers to the number of embryos or cells and also add how many independent replicate experiments this data is representative of. as the text only refers to Figure 1C you can remove the P-values for ** and *.

      Number of embryos. Fixed

      (10) Line 111: Suggest citing Figure 1C at the end of the sentence.

      Fixed

      (11) Line 118-119: the reference to figures require updating to ensure they refer to the appropriate figure; ...decidua (Figure S1C)...viable E9.5 embryos (Figure S1D).

      Fixed

      (12) Line 126: A description of the data in Figures 1D and 1E is missing. Also, consider describing the DNA damage observed in the DMSO control group. Visually, it appears that DNA damage reduces from the 8-cell to the morula stage (Figure 1E) but increases at the blastocyst stage (Figure S2A)? Point for discussion for a normal rate of DNA damage?

      Agree, there is some DNA damage in the TE in blastocyst

      (13) Line 134: 8 EPI and 4 PE cells in what group?

      Fixed: DMSO-treated embryos

      (14) Line 137: Could this also suggest that AZ and reversine induce DNA damage through a different mechanism/pathway, resulting in the differential impact observed? Despite both being inhibitors of Mps1.

      This is a possibility.

      (15) Line 153: the legend for Figure 2A says the Welch t-test was performed, but the Mann-Whitney U-test was stated here. Which is correct?

      Welch’s t-test

      (16) Line 155: ...at the blastocyst stage. Compared to what?

      DMSO-treated embryos

      (17) Line 160: Data in Figure 2B requires the definition of P-values for , , . Please add one for and remove the one for **.

      Fixed

      (18) Line 173-174: Data in Fig. 4 requires the definition of the P-values for ****. Please remove the others.

      Fixed

      (19) Line 180: Instead of jumping across figures, this section would benefit from stating the numbers directly to allow for an accurate comparison, e.g. 64 and 7 in Figure 2D vs. X and Y in Figure 1C.

      (20) Line 187: Hif1a should be italicised.

      Fixed

      (21) Line 197: Based on the description here, I believe you are missing a reference to Figure 1A.

      Fixed

      (22) Line 203: Instead of jumping across figures, this section would benefit from stating the numbers directly to allow for accurate comparison, "particularly in the TE and PE" (67 vs 54; and 11 vs 6, respectively).

      (23) Line 209-210:

      [a] "...lowered the number of yH2AX foci..." is this a visual observation as quantification was performed for yH2AX intensity, not quantification of foci?

      A description for PARP1 levels in morula stage embryos was presented ("...relatively low in morula), but not for yH2AX levels at this stage of development. Missing description?

      Fixed

      (24) Line 235: This sentence would benefit from being specific about the environmental conditions...eg "Under normoxia, DMSO/AZ3146-treated...",

      (25) Line 238: The sentence should reference Figure 4F not 4G.

      Fixed

      (26) Line 242-243:

      [a] "slightly increased... in the TE (49.06%) and PE (50%) but, strikingly, reduced... EPI (33.3%)" compared to what and in which figure?

      Assuming you are comparing normoxia (4F) to hypoxia (4G), the numbers change for the TE (46.75% to 49.06%, respectively), EPI (42.88% to 33.3%, respectively), and PE (28.57% to 50%, respectively); yet these data were described as "strikingly different" for EPI (9.58 decrease) but only "slightly increased" for PE (21.42 increase). Suggest using appropriate adjectives to describe the results.

      Fixed

      (27) Line 256: It is stated in line 255 that treatment was performed at the zygote stage, yet this sentence says reversine treatment occurred at the 2-cell stage? Which is correct? Please amend appropriately. Refer to the comment below regarding adding a schematic to aid readers

      Fixed

      (28) Line 259: "n>27 per treatment" please clarify whether this refers to the number of embryos or cells and also add how many independent replicate experiments this data is representative of. Data in Figures S5A-B requires a definition of P-values for , . Please remove for *, *.

      Fixed

      (29) Line 261: AZ3146/reversine stated here, the figure shows Reversine/AZ3146. Please consider being consistent.

      Fixed

      (30) Line 263: "... normal morphology and cavitation (Figure S5D); however the image presented for Rev/DMSO and Rev/AZ3146 chimeras appear smaller with a distorted/weird shape when compared to DMSO/AZ. I believe the description does not match the images presented.

      Fixed

      (31) Line 267: "...similar results as 8-cell stage derived chimeras"; however, there is only a reference to Fig S5E which depicts 2-cell/zygote stage (see comment above for line 256 regarding required clarification of stage of treatment) derived chimeras. There is also a missing reference to Figure 4B, D, and/or F?

      Fixed

      (32) Line 271: add a reference to Figure S5E.

      Fixed

      (33) Line 283: "AZ3146/reversine" should be "Reversine/AZ3146" to match the figure.

      Fixed

      (34) Line 284: Figures 5E-F show both morphology and cavitation; the text should reflect this.

      Fixed

      (35) Line 281-285: I think this text requires editing to improve clarity. It is difficult for this reader to understand the authors' interpretation of the results....inhibiting HIF1A reduces morphology and cavitation. That's correct. However, this also diminished the contribution of AZ3146-treated cells to all 3 cell lineages; this is not quite accurate. AZ3146-treated cells were significantly reduced in total cell numbers because TE was significantly reduced. It is not appropriate to generalise this result to all 3 lineages, as EPI and TE appear to increase AZ's contribution following IDF treatment, albeit non-statistically significant.

      Fixed

      (36) Line 320: citation? ....reversine-treated embryos. Is this referring to your previous publication...Bolton 2016?

      Fixed

      (37) Line 344: missing space between 7.5 and IU.

      Fixed

      (38) Line 358: animal ethics approval number/code missing.

      Fixed

      (39) Line 397: missing space between "...previously" and "(Bermejo...".

      Fixed

      (40) Line 417: missing space between "...control" and "(Gu et...".

      Fixed

      (41) Line 421: missing space between "protocol" and "(Eakin...".

      Fixed

      (42) Line 427-429: Medium-grade mosaic chimeras were referred to as DMSO:AZ:Rev (3:3:2) here; but Figure 4 and associated legend says otherwise. Please amend appropriately. Were all medium mosaics generated in this manner? As I could only find Rev/AZ chimeras; my understanding of the Rev/AZ chimeras is 1:1 Rev:AZ instead of 3:2:3 DMSO:Rev:AZ.

      Fixed

      (43) Line 428: "reversine-treaded: please correct spelling.

      Fixed

      (44) Line 593: "n=28 per treatment" Please clarify whether this refers to the number of embryos or cells and also add how many independent replicate experiments this data is representative of.

      Fixed

      (45) Line 597: "through morula stage" when compared to what group?

      DMSO-treated embryos

      (46) Line 598: Data in Figure S5A-B requires the definition of P-values for , , **. Please remove for . Please define the error bars. SEM/95% confidence interval?

      Fixed

      (47) Line 604-607: Regarding 2B, no statistical test is stated yet Mann-Whitney was stated in Line 160 of the results section. Please confirm which test was used and include it in both sections for consistency.

      Fixed

      (48) Line 608: "Chemical downregulation of HIF1A"... this is not described in the results/methods section or shown in the figure. Please amend all sections for accuracy.

      Fixed

      (49) Line 613: please change "effect in" to "effect on".

      Fixed

      (50) Line 614: Please clarify the number of embryos or cells and also add how many independent replicate experiments this data is representative of. Data in Figure 2 also requires a definition of P-value for ****.

      Fixed

      (51) Line 625: Please clarify the number of embryos or cells and also add how many independent replicate experiments this data is representative of. Data in Figure 3 also requires a definition of P-value for ****.

      Fixed

      (52) Line 627: description requires editing to improve accuracy "...is only slightly increased at the 8-cell stage after exposure to reversine and AZ3146". However, the results show significantly higher DNA damage with Reversine treatment, but not with AZ when compared to DMSO. Please amend accordingly.

      Fixed

      (53) Line 629: Please define the error bars. SEM/95% confidence interval?

      Fixed

      (54) Line 634-635: it is written here that chimeras were made from 1:1 DMSO/AZ3146 and Reversine/DMSO; but Figure 4A shows 1:1 DMSO(grey):AZ3146(blue), and Reversine(red):AZ3146(blue), which contradicts the legend + method section; see comments for Line 427-429. Please amend these sections accordingly.

      Fixed

      (55) Line 648: reversine/AZ3146 chimeras? Refer to comments above.

      Fixed

      (56) Line 649-650: ...AZ-treated blastomeres contribute similarly to reversine-blastomeres to the TE and EPI but significantly increase contribution to the EPI? Please add the appropriate comparison group.

      Fixed

      (57) Line 652: Please clarify the number of embryos or cells and also add how many independent replicate experiments this data is representative of.

      Fixed

      (58) Line 664: Please clarify the number of embryos or cells and also add how many independent replicate experiments this data is representative of.

      Fixed

      (59) Line 675-677: FigS1B legend requires a definition of P-value for * and ****, can omit **

      Fixed

      (60) Line 678-680: FigS1C and S1D legend: sample size and replicates? Only mentioned in Lines 117-120, which requires back calculation.

      Fixed

      (61) Line 682-694: (1) Fig. S2B legend: missing P-value description for *** and ***; statistical test not stated, please add. Also, Figure S2E, only requires the definition for , and can omit others.

      Fixed

      (62) Line 702: FigS3B: missing description for ****, omit others.

      Fixed

      (63) Line 704-705: missing description for Rev/AZ group and hypoxia vs. normoxia conditions.

      Fixed

      (64) Line 712-713: "n>27 per treatment" Please clarify whether this refers to the number of embryos or cells and also add how many independent replicate experiments this data is representative of. Data in Figure S5 requires the definition of P-values for , . Please remove for *, *.

      Fixed

      (65) Line 713-715: could benefit from a description of which were marked from mTmG; e.g. why is DMSO, Rev, Rev in Green for [D]; does this mean 2-cell stage chimeras were only made with embryos treated with DMSO and Reversine? Has it been tested if you did this with AZ3146, do the proportions remain the same? This would be interesting to know.

      DMSO and reversine are in green because they are the cells mark with green in the chimeras. We also did chimeras with AZ3146. Hope this clarifies.

      (66) Line 719-721: why is there a difference between the proportion of aneuploid cells for the different chimeras? AZ in D/AZ, and R/AZ groups; while only R in D/R group? Is this because you only count those that were marked with mTmG (e.g. based on [Fig S5D])? (67) Line 724: low- and medium-grade chimeras would indicate quality, recommend adding low/medium grade aneuploid/mosaic chimeras.

      Fixed

      (68) Line 725-729: it may be my mistake, but I think the results description is not found within the Results section, but only here in the legend? Please include this detail also in the Results section.

      Fixed

      (69) Line 729: which is AZ or Rev cells?

      (70) References - Page number missing for some references; abbreviated version vs. non abbreviated version of journal titles used. Please be consistent/meet journal requirements.

      Fixed

      (71) Figures

      Figure 1: [C] both AZ-NANOG and DMSO-SOX17 have mean/median(?) of 11 cells (described in results), yet in this figure (on the same axis) these groups are not level. Are the numbers correct? This is also the case for Rev-SOX17 which is described in the results as having 8 cells yet appears to be above the 8 mark in the graphs; AZ-CDX2, which has 64 cells yet appears to be below the 60 mark; AZ-total, which has 82 cells yet appears to be below the 80 mark. In [E] the label orientation, "ns" has both horizontal and vertical orientation. Please make appropriate changes throughout to reflect accuracy.

      Figure 3: [C] As for Figure 1, DMSO-NANOG, which is described in results as having 14 cells, yet appears to be below the 13 mark in the graph; DMSO-SOX17, which has 6 cells yet appears to be above the 7 mark.

      These is due to average

      Figure 4: [D and E] random numerals appear in the bars on the graph. 9,10 and 7, 14? Are these sample size numbers? If they are, they should appear in all bars/groups or in the legend.

      Yes, these are sample sizes

      Figure 5: [D and G] same comment as for Fig 4 above, random numbers in the graph.

      Yes, these are sample sizes

      (72) Supplementary figures. Figure S2 [A] No quantification? This is important to add as representative images are only a 2D plane, which can be easily misinterpreted. [E] Should the y-axis label be written as "Number of cells normalised to DMSO group", or similar? Or is there a figure missing to depict the ratio of cells in each cell lineage normalised to the DMSO group, which is the description written in the legend? But I don't see a figure showing the ratio, just the absolute number of cells. Is this a missing figure or a mislabelled axis?

      Quantification at the blastocyst stage is misleading due to high cellular heterogeneity.

      Reviewer #3 (Recommendations for the authors):

      (1) The statement in the abstract: "embryos with a low proportion of aneuploid cells have a similar likelihood of developing to term as fully euploid embryos" Line 48-50 Capalbo does not really answer as the biopsy may not be reflective of ICM.

      This is a great point. Trophectoderm biopsies may not reflect the real proportion of aneuploidy in the ICM. We emphasize this in discussion and Fig. S4.

      (2) Line 69/70, at least 50% Singla et al/Bolton. It would be helpful to elaborate a bit more on this study. How can this be assessed when analysis results in destruction?

      (3) Differences in the developmental potential of reversine versus AZ-treated embryos. It is not entirely clear why. The differences in non-dividing cells if any are small, and the -crest cells are rather minor also. Could these drugs have other effects that are not evaluated in the study?

      Yes, specifically, reversine has been shown to have several off-targets effects. Including inducing apoptosis (Chen et al 2024).

      (4) Lines 45-46 understanding of reduction of aneuploidy should mention/discuss the paper of attrition/selection, of the kind by the Brivanlou lab for instance, or others. As well as allocation to specific lineages, including the authors' work.

      Dr. Brinvanlou experiments in gastruloids do not represent the same developmental stage of pre-implantation embryos. Comparison between models is debatable.

      (5) Line 53: human experiments are more limited due to access to samples. What does 'not allowed' mean? By who, where?

      NIH does not allow to experiment with human embryos for ethical reasons.

      (6) The figure callouts to S1A in lines 93,97. What is a non-dividing nucleus? For how long is it observed?

      A non-dividing nucleus is an accumulation of DNA in a round form without define separation of the chromosomes and their specific kinetochores (CREST antibody). The presence of non-dividing nucleus during the 4 -to-8 cell stage can indicate activation of the spindle assembly checkpoint during prometaphase. Example of non-dividing nucleus can be observed in Fig S1.B.

      (7) Line 108 A relatively minor effect on cell number and quality of blastocysts is observed. It is not surprising that thereafter, developmental potential is also high. At that stage, what are the individual cell karyotypes?

      Due to technical limitations, we can’t determine the specific karyotypes of these cells.

      (8) Line 153. The p53 increase of 1.3 fold is not dramatic.

      The levels of p53 at the morula stage is 7-fold differences. In contrast, at the blastocyst stage, a change in 1.3-fold is indeed less dramatic. This can be a result of the elimination of aneuploid cells or mechanism to counter the activation of the p53 pathway, like overexpression of the Hif1a pathway.

      (9) Line 155. Is there a more direct way to test for p38 activation?

      Natale et al 2004 (Dev Biol) and Sozen et al 2015 (Mech of Dev) described that inhibition of p38 deeply affect the development of pre-implantation embryos after the 8-cell stage. For this reason, comprehensible dissect the interaction between p53, HIF1A and p38 during aneuploid stress is challenging. We do not discard a double function of p38 during lineage specification and in response to DNA damage.

      (10) Line 191/192 Low oxygen conditions, is this equal to hypoxia? What is the definition of hypoxia here? The next sentence says physiological. Is that the same or different?

      Low oxygen can be defined as hypoxia. This varies from 2% to 6%. Our definition of hypoxia is 5% concentration of oxygen with 5% concentration of CO<sub>2</sub>, taking into consideration the standard levels of oxygen in the IVF clinics. Physiological oxygen in mouse varies from ~1.5% to 8%.

      (11) The question is whether there is something specific about HIF1 and aneuploidy, or whether another added stress would have similar effects on the competitiveness of treated cells.

      That is a great follow up of our work.

      (12) Line 300. Is p21 unregulated at the protein level or mRNA level? Please indicate.

      mRNA level.

      (13) Figure 1D/E H2Ax intensity is cell cycle phase-dependent. It might be meaningful to count foci by the nucleus and show both ways of analysis.

      (14) Check the spelling of phalloidin.

      Fixed in text and figures!

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Chang and colleagues used tetrode recordings in behaving rats to study how learning an audiovisual discrimination task shapes multisensory interactions in the auditory cortex. They found that a significant fraction of neurons in the auditory cortex responded to visual (crossmodal) and audiovisual stimuli. Both auditory-responsive and visually-responsive neurons preferentially responded to the cue signaling the contralateral choice in the two-alternative forced choice task. Importantly, multisensory interactions were similarly specific for the congruent audiovisual pairing for the contralateral side.

      Strengths:

      The experiments were conducted in a rigorous manner. Particularly thorough are the comparisons across cohorts of rats trained in a control task, in a unisensory auditory discrimination task, and the multisensory task, while also varying the recording hemisphere and behavioral state (engaged vs. anesthesia). The resulting contrasts strengthen the authors' findings and rule out important alternative explanations. Through the comparisons, they show that the enhancements of multisensory responses in the auditory cortex are specific to the paired audiovisual stimulus and specific to contralateral choices in correct trials and thus dependent on learned associations in a task-engaged state.

      We thank Reviewer #1 for the thorough review and valuable feedback.

      Weaknesses:

      The main result is that multisensory interactions are specific for contralateral paired audiovisual stimuli, which is consistent across experiments and interpretable as a learned task-dependent effect. However, the alternative interpretation of behavioral signals is crucial to rule out, which would also be specific to contralateral, correct trials in trained animals. Although the authors focus on the first 150 ms after cue onset, some of the temporal profiles of activity suggest that choice-related activity could confound some of the results.

      We thank the reviewer for raising this important point regarding the potential influence of choice-related activity on our results. In our experimental setup, it is challenging to completely disentangle the effects of behavioral choice from multisensory interaction. However, we conducted relevant analyses to examine the influence of choice-related components on multisensory interaction.

      First, we analyzed neural responses during incorrect trials and found a significant reduction in multisensory enhancement for the A<sup>10k</sup>-V<sup>vt</sup> pairing (Fig. 4). In contrast, for the A<sup>3k</sup>-V<sup>hz</sup> pairing, there was no strong multisensory interaction during either correct (right direction) or incorrect (left direction) choices. This finding suggests that the observed multisensory interactions are strongly associated with specific cue combinations during correct task performance.

      Second, we conducted experiments with unisensory training, in which animals were trained separately on auditory and visual discriminations without explicit multisensory associations. The results demonstrated that unisensory training did not lead to the development of selective multisensory enhancement or congruent auditory-visual preferences, as observed in the multisensory training group. This indicates that the observed multisensory interactions in the auditory cortex are specific to multisensory training and cannot be attributed solely to behavioral signals or choice-related effects.

      Finally, we specifically focused on the early 0-150 ms time window after cue onset in our main analyses to minimize contributions from motor-related or decision-related activity, which typically emerge later. This time window allowed us to capture early sensory processing while reducing potential confounds.

      Together, these findings strongly suggest that the observed choice-dependent multisensory enhancement is a learned, task-dependent phenomenon that is specific to multisensory training.

      The auditory stimuli appear to be encoded by short transient activity (in line with much of what we know about the auditory system), likely with onset latencies (not reported) of 15-30 ms. Stimulus identity can be decoded (Figure 2j) apparently with an onset latency around 50-75 ms (only the difference between A and AV groups is reported) and can be decoded near perfectly for an extended time window, without a dip in decoding performance that is observed in the mean activity Figure 2e. The dynamics of the response of the example neurons presented in Figures 2c and d and the average in 2e therefore do not entirely match the population decoding profile in 2j. Population decoding uses the population activity distribution, rather than the mean, so this is not inherently problematic. It suggests however that the stimulus identity can be decoded from later (choice-related?) activity. The dynamics of the population decoding accuracy are in line with the dynamics one could expect based on choice-related activity. Also the results in Figures S2e,f suggest differences between the two learned stimuli can be in the late phase of the response, not in the early phase.

      We appreciate the reviewer’s detailed observations and questions regarding the dynamics of auditory responses and decoding profiles in our study. In our experiment, primary auditory cortex (A1) neurons exhibited short response latencies that meet the established criteria for auditory responses in A1, consistent with findings from many other studies conducted in both anesthetized and task-engaged animals. While the major responses typically occurred during the early period (0-150ms) after cue onset (see population response in Fig. 2e), individual neuronal responses in the whole population were generally dynamic, as illustrated in Figures 2c, 2d, and 3a–c. As the reviewer correctly noted, population decoding leverages the distribution of activity across neurons rather than the mean activity, which explains why the dynamics of population decoding accuracy align well with choice-related activity. This also accounts for the extended decoding window observed in Figure 2j, which does not entirely match the early population response profiles in Figure 2e.

      To address the reviewer’s suggestion that differences between the two learned stimuli might arise in the late phase of the response, we conducted a cue selectivity analysis during the 151–300 ms period after cue onset. The results, shown below, indicate that neurons maintained cue selectivity in this late phase for each modality (Supplementary Fig. 5), though the selectivity was lower than in the early phase. However, interpreting this late-phase activity remains challenging. Since A<sup>3k</sup>, V<sup>hz</sup>, and A<sup>3k</sup>-V<sup>hz</sup> were associated with the right choice, and A<sup>10k</sup>, V<sup>vt</sup>, and A<sup>10k</sup>-V<sup>vt</sup> with the left choice, it is difficult to disentangle whether the responses reflect choice, sensory features, or a combination of both.

      To further investigate, we examined multisensory interactions during the late phase, controlling for choice effects by calculating unisensory and multisensory responses within the same choice context. Our analysis revealed no evident multisensory enhancement for any auditory-visual pairing, nor significant differences between pairings—unlike the robust effects observed in the early phase (Supplementary Fig. 5). We hypothesize that early responses are predominantly sensory-driven and exhibit strong multisensory integration, whereas late responses likely reflect task-related, choice-related, or combined sensory-choice activity, where sensory-driven multisensory enhancement is less prominent. As the focus of this manuscript is on multisensory integration and cue selectivity, we prioritized a detailed analysis of the early phase, where these effects are most prominent. However, the complexity of interpreting late-phase activity remains a challenge and warrants further investigation. We cited Supplementary Fig. 5 in revised manuscript as the following:

      “This resulted in a significantly higher mean MSI for the A<sup>10k</sup>-V<sup>vt</sup> pairing compared to the A<sup>3k</sup>-V<sup>hz</sup> pairing (0.047 ± 0.124 vs. 0.003 ± 0.096; paired t-test, p < 0.001). Among audiovisual neurons, this biasing is even more pronounced (enhanced vs. inhibited: 62 vs. 2 in A<sup>10k</sup>-V<sup>vt</sup> pairing, 6 vs. 13 in A<sup>3k</sup>-V<sup>hz</sup> pairing; mean MSI: 0.119±0.105 in A<sup>10k</sup>-V<sup>vt</sup> pairing vs. 0.020±0.083 A<sup>3k</sup>-V<sup>hz</sup> pairing, paired t-test, p<0.00001) (Fig. 3f). Unlike the early period (0-150ms after cue onset), no significant differences in multisensory integration were observed during the late period (151-300ms after cue onset) (Supplementary Fig. 5).”

      First, it would help to have the same time axis across panels 2,c,d,e,j,k. Second, a careful temporal dissociation of when the central result of multisensory enhancements occurs in time would discriminate better early sensory processing-related effects versus later decision-related modulations.

      Thank you for this valuable feedback. Regarding the first point, we used a shorter time axis in Fig. 2j-k to highlight how the presence of visual cues accelerates the decoding process. This visualization choice was intended to emphasize the early differences in processing speed. For the second point, we have carefully analyzed multisensory integration across different temporal windows. The results presented in the Supplementary Fig. 5 (also see above) already address the late phase, where our data show no evidence of multisensory enhancement for any auditory-visual pairings. This distinction helps clarify that the observed multisensory effects are primarily related to early sensory processing rather than later decision-related modulations. We hope this addresses the concerns raised and appreciate the opportunity to clarify these points.

      In the abstract, the authors mention "a unique integration model", "selective multisensory enhancement for specific auditory-visual pairings", and "using this distinct integrative mechanisms". I would strongly recommend that the authors try to phrase their results more concretely, which I believe would benefit many readers, i.e. selective how (which neurons) and specific for which pairings?

      We appreciate the reviewer’s suggestion to clarify our phrasing for better accessibility. To address this, we have revised the relevant sentence in the abstract as follows:

      "This model employed selective multisensory enhancement for the auditory-visual pairing guiding the contralateral choice, which correlated with improved multisensory discrimination."

      Reviewer #2 (Public review):

      Summary

      In this study, rats were trained to discriminate auditory frequency and visual form/orientation for both unisensory and coherently presented AV stimuli. Recordings were made in the auditory cortex during behaviour and compared to those obtained in various control animals/conditions. The central finding is that AC neurons preferentially represent the contralateral-conditioned stimulus - for the main animal cohort this was a 10k tone and a vertically oriented bar. Over 1/3rd of neurons in AC were either AV/V/A+V and while a variety of multisensory neurons were recorded, the dominant response was excitation by the correctly oriented visual stimulus (interestingly this preference was absent in the visual-only neurons). Animals performing a simple version of the task in which responses were contingent on the presence of a stimulus rather than its identity showed a smaller proportion of AV stimuli and did not exhibit a preference for contralateral conditioned stimuli. The contralateral conditioned dominance was substantially less under anesthesia in the trained animals and was present in a cohort of animals trained with the reverse left/right contingency. Population decoding showed that visual cues did not increase the performance of the decoder but accelerated the rate at which it saturated. Rats trained on auditory and then visual stimuli (rather than simultaneously with A/V/AV) showed many fewer integrative neurons.

      Strengths

      There is a lot that I like about this paper - the study is well-powered with multiple groups (free choice, reversed contingency, unisensory trained, anesthesia) which provides a lot of strength to their conclusions and there are many interesting details within the paper itself. Surprisingly few studies have attempted to address whether multisensory responses in the unisensory cortex contribute to behaviour - and the main one that attempted to address this question (Lemus et al., 2010, uncited by this study) showed that while present in AC, somatosensory responses did not appear to contribute to perception. The present manuscript suggests otherwise and critically does so in the context of a task in which animals exhibit a multisensory advantage (this was lacking in Lemus et al.,). The behaviour is robust, with AV stimuli eliciting superior performance to either auditory or visual unisensory stimuli (visual were slightly worse than auditory but both were well above chance).

      We thank the reviewer for their positive evaluation of our study.

      Weaknesses

      I have a number of points that in my opinion require clarification and I have suggestions for ways in which the paper could be strengthened. In addition to these points, I admit to being slightly baffled by the response latencies; while I am not an expert in the rat, usually in the early sensory cortex auditory responses are significantly faster than visual ones (mirroring the relative first spike latencies of A1 and V1 and the different transduction mechanisms in the cochlea and retina). Yet here, the latencies look identical - if I draw a line down the pdf on the population level responses the peak of the visual and auditory is indistinguishable. This makes me wonder whether these are not sensory responses - yet, they look sensory (very tightly stimulus-locked). Are these latencies a consequence of this being AuD and not A1, or ... ? Have the authors performed movement-triggered analysis to illustrate that these responses are not related to movement out of the central port, or is it possible that both sounds and visual stimuli elicit characteristic whisking movements? Lastly, has the latency of the signals been measured (i.e. you generate and play them out synchronously, but is it possible that there is a delay on the audio channel introduced by the amp, which in turn makes it appear as if the neural signals are synchronous? If the latter were the case I wouldn't see it as a problem as many studies use a temporal offset in order to give the best chance of aligning signals in the brain, but this is such an obvious difference from what we would expect in other species that it requires some sort of explanation.

      Thank you for your insightful comments. I appreciate the opportunity to clarify these points and strengthen our manuscript. Below, I address your concerns in detail:

      We agree that auditory responses are typically faster than visual responses due to the distinct transduction mechanisms. However, in our experiment, we intentionally designed the stimulus setup to elicit auditory and visual responses within a similar time window to maximize the potential for multisensory integration. Specifically, we used pure tone sounds with a 15 ms ramp and visual stimuli generated by an LED array, which produce faster responses compared to mostly used light bars shown on a screen (see Supplementary Fig. 2a). The long ramp of the auditory stimulus slightly delayed auditory response onset, while the LED-generated bar (compared to the bar shown on the screen) elicited visual responses more quickly. This alignment likely facilitated the observed overlap in response latencies.

      Neurons’ strong spontaneous activity in freely moving animals complicates the measurement of first spike latencies. Despite that, we still can infer the latency from robust cue-evoked responses. Supplementary Fig. 2b illustrates responses from an exemplar neuron (the same neuron as shown in Fig. 2c), where the auditory response begins 9 ms earlier than the visual response. Given the 28 ms auditory response latency observed here using 15 ms-ramp auditory stimulus, this value is consistent with prior studies in the primary auditory cortex usually using 5 ms ramp pure tones, where latencies typically range from 7 to 28 ms. Across the population (n=559), auditory responses consistently reached 0.5 of the mean Z-scored response 15 ms earlier than visual responses (Supplementary Fig. 2c). The use of Gaussian smoothing in PSTHs supports the reliability of using the 0.5 threshold as an onset latency marker. We cited Supplementary Fig. 2 in the revised manuscript within the Results section (also see the following):

      “This suggests multisensory discrimination training enhances visual representation in the auditory cortex. To optimize the alignment of auditory and visual responses and reveal the greatest potential for multisensory integration, we used long-ramp pure tone auditory stimuli and quick LED-array-elicited visual stimuli (Supplementary Fig. 2). While auditory responses were still slightly earlier than visual responses, the temporal alignment was sufficient to support robust integration.”

      We measured the time at which rats left the central port and confirmed that these times occur significantly later than the neuronal responses analyzed (see Fig. 1c-d). While we acknowledge the potential influence of movements such as whiskering, facial movements, head direction changes, or body movements on neuronal responses, precise monitoring of these behaviors in freely moving animals remains a technical challenge. However, given the tightly stimulus-locked nature of the neuronal responses observed, we believe they primarily reflect sensory processing rather than movement-related activity.

      To ensure accurate synchronization of auditory and visual stimuli, we verified the latencies of our signals. The auditory and visual stimuli were generated and played out synchronously with no intentional delay introduced. The auditory amplifier used in our setup introduces minimal latency, and any such delay would have been accounted for during calibration. Importantly, even if a small delay existed, it would not undermine our findings, as many studies intentionally use temporal offsets to facilitate alignment of neural signals. Nonetheless, the temporal overlap observed here is primarily a result of our experimental design aimed at promoting multisensory integration.

      We hope these clarifications address your concerns and highlight the robustness of our findings.

      Reaction times were faster in the AV condition - it would be of interest to know whether this acceleration is sufficient to violate a race model, given the arbitrary pairing of these stimuli. This would give some insight into whether the animals are really integrating the sensory information. It would also be good to clarify whether the reaction time is the time taken to leave the center port or respond at the peripheral one.

      We appreciate your request for clarification. In our analysis, reaction time (RT) is defined as the time taken for the animal to leave the center port after cue onset. This measure was chosen because it reflects the initial decision-making process and the integration of sensory information leading to action initiation. The time taken to respond at the peripheral port, commonly referred to as movement time, was not included in our RT measure. However, movement time data is available in our dataset, and we are open to further analysis if deemed necessary.

      To determine whether the observed acceleration in RTs in the audiovisual (AV) condition reflects true multisensory integration rather than statistical facilitation, we tested for violations of the race model inequality (Miller, 1982). This approach establishes a bound for the probability of a response occurring within a given time interval under the assumption that the auditory (A) and visual (V) modalities operate independently. Specifically, we calculated cumulative distribution functions (CDFs) for the RTs in the A, V, and AV conditions (please see Author response image 1). In some rats, the AV_RTs exceeded the race model prediction at multiple time points, suggesting that the observed acceleration is not merely due to statistical facilitation but reflects true multisensory integration. Examples of these violations are shown in Panels a-b of the following figure. However, in other rats, the AV_RTs did not exceed the race model prediction, as illustrated in Author response image 1c-d.

      This variability may be attributed to task-specific factors in our experimental design. For instance, the rats were not under time pressure to respond immediately after cue onset, as the task emphasized accuracy over speed. This lack of urgency may have influenced their behavioral responses and movement patterns. The race model is typically applied to assess multisensory integration in tasks where rapid responses are critical, often under conditions that incentivize speed (e.g., time-restricted tasks). In our study, the absence of strict temporal constraints may have reduced the likelihood of observing consistent violations of the race model. Furthermore, In our multisensory discrimination task, animals should discriminate multiple cues and make a behavioral choice have introduced additional variability in the degree of integration observed across individual animals. Additionally, factors such as a decline in thirst levels and physical performance as the task progressed may have significantly contributed to the variability in our results. These considerations are important for contextualizing the race model findings and interpreting the data within the framework of our experimental design.

      Author response image 1.

      Reaction time cumulative distribution functions (CDFs) and race model evaluation. (a) CDFs of reaction times (RTs) for auditory (blue), visual (green), and audiovisual stimuli (red) during the multisensory discrimination task. The summed CDF of the auditory and visual conditions (dashed purple, CDF_Miller) represents the race model prediction under independent sensory processing. The dashed yellow line represents the CDF of reaction times predicted by the race model. According to the race model inequality, the CDF for audiovisual stimuli (CDF_AV) should always lie below or to the right of the sum of CDF_A and CDF_V. In this example, the inequality is violated at nearly t = 200 ms, where CDF_AV is above CDF_Miller. (b) Data from another animal, showing similar results. (c, d) CDFs of reaction times for two other animals. In these cases, the CDFs follow the race model inequality, with CDF_AV consistently lying below or to the right of CDF_A + CDF_V.

      The manuscript is very vague about the origin or responses - are these in AuD, A1, AuV... ? Some attempts to separate out responses if possible by laminar depth and certainly by field are necessary. It is known from other species that multisensory responses are more numerous, and show greater behavioural modulation in non-primary areas (e.g. Atilgan et al., 2018).

      Thank you for highlighting the importance of specifying the origin of the recorded responses. In the manuscript, we have detailed the implantation process in both the Methods and Results sections, indicating that the tetrode array was targeted to the primary auditory cortex. Using a micromanipulator (RWD, Shenzhen, China), the tetrode array was precisely positioned at stereotaxic coordinates 3.5–5.5 mm posterior to bregma and 6.4 mm lateral to the midline, and advanced to a depth of approximately 2–2.8 mm from the brain surface, corresponding to the primary auditory cortex. Although our recordings were aimed at A1, it is likely that some neurons from AuD and/or AuV were also included due to the anatomical proximity.

      In fact, in our unpublished data collected from AuD, we observed that over 50% of neurons responded to or were modulated by visual cues, consistent with findings from many other studies. This suggests that visual representations are more pronounced in AuD compared to A1. However, as noted in the manuscript, our primary focus was on A1, where we observed relatively fewer visual or audiovisual modulations in untrained rats.

      Regarding laminar depth, we regret that we were unable to determine the specific laminar layers of the recorded neurons in this study, a limitation primarily due to the constraints of our recording setup.

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chang et al. aims to investigate how the behavioral relevance of auditory and visual stimuli influences the way in which the primary auditory cortex encodes auditory, visual, and audiovisual information. The main result is that behavioral training induces an increase in the encoding of auditory and visual information and in multisensory enhancement that is mainly related to the choice located contralaterally with respect to the recorded hemisphere.

      Strengths:

      The manuscript reports the results of an elegant and well-planned experiment meant to investigate if the auditory cortex encodes visual information and how learning shapes visual responsiveness in the auditory cortex. Analyses are typically well done and properly address the questions raised.

      We sincerely thank the reviewer for their thoughtful and positive evaluation of our study.

      Weaknesses:

      Major

      (1) The authors apparently primarily focus their analyses of sensory-evoked responses in approximately the first 100 ms following stimulus onset. Even if I could not find an indication of which precise temporal range the authors used for analysis in the manuscript, this is the range where sensory-evoked responses are shown to occur in the manuscript figures. While this is a reasonable range for auditory evoked responses, the same cannot be said for visual responses, which commonly peak around 100-120 ms, in V1. In fact, the latency and overall shape of visual responses are quite different from typical visual responses, that are commonly shown to display a delay of up to 100 ms with respect to auditory responses. All traces that the authors show, instead, display visual responses strikingly overlapping with auditory ones, which is not in line with what one would expect based on our physiological understanding of cortical visually-evoked responses. Similarly, the fact that the onset of decoding accuracy (Figure 2j) anticipates during multisensory compared to auditory-only trials is hard to reconcile with the fact that visual responses have a later onset latency compared to auditory ones. The authors thus need to provide unequivocal evidence that the results they observe are truly visual in origin. This is especially important in view of the ever-growing literature showing that sensory cortices encode signals representing spontaneous motor actions, but also other forms of non-sensory information that can be taken prima facie to be of sensory origin. This is a problem that only now we realize has affected a lot of early literature, especially - but not only - in the field of multisensory processing. It is thus imperative that the authors provide evidence supporting the true visual nature of the activity reported during auditory and multisensory conditions, in both trained, free-choice, and anesthetized conditions. This could for example be achieved causally (e.g. via optogenetics) to provide the strongest evidence about the visual nature of the reported results, but it's up to the authors to identify a viable solution. This also applies to the enhancement of matched stimuli, that could potentially be explained in terms of spontaneous motor activity and/or pre-motor influences. In the absence of this evidence, I would discourage the author from drawing any conclusion about the visual nature of the observed activity in the auditory cortex.

      We thank the reviewers for highlighting the critical issue of validating the sensory origin of the reported responses, particularly regarding the timing of visual responses and the potential confound of motor-related activity.

      We analyzed neural responses within the first 150 ms following cue onset, as stated in the manuscript. This temporal window encompasses the peak of visual responses. The responses to visual stimuli occur predominantly within the first 100 ms after cue onset, preceding the initiation of body movements in behavioral tasks. This temporal dissociation aligns with previous studies, which demonstrate that motor-related activity in sensory cortices generally emerges later and is often associated with auditory rather than visual stimuli

      We acknowledge that auditory responses are typically faster than visual responses due to distinct transduction mechanisms. However, in our experiment, we intentionally designed the stimulus setup to elicit auditory and visual responses within a similar time window to maximize the potential for multisensory integration. Specifically, we used pure tone sounds with a 15 ms ramp and visual stimuli generated by an LED array, which produce faster responses compared to commonly used light bars shown on a screen. The long ramp of the auditory stimulus slightly delayed auditory response onset, while the LED-generated bar elicited visual responses more quickly (Supplementary Fig. 2). This alignment facilitated the observed overlap in response latencies. As we measured in neurons with robust visual response, first spike latencies is approximately 40 ms, as exemplified by a neuron with a low spontaneous firing rate and a strong, stimulus-evoked response (Supplementary Fig. 4). Across the population (n = 559 neurons), auditory responses reached 0.5 of the mean Z-scored response 15 ms earlier than visual responses on average (Supplementary Fig. 2). We cited Supplementary Fig. 4 in the Results section as follows:

      “Regarding the visual modality, 41% (80/196) of visually-responsive neurons showed a significant visual preference (Fig. 2f). The visual responses observed within the 0–150 ms window after cue onset were consistent and unlikely to result from visually evoked movement-related activity. This conclusion is supported by the early timing of the response (Fig. 2e) and exemplified by a neuron with a low spontaneous firing rate and a robust, stimulus-evoked response (Supplementary Fig. 4).”

      We acknowledge the growing body of literature suggesting that sensory cortices can encode signals related to motor actions or non-sensory factors. To address this concern, we emphasize that visual responses were present not only during behavioral tasks but also in anesthetized conditions, where motor-related signals are absent. Additionally, movement-evoked responses tend to be stereotyped and non-discriminative. In contrast, the visual responses observed in our study were highly consistent and selective to visual cue properties, further supporting their sensory origin.

      In summary, the combination of anesthetized and behavioral recordings, the temporal profile of responses, and their discriminative nature strongly support the sensory (visual) origin of the observed activity within the early response period. While the current study provides strong temporal and experimental evidence for the sensory origin of the visual responses, we agree that causal approaches, such as optogenetic silencing of visual input, could provide even stronger validation. Future work will explore these methods to further dissect the visual contributions to auditory cortical activity.

      (2) The finding that AC neurons in trained mice preferentially respond - and enhance - auditory and visual responses pertaining to the contralateral choice is interesting, but the study does not show evidence for the functional relevance of this phenomenon. As has become more and more evident over the past few years (see e.g. the literature on mouse PPC), correlated neural activity is not an indication of functional role. Therefore, in the absence of causal evidence, the functional role of the reported AC correlates should not be overstated by the authors. My opinion is that, starting from the title, the authors need to much more carefully discuss the implications of their findings.

      We fully agree that correlational data alone cannot establish causality. In light of your suggestion, we will revise the manuscript to more carefully discuss the implications of our findings, acknowledging that the preferred responses observed in AC neurons, particularly in relation to the contralateral choice, are correlational. We have updated several sentences in the manuscript to avoid overstating the functional relevance of these observations. Below are the revisions we have made:

      Abstract section

      "Importantly, many audiovisual neurons in the AC exhibited experience-dependent associations between their visual and auditory preferences, displaying a unique integration model. This model employed selective multisensory enhancement for the auditory-visual pairing guiding the contralateral choice, which correlated with improved multisensory discrimination."

      (Page 8, fourth paragraph in Results Section)

      "This aligns with findings that neurons in the AC and medial prefrontal cortex selectively preferred the tone associated with the behavioral choice contralateral to the recorded cortices during sound discrimination tasks, potentially reflecting the formation of sound-to-action associations. However, this preference represents a neural correlate, and further work is required to establish its causal link to behavioral choices."

      (rewrite 3rd paragraph in Discussion Section)

      "Consistent with prior research(10,31), most AC neurons exhibited a selective preference for cues associated with contralateral choices, regardless of the sensory modality. This suggests that AC neurons may contribute to linking sensory inputs with decision-making, although their causal role remains to be examined. "

      "These results indicate that multisensory training could drive the formation of specialized neural circuits within the auditory cortex, facilitating integrated processing of related auditory and visual information. However, further causal studies are required to confirm this hypothesis and to determine whether the auditory cortex is the primary site of these circuit modifications."

      MINOR:

      (1) The manuscript is lacking what pertains to the revised interpretation of most studies about audiovisual interactions in primary sensory cortices following the recent studies revealing that most of what was considered to be crossmodal actually reflects motor aspects. In particular, recent evidence suggests that sensory-induced spontaneous motor responses may have a surprisingly fast latency (within 40 ms; Clayton et al. 2024). Such responses might also underlie the contralaterally-tuned responses observed by the authors if one assumes that mice learn a stereotypical response that is primed by the upcoming goal-directed, learned response. Given that a full exploration of this issue would require high-speed tracking of orofacial and body motions, the authors should at least revise the discussion and the possible interpretation of their results not just on the basis of the literature, but after carefully revising the literature in view of the most recent findings, that challenge earlier interpretations of experimental results.

      Thank you for pointing out this important consideration. We have revised the discussion (paragraph 8-9) as follows:

      “There is ongoing debate about whether cross-sensory responses in sensory cortices predominantly reflect sensory inputs or are influenced by behavioral factors, such as cue-induced body movements. A recent study shows that sound-clip evoked activity in visual cortex have a behavioral rather than sensory origin and is related to stereotyped movements(48). Several studies have demonstrated sensory neurons can encode signals associated with whisking(49), running(50), pupil dilation (510 and other movements(52). In our study, the responses to visual stimuli in the auditory cortex occurred primarily within a 100 ms window following cue onset. This early timing suggests that the observed responses likely reflect direct sensory inputs, rather than being modulated by visually-evoked body or orofacial movements, which typically occur with a delay relative to sensory cue onset(53).

      A recent study by Clayton et al. (2024) demonstrated that sensory stimuli can evoke rapid motor responses, such as facial twitches, within 50 ms, mediated by subcortical pathways and modulated by descending corticofugal input(56). These motor responses provide a sensitive behavioral index of auditory processing. Although Clayton et al. did not observe visually evoked facial movements, it is plausible that visually driven motor activity occurs more frequently in freely moving animals compared to head-fixed conditions. In goal-directed tasks, such rapid motor responses might contribute to the contralaterally tuned responses observed in our study, potentially reflecting preparatory motor behaviors associated with learned responses. Consequently, some of the audiovisual integration observed in the auditory cortex may represent a combination of multisensory processing and preparatory motor activity. Comprehensive investigation of these motor influences would require high-speed tracking of orofacial and body movements. Therefore, our findings should be interpreted with this consideration in mind. Future studies should aim to systematically monitor and control eye, orofacial, and body movements to disentangle sensory-driven responses from motor-related contributions, enhancing our understanding of motor planning’s role in multisensory integration.”

      (2) The methods section is a bit lacking in details. For instance, information about the temporal window of analysis for sensory-evoked responses is lacking. Another example: for the spike sorting procedure, limited details are given about inclusion/exclusion criteria. This makes it hard to navigate the manuscript and fully understand the experimental paradigm. I would recommend critically revising and expanding the methods section.

      Thank you for raising this point. We clarified the temporal window by including additional details in the methods section, even though this information was already mentioned in the results section. Specifically, we now state:

      (Neural recordings and Analysis in methods section)

      “...These neural signals, along with trace signals representing the stimuli and session performance information, were transmitted to a PC for online observation and data storage. Neural responses were analyzed within a 0-150ms temporal window after cue onset, as this period was identified as containing the main cue-evoked responses for most neurons. This time window was selected based on the consistent and robust neural activity observed during this period.”

      We appreciate your concern regarding spike sorting procedure. To address this, we have expanded the methods section to provide more detailed information about the quality of our single-unit recordings. we have added detailed information in the text, as shown below (Analysis of electrophysiological data in methods section):

      “Initially, the recorded raw neural signals were band-pass filtered in the range of 300-6000 Hz to eliminate field potentials. A threshold criterion, set at no less than three times the standard deviation (SD) above the background noise, was applied to automatically identify spike peaks. The detected spike waveforms were then subjected to clustering using template-matching and built-in principal component analysis tool in a three-dimensional feature space. Manual curation was conducted to refine the sorting process. Each putative single unit was evaluated based on its waveform and firing patterns over time. Waveforms with inter-spike intervals of less than 2.0 ms were excluded from further analysis. Spike trains corresponding to an individual unit were aligned to the onset of the stimulus and grouped based on different cue and choice conditions. Units were included in further analysis only if their presence was stable throughout the session, and their mean firing rate exceeded 2 Hz. The reliability of auditory and visual responses for each unit was assessed, with well-isolated units typically showing the highest response reliability.”

      Reviewer #1 (Recommendations for the authors):

      (1) Some of the ordering of content in the introduction could be improved. E.g. line 49 reflects statements about the importance of sensory experience, which is the topic of the subsequent paragraph. In the discussion, line 436, there is a discussion of the same findings as line 442. These two paragraphs in general appear to discuss similar content. Similarly, the paragraph starting at line 424 and at line 451 both discuss the plasticity of multisensory responses through audiovisual experience, as well as the paragraph starting at line 475 (but now audiovisual pairing is dubbed semantic). In the discussion of how congruency/experience shapes multisensory interactions, the authors should relate their findings to those of Meijer et al. 2017 and Garner and Keller 2022 (visual cortex) about enhanced and suppressed responses and their potential role (as well as other literature such as Banks et al. 2011 in AC).

      We thank the reviewer for their detailed observations and valuable recommendations to improve the manuscript's organization. Below, we address each point:

      We deleted the sentence, "Sensory experience has been shown to shape cross-modal presentations in sensory cortices" (Line 49), as the subsequent paragraph discusses sensory experience in detail.

      To avoid repetition, we removed the sentence, "This suggests that multisensory training enhances AC's ability to process visual information" (Lines 442–443).

      Regarding the paragraph starting at Line 475, we believe its current form is appropriate, as it focuses on the influence of semantic congruence on multisensory integration, which differs from the topics discussed in the other paragraphs.

      We have cited the three papers suggested by the reviewer in the appropriate sections of the manuscript.

      (Paragraph 6 in discussion section)

      “…A study conducted on the gustatory cortex of alert rats has shown that cross-modal associative learning was linked to a dramatic increase in the prevalence of neurons responding to nongustatory stimuli (24). Moreover, in the primary visual cortex, experience-dependent interactions can arise from learned sequential associations between auditory and visual stimuli, mediated by corticocortical connections rather than simultaneous audiovisual presentations (26).”

      (Paragraph 2 in discussion section)

      “...Meijer et al. reported that congruent audiovisual stimuli evoke balanced enhancement and suppression in V1, while incongruent stimuli predominantly lead to suppression(6), mirroring our findings in AC, where multisensory integration was dependent on stimulus feature…”

      (Paragraph 2 in introduction section)

      “...Anatomical investigations reveal reciprocal nerve projections between auditory and visual cortices(4,11-15), highlighting the interconnected nature of these sensory systems. Moreover, two-photon calcium imaging in awake mice has shown that audiovisual encoding in the primary visual cortex depends on the temporal congruency of stimuli, with temporally congruent audiovisual stimuli eliciting balanced enhancement and suppression, whereas incongruent stimuli predominantly result in suppression(6).”

      (2) The finding of purely visually responsive neurons in the auditory cortex that moreover discriminate the stimuli is surprising given previous results (Iurilli et al. 2012, Morrill and Hasenstaub 2018 (only L6), Oude Lohuis et al. 2024, Atilgan et al. 2018, Chou et al. 2020). Reporting the latency of this response is interesting information about the potential pathways by which this information could reach the auditory system. Furthermore, spike isolation quality and histological verification are described in little detail. It is crucial for statements about the auditory, visual, or audiovisual response of individual neurons to substantiate the confidence level about the quality of single-unit recordings and where they were recorded. Do the authors have data to support that visual and audiovisual responses were not restricted to posteromedial tetrodes or clusters with poor quality? A discussion of finding V-responsive units in AC with respect to literature is warranted. Furthermore, the finding that also in visual trials behaviorally relevant information about the visual cue (with a bias for the contralateral choice cue) is sent to the AC is pivotal in the interpretation of the results, which as far as I note not really considered that much.

      We appreciate the reviewer’s thoughtful comments and have addressed them as follows:

      Discussion of finding choice-related V-responsive units in AC with respect to literature and potential pathways

      3rd paragraph in the Discussion section

      “Consistent with prior research(10,31), most AC neurons exhibited a selective preference for cues associated with contralateral choices, regardless of the sensory modality. This suggests that AC neurons may contribute to linking sensory inputs with decision-making, although their causal role remains to be examined. Associative learning may drive the formation of new connections between sensory and motor areas of the brain, such as cortico-cortical pathways(35). Notably, this cue-preference biasing was absent in the free-choice group. A similar bias was also reported in a previous study, where auditory discrimination learning selectively potentiated corticostriatal synapses from neurons representing either high or low frequencies associated with contralateral choices(32)…”

      6th paragraph in the Discussion section

      “Our results extend prior finding(4,47), showing that visual input not only reaches the AC but can also drive discriminative responses, particularly during task engagement. This task-specific plasticity enhances cross-modal integration, as demonstrated in other sensory systems. For example, calcium imaging studies in mice showed that a subset of multimodal neurons in visual cortex develops enhanced auditory responses to the paired auditory stimulus following coincident auditory–visual experience(25)…”

      8th paragraph in the Discussion section

      “…In our study, the responses to visual stimuli in the auditory cortex occurred primarily within a 100 ms window following cue onset, suggesting that visual information reaches the AC through rapid pathways. Potential candidates include direct or fast cross-modal inputs, such as pulvinar-mediated pathways(8) or corticocortical connections(5,54), rather than slower associative mechanisms. This early timing indicates that the observed responses were less likely modulated by visually-evoked body or orofacial movements, which typically occur with a delay relative to sensory cue onset(55).”

      Response Latency

      Regarding the latency of visually driven responses, we have included this information in our response to the second reviewer’s first weakness (please see the above). Briefly, we analyzed neural responses within a 0-150ms temporal window after cue onset, as this period captures the most consistent and robust cue-evoked responses across neurons.

      Purely Visually Responsive Neurons in A1

      We agree that the finding of visually responsive neurons in the auditory cortex may initially seem surprising. However, these neurons might not have been sensitive to target auditory cues in our task but could still respond to other sound types. Cortical neurons are known to exhibit significant plasticity during the cue discrimination tasks, as well as during passive sensory exposure. Thus, the presence of visually responsive neurons is not inconsistent with prior findings but highlights task-specific sensory tuning. We confirm that responses were not restricted to posteromedial tetrodes or low-quality clusters (see an example of a robust visually responsive neuron in supplementary Fig. 4). Histological analysis verified electrode placements across the auditory cortex.

      For spike sorting, we have added detailed information in the text, as shown below:

      “Initially, the recorded raw neural signals were band-pass filtered in the range of 300-6000 Hz to eliminate field potentials. A threshold criterion, set at no less than three times the standard deviation (SD) above the background noise, was applied to automatically identify spike peaks. The detected spike waveforms were then subjected to clustering using template-matching and built-in principal component analysis tool in a three-dimensional feature space. Manual curation was conducted to refine the sorting process. Each putative single unit was evaluated based on its waveform and firing patterns over time. Waveforms with inter-spike intervals of less than 2.0 ms were excluded from further analysis. Spike trains corresponding to an individual unit were aligned to the onset of the stimulus and grouped based on different cue and choice conditions. Units were included in further analysis only if their presence was stable throughout the session, and their mean firing rate exceeded 2 Hz. The reliability of auditory and visual responses for each unit was assessed, with well-isolated units typically showing the highest response reliability.”

      (3) In the abstract it seems that in "Additionally, AC neurons..." the connective word 'additionally' is misleading as it is mainly a rephrasing of the previous statement.

      Replaced "Additionally" with "Furthermore" to better signal elaboration and continuity.

      (4) The experiments included multisensory conflict trials - incongruent audiovisual stimuli. What was the behavior for these trials given multiple interesting studies on the neural correlates of sensory dominance (Song et al. 2017, Coen et al. 2023, Oude Lohuis et al. 2024).

      We appreciate your feedback and have addressed it by including a new figure (supplemental Fig. 8) that illustrates choice selection during incongruent audiovisual stimuli. Panel (a) shows that rats displayed confusion when exposed to mismatched stimuli, resulting in choice patterns that differed from those observed in panel (b), where consistent audiovisual stimuli were presented. To provide clarity and integrate this new figure effectively into the manuscript, we updated the results section as follows:

      “...Rats received water rewards with a 50% chance in either port when an unmatched multisensory cue was triggered. Behavioral analysis revealed that Rats displayed notable confusion in response to unmatched multisensory cues, as evidenced by their inconsistent choice patterns (supplementary Fig. 8).”

      (5) Line 47: The AC does not 'perceive' sound frequency, individual brain regions are not thought to perceive.

      e appreciate the reviewer’s observation and have revised the sentence to ensure scientific accuracy. The updated sentence in the second paragraph of the Introduction now reads:

      “Even irrelevant visual cues can affect sound discrimination in AC<sup>10</sup>.”

      (6) Line 59-63: The three questions are not completely clear to me. Both what they mean exactly and how they are different. E.g. Line 60: without specification, it is hard to understand which 'strategies' are meant by the "same or different strategies"? And Line 61: What is meant by the quotation marks for match and mismatch? I assume this is referring to learned congruency and incongruency, which appears almost the same question as number 3 (how learning affects the cortical representation).

      We have revised the three questions for improved clarity and distinction as follows:<br /> “This limits our understanding of multisensory integration in sensory cortices, particularly regarding: (1) Do neurons in sensory cortices adopt consistent integration strategies across different audiovisual pairings, or do these strategies vary depending on the pairing? (2) How does multisensory perceptual learning reshape cortical representations of audiovisual objects? (3) How does the congruence between auditory and visual features—whether they "match" or "mismatch" based on learned associations—impact neural integration?”

      (7) Is the data in Figures 1c and d only hits?

      Only correct trials are included. We add this information in the figure legend. Please see Fig. 1 legend. Also, please see below

      “c Cumulative frequency distribution of reaction time (time from cue onset to leaving the central port) for one representative rat in auditory, visual and multisensory trials (correct only). d Comparison of average reaction times across rats in auditory, visual, and multisensory trials (correct only).”

      (8) Figure S1b: Preferred frequency is binned in non-equidistant bins, neither linear nor logarithmic. It is unclear what the reason is.

      The edges of the bins for the preferred frequency were determined based on a 0.5-octave increment, starting from the smallest boundary of 8 kHz. Specifically, the bin edges were calculated as follows:

      8×2<sup>0.5</sup>=11.3 kHz;

      8×2<sup>1</sup>=16 kHz;

      8×2<sup>1.5</sup>=22.6 kHz;

      8×2<sup>2</sup>=32 kHz;

      This approach reflects the common practice of using changes in octaves to define differences between pure tone frequencies, as it aligns with the logarithmic perception of sound frequency in auditory neuroscience.

      (9) Figure S1d: why are the responses all most neurons very strongly correlated given the frequency tuning of A1 neurons? Further, the mean normalized response presented in Figure S2e does seem to indicate a stronger response for 10kHz tones than 3kHz, in conflict with the data from anesthetized rats presented in Figure S2e.

      There is no discrepancy in the data. In Figure S1d, we compared neuronal responses to 10 kHz and 3 kHz tones, demonstrating that most neurons responded well to both frequencies. This panel does not aim to illustrate frequency selectivity but rather the overall responsiveness of neurons to these tones. For detailed information on sound selectivity, readers can refer to Figures S3a-b, which show that while more neurons preferred 10 kHz tones, the proportion is lower than in neurons recorded during the multisensory discrimination task. This distinction explains the observed differences and aligns with the results presented.

      (10) Line 79: For clarity, it can be added that the multisensory trials presented are congruent trials (jointly indicated rewarded port), and perhaps that incongruent trials are discussed later in the paper.

      We believe additional clarification is unnecessary, as the designations "A<sup>3k</sup>V<sup>hz</sup>" and "A<sup>10k</sup>V<sup>vt</sup>" clearly indicate the specific combinations of auditory and visual cues presented during congruent trials. Additionally, the discussion of incongruent trials is provided later in the manuscript, as noted by the reviewer.

      (11) Line 111: the description leaves unclear that the 35% reflects the combination of units responsive to visual only and responsive to auditory or visual.

      The information is clearly presented in Figure 2b, which shows the proportions of neurons responding to auditory-only (A), visual-only (V), both auditory and visual (A, V), and audiovisual-only (VA) stimuli in a pie chart. Readers can refer to this figure for a detailed breakdown of the neuronal response categories.

      (12) Figure 2h: consider a colormap with diverging palette and equal positive and negative maximum (e.g. -0.6 to 0.6) and perhaps reiterate in the color bar legend which stimulus is preferred for which selectivity index.

      We appreciate the suggestion; however, we believe that the current colormap effectively conveys the data and the intended interpretation. The existing color bar legend already provides clear information about the selectivity index, and the stimulus preference is adequately explained in the figure caption. As such, further adjustments are not necessary.

      (13) Line 160: "a ratio of 60:20 for V<sup>vt</sup> 160 preferred vs. V<sup>hz</sup> preferred neurons." Is this supposed to add up to 100, or is this a ratio of 3:1?

      We rewrite the sentence. Please see below:

      “Similar to the auditory selectivity observed, a greater proportion of neurons favored the visual stimulus (V<sup>vt</sup>) associated with the contralateral choice, with a 3:1 ratio of V<sup>vt</sup>-preferred to V<sup>hz</sup>-preferred neurons.”

      (14) The statement in Figure 2g and line 166/167 could be supported by a statistical test (chi-square?).

      Thank you for the suggestion. However, we believe that a statistical test is not required in this case, as the patterns observed are clearly represented in Figure 2g. The qualitative differences between the groups are evident and sufficiently supported by the data.

      (15) Line 168, it is unclear in what sense 'dominant' is meant. Is audition perceived as a dominant sensory modality in a behavioral sense (e.g. Song et al. 2017), or are auditory signals the dominant sensory signal locally in the auditory cortex?

      Thank you for the clarification. To address your question, by "dominant," we are referring to the fact that auditory inputs are the most prominent and influential among the sensory signals feeding into the auditory cortex. This reflects the local dominance of auditory signals within the auditory cortex, rather than a behavioral dominance of auditory perception. We have revised the sentence as follows:

      “We propose that the auditory input, which dominates within the auditory cortex, acts as a 'teaching signal' that shapes visual processing through the selective reinforcement of specific visual pathways during associative learning.”

      (16) Line 180: "we discriminated between auditory, visual, and multisensory cues." This phrasing indicated that the SVMs were trained to discriminate sensory modalities (as is done later in the manuscript), rather than what was done: discriminate stimuli within different categories of trials.

      Thank you for your comment. We have revised the sentence for clarity. Please see the updated version below:

      “Using cross-validated support vector machine (SVM) classifiers, we examined how this pseudo-population discriminates stimulus identity within the same modality (e.g., A<sup>3k</sup> vs. A<sup>10k</sup> for auditory stimuli, V<sup>hz</sup> vs. V<sup>vt</sup> for visual stimuli, A<sup>3k</sup>V<sup>hz</sup> vs. A<sup>10k</sup>V<sup>vt</sup> for multisensory stimuli).”

      (17) Line 185: "a deeply accurate incorporation of visual processing in the auditory cortex." the phrasing is a bit excessive for a binary classification performance.

      Thank you for pointing this out. We have revised the sentence to better reflect the findings without overstating them:

      “Interestingly, AC neurons could discriminate between two visual targets with around 80% accuracy (Fig. 2j), demonstrating a meaningful incorporation of visual information into auditory cortical processing.”

      (18) Figure 3, title. An article is missing (a,an/the).

      Done. Please see below:

      Fig. 3 Auditory and visual integration in the multisensory discrimination task

      (19) Line 209, typo pvalue: p<-0.00001.

      Done (p<0.00001).

      (20) Line 209, the pattern is not weaker. The pattern is the same, but more weakly expressed.

      Thank you for your valuable feedback. We appreciate your clarification and agree that our phrasing could be improved for accuracy. The observed pattern under anesthesia is indeed the same but less strongly expressed compared to the task engagement. We have revised the sentence to better reflect this distinction:

      “A similar pattern, albeit less strongly expressed, was observed under anesthesia (Supplementary Fig. 3c-3f), suggesting that multisensory perceptual learning may induce plastic changes in AC.”

      (21) Line 211: choice-free group → free-choice group.

      Done.

      (22) Line 261: wrong → incorrect (to maintain consistent terminology).

      Done.

      (23) Line 265: why 'likely'? Are incorrect choices on the A<sup>3k</sup>-V<sup>hz</sup> trials not by definition contralateral and vice versa? Or are there other ways to have incorrect trials?

      We deleted the word of ‘likely’. Please see below:

      “…, correct choices here correspond to ipsilateral behavioral selection, while incorrect choices correspond to contralateral behavioral selection.”

      (24) Typo legend Fig 3a-c (tasks → task). (only one task performed).

      Done.

      (25) Line 400: typo: Like → like.

      Done.

      (26) Line 405: What is meant by a cohesive visual stimulus? Congruent? Rephrase.

      Done. Please see the below:

      “…layer 2/3 neurons of the primary visual cortex(7), and a congruent visual stimulus can enhance sound representation…”

      (27) Line 412: Very general statement and obviously true: depending on the task, different sensory elements need to be combined to guide adaptive behavior.

      We really appreciate the reviewer and used this sentence (see second paragraph in discussion section).

      (28) Line 428: within → between (?).

      Done.

      (29) Figure 3L is not referenced in the main text. By going through the figures and legends my understanding is that this shows that most neurons have a multisensory response that lies between 2 z-scores of the predicted response in the case of 83% of the sum of the auditory and the visual response. However, how was the 0.83 found? Empirically? Figure S3 shows a neuron that does follow a 100% summation. Perhaps the authors could quantitatively support their estimate of 83% of the A + V sum, by varying the fraction of the sum (80%, 90%, 100% etc.) and showing the distribution of the preferred fraction of the sum across neurons, or by showing the percentage of neurons that fall within 2 z-scores for each of the fractions of the sum.

      Thank you for your detailed feedback and suggestions regarding Figure 3L and the 83% multiplier.

      (1) Referencing Figure 3L:

      Figure 3L is referenced in the text. To enhance clarity, we have revised the text to explicitly highlight its relevance:

      “Specifically, as illustrated in Fig. 3k, the observed multisensory response approximated 83% of the sum of the auditory and visual responses in most cases, as quantified in Fig. 3L.”

      (2) Determination of the 0.83 Multiplier:

      The 0.83 multiplier was determined empirically by comparing observed audiovisual responses with the predicted additive responses (i.e., the sum of auditory and visual responses). For each neuron, we calculated the auditory, visual, and audiovisual responses. We then compared the observed audiovisual response with scaled sums of auditory and visual responses (Fig. 3k), expressed as fractions of the additive prediction (e.g., 0.8, 0.83, 0.9, etc.). We found that when the scaling factor was 0.83, the population-wide difference between predicted and observed multisensory responses, expressed as z-scores, was minimized. Specifically, at this value, the mean z-score across the population was approximately zero (-0.0001±1.617), indicating the smallest deviation between predicted and observed responses.

      (30) Figure 5e: how come the diagonal has 0.5 decoding accuracy within a category? Shouldn't this be high within-category accuracy? If these conditions were untested and it is an issue of the image display it would be informative to test the cross-validated performance within the category as well as a benchmark to compare the across-category performance to. Aside, it is unclear which conventions from Figure 2 are meant by the statement that conventions were the same.

      The diagonal values (~0.5 decoding accuracy) within each category reflect chance-level performance. This occurs because the decoder was trained and tested on the same category conditions in a cross-validated manner, and within-category stimulus discrimination was not the primary focus of our analysis. Specifically, the stimuli within a category shared overlapping features, leading to reduced discriminability for the decoder when distinguishing between them. Our primary objective was to assess cross-category performance rather than within-category accuracy, which may explain the observed pattern in the diagonal values.

      Regarding the reference to Figure 2, we appreciate the reviewer pointing out the ambiguity. To avoid any confusion, we have removed the sentence referencing "conventions from Figure 2" in the legend for Figure 5e, as it does not contribute meaningfully to the understanding of the results.

      (31) Line 473: "movement evoked response", what is meant by this?

      Thank the reviewer for highlighting this point. To clarify, by "movement-evoked response," we are referring to neural activity that is driven by the animal's movements, rather than by sensory inputs. This type of response is typically stereotyped, meaning that it has a consistent, repetitive pattern associated with specific movements, such as whisking, running, or other body or facial movements.

      In our study, we propose that the visually-evoked responses observed within the 150 ms time window after cue onset primarily reflect sensory inputs from the visual stimulus rather than movement-related activity. This interpretation is supported by the response timing: visual-evoked activity occurs within 100 ms of the light flash onset, a timeframe too rapid to be attributed to body or orofacial movements. Additionally, unlike stereotyped movement-evoked responses, the visual responses we observed are discriminative, varying based on specific visual features—a hallmark of sensory processing rather than motor-driven activity.

      We have revised the manuscript as follows (eighth paragraph in discussion section):

      “There is ongoing debate about whether cross-sensory responses in sensory cortices predominantly reflect sensory inputs or are influenced by behavioral factors, such as cue-induced body movements. A recent study shows that sound-clip evoked activity in visual cortex have a behavioral rather than sensory origin and is related to stereotyped movements(49). Several studies have demonstrated sensory neurons can encode signals associated with whisking(50), running(51), pupil dilation(52) and other movements(53). In our study, the responses to visual stimuli in the auditory cortex occurred primarily within a 100 ms window following cue onset. suggests that visual information reaches the AC through rapid pathways. Potential candidates include direct or fast cross-modal inputs, such as pulvinar-mediated pathways(8) or corticocortical connections(5,54), rather than slower associative mechanisms. This early timing suggests that the observed responses were less likely modulated by visually-evoked body or orofacial movements, which typically occur with a delay relative to sensory cue onset(55). ”

      (32) Line 638-642: It is stated that a two-tailed permutation test is done. The cue selectivity can be significantly positive and negative, relative to a shuffle distribution. This is excellent. But then it is stated that if the observed ROC value exceeds the top 5% of the distribution it is deemed significant, which corresponds to a one-tailed test. How were significantly negative ROC values detected with p<0.05?

      Thank you for pointing this out. We confirm that a two-tailed permutation test was indeed used to evaluate cue selectivity. In this approach, significance is determined by comparing the observed ROC value to both tails of the shuffle distribution. Specifically, if the observed ROC value exceeds the top 2.5% or falls below the bottom 2.5% of the distribution, it is considered significant at p< 0.05. This two-tailed test ensures that both significantly positive and significantly negative cue selectivity values are identified.

      To clarify this in the manuscript, we have revised the text as follows:

      “This generated a distribution of values from which we calculated the probability of our observed result. If the observed ROC value exceeds the top 2.5% of the distribution or falls below the bottom 2.5%, it was deemed significant (i.e., p < 0.05).”

      (33) Line 472: the cited paper (reference 52) actually claims that motor-related activity in the visual cortex has an onset before 100ms and thus does not support your claim that the time window precludes any confound of behaviorally mediated activity. Furthermore, that study and reference 47 show that sensory stimuli could be discriminated based on the cue-evoked body movements and are discriminative. A stronger counterargument would be that both studies show very fast auditory-evoked body movements, but only later visually-evoked body movements.

      We appreciate the reviewer’s comments. As Lohuis et al. (reference 55) demonstrated, activity in the visual cortex (V1) can reflect distinct visual, auditory, and motor-related responses, with the latter often dissociable in timing. In their findings, visually-evoked movement-related activity arises substantially later than the sensory visual response, generally beginning around 200 ms post-stimulus onset. In contrast, auditory-evoked activity in A1 occurs relatively early.

      We have revised the manuscript as follows (eighth paragraph in discussion section):

      “A recent study shows that sound-clip evoked activity in visual cortex have a behavioral rather than sensory origin and is related to stereotyped movements(49). ...This early timing suggests that the observed responses were less likely modulated by visually-evoked body or orofacial movements, which typically occur with a delay relative to sensory cue onset(55). ”

      (34) The training order (multisensory cue first) is important to briefly mention in the main text.

      We appreciate the reviewer’s suggestion and have added this information to the main text. The revised text now reads:

      “The training proceeded in two stages. In the first stage, which typically lasted 3-5 weeks, rats were trained to discriminate between two audiovisual cues. In the second stage, an additional four unisensory cues were introduced, training the rats to discriminate a total of six cues.”

      (35) Line 542: As I understand the multisensory rats were trained using the multisensory cue first, so different from the training procedure in the unisensory task rats where auditory trials were learned first.

      Thank you for pointing this out. You are correct that, in the unisensory task, rats were first trained to discriminate auditory cues, followed by visual cues. To improve clarity and avoid any confusion, we have removed the sentence "Similar to the multisensory discrimination task" from the revised text.

      (36) Line 546: Can you note on how the rats were motivated to choose both ports, or whether they did so spontaneously?

      Thank you for your insightful comment. The rats' port choice was spontaneous in this task, as there was no explicit motivation required for choosing between the ports. We have clarified this point in the text to address your concern. The revised sentence now reads:

      “They received a water reward at either port following the onset of the cue, and their port choice was spontaneous.”

      (37) It is important to mention in the main text that the population decoding is actually pseudopopulation decoding. The interpretation is sufficiently important for interpreting the results.

      Thank you for this valuable suggestion. We have revised the text to specify "pseudo-population" instead of "population" to clarify the nature of our decoding analysis. The revised text now reads:

      “Our multichannel recordings enabled us to decode sensory information from a pseudo-population of AC neurons on a single-trial basis. Using cross-validated support vector machine (SVM) classifiers, we examined how this pseudo-population discriminates between stimuli.”

      (38) The term modality selectivity for the description of the multisensory interaction is somewhat confusing. Modality selectivity suggests different responses to the visual or auditory trials. The authors could consider a different terminology emphasizing the multisensory interaction effect.

      Thank you for your insightful comment. We have replaced " modality selectivity " with " multisensory interactive index " (MSI). This term more accurately conveys a tendency for neurons to favor multisensory stimuli over individual sensory modalities (visual or auditory alone).

      (39) In Figures 3 e and g the color code is different from adjacent panels b and c and is to be deciphered from the legend. Consider changing the color coding, or highlight to the reader that the coloring in Figures 3b and c is different from the color code in panels 3 e and g.

      We appreciate the reviewer’s observation. However, we believe that a change in the color coding is not necessary. Figures 3e and 3g differentiate symbols by both shape and color, ensuring accessibility and clarity. This is clearly explained in the figure legend to guide readers effectively.

      (40) Figure S2b: was significance tested here?

      Yes, we did it.

      (41) Figure S2d: test used?

      Yes, test used.

      (42) Line 676: "as appropriate", was a normality test performed prior to statistical test selection?

      In our analysis, we assessed normality before choosing between parametric (paired t-test) and non-parametric (Wilcoxon signed-rank test) methods. We used the Shapiro-Wilk test to evaluate the normality of the data distributions. When data met the assumption of normality, we applied the paired t-test; otherwise, we used the Wilcoxon signed-rank test.

      Thank you for pointing this out. We confirm that a normality test was performed prior to the selection of the statistical test. Specifically, we used the Shapiro-Wilk test to assess whether the data distributions met the assumption of normality. Based on this assessment, we applied the paired t-test for normally distributed data and the Wilcoxon signed-rank test for non-normal data.

      To ensure clarity, we update the "Statistical Analysis" section of the manuscript with the following revised text:

      “For behavioral data, such as mean reaction time differences between unisensory and multisensory trials, cue selectivity and mean modality selectivity across different auditory-visual conditions, comparisons were performed using either the paired t-test or the Wilcoxon signed-rank test. The Shapiro-Wilk test was conducted to assess normality, with the paired t-test used for normally distributed data and the Wilcoxon signed-rank test for non-normal data.”

      (43) Line 679: incorrect, most data is actually represented as mean +- SEM.

      Thank you for pointing this out. In the Results section, we report data as mean ± SD for descriptive statistics, while in the figures, the error bars typically represent the standard error of the mean (SEM) to visually indicate variability around the mean. We have specified in each figure legend whether the error bars represent SD or SEM.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 182 - here it sounds like you mean your classifier was trained to decode the modality of the stimulus, when I think what you mean is that you decoded the stimulus contingencies using A/V/AV cues?

      Thank you for pointing out this potential misunderstanding. We would like to clarify that the classifier was trained to decode the stimulus identity (e.g., A<sup>3k</sup> vs. A<sup>10k</sup> for auditory stimuli, V<sup>hz</sup> vs. V<sup>vt</sup> for visual stimuli, and A<sup>3k</sup>V<sup>hz</sup> vs. A<sup>10k</sup>V<sup>vt</sup> for multisensory stimuli) rather than the modality of the stimulus. The goal of the analysis was to determine how well the pseudo-population of AC neurons could distinguish between individual stimuli within the same modality. We have revised the relevant text in the revised manuscript to ensure this distinction is clear. Please see the following:

      “Our multichannel recordings enabled us to decode sensory information from a pseudo-population of AC neurons on a single-trial basis. Using cross-validated support vector machine (SVM) classifiers, we examined how this pseudo-population discriminates stimulus identity (e.g.,  A<sup>3k</sup> vs. A<sup>10k</sup> for auditory stimuli, V<sup>hz</sup> vs. V<sup>vt</sup> for visual stimuli,  A<sup>3k</sup>V<sup>hz</sup> vs. A<sup>10k</sup>V<sup>vt</sup> for multisensory stimuli).”

      (2) Lines 256 - here the authors look to see whether incorrect trials diminish audiovisual integration. I would probably seek to turn the causal direction around and ask are AV neurons critical for behaviour - nevertheless, since this is only correlational the causal direction cannot be unpicked. However, the finding that contralateral responses per se do not result in enhancement is a key control. Showing that multisensory enhancement is less on error trials is a good first step to linking neural activity and perception, but I wonder if the authors could take this further however by seeking to decode choice probabilities as well as stimulus features in an attempt to get a little closer to addressing the question of whether the animals are using these responses for behaviour.

      Thank you for your comment and for highlighting the importance of understanding whether audiovisual (AV) neurons are critical for behavior. As you noted, the causal relationship between AV neural activity and behavioral outcomes cannot be directly determined in our current study due to its correlational nature. We agree that this is an important topic for future exploration. In our study, we examined how incorrect trials influence multisensory enhancement. Our findings show that multisensory enhancement is less pronounced during error trials, providing an initial link between neural activity and behavioral performance. To address your suggestion, we conducted an additional analysis comparing auditory and multisensory selectivity between correct and incorrect choice trials. As shown in Supplementary Fig. 7, both auditory and multisensory selectivity were significantly lower during incorrect trials. This result highlights the potential role of these neural responses in decision-making, suggesting they may extend beyond sensory processing to influence choice selection. We have cited this figure in the Results section as follows: ( the paragraph regarding Impact of incorrect choices on audiovisual integration):

      “Overall, these findings suggest that the multisensory perception reflected by behavioral choices (correct vs. incorrect) might be shaped by the underlying integration strength. Furthermore, our analysis revealed that incorrect choices were associated with a decline in cue selectivity, as shown in Supplementary Fig. 7.”

      We acknowledge your suggestion to decode choice probabilities alongside stimulus features as a more direct approach to exploring whether animals actively use these neural responses for behavior. Unfortunately, in the current study, the low number of incorrect trials limited our ability to perform such analyses reliably. Nonetheless, we are committed to pursuing this direction in subsequent work. We plan to use techniques such as optogenetics in future studies to causally test the role of AV neurons in driving behavior.

      (3) Figure 5E - the purple and red are indistinguishable - could you make one a solid line and keep one dashed?

      We thank the reviewer for pointing out that the purple and red lines in Figure 5E were difficult to distinguish. To address this concern, we modified the figure by making two lines solid and changing the color of one square, as suggested. These adjustments enhance visual clarity and improve the distinction between them.

      (4) The unisensory control training is a really nice addition. I'm interested to know whether behaviourally these animals experienced an advantage for audiovisual stimuli in the testing phase? This is important information to include as if they don't it is one step closer to linking audiovisual responses in AC to improved behavioural performance (and if they do, we must be suitably cautious in interpretation!).

      Thank you for raising this important point. To address this, we have plotted the behavioral results for each animal (see Author response image 2). The data indicate that performance with multisensory cues is slightly better than with the corresponding unisensory cues. However, given the small sample size (n=3) and the considerable variation in behavioral performance across individuals, we remain cautious about drawing definitive conclusions on this matter. We recognize the need for further investigation to establish a robust link between audiovisual responses in the auditory cortex and improved behavioral performance. In future studies, we plan to include a larger number of animals and more thoroughly explore this relationship to provide a comprehensive understanding.

      Author response image 2.

      (5) Line 339 - I don't think you can say this leads to binding with your current behaviour or neural responses. I would agree there is a memory trace established and a preferential linking in AC neurons.

      We thank the reviewer for raising this important point. In the revised manuscript, we have clarified that our data suggest the formation of a memory trace and preferential linking in AC neurons. The text has been updated to emphasize this distinction. Please see the revised section below (first paragraph in Discussion section).

      “Interestingly, a subset of auditory neurons not only developed visual responses but also exhibited congruence between auditory and visual selectivity. These findings suggest that multisensory perceptual training establishes a memory trace of the trained audiovisual experiences within the AC and enhances the preferential linking of auditory and visual inputs. Sensory cortices, like AC, may act as a vital bridge for communicating sensory information across different modalities.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable manuscript attempts to identify the brain regions and cell types involved in habituation to dark flash stimuli in larval zebrafish. Habituation being a form of learning widespread in the animal kingdom, the investigation of neural mechanisms underlying it is an important endeavor. The authors use a combination of behavioral analysis, neural activity imaging, and pharmacological manipulation to investigate brain-wide mechanisms of habituation. However, the data presented are incomplete and do not show a convincing causative link between pharmacological manipulations, neural activity patterns, and behavioral outcomes.

      We thank the reviewers and editors for their careful reading and reviews of our work. We are grateful that they appreciate the value in our experimental approach and results. We acknowledge what we interpret as the major criticism, that in our original manuscript we focused too heavily on the hypothesized role of GABAergic neurons in driving habituation. This hypothesis will remain only indirectly supported until we can identify a GABAergic population of neurons that drives habituation. Therefore, we have revised our manuscript, decreasing the focus on GABA, and rather emphasizing the following three points:

      1) By performing the first Ca2+ imaging experiments during dark flash habituation, we identify multiple distinct functional classes of neurons which have different adaptation profiles, including non-adapting and potentiating classes. These neurons are spread throughout the brain, indicating that habituation is a complex and distributed process.

      2) By performing a pharmacological screen for dark flash habituation modifiers, we confirm habituation behaviour manifests from multiple distinct molecular mechanisms that independently modulate different behavioural outputs. We also implicate multiple novel pathways in habituation plasticity, some of which we have validated through dose-response studies.

      3) By combining pharmacology and Ca2+ imaging, we did not observe a simple relationship between the behavioural effects of a drug treatment and functional alterations in neurons. This observation further supports our model that habituation is a multidimensional process, for which a simple circuit model will be insufficient.

      We would like to point out that, in our opinion, there appears to be a factual error in the final sentence of the eLife assessment:

      “However, the data presented are incomplete and do not show a convincing causative link between pharmacological manipulations, neural activity patterns, and behavioral outcomes”.

      We believe that a “convincing causative link” between pharmacological manipulations and behavioural outcomes has been clearly demonstrated for PTX, Melatonin, Estradiol and Hexestrol through our dose response experiments. Similarly a link between pharmacology and neural activity patterns has also been directly demonstrated. As mentioned in (3), we acknowledge that our data linking neural activity and behaviour is more tenuous, as will be more explicitly reflected in our revised manuscript.

      Nevertheless, we maintain that one of the primary strengths of our study is our attempt to integrate analyses that span the behavioural, pharmacological, and neural activity-levels.

      In our revised manuscript, we have substantially altered the Abstract and Discussion, removed the Model figure (previously Figure 8), and changed the title from :

      “Inhibition drives habituation of a larval zebrafish visual response”

      to:

      “Functional and pharmacological analyses of visual habituation learning in larval zebrafish”

      Text changes from the initial version are visible as track changes in the word document: “LamireEtAl_2022_eLifeRevisions.docx”

      Reviewer #1 (Public Review):

      This manuscript addresses the important and understudied issue of circuit-level mechanisms supporting habituation, particularly in pursuit of the possible role of increases in the activity of inhibitory neurons in suppressing behavioral output during long-term habituation. The authors make use of many of the striking advantages of the larval zebrafish to perform whole brain, single neuronal calcium imaging during repeated sensory exposure, and high throughput screening of pharmacological agents in freely moving, habituating larvae. Notably, several blockers/antagonists of GABAA(C) receptors completely suppress habituation of the O-bend escape response to dark flashes, suggesting a key role for GABAergic transmission in this form of habituation. Other substances are identified that strikingly enhance habituation, including melatonin, although here the suggested mechanistic insight is less specific. To add to these findings, a number of functional clusters of neurons are identified in the larval brain that has divergent activity through habituation, with many clusters exhibiting suppression of different degrees, in line with adaptive filtration during habituation, and a single cluster that potentiates during habituation. Further assessment reveals that all of these clusters include GABAergic inhibitory neurons and excitatory neurons, so we cannot take away the simple interpretation that the potentiating cluster of neurons is inhibitory and therefore exerts an influence on the other adapting (depressing) clusters to produce habituation. Rather, a variety of interpretations remain in play.

      Overall, there is great potential in the approach that has been used here to gain insight into circuit-level mechanisms of habituation. There are many experiments performed by the authors that cannot be achieved currently in other vertebrate systems, so the manuscript serves as a potential methodological platform that can be used to support a rich array of future work. While there are several key observations that one can take away from this manuscript, a clear interpretation of the role of GABAergic inhibitory neurons in habituation has not been established. This potential feature of habituation is emphasized throughout, particularly in the introduction and discussion sections, meaning that one is obliged as a reader to interrogate whether the results as they currently stand really do demonstrate a role for GABAergic inhibition in habituation. Currently, the key piece of evidence that may support this conclusion is that picrotoxin, which acts to block some classes of GABA receptors, prevents habituation. However, there are interpretations of this finding that do not specifically require a role for modified GABAergic inhibition. For instance, by lowering GABAergic inhibition, an overall increase in neural activity will occur within the brain, in this case below a level that could cause a seizure. That increase in activity may simply prevent learning by massively increasing neural noise and therefore either preventing synaptic plasticity or, more likely, causing indiscriminate synaptic strengthening and weakening that occludes information storage. Sensory processing itself could also be disrupted, for instance by altering the selectivity of receptive fields. Alternatively, it could be that the increase in neural activity produced by the blockade of inhibition simply drives more behavioral output, meaning that more excitatory synaptic adaptation is required to suppress that output. The authors propose two specific working models of the ways in which GABAergic inhibition could be implemented in habituation. An alternative model, in which GABAergic neurons are not themselves modified but act as a key intermediary between Hebbian assemblies of excitatory neurons that are modified to support memory and output neurons, is not explored. As yet, these or other models in which inhibition is not required for habituation, have not been fully tested.

      This manuscript describes a really substantial body of work that provides evidence of functional clusters of neurons with divergent responses to repeated sensory input and an array of pharmacological agents that can influence the rate of a fundamentally important form of learning.

      We thank the reviewer for their careful consideration of our work, and we agree that multiple models of how habituation occurs remain plausible. As discussed above and below in more detail, we have revised our manuscript to better reflect this. We hope the reviewer will agree that this has improved the manuscript.

      Reviewer #2 (Public Review):

      In this study, Lamire et al. use a calcium imaging approach, behavioural tests, and pharmacological manipulations to identify the molecular mechanisms behind visual habituation. Overall, the manuscript is well-written but difficult to follow at times. They show a valuable new drug screen paradigm to assess the impact of pharmacological compounds on the behaviour of larval zebrafish, the results are convincing, but the description of the work is sometimes confusing and lacking details.

      We thank the reviewer for identifying areas where our description lacked details. We apologize for these omissions and have attempted to add relevant details as described below. We note that all of the analysis code is available online, though we appreciate that navigating and extracting data from these files is not straightforward.

      The volumetric calcium imaging of habituation to dark flashes is valuable, but the mix of responses to visual cues that are not relevant to the dark flash escape, such as the slow increase back to baseline luminosity, lowers the clarity of the results. The link between the calcium imaging results and free-swimming behaviour is not especially convincing, however, that is a common issue of head-restrained imaging with larval zebrafish.

      We agree with the reviewer that the design of our stimulus, and specifically the slow increase back to baseline luminosity, is perhaps confusing for the interpretation of some of the response profiles of neurons. We originally chose this stimulus type (rather than a square wave of 1s of darkness, for example) in order to better highlight the responses of the larvae to the onset of darkness (rather than the response to abruptly returning to full brightness). We therefore believe that the slow return to baseline is an important feature of the stimulus,, which better separates activity related to the fast offset from activity related to light onset. And since all of the foundational behavioural data (Randlett et al., Current Biology 2019), and pharmacological data, used this stimulus type, we did not change it for the Ca2+ imaging experiments. Our use of relatively slow nuclear-targeted GCaMP indicators also means that the temporal resolution of our imaging experiments is relatively poor, and therefore we felt that using a stimulus that highlighted light offset might be best.

      We also fully acknowledge in the Results section that the behaviour of the head embedded fish is not the same as that of free-swimming fish, and that therefore establishing a direct link between these types of experiments is complicated. This is an unavoidable caveat in the head-embedded style experiments. To further emphasize this, we have also added a paragraph to the discussion where this is acknowledged explicitly.

      “We also found that the same pharmacological treatments that result in strong alterations to habituation behaviour in freely swimming larvae ([fig:5]), resulted in relatively subtle and complex functional alterations in the circuit ([fig:6]). Making direct comparisons between freely-swimming behaviour and head-fixed Ca2+ imaging is always challenging due to the differences in behaviour observed in the two contexts, and therefore our failure to identify a clear logic in these experiments may have technical explanations that will require approaches to measure neural activity from unrestrained and freely-behaving animals to resolve . Alternatively, these results are again consistent with the idea that habituation is a multidimensional and perhaps highly non-linear phenomenon in the circuit, which cannot be captured by a simple model.”

      The strong focus on GABA seems unwarranted based on the pharmacological results, as only Picrotoxinin gives clear results, but the other antagonists do not give a consistent results. On the other hand, the melatonin receptor agonists, and oestrogen receptor agonists give more consistent results, including more convincing dose effects.

      We agree that our manuscript focused too strongly on GABA and have toned this down. We are currently performing genetic experiments aimed at identifying the Melatonin, Estrogen and GABA receptors that function during habituation, which we think will be necessary to move beyond pharmacology and the necessary caveats that such experiments bring.

      The pharmacological manipulation of the habituation circuits mapped in the first part does not arrive at any satisfying conclusion, which is acknowledged by the authors. These results do reinforce the disconnect between the calcium imaging and the behavioural experiments and undercut somewhat the proposed circuit-level model.

      We agree with this criticism and have toned down the focus on GABA specifically in the circuit, and have removed the speculative model previously in Figure 8.

      Overall, the authors did identify interesting new molecular pathways that may be involved in habituation to dark flashes. Their screening approach, while not novel, will be a powerful way to interrogate other behavioural profiles. The authors identified circuit loci apparently involved in habituation to dark flashes, and the potentiation and no adaptation clusters have not been previously observed as far as I know.

      The data will be useful to guide follow-up experiments by the community on the new pathway candidates that this screen has uncovered, including behaviours beyond dark flash habituation.

      We again thank the reviewer for both their support of our approach, and in pointing out where our conclusions were not well supported by our data.

      Reviewer #3 (Public Review):

      To analyze the circuit mechanisms leading to the habituation of the O-bed responses upon repeated dark flashes (DFs), the authors performed 2-photon Ca2+ imaging in larvae expressing nuclear-targeted GCaMP7f pan-neuronally panning the majority of the midbrain, hindbrain, pretectum, and thalamus. They found that while the majority of neurons across the brain depress their responsiveness during habituation, a smaller population of neurons in the dorsal regions of the brain, including the torus longitudinalis, cerebellum, and dorsal hindbrain, showed the opposite pattern, suggesting that motor-related brain regions contain non-depressed signals, and therefore likely contribute to habituation plasticity.

      Further analysis using affinity propagation clustering identified 12 clusters that differed both in their adaptation to repeated DFs, as well as the shape of their response to the DF.

      Next by the pharmacological screening of 1953 small molecule compounds with known targets in conjunction with the high-throughput assay, they found that 176 compounds significantly altered some aspects of measured behavior. Among them, they sought to identify the compounds that 1) have minimal effects on the naive response to DFs, but strong effects during the training and/or memory retention periods, 2) have minimal effects on other aspects of behaviors, 3) show similar behavioral effects to other compounds tested in the same molecular pathway, and identified the GABAA/C Receptor antagonists Bicuculline, Amoxapine, and Picrotoxinin (PTX). As partial antagonism of GABAAR and/or GABACR is sufficient to strongly suppress habituation but not generalized behavioral excitability, they concluded that GABA plays a very prominent role in habituation. They also identified multiple agonists of both Melatonin and Estrogen receptors, indicating that hormonal signaling may also play a prominent role in habituation response.

      To integrate the results of the Ca2+ imaging experiments with the pharmacological screening results, the authors compared the Ca2+ activity patterns after treatment with vehicle, PTX, or Melatonin in the tethered larvae. The behavioral effects of PTX and Melatonin were much smaller compared with the very strong behavioral effects in freely-swimming animals, but the authors assumed that the difference was significant enough to continue further experiments. Based on the hypothesis that Melatonin and GABA cooperate during habituation, they expected PTX and Melatonin to have opposite effects. This was not the case in their results: for example, the size of the 12(Pot, M) neuron population was increased by both PTX and Melatonin, suggesting that pharmacological manipulations that affect habituation behavior manifest in complex functional alterations in the circuit, making capturing these effects by a simple difficult.

      Since the 12(𝑃𝑜𝑡, 𝑀) neurons potentiate their responses and thus could act to progressively depress the responses of other neuronal classes, they examined the identity of these neurons with GABA neurons. However, GABAergic neurons in the habituating circuit are not characterized by their Adaptation Profile, suggesting that global manipulations of GABAergic signaling through PTX have complex manifestations in the functional properties of neurons.

      Overall, the authors have performed an admirably large amount of work both in whole-brain neural activity imaging and pharmacological screening. However, they are not successful in integrating the results of both experiments into an acceptably consistent interpretation due to the incongruency of the results of different experiments. Although the authors present some models for interpretation, it is not easy for me to believe that this model would help the readers of this journal to deepen the understanding of the mechanisms for habituation in DF responses at the neural circuit level.

      This reviewer would rather recommend the authors divide this manuscript into two and publish two papers by adding some more strengthening data for each part such as cellular manipulations, e.g. ablation to prove the critical involvement of 12(Pot, M) neurons in habituation.

      We thank the reviewer for their careful consideration of our manuscript, and we agree that our emphasis on a particular model of DF habituation, namely the potentiation of GABAergic synapses, was overly speculative. We hope they will agree that our revised manuscript better reflect the results from our experiments, and we have tried to more specifically emphasize the incongruency in our behavioural and Ca2+ imaging data after pharmacological treatment, which we agree shows that a simple model is insufficient to capture both of these sets of observations.

      We have opted not to split the paper into two, since we feel that the collective message of this paper and approach combining molecular and functional analysis will be of interest. Moreover, we feel that the molecular and functional analyses feed off of each other and provide a level of complementarity that would be lost if the manuscript would be split, even if the message in this particular case is rather complex

      Reviewer #1 (Recommendations For The Authors):

      There is much to commend about this manuscript. The advantages of studying habituation in the zebrafish larva are very clearly demonstrated, including the wonderful calcium imaging across the brain and the relatively high throughput screening of large numbers of different pharmacological agents. The habituation to dark flashes in freely moving larvae is also striking and the very large effect size serves the screening beautifully. Thus, if we take the really substantial amount of work of a very high standard that has been done here, there is clearly potential for an important new contribution to the literature. However, as you will see from my public review, I am of the opinion that a specific role for the modification of GABAergic inhibitory systems has not yet been established through this work. While the potential role for GABAergic inhibitory neurons in habituation, either as the key modifiable element or as an intermediary between memory and motor output, is an attractive theory with many strengths, your study as it currently stands does not categorically demonstrate that one of those two options holds. For instance, the more traditional view, that adaptive filtration is mediated by weakened synaptic connectivity between excitatory sensory systems and excitatory motor output or reduced intrinsic excitability in those same neurons, could still be in operation here. By lowering GABAergic influence over post-synaptic targets with picrotoxin, it is possible that motor output remains highly active, and even lower activity or synaptic drive from those excitatory sensory systems that feed into the output may still reliably produce behavioral output. Alternatively, it could be the formation of a memory of the familiar stimulus is disrupted by reduced inhibition that alters sensory coding either by introducing noise or reducing the selectivity of receptive fields. I believe that there are several options to address these concerns:

      1) You could change the emphasis of the manuscript so that it is less focused on inhibition and instead emphasizes the categorization of clusters of neurons that have divergent responses during habituation, including either strong suppression to potentiation. To this, you add a high throughput screening system with a wide range of different agents being tested, several of which produce a significant effect on habituation in either direction. These observations in themselves provide powerful building blocks for future work.

      2) If GABAergic neurons play a key role in habituation in this paradigm, then picrotoxin is having its effect by blocking receptors on excitatory neurons. Thus, it seems that selectively imaging GABAergic neurons before and after the application of these drugs is not likely to reveal the contribution of GABAergic synaptic influence on excitatory targets. More important is to get a stronger sense of how the GABAergic neurons change their activity throughout habituation and then influence the downstream target neurons of those GABAergic neurons (some of which may themselves be inhibitory and participating in disinhibition). For instance, you could interrogate whether anti-correlations in activity levels exist between presynaptic inhibitory neurons and putative post-synaptic targets. This analysis could be further bolstered by removing that relationship in the presence of Picrotoxin, thereby demonstrating a direct influence of inhibition from a GABAergic presynaptic partner on a postsynaptic target. While this would constitute a lot more work, it is likely to yield greater insight into a specific role for GABAergic neurons in habituation, and I suspect much of that information is in the existing datasets.

      3) To really reveal causal roles for inhibition in this form of habituation, it seems to me that there needs to be some selective intervention in GABAergic neuronal activity, ideally bidirectionally, to transiently interrupt or enhance habituation. Optogenetic or chemogenetic stimulation/inactivation is one option in this regard, which I imagine would be challenging to implement and certainly involves a lot of further work, particularly if you are then going to target specific subpopulations of GABAergic neurons. I appreciate that this option seems way beyond the scope of a review process and would probably constitute a follow-up study.

      We agree with the reviewer that we have not “categorically demonstrated” that GABAergic inhibitory neurons drive habituation by increasing their influence on the circuit, and appreciate the suggestions for how to reformulate our manuscript to better reflect this. We have opted to follow suggestion (1), and have considerably changed the focus of the manuscript.

      The additional analysis suggested in (2) is very interesting, but since we can not identify which cells are inhibitory in our imaging experiments with picrotoxinin treatment, nor which are pre- or post-synaptic, we feel that this analysis will be very unconstrained. Also, if GABA is acting as an inhibitory neurotransmitter, it therefore is expected to act to drive anticorrelations among pre and postsynaptic neurons through inhibition. Therefore, blockage of GABA through PTX would be expected to result in increased correlations, regardless of our hypothesized role of neurons during habituation. Our current efforts are aimed at identifying critical neurons driving habituation plasticity, and we will perform such analysis once we have mechanisms for identifying these neurons.

      Finally, we agree that (3) is the obvious and only way to demonstrate causation here, and this is where we are working towards. However, since we currently have no means of genetically targeting these neurons, we are not able to perform these suggested experiments today.

      I have some additional concerns that I would really appreciate you addressing:

      1) The behavioral habituation is striking in the freely moving larvae, but very hard to monitor in the larvae that are immobilized for calcium imaging. Are there steps that could be taken in the long run to improve direct observation of the habituation effect in these semi-stationary fish? For instance, is it possible to observe eye movements or some more subtle behavioral readout than the O-bend reflex? I apologize if this is a naïve question, but I am not entirely familiar with this specific experimental paradigm.

      In the Dark Flash paradigm, we do not have readouts beyond the “O-bend” response itself, which is characterized by a large-angle bend of the tail and turning maneuver. We have not observed other, more subtle behavioural responses, such as eye or fin movements, for example. If we would be able to identify alternative behavioural outputs that were more robustly performed during head-embedded preparations, this would indeed be an advantage allowing us to more directly interpret the Ca2+ imaging results with respect to behaviour.

      2) The dark flash as a stimulus to which the larvae habituate is obviously used as a powerful and ethologically relevant stimulus. However, it does leave an element of traditional habituation paradigms out, which is a novel stimulus that can be used to immediately re-instate the habituated response (otherwise known as dishabituation). Is there a way that you can imagine implementing that with zebrafish larvae, for instance through systematically altering a visual feature, such as spatial frequency or orientation? This would be a powerful development in my view as it would not only allow you to rule out motor or sensory fatigue as an underlying cause of reduced behavior but also it would provide an extra feature that strengthens your assessment of neuronal response profiles in candidate populations of inhibitory and excitatory neurons.

      We agree that identifying a dishabituating stimulus would be very powerful for our experiments. For short-term habituation of the acoustic startle response, Wolman et al demonstrated that dishabituation occurs after a touch stimulus (Wolman et al., PNAS, 2011; https://doi.org/10.1073/pnas.1107156108). We attempted to dishabituate the O-Bend response with tap and touch stimuli, and this unfortunately did not occur. Our understanding of dishabituation is that this generally requires a second stimulus that elicits the same behaviour as the habituated stimulus (e.g. both acoustic and touch-stimuli elicit the Mauthner-dependent C-bend response). In zebrafish the only stimulus that has been identified that elicits the O-bend is a dark-flash. This lack of an appropriate alternative stimulus is perhaps why we have been unsuccessful in identifying a dishabituating stimulus.

      3) You have written about the concept of 'short' and 'long' response shapes when using calcium imaging as a proxy for neural activity, surmising that the short response shape may reflect transient bursting. Although calcium imaging obviously has many advantages, this feature reveals one notable limitation of calcium imaging in contrast to electrophysiology, in that the time course of the signal is considerably longer and does not allow you with confidence to fully detect the response profile of neurons. Is there some kind of further deconvolution process that you could implement to improve the fidelity of your calcium imaging to the occurrence of action potentials? The burstiness of neurons is obviously important as it can indicate a particular type of neuron (for instance fast-spiking inhibitory neurons) or it might reveal a changing influence on post-synaptic neurons. For instance, bursting can be a response to inhibition due to the triggering of T-type calcium channels in response to hyperpolarization.

      One of the major limitations to Ca2+ imaging is the lack of temporal resolution. In our particular approach, using nuclear-targeted H2B-GCaMP indicators, further reduces our temporal resolution. Deconvolution approaches can be used in some instances to approximate spike rate, since the rise-time of Ca2+ indicators can be relatively fast. However, in our imaging we chose to image larger volumes at the expense of scan rate, where our imaging is performed at only 2hz. Therefore, deconvolution and spike-rate estimation is not appropriate. Considering these limitations, we would argue that the fact that we can observe differences in kinetics of the 'short' and 'long' response shapes indicates that they likely show very different response kinetics, which we hope to confirm by electrophysiology once we have established ways of targeting these neurons for recordings.

      4) I note that among the many substances you screened with is MK801. An obvious candidate mechanism in habituation is the NMDA receptor, given the importance of this receptor for so many forms of learning and bidirectional synaptic plasticity. If I am to understand correctly, this NMDA receptor blocker actually enhances habituation in the zebrafish larvae, similar to melatonin. That is a very surprising observation, which is worth looking into further or at least discussed in the manuscript. The finding would, at least, be consistent with the idea that plasticity is not occurring at excitatory synapses and could potentially bolster the argument that plasticity of inhibitory synapses is at play in this particular form of habituation.

      This is a very important point. We were also particularly interested in MK801, which has been shown to inhibit other forms of habituation, like short-term acoustic habituation (Wolman et al., PNAS, 2011; https://doi.org/10.1073/pnas.1107156108). In our experiments we did see that fish become even less responsive to dark flashes when treated with MK-801 (SSMD fingerprint data: Prob-Train = -0.39, Prob-Test = -1.58) which would indicate that MK-801 promotes dark flash habituation, similar to Melatonin. However, we also observed that MK-801 caused a decrease in the performance in the other visual assay we tested: the optomotor response (OMR-Perf = -0.93), indicating that MK-801 causes a generalized decrease in visual responses, perhaps by acting on circuits within the retina. Therefore, based on these experiments with global drug applications, we cannot determine if MK-801 influences the plasticity process in dark-flash habituation, and this is why we did not pursue it further in this project.

      Anyway, I hope that you take these suggestions as constructive and, in the spirit that they are intended, as possible routes for improving an already very interesting manuscript.

      We are very grateful for your suggestions, which we feel has helped us to improve our manuscript substantially.

      Reviewer #2 (Recommendations For The Authors):

      Overall, the manuscript is well-written, but confusing at times. The results are not always presented in a consistent way, and I found myself having to dig in the raw data or code to find answers. There is a certain disconnect between the free-swimming results, and the calcium imaging, which is somewhat inevitable based on other published work. But I am unsure of what they each bring to the other, as the results from Fig.6 do not match at all the changes observed in the behavioural assays, it almost feels like two separate studies and the inconsistencies make the model appear unlikely.

      We agree that there is a disconnect at the behavioural level in our free-swimming and head-embedded imaging experiments. However, this does not necessarily mean that the activity we observe during the imaging experiments cannot be informative about processes that are also occurring in freely-swimming fish. For example, it is possible that the dark-flash circuit is responding and habitating similarly in the head-embedded and freely-swimming preparations, but that in the latter context there is an additional blockade on motor output that massively decreases the propensity of the fish to initiate any movements. In such a case, the “disconnect between the free-swimming results, and the calcium imaging” would indicate that the relationship between neural activity and habituation behaviour is rather complex.

      Without a method to record activity from freely swimming fish at our disposal, we can not determine this, one way or the other.

      We hope that we now acknowledge these concerns appropriately in the discussion:

      “We also found that the same pharmacological treatments that result in strong alterations to habituation behaviour in freely swimming larvae ([fig:5]), resulted in relatively subtle and complex functional alterations in the circuit ([fig:6]). Making direct comparisons between freely-swimming behaviour and head-fixed Ca2+ imaging is always challenging due to the differences in behaviour observed in the two contexts, and therefore our failure to identify a clear logic in these experiments may have technical explanations that will require approaches to measure neural activity from unrestrained and freely-behaving animals to resolve . Alternatively, these results are again consistent with the idea that habituation is a multidimensional and perhaps highly non-linear phenomenon in the circuit, which cannot be captured by a simple model. “

      I am not convinced by the results surrounding GABA, from the inconsistent GABA receptor antagonist profile to the post hoc identification of GABAergic neurons as it is currently done in the manuscript. I think that the current focus on GABA does a disservice to the manuscript. However, the novel findings surrounding the potential role of Melatonin, and Estrogen, in habituation are quite interesting.

      We agree that we focused too heavily on our hypothesized role for GABA in our original manuscript, and we hope that the reviewer agrees that our updated manuscript is an improvement. We also thank the reviewer for their interest in our Melatonin and Estrogen results, for which follow up studies are ongoing to characterize the effects of these hormones and their receptors on habituation.

      There is an assumption that all the adaptation profiles are related to the DF (although that is somewhat alleviated in the discussions of the ON responses) and not to the luminosity changes. But there is no easy way to deconvolve those two in the current experiments. I would like the timing of the fluorescence rise to be quantified compared to the dark flash stimulus onset, potentially spike inference methods could help with giving a better idea of the timing of those responses. Based on the behavioural responses that were <500ms in Randlet O et al, eLife, 2019; we would expect only the fastest DF responses to be linked to the behaviour.

      We agree that we are unable to disambiguate responses to the dark flash that initiate the O-bend response, and those that are related to only changes in luminosity. As discussed above, our Ca2+ imaging approach is severely limited in temporal resolution and therefore spike inference methods are not appropriate.

      Major comments

      Fig.1: There seems to be a very variable lag between the motor events and DF responses, furthermore, it does not seem that the motor responses follow a similar habituation rate as in 1Bi. Although this only shows the smoothed 'movement cluster' from the rastermap, it could hide individual variability. It would be important to know what the 'escape' rate was in the embedded experiment, as

      Fig.1 sup.1 seems to indicate there was little to no habituation. It would also be needed to know which motor events are considered linked to the DF stimulus, and how that was decided. Was there a movement intensity threshold and lag limit in the response?

      We interpret this concern as relating to the data presented in Figure 6A, where we quantify the habituation rate in the head-embedded experiments. As we have discussed, both above and in the manuscript, we saw very strongly muted responses to DFs in the head-embedded preparation, but we neglected to describe our method of quantifying the responses. We have added the following description to the methods:

      “To quantify responses to the dark flash stimuli we used motion artifacts in the imaging data to identify frames associated with movements ([fig:1]-[fig:S1]). Motion artifact was quantified using the “corrXY” parameter from suite2p, which reflects the peak of phase correlation comparing each acquired frame and reference image used for motion correction. The “motion power” was quantified as the standard deviation of a 3-frame rolling window, which was smoothed in time using a Savitzky-Golay filter (window length = 15 frames, polyorder = 2). A response to a dark flash was defined as a “motion power” signal greater than 3 (z-score) occurring within 10-seconds of the dark-flash onset, and was used to quantify habituation in the head-embedded preparation ([fig:6]A).“

      Line 94: This seems to be a strong claim based on the sparse presence of non-habituating, or potentiating, neurons in downstream regions. However, these neurons appear to be extremely rare, and as mentioned in my comment above, the behavioural habituation appears minimal. These neurons could encode the luminosity and be part of other responses, such as light-seeking in Karpenko S et al, eLife, 2020 or escape directionality in Heap et al, Neuron, 2018. Furthermore, dimming information has been shown to have parallel processing pathways in Robles E et al, JCN, 2020; so it would make sense that not all the observed responses in this manuscript would be involved in behavioural habituation to dark flashes.

      We agree that without functional interventions, we do not know which of the neurons we have categorized are specifically involved in the dark flash response habituation. It is possible that the non-adapting and potentiating neurons are involved in other behaviours. We have therefore removed this statement.

      Line 103: It appears that several of those responses are to the changes in luminosity and not the DF itself, especially the ON and sustained responses. Based on the previous DF habituation study from Randlet O et al, eLife, 2019; the latency of the response is below 0.5s. So the behaviour-relevant responses must only include the shortest latency one, as discussed above.

      We appreciate the point that the reviewer is making here, but we are less clear about what the difference between “changes in luminosity” and a “dark flash” response are, since a dark flash consists of a change in luminosity. We take it that the reviewer means the difference between a luminance stimulus that elicits an O-bend, from one that does not. In order to disambiguate the two, one would likely need to use stimuli where the luminosity changes, but do not elicit O-bends.

      Perhaps due to the limited temporal resolution of our Ca2+ imaging data, we do not see a clear difference in the onset of the stimulus response for any of the functional clusters that would help us to determine which neurons are more relevant to the acute DF response.

      Fig.2B. It is very difficult to make out the actual average z-scored fluorescence, a supplementary figure would help by making these bigger. A plot to quantify the maximum response would also be useful to judge how it changes between the first few and few last DF. Another plot to give the time between the onset of the responses and the onset of the DF stimulus is also needed to judge which cluster may be relevant to the DF escapes observed in the free-swimming experiments.

      We agree with the reviewer that interpreting these datasets are challenging. We did include the actual average z-scored fluorescence in Figure 6—figure supplement 1, panel D. This figure also includes a comparison between the predicted Ca2+ response to the dark flash (the stimulus convolved with the approximate GCaMP response kernel), which shows that all OFF-responding neuronal classes show very similar rise time response kinetics, and thus this analysis does not help to judge whether a cluster is more or less relevant to O-bend responses in the free-swimming experiments. We appreciate that there are differences in opinion about the best way to present the data, but we have opted to leave our original presentation.

      Line 130: Is a correlation below 0.1 meaningful or significant? It does not seem like this cluster would be a motor or decision cluster.

      Our goal with this correlational analysis to motor signals was to identify if certain clusters of DF responsive neurons were more associated with motor output, and therefore may be more downstream in the sensori-motor cascade. Cluster 4 showed the highest median correlation across the population of cells. Whether a median correlation of ~0.1 is “meaningful” is impossible for us to answer, but it is highly “significant” in the statistical sense, as is evident by the 99.99999% confidence intervals plotted. We note that these cells were not selected based on their correlation to the motor stimulus, but only to the dark flash stimulus. There are “motor” clusters that show much higher correlations to the motors signals, as is evident in Figure 1G.

      Line 165: Did the changes observed for Pimozide fall below the significance threshold, were lethal, or were the results not repeated? It does not appear in source data 2.

      Pimozide was lethal in our screen and therefore does not appear in the source data file. Indeed, in our previous experiments with Pimozide we had already established that a 10uM dose is lethal, and that the maximal effective dose we tried was 1uM as reported in (Randlett et al., Current Biology, 2019).

      We have clarified this in the text:

      “While the false negative rate is difficult to determine since so little is known about the pharmacology of the system, we note that of the three small molecules we previously established to alter dark flash habituation that were included in the screen, Clozapine, Haloperidol and Pimozide , the first two were identified among our hits while Pimozide was lethal at the 10\muM screening concentration.”

      Fig.1B and Fig.3B are the same data, which is awkward and should be explicitly stated. But the legends do not match in terms of the rest period. Which is correct? It is also important to note the other behavioural assays in the 'rest' period.

      We thank the reviewer for pointing out this discrepancy in the legend. We have corrected the typo in the figure legend of Figure 3B :

      “Habituation results in a progressive decrease in responsiveness to dark flashes repeated at 1-minute intervals, delivered in 4 training blocks of 60 stimuli, separated by 1hr of rest (from 0:00-7:00).”

      We have also added a statement that the data is the same as that in Figure 1B.

      Figure 3-4: SSMD fingerprint, there is no description of the different behavioural parameters. What they represent is left to the reader's inference. There is no mention of SpontDisp in the GitHub for example, so it is hard to know how these different parameters were measured. Even referring to the previous manuscript on habituation (Randlet O et al, eLife, 2019) does not shed light on most of them, for example, I suppose TwoMvmt represents the 'double responses' from the previous manuscript. Furthermore, there are inconsistencies between 3C and 4B, some minor (SpontDisp becomes SpntDisp), but Curve-Tap has disappeared for example, and I suspect became BendAmp-Tap. A more thorough description of these measures, and making the naming scheme consistent, are essential for readers to know what they are looking at.

      We again thank the reviewer for their careful assessment of our data, and we apologize for this sloppiness. We have gone through and made the naming of these parameters consistent in both figures, and have added another supplementary table that describes in more detail what each parameter is, and how it relates to the analysis code (Figure3_sourcedata3_SSMDFingerprintParameters.xls). This was an essential missing piece of information from our original manuscript.

      Line 206: While this prioritization makes sense, how was it implemented, how was the threshold decided and which were they? A table, or supplementary figure, would help to clarify the reason behind the choices. Fig.4C being cropped only around the response probability makes it impossible to judge if the criteria were respected, as the main heatmap is too small. For example, the choice of GABA receptor antagonists is somewhat puzzling, as besides PTX it does not seem that the other compounds had strong effects, with Amoxapine for example having seemingly as much effect on Naive and Train, with little in Test. And Bicuculline gave negative SSMD for prob in the three cases. The dose-response for PTX does lend credence to its effect, but I would have liked the other compounds, especially bicuculline. The melatonin results, for example, are much more convincing and interesting in our opinion.

      While in hindsight it may have been possible to do the hit prioritization in a systematic way using thresholding and ranking, we did this manually by inspecting the clustered fingerprints. We have clarified this in the text: “This manual prioritization led to the identification of the GABAA/C Receptor antagonists…”

      While we agree that it is not possible to judge how well we performed this prioritization based on the images presented, we note that we do provide the full fingerprint data in the supplementary data, for which the reader is welcome to draw their own conclusions.

      We have not performed further experiments with amoxapine, so we can not comment further on this. We did perform additional experiments with bicuculline, for which we did see effects similar to those of PTX, were habituation was inhibited. However, the effects are weaker and more variable than what we observe with PTX, and bicuculline also inhibits the initial responses of the larvae, causing their Naive response to be lower. Therefore we did not include it in our manuscript. We include these data here in Author response image 1 to reassure the Reviewer that picrotoxinin is not the only GABA Receptor antagonist for which we see inhibitory effects on habituation.

      Author response image 1.

      Fig.6: Why was the melatonin concentration used only 1um instead of 10um on the screen?

      Based on dose response experiments (Figure 5B, and others not shown), we found that the effect of Melatonin on habituation saturates at about 1uM, and therefore we used this dose.

      Line 277: As the correlation with motor output is marginal at best, and the authors recognize the lack of behaviour in tethered animals, I would be careful about such speculation. Especially since the other changes are complex and go in all directions.

      While we appreciate the reviewer's caution, we feel that our statement is appropriately hedged using “might be”. We have also removed the statement “and thus is most closely associated with behavioural initiation”.

      We now state:

      “However, opposite effects of PTX and Melatonin were observed for 4_L^{strgD} neurons ([fig:6]C), which we found to be most strongly correlated with motor output ([fig:2]F). Therefore, this class might be most critical for habituation of response Probability.”

      Fig.7: I am not sure how convincing these results are. 7F may have been more convincing, but to be thorough the authors would need to register the Gad1b identity to the calcium imaging and use their outline to extract the neuron's fluorescence. As it is, in the tectum, it is hard to be sure that all the identified neurons are indeed Gad1b positive, as that population is intermingled with other neuronal populations. The authors should consider the approach of Lovett-Barron M et al, Nat Neuro, 2020. Alternatively, the authors can tone down the language used in this section to match the confidence level of the association they propose.

      Figure 7A-E are what can be considered “virtual colocalization” analyses, where we are comparing the localization of data acquired in different experiments using image registration to common atlas coordinates. We agree that these results alone will never be very strong evidence for the identification of individual cells. The MultiMAP approach of Lovett-Barron is a powerful approach, though it makes the assumption that registration accuracy will be subcellular, which in practice may often not be the case. We believe that a better approach is to label the cells of interest during the Ca2+ imaging experiment itself, as we did 7F and G. The challenge in this experiment is binarizing the ROIs and thus deciding what is and is not a Gad1b-positive cell. In our opinion, the fact that these two independent experiments came to the same conclusion regarding Cluster 10 and 11 is good evidence that these cell types are likely predominantly GABAergic.

      As discussed above, we have re-written the manuscript to tone down our claims about the role of GABA and GABAergic neurons in habituation, which we hope the reviewer will agree better reflects the limitations of the data in Figure 6 and 7.

      Line 317: Based on the somewhat inconsistent results of the other GABA antagonists, I would be careful. Picrotoxin has been reported to antagonize other receptors besides GABA, see Das P et al, Neuropharma, 2003. So the results may be explained by a complex set of effects on multiple pathways with PTX.

      Off target effects are an important concern with any pharmacological experiment, and perhaps especially in zebrafish where receptors and targets can be quite divergent from those in mammals where most drug targets have been characterized. We have added this sentiment to the discussion:

      “We cannot rule out the possibility that off-targets of PTX, or subtle non-specific changes in excitatory/inhibitory balance alter habituation behaviour.”

      Line 400-403, 430: There are some conflicting statements regarding the potential role of clusters 1 and 2 in DF habituation. Do the authors think they play a role in the behaviour measured in this manuscript? Could they clarify what they mean?

      We see how our original statement in line 429 about the presence of cluster 1 and 2 neurons in the TL implied a role in dark flash habituation. This was not our intent, and we have removed “which also contains high concentrations of on-responding neurons”.

      Our thoughts on these neurons are now stated in the discussion as:

      “We also observed classes exhibiting an On-response profile ( and ). These neurons fire at the ramping increase in luminance after the DF, making it unlikely that they play a role in aspects of acute DF behaviour we measured here. These neurons exist in both non-adapting and depressing forms suggesting a yet unidentified role in behavioural adaptation to repeated DFs.“

      Minor comments

      Line 73 (and elsewhere): Why use adaptation instead of habituation (also in the adaptation profile)? Do you suspect your observations do not reflect habituation, but a sensory adaptation mechanism?

      We have used the convention that “habituation” refers to observations at the behavioural level, while “depression” and “potentiation” refer to observations at the neuronal level. We use the term “adaptation” to refer to neuronal adaptations of either sign (depression or potentiation), as in line 73.

      We believe that our observations reflect neuronal adaptations that underlie habituation behaviour.

      Line 71: It is debatable that the strongest learning happens in the first block, the difference between the first and last response seems to grow larger with each successive block. What do the authors mean by 'strongest'

      We agree that “strongest” was ambiguous. We have changed this to “initial”:

      “We focused on a single training block of 60 DFs to identify neuronal adaptations that occur during the initial phase of learning ”

      Fig.1F: there is no rastermap call in the GitHub repository, was the embedding done in the GUI? If so, it should also be shared for reproducibility's sake.

      Yes, Fig.1F was created using the suite2p GUI, as we have now clarified in the methods:

      “The clustered heatmap image of neural activity (([fig:3]F) was generated using the suite2p GUI using the “Visualize selected cells” function, and sorting the neurons using the rastermap algorithm ”

      The image is available in the “Figure1 - Ca2Imaging.svg” file available here: https://github.com/owenrandlett/lamire_2022/tree/main/LamireEtAl_2022

      Line 101: while true that AffinityPropagation does not require input on the number of clusters, preference can influence the number of clusters. It seems that at least two values were tested in the search for the clusters, can the authors comment on how many clusters the other preference value converged (or failed to converge) on?

      Indeed, as with any clustering approach, the resultant clusters are highly dependent on the input parameters, in this case the “preference”, as well as “damping” and the choice of affinity metric. By varying these parameters one can arrive at anywhere between 2 and hundreds of clusters.

      It is for this reason that we feel that the anatomical analyses of these clusters is very important, making the assumption that neurons of differing functional types will have different localizations in the brain, as we explained in the Results:

      “While these results indicate the presence of a dozen functionally distinct neuron types, such clustering analyses will force categories upon the data irrespective of if such categories actually exist. To determine if our cluster analyses identified genuine neuron types, we analyzed their anatomical localization ([fig:2]C-E). Since our clustering was based purely on functional responses, we reasoned that anatomical segregation of these clusters would be consistent with the presence of truly distinct types of neurons.”

      We also acknowledge in the Results that the clustering approach has limitations:

      “These results highlight a diversity of functional neuronal classes active during DF habituation. Whether there are indeed 12 classes of neurons, or if this is an over- or under-estimate, awaits a full molecular characterization. Independent of the precise number of neuronal classes, we proceed under the hypothesis that these clusters define neurons that play distinct roles in the DF response and/or its modulation during habituation learning“

      Fig.2. My understanding is that the cluster numbers are arbitrary unless there is a meaning to them, which then should be explained. I would recommend grouping the clusters per functional category as in Fig.6 to make it easier for the reader.

      Cluster number reflects the ordering in the hierarchical clustering tree shown in Figure 2B. We feel that this is the most logical representation of their functional similarity. We have clarified this in the Methods:

      “ We then used the Affinity Propagation clustering from scikit-learn , with “affinity” computed as the Pearson product-moment correlation coefficients (corrcoef in NumPy ), preference=-9, and damping=0.9, and clustered using Hierarchical clustering (cluster.hierarchy in SciPy ). Cluster number was assigned based on the ordering of the hierarchical clustering tree. ”

      Fig.3 SSMD fingerprint, it would be much easier for the readers if the list of parameters was clearer and rotated 90 degrees. Maybe in a supplementary figure to show what each represents.

      We agree that the SSMD fingerprint is very difficult to interpret. As discussed above, we have now included a supplementary table (Figure3_sourcedata2_SSMDFingerprintParameters.xlsx) where we have clarified what each parameter represents.

      Fig.4: The use of the same colours across the clustering methods is confusing, especially after the use of colours for the SSMD fingerprint in Fig.3. and at the bottom of 4A. Fig.4A for example could have been colour coded according to the most affected behaviour in the fingerprint at the bottom.

      Fig.4B the coloured text is difficult to read, especially for the lighter colours.

      We agree that our use of color is not perfect, but we have attempted to use them consistently: for example when referring to a functional cluster, or a drug manipulation. We don’t think that there is a sufficient number of distinguishable colors for us to never use the same color twice.

      Fig.4C if the goal is to show similarity, the relevant drugs could be placed adjacent to each other. One could also report the Euclidean distance, or compute how correlated the different fingerprints are within one pharmacological target space.

      The goal of Fig 4C is to highlight where Bicuculline, Amoxapine, Picrotoxinin, Melatonin, Ethinyl Estradiol and Hexestrol lie within the clustered heatmap of the behavioural fingerprints (Fig 4A), and<br /> demonstrate how the probability of response to dark flashes is modulated by these drugs. In our analyses, “similarity” is a function of the clustering distance.

      Fig.6D 'Same data as M, ...' I assume should be 'Same data as C,...'

      Indeed, thank you for pointing out this error that we have corrected.

      Fig. 7 How many GCaMP6s double transgenic larvae were imaged?

      6 fish were imaged, as is stated in the legend to Fig 7G

      Line 407: all is repeated.

      We apologize, but we do not see what is repeated at line 407. Can you please clarify?

      Line 481: Would testing spontaneous activity after training for 7h be unbiased, could there be fatigue effects?

      We tested for fatigue effects in our previous study, comparing larvae that received the training for 7hrs and those that did not, and we saw no deficits in spontaneous activity, tap response, or OMR performance (Figure S1, Randlett et al., Current Biology, 2019).

      Line 610: There are some inconsistencies between the authors' contributions in the manuscript and the one provided to eLife.

      Thank you, we will double check this in the resubmission forms. The authors' contributions in the manuscript are correct.

      Reviewer #3 (Recommendations For The Authors):

      I would rather recommend the authors divide this manuscript into two and publish two papers by adding some more strengthening data for each part such as cellular manipulations, e.g. ablation to prove the critical involvement of 12(Pot, M) neurons in habituation.

      We thank the reviewer for their suggestion, but have opted not to split the paper into two. We feel that the collective message of this paper and approach combining molecular and functional analysis will be of interest, and we believe the incongruencies in our results reflects the complexity inherent within the system.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment:

      The study answers the important question of whether the conformational dynamics of proteins are slaved by the motion of solvent water or are intrinsic to the polypeptide. The results from neutron scattering experiments, involving isotopic labelling, carried out on a set of four structurally different proteins are convincing, showing that protein motions are not coupled to the solvent. A strength of this work is the study of a set of proteins using spectroscopy covering a range of resolutions, however, it suffers from some scholarly shortcomings and limited discussion of results. The work is of broad interest to researchers in the fields of protein biophysics and biochemistry.

      Reply 1: We thank the editors and reviewers for the positive and encouraging comments.

      Reviewer #1 (Public Review):

      Summary:

      Zheng et al. study the 'glass' transitions that occur in proteins at ca. 200K using neutron diffraction and differential isotopic labeling (hydrogen/deuterium) of the protein and solvent. To overcome limitations in previous studies, this work is conducted in parallel with 4 proteins (myoglobin, cytochrome P450, lysozyme, and green fluorescent protein) and experiments were performed at a range of instrument time resolutions (1ns - 10ps). The author's data looks compelling, and suggests that transitions in the protein and solvent behavior are not coupled and contrary to some previous reports, the apparent water transition temperature is a 'resolution effect'; i.e. instrument response is limited. This is likely to be important in the field, as a reassessment of solvent 'slaving' and the role of the hydration shell on protein dynamics should be reassessed in light of these findings.

      Strengths:

      The use of multiple proteins and instruments with a rate of energy resolution/ timescales.

      Reply 2: We thank the reviewer for highlighting our key findings.

      Weaknesses:

      The paper could be organised to better allow the comparison of the complete dataset collected. The extent of hydration clearly influences the protein transition temperature. The authors suggest that "water can be considered here as lubricant or plasticizer which facilitates the motion of the biomolecule." This may be the case, but the extent of hydration may also alter the protein structure.

      Reply 3: Following the reviewer’s suggestion, we studied the secondary structure content and tertiary structure of CYP protein at different hydration levels (h = 0.2 and 0.4) through molecular dynamics simulation. As shown in Table S2 and Figure S6, the extent of hydration does not alter the protein secondary structure content and overall packing. Thus, this result also suggests that water molecules have more influence on protein dynamics than on protein structure.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript entitled "Decoupling of the Onset of Anharmonicity between a Protein and Its Surface Water around 200 K" by Zheng et al. presents a neutron scattering study trying to elucidate if at the dynamical transition temperature water and protein motions are coupled. The origin of the dynamical transition temperature has been highly debated for decades, specifically its relation to hydration.

      Strengths:

      The study is rather well conducted, with a lot of effort to acquire the perdeuterated proteins, and some results are interesting.

      Reply 4: We thank the reviewer for highlighting our key findings.

      Weaknesses:

      The present work could certainly contribute some arguments, but I have the feeling that not all known facts are properly discussed.

      The points the authors should carefully discuss are the following:

      (1) Daniel et al. (10.1016/S0006-3495(98)77694-5) have shown that enzymes can be functional below the dynamical transition temperature which is at odds with some of the claims of the authors.

      Reply 5: Following the reviewer’s suggestion, we added the following paragraph into the Introduction into the revised main text.

      “Although exceptions have been reported (Biophys. J. 1998, 75, 2504.), the dynamical transition has been linked to the thermal onset of function in a number of proteins, e.g, myoglobin (Biochemistry, 1975, 14, 5355-5373), ribonuclease (Nature, 1992, 357, 423-424.), elastase ( Biochemistry, 1994, 33, 9285-9293.) and bacteriorhodopsin (PNAS, 1993, 90, 9668-9672.), all of which become inactive below the dynamical transition temperature.”

      (2) It is not as easy to say that protonated proteins in D2O reflect protein dynamics while perdeuterated proteins in H2O reflect water dynamics. A recent study by Nidriche et al. (PRX LIFE 2, 013005 (2024)) reveals that H <-> D exchange is much faster than usually assumed and has important consequences for such studies.

      Reply 6: For the sample preparation, all the H-proteins were dissolved in D2O to allow full deuterium exchange of all exchangeable hydrogen atoms and then lyophilized for 12 hours to obtain the dry sample. The lyophilized H-protein is then put into a desiccator with D2O, placed in the glove box purged with nitrogen gas, to absorb D2O till the desired hydration level, h (gram water/gram protein). In contrast, the preparation of the deuterated proteins was conducted in the opposite way. The D-proteins were dissolved in H2O to allow full hydrogen exchange of all exchangeable deuterium atoms and then lyophilized for 12 hours to obtain the dry sample. The lyophilized D-protein is then put into a desiccator with H2O to absorb H2O till the desired h. This procedure can avoid H-D exchange during experiments. We added the above methods into the revised SI.

      (3) A publication by Jasnin et al. (10.1039/b923878f) on heparin sulfate shows a resolution effect.

      Reply 7: Based on the data from Jasnin et al. (10.1039/b923878f), we found that the dynamical transition of heparin sulfate did not exhibit a strong resolution effect. Estimating the dynamical transition of mean square displacement (MSD) for nanosecond motions in all heparan sulfate samples is challenging due to the absence of data on nanosecond motion of HS-dry.

      (4) The authors should discuss the impact of the chosen q-range on their findings (see Phys. Chem. Chem. Phys., 2012, 14, 4927-4934, where the authors see a huge effect!).

      Reply 8: Following the reviewer's suggestion, we calculated Ton of H-protein in D2O in the q-range from 0.45-0.9 Å⁻¹ and 1.1-1.75 Å⁻¹. The results are summarized in Table S2 and Table S3. As shown in Tables S2-3., the q-range does not alter the Ton of proteins. We added the above results into the revised SI.

      (5) The authors underline that the dynamical transition is intrinsic to the protein. However, Cupane et al. (ref 12) have shown that it can also be found in a mixture of amino acids without any protein backbone.

      Reply 9: Following the reviewer’s suggestion, we added the following discussion into the revised main text.

      “Unfreezing of the protein structural relaxation might facilitate these conformational jumps, turning on its functionality. However, as revealed by Ref (Journal of biological physics, 2010, 36, 291-297.), the denatured form of lysozyme also exhibits a dynamical transition, similar to that seen in its folded native form. Additionally, the dynamical transition also can be found in the mixture of amino acids (Physical Review Letters, 2012, 109, 128102.). Hence, one can argue that the activation of the structural relaxation of the biomolecule above the dynamical transition temperature is a necessary but insufficient condition for the protein to function, as the latter also requires the biomolecule assuming the correctly folded 3-dimensional structure.”

      (6) The authors say that they find similar dependences from MSD. They should explain that the MSD is inversely proportional to the summed intensities squared.

      Reply 10: Following the reviewer’s suggestion, we added the estimation of mean-squared atomic displacement (MSD) in the revised SI.

      “The mean-squared atomic displacement was estimated by performing Gaussian approximation, where . The values of q used for Gaussian fitting ranges from 0.45 to 0.9 Å (Biophys. J. 2006, 91, 2573.).”

      (7) A decoupling between water dynamics and membrane dynamics has already been discussed by K. Wood, G. Zaccai et al.

      Reply 11: Following the reviewer’s suggestion, we added the discussion in revised main text. “The results from the neutron scattering experiments suggest that the dynamical transition in proteins is an intrinsic property of the biomolecule and strongly depends on the amount of water surrounding it. Such an intrinsic transition can result either from a critical phase transition, e.g., water to ice (PNAS 2007, 104, 18049-18054.; JPCB, 1999, 103, 8036-8050), or from freezing of the structural relaxation of the system beyond the equilibrium time (~100-1000 s) of the experiment, in analogy to the glass transition in polymers from rubbery state to the glass form (Philosophical Magazine, 2004, 84, 1341-1353.; Science, 1995, 267, 1939-1945.; Colloid and Polymer Science, 1995, 273, 413-420.).”

      (8) The fact that transition temperature in lipid membranes is higher when the membrane is dry is also well known (A.V. Popova, D.K. Hincha, BMC Biophys. 4, 11 (2011)).

      Reply 12: We agree with the reviewer that transition temperature in lipid membranes is higher when the membrane is dry is well known. We cited this work as reference.

      (9) The authors should mention the slope (K/min) they used for DSC and discuss the impact of it on the results.

      Reply 13: Following the reviewer’s suggestion, we added DSC measurements in revised SI. “DSC measurements were performed by using the METTLER instruments DSC3+. The sample was sealed in a pan of aluminum. An empty pan was used as a reference. All the experiments were carried out in the temperature range from 150 to 300 K with a heating rate of 1 K/min. The heating rate of DSC is the same as neutron experiments.”

      (10) In the introduction, the authors should present the different explanations forwarded for the dynamical transition.

      Reply 14: Following the reviewer’s suggestion, we added different explanations forwarded for the dynamical transition in the Introduction in revised main text.

      “The dynamical transition of protein represents a significant change in the internal mobility of proteins, which has garnered various explanations. One theory suggests it's due to the behavior of water in the hydration shell, transitioning from rigid to fluid at certain temperatures, thus influencing protein flexibility. Another theory considers the transition as an inherent property of the protein, where thermal energy allows the protein to access a wider range of conformations. ”

      Reviewer #1 (Recommendations For The Authors):

      A major strength of the work is the parallel experiments performed on each of the 4 proteins. To allow better comparison of these it would be helpful to present these combined data in relevant figures to make a side-by-side comparison easier. A summary table of Ton (and potentially TDSC) values would also be helpful.

      Reply 15: Following the reviewer’s suggestion, we summarized the Ton of proteins in Table S5 and Table S6.

      The effect of hydration on protein structure should be considered. Alterations in protein secondary and tertiary structure would be expected to alter dynamics and thus could be seen as a change in Ton.

      Reply 16: The detailed analysis and discussion are presented in Reply 3.

      No uncertainty (error) in Ton values is presented. Could these be estimated from e.g. a comparison of protein Ton values measured under identical sample conditions with different spectrometers?

      Reply 17: It would be hard to compare Ton of proteins measured with different spectrometers because different spectrometers have different energy resolutions. For example, the energy resolutions of HFBS, DNA and OSIRIS are 1 μeV, 13 μeV, 25.4 μeV and 100 μeV, respectively.

      More detail is needed to correctly describe/define the proteins used for the study - e.g. P450 is a family of enzymes, so which one was used?

      Reply 18: We used P450 from Pseudomonas putida for the study. The PDB ID is 2ZAX. We added this information in the revised SI.

      P450 and myoglobin also have heme cofactors. Were these deuterated as part of the protein preparation?

      Reply 19: The heme cofactors were deuterated as part of the protein preparation.  For D-protein, all the cell culture for E.coli is deuterated.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors investigated how partial loss of SynGap1 affects inhibitory neurons derived from the MGE in the auditory cortex, focusing on their synaptic inputs and excitability. While haplo-insufficiently of SynGap1 is known to lead to intellectual disabilities, the underlying mechanisms remain unclear.

      Strengths:

      The questions are novel

      Weaknesses:

      Despite the interesting and novel questions, there are significant issues regarding the experimental design and potential misinterpretations of key findings. Consequently, the manuscript contributes little to our understanding of SynGap1 loss mechanisms.

      Major issues in the second version of the manuscript:

      In the review of the first version there were major issues and contradictions with the sEPSC and mEPSC data, and were not resolved after the revision, and the new control experiments rather confirmed the contradiction.

      In the original review I stated: "One major concern is the inconsistency and confusion in the intermediate conclusions drawn from the results. For instance, while the sEPSC data indicates decreased amplitude in PV+ and SOM+ cells in cHet animals, the frequency of events remains unchanged. In contrast, the mEPSC data shows no change in amplitudes in PV+ cells, but a significant decrease in event frequency. The authors conclude that the former observation implies decreased excitability. However, traditionally, such observations on mEPSC parameters are considered indicative of presynaptic mechanisms rather than changes of network activity.‎ The subsequent synapse counting experiments align more closely with the traditional conclusions. This issue can be resolved by rephrasing the text. However, it would remain unexplained why the sEPSC frequency shows no significant difference. If the majority of sEPSC events were indeed mediated by spiking (which is blocked by TTX), the average amplitudes and frequency of mEPSCs should be substantially lower than those of sEPSCs. Yet, they fall within a very similar range, suggesting that most sEPSCs may actually be independent of action potentials. But if that was indeed the case, the changes of purported sEPSC and mEPSC results should have been similar."<br /> Contradictions remained after the revision of the manuscript. On one hand, the authors claimed in the revised version that "We found no difference in mEPSC amplitude between the two genotypes (Fig. 1g), indicating that the observed difference in sEPSC amplitude (Figure 1b) could arise from decreased network excitability". On the other hand, later they show "no significative difference in either amplitude or inter-event intervals between sEPSC and mEPSC, suggesting that in acute slices from adult A1, most sEPSCs may actually be AP independent." The latter means that sEPSCs and mEPSCs are the same type of events, which should have the same sensitivity to manipulations.

      We understand that the data are confusing. Our results suggest a diverse population of PV+ cells, with varying reliance on action potential-dependent and -independent release. Several PV+ cells indeed show TTX sensitivity (reduced EPSC event amplitudes following TTX application: See Fig.1c-f, at the end of this document), but their individual responses are diluted when all cells are pooled together. To account for this variability, we are currently recording sEPSC followed by mEPSC from more mice of both genotypes. We will rephrase the text to reflect the updated data accordingly, keeping with the editors and reviewers’ suggestions.

      Concerns about the quality of the synapse counting experiments were addressed by showing additional images in a different and explaining quantification. However, the admitted restriction of the analysis of excitatory synapses to the somatic region represent a limitation, as they include only a small fraction of the total excitation - even if, the slightly larger amplitudes of their EPSPs are considered.

      We agree with the reviewer that restricting the anatomical analysis of excitatory synapses to PV cell somatic region is a limitation, which is what we have already highlighted in the discussion of the revised manuscript. Recent studies, based on serial block-face scanning electron microscopy, suggest that cortical PV+ interneurons receive more robust excitatory inputs to their perisomatic region as compared to pyramidal neurons (see for example, Hwang et al. 2021, Cerebral Cortex, http://doi.org/10.1093/cercor/bhaa378). It is thus possible that putative glutamatergic synapses, analysed by vGlut1/PSD95 colocalisation around PV+ cell somata, may be representative of a substantially major excitatory input population. Similar immunolabeling and quantification approach coupled with mEPSC analysis have been reported in several publications by other labs (for example Bernard et al 2022, Science 378, doi: 10.1126/science.abm7466; Exposito-Alonso et al, 2020 eLife, doi: 10.7554/eLife.57000). Since analysing putative excitatory synapses onto PV+ dendrites would be difficult and require a much longer time, we will re-phrase the text to more clearly highlight the rationale and limitation of this approach.

      New experiments using paired-pulse stimulation provided an answer to issues 3 and 4. Note that the numbering of the Figures in the responses and manuscript are not consistent.

      We are glad that the reviewer found that the new paired-pulse experiments answered previously raised concerns. We will correct the discrepancy in figure numbers in the manuscript.

      I agree that low sampling rate of the APs does not change the observed large differences in AP threshold, however, the phase plots are still inconsistent in a sense that there appears to be an offset, as all values are shifted to more depolarized membrane potentials, including threshold, AP peak, AHP peak. This consistent shift may be due to a non-biological differences in the two sets of recordings, and, importantly, it may negate the interpretation of the I/f curves results (Fig. 5e).

      We agree with the reviewers that higher sampling rate would allow to more accurately assess different parameters, such as AP height, half-width, rise time, etc., while it would not affect the large differences in AP threshold we observed between control and mutant mice. Since the phase plots to not add to our result analysis, we will remove them. The offset shown in Fig.5 was due to the unfortunate choice of two random neurons; this offset is not present in the different examples shown in Fig.7. We apologize for the confusion.

      Additional issues:

      The first paragraph of the Results mentioned that the recorded cells were identified by immunolabelling and axonal localization. However, neither the Results nor the Methods mention the criteria and levels of measurements of axonal arborization.

      As suggested, we will add this information in the revised manuscript.

      The other issues of the first review were adequately addressed by the Authors and the manuscript improved by these changes.

      Reviewer #3 (Public review):

      This paper compares the synaptic and membrane properties of two main subtypes of interneurons (PV+, SST+) in the auditory cortex of control mice vs mutants with Syngap1 haploinsufficiency. The authors find differences between control and mutants in both interneuron populations, although they claim a predominance in PV+ cells. These results suggest that altered PV-interneuron functions in the auditory cortex may contribute to the network dysfunctions observed in Syngap1 haploinsufficiency-related intellectual disability.

      The subject of the work is interesting, and most of the approach is rather direct and straightforward, which are strengths. There are also some methodological weaknesses and interpretative issues that reduce the impact of the paper.

      (1) Supplementary Figure 3: recording and data analysis. The data of Supplementary Figure 3 show no differences either in the frequency or amplitude of synaptic events recorded from the same cell in control (sEPSCs) vs TTX (mEPSCs). This suggests that, under the experimental conditions of the paper, sEPSCs are AP-independent quantal events. However, I am concerned by the high variability of the individual results included in the Figure. Indeed, several datapoints show dramatically different frequencies in control vs TTX, which may be explained by unstable recording conditions. It would be important to present these data as time course plots, so that stability can be evaluated. Also, the claim of lack of effect of TTX should be corroborated by positive control experiments verifying that TTX is working (block of action potentials, for example). Lastly, it is not clear whether the application of TTX was consistent in time and duration in all the experiments and the paper does not clarify what time window was used for quantification.

      We understand the reviewer’s concern about high variability. To account for this variability, we are currently recording sEPSC followed by mEPSC from more mice of both genotypes.

      Indeed, we confirmed that TTX was working several times through the time course of this study, in different aliquots prepared from the same TTX vial used for all experiments. The results of the last test we performed, showing that TTX application blocks action potentials (2 recordings, one from a SST+ and one from a PV+ interneuron), are shown in Fig.1a,b at the end of this document. TTX was applied using the same protocol for all recorded neurons. In particular, sEPSCs were first sampled over a 2 min period. TTX (1μM; Alomone Labs) was then perfused into the recording chamber at a flow rate of 2 mL/min. We then waited for 5 min before sampling mEPSCs over a 2 min period. We will add this information in the revised manuscript methods. Finally, Fig.1g-j shows series resistance (Rs) over time for 4 different PV+ interneurons, indicating recording stability. These results are representative of the entire population of recorded neurons, which we have meticulously analysed one by one.

      (2) Figure 1 and Supplementary Figure 3: apparent inconsistency. If, as the authors claim, TTX does not affect sEPSCs (either in the control or mutant genotype, Supplementary Figure 3 and point 1 above), then comparing sEPSC and mEPSC in control vs mutants should yield identical results. In contrast, Figure 1 reports a _selective_ reduction of sEPSCs amplitude (not in mEPSCs) in mutants, which is difficult to understand. The proposed explanation relying on different pools of synaptic vesicles mediating sEPSCs and mEPSCs does not clarify things. If this was the case, wouldn't it also imply a decrease of event frequency following TTX addition? However, this is not observed in Supplementary Figure 3. My understanding is that, according to this explanation, recordings in control solution would reflect the impact of two separate pools of vesicles, whereas, in the presence of TTX, only one pool would be available for release. Therefore, TTX should cause a decrease in the frequency of the recorded events, which is not what is observed in Supplementary Figure 3.

      Our results suggest a diverse population of PV+ cells, with varying reliance on action potential-dependent and -independent release. Several PV+ cells indeed show TTX sensitivity (reduced EPSC event amplitudes following TTX application: See Fig.1c-f, at the end of this document), but their individual responses are diluted when all cells are pooled together. As mentioned above, we are currently recording sEPSCs followed by mEPSCs from more mice of both genotypes, to account for the large variability. We will rephrase the text in the revised manuscript according to the updated data and reviewers’ suggestions.

      (3) Figure 1: statistical analysis. Although I do appreciate the efforts of the authors to illustrate both cumulative distributions and plunger plots with individual data, I am confused by how the cumulative distributions of Figure 1b (sEPSC amplitude) may support statistically significant differences between genotypes, but this is not the case for the cumulative distributions of Figure 1g (inter mEPSC interval), where the curves appear even more separated. A difference in mEPSC frequency would also be consistent with the data of Supplementary Fig 2b, which otherwise are difficult to reconciliate. I would encourage the authors to use the Kolmogorov-Smirnov rather than a t-test for the comparison of cumulative distributions.

      We thank the reviewer for this suggestion. We used both cumulative distribution and plunger plots with individual data because they convey 2 different kinds of information. Cumulative distributions highlight where the differences lie (the deltas between the groups), while plunger plots with individual data show the variability between data points. In histogram 1g, the variability is greater than in 1b (due to the smaller sample size in 1g), which leads to larger error bars and directly impacts the statistical outcome. So, while the delta is larger in 1g, the variability is also greater. In contrast, the delta in 1b is smaller, as is the variability, which in turn affects the statistical outcome. To address this issue, we are currently increasing N of recordings.

      We will include Kolmogorov-Smirnov analysis in the revision, as suggested; nevertheless, we will base our conclusions on statistical results generated by the linear mixed model (LMM), modelling animal as a random effect and genotype as the fixed effect. We used this statistical analysis since we considered the number of mice as independent replicates and the number of cells in each mouse as repeated/correlated measures. The reason we decided to use LMM for our statistical analyses is based on the growing concern over reproducibility in biomedical research and the ongoing discussion on how data are analysed (see for example, Yu et al (2022), Neuron 110:21-35 https://doi: 10.1016/j.neuron.2021.10.030; Aarts et al. (2014). Nat Neurosci 17, 491–496. https://doi.org/10.1038/nn.3648). We acknowledge that patch-clamp data has been historically analysed using t-test and analysis of variance (ANOVA), or equivalent non-parametric tests. However, these tests assume that individual observations (recorded neurons in this case) are independent of each other. Whether neurons from the same mouse are independent or correlated variables is an unresolved question, but does not appear to be likely from a biological point of view. Statisticians have developed effective methods to analyze correlated data, including LMM. In parallel, we also tested the data by using the standard parametric and non-parametric analyses and reported these results as well (Tables 1-9, and S1-S2).

      (4) Methods. I still maintain that a threshold at around -20/-15 mV for the first action potential of a train seems too depolarized (see some datapoints of Fig 5c and Fig7c) for a healthy spike. This suggest that some cells were either in precarious conditions or that the capacitance of the electrode was not compensated properly.

      As suggested by the reviewer, we will exclude the neurons with threshold at -20/-15 mV. In addition, we performed statistical analysis with and without these cells (data reported below) and found that whether these cells are included or excluded, the statistical significance of the results does not change.

      Fig.5c: including the 2 outliers from cHet group with values of -16.5 and 20.6 mV: -42.6±1.01 mV in control, n=33 cells from 15 mice vs -35.3±1.2 mV in cHet, n=40 cells from 17 mice, ***p<0.001, LMM; excluding the 2 outliers from cHet group -42.6±1.01 mV in control, n=33 cells from 15 mice vs -36.2±1.1 mV in cHet, n=38 cells from 17 mice, ***p<0.001, LMM.

      Fig.7c: including the 2 outliers from cHet group with values of -16.5 and 20.6 mV: -43.4±1.6 mV in control, n=12 cells from 9 mice vs -33.9±1.8 mV in cHet, n=24 cells from 13 mice, **p=0.002, LMM; excluding the 2 outliers from cHet group -43.4±1.6 mV in control, n=12 cells from 9 mice vs -35.4±1.7 mV in cHet, n=22 cells from 13 mice, *p=0.037, LMM.

      (5) The authors claim that "cHet SST+ cells showed no significant changes in active and passive membrane properties (Figure 8d,e); however, their evoked firing properties were affected with fewer AP generated in response to the same depolarizing current injection".<br /> This sentence is intrinsically contradictory. Action potentials triggered by current injections are dependent on the integration of passive and active properties. If the curves of Figure 8f are different between genotypes, then some passive and/or active property MUST have changed. It is an unescapable conclusion. The general _blanket_ statement of the authors that there are no significant changes in active and passive properties is in direct contradiction with the current/#AP plot.

      We shall rephrase the text according to the reviewer’s suggestion to better represent the data. As discussed in the first revision, it's possible that other intrinsic factors, not assessed in this study, may have contributed to the effect shown in the current/#AP plot.

      (6) The phase plots of Figs 5c, 7c, and 7h suggest that the frequency of acquisition/filtering of current-clamp signals was not appropriate for fast waveforms such as spikes. The first two papers indicated by the authors in their rebuttal (Golomb et al., 2007; Stevens et al., 2021) did not perform a phase plot analysis (like those included in the manuscript). The last work quoted in the rebuttal (Zhang et al., 2023) did perform phase plot analysis, but data were digitized at a frequency of 20KHz (not 10KHz as incorrectly indicated by the authors) and filtered at 10 kHz (not 2-3 kHz as by the authors in the manuscript). To me, this remains a concern.

      We agree with the reviewer that higher sampling rate would allow to more accurately assess different AP parameters, such as AP height, half-width, rise time, etc. The papers were cited in context of determining AP threshold, not performing phase plot analysis. We apologize for the confusion and error. Further, as mentioned above, we will remove the phase plots since they do not add relevant information.

      (7) The general logical flow of the manuscript could be improved. For example, Fig 4 seems to indicate no morphological differences in the dendritic trees of control vs mutant PV cells, but this conclusion is then rejected by Fig 6. Maybe Fig 4 is not necessary. Regarding Fig 6, did the authors check the integrity of the entire dendritic structure of the cells analyzed (i.e. no dendrites were cut in the slice)? This is critical as the dendritic geometry may affect the firing properties of neurons (Mainen and Sejnowski, Nature, 1996).

      As suggested by the reviewer, we will remove Fig.4. All the reconstructions used for dendritic analysis contained intact cells with no evidently cut dendrites.

      Author response image 1.

      (a, b) Representative voltage responses of a SST+ cell (a) and a PV+ cell (b) in absence (left) and presence (right) of TTX in response to depolarizing current injections corresponding to threshold current and 2x threshold current. (c-f) Cumulative histograms of sEPSCs/mEPSCs amplitude (bin width 0.5 pA) and frequency (bin width 10 ms) recorded from four PV+ cells.  sEPSC were recorded for 2 minutes, then TTX (1μM; Alomone Labs) was perfused into the recording chamber. After 5 minutes, mEPSC were recorded for 2 minutes. (g, h, i, j) Time course plots of series resistance (Rs) of the four representative PV+ cells shown in c-f before (sEPSC) and during the application of TTX (mEPSC).


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The study is designed to assess the role of Syngap1 in regulating the physiology of the MGE-derived PV+ and SST+ interneurons. Syngap1 is associated with some mental health disorders, and PV+ and SST+ cells are the focus of many previous and likely future reports from studies of interneuron biology, highlighting the translational and basic neuroscience relevance of the authors' work.

      Strengths of the study are using well-established electrophysiology methods and the highly controlled conditions of ex vivo brain slice experiments combined with a novel intersectional mouse line, to assess the role of Syngap1 in regulating PV+ and SST+ cell properties. The findings revealed that in the mature auditory cortex, Syngap1 haploinsufficiency decreases both the intrinsic excitability and the excitatory synaptic drive onto PV+ neurons from Layer 4. In contrast, SST+ interneurons were mostly unaffected by Syngap1 haploinsufficiency. Pharmacologically manipulating the activity of voltagegated potassium channels of the Kv1 family suggested that these channels contributed to the decreased PV+ neuron excitability by Syngap insufficiency. These results therefore suggest that normal Syngap1 expression levels are necessary to produce normal PV+ cell intrinsic properties and excitatory synaptic drive, albeit, perhaps surprisingly, inhibitory synaptic        transmission was not affected by Syngap1 haploinsufficiency.

      Since the electrophysiology experiments were performed in the adult auditory cortex, while Syngap1 expression was potentially affected since embryonic stages in the MGE, future studies should address two important points that were not tackled in the present study. First, what is the developmental time window in which Syngap1 insufficiency disrupted PV+ neuron properties? Albeit the embryonic Syngap1 deletion most likely affected PV+ neuron maturation, the properties of Syngap-insufficient PV+ neurons do not resemble those of immature PV+ neurons. Second, whereas the observation that Syngap1 haploinsufficiency affected PV+ neurons in auditory cortex layer 4 suggests auditory processing alterations, MGE-derived PV+ neurons populate every cortical area. Therefore, without information on whether Syngap1 expression levels are cortical area-specific, the data in this study would predict that by regulating PV+ neuron electrophysiology, Syngap1 normally controls circuit function in a wide range of cortical areas, and therefore a range of sensory, motor and cognitive functions. These are relatively minor weaknesses regarding interpretation of the data in the present study that the authors could discuss.

      We agree with the reviewer on the proposed open questions, which we now discuss in the revised manuscript. We do have experimental evidence suggesting that Syngap1 mRNA is expressed by PV+ and SST+ neurons in different cortical areas, during early postnatal development and in adulthood (Jadhav et al., 2024); therefore, we agree that it will be important, in future experiments, to tackle the question of when the observed phenotypes arise.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors investigated how partial loss of SynGap1 affects inhibitory neurons derived from the MGE in the auditory cortex, focusing on their synaptic inputs and excitability. While haplo-insufficiently of SynGap1 is known to lead to intellectual disabilities, the underlying mechanisms remain unclear.

      Strengths:

      The questions are novel

      Weaknesses:

      Despite the interesting and novel questions, there are significant concerns regarding the experimental design and data quality, as well as potential misinterpretations of key findings. Consequently, the current manuscript fails to contribute substantially to our understanding of SynGap1 loss mechanisms and may even provoke unnecessary controversies.

      Major issues:

      (1) One major concern is the inconsistency and confusion in the intermediate conclusions drawn from the results. For instance, while the sEPSC data indicates decreased amplitude in PV+ and SOM+ cells in cHet animals, the frequency of events remains unchanged. In contrast, the mEPSC data shows no change in amplitudes in PV+ cells, but a significant decrease in event frequency. The authors conclude that the former observation implies decreased excitability. However, traditionally, such observations on mEPSC parameters are considered indicative of presynaptic mechanisms rather than changes of network activity. The subsequent synapse counting experiments align more closely with the traditional conclusions. This issue can be resolved by rephrasing the text. However, it would remain unexplained why the sEPSC frequency shows no significant difference. If the majority of sEPSC events were indeed mediated by spiking (which is blocked by TTX), the average amplitudes and frequency of mEPSCs should be substantially lower than those of sEPSCs. Yet, they fall within a very similar range, suggesting that most sEPSCs may actually be independent of action potentials. But if that was indeed the case, the changes of purported sEPSC and mEPSC results should have been similar.

      We understand the reviewer’s perspective; indeed, we asked ourselves the very same question regarding why the sEPSC and mEPSC frequency fall within a similar range when we analysed neuron means (bar graphs). We thus recorded sEPSCs followed by mEPSCs from several PV neurons (control and cHet) and included this data to the revised version of the manuscript (new Supplementary Figure 3). We found that the average amplitudes and frequency of mEPSCs together with their respective cumulative probability curves were not significantly different than those of sEPSCs. We rephrased the manuscript to present potential interpretations of the data.

      We hope that we have correctly interpreted the reviewer's concern. If the question is why we do not observe a significant difference in the average frequency when comparing sEPSC and mEPSC in control mice, this could be explained by the fact that increased mean amplitude of sEPSCs was primarily driven by alterations in large sEPSCs (>9-10pA, as shown in cumulative probability in Fig. 1b right), with smaller ones being relatively unaffected. Consequently, a reduction in sEPSC amplitude may not necessarily result in a significant decrease in frequency since their values likely remain above the detection threshold of 3 pA. 

      If the question is whether we should see the same parameters affected by the genetic manipulation in both sEPSC and mEPSC, then another critical consideration is the involvement of the releasable pool in mEPSCs versus sEPSCs. Current knowledge suggests that activity-dependent and -independent release may not necessarily engage the same pool of vesicles or target the same postsynaptic sites. This concept has been extensively explored (Sara et al., 2005; Sara et al., 2011; reviewed in Ramirez and Kavalali, 2011; Kavalali, 2015). Consequently, while we may have traditionally interpreted activitydependent and -independent data assuming they utilize the same pool, this is no longer accurate. The current discussion in the field revolves around understanding the mechanisms underlying such phenomena. Therefore, comparisons between sEPSCs and mEPSCs may not yield conclusive data but rather speculative interpretations. 

      (2) Another significant concern is the quality of synapse counting experiments. The authors attempted to colocalize pre- and postsynaptic markers Vglut1 and PSD95 with PV labelling. However, several issues arise. Firstly, the PV labelling seems confined to soma regions, with no visible dendrites. Given that the perisomatic region only receives a minor fraction of excitatory synapses, this labeling might not accurately represent the input coverage of PV cells. Secondly, the resolution of the images is insufficient to support clear colocalization of the synaptic markers. Thirdly, the staining patterns are peculiar, with PSD95 puncta appearing within regions clearly identified as somas by Vglut1, hinting at possible intracellular signals. Furthermore, PSD95 seems to delineate potential apical dendrites of pyramidal cells passing through the region, yet Vglut1+ partners are absent in these segments, which are expected to be the marker of these synapses here. Additionally, the cumulative density of Vglut2 and Vglut1 puncta exceeds expectations, and it's surprising that subcortical fibers labeled by Vglut2 are comparable in number to intracortical Vglut1+ axon terminals. Ideally, N(Vglut1)+N(Vglut2) should be equal or less than N(PSD95), but this is not the case here. Consequently, these results cannot be considered reliable due to these issues.

      We apologize, as it appears that the images we provided in the first submission have caused confusion. The selected images represent a single focal plane of a confocal stack, which was visually centered on the PV cell somata. We chose just one confocal plane because we thought it showed more clearly the apposition of presynaptic and postsynaptic immunolabeling around the somata. In the revised version of the manuscript, we now provide higher magnification images, which will clearly show how we identified and selected the region of interest for the quantification of colocalized synaptic markers (Supplemental Figure 2). In our confocal stacks, we can also identify PV immunolabeled dendrites and colocalized vGlut1/PSD95 or vGlut2/PSD95 puncta on them; but these do not appear in the selected images because, as explained, only one focal plane, centered on the PV cell somata, was shown. 

      We acknowledge the reviewer's point that in PV+ cells the majority of excitatory inputs are formed onto dendrites; however, we focused on the somatic excitatory inputs to PV cells, because despite their lower number, they produce much stronger depolarization in PV neurons than dendritic excitatory inputs (Hu et al., 2010; Norenberg et al., 2010). Further, quantification of perisomatic putative excitatory synapses is more reliable since by using PV immunostaining, we can visualize the soma and larger primary dendrites, but smaller, higher order dendrites are not be always detectable. Of note, PV positive somata receive more excitatory synapses than SST positive and pyramidal neuron somata as found by electron microscopy studies in the visual cortex (Hwang et al., 2021; Elabbady et al., 2024).

      Regarding the comment on the density of vGlut1 and vGlut2 puncta, the reason that the numbers appear high and similar between the two markers is because we present normalized data (cHet normalized to their control values for each set of immunolabelling) to clearly represent the differences between genotypes. We now provide a more detailed explanation of our methods in the revised manuscript.  Briefly, immunostained sections were imaged using a Leica SP8-STED confocal microscope, with an oil immersion 63x (NA 1.4) at 1024 X 1024, z-step =0.3 μm, stack size of ~15 μm. Images were acquired from the auditory cortex from at least 3 coronal sections per animal. All the confocal parameters were maintained constant throughout the acquisition of an experiment. All images shown in the figures are from a single confocal plane. To quantify the number of vGlut1/PSD95 or vGlut2/PSD95 putative synapses, images were exported as TIFF files and analyzed using Fiji (Image J) software. We first manually outlined the profile of each PV cell soma (identified by PV immunolabeling). At least 4 innervated somata were selected in each confocal stack. We then used a series of custom-made macros in Fiji as previously described (Chehrazi et al, 2023). After subtracting background (rolling value = 10) and Gaussian blur (σ value = 2) filters, the stacks were binarized and vGlut1/PSD95 or vGlut2/PSD95 puncta were independently identified around the perimeter of a targeted soma in the focal plane with the highest soma circumference. Puncta were quantified after filtering particles for size (included between 0-2μm2) and circularity (included between 01). Data quantification was done by investigators blind to the genotype, and presented as normalized data over control values for each experiment.

      (3) One observation from the minimal stimulation experiment was concluded by an unsupported statement. Namely, the change in the onset delay cannot be attributed to a deficit in the recruitment of PV+ cells, but it may suggest a change in the excitability of TC axons.

      We agree with the reviewer, please see answer to point below.

      (4) The conclusions drawn from the stimulation experiments are also disconnected from the actual data. To make conclusions about TC release, the authors should have tested release probability using established methods, such as paired-pulse changes. Instead, the only observation here is a change in the AMPA components, which remained unexplained.

      As suggested, we performed additional paired-pulse ratio experiments at different intervals. We found that, in contrast with Control mice, evoked excitatory inputs to layer IV PV+ cells showed paired-pulse facilitation in cHet mice (Figure 3g, h), suggesting that thalamocortical presynaptic sites likely have decreased release probability in mutant compared to control mice.  We rephrased the text according to the data obtained from this new experiment.

      (5) The sampling rate of CC recordings is insufficient to resolve the temporal properties of the APs. Therefore, the phase-plots cannot be interpreted (e.g. axonal and somatic AP components are not clearly separated), raising questions about how AP threshold and peak were measured. The low sampling rate also masks the real derivative of the AP signals, making them apparently faster.

      We acknowledge that a higher sampling rate would provide a more detailed and smoother phase-plot. However, in the context of action potential parameters analysis here, it is acceptable to use sampling rates ranging from 10 kHz to 20 kHz (Golomb et al., 2007; Stevens et al., 2021; Zhang et al., 2023), which are considered adequate in the context of the present study. Indeed, our study aims to evaluate "relative" differences in the electrophysiological phenotype when comparing groups following a specific genetic manipulation. A sampling rate of 10 kHz is commonly employed in similar studies, including those conducted by our collaborator and co-author S. Kourrich (e.g., Kourrich and Thomas 2009, Kourrich et al., 2013), as well as others (Russo et al., 2013; Ünal et al., 2020; Chamberland et al., 2023). Despite being acquired at a lower sampling rate than potentially preferred by the reviewer, our data clearly demonstrate significant differences between the experimental groups, especially for parameters that are negligibly or not affected by the sampling rate used here (e.g., #spikes/input, RMP, Rin, Cm, Tm, AP amplitude, AP latency, AP rheobase).

      Regarding the phase-plots, a higher sampling rate would indeed have resulted in smoother curves. However, the differences were sufficiently pronounced to discern the relative variations in action potential waveforms between the experimental groups.

      A related issue is that the Methods section lacks essential details about the recording conditions, such as bridge balance and capacitance neutralization.

      We indeed performed bridge balance and neutralized the capacitance before starting every recording. We added the information in the methods.

      (6) Interpretation issue: One of the most fundamental measures of cellular excitability, the rheobase, was differentially affected by cHet in BCshort and BCbroad. Yet, the authors concluded that the cHet-induced changes in the two subpopulations are common.

      We are uncertain if we have correctly interpreted the reviewer's comment. While we observed distinct impacts on the rheobase (Fig. 7d and 7i), there seems to be a common effect on the AP threshold (Fig. 7c and 7h), as interpreted and indicated in the final sentence of the results section for Figure 7. If our response does not address the reviewer's comment adequately, we would greatly appreciate it if the reviewer could rephrase their feedback.

      (7) Design issue:

      The Kv1 blockade experiments are disconnected from the main manuscript. There is no experiment that shows the causal relationship between changes in DTX and cHet cells. It is only an interesting observation on AP halfwidth and threshold. However, how they affect rheobase, EPSCs, and other topics of the manuscript are not addressed in DTX experiments.

      Furthermore, Kv1 currents were never measured in this work, nor was the channel density tested. Thus, the DTX effects are not necessarily related to changes in PV cells, which can potentially generate controversies.

      While we acknowledge the reviewer's point that Kv1 currents and density weren't specifically tested, an important insight provided by Fig. 5 is the prolonged action potential latency. This delay is significantly influenced by slowly inactivating subthreshold potassium currents, namely the D-type K+ current. It's worth noting that D-type current is primarily mediated by members of the Kv1 family. The literature supports a role for Kv1.1containing channels in modulating responses to near-threshold stimuli in PV cells (Wang et al., 1994; Goldberg et al., 2008; Zurita et al., 2018). However, we recognize that besides the Kv1 family, other families may also contribute to the observed changes.

      To address this concern, we revised the manuscript by referring to the more accurate term "D-type K+ current", and rephrased the discussion to clarify the limit of our approach. It is not our intention to open unnecessary controversy, but present the data we obtained. We believe this approach and rephrasing the discussion as proposed will prevent unnecessary controversy and instead foster fruitful discussions.

      (8) Writing issues:

      Abstract:

      The auditory system is not mentioned in the abstract.

      One statement in the abstract is unclear. What is meant by "targeting Kv1 family of voltagegated potassium channels was sufficient..."? "Targeting" could refer to altered subcellular targeting of the channels, simple overexpression/deletion in the target cell population, or targeted mutation of the channel, etc. Only the final part of the Results revealed that none of the above, but these channels were blocked selectively.

      We agree with the reviewer and we will rephrase the abstract accordingly.

      Introduction:

      There is a contradiction in the introduction. The second paragraph describes in detail the distinct contribution of PV and SST neurons to auditory processing. But at the end, the authors state that "relatively few reports on PV+ and SST+ cell-intrinsic and synaptic properties in adult auditory cortex". Please be more specific about the unknown properties.

      We agree with the reviewer and we will rephrase more specifically.

      (9) The introduction emphasizes the heterogeneity of PV neurons, which certainly influences the interpretation of the results of the current manuscript. However, the initial experiments did not consider this and handled all PV cell data as a pooled population.

      In the initial experiments, we handled all PV cell data together because we wanted to be rigorous and not make assumptions on the different PV cells, which in later experiments we distinguished based on the intrinsic properties alone. Nevertheless, based on this and other reviewers’ comments, we completely rewrote the introduction in the revised manuscript to increase both focus and clarity.

      (10) The interpretation of the results strongly depends on unpublished work, which potentially provide the physiological and behavioral contexts about the role of GABAergic neurons in SynGap-haploinsufficiency. The authors cite their own unpublished work, without explaining the specific findings and relation to this manuscript.

      We agree with the reviewer and provided more information and updated references in the revised version of this manuscript. Our work is now in press in Journal of Neuroscience.

      (11) The introduction of Scholl analysis experiments mentions SOM staining, however, there is no such data about this cell type in the manuscript.

      We thank the reviewer for noticing the error; we changed SOM with SST (SOM and SST are two commonly used acronyms for Somatostatin expressing interneurons).

      Reviewer #3 (Public Review):

      This paper compares the synaptic and membrane properties of two main subtypes of interneurons (PV+, SST+) in the auditory cortex of control mice vs mutants with Syngap1 haploinsufficiency. The authors find differences at both levels, although predominantly in PV+ cells. These results suggest that altered PV-interneuron functions in the auditory cortex may contribute to the network dysfunction observed in Syngap1 haploinsufficiencyrelated intellectual disability. The subject of the work is interesting, and most of the approach is direct and quantitative, which are major strengths. There are also some weaknesses that reduce its impact for a broader field.

      (1) The choice of mice with conditional (rather than global) haploinsufficiency makes the link between the findings and Syngap1 relatively easy to interpret, which is a strength. However, it also remains unclear whether an entire network with the same mutation at a global level (affecting also excitatory neurons) would react similarly.

      We agree with the reviewer and now discuss this important caveat in the revised manuscript.

      (2) There are some (apparent?) inconsistencies between the text and the figures. Although the authors appear to have used a sophisticated statistical analysis, some datasets in the illustrations do not seem to match the statistical results. For example, neither Fig 1g nor Fig 3f (eNMDA) reach significance despite large differences. 

      We respectfully disagree, we do not think the text and figures are inconsistent. In the cited example, large apparent difference in mean values does not show significance due to the large variability in the data; further, we did not exclude any data points, because we wanted to be rigorous. In particular, for Fig.1g, statistical analysis shows a significant increase in the inter-mEPSC interval (*p=0.027, LMM) when all events are considered (cumulative probability plots), while there is no significant difference in the inter-mEPSCs interval for inter-cell mean comparison (inset, p=0.354, LMM).  Inter-cell mean comparison does not show difference with Mann-Whitney test either (p=0.101, the data are not normally distributed, hence the choice of the Mann-Whitney test). For Fig. 3f (eNMDA), the higher mean value for the cHet versus the control is driven by two data points which are particularly high, while the other data points overlap with the control values. The MannWhitney test show also no statistical difference (p=0.174).

      In the manuscript, discussion of the data is based on the results of the LMM analysis, which takes in account both the number of cells and the numbers of mice from which these cells are recorded. We chose this statistical approach because it does not rely on the assumption that cells recorded from same mouse are independent variables. In the supplemental tables, we provided the results of the statistical analysis done with both LMM and the most commonly used Mann Whitney (for not normally distributed) or t-test (for normally distributed), for each data set.

      Also, the legend to Fig 9 indicates the presence of "a significant decrease in AP half-width from cHet in absence or presence of a-DTX", but the bar graph does not seem to show that.

      We apologize for our lack of clarity. In legend 9, we reported the statistical comparisons between 1) vehicle-treated cHET vs control PV+ cells and 2) a-DTX-treated cHET vs control PV+ cells. We rephrased the legend of the figure to avoid confusion.

      (3) The authors mention that the lack of differences in synaptic current kinetics is evidence against a change in subunit composition. However, in some Figures, for example, 3a, the kinetics of the recorded currents appear dramatically different. It would be important to know and compare the values of the series resistance between control and mutant animals.

      We agree with the reviewer that there appears to be a qualitative difference in eNMDA decay between conditions, although quantified eNMDA decay itself is similar between groups. We have used a cutoff of 15 % for the series resistance (Rs), which is significantly more stringent as compared to the cutoff typically used in electrophysiology, which are for the vast majority between 20 and 30%. To answer this concern, we re-examined the Rs, we compared Rs between groups and found no difference for Rs in eAMPA (Control mice: 13.2±0.5, n=16 cells from 7 mice vs cHet mice: 13.7±0.3, n=14 cells from 7 mice; LMM, p=0.432) and eNMDA (Control mice: 12.7±0.7, n=6 cells from 3 mice vs cHet mice: 13.8±0.7 in cHet n=6 cells from 5 mice: LMM, p=0.231). Thus, the apparent qualitative difference in eNMDA decay stems from inter-cell variability rather than inter-group differences. Notably, this discrepancy between the trace (Fig. 3a) and the data (Fig. 3f, right) is largely due to inter-cell variability, particularly in eNMDA, where a higher but non-significant decay rate is driven by a couple of very high values (Fig. 3f, right). In the revised manuscript, we now show traces that better represent our findings.

      (4) A significant unexplained variability is present in several datasets. For example, the AP threshold for PV+ includes points between -50-40 mV, but also values at around -20/-15 mV, which seems too depolarized to generate healthy APs (Fig 5c, Fig7c).

      We acknowledge the variability in AP threshold data, with some APs appearing too depolarized to generate healthy spikes. However, we meticulously examined each AP that spiked at these depolarized thresholds and found that other intrinsic properties (such as Rin, Vrest, AP overshoot, etc.) all indicate that these cells are healthy. Therefore, to maintain objectivity and provide unbiased data to the community, we opted to include them in our analysis. It's worth noting that similar variability has been observed in other studies (Bengtsson Gonzales et al., 2020; Bertero et al., 2020).

      Further, we conducted a significance test on AP threshold excluding these potentially unhealthy cells and found that the significant differences persist. After removing two outliers from the cHet group with values of -16.5 and 20.6 mV, we obtain: -42.6±1.01 mV in control, n=33, 15 mice vs -36.2±1.1 mV in cHet, n=38 cells, 17 mice (LMM, ***p<0.001). Thus, whether these cells are included or excluded, our interpretations and conclusions remain unchanged.

      We would like to clarify that these data have not been corrected with the junction potential, as described in the revised version.

      (5) I am unclear as to how the authors quantified colocalization between VGluts and PSD95 at the low magnification shown in Supplementary Figure 2.

      We apologize for our lack of clarity. Although the analysis was done at high resolution, the figures were focused on showing multiple PV somata receiving excitatory inputs. We added higher magnification figures and more detailed information in the methods of the revised version. Please also see our response to reviewer #2.

      (6) The authors claim that "cHet SST+ cells showed no significant changes in active and passive membrane properties", but this claim would seem to be directly refused by the data of Fig 8f. In the absence of changes in either active or passive membrane properties shouldn't the current/#AP plot remain unchanged?

      While we acknowledge the theoretical expectation that changes in intrinsic parameters should correlate with alterations in neuronal firing, the absence of differences in the parameters analyzed in this study is not incompatible with the clear and significant decrease in firing rate observed in cHet SST+ cells. It's indeed possible that other intrinsic factors, not assessed in this study, may have contributed to this effect. However, exploring these mechanisms is beyond the scope of our current investigation. We rephrased the discussion and added this limitation of our study in the revised version.

      (7) The plots used for the determination of AP threshold (Figs 5c, 7c, and 7h) suggest that the frequency of acquisition of current-clamp signals may not have been sufficient, this value is not included in the Methods section.

      This study utilized a sampling rate of 10 kHz, which is a standard rate for action potential analysis in the present context. While we acknowledge that a higher sampling rate could have enhanced the clarity of the phase plot, our recording conditions, as detailed in our response to Rev#2/comment#5, were suitable for the objectives of this study.

      Reference list

      Bengtsson Gonzales C, Hunt S, Munoz-Manchado AB, McBain CJ, Hjerling-Leffler J (2020) Intrinsic electrophysiological properties predict variability in morphology and connectivity among striatal Parvalbumin-expressing Pthlh-cells Scientific Reports 10: 15680 https://doi.org/10.1038/s41598-020-72588-1

      Bertero A, Zurita H, Normandin M, Apicella AJ (2020) Auditory long-range parvalbumin cortico-striatal neurons. Frontiers in Neural Circuits 14:45 http://doi.org/10.3389/fncir.2020.00045

      Chamberland S, Nebet ER, Valero M, Hanani M, Egger R, Larsen SB, Eyring KW, Buzsáki G, Tsien RW (2023) Brief synaptic inhibition persistently interrupts firing of fastspiking interneurons Neuron 111:1264–1281 http://doi.org/10.1016/j.neuron.2023.01.017 

      Chehrazi P, Lee KKY, Lavertu-Jolin M, Abbasnejad Z, Carreño-Muñoz MI, Chattopadhyaya B, Di Cristo G (2023). The p75 neurotrophin receptor in preadolescent prefrontal parvalbumin interneurons promotes cognitive flexibility in adult mice Biological Psychiatry 94:310-321 doi: https://doi.org/10.1016/j.biopsych.2023.04.019

      Elabbady L, Seshamani S, Mu S, Mahalingam G, Schneider-Mizell C, Bodor AL, Bae JA, Brittain D, Buchanan J, Bumbarger DJ, Castro MA, Dorkenwald S, Halageri A, Jia Z, Jordan C, Kapner D, Kemnitz N, Kinn S, Lee K, Li K, Lu R, Macrina T, Mitchell E, Mondal SS,  Popovych S, Silversmith W, Takeno M, Torres R,  Turner NL, Wong W,  Wu J, Yin W, Yu SC, The MICrONS Consortium,  Seung S,  Reid C,  Da Costa NM,  Collman F (2024) Perisomatic features enable efficient and dataset wide cell-type classifications across large-scale electron microscopy volumes bioRxiv, https://doi.org/10.1101/2022.07.20.499976

      Goldberg EM, Clark BD, Zagha E, Nahmani M, Erisir A, Rudy B (2008) K+ Channels at the axon initial segment dampen near-threshold excitability of neocortical fastspiking GABAergic interneurons. Neuron 58 :387–400 https://doi.org/10.1016/j.neuron.2008.03.003

      Golomb D, Donner K, Shacham L, Shlosberg D, Amitai Y, Hansel D. (2007). Mechanisms of firing patterns in fast-spiking cortical interneurons PLoS Computational Biology 38:e156 http://doi.org/10.1371/journal.pcbi.0030156

      Hu H, Martina M, Jonas P (2010). Dendritic mechanisms underlying rapid synaptic activation of fast-spiking hippocampal interneurons. Science 327:52–58. http://doi.org/10.1126/science.1177876

      Hwang YS, Maclachlan C, Blanc J, Dubois A, Petersen CH, Knott G, Lee SH (2021). 3D ultrastructure of synaptic inputs to distinct gabaergic neurons in the mouse primary visual cortex. Cerebral Cortex 31:2610–2624 http://doi.org/10.1093/cercor/bhaa378

      Jadhav V, Carreno-Munoz MI, Chehrazi P, Michaud JL, Chattopadhyaya B, Di Cristo G (2024) Developmental Syngap1 haploinsufficiency in medial ganglionic eminencederived interneurons impairs auditory cortex activity, social behavior and extinction of fear memory The Journal of Neuroscience in press.

      Kavalali E (2015) The mechanisms and functions of spontaneous neurotransmitter release Nature Reviews Neuroscience 16:5–16. https://doi.org/10.1038/nrn3875

      Kourrich S, Thomas MJ (2009) Similar neurons, opposite adaptations: psychostimulant experience differentially alters firing properties in accumbens core versus shell Journal of Neuroscience 29:12275-12283 http://doi.org:10.1523/JNEUROSCI.302809.2009

      Kourrich S, Hayashi T, Chuang JY, Tsai SY, Su TP, Bonci A (2013) Dynamic interaction between sigma-1 receptor and Kv1.2 shapes neuronal and behavioral responses to cocaine Cell 152:236–247. http://doi.org/10.1016/j.cell.2012.12.004 

      Norenberg A, Hu H, Vida I, Bartos M, Jonas P (2010) Distinct nonuniform cable properties optimize rapid and efficient activation of fast-spiking GABAergic interneurons Proceedings of the National Academy of Sciences 107:894–9. http://doi.org/10.1073/pnas.0910716107

      Ramirez DM, Kavalali ET (2011) Differential regulation of spontaneous and evoked neurotransmitter release at central synapses Current Opinion in Neurobiology 21:275282 https://doi.org/10.1016/j.conb.2011.01.007

      Russo G, Nieus TR, Maggi S, Taverna S (2013) Dynamics of action potential firing in electrically connected striatal fast-spiking interneurons Frontiers in Cellular Neuroscience 7:209 https://doi.org/10.3389/fncel.2013.00209

      Sara Y, Virmani T, Deák F, Liu X, Kavalali ET (2005) An isolated pool of vesicles recycles at rest and drives spontaneous neurotransmission Neuron 45:563-573 https://doi.org/10.1016/j.neuron.2004.12.056

      Sara Y, Bal M, Adachi M, Monteggia LM, Kavalali ET (2011) Use-dependent AMPA receptor block reveals segregation of spontaneous and evoked glutamatergic neurotransmission Journal of Neuroscience 14:5378-5382 https://doi.org/10.1523/JNEUROSCI.5234-10.2011

      Stevens SR, Longley CM, Ogawa Y, Teliska LH, Arumanayagam AS, Nair S, Oses-Prieto JA, Burlingame AL, Cykowski MD, Xue M, Rasband MN (2021) Ankyrin-R regulates fast-spiking interneuron excitability through perineuronal nets and Kv3.1b K+ channels eLife 10:e66491 http://doi.org/10.7554/eLife.66491  

      Ünal CT, Ünal B, Bolton MM (2020) Low-threshold spiking interneurons perform feedback inhibition in the lateral amygdala Brain Structure and Function 225:909–923. http://doi.org/10.1007/s00429-020-02051-4

      Wang H, Kunkel DD, Schwartzkroin PA, Tempel BL (1994) Localization of Kv1.1 and Kv1.2, two K channel proteins, to synaptic terminals, somata, and dendrites in the mouse brain. The Journal of Neuroscience 14:4588-4599. https://doi.org/10.1523/JNEUROSCI.14-08-04588.1994

      Zhang YZ, Sapantzi S, Lin A, Doelfel SR, Connors BW, Theyel BB (2023) Activitydependent ectopic action potentials in regular-spiking neurons of the neocortex. Frontiers in Cellular Neuroscience 17 https://doi.org/10.3389/fncel.2023.1267687

      Zurita H, Feyen PLC, Apicella AJ (2018) Layer 5 callosal parvalbumin-expressing neurons: a distinct functional group of GABAergic neurons. Frontiers in Cellular Neuroscience 12:53 https://doi.org/10.3389/fncel.2018.00053

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major points:

      (1) The introduction nicely summarizes multiple aspects of cortical auditory physiology and auditory stimulus processing, but the experiments in this study are performed ex vivo in acute slices. I wonder if it would be beneficial to shorten the initial parts of the introduction and consider a more focused approach highlighting, for example, to what extent Syngap1 expression levels change during development and/or vary across cortical areas. What cortical cell types express Syngap1 in addition to PV+ and SST+ cells? If multiple cell types normally express Syngap1, the introduction could clarify that the present study investigated Syngap1 insufficiency by isolating its effects in PV+ and SST+ neurons, a condition that may not reflect the situation in mental health disorders, but that would allow to better understand the global effects of Syngap1 deficiency.

      We thank the reviewer for this very helpful suggestion. We have changed the introduction as suggested.

      (2) Because mEPSCs are not affected in Syngap+/- interneurons, the authors conclude that the lower sEPSC amplitude is due to decreased network activity. However, it is likely that the absence of significant difference (Fig 1g), is due to lack of statistical power (control: 18 cells from 7 mice, cHet: 8 cells from 4 mice). By contrast, the number of experiments recording sIPSCs and mIPSCs (Fig 2) is much larger. Hence, it seems that adding mEPSC data would allow the authors to more to convincingly support their conclusions. To more directly test whether Syngap insufficiency affects excitatory inputs by reducing network activity, ideally the authors would want to record sEPSCs followed by mEPSCs from each PV+ neuron (control or cHet). Spontaneous event frequency and amplitude should be higher for sEPSCs than mEPSCs, and Syngap1 deficiency should affect only sEPSCs, since network activity is abolished following tetrodotoxin application for mEPSC recordings.

      We agreed with the reviewer’s suggestion, and recorded sEPSCs followed by mEPSCs from PV+ neurons in control and cHet mice (Figure supplement 3). In both genotypes, we found no significative difference in either amplitude or inter-event intervals between sEPSC and mEPSC, suggesting that in acute slices from adult A1, most sEPSCs may actually be action potentialindependent. While perhaps surprisingly at first glance, this result can be explained by recent published work suggesting that action potentials-dependent (sEPSC) and -independent (mEPSC) release may not necessarily engage the same pool of vesicles or target the same postsynaptic sites (Sara et al., 2005; Sara et al., 2011; reviewed in Ramirez and Kavalali, 2011; Kavalali, 2015). Consequently, while we may have traditionally interpreted activity-dependent and -independent data assuming they utilize the same pool, this is no longer accurate; and indeed, the current discussion in the field revolves around understanding the mechanisms underlying such phenomena.

      Therefore, comparisons between sEPSCs and mEPSCs may not yield conclusive data but rather speculative interpretations. We have added this caveat in the result section.

      (3) The interpretation of the data of experiments studying thalamic inputs and single synapses should be clarified and/or rewritten. First, it is not clear why the authors assume they are selectively activating thalamic fibers with electrical stimulation. Presumably the authors applied electrical stimulation to the white matter, but the methods not clearly explained? Furthermore, the authors could clarify how stimulation of a single axon was verified and how could they distinguish release failures from stimulation failures, since the latter are inherent to using minimal stimulation conditions. Interpretations of changes in potency, quantal content, failure rate, etc, depend on the ability to distinguish release failures from stimulation failures. In addition, can the authors provide information on how many synapses a thalamic axon does establish with each postsynaptic PV+ cell from control or Syngap-deficient mice? Even if stimulating a single thalamic axon would be possible, if the connections from single thalamic axons onto single PV+ or SST+ cells are multisynaptic, this would make the interpretation of minimal stimulation experiments in terms of single synapses very difficult or unfeasible. In the end, changes in EPSCs evoked by electrical stimulation may support the idea that Syngap1 insufficiency decreases action potential evoked release, that in part mediates sEPSC, but without indicating the anatomical identity of the stimulated inputs (thalamic, other subcortical or cortico-cortical?

      We agree with the reviewer, our protocol does not allow the stimulation of single synapses/axons, but rather bulk stimulation of multiple axons. We thank the reviewer for bringing up this important point.  In our experiment, we reduced the stimulus intensity until no EPSC was observed, then increased it until we reached the minimum intensity at which we could observe an EPSC. We now explain this approach more clearly in the method and changed the results section by removing any reference to “minimal” stimulation.

      Electrical stimulation of thalamic radiation could indeed activate not only monosynaptic thalamic fibers but also polysynaptic (corticothalamic and/or corticocortical) EPSC component. To identify monosynaptic thalamocortical connections, we used as criteria the onset latencies of EPSC and the variability jitter obtained from the standard deviation of onset latencies, as previously published by other studies (Richardson et al., 2009; Blundon et al., 2011; Chun et al., 2013). Onset latencies were defined as the time interval between the beginning of the stimulation artifact and the onset of the EPSC. Monosynaptic connections are characterized by short onset latencies and low jitter variability (Richardson et al., 2009; Blundon et al., 2011; Chun et al., 2013). In our experiments, the initial slopes of EPSCs evoked by white matter stimulation had short onset latencies (mean onset latency, 4.27 ± 0.11 ms, N=16 neurons in controls, and 5.07 ± 0.07 ms, N=14 neurons in cHet mice) and low onset latency variability jitter (0.24 ± 0.03 ms in controls vs 0.31 ± 0.03 ms in cHet mice), suggestive of activation of monosynaptic thalamocortical monosynaptic connections (Richardson et al., 2009; Blundon et al., 2011; Chun et al., 2013). Of note, a previous study in adult mice (Krause et al., 2014) showed that local field potentials evoked by electrical stimulation of medial geniculate nucleus or thalamic radiation were comparable. The information is included in the revised manuscript, in the methods section.

      (4) The data presentation in Fig 6 is a bit confusing and could be clarified. First, in cluster analysis (Fig 6a), the authors may want to clarify why a correlation between Fmax and half width is indicative of the presence of subgroups. Second, performing cluster analysis based on two variables alone (Fmax and half-width) might not be very informative, but perhaps the authors could better explain why they chose two variables and particularly these two variables? For reference, see the study by Helm et al. 2013 (cited by the authors) using multivariate cluster analysis. Additionally, the authors may want to clarify, for non-expert readers, whether or not finding correlations between variables (heatmap in the left panel of Fig 6b) is a necessary condition to perform PCA (Fig 6b right panel).

      We apologize for the confusion and thank the reviewer for the comment. The choice of Fmax and half width to cluster PV+ subtypes was based on past observation of atypical PV+ cells characterized by a slower AP half-width and lower maximal AP firing frequency (Nassar et al., 2015; Bengtsson Gonzales et al., 2018; Ekins et al., 2020; Helm et al., 2013). Based on these previous studies we performed hierarchical clustering of AP half-width and Fmax-initial values based on Euclidean distance. However, in our case some control PV+ cells showed no correlation between these parameters (as it appears in Fig 6a left, right, and 6b left), requiring the use of additional 11 parameters to perform Principal Component Analysis (PCA). PCA takes a large data set with many variables per observation and reduces them to a smaller set of summary indices (Murtagh and Heck 1987).  We choose in total 13 parameters that are largely unrelated, while excluding others that are highly correlated and represent similar features of membrane properties (e.g., AP rise time and AP half-width). PCA applies a multiexponential fit to the data, and each new uncorrelated variable [principal component (PC)] can describe more than one original parameter (Helm et al., 2013). We added information in the methods section as suggested.

      Minor points:

      (1) In Fig 3a, the traces illustrating the effects of syngap haplo-insufficiency on AMPA and NMDA EPSCs do not seem to be the best examples? For instance, the EPSCs in syngap-deficient neurons show quite different kinetics compared with control EPSCs, however Fig 3f suggests similar kinetics.

      We changed the traces as suggested.

      (2) In the first paragraph of results, it would be helpful to clarify that the experiments are performed in acute brain slices and state the age of animals.

      Done as suggested.

      (3) The following two sentences are partly redundant and could be synthesized or merged to shorten the text: "Recorded MGE-derived interneurons, identified by GFP expression, were filled with biocytin, followed by posthoc immunolabeling with anti-PV and anti-SST antibodies. PV+ and SST+ interneuron identity was confirmed using neurochemical marker (PV or SST) expression and anatomical properties (axonal arborisation location, presence of dendritic spines)."

      We rewrote the paragraph to avoid redundancy, as suggested.

      (4) In the following sentence, the mention of dendritic spines is not sufficiently clear, does it mean that spine density or spine morphology differ between PV and SST neurons?: "PV+ and SST+ interneuron identity was confirmed using neurochemical marker (PV or SST) expression and anatomical properties (axonal arborisation location, presence of dendritic spines)."

      We meant absence or presence of spines. PV+ cells typically do not have spines, while SST+ interneurons do. We corrected the sentence to improve clarity.

      (5) The first sentence of the discussion might be a bit of an overinterpretation of the data? Dissecting the circuit mechanisms of abnormal auditory function with Syngap insufficiency requires experiments very different from those reported in this paper. Moreover, that PV+ neurons from auditory cortex are particularly vulnerable to Syngap deficiency is possible, but this question is not addressed directly in this study because the effects on auditory cortex PV+ neurons were not thoroughly compared with those on PV+ cells from other cortical areas.

      We agreed with the reviewer and changed this sentence accordingly.

      Reviewer #2 (Recommendations For The Authors):

      Minor issues:

      "glutamatergic synaptic inputs to Nkx2.1+ interneurons from adult layer IV (LIV) auditory cortex" it would be more correct if this sentence used "in adult layer IV" instead of "from".

      We made the suggested changes.

      It would be useful information to provide whether the slice quality and cellular health was affected in the cHet animals.

      We did not observe any difference between control and cHet mice in terms of slices quality, success rate of recordings and cellular health. We added this sentence in the methods.

      Were BCshort and BCbroad observed within the same slice, same animals? This information is important to exclude the possibility of experimental origin of the distint AP width.

      We have indeed found both type of BCs in the same animal, and often in the same slice.

      Reviewer #3 (Recommendations For The Authors):

      (1) The introduction is rather diffuse but should be more focused on Syngap1, cellular mechanisms and interneurons. For example, the authors do not even define what Syngap1 is.

      We thank the reviewer for this very helpful suggestion. We have changed the introduction as suggested.

      (2) Some of the figures appear very busy with small fonts that are difficult to read. Also, it is very hard to appreciate the individual datapoints in the blue bars. Could a lighter color please be used?

      We thank the reviewer for this helpful suggestion. We made the suggested changes.

      (3)     The strength/limit of using a conditional knockout should be discussed.

      Done as suggested, in the revised Discussion.

      (4) Statistical Methods should be described more in depth and probably some references should be added. Also, do (apparent?) inconsistencies between the text and the figures depend on the analysis used? For example, neither Fig 1g nor Fig 3f (eNMDA) reach significance despite large differences in the illustration. Maybe the authors could acknowledge this trend and discuss potential reasons for not reaching significance. Also, the legend to Fig 9 indicates the presence of "a significant decrease in AP half-width from cHet in absence or presence of a-DTX", but the bar graph does not show that.

      The interpretation of the data is based on the results of the LMM analysis, which takes in account both the number of cells and the numbers of mice from which these cells are recorded. We chose this statistical approach because it does not rely on the assumption that cells recorded from same mouse are independent variables. We further provided detailed information about statistical analysis done in the tables associated to each figure where we show both LMM and the most commonly used Mann Whitney (for not normally distributed) or t-test (for normally distributed), for each data set.  As suggested, we added reference about LMM in Methods section.

      (5) Were overall control and mutant mice of the same average postnatal age? Is there a reason for the use of very young animals? Was any measured parameter correlated with age?

      Control and mutant mice were of the same postnatal age. In particular, the age range was 75.5 ± 1.8 postnatal days for control group and 72.1 ± 1.7 postnatal days in cHet group (mean ± S.E.M.). We did not use any young mice. We have added this information in the methods.

      (6) Figure 6. First, was the dendritic arborization of all cells fully intact? Second, if Figure 7 uses the same data of Figure 5 after a reclassification of PV+ cells into the two defined subpopulations, then Figure 5 should probably be eliminated as redundant. Also, if the observed changes impact predominantly one PV+ subpopulation, maybe one could argue that the synaptic changes could be (at least partially) explained by the more limited dendritic surface of BC-short (higher proportion in mutant animals) rather than only cellular mechanisms.

      All the reconstructions used for dendritic analysis contained intact cells with no evidently cut dendrites. We added this information in the methods section.

      Regarding Figure 5 we recognize the reviewer’s point of view; however, we think both figures are informative. In particular, Figure 5 shows the full data set, avoiding assumptions on the different PV cells subtype classification, and can be more readily compared with several previously published studies.

      We apologize for our lack of clarity, which may have led to a misunderstanding. In Figure 6i our data show that BC-short from cHet mice have a larger dendritic surface and a higher number of branching points compared to BC-short from control mice. 

      (7) I am rather surprised by the AP threshold of ~-20/-15 mV observed in the datapoints of some figures. Did the authors use capacitance neutralization for their current-clamp recordings? What was the sampling rate used? Some of the phase plots (Vm vs dV/dT) suggests that it may have been too low.

      See responses to public review.

      (8) Please add the values of the series resistance of the recordings and a comparison between control and mutant animals.

      As suggested, we re-examined the series resistance values (Rs), comparing Rs between groups and found no difference for Rs in eAMPA (Control mice: 13.2±0.5,  n=16 cells from 7 mice; cHet mice: 13.7±0.3, n=14 cells from 7 mice; LMM, p=0.432) and eNMDA (Control mice: 12.7±0.7, n=6 cells from 3 mice; cHet mice: 13.8±0.7, n=6 cells from 5 mice;  LMM, p=0.231).

      (9) I am unclear as to how the authors quantified colocalization between VGluts and PSD95 at the low magnification shown in Supplementary Figure 2. Could they please show images at higher magnification?

      Quantification was done on high resolution images. Immunostained sections were imaged using a Leica SP8-STED confocal microscope, with an oil immersion 63x (NA 1.4) at 1024 X 1024, zoom=1, z-step =0.3 μm, stack size of ~15 μm. As suggested by the reviewer, we changed the figure by including images at higher magnification.

      (10) The authors claim that "cHet SST+ cells showed no significant changes in active and passive membrane properties", but this claim would seem to be directly refused by the data of Fig 8f. In the absence of changes in either active or passive membrane properties shouldn't the current/#AP plot remain unchanged?

      The reduction in intrinsic excitability observed in SST+ cells from cHet mice could be due to intrinsic factors not assessed in this study. However, exploring these mechanisms is beyond the scope of our current investigation. We rephrased the discussion and added this limitation of our study in the revised version.

      (11) Please check references as some are missing from the list.

      Thank you for noticing this issue, which is now corrected.

      References  

      Bengtsson Gonzales C, Hunt S, Munoz-Manchado AB, McBain CJ, Hjerling-Leffler J (2020) Intrinsic electrophysiological properties predict variability in morphology and connectivity among striatal Parvalbumin-expressing Pthlh-cells Scientific Reports 10:15680 https://doi.org/10.1038/s41598-020-72588-1

      Blundon JA, Bayazitov IT, Zakharenko SS (2011) Presynaptic gating of postsynaptically expressed plasticity at mature thalamocortical synapses The Journal of Neuroscience 31:1601225 https://doi.org/10.1523/JNEUROSCI.3281-11.2011

      Chun S, Bayazitov IT, Blundon JA, Zakharenko SS (2013) Thalamocortical long-term potentiation becomes gated after the early critical period in the auditory cortex The journal of Neuroscience 33:7345-57 https://doi.org/10.1523/JNEUROSCI.4500-12.2013.

      Ekins TG, Mahadevan V, Zhang Y, D’Amour JA, Akgül G, Petros TJ, McBain CJ (2020) Emergence of non-canonical parvalbumin-containing interneurons in hippocampus of a murine model of type I lissencephaly eLife 9:e62373 https://doi.org/10.7554/eLife.62373

      Helm J, Akgul G, Wollmuth LP (2013) Subgroups of parvalbumin-expressing interneurons in layers 2/3 of the visual cortex Journal of Neurophysiology 109:1600–1613 https://doi.org/10.1152/jn.00782.2012

      Kavalali E (2015) The mechanisms and functions of spontaneous neurotransmitter release Nature Reviews Neuroscience 16:5–16 https://doi.org/10.1038/nrn3875

      Krause BM, Raz A, Uhlrich DJ, Smith PH, Banks MI (2014) Spiking in auditory cortex following thalamic stimulation is dominated by cortical network activity Frontiers in Systemic Neuroscience 8:170. https://doi.org/10.3389/fnsys.2014.00170

      Murtagh F, Heck A (1987) Multivariate Data Analysis. Dordrecht, The Netherlands: Kluwer Academic.

      Nassar M, Simonnet J, Lofredi R, Cohen I, Savary E, Yanagawa Y, Miles R, Fricker D (2015) Diversity and overlap of Parvalbumin and Somatostatin expressing interneurons in mouse presubiculum Frontiers in Neural Circuits 9:20. https://doi.org/10.3389/fncir.2015.00020

      Ramirez DM, Kavalali ET (2011) Differential regulation of spontaneous and evoked neurotransmitter release at central synapses Current Opinion in Neurobiology 21:275-282 https://doi.org/10.1016/j.conb.2011.01.007

      Richardson RJ, Blundon JA, Bayazitov IT, Zakharenko SS (2009) Connectivity patterns revealed by mapping of active inputs on dendrites of thalamorecipient neurons in the auditory cortex. The Journal of Neuroscience 29:6406-17 https://doi.org/10.1523/JNEUROSCI.3028-09.2009

      Sara Y, Virmani T, Deák F, Liu X, Kavalali ET (2005) An isolated pool of vesicles recycles at rest and drives spontaneous neurotransmission Neuron 45:563-573 https://doi.org/10.1016/j.neuron.2004.12.056

      Sara Y, Bal M, Adachi M, Monteggia LM, Kavalali ET (2011) Use-dependent AMPA receptor block reveals segregation of spontaneous and evoked glutamatergic neurotransmission Journal of Neuroscience 14:5378-5382 https://doi.org/10.1523/JNEUROSCI.5234-10.2011

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this study, the authors examined the role of IBTK, a substrate-binding adaptor of the CRL3 ubiquitin ligase complex, in modulating the activity of the eiF4F translation initiation complex. They find that IBTK mediates the non-degradative ubiquitination of eiF4A1, promotes cap-dependent translational initiation, nascent protein synthesis, oncogene expression, and tumor cell growth. Correspondingly, phosphorylation of IBTK by mTORC1/ S6K1 increases eIF4A1 ubiquitination and sustains oncogenic translation.

      Strengths:

      This study utilizes multiple biochemical, proteomic, functional, and cell biology assays to substantiate their results. Importantly, the work nominates IBTK as a unique substrate of mTORC1, and further validates eiF4A1 (a crucial subunit of the ei44F complex) as a promising therapeutic target in cancer. Since IBTK interacts broadly with multiple members of the translational initial complex - it will be interesting to examine its role in eiF2alpha-mediated ER stress as well as eiF3-mediated translation. Additionally, since IBTK exerts pro-survival effects in multiple cell types, it will be of relevance to characterize the role of IBTK in mediating increased mTORC1 mediated translation in other tumor types, thus potentially impacting their treatment with eiF4F inhibitors.

      Limitations/Weaknesses:

      The findings are mostly well supported by data, but some areas need clarification and could potentially be enhanced with further experiments:

      (1) Since eiF4A1 appears to function downstream of IBTK1, can the effects of IBTK1 KO/KD in reducing puromycin incorporation (in Fig 3A), cap-dependent luciferase reporter activity (Fig 3G), reduced oncogene expression (Fig 4A) or 2D growth/ invasion assays (Fig 4) be overcome or bypassed by overexpressing eiF4A1? These could potentially be tested in future studies.

      We appreciate the reviewer for bringing up this crucial point. As per the reviewer's suggestion, we conducted experiments where we overexpressed Myc-eIF4A1 in IBTK-KO SiHa cells. Our findings indicate that increasing levels of eIF4A1 through ectopic overexpression is unable to reverse the decrease in puromycin incorporation (Fig. S3C) and protein expression of eIF4A1 targets caused by IBTK ablation (Fig. S4E). These results clearly demonstrate that IBTK ablation-induced eIF4A1 dysfunctions cannot be rescued by simply elevating eIF4A1 protein levels. Given the above results are negative, the impacts of eIF4A1 overexpression on the 2D growth/invasion capacities of IBTK-KO cells were not further examined. We sincerely appreciate the reviewer's understanding regarding this matter.

      (2) The decrease in nascent protein synthesis in puromycin incorporation assays in Figure 3A suggest that the effects of IBTK KO are comparable to and additive with silvesterol. It would be of interest to examine whether silvesterol decreases nascent protein synthesis or increases stress granules in the IBTK KO cells stably expressing IBTK as well.

      We appreciate the reviewer for bringing up this crucial point. We have showed that silvestrol treatment still decreased nascent protein synthesis in IBTK-KO cells overexpressing FLAG-IBTK as well (Fig. S3B).

      (3) The data presented in Figure 5 regarding the role of mTORC1 in IBTK- mediated eiF4A1 ubiquitination needs further clarification on several points:

      • It is not clear if the experiments in Figure 5F with Phos-tag gels are using the FLAG-IBTK deletion mutant or the peptide containing the mTOR sites as it is mentioned on line 517, page 19 "To do so, we generated an IBTK deletion mutant (900-1150 aa) spanning the potential mTORC1-regulated phosphorylation sites" This needs further clarification.

      We appreciate the reviewer for bringing up this crucial point. The IBTK deletion mutant used in Fig. 5F is FLAG-IBTK900-1150aa. We have annotated it with smaller font size in the panel (red box) in Author response image 1.

      Author response image 1.

      • It may be of benefit to repeat the Phos tag experiments with full-length FLAG- IBTK and/or endogenous IBTK with molecular weight markers indicating the size of migrated bands.

      We appreciate the reviewer for bringing up this crucial point. We attempted to perform Phos-tag assays to detect the overexpressed full-length FLAG-IBTK or endogenous IBTK. However, we encountered difficulties in successfully transferring the full-length FLAG-IBTK or endogenous IBTK onto the nitrocellulose membrane during Phos-tag WB analysis. This is likely due to the limitations of this technique. Based on our experience, phos-tag gel is less efficient in detecting protein motility shifts with large molecular weights. As the molecular weight of IBTK protein is approximately 160 kDa, it falls within this category. Considering these technical constraints, we did not include Phos-tag assay results for full-length IBTK in our study. We sincerely appreciate the reviewer's understanding regarding this matter.

      The binding of Phos-tag to phosphorylated proteins induces a mobility shift during gel electrophoresis or protein separation techniques. This shift allows for the visualization and quantification of phosphorylated proteins separately from non-phosphorylated proteins. It's important to note that these mobility shifts indicate phosphorylation status, rather than actual molecular weights. pre- stained protein markers are typically used as a reference to assess the efficiency of protein transfer onto the membrane [Ref: 1]. Considering the aforementioned reasons, we did not add molecular weights to the WB images.

      Reference [1]. FUJIFILM Wako Pure Chemical Corporation, https://www.wako- chemicals.de/media/pdf/c7/5e/20/FUJIFILM-Wako_Phos-tag-R.pdf

      • Additionally, torin or Lambda phosphatase treatment may be used to confirm the specificity of the band in separate experiments.

      We appreciate the reviewer for bringing up this crucial point. Torin1 is a synthetic mTOR inhibitor by preventing the binding of ATP to mTOR, leading to the inactivation of both mTORC1 and mTORC2, whereas rapamycin primarily targets mTORC1 activity and may inhibit mTORC2 in certain cell types after a prolonged treatment. We have identified that the predominant mediator of IBTK phosphorylation is the mTORC1/S6K1 complex. Therefore, in this context, we think that rapamycin is sufficient to inactivate the mTORC1/S6K1 pathway. As shown in Fig. 5F, the phosphorylated IBTK900-1150aa was markedly decreased while the non-phosphorylated form was simultaneously increased in rapamycin- treated cells. As per the reviewer's suggestion, we treated FLAG-IBTK900-1150aa overexpressed cells with lambda phosphatase. As shown in Fig. 5G, lambda phosphatase treatment completely abolished the mobility shifts of phosphorylated FLAG-IBTK900-1150aa. Additionally, the lowest band displayed an abundant accumulation of the non-phosphorylated form of FLAG-IBTK900-1150aa. These findings confirm that the mobility shifts observed in WB analysis correspond to the phosphorylated forms of FLAG-IBTK900-1150aa.

      • Phos-tag gels with the IBTK CRISPR KO line would also help confirm that the non-phosphorylated band is indeed IBTK.

      We appreciate the reviewer for bringing up this crucial point. As we state above, we performed Phos-tag assays to detect the mobility shifts of phosphorylated FLAG-IBTK900-1150aa. Anti-FLAG antibody, but not the anti-IBTK antibody was used for WB detection. This antibody does not exhibit cross-reactivity with endogenous IBTK.

      • It is unclear why the lower, phosphorylated bands seem to be increasing (rather than decreasing) with AA starvation/ Rapa in Fig 5H.

      We appreciate the reviewer for bringing up this crucial point. We think the panel the reviewer mentioned is Fig. 5F. According to the principle of Phos-tag assays, proteins with higher phosphorylation levels have slower migration rates on SDS-PAGE, while proteins with lower phosphorylation levels have faster migration rates.

      As shown in Author response image 2, the green box indicates the most phosphorylated forms of FLAG-IBTK900-1150aa, the red box indicates the moderately phosphorylated forms of FLAG-IBTK900-1150aa, and the yellow box indicates the non-phosphorylated forms of FLAG-IBTK900-1150aa. AA starvation or Rapamycin treatment reduced the hyperphosphorylated forms of FLAG-IBTK900-1150aa (green box), while simultaneously increasing the hypophosphorylated (red box) and non- phosphorylated (yellow box) forms of FLAG-IBTK900-1150aa. Thus, we conclude that AA starvation or Rapamycin treatment leads to a marked decrease in the phosphorylation levels of FLAG-IBTK900-1150aa.

      Author response image 2.

      Reviewer #2 (Public Review):

      Summary:

      This study by Sun et al. identifies a novel role for IBTK in promoting cancer protein translation, through regulation of the translational helicase eIF4A1. Using a multifaceted approach, the authors demonstrate that IBTK interacts with and ubiquitinates eIF4A1 in a non-degradative manner, enhancing its activation downstream of mTORC1/S6K1 signaling. This represents a significant advance in elucidating the complex layers of dysregulated translational control in cancer.

      Strengths:

      A major strength of this work is the convincing biochemical evidence for a direct regulatory relationship between IBTK and eIF4A1. The authors utilize affinity purification and proximity labeling methods to comprehensively map the IBTK interactome, identifying eIF4A1 as a top hit. Importantly, they validate this interaction and the specificity for eIF4A1 over other eIF4 isoforms by co- immunoprecipitation in multiple cell lines. Building on this, they demonstrate that IBTK catalyzes non-degradative ubiquitination of eIF4A1 both in cells and in vitro through the E3 ligase activity of the CRL3-IBTK complex. Mapping IBTK phosphorylation sites and showing mTORC1/S6K1-dependent regulation provides mechanistic insight. The reduction in global translation and eIF4A1- dependent oncoproteins upon IBTK loss, along with clinical data linking IBTK to poor prognosis, support the functional importance.

      Weaknesses:

      While these data compellingly establish IBTK as a binding partner and modifier of eIF4A1, a remaining weakness is the lack of direct measurements showing IBTK regulates eIF4A1 helicase activity and translation of target mRNAs. While the effects of IBTK knockout/overexpression on bulk protein synthesis are shown, the expression of multiple eIF4A1 target oncogenes remains unchanged.

      Summary:

      Overall, this study significantly advances our understanding of how aberrant mTORC1/S6K1 signaling promotes cancer pathogenic translation via IBTK and eIF4A1. The proteomic, biochemical, and phosphorylation mapping approaches established here provide a blueprint for interrogating IBTK function. These data should galvanize future efforts to target the mTORC1/S6K1-IBTK-eIF4A1 axis as an avenue for cancer therapy, particularly in combination with eIF4A inhibitors.

      Reviewer #1 (Recommendations For The Authors):

      (1) Certain references should be provided for clarity. For e.g.,: Page 15, line 418 " The C-terminal glycine glycine (GG) amino acid residues are essential for Ub conjugation to targeted proteins".

      We appreciate the reviewer for bringing up this crucial point. We have taken two fundamental review papers (PMID: 22524316, 9759494) on the ubiquitin system as references in this sentence.

      (2) Please describe the properties of the ΔBTB mutant on page 15 when first describing it. What motifs does it lack and has it been described before in functional studies?

      We appreciate the reviewer for bringing up this crucial point. We added a sentence to describe the properties of the ΔBTB mutant. This mutant lacks the BTB1 and BTB2 domains (deletion of aa 554–871), which have been previously demonstrated to be essential for binding to CUL3. The original reference has been added to the revised manuscript.

      (3) In Figure 2G how do the authors explain the fact that co-expression of the Ub K-ALLR mutant, which is unable to form polyubiquitin chains, formed only a moderate reduction in IBTK-mediated eIF4A1 ubiquitination?

      We appreciate the reviewer for bringing up this crucial point. The Ub K-ALLR mutant can indeed conjugate to substrate proteins, but it cannot form chains due to its absence of lysine residues, resulting in mono-ubiquitination. Multi- mono-ubiquitination refers to the attachment of single ubiquitin molecules to multiple lysine residues on a substrate protein. It's worth noting that a poly- ubiquitinated protein and a multi-mono-ubiquitinated protein appear strikingly similar in Western blot. Our findings demonstrated that the co-expression of the Ub K-ALL-R mutant resulted in only a modest reduction in IBTK-mediated eIF4A1 ubiquitination (Fig. 2G), and that eIF4A1 was ubiquitinated at twelve lysine residues when co-expressed with IBTK (Fig. S2F). As such, we conclude that the CRL3IBTK complex primarily catalyzes multi-mono-ubiquitination on eIF4A1. .

      (4) In Figure 5, The identity of the seven sites in the IBTK 7ST A mutants should be specified.

      We appreciate the reviewer for bringing up this crucial point. We have specified the seven mutation sites in the IBTK-7ST A mutant (Fig. 6A).

      (5) In Figure 5, the rationale for generating antibodies only to S990/992/993, as opposed to the other mTORC1/S6K motifs should be specified.

      We appreciate the reviewer for bringing up this crucial point. Upon demonstrating that IBTK can be phosphorylated—with evidence from positive Phos-tag and in vitro phosphorylation assays—we sought to directly detect changes in the phosphorylation levels using an antibody specific to IBTK phosphorylation. However, the expense of generating seven phosphorylation- specific antibodies for each site is significant. Recognizing that S990/992/993 are three adjacent sites, we deemed it appropriate to generate a single antibody to recognize the phospho-S990/992/993 epitope. Moreover, out of the seven phosphorylation sites, S992 perfectly matches the consensus motif for S6K1 phosphorylation (RXRXXS). Utilizing this antibody allowed us to observe a substantial decrease in the phosphorylation levels of these three adjacent Ser residues in IBTK following either AA deprivation or Rapamycin treatment (Fig. 5L). We have specified these points in the manuscript.

      Reviewer #2 (Recommendations For The Authors):

      The following suggestions would strengthen the study:

      (1) Directly examine the effects of IBTK modulation (knockdown/knockout/ overexpression) on eIF4A1 helicase activity.

      We appreciate the reviewer for bringing up this crucial point. We agree with the reviewer's suggestion that evaluating IBTK's influence on eIF4A1 helicase activity directly would enhance the strength of our conclusion. However, the current eIF4A1 helicase assays, as described in previous publications [Ref: 1, 2], can only be conducted using in vitro purified recombinant proteins. For instance, it is feasible to assess the varying levels of helicase activity exhibited by recombinant wild-type or mutant EIF4A1 proteins [Ref: 2]. Importantly, there is currently no reported methodology for evaluating the helicase activity of EIF4A1 in vivo, as mentioned by the reviewer in gene knockdown, knockout, or overexpression cellular contexts. Therefore, we have not performed these assays and we sincerely appreciate the reviewer's understanding in this regard. We sincerely appreciate the reviewer's understanding regarding this matter.

      Reference:

      [1] Chu J, Galicia-Vázquez G, Cencic R, Mills JR, Katigbak A, Porco JA, Pelletier J. CRISPR-mediated drug-target validation reveals selective pharmacological inhibition of the RNA helicase, eIF4A. Cell reports. 2016 Jun 14;15(11):2340-7.

      [2] Chu J, Galicia-Vázquez G, Cencic R, Mills JR, Katigbak A, Porco JA, Pelletier J. CRISPR-mediated drug-target validation reveals selective pharmacological inhibition of the RNA helicase, eIF4A. Cell reports. 2016 Jun 14;15(11):2340-7.

      (2) Justify why the expression of some but not all eIF4A1 target oncogenes is affected in IBTK-depleted/overexpressing cells. This is important if IBTK should be considered as a therapeutic target. The authors should consider which of the eIF4A1 targets are most impacted by IBTK KO. This would provide a more focused therapeutic approach in the future.

      We appreciate the reviewer for bringing up this crucial point. As the reviewer has pointed out, we assessed the protein levels of ten reported eIF4A1 target genes across three cancer cell lines (Fig.4, Fig. S4A, C). We observed that IBTK depletion led to a substantial reduction in the protein levels of most eIF4A1- regulated oncogenes upon IBTK depletion, although there were some exceptions. For instance, IBTK KO in H1299 cells exerted minimal influence on the protein levels of ROCK1 (Fig. S4A). Several possible explanations might account for this observation: firstly, given that our list of eIF4A1 target genes collected from previous studies conducted using distinct cell lines, it is not unexpected for different lines to exhibit subtle differences in regulation of eIF4A1 target genes. Secondly, as a CRL3 adaptor, IBTK potentially performs other biological functions via ubiquitination of specific substrates; dysregulation of these could buffer the impact of IBTK KO on the protein expression of some eIF4A1 target genes. We added these comments to the Discussion section of the revised manuscript.

      (3) Expand mTOR manipulation experiments (inhibition, Raptor knockout, activation) and evaluate impacts on IBTK phosphorylation, eIF4A1 ubiquitination, and translation.

      The mTORC1 signaling pathway is constitutively active under normal culture conditions. In order to inhibit mTORC1 activation, we employed several approaches including AA starvation, Rapamycin treatment, or Raptor knockout. Our results have demonstrated that both AA starvation and rapamycin treatment led to a reduction in eIF4A1 ubiquitination (Fig. 5M). Moreover, we have included new findings in the revised manuscript, which highlight that Raptor knockout specifically decreases eIF4A1 ubiquitination (Fig. 5N). It is worth mentioning that the impacts of mTOR inhibition or activation on protein translation have been extensively investigated and documented in numerous studies. Therefore, in our study, we did not feel it necessary to examine these treatments further.

      (4) Although not absolutely necessary, it would be nice to see if some of these findings are true in other cancer cell types.

      We appreciate the reviewer for bringing up this crucial point. We concur with the reviewer's suggestion that including data from other cancer cell types would enhance the strength of our conclusion. While the majority of our data is derived from two cervical cancer cell lines, we have corroborated certain key findings— such as the impact of IBTK on eIF4A1 and its target gene expression—in H1299 cells (human lung cancer) (Fig. 2C, Fig. S4A, B) and in CT26 cells (murine colon adenocarcinoma) (Fig. S4C, D). Additionally, we demonstrated that IBTK promotes IFN-γ-induced PD-L1 expression and tumor immune escape in both the H1299 and CT26 cells (Fig. S6A-K).

    1. Author Response:

      The following is the authors’ response to the original reviews.

      General response

      (1) Evaluation of mitochondrial activity in mox-YG overexpression cells

      To determine whether the observed “mitochondrial development” seen in transcriptomic, proteomic, and microscopic analyses corresponds to an actual phenotypic shift toward respiration, we measured oxygen consumption in mox-YG overexpression cells. The results showed that oxygen consumption rates were indeed elevated in these cells, suggesting a metabolic shift from fermentation toward respiration. These findings have been incorporated into the revised manuscript as new Figure 4E and Figure 4—figure supplement 9, along with the corresponding descriptions in the Results section.

      (2) Evaluation of TORC1 Pathway Inactivation in mox-YG Overexpression Cells

      While the proteomic response in mox-YG overexpression cells overlapped with known responses to TORC1 pathway inactivation, we had not obtained direct evidence that TORC1 activity was indeed reduced. To address this, we assessed TORC1 activity by testing the effect of rapamycin, a TORC1 inhibitor, and by attempting to detect the phosphorylation state of known TORC1 targets. Our results showed that mox-YG overexpressing cells exhibited reduced sensitivity to rapamycin compared to vector control cells, supporting the idea that TORC1 is already inactivated in the mox-YG overexpression condition.

      In parallel, we attempted to detect phosphorylation of TORC1 targets Sch9 and Atg13 by Western blotting. Specifically, we tested several approaches: detecting phospho-Sch9 using a phospho-specific antibody, assessing the band shift of HA-tagged Sch9, and monitoring Atg13 band shift using an anti-Atg13 antibody. While we were unable to detect Sch9 phosphorylation, likely due to technical limitations, we finally succeeded in detecting Atg13 with the help of our new co-author, Dr. Kamada. However, we observed a marked reduction in Atg13 protein levels in mox-YG overexpression cells, making it difficult to interpret the biological significance of any apparent decrease in phosphorylation. Therefore, we decided not to pursue further experiments on TORC1 phosphorylation within the current revision period.

      These findings have been summarized in new Figure 4—figure supplement 7, and the relevant description has been added to the Results section.

      (3) Phenotypes of Gpm1-CCmut

      We focused our initial analysis on the phenotypes of cells overexpressing mox-YG, the protein with the lowest Neutrality Index (NI) in our dataset, as a model of protein burden. However, it remained unclear to what extent the phenotypes observed in mox-YG overexpression cells are generalizable to protein burden as a whole. We agree with the reviewers’ suggestion that it is important to examine whether similar phenotypes are also observed in cells overexpressing Gpm1-CCmut, which was newly identified in this study as having a similarly low NI. We therefore performed validation experiments using Gpm1-CCmut overexpression cells to assess whether they exhibit the characteristic phenotypes observed in mox-YG overexpression cells. These phenotypes included: transcriptional responses, mitochondrial development, metabolic shift toward respiration, and nucleolar shrinkage.

      As a result, mitochondrial development and nucleolar shrinkage were also observed in Gpm1-CCmut overexpression cells, consistent with mox-YG. In contrast, the transcriptional response associated with amino acid starvation and the metabolic shift toward respiration were not observed. Furthermore, an abnormal rounding of cell morphology—absent in mox-YG overexpression cells—was uniquely observed in Gpm1-CCmut cells. These results suggest that the phenotypes observed under mox-YG overexpression may comprise both general effects of protein burden and effects specific to the mox-YG protein. Alternatively, it is possible that Gpm1-CCmut imposes a different kind of constraint or toxicity not shared with mox-YG. In any case, these findings highlight that the full range of phenotypes associated with protein burden cannot yet be clearly defined and underscore the need for future analyses using a variety of “non-toxic” proteins.

      Given that these results form a coherent set, we have relocated original Figure 3—which previously presented the NI values of Gpm1 and Tdh3 in the original version—to new Figure 6, which now includes all related phenotypic analyses. Correspondingly, we have added new Figures 6—figure supplement 1 through 6—figure supplement 7. The associated results have been incorporated into the Results section, and we have expanded the Discussion to address this point

      As a result of these revisions, the order of figures has changed from the original version. The correspondence between the original and revised versions is as follows:

      original→ Revised

      Figure 1 → Figure 1<br />  Figure 2 → Figure 2<br />  Figure 3 → Figure 6<br />  Figure 4 → Figure 3<br />  Figure 5 → Figure 4<br />  Figure 6 → Figure 5

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses:

      While the introduction of the neutrality index seems useful to differentiate between cytotoxicity and protein burden, the biological relevance of the effects of overexpression of the model proteins is unclear.

      Thank you for your comment. This point is in fact the core message we wished to convey in this study. We believe that every protein possesses some degree of what can be described as “cytotoxicity,” and that this should be defined by the expression limit—specifically, the threshold level at which growth inhibition occurs. This index corresponds to what we term the neutrality index. We further argue that protein cytotoxicity arises from a variety of constraints inherent to each protein. These constraints act in a stepwise manner to determine the expression limit (i.e., the neutrality) of a given protein (Figure 1A). To demonstrate the real existence of such constraints, there are two complementary approaches: an inductive one that involves large-scale, systematic investigation of naturally occurring proteins, and a deductive one that tests hypotheses using selected model proteins. Our current study follows the latter approach. In addition, we define protein burden as a phenomenon that can only be elicited by proteins that are ultimately harmless (Figure 1B). We assume that such burden results in a shared physiological state, such as depletion of cellular resources. Through continued efforts to identify a protein suitable for investigating this phenomenon, we eventually arrived at mox-YG. As the reviewer rightly pointed out, examining only mox-YG does not reveal the full picture of protein burden. In fact, in response to the reviewer’s suggestion, we investigated the physiological consequences of overexpressing a mutant glycolytic protein, Gpm1-CCmut (General Response 3). We found that the resulting phenotype was notably different from that observed in cells overexpressing mox-YG. Going forward, we believe that our study provides a foundation for further systematic exploration of “harmless proteins” and the cellular impacts of their overexpression.

      Reviewer #2 (Public Review):

      Weaknesses:

      The authors concluded from their RNA-seq and proteomics results that cells with excess mox-YG expression showed increased respiration and TORC1 inactivation. I think it will be more convincing if the authors can show some characterization of mitochondrial respiration/membrane potential and the TOR responses to further verify their -omic results.

      These points are addressed in General Response 1 and 2.

      In addition, the authors only investigated how overexpression of mox-YG affects cells. It would be interesting to see whether overexpressing other non-toxic proteins causes similar effects, or if there are protein-specific effects. It would be good if the authors could at least discuss this point considering the workload of doing another RNA-seq or mass-spectrum analysis might be too heavy.

      These points are addressed in General Response 3.

      Reviewer #3 (Public Review):

      Weaknesses:

      The data are generally convincing, however in order to back up the major claim of this work - that the observed changes are due to general protein burden and not to the specific protein or condition - a broader analysis of different conditions would be highly beneficial.

      These points are addressed in General Response 3.

      Major points:

      (1) The authors identify several proteins with high neutrality scores but only analyze the effects of mox/mox-YG overexpression in depth. Hence, it remains unclear which molecular phenotypes they observe are general effects of protein burden or more specific effects of these specific proteins. To address this point, a proteome (and/or transcriptome) of at least a Gpm1-CCmut expressing strain should be obtained and compared to the mox-YG proteome. Ideally, this analysis should be done simultaneously on all strains to achieve a good comparability of samples, e.g. using TMT multiplexing (for a proteome) or multiplexed sequencing (for a transcriptome). If feasible, the more strains that can be included in this comparison, the more powerful this analysis will be and can be prioritized over depth of sequencing/proteome coverage.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes that were shared with, and distinct from, those observed in mox-YG overexpression cells. To define a unified set of phenotypes associated with "protein burden," we believe that extensive omics analyses targeting multiple "non-toxic" protein overexpression strains will be necessary. However, such an effort goes beyond the scope of the current study, and we would like to leave it as an important subject for future investigation.

      (2) The genetic tug-of-war system is elegant but comes at the cost of requiring specific media conditions (synthetic minimal media lacking uracil and leucine), which could be a potential confound, given that metabolic rewiring, and especially nitrogen starvation are among the observed phenotypes. I wonder if some of the changes might be specific to these conditions. The authors should corroborate their findings under different conditions. Ideally, this would be done using an orthogonal expression system that does not rely on auxotrophy (e.g. using antibiotic resistance instead) and can be used in rich, complex mediums like YPD. Minimally, using different conditions (media with excess or more limited nitrogen source, amino acids, different carbon source, etc.) would be useful to test the robustness of the findings towards changes in media composition.

      We appreciate the reviewer’s clear understanding of both the advantages and limitations of the gTOW system. As rightly pointed out, since our system relies on leucine depletion, it is essential to carefully consider the potential impact this may have on cellular metabolism. Another limitation—though it also serves as one of the strengths—of the gTOW system is its reliance on copy number variation to achieve protein overexpression. This feature limits the possibility of observing rapid responses, as immediate induction is not feasible. To address this issue, we have recently developed a strong and inducible promoter that minimizes effects on other metabolic systems (Higuchi et al., 2024), and we believe this tool will be essential in future experiments.

      In response to the reviewer’s comments, we conducted two additional sets of experiments. First, we established a new overexpression system in nutrient-rich conditions (YPD medium) that is conceptually similar to gTOW but uses aureobasidin A and the AUR1d resistance gene to promote gene amplification (new Figure 4—figure supplement 2). Using this system, we observed that non-fluorescent YG mutants led to increased expression of mox. Total protein levels appeared to rise correspondingly, suggesting that the overall synthetic capacity of cells might be higher in YPD compared to SC medium. However, the degree of overexpression achieved in this system was insufficient to strongly inhibit growth, meaning we could not replicate the stress conditions observed with the original gTOW system. Further studies will be needed to determine whether stronger induction under these nutrient-rich conditions will yield comparable responses.

      Second, we performed a control experiment to examine whether the amino acid starvation response observed in mox-YG overexpressing cells could be attributed to leucine depletion from the medium (new Figure 3—figure supplement 3). By titrating leucine concentrations in SC medium, we confirmed that lower leucine levels reduced the growth rate of vector control cells, indicating leucine limitation. However, GAP1 induction was not observed under these conditions. In contrast, mox-YG overexpression led to strong GAP1 induction under similar growth-inhibitory conditions, suggesting that the amino acid starvation response is not simply due to environmental leucine depletion, but rather a consequence of the cellular burden imposed by mox-YG overexpression.

      These findings have been incorporated into the manuscript, along with the corresponding figures (new Figure 4—figure supplement 2, Figure 3—figure supplement 3), and relevant descriptions have been added to the Results and Discussion sections.

      (3) The authors suggest that the TORC1 pathway is involved in regulating some of the changes they observed. This is likely true, but it would be great if the hypothesis could be directly tested using an established TORC1 assay.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (4) The finding that the nucleolus appears to be virtually missing in mox-YG-expressing cells (Figure 6B) is surprising and interesting. The authors suggest possible mechanisms to explain this and partially rescue the phenotype by a reduction-of-function mutation in an exosome subunit. I wonder if this is specific to the mox-YG protein or a general protein burden effect, which the experiments suggested in point 1 should address. Additionally, could a mox-YG variant with a nuclear export signal be expressed that stays exclusively in the cytosol to rule out that mox-YG itself interferes with phase separation in the nucleus?

      As also described in our General Response 3, we observed nucleolar shrinkage upon Gpm1-CCmut overexpression as well (new Figure 6E and 6—figure supplement 7), suggesting that this phenomenon may represent a general feature of protein burden. The reviewer’s suggestion to test whether this effect persists when mox-YG is excluded from the nucleus is indeed intriguing. However, based on our previous work, we have shown that overexpression of NES-tagged proteins (e.g., NES-EGFP) causes severe growth inhibition due to depletion of nuclear export factors (Kintaka et al., 2020). Unfortunately, this technical limitation makes it difficult for us to carry out the proposed experiment as suggested.

      Minor points:

      (5) It would be great if the authors could directly compare the changes they observed at the transcriptome and proteome levels. This can help distinguish between changes that are transcriptionally regulated versus more downstream processes (like protein degradation, as proposed for ribosome components).

      We also considered this point to be important, and therefore compared the transcriptomic and proteomic changes associated with mox-YG overexpression. However, somewhat unexpectedly, we found little correlation between these two layers of response. As shown in new Figure 3 and 4 (original Figures 4 and 5), while genes related to oxidative phosphorylation were consistently upregulated at both the mRNA and protein levels in mox-YG overexpressing cells, ribosomal proteins showed a discordant pattern: their mRNA levels were significantly increased, whereas their protein levels were significantly decreased.

      Several factors may explain this discrepancy: (1) differences in analytical methods between transcriptomics and proteomics; (2) temporal mismatches arising from the dynamic changes in mRNA and protein expression during batch culture; and (3) the possibility that, under protein burden conditions, specific regulatory mechanisms may govern the selective translation or targeted degradation of certain proteins. However, at this point, we were unable to clearly determine which of these factors account for the observed differences.

      For this reason, we did not originally include a global transcriptome–proteome comparison in the manuscript. In response to the reviewer’s comment, however, we have now included the comparison data (new Figure 4—figure supplement 3D).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Major points:

      (1) While the study provides a detailed description of physiological changes, the underlying mechanisms remain speculative. For example, the exact reasons for nitrogen source depletion or increased respiration are unclear. The transcriptomic and proteomic data should be complemented by basic growth assay tests on rapamycin or glycerol to strengthen these observations.

      This comment has been addressed in General Responses 1 and 2. We conducted oxygen consumption assays and growth assays in the presence of rapamycin, and incorporated these results into the revised version of the manuscript.

      We also performed culture experiments using glycerol as a carbon source. However, both the vector control and mox-YG overexpression cells showed extremely poor growth. Although there was a slight difference between the two, we judged that it would be difficult to draw any meaningful conclusions from these results. Therefore, we have chosen not to include them in the main text (the data are attached below for reference).

      Author response image 1.

      (2) The study mainly focuses on two proteins, mox-YG/ FP proteins and Gpm1-CCmut. Did the authors look also at a broader range of proteins with varying degrees of cytotoxicity to validate the neutrality index and generalize their findings? Such as known cytotoxic proteins.

      In our calculation of the Neutrality Index (NI), we use two parameters: the maximum growth rate (expressed as %MGR relative to the control) and the protein expression level. For the latter, we measure the abundance of the overexpressed protein as a percentage of total cellular protein, based on the assumption that the protein is expressed at a sufficiently high level to be detectable by SDS-PAGE. In our view, proteins typically regarded as “cytotoxic” cannot be overexpressed to levels detectable by SDS-PAGE without the use of more sensitive techniques such as Western blotting. This limitation in expression itself is an indication of their high cytotoxicity. Consequently, for such proteins, NI is determined solely by the MGR value, and will inherently fall below 100.

      To test whether this interpretation is valid, we re-evaluated a group of EGFP variants previously reported by us to exhibit higher cytotoxicity than EGFP (Kintaka et al., 2016), due to overloading of specific cellular transport pathways. These include EGFPs tagged with localization signals. At the time of the original study, we had not calculated their NI values. Upon re-analysis, we found that all of these localization-tagged EGFP variants indeed have NI values below 100.

      This result has been included as a new Figure 2—figure supplement 3, and the relevant descriptions have been added to the Results section.

      (3) The partial rescue of ribosomal biosynthesis defects by a mutation in the nuclear exosome is intriguing but not fully explored. The specific role of the nuclear exosome in managing protein burden remains unclear. This result could be supported by alternative experiments. For example, would tom1 deletion or proteasome inhibition (degradation of ribosomal proteins in the nucleus) partially rescue the nuclear formation?

      As described in the main text, our interest in exosome mutants was prompted by our previous SGA (Synthetic Genetic Array) analysis, in which these mutants exhibited positive genetic interactions with GFP overexpression—namely, they acted in a rescuing manner (Kintaka et al., 2020). In contrast, proteasome mutants did not show such positive interactions in the same screening. On the contrary, proteasome mutants that displayed negative genetic interactions have been identified, such as the pre7ts mutant. Furthermore, the proteasome is involved in various aspects of proteostasis beyond just orphan ribosomal proteins, making the interpretation of its effects potentially quite complex.

      Regarding the TOM1 mutant raised by the reviewer, we attempted to observe nucleolar morphology using the NSR1-mScarlet-I marker in the tom1Δ deletion strain. However, we were unsuccessful in constructing the strain. This failure may be due to the strong detrimental effects of this perturbation in the tom1Δ background. As we were unable to complete this experiment within the revision period, we would like to address this issue in future work.

      Minor comments:

      (1) It would be interesting to include long-term cellular and evolutionary responses to protein overexpression to understand how cells adapt to chronic protein burden.

      Thank you for the suggestion. We are currently conducting experiments related to these points. However, as they fall outside the scope of the present study, we would like to refrain from including the data in this manuscript.

      (2) The microscopy of Nsr1 in Figure 6G does not clearly demonstrate the restored formation of the nucleolus in the mrt4-1 mutant. Electron microscopy images would be a better demonstration.

      The restoration of nucleolar size in the mtr4-1 mutant, as shown in Figure 5—figure supplement 5 (original Figure 6_S5), is statistically significant. However, as described in the main text, the degree of rescue by the mutation is partial, and, as the reviewer notes, not clearly distinguishable by eye. It becomes apparent only when analyzing a large number of cells, allowing for detection as a statistically significant difference. Given that electron microscopy images are inherently limited in the number of cells that can be analyzed and pose challenges for statistical evaluation, we believe it would be difficult to detect such a subtle difference using this method. Therefore, we respectfully ask for your understanding that we will not include additional EM experiments in this revision.

      (3) On page 24, line 451 it says that of the 84 ribosomal proteins... latest reviews and structures described/ identified 79 ribosomal proteins in budding yeast of which the majority are incorporated into the pre-ribosomal particles in the nucleolus. We could not find this information in the provided reference. Please align with the literature.

      Thank you for the comment. In S. cerevisiae, many ribosomal protein genes are duplicated due to gene duplication events, resulting in a total of 136 ribosomal proteins (http://ribosome.med.miyazaki-u.ac.jp/rpg.cgi?mode=genetable). However, not all of them are duplicated, and among the duplicated pairs, some can be distinguished by proteomic analysis based on differences in amino acid sequences, while others cannot. As a result, we report that 84 ribosomal proteins were “detected” in our proteomic analysis. To avoid confusion, we have added the following explanation to the legend of Figure 5—figure supplement 1 (original Figure 6_S1), as follows.

      “Note that when the amino acid sequences of paralogs are identical, they cannot be distinguished by proteomic analysis, and the protein abundance of both members of the paralog pair is represented under the name of only one.”

      Reviewer #2 (Recommendations for the authors):

      (1) The authors mentioned that based on their proteomics results, overexpressing mox-YG appears to increase respiration. I think it is worth doing some quick verification, such as oxygen consumption experiments or mitochondrial membrane potential staining to provide some verification on that.

      This comment has been addressed in General Response 1. We measured oxygen consumption in mox-YG overexpression cells and found that it was indeed elevated, suggesting a metabolic shift from fermentation toward aerobic respiration.

      (2) Similar to point 1, the authors concluded from their proteomics data that the mox-YG overexpression induced responses that are similar to TORC1 inactivation. It might be worth testing whether there is any actual TORC1 inactivation, e.g. by detecting whether there is reduced Sch9 phosphorylation by western blot.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (3) The authors showed that overexpressing excess mox-YG caused downregulated glycolysis pathways. It is worth discussing whether overexpressing glycolysis-related non-toxic proteins such as Gpm1-CCmut will also lead to similar results.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes shared with mox-YG overexpression and distinct ones. These findings suggest that a unified set of phenotypes associated with "protein burden" has yet to be clearly defined, and further investigation will be necessary to elucidate this.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors identify several proteins with high neutrality scores but only analyze the effects of mox/mox-YG overexpression in depth. Hence, it remains unclear which molecular phenotypes they observe are general effects of protein burden or more specific effects of these specific proteins. To address this point, a proteome (and/or transcriptome) of at least a Gpm1-CCmut expressing strain should be obtained and compared to the mox-YG proteome. Ideally, this analysis should be done simultaneously on all strains to achieve a good comparability of samples, e.g. using TMT multiplexing (for a proteome) or multiplexed sequencing (for a transcriptome). If feasible, the more strains that can be included in this comparison, the more powerful this analysis will be and can be prioritized over depth of sequencing/proteome coverage.

      This comment has been addressed in General Response 3. Gpm1-CCmut overexpression cells exhibited both phenotypes that were shared with, and distinct from, those observed in mox-YG overexpression cells. To define a unified set of phenotypes associated with "protein burden," we believe that extensive omics analyses targeting multiple "non-toxic" protein overexpression strains will be necessary. However, such an effort goes beyond the scope of the current study, and we would like to leave it as an important subject for future investigation.

      (2) The genetic tug-of-war system is elegant but comes at the cost of requiring specific media conditions (synthetic minimal media lacking uracil and leucine), which could be a potential confound, given that metabolic rewiring, and especially nitrogen starvation are among the observed phenotypes. I wonder if some of the changes might be specific to these conditions. The authors should corroborate their findings under different conditions. Ideally, this would be done using an orthogonal expression system that does not rely on auxotrophy (e.g. using antibiotic resistance instead) and can be used in rich, complex mediums like YPD. Minimally, using different conditions (media with excess or more limited nitrogen source, amino acids, different carbon source, etc.) would be useful to test the robustness of the findings towards changes in media composition.

      We appreciate the reviewer’s clear understanding of both the advantages and limitations of the gTOW system. As rightly pointed out, since our system relies on leucine depletion, it is essential to carefully consider the potential impact this may have on cellular metabolism. Another limitation—though it also serves as one of the strengths—of the gTOW system is its reliance on copy number variation to achieve protein overexpression. This feature limits the possibility of observing rapid responses, as immediate induction is not feasible. To address this issue, we have recently developed a strong and inducible promoter that minimizes effects on other metabolic systems (Higuchi et al., 2024), and we believe this tool will be essential in future experiments.

      In response to the reviewer’s comments, we conducted two additional sets of experiments. First, we established a new overexpression system in nutrient-rich conditions (YPD medium) that is conceptually similar to gTOW but uses aureobasidin A and the AUR1d resistance gene to promote gene amplification (new Figure 4—figure supplement 2). Using this system, we observed that non-fluorescent YG mutants led to increased expression of mox. Total protein levels appeared to rise correspondingly, suggesting that the overall synthetic capacity of cells might be higher in YPD compared to SC medium. However, the degree of overexpression achieved in this system was insufficient to strongly inhibit growth, meaning we could not replicate the stress conditions observed with the original gTOW system. Further studies will be needed to determine whether stronger induction under these nutrient-rich conditions will yield comparable responses.

      Second, we performed a control experiment to examine whether the amino acid starvation response observed in mox-YG overexpressing cells could be attributed to leucine depletion from the medium (new Figure 3—figure supplement 3). By titrating leucine concentrations in SC medium, we confirmed that lower leucine levels reduced the growth rate of vector control cells, indicating leucine limitation. However, GAP1 induction was not observed under these conditions. In contrast, mox-YG overexpression led to strong GAP1 induction under similar growth-inhibitory conditions, suggesting that the amino acid starvation response is not simply due to environmental leucine depletion, but rather a consequence of the cellular burden imposed by mox-YG overexpression.

      These findings have been incorporated into the manuscript, along with the corresponding figures (new Figure 4—figure supplement 2, Figure 3—figure supplement 3), and relevant descriptions have been added to the Results and Discussion sections.

      (3) The authors suggest that the TORC1 pathway is involved in regulating some of the changes they observed. This is likely true, but it would be great if the hypothesis could be directly tested using an established TORC1 assay.

      This comment has been addressed in General Response 2. We assessed the rapamycin sensitivity of mox-YG overexpression cells—which was found to be reduced—and attempted to detect phosphorylation of the TORC1 target Atg13, although the latter was only partially successful. These findings have been incorporated into the Results section.

      (4) The finding that the nucleolus appears to be virtually missing in mox-YG-expressing cells (Figure 6B) is surprising and interesting. The authors suggest possible mechanisms to explain this and partially rescue the phenotype by a reduction-of-function mutation in an exosome subunit. I wonder if this is specific to the mox-YG protein or a general protein burden effect, which the experiments suggested in point 1 should address. Additionally, could a mox-YG variant with a nuclear export signal be expressed that stays exclusively in the cytosol to rule out that mox-YG itself interferes with phase separation in the nucleus?

      As also described in our General Response 3, we observed nucleolar shrinkage upon Gpm1-CCmut overexpression as well (new Figure 6E and 6—figure supplement 7), suggesting that this phenomenon may represent a general feature of protein burden. The reviewer’s suggestion to test whether this effect persists when mox-YG is excluded from the nucleus is indeed intriguing. However, based on our previous work, we have shown that overexpression of NES-tagged proteins (e.g., NES-EGFP) causes severe growth inhibition due to depletion of nuclear export factors (Kintaka et al., 2020). Unfortunately, this technical limitation makes it difficult for us to carry out the proposed experiment as suggested.

      (5) It would be great if the authors could directly compare the changes they observed at the transcriptome and proteome levels. This can help distinguish between changes that are transcriptionally regulated versus more downstream processes (like protein degradation, as proposed for ribosome components).

      We also considered this point to be important, and therefore compared the transcriptomic and proteomic changes associated with mox-YG overexpression. However, somewhat unexpectedly, we found little correlation between these two layers of response. As shown in new Figure 3 and 4 (original Figures 4 and 5), while genes related to oxidative phosphorylation were consistently upregulated at both the mRNA and protein levels in mox-YG overexpressing cells, ribosomal proteins showed a discordant pattern: their mRNA levels were significantly increased, whereas their protein levels were significantly decreased.

      Several factors may explain this discrepancy: (1) differences in analytical methods between transcriptomics and proteomics; (2) temporal mismatches arising from the dynamic changes in mRNA and protein expression during batch culture; and (3) the possibility that, under protein burden conditions, specific regulatory mechanisms may govern the selective translation or targeted degradation of certain proteins. However, at this point, we were unable to clearly determine which of these factors account for the observed differences.

      For this reason, we did not originally include a global transcriptome–proteome comparison in the manuscript. In response to the reviewer’s comment, however, we have now included the comparison data (new Figure 4—figure supplement 3D).

      Minor points:

      (1) The authors repeatedly state that 'mitochondrial function' is increased. This is inaccurate in two ways: first, mitochondria have multiple functions, and it should be specified which one is referred to (probably mitochondrial respiration); second, the claim is based solely on the abundance of transcripts/proteins, which may or may not reflect increased activity.

      The authors should either perform functional tests (e.g. measure oxygen consumption or extracellular acidification), or change their wording to more accurately reflect the findings.

      To more directly reflect our findings, we revised two instances of the phrase “mitochondrial function” to “mitochondrial proteins” in the manuscript. Furthermore, as described in General Response 1, we confirmed that oxygen consumption is elevated in mox-YG overexpression cells. This observation suggests that mitochondrial respiratory activity is indeed enhanced under these conditions.

      (2) Similarly, the authors state that FPs are 'not localized' (e.g. line 137). This should be specified (e.g. 'not actively sorted into cellular compartments other than the cytosol').

      As pointed out by the reviewer, we have revised the relevant sections accordingly.

      (3) In Figure 4D, some of the reporter assays don't fully recapitulate the RNAseq findings (e.g. for PHO84 and ZPS1, where mox-FS and mox-YG behave differently in the reporter assay, but not in the RNAseq data). This may stem from technical limitations given that the reporter assay relies on RFP expression which could generally be affected by protein overexpression (cf. ACT1pro in mox-FS), but it should be mentioned in the text.

      We apologize for the confusion caused by our insufficient explanation of "moxFS" in new Figure 3D (original Figure 4D). As clarified here, "moxFS" refers to a frameshift mutant in which the mRNA is transcribed but the protein is not translated due to an early frameshift mutation. This is not a functional mox protein. The behavior of this mutant is nearly identical to that of the vector control, indicating that the transcriptional response observed in this assay is not triggered by mRNA expression itself, but rather by events occurring after protein synthesis begins. Importantly, the transcriptional responses identified by RNA-seq in mox-YG overexpression cells are largely recapitulated by this reporter assay, supporting the reliability of our experimental design.

      We appreciate the reviewer’s comment, which helped us recognize the lack of clarity in our original description. In response, we have added an explanation of the FS mutation to the figure legend (new Figure 3D), and we have also expanded the description of the moxFS experimental results in the Results section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Arimura et al describe MagIC-Cryo-EM, an innovative method for immune-selective concentrating of native molecules and macromolecular complexes for Cryo-EM imaging and single-particle analysis. Typically, Cryo-EM imaging requires much larger concentrations of biomolecules than that are feasible to achieve by conventional biochemical fractionation. Overall, this manuscript is meticulously and clearly written and may become a great asset to other electron microscopists and chromatin researchers.

      Strengths:

      Previously, Arimura et al. (Mol. Cell 2021) isolated from Xenopus extract and resolved by Cryo-EM a sub-class of native nucleosomes conjugated containing histone H1.8 at the on-dyad position, similar to that previously observed by other researchers with reconstituted nucleosomes. Here they sought to analyze immuno-selected nucleosomes aiming to observe specific modes of H1.8 positioning (e.g. on-dyad and off-dyad) and potentially reveal structural motifs responsible for the decreased affinity of H1.8 for the interphase chromatin compared to metaphase chromosomes. The main strength of this work is a clever and novel methodological design, in particular the engineered protein spacers to separate captured nucleosomes from streptavidin beads for a clear imaging. The authors provide a detailed step-by-step description of MagIC-Cryo-EM procedure including nucleosome isolation, preparation of GFP nanobody attached magnetic beads, optimization of the spacer length, concentration of the nucleosomes on graphene grids, data collection and analysis, including their new DUSTER method to filter-out low signal particles. This tour de force methodology should facilitate considering of MagIC-CryoEM by other electron microscopists especially for analysis of native nucleosome complexes.

      In pursue of biologically important new structures, the immune-selected H1.8-containing nucleosomes were solved at about 4A resolution; their structure appears to be very similar to the previously determined structure of H1.8-reconstituted nucleosomes. There were no apparent differences between the metaphase and interphase complexes suggesting that the on-dyad and off-dyad positioning does not explain the differences in H1.8 - nucleosome binding. However, they were able to identify and solve complexes of H1.8-GFP with histone chaperone NPM2 in a closed and open conformation providing mechanistic insights for H1-NPM2 binding and the reduced affinity of H1.8 to interphase chromatin as compared to metaphase chromosomes.

      Weaknesses:

      Still, I feel that there are certain limitations and potential artifacts resulting from formaldehyde fixation, use of bacterial-expressed recombinant H1.8-GFP, and potential effects of magnetic beads and/or spacer on protein structure, that should be more explicitly discussed. 

      We thank the reviewer for recognizing the significance of our methods and for constructive comments. To respond to the reviewer's criticism, we revised the “Limitation of the study” section (page 12, line 420) as indicated by the underlines below.

      “While MagIC-cryo-EM is envisioned as a versatile approach suitable for various biomolecules from diverse sources, including cultured cells and tissues, it has thus far been tested only with H1.8-bound nucleosome and H1.8-bound NPM2, both using antiGFP nanobodies to isolate GFP-tagged H1.8 from chromosomes assembled in Xenopus egg extracts after pre-fractionation of chromatin. To apply MagIC-cryo-EM for the other targets, the following factors must be considered: 1) Pre-fractionation. This step (e.g., density gradient or gel filtration) may be necessary to enrich the target protein in a specific complex from other diverse forms (such as monomeric forms, subcomplexes, and protein aggregates). 2) Avoiding bead aggregation. Beads may be clustered by targets (if the target complex contains multiple affinity tags or is aggregated), nonspecific binders, and the target capture modules. To directly apply antibodies that recognize the native targets and specific modifications, optimization to avoid bead aggregation will be important. 3) Stabilizing complexes. The target complexes must be stable during the sample preparation. Crosslink was necessary for the H1.8-GFP-bound nucleosome. 4) Loading the optimum number of targets on the bead. The optimal number of particles per bead differs depending on target sizes, as larger targets are more likely to overlap. For H1.8-GFP-bound nucleosomes, 500 to 2,000 particles per bead were optimal. We expect that fewer particles should be coated for larger targets.”

      We would like to note that while the use of bacterially expressed GFP-tagged H1.8 and MagIC-cryo-EM may potentially influence the structure of the H1.8-bound nucleosome, the structures of GFP-tagged H1.8-bound nucleosomes isolated from chromosomes assembled in Xenopus egg extract are essentially identical to the endogenous H1.8bound nucleosome structure we previously determined. In addition, we have shown that GFP-H1.8 was able to replace the function of endogenous H1.8 to support the proper mitotic chromosome length (Fig. S3), which is based on the capacity of H1.8 to compete with condensin as we have previously demonstrated (PMID 34406118). Therefore, we believe that the effects of GFP-tagging to be minimal. This point incorporated into the main result section (page 6, line 215) to read as “The structures of GFP-tagged H1.8bound nucleosomes isolated from Xenopus egg extract chromosomes are essentially identical to the endogenous H1.8-bound nucleosome structure we previously determined. Therefore, although the usage of GFP-tagged H1.8 and MagIC-cryo-EM potentially influence the structure of the H1.8-bound nucleosome, we consider these influences to be minimal.”

      Also, the GFP-pulled down H1.8 nucleosomes should be better characterized biochemically to determine the actual linker DNA lengths (which are known to have a strong effect of linker histone affinity) and presence or absence of other factors such as HMG proteins that may compete with linker histones and cause the multiplicity of nucleosome structural classes (such as shown on Fig. 3F) for which the association with H1.8 is uncertain.

      We addressed the concerns brought by the reviewer as following:

      (1) DNA length

      As the reviewer correctly pointed out, linker DNA length is critical for linker histone binding, and conventional ChIP protocols often result in DNA over-digestion to lengths of 140–150 bp. To minimize DNA over-digestion and structural damage, we have optimized a gentle chromosomal nucleosome purification protocol that enabled the cryoEM analysis of chromosomal nucleosomes (PMID: 34478647). This protocol involves DNA digestion with a minimal amount of MNase at 4ºC, producing nucleosomal DNA fragments of 180–200 bp. Additionally, before each chromatin extraction, we performed small-scale MNase assays to ensure that the DNA lengths consistently fell within the 180–200 bp range (Fig. S4B). These DNA lengths are sufficient for linker histone H1 binding, in agreement with previous findings indicating that >170 bp is adequate for linker histone association (PMID: 26212454). 

      This information has been incorporated into the main text and Methods section; 

      On page 5, line 178, the sentence was added to read, “To prevent dissociation of H1.8 from nucleosomes during DNA fragmentation, the MNase concentration and the reaction time were optimized to generate DNA fragment lengths with 180–200 bp (Fig. S4B), which is adequate for linker histone association (PMID 26212454).”

      On page 32, line 1192, the sentence was added to read, “To digest chromatin, MNase concentration and reaction time were tested on a small scale and optimized to the condition that produces 180-200 bp DNA fragments.”

      (2) Co-associated proteins with H1-GFP nucleosome.

      We now include mass spectrometry (MS) data for the proteins in the sucrose density gradient fraction 5 used for MagIC-cryo-EM analysis of GFP-H1.8-bound chromatin proteins as well as MS of proteins isolated with the corresponding MagIC-cryo-EM beads (Table S2 and updated Table S5). As the reviewer expected, HMG proteins (hmga2.L and hmga2.S in Table S2) were present in interphase sucrose gradient fraction 5, but their levels were less than 2% of H1.8. Accordingly, none of the known chromatin proteins besides histones and the nucleoplasmin were detected by MS in the GFP-nanobody MagIC-cryo-EM beads, including the FACT complex and PCNA, whose levels in the sucrose fraction were comparable to H1.8 (Table S2), suggesting that our MagIC-cryo-EM analysis was not meaningfully affected by HMG proteins and other chromatin proteins. Consistent with our interpretation, the structural features of H1.8bound nucleosomes isolated from interphase and metaphase chromosomes were essentially identical.

      Reviewer #2 (Public review):

      Summary:

      The authors present a straightforward and convincing demonstration of a reagent and workflow that they collectively term "MagIC-cryo-EM", in which magnetic nanobeads combined with affinity linkers are used to specifically immobilize and locally concentrate complexes that contain a protein-of-interest. As a proof of concept, they localize, image, and reconstruct H1.8-bound nucleosomes reconstructed from frog egg extracts. The authors additionally devised an image-processing workflow termed "DuSTER", which increases the true positive detections of the partially ordered NPM2 complex. The analysis of the NPM2 complex {plus minus} H1.8 was challenging because only ~60 kDa of protein mass was ordered. Overall, single-particle cryo-EM practitioners should find this study useful.

      Strengths:

      The rationale is very logical and the data are convincing.

      Weaknesses:

      I have seen an earlier version of this study at a conference. The conference presentation was much easier to follow than the current manuscript. It is as if this manuscript had undergone review at another journal and includes additional experiments to satisfy previous reviewers. Specifically, the NPM2 results don't seem to add much to the main story (MagIC-cryo-EM), and read more like an addendum. The authors could probably publish the NPM2 results separately, which would make the core MagIC results (sans DusTER) easier to read.

      We thank the reviewer for constructive comments. We regret to realize that the last portion of the result section, where we have described a detailed analysis of NPM2 structures, was erroneously omitted from the submission due to MS Word's formatting error. We hope that the inclusion of this section will justify the inclusion of the NPM2 analysis. Specifically, we decided to include NPM2 structures to demonstrate that our method successfully determined the structure that had never been reported. Conformational changes in the NPM family have been proposed in previous studies using techniques such as NMR, negative stain EM, and simulations, and these changes are thought to play a critical role in regulating NPM function (PMID: 25772360, 36220893, 38571760), but there has been a confusion in the literature, for example, on the substrate binding site and on whether NPM2 recognizes the substrate as a pentamer or decamer. Despite their low resolution, our new cryo-EM structures of NPM2 suggest that NPM2 recognizes the substrate as a pentamer, identifies potential substrate-binding sites, and indicates the mechanisms underlying NPM2 conformational changes. We believe that publishing these results will provide valuable insights into the NPM research field and help guide and inspire further investigations.

      Reviewer #3 (Public review):

      Summary:

      In this paper, Arimura et al report a new method, termed MagIC-Cryo-EM, which refers to the method of using magnetic beads to capture specific proteins out of a lysate via, followed immunoprecipitation and deposition on EM grids. The so-enriched proteins can be analzyed structurally. Importantly, the nanoparticles are further functionalized with protein-based spacers, to avoid a distorted halo around the particles. This is a very elegant approach and allows the resolution of the stucture of small amounts of native proteins at atomistic resolution.

      Here, the authors apply this method to study the chromatosome formation from nucleosomes and the oocyte-specific linker histone H1.8. This allows them to resolve H1.8-containing chromatomosomes from oocyte extract in both interphase and metaphase conditions at 4.3 A resolution, which reveal a common structure with H1 placed right at the dyad and contacting both entry-and exit linker DNA.

      They then investigate the origin of H1.8 loss during interphase. They identify a nonnucleosomal H1.8-containing complex from interphase preparations. To resolve its structure, the authors develop a protocol (DuSTER) to exclude particles with ambiguous center, revealing particles with five-fold symmetry, that matches the chaperone NPM2. MS and WB confirms that the protein is present in interphase samples but not metaphase. The authors further separate two isoforms, an open and closed form that coexist. Additional densities in the open form suggest that this might be bound H1.8.

      Strengths:

      Together this is an important addition to the suite of cryoEM methods, with broad applications. The authors demonstrate the method using interesting applications, showing that the methods work and they can get high resolution structures from nucleosomes in complex with H1 from native environments.

      Weaknesses:

      The structures of the NPM2 chaperone is less well resolved, and some of the interpretation in this part seems only weakly justified.

      We thank the reviewer for recognizing the significance of our methods and for constructive comments. We regret to realize that the last portion of the result section where we have described detailed analysis of NPM2 structures was erroneously omitted from the submission due to the MS word's formatting error. We hope that inclusion of this section will justify the inclusion of NPM2 analysis. Specifically, we agree that our NPM2 structures are low-resolution and that our interpretations may be revised as higher-resolution structures become available, although we believe that publishing these results will provide valuable insights into the NPM research field and also will illustrate the power of MagIC-cryo-EM and DuSTER. To respond to this criticism, the revised manuscript now clearly describes the limitations of our NPM2 structures while highlighting the key insights. In page 12 line 452, the sentence was added to read, “While DuSTER enables the structural analysis of NPM2 co-isolated with H1.8-GFP, the resulting map quality is modest, and the reported numerical resolution may be overestimated. Furthermore, only partial density for H1.8 is observed. Although structural analysis of small proteins is inherently challenging, it is possible that halo-like scattering further hinder high-resolution structural determination by reducing the S/N ratio. More detailed structural analyses of the NPM2-substrate complex will be addressed in future studies.

      Reviewer #1 (Recommendations for the authors): 

      (1) To assess the advantage provided by the new technique for imaging of isolated pure or enriched fractions of native chromatin, the nucleosome structure analysis should be matched by a proper biochemical characterization of the isolated nucleosomes. Nucleosome DNA size is known to greatly affect linker histone affinity and additional proteins like HMG may compete with linker histone for binding. SDS-PAGE of the sucrose gradient fractions (Fig. 3E) shows many nonhistone proteins where H1-GFP appears to be a minor component. However, the gradient fractions contain both bound and unbound proteins. I would suggest that a larger-scale pull-down using the same GFP antibodies and streptavidin beads should be conducted and the captured nucleosome DNA and proteins characterized. 

      We addressed the concerns brought by the reviewer as following:

      (1) DNA length

      As the reviewer correctly pointed out, linker DNA length is critical for linker histone binding, and conventional ChIP protocols often result in DNA over-digestion to lengths of 140–150 bp. To minimize DNA over-digestion and structural damage, we have optimized a gentle chromosomal nucleosome purification protocol that enabled the cryoEM analysis of chromosomal nucleosomes (PMID: 34478647). This protocol involves DNA digestion with a minimal amount of MNase at 4ºC, producing nucleosomal DNA fragments of 180–200 bp. Additionally, before each chromatin extraction, we performed small-scale MNase assays to ensure that the DNA lengths consistently fell within the 180–200 bp range (Fig. S4B). These DNA lengths are sufficient for linker histone H1 binding, in agreement with previous findings indicating that >170 bp is adequate for linker histone association (PMID: 26212454). 

      This information has been incorporated into the main text and Methods section. 

      On page 5, line 178, the sentence was added to read, “To prevent dissociation of H1.8 from nucleosomes during DNA fragmentation, the MNase concentration and the reaction time were optimized to generate DNA fragment lengths with 180–200 bp (Fig. S4B), which is adequate for linker histone association (PMID 26212454).”

      On page 32, line 1192, the sentence was added to read, “To digest chromatin, MNase concentration and reaction time were tested on a small scale and optimized to the condition that produces 180-200 bp DNA fragments.”

      (2) Co-associated proteins with H1-GFP nucleosome.

      We now include mass spectrometry (MS) data for the proteins in the sucrose density gradient fraction 5 used for MagIC-cryo-EM analysis of GFP-H1.8-bound chromatin proteins as well as MS of proteins isolated with the corresponding MagIC-cryo-EM beads (Table S2 and updated Table S5). As the reviewer expected, HMG proteins (hmga2.L and hmga2.S in Table S2) were present in interphase sucrose gradient fraction 5, but their levels were less than 2% of H1.8. Accordingly, none of known chromatin proteins besides histones and the nucleoplasmin were detected by MS in the GFP-nanobody MagIC-cryo-EM beads, including the FACT complex and PCNA, whose levels in the sucrose fraction were comparable to H1.8 (Table S2), suggesting that our MagIC-cryo-EM analysis was not meaningfully affected by HMG proteins and other chromatin proteins. Consistent with our interpretation, the structural features of H1.8bound nucleosomes isolated from interphase and metaphase chromosomes were essentially identical.

      (2) A similar pull-down analysis with quantitation of NPM2 and GFP (in addition to analysis of sucrose gradient fractions) should be conducted to show whether the immune-selected particles do indeed contains a stoichiometric complex of H1.8 with NPM2.  

      Proteins isolated using MagIC-cryo-EM beads were identified through mass spectrometry (Fig. 4D). The MS signal suggests that the molar ratio of NPM2 is higher than that of H1.8 or sfGFP. This observation is consistent with the idea that an NPM2 pentamer can bind between one and five H1.8-GFP molecules.

      (3) The use of recombinant, bacterial produced H1.8- GFP and just one type of antibodies (GFP) are certain limitations of this work. These limitations as well as future steps needed to use antibodies specific for native antigens, such as histone variants and epigenetic modifications should be discussed.  

      We clarified these points in the “Limitation of the study” section (page 12, line 420). The revised sections are indicated by the underlines below.

      “While MagIC-cryo-EM is envisioned as a versatile approach suitable for various biomolecules from diverse sources, including cultured cells and tissues, it has thus far been tested only with H1.8-bound nucleosome and H1.8-bound NPM2, both using antiGFP nanobodies to isolate GFP-tagged H1.8 from chromosomes assembled in

      Xenopus egg extracts after pre-fractionation of chromatin. To apply MagIC-cryo-EM for the other targets, the following factors must be considered: 1) Pre-fractionation. This step (e.g., density gradient or gel filtration) may be necessary to enrich the target protein in a specific complex from other diverse forms (such as monomeric forms, subcomplexes, and protein aggregates). 2) Avoiding bead aggregation. Beads may be clustered by targets (if the target complex contains multiple affinity tags or is aggregated), nonspecific binders, and the target capture modules. To directly apply antibodies that recognize the native targets and specific modifications, optimization to avoid bead aggregation will be important. 3) Stabilizing complexes. The target complexes must be stable during the sample preparation. Crosslink was necessary for the H1.8-GFP-bound nucleosome. 4) Loading the optimum number of targets on the bead. The optimal number of particles per bead differs depending on target sizes, as larger targets are more likely to overlap. For H1.8-GFP-bound nucleosomes, 500 to 2,000 particles per bead were optimal. We expect that fewer particles should be coated for larger targets.”

      Reviewer #2 (Recommendations for the authors):  

      General: 

      Figures: Most of the figures have tiny text and schematic items (like Fig. 2B). To save readers from having to enlarge the paper on their computer screen, consider enlarging the smallest text & figure panels. 

      We enlarged the text in the main figures.

      Is it possible that the MagIC method also keeps more particles "submerged", i.e., away from the air:water interface? Does MagIC change the orientation distribution?  

      In theory, the preferred orientation bias should be reduced in MagIC-cryo-EM, as particles are submerged, and the bias is thought to arise from particle accumulation at the air-water interface. However, while the preferred orientation appears to be mitigated, the issue is not completely resolved, as demonstrated in Author response image 1.

      Author response image 1.

      A possible explanation for the remaining preferred orientation bias in MagIC-cryo-EM data is that many particles are localized on graphene-water interfaces.

      Consider adding a safety note to warn about possible pinching injuries when handling neodymium magnets. 

      This is a good idea. We added a sentence in the method section (page 24, line 878), “The two pieces of strong neodymium magnets have to be handled carefully as magnets can leap and slam together from several feet apart.”

      In the methods section, the authors state that the grids were incubated on magnets, followed by blotting and plunge freezing in the Vitrobot. Presumably, the blotting was performed in the absence of magnets. The authors may want to clarify this in the text. If so, can the authors speculate how the magnet-treated beads are better retained on the grids during blotting? Is it due to the induced aggregation and/or deposition of the nanobeads on the grid surface? 

      In the limitation section (page 12 line 446), the sentence was added to read:

      “The efficiency of magnetic bead capture can be further improved. In the current MagICcryo-EM workflow, the cryo-EM grid is incubated on a magnet before being transferred to the Vitrobot for vitrification. However, since the Vitrobot cannot accommodate a strong magnet, the vitrification step occurs without the magnetic force, potentially resulting in bead loss. This limitation could be addressed by developing a new plunge freezer capable of maintaining magnetic force during vitrification.”

      In the method section (page 27 line 993), the sentence was modified. The revised sections are indicated by underlines.

      “The grid was then incubated on the 40 x 20 mm N52 neodymium disc magnets for 5 min within an in-house high-humidity chamber to facilitate magnetic bead capture. Once the capture was complete, the tweezers anchoring the grid were transferred and attached to the Vitrobot Mark IV (FEI), and the grid was vitrified by employing a 2second blotting time at room temperature under conditions of 100% humidity.”

      Do you see an extra density corresponding to the GFP in your averages?  

      Since GFP is connected to H1.8 via a flexible linker, the GFP structure was observed in complex with the anti-GFP nanobody, separate from the H1.8-nucleosome and H1.8NPM2 complexes, as shown in Fig. S10.

      Fig. 5 & Fig. S11: The reported resolutions for NPM2 averages were ~5Å but the densities appear - to my eyes - to resemble a lower-resolution averages.  

      Although DuSTER enables the 3D structural determination of NPM2 co-isolated with H1-GFP, we recognize that the quality of the NPM2 map falls short of the standard expected for a typical 5 Å-resolution map. To appropriately convey the quality of the NPM2 maps, we have included the 3D FSC and local resolution map of the NPM2 structure (new Fig. S12). Furthermore, we have revised the manuscript to deemphasize the resolution of the NPM2 structure to avoid any potential misinterpretation.

      Fig. 5D: The cartoon says: "less H1.8 on interphase nucleosome" and "more H1.8 on metaphase nucleosome". Please help the readers understand this conclusion with the gel in Fig. 3C and the population histograms in Fig. 3F. 

      As depicted in Fig. 3A, we previously identified the preferential binding of H1.8 to metaphase nucleosomes (PMID: 34478647). In this study, to obtain sufficient H1.8bound nucleosomes for MagIC-cryo-EM, we used 2.5 times more starting material for interphase samples compared to M-phase samples. This discrepancy complicates the comparison of H1-GFP binding ratios in western blots. However, in GelCode<sup>TM</sup> Blue staining (Fig. S4A), where both H1-GFP and histone bands are visible, the preferential binding of H1.8 to metaphase nucleosomes can be observed (See fractions 11 in interphase and metaphase).

      Abstract - that removes low signal-to-noise ratio particles -> to exclude low signal-tonoise ratio particles; The term "exclude" is more accurate and is in the DuSTER acronym itself. 

      We edited it accordingly. 

      P1 - to reduce sample volume/concentration -> to lower the sample volume/concentration needed 

      We edited it accordingly.

      P1 - Flow from 1st to 2nd paragraph could be improved. It's abrupt. Maybe say that some forms of nucleoprotein complexes are rare, with one example being H1.8-bound nucleosomes in interphase chromatin? 

      We have revised the text to address the challenges involved in the structural characterization of native chromatin-associated protein complexes. The revised text reads, “Structural characterization of native chromatin-associated protein complexes is particularly challenging due to their heterogeneity and scarcity: more than 300 proteins directly bind to the histone core surface, while each of these proteins is targeted to only a fraction of nucleosomes in chromatin.”

      P2 - interacts both sides of the linker DNA -> interacts with both the entry and exit linker DNA 

      We have edited it accordingly.

      P2 - "from the chromatin sample isolated from metaphase chromosomes but not from interphase chromosomes" - meaning that the interphase nucleosomes don't have H1.8 densities at all, or that they do, but the H1.8 only interacts with one of the two linker DNAs? 

      In our original attempt to analyze nucleosome structures assembled in Xenopus egg extracts without MagIC-cryo-EM, we were not able to detect the density confidently assigned to H1.8 in interphase chromatin samples. To avoid potential confusion, the revised text reads, “We were able to resolve the 3D structure of the H1.8-bound nucleosome isolated from metaphase chromosomes but not from interphase chromosomes(3). The resolved structure indicated that H1.8 in metaphase is most stably bound to the nucleosome at the on-dyad position, in which H1 interacts with both the entry and exit linker DNAs(21–24). This stable H1 association to the nucleosome in metaphase likely reflects its role in controlling the size and the shape of mitotic chromosomes through limiting chromatin accessibility of condensins(25), but it remains unclear why H1.8 binding to the nucleosome in interphase is less stable. Since the low abundance of H1.8-bound nucleosomes in interphase chromatin might have prevented us from determining their structure, we sought to solve this issue by enriching H1.8bound nucleoprotein complexes through adapting ChIP-based methods.”

      P1, P2 - The logical leap from "by adapting ChIP-based methods" to MagIC is not clear. 

      We addressed this point by revising the text as shown above.

      P2 - "Intense halo-like noise" - This is an awkward term. These are probably the Fresnel fringes that arise from underfocus. I wouldn't call this phenomenon "noise". https://www.jeol.com/words/emterms/20121023.093457.php  

      We re-phrased it as “halo-like scattering”.

      P3 -It may help readers to explain how cryo-EM structures of the H1.8-associated interphase nucleosomes would differentiate from the two models in Fig. 3A.  

      We have revised the introduction section (lines 43~75), including the corresponding paragraph to address the comments above, highlighting the motivation behind determining the structures of interphase and metaphase H1.8-associated nucleosomes. We hope the revisions are now clear.

      P6 - "they were masked by background noise from the ice, graphene". I thought that graphene would be contribute minimal noise because it is only one-carbon-layer thick? 

      That is a valid point. We have removed the term ‘graphene’ from the sentence.

      P6 - What was the rationale to focus on particles with 60 - 80Å dimensions? 

      We observed that 60–80 Å particles were captured by MagIC-cryo-EM beads, as numerous particles of this size were clearly visible in the motion-corrected micrographs surrounding the beads. To clarify this, we revised the sentence to read: 'Topaz successfully picked most of the 60–80 Å particles visible in the motion-corrected micrographs and enriched around the MagIC-cryo-EM beads (Figure S6A).

      P7 - Please explain a technical detail about DuSTER: do independent runs of Topaz picks give particle centers than differ by up to ~40Å or is it that 2D classification gives particle centers that differ by up to ~40Å? Is it possible to distinguish these two possibilities by initializing CryoSPARC on two independent 2D classification jobs on the same set of Topaz picks?  

      Due to the small particle size of NPM2, the former type is predominantly generated when Topaz fails to pick the particles reproducibly. The first cycle of DuSTER removes both former-type particles (irreproducibly picked particles) and latter-type particles (irreproducibly centered particles), while subsequent cycles specifically target and remove the latter type. We have added the following sentence to clarify this (page 7, line 249). The revised sections are indicated by underlines below: “To assess the reproducibility of the particle recentering during 2D classification, two independent particle pickings were conducted by Topaz so that each particle on the grid has up to two picked points (Figure 4A, second left panel). Some particles that only have one picked point will be removed in a later step. These picked points were independently subjected to 2D classification. After recentering the picked points by 2D classification, distances (D) between recentered points from the first picking process and other recentered points from the second picking process were measured. DuSTER keeps recentered points whose D are shorter than a threshold distance (D<sub>TH</sub>). By setting D<sub>TH</sub> = 20 Å, 2D classification results were dramatically improved in this sample; a five-petal flower-shaped 2D class was reconstructed (Figure 4B). This step also removes the particles that only have one picked point.“

      P8 - NPM2 was introduced rather abruptly (it was used as an initial model for 3D refinement). I see NPM2 appear in the supplemental figures cited before the text in P8, but the significance of NPM2 was not discussed there. The authors seem to have made a logical leap that is not explained. 

      We have removed the term NPM2 in P8.

      P9 - "extra cryo-EM densities, which likely represent H1." This statement would be better supported if the resolution of the reconstruction was high enough to resolve H1specific amino acids in the "extra densities" protruding from the petals. 

      We concurred and softened the statement to read “extra cryo-EM densities, which may represent H1.8,”

      P9 - "Notably, extra cryo-EM densities, which likely represent H1.8, are clearly observed in the open form but much less in the closed form near the acidic surface regions proximal to the N terminus of beta-1 and the C terminus of beta-8 (Fig. 5A and 5B)."  It would be helpful to point out where the "extra densities" are in the figure for the open and closed form. Some readers may not be able to extrapolate from the single red arrow to the other extra densities. 

      Thank you for your comment. We have pointed out the density in the Fig 5A as well.

      P9 - "Supporting this idea, the acidic tract A1 (aa 36-40) and A2 (aa 120-140) are both implicated in the recognition of basic substrates such as core histones..."  Did this sentence get cut off in the next column?  

      We apologize for our oversight on this error. Due to an MS Word formatting error, the sentences (lines 316–343) were hidden beneath a figure. We have retrieved the missing sentences:

      “Supporting this idea, the acidic tract A1 (aa 36-40) and A2 (aa 120-140), which are both implicated in recognition of basic substrates such as core histones(43,50), respectively interact with and are adjacent to the putative H1.8 density (Figure 5B). In addition, the NPM2 surface that is in direct contact with the putative H1.8 density is accessible in the open form while it is internalized in the closed form (Figure 5C). This structural change of NPM2 may support more rigid binding of H1.8 to the open NPM2, whereas H1.8 binding to the closed form is less stable and likely occurs through interactions with the C-terminal A2 and A3 tracts, which are not visible in our cryo-EM structures.

      In the aforementioned NPM2-H1.8 structures, for which we applied C5 symmetry during the 3D structure reconstruction, only a partial H1.8 density could be seen (Figure 5B). One possibility is that H1.8 structure in NPM2-H1.8 does not follow C5 symmetry. As the size of the NPM2-H1.8 complex estimated from sucrose gradient elution volume is consistent with pentameric NPM2 binding to a single H1.8 (Figure 3C and Table S3), applying C5 symmetry during structural reconstruction likely blurred the density of the monomeric H1.8 that binds to the NPM2 pentamer. The structural determination of NPM2-H1.8 without applying C5 symmetry lowered the overall resolution but visualized multiple structural variants of the NPM2 protomer with different degrees of openness coexisting within a NPM2-H1.8 complex (Figure S14), raising a possibility that opening of a portion of the NPM2 pentamer may affect modes of H1.8 binding. Although more detailed structural analyses of the NPM2-substrate complex are subject of future studies, MagIC-cryo-EM and DuSTER revealed structural changes of NPM2 that was co-isolated H1.8 on interphase chromosomes.

      Discussion 

      MagIC-cryo-EM offers sub-nanometer resolution structural determination using a heterogeneous sample that contains the target molecule at 1~2 nM, which is approximately 100 to 1000 times lower than the concentration required for conventional cryo-EM methods, including affinity grid approach(9–11).”

      Reviewer #3 (Recommendations for the authors):  

      All with regards to the NPM2 part: 

      It would be great if the authors could provide micrographs where the particles are visible, in addition to the classes. 

      The particles on the motion-corrected micrographs are available in Fig S9.

      Also, the angular distribution in the SI looks very uniform. 

      I also wonder, if the authors could indicate the local resolution for all structures. 

      Could the authors provide the 3D FSC for NPM2?  

      Although DuSTER enables the 3D structural determination of NPM2 co-isolated with H1-GFP, we recognize that the quality of the NPM2 map falls short of the standard expected for a typical 5 Å resolution map. To appropriately convey the quality of the NPM2 maps, we have included the 3D FSC and local resolution map of the NPM2 structure (new Fig. S12).

      I really cannot see a difference between the open and closed forms. Looking at the models, I am skeptical that the authors can differentiate the two forms with the available resolution. Could they provide statistics that support their assignments? 

      To better highlight the structural differences between the two forms, we added a new figure to compare the maps between open and closed forms (Fig S12J-K).

      Also, the 'additional density' representing H1.8 in the NPM2 structures - I cannot see it. 

      We pointed out the density with the red arrow in the revised Fig 5A.

      Minor comments: 

      Something is missing at the end of Results, just before the beginning of the Discussion.  The figure legend for Fig. S12 is truncated, so it is unclear what is going on 

      We apologize for our oversight on this error. Due to an MS Word formatting error, the sentences (lines 316–343) were hidden beneath a figure. We have retrieved the missing sentences:

      “Supporting this idea, the acidic tract A1 (aa 36-40) and A2 (aa 120-140), which are both implicated in recognition of basic substrates such as core histones(43,50), respectively interact with and are adjacent to the putative H1.8 density (Figure 5B). In addition, the NPM2 surface that is in direct contact with the putative H1.8 density is accessible in the open form while it is internalized in the closed form (Figure 5C). This structural change of NPM2 may support more rigid binding of H1.8 to the open NPM2, whereas H1.8 binding to the closed form is less stable and likely occurs through interactions with the C-terminal A2 and A3 tracts, which are not visible in our cryo-EM structures.

      In the aforementioned NPM2-H1.8 structures, for which we applied C5 symmetry during the 3D structure reconstruction, only a partial H1.8 density could be seen (Figure 5B). One possibility is that H1.8 structure in NPM2-H1.8 does not follow C5 symmetry. As the size of the NPM2-H1.8 complex estimated from sucrose gradient elution volume is consistent with pentameric NPM2 binding to a single H1.8 (Figure 3C and Table S2), applying C5 symmetry during structural reconstruction likely blurred the density of the monomeric H1.8 that binds to the NPM2 pentamer. The structural determination of NPM2-H1.8 without applying C5 symmetry lowered the overall resolution but visualized multiple structural variants of the NPM2 protomer with different degrees of openness coexisting within a NPM2-H1.8 complex (Figure S14), raising a possibility that opening of a portion of the NPM2 pentamer may affect modes of H1.8 binding. Although more detailed structural analyses of the NPM2-substrate complex are subject of future studies, MagIC-cryo-EM and DuSTER revealed structural changes of NPM2 that was co-isolated H1.8 on interphase chromosomes.

      Discussion 

      MagIC-cryo-EM offers sub-nanometer resolution structural determination using a heterogeneous sample that contains the target molecule at 1~2 nM, which is approximately 100 to 1000 times lower than the concentration required for conventional cryo-EM methods, including affinity grid approach(9–11).”

      Figure S13: I am not sure how robust these assignments are at this low resolution. Are these real structures or classification artifacts? It feels very optimistic to interpret these structures  

      We agree that our NPM2 structures are low-resolution and that our interpretations may be revised as higher-resolution structures become available, although we believe that publishing these results will provide valuable insights into the NPM research field and also will illustrate the power of MagIC-cryo-EM and DuSTER. Conformational changes in the NPM family have been proposed in previous studies using techniques such as NMR, negative stain EM, and simulations, and these changes are thought to play a critical role in regulating NPM function (PMID: 25772360, 36220893, 38571760), but there has been a confusion in the literature, for example, on the substrate binding site and on whether NPM2 recognizes the substrate as a pentamer or decamer. Despite their low resolution, our new cryo-EM structures of NPM2 suggest that NPM2 recognizes the substrate as a pentamer, identify potential substrate-binding sites, and indicate the mechanisms underlying NPM2 conformational changes. We believe that publishing these results will provide valuable insights into the NPM research field and help guide and inspire further investigations. 

      To respond to this criticism, we have revised the manuscript to clearly describe the limitations of our NPM2 structures while highlighting the key insights. On page 12, line 452, the sentence was added to read, “While DuSTER enables the structural analysis of NPM2 co-isolated with H1.8-GFP, the resulting map quality is modest, and the reported numerical resolution may be overestimated. Furthermore, only partial density for H1.8 is observed. Although structural analysis of small proteins is inherently challenging, it is possible that halo-like scattering further hinders high-resolution structural determination by reducing the S/N ratio. More detailed structural analyses of the NPM2-substrate complex will be addressed in future studies.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to Reviews

      All reviewers were positive about the rigor and impact of our work and offered a number of very helpful suggestions. We have done a number of suggested experiments, whose results have been added to the revision. We have also used their suggestions to improve the clarity and precision with which we describe and interpret our results.

      Reviewer 1 found the paper to be clearly written, with novel results, and the conclusions relevant and solid. This review offered many insights and thoughtful suggestions, which we have adopted to greatly improve the manuscript. The referee’s points are listed below with our responses.

      The study chooses to examine growth only in the prospective wing blade (the "pouch") rather than the wing disc as a whole. This can create biases, as fat and ds manipulations often cause stronger effects on growth, and on Hippo signaling targets, in the adjacent hinge regions of the disc. So I am curious about this choice. 

      Actually, several experiments described in the manuscript measured growth in regions of the wing disc that did not include the pouch (Fig 1 supplement 4). We found that in the second phase of allometric growth, growth of the pouch was greater than growth of the hinge-notum (Fig.1G and Fig 1 supplement 4).  We also looked at the effect of Ds and Fat on growth of the hinge-notum (Fig 4 supplement 1 and Fig 5 supplement 2). Loss of Ds or Fat also affected allometric growth of the pouch differently from their effects on allometric growth of the hinge-notum. We therefore treated analysis of each region independently. Greater focus was given to wing pouch growth because it was in this region that we detected the interesting gradient properties in Fat and Ds expression.

      The limitation to the wing region also creates some problems for the measurements themselves. The division between wing and pouch is not a strict lineage boundary, and thus cells can join or leave this region, creating two different reasons for changes in wing pouch size; growth of cells already in the region, or recruitment of cells into or out of the region. The authors do not discuss the second mechanism.

      We agree with this assessment that pouch growth can occur via lineage-restricted growth or by recruitment of cells into the region. This has now been clarified in the Introduction and the Discussion with discussion of the second mechanism.

      It is not at all clear that the markers for the pouch used by the authors are stable during development. One of these is Vg expression, or the Vg quadrant enhancer. But the Vgexpressing region is thought to increase by recruitment over late second and third instar through a feed-forward mechanism by which Vg-expressing cells induce Vg expression in adjacent cells. In fact, this process is thought to be driven in part by Fat and Ds (Zecca et al 2010). So when the authors manipulate Fat and Ds are they increasing growth or simply increasing Vg recruitment? I would prefer that this limitation be addressed. 

      There is the possibility that the feedforward recruitment of disc cells to express Vg leads to some expansion of the measured pouch domain. However, we argue that the recruitment mechanism may not be contributing significantly to the phenomena we measured in this study. 1) We limited our analysis of pouch growth to the third instar stage. In Fig.2, Zecca and Struhl (2007 doi 10.1242/dev.006411) found that recruitment was much stronger in clones induced at first instar rather than third instar, and so they limited their clonal analysis throughout the paper to first instar induced clones. Thus, it is unclear how much the feedforward recruitment mechanism contributes to pouch growth in the mid-to-late third instar. 2) We detected an effect of Ds and Fat on how rapidly the cell cycle slows down over time in pouch cells. The effect is entirely consistent with it having a causal effect on wing pouch growth. For example, nub>Ds(RNAi) causes the average third instar pouch cell to divide ~25% more rapidly than normal, when comparing the slopes in Figure 6. Note that at the beginning of the third instar, the average pouch cell has a similar doubling time whether lacking Ds or not (Figure 6). When we measured the final size of the wing pouch at the end of the third instar, nub>Ds(RNAi) caused the pouch to be ~30% larger than normal (Figure 5). This effect is quite comparable to the effect of Ds RNAi on cell doubling.

      To provide more rigorous evidence that the effect of Fat and Ds on cell cycle dynamics is primarily responsible for their effects on wing growth that we measured, we have adapted the simple growth modeling framework from Wartlick et al (2011) and fit our cell cycle measurements made for different genotypes. These fits give us estimates for instantaneous cell growth rates over time, and using these estimates, we simulated the theoretical growth trajectory of the entire wing pouch for wildtype and ds / fat RNAi animals. When we compare these model predictions of wing growth to our pouch volume measurements over time, they agree very well with one another. These

      analyses and results are now discussed in the Results and presented in Fig. 6 supplement 2. Overall, it supports a model that Fat and Ds regulate cell cycle dynamics in the wing pouch during third instar and this effect is primarily responsible for Fat and Ds’s effect on overall wing pouch growth in that timeframe. It does not rule out that Fat and Ds might also affect Vg recruitment at third instar, but such effects must be small relative to the primary effect on the cell cycle. It is feasible that Fat and Ds work via the feedforward mechanism at earlier larval stages. We have now discussed all this in detail in the Discussion considering the limitation of recruitment. 

      The second pouch marker the authors use is epithelial folding, but this also has problems, as Fat and Ds manipulations change folding. Even in wild type, the folding patterns are complex. For instance, to make folding fit the Vg-QE pattern at late third the authors appear to be jumping in the dorsal pouch between two different sets of folds (Fig 1S2A). The authors also do not show how they use folding patterns in younger, less folded discs, nor provide evidence that the location of the folds are the same and do not shift relative to the cells. They also do not explain how they use folds and measure at later wpp and bpp stages, as the discs unfold and evert, exposing cells that were previously hidden in the folds.

      The primary marker we used for the pouch boundary were the folds. We agree with the reviewer that our original description of how we defined the pouch boundary using the folds was inadequate. We now have substantially expanded the Methods section describing how we defined the boundary at all stages using the folds, including a supplementary figure (Fig 1 supplement 2). Importantly, in our measurements, we did not exclude the pouch regions within the folds but included them (see also the next point). Our microscopy detected fluorescence in the folds, and surface rendering allowed us to visualize fold structure and its contents. In younger discs with less folding, we defined the boundary by the location of the Wg inner ring. The folds were more prominent in older L3 larval discs and in the WPP and later stages since the wings had not fully everted yet. Therefore, we used accepted morphological definitions of the pouch boundary from the literature to define the boundaries. We were able to do so even though, as the reviewer notes, the fold architecture evolves as the larvae age. We agree with the reviewer that defining a boundary based on morphology could be error prone, especially prone to systematic error based on age. It is the main reason we directly compared the morphologically defined boundaries to boundaries defined by the Vg quadrant expression domain for many wing discs across all ages. As seen in Fig 1 supplement 3C, the two methods are in strong agreement with one another for discs of all ages. There is a slight overestimate of the pouch boundary using the morphological method, but the error is small (2.5%) and independent of disc size.  

      Finally, the authors limit their measurements to cells with exposed apical faces and thus a measurable area but apparently ignore the cells inside the folds. At late third, however, a substantial amount of the prospective wing blade is found within the folds, especially where they are deepest near the A/P compartment boundary. Using the third vein sensory organ precursors as markers, the L3-2 sensillum is found just distal to the fold, the L3-1 and the ACV sensilla are within the fold, and the GSR of the distal hinge is found just proximal to the fold. That puts the proximal half of the central wing blade in the fold, and apparently uncounted in their assays. These cells will however be exposed at wpp and especially bpp stages. How are the authors adjusting for this? 

      We apologize for not describing the methods of measurement thoroughly in the original submission. In fact, we did make measurements of cells located within the folds of the wing pouch at all stages. Z stacks of optical sections were collected that transversed the disc, including the folds. Using surface detection algorithms, we could make spatial measurements (xyz distances and areas) of the material within the folds enveloping the apical pouch. Therefore, we could measure the surface area and volume of the wing pouch that included the folds. This was indeed what we did and reported in the original submission. A much more complete description of the process has now been added to the Methods.

      On the other hand, we could not reliably measure Fat-GFP or Ds-GFP fluorescence intensity in cells deep in the folds due to light scattering. Therefore, we did not assay the entire gradient across the pouch. Of the cells we did measure, we know their relative distance to the center of the pouch, defined as the intersection of the AP and DV boundaries. Therefore, fluorescence intensities could be directly compared across stages since they were calibrated by the centerpoint of the pouch. We have added text to the Methods to clarify this.

      Stabilizing and destabilizing interactions between Fat and Ds- The authors describe a distal accumulation of Fat protein in the wing, and show that this is unlikely to be through Fat transcription. They further try to test whether the distal accumulation depends on destabilization of proximal Fat by proximal Ds by looking at Fat in ds mutant discs. However, the authors do not describe how they take into account the stabilizing effects of heterophilic binding between the extracellular domains (ECDs) of Fat and Ds; without one, the junctional levels and stability of the other is reduced (Ma et al., 2003; Hale et al. 2015). So when they show that the A-P gradient of Fat is reduced in a ds mutant, is this because of the loss of a destabilizing effect of Ds on Fat, as they assume, or is it because all junctional Fat has been destabilized by loss of extracelluarlar binding to Ds? The description of the Fat gradient in Ds mutants is also confusing (see note 6 below), making this section difficult for the reader to follow. 

      We did not intend to imply that Ds actively inhibits Fat. We now describe the implications of the result more clearly in the Results and Discussion with reference to the prior Hale and Ma study of heterophilic stabilization. It is worth noting that Ma et al 2003 saw elevated junctional Fat in ds mutant cells if they were surrounded by other ds mutant cells. This is consistent with our results. We also apologize for the confusion in describing the Fat gradient and have reworded the section in the Results to make it more clear.

      The authors do not propose or test a mechanism for the proposed destabilization. Fat and Ds bind not only through their ECDs, but binding has now also been demonstrated through their ICDs (Fulford et al. 2023)

      We now discuss possible mechanisms in the Discussion and include the Fulford reference in the Results.

      Ds gradient scales by volume, rather than cell number - This is an intriguing result, but the authors do not discuss possible mechanisms.

      We have now added discussion of possible mechanisms in the Discussion.

      Fat and Ds are already known to have autonomous effects on growth and Hippo signaling from clonal analyses and localized knockdowns. One novelty here is showing that localized knockdown does not delay pupariation in the way that whole animal knockdown does, although the mechanism is not investigated. Another novelty is that the authors find stronger wing pouch overgrowth after localized ds RNAi or whole disc loss of fat than after localized fat RNAi, the latter being only 11% larger. The fat RNAi result would have been strengthened by testing different fat RNAi stocks, which vary in their strength and are commonly weaker than null mutations, or stronger drivers such as the ap-gal4 they used for some of their ds-RNAi experiments or use of UAS-dcr2. Another reason for caution is that Garoia (2005) found much stronger overgrowth in fat mutant clones, which were about 75% larger than control clones.

      We thank the reviewer for this suggestion. Indeed, the weak effect of Fat RNAi had been due to the specific RNAi driver. We followed the reviewer’s suggestion and tested other RNAi stocks. We had in hand an RNAi driver against GFP that we had found in unrelated studies to be a very potent repressor of GFP expression. Since we had been using a knock-in allele of GFP inserted in frame to Fat throughout this study, we applied nub>Gal4 UAS-GFP RNAi to knock down homozygous Fat-GFP. The effect of the knockdown was very strong, as measured by residual 488nm fluorescence above background autofluorescence after knockdown. Correcting for background autofluorescence, we estimate that only 4.5% of Fat-GFP remained under RNAi conditions (Figure 5 - figure supplement 3). 

      Using the more potent RNAi reagent, we repeated the various experiments related to

      Fat. We observed a 42% increase in wing pouch growth, which is similar to that of Ds RNAi. We also observed an effect of Fat RNAi on the average cell cycle time of wing pouch cells. There was still a linear coupling between the cell cycle duration and wing pouch size, but the slope of the coupling was smaller with Fat RNAi. This was very similar to what Ds RNAi does to the cell cycle. Therefore, we have replaced the data from the original Fat RNAi experiments with the new data and modified the text throughout the manuscript to describe the new results.

      Flattening of Ds gradient does not slow growth. One model suggests that the flattening of the Ds gradient, and thus polarized Ds-Fat binding, account for slowed growth in older discs. The difficulty in the past has been that two ways of flattening the Ds gradient, either removing Ds or overexpressing Ds uniformly, give opposite results; the first increases growth, while the latter slows it. Both experiments have the problem of not just flattening the gradient, but also altering overall levels of Ds-Fat binding, which will likely alter growth independent of the gradients. Here, the authors instead use overexpression to create a strong Ds gradient (albeit a reversely oriented one) that does not flatten, and show that this does not prevent growth from slowing and arresting.

      To make sure that this is not some effect caused by using a reverse gradient, one might instead induce a more permanent normally oriented Ds gradient and see if this also does not alter growth; there is a ds Trojan gal4 line available that might work for this, and several other proximal drivers.

      Again, we thank the reviewer for this suggestion. We followed the reviewer’s suggestion and generated Trojan-Gal4 mediated overexpression of Ds. The Ds protein gradient was strongly amplified by Trojan-Gal4 but remained normally oriented. However, it only caused a modest (12%) increase in wing pouch volume. It did not significantly alter Fat expression dynamics nor the dynamics of cell cycle duration. This new data has been added to the Results (Fig. 7 and Fig 7 supplement 2) and discussed at length in the text.

      Another possible problem is that, unlike previous studies, the authors have not blocked the Four-jointed gradient; Fj alters Fat-Ds binding and might regulate polarity independently of Ds expression. A definitive test would be to perform the tests above in four-joined mutant discs.

      We examined a fj null mutant (fjp1/d1) and found that it did not alter final wing pouch size (Fig. 2 - figure supplement 3E). Moreover, neither Fat nor Ds expression were altered in the fj mutant (Fig. 2- figure supplement 3C,D). 

      The Discussion of these data should be improved. The authors state in the Discussion "The significance of these dynamics is unclear, but the flattening of the Fat gradient is not a trigger for growth cessation." While the Discussion mentions the effects of Ds on Fat distribution in some detail, this is the only phrase that discusses growth, which is surprising given how often the gradient model of growth control is mentioned elsewhere. The reader would be helped if details are given about what experiment supports this conclusion, the effect on not only growth cessation but cell cycle time, and why the result differs from those of Rogjula 2008 and Willecke 2008 using Ds and Fj overexpression.

      We have rewritten the Discussion to better reflect the results and incorporate the reviewer’s criticisms.

      The authors spend much of the discussion speculating on the possibility that Fat and Ds control growth by changing the wing's sensitivity to the BMP Dpp. As the manuscript contains no new data on Dpp, this is somewhat surprising. The discussion also ignores Schwank (2011), who argues that Fat and Dpp are relatively independent. There have also been studies showing genetic interactions between Fat and signaling pathways such as Wg (Cho and Irvine 2004) and EGF (Garoia 2005).

      We have modified the discussion to be more inclusive of mechanisms connecting Fat and other signaling pathways, and we deleted some of the speculation about Dpp. However, since Dpp is the only known growth factor whose local concentration linearly scales with average cell doubling time (the process we found Ds/Fat regulates), there is a logical connection that readers deserve to know about. Therefore, we have retained some discussion of the hypothesis that the two might be linked through cell cycle duration. It is for future studies to test that hypothesis as it is beyond the scope of this paper.

      That said, there are studies that discount the work of Wartlick’s Dpp model, eg. Schwank et al 2012, arguing that Dpp regulates growth permissively by limiting an antigrowth factor, Brinker. We have added this reference and the others in the Discussion to discuss alternative models where Fat/Ds act in parallel to Dpp. 

      Wpp and Bpp- First, the charts treat wpp as if it is a fixed number of hours after 5 day larvae, but this will not be true in fat and ds mutants with extended larval life. This should be mentioned.

      We have clarified this distinction in the figure legends.

      How are the authors limiting bpp to 1 hr from wpp? Prepupa are brown and lack air bubbles, but that spans 5 hours of disc changes from barely everted to fully wing-like.

      We deliberately chose 1 hour post WPP because we wanted to measure final wing volume with minimal eversion. We agree with the reviewer’s concerns with calling this BPP and we now call it WPP+1  

      "However, growth of the wing pouch ceased at the larva-pupa molt and its size remained constant".

      The transition from late third to wpp shown in the figure is not the pupal molt. Unlike in most insects, in Drosophila the larval cuticle is not molted away, it is remodeled during pupariation into the prepupal case. The pupal cuticle is not formed until 6 hr APF, which is why the initial stages are termed pre-pupal. Also, there is at least one more set of cell divisions that occur in later pupal stages (for instance, see recent work from the Buttitta lab).

      We have changed the reference of pupal molt to larva-prepupal transition throughout the manuscript.

      "In contrast, the notum-hinge exhibited simpler linear-like positive allometric growth (Fig. 1 - figure supplement 3C) 

      This oversimplifies, as there is still a strong inflection after the third time point, albeit not as large as with the wing because there is less notal growth.

      We have reworded the text as suggested. 

      "whereas at the WPP stage, dividing cells were only found in a narrow zone where sensory organ precursor cells undergo two divisions to generate future sensory organs (Fig. 1 - figure supplement 4C-E)."

      While there are more dividing cells at the anterior D/V, which will form sensory bristles, there are also dividing cells elsewhere, including in the posterior and scattered through the pouch, where there are no sensory precursors. Sensory organs are limited to the wing margin and the very few campaniform sensilla found on the prospective third vein. The Sens-GFP shown here, meant to identify sensory precursors, does not look much like the Sens expression in Nolo et al 2000. Anterior is on the left in 1S4A-D, but on the right in E.

      We thank the reviewer for this observation. Indeed, the Sens-GFP signal in the figure is too broad. This was owing to bleed-through of the PHH3 signal. Since the pattern of dividing cells at the WPP stage has been so well characterized in the literature, as has the pattern of Sens+ cells at that stage (ie, Nolo et al 2000), we have removed these panels and now simply cite the relevant literature.  

      "The gradient was asymmetric along the AP axis, being lower at the A margin than the P margin."

      The use of "margin" here is a bit confusing, as the term is usually used to describe the wing margin; that is, the D/V compartment boundary in the disc that forms the edge of the wing. Can the authors use a different term? It would also be helpful to point out that the A and P extremes are also, because of the geometry of the disc, the prospective proximal portions of the wing margin, and the hinge, especially since the authors are including the regions proximal to the most distal fold.

      We have reworded it as suggested.   

      The graphed loss of the Fat A-P gradient between day 5 third and wpp is dramatic. Given that the changes in folding at wpp might alter which cells are being graphed, can the authors show a photo?

      We have now included a photo of Fat-GFP at WPP in Fig 2 - figure supplement 2E.

      "Since Ds levels are highest and most steep near the margins, perhaps Ds inhibits Fat expression in a dose- or gradient-dependent manner. We also followed Fat-GFP dynamics in the ds mutant. We did not observe the progressive flattening of the FatGFP profile to the WPP wing (Fig. 2 - figure supplement 3A). Instead, the Fat-GFP profile was graded at the WPP stage and flattened somewhat more by the BPP stage (Fig. 2 - figure supplement 3B)."

      This description does not tell the reader if there is any less grading of Fat in the ds mutant compared with wild type; instead, it sounds like it is more graded, as gradation continues at wpp. This would then contradict the hypothesis that proximal Ds is required to create the distal Fat gradient.

      The Fat signals for the two genotypes are directly comparable as the samples were imaged together with the same microscope settings.  Fig 2M shows that the Fat gradient is less graded compared to the wildtype. We have reworded the text to make this more clear. But this graded expression persists longer into WPP, not the level of gradation. The reason for this is not understood.

      The figure, on the other hand, looks like Fat is less graded, although as noted above this could instead be caused by loss of the stable Ds-bound Fat normally found at junctions. 

      Fig 2M shows an increase in Fat levels at the proximal regions of the ds mutant pouch, where Ds is normally most concentrated. This makes the overall profile look less graded. 

      Confusingly, in the Discussion the authors state: "Loss of Ds affects the Fat gradient such that distribution of Fat is uniformly upregulated to peak levels." There is no mention of "peak levels" in the Results, and no mention of "graded" expression in the Discussion. I am unclear on how the absolute levels are being determined and would be surprised if there were peak levels after loss of Ds-bound Fat from junctions.

      The absolute levels between the genotypes were determined by carefully calibrated fluorescence of Fat-GFP from samples imaged at the same time with the same settings. We used the word peak to refer to the highest level of Fat-GFP within a given gradient profile. Clearly, the description is confusing and so we have deleted the word and modified the text to clarify the meaning.

      "Interestingly, the reversed Ds gradient caused a change in the Fat gradient (Fig. 7E). Its peak also became skewed to the anterior and did not normally flatten at the WPP stage."

      This result contradicts the author's earlier model that proximal Ds destabilizes Fat. Instead, the result fits the stabilization of Fat caused by binding to endogenous or overexpressed Ds or Ds ECD (Ma et al. 2003; Matakatsu and Blair, 2004; 2006; Hale et al. 2015).

      We agree that the reversed Ds affects Fat differently than the loss-of-function ds phenotype. We were not intending to propose a model based on the ds mutant, but a simple interpretation of the result. The reversed Ds experiment generates on its own a simple interpretation that is not consistent with the other. This speaks to the complexity of the system. We have changed the text in the Results to make this less confusing.

      Reviewer 2 found the paper to provide insights into normal growth of the wing and useful tools for measurement of growth features. This review offered many insights and thoughtful suggestions, which we have adopted to greatly improve the manuscript. The referee’s points are listed below with our responses.

      Although the approach used to measure volume is new to this study, the basic finding that imaginal disc growth slows at the mid-third instar stage has been known for some time from studies that counted disc cell number during larval development (Fain and Stevens, 1982; Graves and Schubiger, 1982). Although these studies did not directly measure disc volume, because cell size in the disc is not known to change during larval development, cell number is an accurate measure of tissue volume. However, it is worth noting that the approach used here does potentially allow for differential growth of different regions of the disc.

      We had cited the older literature in reference to our results. We have now noted the approach’s usefulness in measuring different disc regions such as the pouch.

      Related to point 1, a main conclusion of this study, that cell cycle length scales with growth of the wing, is based on a developmentally limited analysis that is restricted to the mid-third instar larval stage and later (early third instar begins at 72 hr - the authors' analysis started at 84 hr). The previous studies cited above made measurements from the beginning of the 3rd instar and combined them with previous histological analyses of cell numbers starting at the beginning of the 2nd instar. Interestingly, both studies found that cell number increases exponentially from the start of the 2nd instar until mid-third instar, and only after that point does the cell cycle slow resulting in the linear growth reported here. The current study states that growth is linear due to scaling of cell cycle with disc size as though this is a general principle, but from the earlier studies, this is not the case earlier in disc development and instead applies only to the last day of larval life.

      We apologize for not making this distinction clearer in the original manuscript. Indeed, growth is initially exponential and shifts to a more linear-like regime in the mid third instar. Our focus in the manuscript is primarily this latter phase. We have now rewritten the text in the Introduction, Results and Discussion to make this very clear. 

      While cell number and pouch volume increase exponentially from the start of the 2nd instar, the cell cycle already begins to slow down during the 2nd instar, as found with mitotic index measurements done by Wartlick et al 2011. Using their data to model cell cycle duration as a function of pouch area, we find that during the 2nd instar, cell cycle duration also increases as the size of the wing pouch increases. This is shown in the figure (panel C) below. Note that this relationship appears nonlinear and is quantitatively distinct from the relationship for third instar wing growth.

      Author response image 1.

      The analysis of the roles of Fat and Dachsous presented here has weaknesses that should be addressed. It is very curious that the authors found that depletion of Fat by RNAi in the wing blade had essentially no effect on growth while depletion of Dachsous did, given that the loss of function overgrowth phenotype of null mutations in fat is more severe than that of null mutations in dachsous (Matakatsu and Blair, 2006). An obvious possibility is that the Fat RNAi transgene employed in these experiments is not very efficient. The authors tried to address this by doubling the dose of the transgene, but it is not clear to me that this approach is known to be effective. The authors should test other RNAi transgenes and additionally include an analysis of growth of discs from animals homozygous for null alleles, which as they note survive to the late larval stages.

      We thank the reviewer for this suggestion. Indeed, the weak effect of Fat RNAi had been due to the specific RNAi driver. We followed the reviewer’s suggestion and tested other RNAi stocks. We had in hand an RNAi driver against GFP that we had found in unrelated studies to be a very potent repressor of GFP expression. Since we had been using a knock-in allele of GFP inserted in frame to Fat throughout this study, we applied nub>Gal4 UAS-GFP RNAi to knock down homozygous Fat-GFP. The effect of the knockdown was very strong, as measured by remaining 488nm fluorescence above background fluorescence after knockdown. Correcting for background fluorescence, we estimated that only 4.5% of Fat-GFP remained under RNAi conditions (Figure 5 - figure supplement 3). 

      Using the more potent RNAi reagent, we repeated the various experiments related to Fat. We observed a 42% increase in wing pouch growth, which is similar to that of Ds RNAi. We also observed an effect of Fat RNAi on the average cell cycle time of wing pouch cells. There was still a linear coupling between the cell cycle duration and wing pouch size, but the slope of the coupling was smaller with Fat RNAi. This was very similar to what Ds RNAi does to the cell cycle. Therefore, we have replaced the data from the original Fat RNAi experiments with the new data and modified the text throughout the manuscript to describe the new results.

      It is surprising that the authors detect a gradient of Fat expression that has not been seen previously given that this protein has been extensively studied. It is also surprising that they find that expression of Nubbin Gal4 is graded across the wing blade given that previous studies indicate that it is uniform (ie. Martín et al. 2004). These two surprising findings raise the possibility that the quantification of fluorescence could be inaccurate. The curvature of the wing blade makes it a challenging tissue to image, particularly for quantitative measurements.

      Fat protein expression not being uniform has been observed before but not carefully quantified (see Mao et al., 2009, Strutt and Strutt 2002).  Martin et al. 2004 (doi 10.1242/dev.013) claimed that Nub-Gal4 is uniform without actually measuring it. Please consult Fig 1A and 2A in their paper, which clearly shows stronger expression in the center/distal region of the pouch. 

      Regarding systematic errors in quantification, we took great pains to minimize them. We carefully divided the complex folded disc’s z stack into an apical region of interest (ROI) that included the distal domain of the wing pouch and a basal ROI that included the folds encompassing the pouch. We then used a published and widely used surface detection algorithm (ImSAnE) that captures a 3D region of interest (ROI) that can be curved and complex in shape (in z space) because the user creates a surface spline of the ROI. The resulting output treats the ROI as a virtual 2D object. This obviates the need to perform max projections of confocal stacks, which often create artifacts that the reviewer speaks of. Instead, ImSAnE eliminates such artifacts, and it is the gold standard for image processing of ROIs with 3D curvature. 

      Moreover, our pipeline does detect uniform expression if it is there. We used a da-Gal4 driver in Fig. 2K,L - this driver is widely acknowledged to be uniformly expressed in tissues of the fly. When it drives a control fluorescent marker (Bazooka-mCherry), our analysis pipeline detects a uniform expression pattern across the wing pouch (Fig. 2L). When the same Gal4 transgene drives Fat-HA in the same tissue, our pipeline detects a graded expression pattern of Fat-HA (Fig. 2L). In fact, this experiment co-expressed both Fat-HA and the control marker in the same disc. Thus, we feel confident that our analysis is not inaccurate.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      All comments made in the public section.

      We would like to thank the reviewer for their assessment of our study and for suggestions for additional experiments to follow up our studies.

      Reviewer #2 (Recommendations For The Authors):

      ‐ Preparation of spike proteins and VLPs. Although Triton‐X114 extraction was done to remove endotoxin from the recombinant spike protein preparations, its removal efficiency depends on the levels of endotoxin in the samples. Therefore, the residual endotoxin levels in each of the test samples and batches should be measured. Even very low but varying levels of residual endotoxin would substantially impact the reported results, as they create inconsistent data that are not interpretable.

      Certainly, endotoxin contamination in instilled materials is always an issue. Established protocols for inducing acute inflammatory responses using endotoxin outline specific ranges of endotoxin levels in the instillation materials. To induce acute lung inflammation in mice at least 2 µg of endotoxin must be instilled. We have endeavored to reduce the possibility of endotoxin contamination in our recombinant proteins by using a mammalian expression system; careful aseptic culture and protein purification techniques; and a final Triton-X114 partitioning protocol. We assessed the possibility of endotoxin contamination using the Pierce™ Chromogenic Endotoxin Quant Kit, which is based on the amebocyte lysate assay. Our analysis revealed that the endotoxin level in the purified recombinant protein preparation is below 1.0 EU/ml, which closely aligns with the levels specified for recombinant proteins. An endotoxin concentration of 1.0 EU/ml is equivalent to approximately 0.1 ng/ml. Throughout all mouse nasal instillation experiments, the total volume of recombinant protein administered did not exceed 6 µl. The amount of contaminant endotoxin instilled did not exceed 1 pg (50 µl of 0.02 ng/ml of endotoxin). Consequently, we can confirm that the extent of endotoxin contamination is at trace levels. Moreover, our study reveals multiple results indicating that the level of endotoxin contamination in the recombinant protein was inadequate to independently induce neutrophil recruitment in the cremaster muscle, lymph nodes, and liver. For further insights, refer to Figure 5.

      ‐ Doses of spike and VLPs: The amount of spike protein incorporated into HIV Gag‐based VLPs should be determined and compared to that found in the native SARS‐CoV‐2 virus particles. This should provide more physiologic doses (or dose ranges/titration) of spike than the arbitrary doses (3 ug or 5 ug) used in the mouse experiments.

      To visualize the acquisition of spike protein and track cells that have acquired the spike protein, we conducted a series of tests and optimizations using different concentrations of Alexa 488 labeled spike protein, ranging from 0.5 to 5 µg. During the processing of lung tissue for microscopic imaging, it was of utmost importance to preserve the integrity of the labeled spike protein in the tissue samples. We determined that instillation of 3 µg of Alexa 488 labeled spike protein yielded the optimal signal strength across the lung sections. Notably, in many mouse models employing intra-nasal instillation protocols for SARS-CoV2 spike protein or RBD domain-only recombinant proteins, a dosage of approximately 3 µg or higher were commonly used. Regarding the titer of spike-incorporated VLPs, it is important to highlight that we did not directly compare the quantity of spike protein present in NL4.3 VLPs to that of the naïve SARS-CoV-2 virus. HIV-1 and SARS-CoV-2 viruses typically carry around 70 gp120 spikes and 30 spikes, respectively. We estimated that SARS-CoV-2 spike-incorporated NL4.3 VLPs may display twice the number of spikes compared to naïve SARS-CoV-2. Notably, our measurements of SARS-CoV-2 spike on NL4.3 VLPs demonstrated similar behavior to SARS-CoV-2 in terms of specific binding to ACE2-expressing 293T cells, indicating their functional similarity in this context.

      Author response image 1.

      Spike protein-incorporated NL4.3 VLPs test with human ACE2-transfected HEK293 cells. The wild-type spike protein-incorporated VLPs and delta envelope NL4.3 VLPs were analyzed using human ACE2-transfected HEK293 cells. The first plot shows ACE2 expression levels in HEK293 cells. The second plot displays the binding pattern of Delta Env NL4.3 VLPs on ACE2-expressing HEK293 cells. The third plot illustrates the binding pattern of wild-type spike protein-incorporated NL4.3 VLPs on ACE2expressing HEK293 cells. The histogram provides a comparison of VLP binding strength to ACE2expressing HEK293 cells.

      ‐ The PNGase F‐treated protein was not studied in Fig 1. In Fig 2, glycan‐removal by PNGaseF has little effects on cell uptake and cell recruitment in the lung. If binding to one of the Siglec lectins is a critical initial step, experiments should be designed to evaluate this aspect of the spike‐cell interaction in a greater depth.

      As the reviewer states results with the PNGase F-treated protein were not shown in Fig. 1 although we showed results in Figs. 2 & 3. See discussion below about our preparation of the PNGase F-treated protein. Perhaps because we elected to use a purified fraction that retained ACE2 binding, the protein we used likely retained some complex glycans. As the reviewer notes the PNGase F treated protein had similar overall cellular recruitment and uptake profiles compared to the untreated spike protein. The PNGase Ftreated fraction we used no longer bound Siglec-F in the flow-based assay, shown in Fig. 7. This argues that the initial uptake and cellular recruitment following intranasal instillation of the Spike protein did not depend upon the engagement of Siglec-F. While Siglec-F on the murine alveolar macrophage can likely efficiently capture the spike proteins other cellular receptors contribute and the overall impact of the spike protein on alveolar macrophages likely reflects its engagement of multiple receptors.

      • Enzymatic removal of sialic acids from spike may be one parameter to explore. The efficiency of enzymatic removal should also be verified prior to experiments. Finally, the authors need to assess whether the proteins remained functional, folded properly, and did not aggregate.

      To obtain the de-glycosylated form of the SARS-CoV-2 spike protein, we employed PNGase F enzymatic digestion to remove glycans. Subsequently, the spike protein was purified using a size exclusion column. During this purification process, the PNGase F-treated spike protein segregated into two distinct fractions, specifically fraction 6 to 8 and fraction 9 to 11 (see revised Figure 1- figure supplement 1).

      Author response image 2.

      Size exclusion chromatography. The peak lines represent the absorbance at 280 nm. PNGase F-treated spike proteins were loaded onto a Superdex 26/60 column, resolved at a flow rate of 1.0 ml/min, and collected in 1 ml fractions.

      The Coomassie blue staining of an SDS-PAGE gel revealed that fractions 6 to 8 likely underwent a more pronounced de-glycosylation by PNGase F compared to fractions 9 to 11. Additionally, during the size column purification, we noticed that fraction 6 to 8 exhibited a faster mobility than the untreated spike protein, implying a potentially substantial modification of the protein's conformation. To probe the functional characteristics of the de-glycosylated spike protein in fraction 6 to 8, we conducted binding tests with human ACE2. Strikingly, the spike protein in fraction 6 to 8 completely lost its binding affinity to ACE2, indicating a loss of its ACE2-binding capability. Conversely, the protein in fraction 9 to 11 showed partial de-glycosylation but still retained its original functionality to bind to ACE2 and its antibody.

      Author response image 3.

      FACS analysis of various spike protein-bound beads. Protein bound beads were detected with labeled spike antibody, recombinant human ACE2, and recombinant mouse Siglec-F.

      Based on these results, we concluded that fraction 9 to 11 would be the most suitable choice for further studies as the de-glycosylated spike protein, considering its retained functional properties relevant for ligating ACE2 and antibody motifs yet had lost Siglec-F binding. In the revised manuscript we have describe in more detail the purification of the PNGase F treated Trimer and its functional assessment.

      ‐ Increases in macrophages and alveolar macrophages by Kifunensine Tx spike in Fig 2A suggest effects that are not related to Siglec lectins. These effects are not seen with the wild type or D614 spike trimers, so the relevance of high‐ mannose spike is unclear. On the other hand, there were clear differences between Wuhan and D614 trimers seen in Fig 2A and 2B, but there was no verification to ascertain whether these differences were indeed due to strain differences and not due to batch‐to‐batch variability of the recombinant protein production. The overall glycan contents of the Wuhan and D614 spike protein samples should be measured. If Siglec interaction is the main interest in this study, the terminal sialic acid contents should be determined and compared to those in the corresponding strains in the context of native SARS‐CoV‐2 virions.

      Our initial observation that Siglec-F positive alveolar macrophages (AMs) avidly acquired spike proteins followed by a rapid leukocyte recruitment provided the rational for us to examine the impact of modifying the glycosylation pattern on the spike protein (de-glycosylated and spike variants) on their binding tropism and their cellular recruitment profiles in the lung. In this context, we examined the influence of several glycan modification on spike proteins, hypothesizing that these modifications would alter the acquisition of the spike protein by mouse AMs compared to the wild-type trimer. While we did not conduct an indepth analysis of the glycan composition and terminal sialic acid contents of the SARS-CoV-2 spike proteins we used we did verify that the different proteins behaved as expected. Most of the biochemical studies were performed in Jim Arthos’ laboratory, which has a long interest in the glycosylation of the HIV envelope protein. On SDS-PAGE the SARS-CoV-2 spike protein purified from the Kifunesine treated CHO cells exhibited a 12 kDa reduction. It bound much better to L-Sign, DC-Sign, and maltose binding lectin, and poorly to Siglec-F. In the cellular studies it bound less well to most of the cellular subsets examined including murine alveolar macrophages. In studies with human blood leukocytes, it relied on cations for binding. However, it retained its toxicity directed at mouse and human neutrophils and it elicited a similar cytokine profile when added to human macrophages. The D614G mutation increased the spike protein binding to P-Selectin, CD163, and snowdrop lectin (mannose binding) suggesting that the mutation had altered the glycan content of the protein. We used the D614G spike protein in a limited number of experiments as it behaved like the wild-type protein except for a slightly altered cellular retention pattern 18 hrs after intranasal instillation. In the revised manuscript we have included its binding to peripheral blood leukocytes. The D614G mutation conferred stronger binding to human monocytes than the original Spike protein. As discussed above, we recovered two fractions following the PNGase F treatment, one with a 40 kDa reduction on SDS-PAGE and the other a 60 kDa decrease and we chose to evaluate the fraction with a 40 kDa reduction in subsequent experiments. Consistent with a loss of N-linked glycans the PNGase F treatment reduced the binding to the lectin PHA, which recognizes complex carbohydrates, and it resulted in a sharp reduction in Siglec-F binding. The lower molecular weight fraction recovered after PNGase F treatment no longer bound ACE2. While our studies showed that alveolar macrophages likely employ Siglec-F as a capturing receptor they possess other receptors that also can capture the spike protein. The downstream consequences of engaging SiglecF and other Siglecs by the SARS-CoV-2 spike protein will require additional studies.

      While acknowledging the possibility of some batch-batch variation in recombinant protein preparation, we don’t think this was a major issue. We have noted some batch-batch variations in yield- efficiency, however the purified proteins consistently gave similar results in the various experiments.

      ‐ Fig 3: The same concern described above applies to the hCoV‐HKU1 spike protein. In Panel D, the PNGase and Kifunensine treatment did not appear to abrogate the neutrophil recruitment. Panel A did not include PNGase and Kif Tx spike proteins. Quantification of images in panel D is missing and should be done on many randomly selected areas.

      We analyzed the neutrophil count of images in panel D and the results are presented. (Figure 3-figure supplement 1C). The Kifunensine treatment reduced the neutrophil recruitment at 3 hours, while the PNGase F treated Spike protein recruited as well or slightly more neutrophils. The hCoV-HKU1 S1 domain did not differ much from the saline control.

      ‐ Fig 4: Kifunensine Tx spike caused more increase in neutrophil damage after intrascrotal injections. PNGase Tx spike was not tested. Connection between Siglec‐spike binding and neutrophil recruitment/damage is lacking.

      Exteriorized cremaster muscle imaging functions as a model system for monitoring neutrophil behavior recruited by spike proteins within the local tissue, distinct from Siglec F-positive alveolar macrophages residing in lung tissue. Hence, our primary focus was not on investigating the Siglec/Spike protein interaction. Consequently, we did not utilize PNGase F-treated spike protein in these experiments. To clarify this issue, we added a sentence in main text ‘Although this model lacks Siglec F-positive macrophages, it is worth monitoring the effect of the SARS-CoV-2 Spike protein on neutrophils recruited in the inflammatory local tissue.’

      ‐ Fig 5. Neutrophil injury was also seen after inhalation (intranasal) of spike protein in mice and in vitro with human neutrophils. Panel B shows no titrating effects of spike (from 0.1 to 2) on Netosis of murine neutrophils. Panel C: Netosis was seen with human neutrophils at 1 but not 0.1. Is this species difference important?

      Given the observation of neutrophil NETosis in the mouse imaging experiment, our objective was to characterize the direct impact of the spike protein on human and murine neutrophils. The origins of the neutrophils are different as the murine neutrophils were purified from mouse bone marrow while the human neutrophils were purified from human blood. Both purification protocols led to greater than 98% neutrophils. However, the murine neutrophils contain many more immature cells (50-60%) because the bone marrow served as their source. Furthermore, the murine neutrophils are from 6–8-week-old mice while the human neutrophils are from 30-50 year-old humans. More work would be needed to sort out whether there is any difference between human and mouse neutrophils in their propensity to undergo netosis in response to Spike protein.

      ‐ Kifunensine Tx again did not cause any reduction, indicating the lack of involvement of sialic acid. How was this related to Siglec participation directly or indirectly? There was no quantification for Panel D.

      We do not think that Siglecs play a role in the induction of neutrophil netosis as the Spike proteins lacking Siglec interactions induced similar levels of netosis. Likely other neutrophil receptors are important. As noted in the text,

      "human neutrophils express several C-type lectin receptors including CLEC5A, which has been implicated in SARS-CoV-2 triggered neutrophil NETosis." Our goal with the data in Panel D was to visualize human neutrophil NETosis on trimer-bearing A549 cells we relied on the flow cytometry assays for quantification.

      ‐ The rationale for testing cation dependence is unclear and should be described. What is the significance of "cations enhanced leukocyte binding particularly so with the high mannose protein"? Are there cationdependent receptors for spike independent of glycans and huACE‐2? If so, how is this relevant to the main topic of this paper?

      It is well known that many glycan bindings by C-type lectins are calcium-dependent, involving specific amino acid residues that coordinate with calcium ions and bind to the hydroxyl groups of sugars. As discussed in our previous draft, the C-type lectin receptor L-SIGN has been suggested as a calciumdependent receptor for SARS-CoV-2, specifically interacting with high-mannose-type N-glycans on the SARS-CoV-2 spike protein. Therefore, it was worthwhile to investigate the calcium-dependent manner of spike protein binding to various types of immune cells. We added some data to this figure. It now includes the binding profile of the D614G protein. In addition, we corrected the binding data by subtracting the fluorescent signal from the unstained control cells.

      ‐ Fig 7: human Siglec 5 and 8 were studied in comparison with mouse Siglec F. Recombinant protein data are not congruent with transfected 293 cell data. Panel A, the best binding to hSiglec 5 and 8 are the PNGase F Tx spike protein; how to interpret these data? Panel B: only the WT and D614G spike proteins binding to Siglec 5 and 8 on transfected cells. It made sense that kif Tx (high‐mannose) and PNGaseF Tx (no glycan) spike would not bind to the Siglecs, but they did not bind to ACE2 either, indicative of nonfunctional spike proteins.

      We discussed this as follows: ‘The closest human paralog of mouse Siglec-F is hSiglec-8 (reference 40). While expressed on human eosinophils and mast cells, human AMs apparently lack it. In contrast, human AMs do express Siglec-5 (reference 37). Along with its paired receptor, hSiglec-14, Siglec-5 can modulate innate immune responses (reference 41). When tested in a bead binding assay, in contrast to Siglec-F, neither hSiglec-5 or -8 bound the recombinant spike protein, yet their expression in a cellular context allowed binding. The in vitro bead binding assay we established demonstrated the specific binding of the bait molecule to target molecules. However, it does have limitations in replicating the complexities of the actual cellular environment. As discussed previously the PNGase Tx fraction we used in these experiments retained ACE2 binding, but loss binding to Siglec-F in the bead assay. In a biacore assay, not shown, the PNGase Tx fraction bound L-Sign and DC-Sign better than the untreated trimer, and it retained human ACE2 binding although it bound less well than wild type-trimer. Why the PNGase Tx fractions bound poorly to the human ACE2 transfected HEK293 cells is unclear. A higher density of recombinant ACE2 on the beads compared to that expressed on the surface of HEK293 may explain the difference. Alternatively in the bead assay we used a recombinant human ACE2-Fc fragment fusion protein purified from HEK293 cells, while in the transfection assay, we expressed human full length ACE2. The biacore, the bead binding, and the functional assays we performed all suggest that we had used intact recombinant proteins.

      ‐ Fig 8: This last set of experiment was to measure cytokine release by different types of macrophage cultures treated with spike from different cells with vs without Kifunensine Tx. The connection of these experiments to the rest is tenuous and is not explained. This is one of the examples where bits of data are presented without tying them together.

      Dysregulated cytokine production significantly contributes to the pathogenesis of severe COVID-19 infection. Since we had observed strong binding of the spike protein to human monocytes and murine alveolar macrophages, we tested whether the spike protein altered cytokine production by human monocyte-derived macrophages. Depending on the culture conditions human monocytes can be differentiated M0, M1, or M2 phenotypes. Each type of macrophage responds differently to stimulants, often leading to distinct patterns of cytokine secretion. These patterns offer valuable insights into the immune response. The cytokine profiling conducted in this study enhances our understanding of how distinct macrophage types react to the spike protein.

      ‐ Discussion section did not describe how the various experiments and data are tied together. The authors explained the interactions of spike with different cell types in each paragraph separately, leaving this reviewer really confused as to what the authors want to convey as the main message of the paper.

      We have modified discussion to address this issue.

      Reviewer #3 (Recommendations For The Authors):

      ‐ The authors may want to refer to "intranasal instillation" to distinguish it from inhalation of an aerosolised liquid. How was the dose of the spike protein selected? There is some dose information in different settings, but usually between 0.1‐1 µg/ml or 0.1 µg‐5 µg range for in vivo injection, but the rationale for these ranges should be discussed. Is this mimicking a real situation during infections or a condition that might be used for vaccines?

      While inhalation of aerosolized liquid closely mimics the natural route of human exposure to respiratory infectious materials, intranasal instillation with a liquid inoculum remains a widely accepted standard approach for virus or vaccine inoculation across various laboratory species. To clearly define our mouse model, we are changing the term 'inhalation' to 'instillation'. We previously answered to Reviewer #2 as following: To visualize the acquisition of spike protein and track cells that have acquired the spike protein, we conducted a series of tests and optimizations using different concentrations of Alexa Fluor 488 labeled spike protein, ranging from 0.5 to 5 µg. During the processing of lung tissue for microscopic imaging, it was of utmost importance to preserve the integrity of the labeled spike protein on the tissue samples. Through our investigations, we determined that an instillation of 3 µg of Alexa Fluor 488 labeled spike protein yielded the most optimal signal strength across the lung sections. Notably, in many mouse models employing intra-nasal instillation protocols for SARS-CoV-2 spike protein or RBD domain-only recombinant proteins, a dosage of approximately 3 µg or higher was commonly used. Hence, based on these references and our preliminary studies, we selected 3 µg as the optimal concentration of instilled spike protein per mouse.

      ‐ Controls are not evenly applied. In some cases, the control for the large and complex SARS‐CoV2 spiker trimer is PBS. This seems insufficient to control against effects of injecting such complex proteins that can undergo significant conformational changes after uptake by a cell. In some cases, human coronavirus spike proteins from different viruses are used, but not much is said about these proteins and the different glycoforms are not explored. Are these prepared in the same way and do they have similar glycoforms. For example, if the Siglecs bind sialic acid on N‐linked glycans, then why do the purified Siglecs or Siglecs expressed in cells not bind the HKU‐1 spike, which would have such sialic acids if expressed in the same way as the CoV2 spike?

      We have taken careful consideration to select an appropriate control material for these experiments. Initially, we opted to employ Saline or PBS for intranasal instillation as a vehicle control, a choice aligned with the approach taken in numerous previous studies involving lung inflammation mouse models. However, as the reviewer pointed out, we share the concern for achieving more meaningful and comparable control materials, particularly considering the size and complexity of the recombinant protein. In accordance with this perspective, we introduced glycan-modified spike proteins and the HCoV-HKU1 S1 subunit. Figure 3 illustrates our comprehensive evaluation of various spike proteins in terms of their impact on neutrophil recruitment. The diversity of sialic acid structures observed on recombinant proteins expressed within the same cell emerges from the intricate interplay of multiple factors within the cellular glycosylation machinery. This complex enzymatic process empowers cells to finely modulate glycan structures and sialic acid patterns, tailoring them to suit the diverse biological functions of distinct proteins. Despite structural similarities between the HCoV-HKU1 and SARS-CoV-2 spike proteins, their glycan modifications vary, thereby leading to distinct binding properties with various Siglec subtypes. All recombinant proteins used in this study except for the S1 subunits were generated within our laboratory. These include the wild-type spike protein, the D614G Spike protein, the Kifunensine-treated high mannose spike proteins, and the PNGase F-treated deglycosylated spike proteins. All the proteins were produced using the same protocol using CHO cells or on occasion HEK293F cells. We have indicated in the manuscript where we used HEK293F cells for the protein production otherwise they were produced in CHO cells.

      ‐ Figure 1 F‐I, there should be a control for VLP without SARS‐CoV2 spike as the VLP will contain other components that may be active in the system.

      We tested the delta Env VLP for alveolar macrophage acquisition and neutrophil recruitment. We found a similar alveolar macrophage acquisition of the VLPs, but significantly less neutrophil recruitment compared to the free Spike protein. Since the uptake pattern with the VLPs matched that of the spike protein we did not consider adding a non-spike bearing VLP as a control. The rapid VLPs clearance into the lymphatics shortly after instillation may account for the reduced neutrophil recruitment following their instillation (Figure 1 figure supplement 2B, C).

      ‐ In Figure 1H, that do they mean by autofluorescence? Is this the cyan signal?

      Is the green signal also autofluorescence as this is identified as the VLP?

      We appreciate reviewer pointing out the typo regarding autofluorescence in the figure image. To provide clarity regarding the background in all lung section images, we have included additional supplemental data. During the fixation process of lung tissue, various endogenous elements in the tissue sample contribute to autofluorescence when exposed to lasers in the confocal microscope. Specifically, collagen and elastin present in the lung vasculature, including airways and blood vessels, are dominant structures that generate autofluorescence. To address this issue, we have implemented optimizations to distinguish between real signals and the noise caused by autofluorescence. We inadvertently failed to indicate the source of the strong cyan signal. The signal is due to Evans Blue dye delineating lung airway structures, which contain collagen and elastin—known binding materials for Evans Blue dye. This explains the strong fluorescence signals observed in the airways. We conjugated the recombinant spike protein with Alexa Fluor 488, and viral-like particles (VLPs) were visualized with gag-GFP. (Figure 1 figure supplement 2A, D)

      ‐ The control for SARS‐CoV2 spike trimer is PBS, but how can the authors distinguish patterns specific to the spike trimer from any other protein delivered by intranasal instillation. Could they use another channel with a control glycoprotein to determine if there is anything unique about the pattern for spike trimer?

      Alveolar macrophages employ numerous receptors to capture glycoproteins that have mannose, Nacetylglucosamine, or glucose exposed. Galactose-terminal glycoproteins are typically not bound. We do not think that the Spike protein is unique in its propensity to target alveolar macrophages.

      ‐ What is the parameter measured in Figure S2B?

      The percentage of the different cell types that have retained the instilled Spike protein at the three-hour time point. .

      ‐ The Spike trimer with high mannose oligosaccharides may gain binding to the mannose receptor. It may be helpful to state the distribution of this receptor and comment is it could be responsible for this having the largest effect size for some cell types.

      We agree that the spike trimer with high mannose should target cells bearing the mannose receptor. We have modified the discussion to address this point and have mentioned some of the cell types likely to bind the high mannose bearing spike protein.

      ‐ A key experiment is the Evans Blue measure of lung injury in Figure 3A. A control with the HKU‐1 spike is also performed, but more details on the matching of this proteins production to the SARS‐CoV2 spike trimer and the quantification of these comparative result should be provided. To show that the SARSCoV2 spike trimer can cause tissue injury on its own seems like a very important result, but the impact is currently reduced by the inconsistent application of controls and quantification of key results. Furthermore, if these results can be repeated in the B6 and B6 K18‐hACE2 mouse model it might further increase the impact by demonstrating whether or not hACE2 contributes to this effect.

      We repeated the lung permeability assay using the S1 subunit from the original SARS-CoV-2 and the S1 subunit from HCoV-HKU1. Both proteins were made by the same company using a similar expression system and purification protocol. Consistent with our original data, the instillation of the SARS-CoV-2 S1 subunit led to an increase in lung vasculature permeability, whereas the HCoV-HKU-1 S1 subunit had a minimal impact. (Figure 3 figure supplement 1A). This experiment suggests that it the S1 subunit that leads to the increase in vascular permeability. To address the contribution of hACE2 in this phenomenon, we conducted a lung permeability assay using K18-hACE2 transgenic mice. The K18-hACE2 transgenic mice exhibited a slight increase in lung vasculature permeability upon SARS-CoV-2 trimer instillation compared to the non-transgenic mice. This suggests that the hACE2-Spike protein interaction may contribute to an increase in lung vascular permeability during SARS-CoV-2 lung infection (Figure 3 figure supplement 1B).

      ‐ For Figure 4A, could they provide quantification. The neutrophil extravasation with Trimer appears quite robust, but the authors seem to down‐play this and it's not clear without quantification.

      To address this issue, we analyzed and graphed the neutrophil numbers in each image. Injection of the trimer along with IL-1β significantly increased neutrophil infiltration. (Figure 4 figure supplement 1)

      ‐ In Figure 4B, there are no neutrophils at all in the BSA condition. Is this correct? Intravascular neutrophils were detected with PBS injection in Figure 4A.

      We demonstrated that the neutrophil behaviors occur within the infiltrated tissue rather than within the blood vessels. Even when examining the blood vessels in all other images, it is challenging to identify neutrophils adhering to the endothelium of the blood vessels. Neutrophils observed in the PBS 3-hour control group are likely acute responders to the local injection, as a smaller number of neutrophils were observed in the 6-hour image.

      ‐ In Figure 5A the observation of neutrophil response in lung slices seems to be presented an anecdotal account. The neutrophil appears to polarize, but is this a consistent observation? How many such observations were made?

      We have consistent observations across three different experiments. In addition, highly polarized and fragmented neutrophils were consistently observed in the fixed lung section images.

      ‐ The statement: "human Siglec‐5 and Siglec‐8 bound poorly despite being the structural and functional equivalents of Siglec F, respectively (37)". How can one Siglec be the structural and the other the functional equivalent of Siglec‐F? It might help to provide a little more detail as to how these should be seen.

      Mouse Siglec-F has two distinct counterparts in the human Siglec system, both in terms of structure and function. In the context of domain structure, human Siglec-5 serves as the counterpart to mouse Siglec-F. However, it's important to note that while human Siglec-8 is not a genetic ortholog of mouse Siglec-F, it is expressed on similar cellular populations and functions as a functional paralog.

      ‐ The assay using purified proteins and proteins expressed in cells don't fully agree. For example, it's very surprising that recombinant Siglec 5 and 8 bind better to the non‐glycosylated form than to the glycosylated trimer. It appears from Figure S1 that the PNGaseF treated Spike contains at least partly glycosylated monomers and it also appears that the Kifunesine effect may be partial. PNGaseF may have a hard time removing some glycans from a native protein.

      We were also surprised by the results using the PNGase F treated Spike protein in that it lost binding to Siglec-F and retained binding to human Siglec-5 and 8 in the bead assay, shown in Figure 7A. As explained above we used a purified fraction of the PNGase F treated protein that retained some functional activity as assessed in the ACE2 binding assay and in biacore assays not shown. The persistent binding of Siglec-5 and Siglec-8 suggests that removal of some of the complex glycans had revealed sites capable of binding Siglec-5 and 8. We would agree with the reviewer that the PNGase treatment we used only removed some of the glycans from the native protein. In data not shown the high mannose spike protein behaved as predicted in biacore assays, binding better to DC-SIGN and maltose binding lectin, but less well to PHA and less well to ACE2. The high mannose trimer also bound less to the HEK293 cells expressing ACE2, Siglec-5, or Siglec-8 as well as peripheral blood leukocytes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Kume et al examined the role of the protein Semaphorin 4a in steady-state skin homeostasis and how this relates to skin changes seen in human psoriasis and imiquimod-induced psoriasis-like disease in mice. The authors found that human psoriatic skin has reduced expression of Sema4a in the epidermis. While Sema4a has been shown to drive inflammatory activation in different immune populations, this finding suggested Sema4a might be important for negatively regulating Th17 inflammation in the skin. The authors go on to show that Sema4a knockout mice have skin changes in key keratinocyte genes, increased gdT cells, and increased IL-17 similar to differences seen in non-lesional psoriatic skin, and that bone marrow chimera mice with WT immune cells and Sema4a KO stromal cells develop worse IMQ-induced psoriasis-like disease, further linking expression of Sema4a in the skin to maintaining skin homeostasis. The authors next studied downstream pathways that might mediate the homeostatic effects of Sema4a, focusing on mTOR given its known role in keratinocyte function. As with the immune phenotypes, Sema4a KO mice had increased mTOR activation in the epidermis in a similar pattern to mTOR activation noted in non-lesional psoriatic skin. The authors next targeted the mTOR pathway and showed rapamycin could reverse some of the psoriasis-like skin changes in Sema4a KO mice, confirming the role of increased mTOR in contributing to the observed skin phenotype.

      Strengths:

      The most interesting finding is the tissue-specific role for Sema4a, where it has previously been considered to play a mostly pro-inflammatory role in immune cells, this study shows that when expressed by keratinocytes, Sema4a plays a homeostatic role that when missing leads to the development of psoriasis-like skin changes. This has important implications in terms of targeting Sema4a pharmacologically. It also may yield a novel mouse model to study mechanisms of psoriasis development in mice separate from the commonly used IMQ model. The included experiments are well-controlled and executed rigorously.

      Weaknesses:

      A weakness of the study is the lack of tissue-specific Sema4a knockout mice (e.g. in keratinocytes only). The authors did use bone marrow chimeras, but only in one experiment. This work implies that psoriasis may represent a Sema4a-deficient state in the epidermal cells, while the same might not be true for immune cells. Indeed, in their analysis of non-lesional psoriasis skin, Sema4a was not significantly decreased compared to control skin, possibly due to compensatory increased Sema4a from other cell types. Unbiased RNA-seq of Sema4a KO mouse skin for comparison to non-lesional skin might identify other similarities besides mTOR signaling. Indeed, targeting mTOR with rapamycin reveres some of the skin changes in Sema4a KO mice, but not skin thickness, so other pathways impacted by Sema4a may be better targets if they could be identified. Utilizing WT→KO chimeras in addition to global KO mice in the experiments in Figures 6-8 would more strongly implicate the separate role of Sema4a in skin vs immune cell populations and might more closely mimic non-lesional psoriasis skin.

      We sincerely appreciate your summary and for pointing out the strengths and weaknesses of our study. Although we were unfortunately unable to perform all these experiments due to limitations in our resources, we fully agree with the importance of studying tissue-specific Sema4A KO mice. As an alternative, we compared the IL-17A-producing potential of skin T cells between WT→KO mice and KO→KO mice following 4 consecutive days of IMQ treatment using flow cytometry. The results were comparable between the two groups. Additionally, we performed RNA-seq on the epidermis of WT and Sema4A KO mice. While we did not find similarities between Sema4A KO skin and non-lesional psoriasis except for S100a8 expression, we will further try to seek for the mechanisms how Sema4A KO skin mimics non-lesional psoriasis skin as a future project.

      Although targeting mTOR with rapamycin did not reverse the epidermal thickness in Sema4A KO mice, rapamycin was effective in reducing epidermal thickness in a murine psoriasis model induced by IMQ in Sema4A KO mice. These results suggest potential clinical relevance for treating active, lesional psoriatic skin changes, which would be of interest to clinicians. Thank you once again for your valuable insights.

      Reviewer #2 (Public Review):

      Summary:

      Kume et al. found for the first time that Semaphorin 4A (Sema4A) was downregulated in both mRNA and protein levels in L and NL keratinocytes of psoriasis patients compared to control keratinocytes. In peripheral blood, they found that Sema4A is not only expressed in keratinocytes but is also upregulated in hematopoietic cells such as lymphocytes and monocytes in the blood of psoriasis patients. They investigated how the down-regulation of Sema4A expression in psoriatic epidermal cells affects the immunological inflammation of psoriasis by using a psoriasis mice model in which Sema4A KO mice were treated with IMQ. Kume et al. hypothesized that down-regulation of Sema4A expression in keratinocytes might be responsible for the augmentation of psoriasis inflammation. Using bone marrow chimeric mice, Kume et al. showed that KO of Sema4A in non-hematopoietic cells was responsible for the enhanced inflammation in psoriasis. The expression of CCL20, TNF, IL-17, and mTOR was upregulated in the Sema4AKO epidermis compared to the WT epidermis, and the infiltration of IL-17-producing T cells was also enhanced.

      Strengths:

      Decreased Sema4A expression may be involved in psoriasis exacerbation through epidermal proliferation and enhanced infiltration of Th17 cells, which helps understand psoriasis immunopathogenesis.

      Weaknesses:

      The mechanism by which decreased Sema4A expression may exacerbate psoriasis is unclear as yet.

      We greatly appreciate your summary and thoughtful feedback on the strengths and weaknesses of our study. In response, we have included the results of additional experiments on IL-23-mediated psoriasis-like dermatitis, which showed that epidermal thickness was significantly greater in KO mice compared to WT mice. When we analyzed the T cells infiltrating the ears using flow cytometry, the proportion of IL-17A producing Vγ2 and DNγδ T cells within the CD3 fraction of the epidermis was significantly higher in Sema4A KO mice, consistent with the results from IMQ-induced psoriasis-like dermatitis. Furthermore, we examined STAT3 expression in the epidermis of WT and Sema4A KO mice using Western blot analysis, and the results were comparable between the two groups. However, the mechanism by which decreased Sema4A expression may exacerbate psoriasis remains unclear. We have added some explanations and presumptions to the limitations section. Thank you once again for your valuable insights.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 1C

      What statistics were used? The supplemental notes adjusted the P value, what correction for multiple comparisons was utilized? Could the authors instead show logFC for the DEGs between Ctl and L in each cluster? This might be best demonstrated with a volcano plot, highlighting SEMA4A, and other genes known to be DE in psoriasis.

      We apologize for not including the detailed analysis methods in the original manuscript submission. We analyzed the scRNA-seq data using Cellxgene VIP with Welch’s t-test. Multiple comparisons were performed using the Benjamini-Hochberg procedure, setting the false discovery rate (FDR) at 0.05. These details are now explained in the MATERIALS AND METHODS section of the resubmitted manuscript. We also added a log2FC-log10 p-value graph for the DEGs in keratinocytes between Ctl and L to Figure 1-figure supplement 1D. The log2FC values in keratinocytes, dendritic cells, and macrophages were -0.07, 0.00, and -0.05, respectively. Although the log2FC is low in keratinocytes, the adjusted p-value (padj) for Sema4A is 2.83×10-39, indicating a statistically significant difference.

      Page 8 Line 111 in the resubmitted manuscript:

      “The adjusted p-value (padj) for SEMA4A in keratinocytes between Ctl and L was 2.83×10-39, indicating a statistically significant difference despite not being visually prominent in the volcano plot, which shows comprehensive differential gene expression in keratinocytes (Figure 1C; Figure 1-figure supplement 1D).”

      Page 54: In the Figure legend of Figure 1-figure supplement 1D in the resubmitted manuscript:

      “(D) The volcano plot displays changes in gene expression in psoriatic L compared to Ctl.”

      Page 30 Line 481 in the resubmitted manuscript: In the “Data processing of single-cell RNA-sequencing and bulk RNA-sequencing” section.

      “The data was integrated into an h5ad file, which can be visualized in Cellxgene VIP (K. Li et al., 2022). We then performed differential analysis between two groups of cells to identify differential expressed genes using Welch’s t-test. Multiple comparisons were controlled using the Benjamini-Hochberg procedure, with the false discovery rate set at 0.05 and significance defined as padj < 0.05.”

      Figure 2B

      The results narrative notes WT->WT is comparable to KO->WT. No statistics are given for this comparison. It appears the difference is less than the other comparisons, but still may be significant. Also, in the supplemental for Figure 2B, there appear to be missing columns for the 4 BM chimera groups (columns for WT and KO, but not 4 columns for each donor: recipient pair).

      We sincerely apologize for any confusion. We presented the results of the chimeric mice in Figure 3, and Figure 3-source data 1 shows the 4 BM chimera groups. In Figure 3B, the p-value for the comparison between WT->WT mice and KO->WT mice was 0.7988, as indicated in Figure 3-source data 1.

      Figure 3B

      While ear skin is not easily obtainable at day 0 for comparison, why not also include back skin at Wk 8? If the back skin epidermis is thicker like the ear skin, it supports the ear skin conclusion and adds a more consistent comparison. If the back skin epidermis is not thicker, what would be the author's explanation as to the why only ear skin epidermis is thicker in KO mice at 8 weeks?

      We appreciate and completely agree with the reviewer’s insightful comment. We have added images and dot plots of the back skin at Week 8 in Figure 4B. Since the back skin epidermis is thicker, similar to the ear skin, these results support the conclusion drawn from the ear skin data. Regarding Figure 4C, which shows the expression of Sema4a in the epidermis and dermis of 8-week-old WT mouse ear, we have modified the sentence in the manuscript to ‘the epidermis of WT ear at Week 8’ for clarification.

      Page 12 Line 180 in the resubmitted manuscript:

      “While epidermal thickness of back skin was comparable at birth (Figure 4B), on week 8, epidermis of Sema4AKO back and ear skin was notably thicker than that of WT mice (Figure 4B), suggesting that acanthosis in Sema4AKO mice is accentuated post-birth.”

      Page 47: In the Figure legend of Figure 4B in the resubmitted manuscript:

      “(B) Left: representative Hematoxylin and eosin staining of Day 0 back and Wk 8 back and ear. Scale bar = 50 μm. Right: Epi and Derm thickness in Day 0 back (n = 5) and Wk 8 back (n = 5) and ear (n = 8).”

      Figures 3C&D, Figures 4 D-F

      The figures might be easier to read if some of the data is moved to supplemental, especially in Figure 4, which has 36 panels just in D-F. Conversely, the dLN data is important in establishing the skin microenvironment as important in the accumulation of γδ cells and IL-17 production in the setting of Sema4a KO, so this might be more impactful if moved to the main figure.

      We appreciate and agree with your comments. As recommended, we have moved data from Figure 3C and 4D-F to the supplemental section. The dLN data have been moved to the main figure as Figure 4E. This has improved the readability of the figures.

      Figure 5 and Figure 6 might work better if combined. The differences in keratinocytes in psoriasis are well-known, so the novelty is how Sema4a KO skin appears to share similar differences. This would be easier to see if compared side-by-side in the same figure. Also, there is an opportunity to show this more rigorously by performing RNA-seq on WT vs Sema4a KO skin. Showing a larger set of DEGs that trend similarly between Ctl/NL psoriasis and WT/Sema4a KO skin in a heatmap would bolster the conclusion that Sema4a deficiency contributes to a psoriasis-like skin defect.

      We appreciate your valuable suggestion. Following your recommendation, we have combined Figures 5 and 6 to facilitate a side-by-side comparison. This highlights the similarities between Sema4AKO skin and psoriasis, making it easier to observe differences in keratinocytes. Additionally, we performed RNA-seq on WT and Sema4a KO epidermis (n = 3 per group). We analyzed the raw count data using iDEP 2.0 (Ge S.X., BMC Bioinformatics, 2018), setting the minimal counts per million to 0.5 in at least one library. Differential gene expression analysis was conducted using DEseq2, with an FDR cutoff of 0.1 and a minimum fold change of 2. As a result, we identified 46 upregulated and 70 downregulated genes in Sema4AKO mice compared to WT mice (see the volcano plot and heat map). However, except for S100a8, we did not observe significant expression changes in non-lesional psoriasis-related genes between WT and Sema4AKO mice. In the future, we aim to identify subtle stimuli that could cause gene expression changes between these groups and we would like to perform additional RNA-seq experiments.

      Author response image 1.

      Author response image 2.

      Page 48: The Figure title of Figure 5 in the resubmitted manuscript:

      “Figure 5: Sema4AKO skin shares the features of human psoriatic NL.”

      SEMA4A is not significantly DE between Ctl and NL in the psoriasis RNA-seq data. If a lower expression of SEMA4A in psoriasis skin is a driving part of the phenotype, why is this not observed in the RNA-seq data? Presumably, this could be explained by infiltration of immune cells with increased SEMA4A expression, like in the scRNA-seq data in Figure 1. If so, might it be useful to analyze WT->KO chimera mice similarly to global KO mice in Figures 6-8? This might more accurately reflect what is happening in psoriasis, if epidermal SEMA4A expression is low, but immune expression is not. The KO data on their own nicely show a skin phenotype, but these additional experiments might more closely mimic psoriatic disease and increase the rigor and impact of the study.

      We really appreciate your insightful comments. Due to the limitations of the animal experimentation facility, we regret that we are unable to create additional chimeric mice. Although our analysis is limited, we compared IL-17A production from T cells of WT→KO mice and KO→KO mice following 4 consecutive days of IMQ treatment using flow cytometry (see Author response image 3 below; n = 6 for WT→KO, n = 4 for KO→KO). This comparison revealed that IL-17A production from T cells was comparable, regardless of whether they were derived from WT or Sema4AKO mice, when the skin constituent cells were derived from Sema4AKO. We appreciate the value of your advice, and agree that investigating keratinocyte differentiation and mTOR signaling in the epidermis, using either WT→KO chimeric mice or keratinocyte-specific Sema4A-deficient mice, is a crucial next step in our research.

      Author response image 3.

      Figure 8

      Rapamycin was able to partially reverse the psoriasis-like skin phenotype in Sema4a KO mice. Would rapamycin also be effective in the more severe disease induced by IMQ in Sema4a KO mice? While partially reducing the effect of Sema4a KO on steady-state skin with rapamycin strengthens the link to mTOR dysregulation, it did not change skin thickness. It's unclear if this would be useful clinically for patients with well-controlled psoriasis (NL skin). Would it be useful to reverse active, lesional psoriatic skin changes? Testing this might yield results more relevant to clinicians and patients.

      We are grateful for your valuable feedback. Rapamycin showed effectiveness in reducing epidermal thickness in a murine psoriasis model induced by IMQ in Sema4AKO mice. Rapamycin treatment downregulated the expression of Krt10, Krt14, and Krt16. We included these results to Figure 7-figure supplement 2. These results suggest potential clinical relevance for treating active, lesional psoriatic skin changes and may be of interest to clinicians and patients.

      Page 17 Line 269 in the resubmitted manuscript:

      “Next, we investigated whether intraperitoneal rapamycin treatment effectively downregulates inflammation in the IMQ-induced murine model of psoriasis in Sema4AKO mice (Figure 7-figure supplement 2A). Rapamycin significantly reduced epidermal thickness compared to vehicle treatment (Figure 7-figure supplement 2B). Additionally, rapamycin treatment downregulated the expression of Krt10, Krt14, and Krt16 (Figure 7-figure supplement 2C). While the upregulation of Il17a in the Sema4AKO epidermis in IMQ model was not clearly modified by rapamycin (Figure 7-figure supplement 2C), immunofluorescence revealed a decrease in the number of CD3 T cells in Sema4AKO epidermis by rapamycin (Figure 7-figure supplement 2D). In the naive states, mTORC1 primarily regulates keratinocyte proliferation, whereas mTORC2 mainly involved in the keratinocyte differentiation through Sema4A-related signaling pathways. Conversely, in the psoriatic dermatitis state, rapamycin downregulated both keratinocyte differentiation and proliferation markers. The observed similarities in Il17a expression following treatment with rapamycin and JR-AB2-011, regardless of additional IMQ treatment, suggest that Il17a production is not significantly dependent on Sema4A-related mTOR signaling.”

      Page 29 Line 461 in the resubmitted manuscript: In the “Inhibition of mTOR” section.

      “To analyze the preventive effectiveness of rapamycin in an IMQ-induced murine model of psoriatic dermatitis, Sema4AKO mice were administered either vehicle or rapamycin intraperitoneally from Day 0 to Day 17, and IMQ was topically applied to both ears for 4 days starting on Day 14. Then, on Day 18, ears were collected for further analysis.”

      Page 71: Figure 7-figure supplement 2 in the resubmitted manuscript:

      “Figure 7-figure supplement 2: Rapamycin treatment reduced the epidermal swelling observed in IMQ-treated Sema4AKO mice.

      (A) Experimental scheme. (B) The Epi thickness on Day 18. (n = 10 for Ctl, n = 12 for Rapamycin). (C) Relative expression of keratinocyte differentiation markers and Il17a in Sema4AKO Epi (n = 10 for Ctl, n = 12 for Rapamycin). (D) The number of T cells in the Epi (left) and Derm (right), under Ctl or rapamycin and IMQ treatments (n = 10 for Ctl, n = 12 for Rapamycin). Each dot represents the sum of numbers from 10 unit areas across 3 specimens. A-C: *p < 0.05, **p < 0.01. NS, not significant.”

      Reviewer #2 (Recommendations For The Authors):

      (1) To know whether the decrease of Sema4A in the epidermis of psoriasis patients is a result or a cause of psoriasis, it is necessary to show how the expression of Sema4A in epidermal cells is regulated. Shouldn't the degree of change in the expression of essential molecules (which is the cause of psoriasis) be more pronounced in L than in NL?

      We surveyed transcription factors of human Sema4A using GeneCards and found that NF-κB is the transcription factor most frequently associated with psoriasis. Wang et al. (Arthritis Res Ther. 2015) indicated NF-κB-dependent modulation of Sema4A expression in synovial fibroblasts of rheumatoid arthritis. However, since NF-κB expression is reportedly upregulated in psoriasis lesions, other transcription factors may function as key modulators of Sema4A expression in the epidermis.

      Although the molecules causing psoriasis remain to be elucidated, we investigated the correlation between the expression of psoriasis-related essential molecules in keratinocytes—such as S100A7A, S100A7, S100A8, S100A9, and S100A12—and SEMA4A expression in L and NL samples using qRT-PCR. We could not identify a correlation between these molecules and SEMA4A expression. We added a note to the limitations section to acknowledge that we were not able to reveal how Sema4A expression is regulated and that we could not determine the relationships between Sema4A expression and the essential molecules upregulated in psoriatic keratinocytes.

      Page 21 Line 328 in the resubmitted manuscript:

      “We were not able to reveal how Sema4A expression is regulated. Although we showed that downregulation of Sema4A is related to the abnormal cytokeratin expression observed in psoriasis, we could not determine the relationships between Sema4A expression and the essential molecules upregulated in psoriatic keratinocytes.”

      (2) Using bone marrow chimeric mice, it has already been reported that hematopoietic cells contain keratinocyte stem cells. Therefore, their interpretation is not supported by the results of their bone marrow chimeric mice experiment, and it is essential to generate keratinocyte-specific Sema4A knockout mice and perform similar experiments to support their interpretation.

      We value the reviewer’s insightful comment. We have assessed the expression of Sema4a in the epidermis of WT→KO chimeric mice using qRT-PCR. Our findings indicate that Sema4a expression levels in the epidermis of these mice are minimal (cycle threshold values of Sema4a ranged from 31.9 to not detected in WT→KO chimeric mice, whereas they ranged from 24.5 to 26.2 in WT→ WT mice). Consequently, we believe that the impact of keratinocyte stem cells derived from WT-hematopoietic cells is limited in this model. We appreciate this opportunity to clarify our results and will consider the generation of keratinocyte-specific Sema4A knockout mice for future experiments to further substantiate our interpretation.

      Page 11 Line 159 in the resubmitted manuscript:

      “Since it has already been reported that bone marrow cells contain keratinocyte stem cells (Harris et al., 2004; Wu, Zhao, & Tredget, 2010), we confirmed that epidermis of mice deficient in non-hematopoietic Sema4A (WT→KO) showed no obvious detection of Sema4a, thereby ruling out the impact of donor-derived keratinocyte stem cells infiltrating the host epidermis (Figure 3-figure supplement 1A).”

      Page 60: In the Figure legend of Figure 3-figure supplement 1A in the resubmitted manuscript:

      “(A) Sema4a expression in the Epi of WT→ WT mice and WT→ KO mice (n = 8 for WT→ WT, n = 7 for WT→ KO).”

      (3) Since Sema4A KO mice already have immunological and epidermal cell characteristics similar to psoriasis, albeit weak, it is possible that the nonspecific stimulus of simply topical IMQ may have appeared to exacerbate psoriasis. It is advisable to confirm whether a more psoriasis-specific stimulus, IL-23 administration, would produce similar results.

      Thank you for your suggestion. Following your advice, we have analyzed IL-23-mediated psoriasis-like dermatitis. To induce the model, 20 μl of phosphate-buffered saline containing 500 ng of recombinant mouse IL-23 was injected intradermally into both ears for 4 consecutive days. Unlike with the application of IMQ, there was no significant difference in ear thickness. However, H&E staining revealed that the epidermal thickness was significantly greater in KO mice compared to WT mice. Although a longer period of IL-23 induction might result in more pronounced ear swelling, we conducted this experiment over the same duration as the IMQ application experiment to maintain consistency. When we analyzed the T cells infiltrating the ears using flow cytometry, the proportion of IL-17A producing Vγ2 and DNγδ T cells in CD3 fraction in the epidermis was significantly higher in Sema4A KO mice, consistent with the results from IMQ-induced psoriasis-like dermatitis.

      The lack of significant difference in ear thickness changes with IL-23 administration might be due to IL-23 administration not reflecting upstream events of IL-23 production.

      We consider that in psoriasis, the expression of Sema4A in keratinocytes is likely more important than in T cells. Therefore, it makes sense that the phenotype difference was more pronounced with IMQ, which likely has a greater effect on keratinocytes compared to IL-23.

      Page 9 Line 137 in the resubmitted manuscript:

      “Though the imiquimod model is well-established and valuable murine psoriatic model (van der Fits et al., 2009), the vehicle of imiquimod cream can activate skin inflammation that is independent of toll-like receptor 7, such as inflammasome activation, keratinocyte death and interleukin-1 production (Walter et al., 2013). This suggests that the imiquimod model involves complex pathway. Therefore, we subsequently induced IL-23-mediated psoriasis-like dermatitis (Figure2-figure supplement 2A), a much simpler murine psoriatic model, because IL-23 is thought to play a central role in psoriasis pathogenesis (Krueger et al., 2007; Lee et al., 2004). Although ear swelling on day 4 was comparable between WT mice and Sema4AKO mice (Figure2-figure supplement 2B), the epidermis, but not the dermis, was significantly thicker in Sema4AKO mice compared to WT mice (Figure2-figure supplement 2C). We found that the proportion of CD4 T cells among T cells was significantly higher in Sema4A KO mice compared to WT mice, while the proportion of Vγ2 and DNγδ T cells among T cells was comparable between them (Figure 2-figure supplement 2D). On the other hand, focusing on IL-17A-producing cells, the proportion of IL-17A-producing Vγ2 and DNγδ T cells in CD3 fraction in the epidermis was significantly higher in Sema4A KO mice, consistent with the results from imiquimod-induced psoriasis-like dermatitis. (Figure 2-figure supplement 2E).”

      Page 24 Line 363 in the resubmitted manuscript: In the “Mice” section.

      “To induce IL-23-mediated psoriasis-like dermatitis, 20 μl of phosphate-buffered saline containing 500 ng of recombinant mouse IL-23 (BioLegend, San Diego, CA) was injected intradermally into both ears of anesthetized mice using a 29-gauge needle for 4 consecutive days.”

      Page 58: In the Figure legend of Figure 2-figure supplement 2 in the resubmitted manuscript:

      “IL-23-mediated psoriasis-like dermatitis is augmented in Sema4AKO mice.

      (A) An experimental scheme involved intradermally injecting 20 μl of phosphate-buffered saline containing 500 ng of recombinant mouse IL-23 into both ears of WT mice and KO mice for 4 consecutive days. Samples for following analysis were collected on Day 4. (B and C) Ear thickness (B) and Epi and Derm thickness (C) of WT mice and KO mice on Day 4 (n = 12 per group). (D and E) The percentages of Vγ3, Vγ2, DNγδ, CD4, and CD8 T cells (D) and those with IL-17A production (E) in CD3 fraction in the Epi (top) and Derm (bottom) of WT and KO ears (n = 5 per group). Each dot represents the average of 4 ear specimens. B-E: *p < 0.05, **p < 0.01. NS, not significant.”

      (4) How is STAT3 expression in the epidermis crucial in the pathogenesis of psoriasis in Sem4AKO mice?

      We appreciate your insightful comment. In our study, given the established role of activated STAT3 in psoriasis, we investigated both total STAT3 and phosphorylated STAT3 (p-STAT3) levels in the naive epidermis of WT and Sema4AKO mice (See the figure below). Our findings indicate that STAT3 activation does not occur in the epidermis of Sema4AKO mice. Therefore, we speculated that the hyperkeratosis observed in Sema4AKO mice is due to aberrant mTOR signaling rather than STAT3 activation. STAT3 may be relevant to other pathways independent of Sema4A signaling, or it may function as a complex with other molecules in the Sema4A signaling.

      Author response image 4.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      There is a long-standing idea that choices influence evaluation: options we choose are re-evaluated to be better than they were before the choice. There has been some debate about this finding, and the authors developed several novel methods for detecting these re-evaluations in task designs where options are repeatedly presented against several alternatives. Using these novel methods the authors clearly demonstrate this re-evaluation phenomenon in several existing datasets.

      Strengths:

      The paper is well-written and the figures are clear. The authors provided evidence for the behaviour effect using several techniques and generated surrogate data (where the ground truth is known) to demonstrate the robustness of their methods.

      Weaknesses:

      The description of the results of the fMRI analysis in the text is not complete: weakening the claim that their re-evaluation algorithm better reveals neural valuation processes.

      We appreciate the reviewer’s comment regarding the incomplete account of the fMRI results. In response, we implemented Reviewer #2's suggestion to run additional GLM models for a clearer interpretation of our findings. We also took this opportunity to apply updated preprocessing to the fMRI data and revise the GLM models, making them both simpler and more comprehensive. The results section is thus substantially revised, now including a new main figure and several supplemental figures that more clearly present our fMRI findings. Additionally, we have uploaded the statistical maps to NeuroVault, allowing readers to explore the full maps interactively rather than relying solely on the static images in the paper. The new analyses strengthen our original conclusion: dynamic values (previously referred to as revalued values, following the reviewer’s suggestion) better explain BOLD activity in the ventromedial prefrontal cortex, a region consistently associated with valuation, than static values (values reported prior to the choice phase in the auction procedure).

      Reviewer #2 (Public Review):

      Summary:

      Zylberberg and colleagues show that food choice outcomes and BOLD signal in the vmPFC are better explained by algorithms that update subjective values during the sequence of choices compared to algorithms based on static values acquired before the decision phase. This study presents a valuable means of reducing the apparent stochasticity of choices in common laboratory experiment designs. The evidence supporting the claims of the authors is solid, although currently limited to choices between food items because no other goods were examined. The work will be of interest to researchers examining decision-making across various social and biological sciences.

      Strengths:

      The paper analyses multiple food choice datasets to check the robustness of its findings in that domain.

      The paper presents simulations and robustness checks to back up its core claims.

      Weaknesses:

      To avoid potential misunderstandings of their work, I think it would be useful for the authors to clarify their statements and implications regarding the utility of item ratings/bids (e-values) in explaining choice behavior. Currently, the paper emphasizes that e-values have limited power to predict choices without explicitly stating the likely reason for this limitation given its own results or pointing out that this limitation is not unique to e-values and would apply to choice outcomes or any other preference elicitation measure too. The core of the paper rests on the argument that the subjective values of the food items are not stored as a relatively constant value, but instead are constructed at the time of choice based on the individual's current state. That is, a food's subjective value is a dynamic creation, and any measure of subjective value will become less accurate with time or new inputs (see Figure 3 regarding choice outcomes, for example). The e-values will change with time, choice deliberation, or other experiences to reflect the change in subjective value. Indeed, most previous studies of choice-induced preference change, including those cited in this manuscript, use multiple elicitations of e-values to detect these changes. It is important to clearly state that this paper provides no data on whether e-values are more or less limited than any other measure of eliciting subjective value. Rather, the paper shows that a static estimate of a food's subjective value at a single point in time has limited power to predict future choices. Thus, a more accurate label for the e-values would be static values because stationarity is the key assumption rather than the means by which the values are elicited or inferred.

      Thank you for this helpful comment. We changed the terminology following the reviewer’s suggestion. The “explicit” values (e-values or ve) are now called “static” values (s-values or vs). Accordingly, we also changed the “Reval” values (r-values or vr) to “dynamic” values (d-values or vd).

      We also address the reviewer's more general point about the utility of item ratings/bids (s-values) and whether our results are likely to hold with other ways of eliciting subjective values. We added a new sub-section in Discussion addressing this and other limitations of our study. To address the reviewer’s point, we write:

      “One limitation of our study is that we only examined tasks in which static values were elicited from explicit reports of the value of food items. It remains to be determined if other ways of eliciting subjective values (e.g., Jensen and Miller, 2010) would lead to similar results. We think so, as the analysis of trials with identical item pairs (Fig. 3) and the difference between forward and backward Reval (Fig. 7) are inconsistent with the notion that values are static, regardless of their precise value. It also remains to be determined if our results will generalize to non-food items whose value is less sensitive to satiety and other dynamic bodily states. Perceptual decisions also exhibit sequential dependencies, and it remains to be explored whether these can be explained as a process of value construction, similar to what we propose here for the food-choice task (Gupta et al., 2024; Cho et al., 2002; Zylberberg et al., 2018; Abrahamyan et al., 2016).”

      There is a puzzling discrepancy between the fits of a DDM using e-values in Figure 1 versus Figure 5. In Figure 1, the DDM using e-values provides a rather good fit to the empirical data, while in Figure 5 its match to the same empirical data appears to be substantially worse. I suspect that this is because the value difference on the x-axis in Figure 1 is based on the e-values, while in Figure 5 it is based on the r-values from the Reval algorithm. However, the computation of the value difference measure on the two x-axes is not explicitly described in the figures or methods section and these details should be added to the manuscript. If my guess is correct, then I think it is misleading to plot the DDM fit to e-values against choice and RT curves derived from r-values. Comparing Figures 1 and 5, it seems that changing the axes creates an artificial impression that the DDM using e-values is much worse than the one fit using r-values.

      We agree with the reviewer that this way of presenting the DDM fits could be misleading. In the previous version of the manuscript, we included the two fits in the same figure panel to make it clear that the sensitivity (slope) of the choice function is greater when we fit the data using the r-values (now d-values) than when we fit them using the e-values (now s-values). In the revised version of Figure 5, we include the data points already shown in Figure 1, so that each DDM fit is shown with their corresponding data points. Thus we avoid giving the false impression that the DDM model fit using the s-values is much worse than the one fit using the d-values. This said, the fit is indeed worse, as we now show with the formal model comparison suggested by the reviewer (next comment).

      Relatedly, do model comparison metrics favor a DDM using r-values over one using e-values in any of the datasets tested? Such tests, which use the full distribution of response times without dividing the continuum of decision difficulty into arbitrary hard and easy bins, would be more convincing than the tests of RT differences between the categorical divisions of hard versus easy.

      We now include the model comparison suggested by the reviewer. The comparison shows that the DDM model using dynamic values explains the choice and response time data better than one using static values. One potential caveat of this comparison, which explains why we did not include it in the original version of the manuscript, is that the d-values are obtained from a fit to the choice data, which could bias the subsequent DDM comparison. We control for this in three ways: (1) by calculating the difference in Bayesian Information Criterion (BIC) between the models, penalizing the DDM model that uses the d-values for the additional parameter (δ); (2) by comparing the difference in BIC against simulations of a model in which the choice and RT data were obtained assuming static values; this analysis shows that if values were static, the DDM using static values would be favored in the comparison despite having one fewer parameter; (3) ignoring the DDM fit to the choices in the model comparison, and just comparing how well the two models explain the RTs; this comparison is unbiased because the δ values are fit only to the choice data, not the RTs. These analyses are now included in Figure 5 and Figure 5–Figure supplement 2.

      Revaluation and reduction in the imprecision of subjective value representations during (or after) a choice are not mutually exclusive. The fact that applying Reval in the forward trial order leads to lower deviance than applying it in the backwards order (Figure 7) suggests that revaluation does occur. It doesn't tell us if there is also a reduction in imprecision. A comparison of backwards Reval versus no Reval would indicate whether there is a reduction in imprecision in addition to revaluation. Model comparison metrics and plots of the deviance from the logistic regression fit using e-values against backward and forward Reval models would be useful to show the relative improvement for both forms of Reval.

      We agree with the reviewer that the occurrence of revaluation does not preclude other factors from affecting valuation. Following the reviewer’s suggestion we added a panel to Figure 6 (new panel B), in which we show the change in the deviance from the logistic regression fits between Reval (forward direction) and no-Reval. The figure clearly shows that the difference in deviance for the data is much larger than that obtained from simulations of choice data generated from the logistic fits to the static values (shown in red).

      Interestingly, we also observe that the deviance obtained after applying Reval in the backward direction is lower than that obtained using the s-values. We added a panel to figure 7 showing this (Fig. 7B). This observation, however, does not imply that there are factors affecting valuation besides revaluation (e.g.,”reduction in imprecision”). Indeed, as we now show in a new panel in Figure 11 (panel F), the same effect (lower deviance for backward Reval than no-Reval) is observed in simulations of the ceDDM.

      Besides the new figure panels (Fig. 6B, 7B, 11F), we mention in Discussion (new subsection, “Limitations...”, paragraph #2) the possibility that there are other non-dynamic contributions to the reduction in deviance for Backward Reval compared to no-Reval:

      “Another limitation of our study is that, in one of the datasets we analyzed (Sepulveda et al. 2020), applying Reval in the forward direction was no better than applying it in the backward direction (Fig. 10). We speculate that this failure is related to idiosyncrasies of the experimental design, in particular, the use of alternating blocks of trials with different instructions (select preferred vs. select non-preferred). More importantly, Reval applied in the backward direction led to a significant reduction in deviance relative to that obtained using the static values. This reduction was also observed in the ceDDM, suggesting that the effect may be explained by the changes in valuation during deliberation. However, we cannot discard a contribution from other, non-dynamic changes in valuation between the rating and choice phase including contextual effects (Lichtenstein and Slovic, 2006), stochastic variability in explicit value reporting (Polania et al., 2019), and the limited range of numerical scales used to report value.”

      Did the analyses of BOLD activity shown in Figure 9 orthogonalize between the various e-valueand r-value-based regressors? I assume they were not because the idea was to let the two types of regressors compete for variance, but orthogonalization is common in fMRI analyses so it would be good to clarify that this was not used in this case. Assuming no orthogonalization, the unique variance for the r-value of the chosen option in a model that also includes the e-value of the chosen option is the delta term that distinguishes the r and e-values. The delta term is a scaled count of how often the food item was chosen and rejected in previous trials. It would be useful to know if the vmPFC BOLD activity correlates directly with this count or the entire r-value (e-value + delta). That is easily tested using two additional models that include only the r-value or only the delta term for each trial.

      We did not orthogonalize the static value and dynamic value regressors. We have included this detail in the revised methods. We thank the reviewer for the suggestion to run additional models to improve our ability to interpret our findings. We have substantially revised all fMRI-related sections of the paper. We took this opportunity to apply standardized and reproducible preprocessing steps implemented in fmriprep, present whole-brain corrected maps on a reconstructed surface of a template brain, and include links to the full statistical maps for the reader to navigate the full map, rather than rely on the static image in the figures. We implemented four models in total: model 1 includes both static value (Vs) obtained during the auction procedure prior to the choice phase and dynamic value (Vd) output by the revaluation algorithm (similar to the model presented in the first submission); model 2 includes only delta = Vd - Vs; model 3 includes only Vs; model 4 includes only Vd. All models included the same confound and nuisance regressors. We found that Vd was positively related to BOLD in vmPFC when accounting for Vs, correcting for familywise error rate at the whole brain level. Interestingly, the relationship between delta and vmPFC BOLD did not survive whole-brain correction and the effect size of the relationship between Vd and vmPFC bold in model 4 was larger than the effect size of the relationship between Vs and vmPFC bold in model 3 and survived correction at the whole brain level encompassing more of the vmPFC. Together, these findings bolster our claim that Vd better accounts for BOLD variability in vmPFC, a brain region reliably linked to valuation.

      Please confirm that the correlation coefficients shown in Figure 11 B are autocorrelations in the MCMC chains at various lags. If this interpretation is incorrect, please give more detail on how these coefficients were computed and what they represent.

      We added a paragraph in Methods explaining how we compute the correlations in Figure 11B (last paragraph of the sub-section “Correlated-evidence DDM” in Methods):

      “The correlations in Fig. 11B were generated using the best-fitting parameters for each participant to simulate 100,000 Markov chains. We generate Markov chain samples independently for the left and right items over a 1-second period. To illustrate noise correlations, the simulations assume that the static value of both the left and right items is zero. We then and for each of the Markov chains (𝑥). Pearson's𝑥 correlation is computed between these 𝑡 calculate the difference in dynamic value ( ) between the left and right items at each time (𝑡) differences at time zero, 𝑥𝑖(𝑡 = 0), and at time 𝑥𝑖(𝑡 = τ), for different time lags τ. Correlations were calculated independently for each participant. Each trace in Fig. 11B represents a different participant.”

      The paper presents the ceDDM as a proof-of-principle type model that can reproduce certain features of the empirical data. There are other plausible modifications to bounded evidence accumulation (BEA) models that may also reproduce these features as well or better than the ceDDM. For example, a DDM in which the starting point bias is a function of how often the two items were chosen or rejected in previous trials. My point is not that I think other BEA models would be better than the ceDDM, but rather that we don't know because the tests have not been run. Naturally, no paper can test all potential models and I am not suggesting that this paper should compare the ceDDM to other BEA processes. However, it should clearly state what we can and cannot conclude from the results it presents.

      Indeed, the ceDDM should be interpreted as a proof-of-principle model, which shows that drifting values can explain many of our results. It is definitely wrong in the details, and we are open to the possibility that a different way of introducing sequential dependencies between decisions may lead to a better match to the experimental data. We now mention this in a new subsection of Discussion, “Limitations...” paragraph #3:

      “Finally, we emphasize that the ceDDM should be interpreted as a proof-of-principle model used to illustrate how stochastic fluctuations in item desirability can explain many of our results. We chose to model value changes following an MCMC process. However, other stochastic processes or other ways of introducing sequential dependencies (e.g., variability in the starting point of evidence accumulation) may also explain the behavioral observations. Furthermore, there likely are other ways to induce changes in the value of items other than through past decisions. For example, attentional manipulations or other experiences (e.g., actual food consumption) may change one's preference for an item. The current version of the ceDDM does not allow for these influences on value, but we see no fundamental limitation to incorporating them in future instantiations of the model.”

      This work has important practical implications for many studies in the decision sciences that seek to understand how various factors influence choice outcomes. By better accounting for the context-specific nature of value construction, studies can gain more precise estimates of the effects of treatments of interest on decision processes.

      Thank you!

      That said, there are limitations to the generalizability of these findings that should be noted.

      These limitations stem from the fact that the paper only analyzes choices between food items and the outcomes of the choices are not realized until the end of the study (i.e., participants do not eat the chosen item before making the next choice). This creates at least two important limitations. First, preferences over food items may be particularly sensitive to mindsets/bodily states. We don't yet know how large the choice deltas may be for other types of goods whose value is less sensitive to satiety and other dynamic bodily states. Second, the somewhat artificial situation of making numerous choices between different pairs of items without receiving or consuming anything may eliminate potential decreases in the preference for the chosen item that would occur in the wild outside the lab setting. It seems quite probable that in many real-world decisions, the value of a chosen good is reduced in future choices because the individual does not need or want multiples of that item. Naturally, this depends on the durability of the good and the time between choices. A decrease in the value of chosen goods is still an example of dynamic value construction, but I don't see how such a decrease could be produced by the ceDDM.

      These are all great points. The question of how generalizable our results are to other domains is wide open. We do have preliminary evidence suggesting that in a perceptual decision-making task with two relevant dimensions (motion and color; Kang, Loffler et al. eLife 2021), the dimension that was most informative to resolve preference in the past is prioritized in future decisions. We believe that a similar process underlies the apparent change in value in value-based decisions. We decided not to include this experiment in the manuscript, as it would make the paper much longer and the experimental designs are very different. Exploring the question of generality is a matter for future studies.

      We also agree that food consumption is likely to change the value of the items. For example, after eating something salty we are likely to want something to drink. We mention in the revised manuscript that time, choice deliberation, attentional allocation and other experiences (including food consumption) are likely to change the value of the alternatives and thus affect future choices and valuations.

      The ceDDM captures only sequential dependencies that can be attributed to values that undergo diffusion-type changes during deliberation. While the ceDDM captures many of the experimental observations, the value of an item may change for reasons not captured by the ceDDM. For example, food consumption is likely to change the value of items (e.g., wanting something to drink after eating something salty). The reviewer is correct that the current version of ceDDM could not account for these changes in value. However, we see no fundamental limitation to extending the ceDDM to account for them.

      We discuss these issues in a new subsection in Discussion (“Limitations...” paragraph #3).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Summary

      The authors address assumptions of bounded accumulation of evidence for value-based decision-making. They provide convincing evidence that subjects drift in their subjective preferences across time and demonstrate valuable methods to detect these drifts in certain task designs.

      My specific comments are intended to assist the authors with making the paper as clear as possible. My only major concern is with the reporting of the fMRI results.

      Thank you, please see our responses above for a description of the changes we made to the fMRI analyses.

      Specific comments

      - In the intro, I would ask the authors to consider the idea that things like slow drift in vigilance/motivation or faster drifts in spatial attention could also generate serial dependencies in perceptual tasks. I think the argument that these effects are larger in value-based tasks is reasonable, but the authors go a bit too far (in my opinion) arguing that similar effects do not exist *at all* in perceptual decision-making.

      We added a sentence in the Discussion (new section on Limitations, paragraph #1) mentioning some of the literature on sequential dependencies in perceptual tasks and asking whether there might be a common explanation for such dependencies for perceptual and value-based decisions. We tried including this in the Introduction, but we thought it disrupted the flow too much.

      - Figure 1: would it not be more clear to swap the order of panels A and B? Since B comes first in the task?

      We agree, we swapped the order of panels A and B.

      - Figure 2: the label 'simulations' might be better as 'e-value simulations'

      Yes, we changed the label ‘simulations’ to ‘simulations with s-values’ (we changed the term explicit value to static value, following a suggestion by Reviewer #2).

      - For the results related to Figure 2, some citations related to gaps between "stated versus revealed preferences" seem appropriate.

      We added a few relevant citations where we explain the results related to Figure 2.

      - Figure 3: in addition to a decrease in match preferences over the session, it would be nice to look at other features of the task which might have varied over the session. e.g. were earlier trials more likely to be predicted by e-value?

      We do see a trend in this direction, but the effect is not significant. The following figure shows the consistency of the choices with the stated values, as a function of the |∆value|, for the first half (blue) and the second half (red) of the trials. The x-axis discretizes the absolute value of the difference in static value between the left and right items, binned in 17 bins of approximately equal number of trials.

      Author response image 1.

      The slope is shallower for the second half, but a logistic regression model revealed that the difference is not significant:

      ,

      where Ilate is an indicator variable that takes a value of 1 for the second half of the trials and zero otherwise.

      As expected from the figure β2 was negative (-0.15) but the effect was not significant (p-value =0.32, likelihood ratio test).

      We feel we do not have much to say about this result, which may be due to lack of statistical power, so we would rather not include this analysis in the revised manuscript.

      It is worth noting that if we repeat the analysis using the dynamic values obtained from Reval instead of the static values, the consistency is overall much greater and little difference is observed between the first and second halves of the experiment:

      Author response image 2.

      - The e-value DDM fit in Figure 1C/D goes through the points pretty well, but the e-value fits in 5A do not because of a mismatch with the axis. The x-axis needs to say whether the value difference is the e-value or the r-value. Also, it seems only fair to plot the DDM for the r-value on a plot with the x-axis being the e-value.

      Thank you for this comment, we have now changed Figure 5A, such that both sets of data points are shown (data grouped by both e-values and by r-values). We agree that the previous version made it seem as if the fits were worse for the DDM fit to the e-values. The fits are indeed worse, as revealed by a new DDM model comparison (Figure 5–Figure supplement 2), but the effect is more subtle than the previous version of the figure implied.

      - How is Figure 5B "model free" empirical support? The fact that the r-value model gives better separation of the RTs on easy and hard trials doesn't seem "model-free" and also it isn't clear how this directly relates to being a better model. It seems that just showing a box-plot of the R2 for the RT of the two models would be better?

      We agree that “model free” may not be the best expression, since the r-values (now d-values) are derived from a model (Reval). Our intention was to make clear that because Reval only depends on the choices, the relationship between RT and ∆vdynamic is a prediction. We no longer use the term, model free, in the caption. We tried to clarify the point in Results, where we explain this figure panel. We have also included a new model comparison (Figure 5–Figure supplement 2), showing that the DDM model fit to the d-values explains choice and RT better than one fit to the s-values.

      This said, we do consider the separation in RTs between easy and hard trials to be a valid metric to compare the accuracy of the static and dynamic values. The key assumption is that there is a monotonically decreasing relationship between value difference, ∆v, and response time. The monotonic relationship does not need to hold for individual trials (due to the noisiness of the RTs) but should hold if one were to average a large enough number of trials for each value of ∆v.

      Under this assumption, the more truthful a value representation is (i.e., the closer the value we infer is to the true subjective value of the item on a given trial, assuming one exists), the greater the difference in RTs between trials judged to be difficult and those considered easy. To illustrate this with an extreme case, if an experimenter’s valuation of the items is very inaccurate (e.g., done randomly), then on average there will be no difference between easy and difficult RTs as determined by this scoring.

      - Line 189: Are the stats associated with Eq 7, was the model fit subject by subject? Combining subjects? A mixed-effects model? Why not show a scatter plot of the coefficients of Δvₑ and Δvᵣ (1 point/subject).

      The model was not fit separately for each subject. Instead, we concatenated trials from all subjects, allowing each subject to have a different bias term (β0,i ).

      We have now replaced it with the analysis suggested by the reviewer. We fit the logistic regression model independently for each participant. The scatter plot suggested by the reviewer is shown in Figure 5–Figure supplement 1. Error bars indicate the s.e. of the regression coefficients:

      It can be seen that the result is consistent with what we reported before: βd is significantly positive for all participants, while βs is not.

      - I think Figure S1 should be a main figure.

      Thank you for this suggestion, we have now included the former Figure S1 as an additional panel in Figure 5.

      - Fig 9 figure and text (line 259) don't exactly match. In the text it says that the BOLD correlated with vᵣ and not vₑ, but the caption says there were correlations with vᵣ after controlling for vₑ. Is there really nothing in the brain that correlated with vₑ? This seems hard to believe given how correlated the two estimates are. In the methods, 8 regressors are described. A more detailed description of the results is needed.

      Thank you for pointing out the inconsistency in our portrayal of the results in the main text and in the figure caption. We have substantially revised all fMRI methods, re-ran fMRI data preprocessing and implemented new, simpler, and more comprehensive GLM models following Reviewer #2's suggestion. Consequently, we have replaced Figure 9, added Figure 9 — Figure Supplement 1, and uploaded all maps to NeuroVault. These new models and maps allow for a clearer interpretation of our findings. More details about the fMRI analyses in the methods and results are included in the revision. We took care to use similar language in the main text and in the figure captions to convey the results and interpretation. The new analyses strengthen our original conclusion: dynamic values better explain BOLD activity in the ventromedial prefrontal cortex, a region consistently associated with valuation, than static values.

      - It's great that the authors reanalyzed existing datasets (fig 10). I think the ΔRT plots are the least clear way to show that _reval_ is better. Why not a figure like Figure 6a and Figure 7 for the existing datasets?

      We agree with the reviewer. We have replaced Fig. 10 with a more detailed version. For each dataset, we show the ΔRT plots, but we also show figures equivalent to Fig. 6a, Fig. 7a, and the new Fig. 6b (Deviance with and without Reval).

      Reviewer #2 (Recommendations For The Authors):

      I assume that the data and analysis code will be made publicly and openly available once the version of record is established.

      Yes, the data and analysis code is now available at: https://github.com/arielzylberberg/Reval_eLife_2024

      We added a Data Availability statement to the manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      Major comments:

      1) The authors conclude that the bone growth defects are chondrocyte-specific, highlighting no changes in the IGF pathway. However, other bone cells such as mesenchymal progenitors, osteoblasts, osteocytes, and marrow stromal cells are also lateral plate mesoderm derived and likely have roles in the bone growth phenotypes (a). Additionally, while the size decrease of the proliferative zone was stated, no actual proliferation assays such as BrdU were conducted (b). With the elements being of such small size in the mutants, the defects are likely to be found at the earliest stages of limb development at E11.5-E13.5 and may be due to mesenchymal to chondrocyte transitions or defects in osteoblast lineage development (c). Overall, the skeletal characterization is not rigorous and does not identify even a likely cellular mechanism. Further, a molecular mechanism by which SMN functions in mesenchymal progenitors, chondrocytes, or osteoblast lineage cells has not been assessed (d).

      (a, c) As the reviewer commented, it seems to be a very important point to evaluate whether there is any problem in embryonic development from the time of mesenchymal cell condensation of the limb bud to the primary ossification center. However, when Hensel et al evaluated bone growth in P3 of severe SMA mice, the growth defect was not very large, with control femur length 3.5 mm and mutant 3.2 mm. it seems that even if SMN defects occur, there is no major problem with endochondral bone formation in the embryonic period (Hensel et al., 2020).

      In this study, the SMN2 1-copy mutant with the bone growth defect was found to have a similar reduction in SMN protein to the severe SMA mouse model in experiments quantifying SMN protein. When Hensel et al. performed an in vitro ossification test on primary osteoblasts from the other severe SMA mouse model (Taiwanese severe SMA), they found no significant difference compared to controls. In femurs at P3 from severe SMA mice, they found no difference in bone voxel density and bone thickness (Hensel et al., 2020). In our data, bone thickness was not different in Figure 1 and Figure 1 – figure supplement 2, and BMD was actually greater. Thus, we believe that osteoblast and osteocyte function does not appear to be impaired by the absence of SMNs. When we looked at cortical osteoblasts in our new Figure 1-figure supplement 2, there did not appear to be a significant difference in density.

      Furthermore, it is unlikely that BMSCs contributed to the bone growth we observed up to 2 weeks of age. the Lepr+Cxcl12+ BMSC population, which constitutes 94% ± 4% of CFU-F colonies formed by bone marrow cells (Zhou et al.k, 2014), is Prrx1-positive, and is known to be capable of osteogenesis in vivo, was only shown to differentiate into osteoblasts and form new bone in adults over 8 weeks of age. In the Lepr-cre; tdTomato; Col2.3-GFP mouse model, few cells expressing the osteoblast marker Col2.3-GFP are found before 2 months, and only about 3% of femur trabecular and cortical osteocytes express tdTomato at 2 months (Zhou et al., 2014). In Cxcl12-CreER; tdTomato; Col2.3-GFP mouse model, the researchers did not find tomato positivity in osteoblasts and osteocytes even after administration of tamoxifen at P3 and analysis 1 year later (Matsushita et al., 2020).

      We, therefore, concluded that the bone growth abnormalities observed in SMN2 1-copy mutants are due to problems in endochondral ossification caused by chondrocyte defects and not due to other Prrx1-lineage skeletal cells.

      (b) According to the reviewer's suggestion, we evaluated cell proliferation in the new Figure 1J-L by performing immunostaining for the Ki67 proliferation marker in growth plates.

      (d) As the reviewer pointed out, we enhanced the mechanism study and found the reduction of chondrocyte-derived IGF signaling and hypertrophic marker in new Figure 2. We evaluated the density of osteoblasts and osteoclasts, which can affect bone mineralization. We highlighted the limited impact of BMSCs on bone growth in the first two weeks of life. In a previous study, SMN-deleted osteoblasts did not show any issues with ossification (Hensel et al., 2020). In fact, osteoblast density in the SMN2 1-copy mutant was not different from the control, indicating that the skeletal abnormalities can largely be attributed to deficiencies in endochondral ossification caused by chondrocytes. Since chondrocytes are the local source of IGF and our mutants exhibit phenotypes similar to mouse models with reduced IGF, such as downregulated expression of Igf1 and Igfbp3, downregulated IGF-induced hypertrophic gene expression, reduced AKT phosphorylation, proliferation, and growth plate zone length, SMN-deleted chondrocytes probably showed these phenotypes due to decreased IGF secretion. Now, we added new Figure 2A-C, and E.

      2) Is the liver the only organ/tissue that supplied IGF to the chondrocytes or are other lateral plate mesoderm-derived cells potential suppliers? It's not possible to pin SMN deletion in chondrocytes as intrinsic ignoring the other bone cell types that it is depleted from in the Prrx1Cre genetic model.

      Recently, Oichi et al. reported that the local IGF source in the growth plate is chondrocytes by in situ hybridization and p-AKT staining (Oichi et al., 2023). When we measured IGF in chondrocytes isolated from articular cartilage, the expressions of Igf1 andIgfbp3 were markedly reduced in chondrocytes with SMN deletion compared to controls (New Figure 2E), suggesting that intrinsic SMN expression in chondrocytes plays an important role in the growth plate.

      3) Why is SMN protein being isolated from FAPs to assess levels in the null/SMN2 single copy/double copy mutants when the bone defects are supposed to be a chondrocyte-specific phenotype? This protein expression needs to be confirmed in chondrocytes themselves, and or other Prrx1Cre lineaged skeletal cells.

      According to the reviewer’s suggestion, we attempted to evaluate the protein levels in chondrocytes of the SMN2 1-copy mutant. However, we were unable to obtain sufficient numbers of chondrocytes, because of poor proliferation of mutant chondrocytes compared to controls in culture conditions. We could obtain ~10^4 viable cells from 1 mouse of SMN2 1-copy mutant. Therefore, our only options for confirming SMN deletion in chondrocytes were DNA and RNA work. As in the Prrx1-lineage FAPs that the amount of SMN protein correlates with the expression levels of full-length SMN mRNA (Figure 2H-J), we expect that the SMN protein in chondrocytes would be fully depleted due to poor full-length SMN mRNA expression (Figure 2H).

      4) Figure 2E should have example images of each type of NMJ characterization.

      We revised our figure by adding the example images in new Figure 3E.

      5) What are the overall NMJ numbers in the normal formation period? Are these constant into the juvenile period when the authors say the deterioration occurs?

      We appreciate the reviewer's constructive comments, and it would be interesting to see if we could see a difference in the total number of NMJs. However, there is one NMJ in every myofiber, and each muscle has hundreds to thousands of myofibers. The technical difficulty of confocal imaging an entire muscle, which can be several millimeters across, precludes experiments that count every NMJ and show a difference. It may be possible to do so by combining clearing and confocal line scanning techniques. In our analysis of the NMJ, the formation of the NMJ in the mutant appears to be normal. Additionally, the number of myofibers seems to be the same, and there may be no difference in the total NMJ number.

      6) For transplantation experiments the authors sorted YFP or TOMATO+ cells from the Prrx1Cre mice muscles, but refer to them as FAPs. It is known that other cells including tenocyte-like cells, pericytes, and vascular smooth muscle cells are identified by this reporter line. Staining for TOMATO colocalization with PDGFRA would help to clarify this.

      In the method ‘Hindlimb fibro-adipogenic progenitors isolation’ section, we sorted 7AAD–Lin–Vcam–Sca1+ population refers to FAPs. For FAPs transplantation, we also used YFP or TOMATO+ FAPs (7AAD–Lin–Vcam–Sca1+). The ‘FAPs transplantation’ method section did not specify the FAPs population in detail. This has been fixed in the new method. Sca1 (Ly6a) is an effective marker for identifying FAPs within Prrx1-lineage cells, as well as Pdgfra (Leinroth et al., 2022).

      7) The authors only compare the SMN2 single copy mutant transplantation to contralateral to show rescue, but how does this compare to overall wt morphology?

      According to the reviewer’s constructive comment, we compared them with wild-type morphology (new Figure 7A-D).

      8) The asterisks of TOMATO+ in Figure 6A are confusing. FAPs do not usually clump together to form such large plaques and are normally much thinner tendrils. What is the reason for this?

      As the reviewer states, FAPs have a fibroblast-like morphology with elongated thinner tendrils. The Figure 6A image in the figure shows a Z-sliced cell body portion of FAP, where the nucleus is located, and it appears blunt. We attached imaged tomato+ FAPs, in which their cell body parts are plaque-like.

      Author response image 1.

      Tomato+ FAPs in muscle

      9) Would transplantation of healthy FAPs after NMJ maturation in SMN mutants still rescue the phenotype? Assessment of this is key for therapy intervention timelines moving forward.

      It will be very interesting to see if the phenotype improves after NMJ maturation by healthy FAPs transplantation, but this is a technically difficult experiment to do because we found that FAPs do not implant effectively when injected into naive adult muscle. The transplantation into the adult is sufficiently possible if accompanied by an injury, but this eventually leads to new formation of NMJ again. Thus, it seems impossible to do transplantation experiment after NMJ maturation through general methods. If we discover a method to efficiently rescue SMNs from FAPs or identify a factor that affects FAPs' influence on NMJ, then we may be able to conduct this experiment.

      Reference

      Hensel, N., Brickwedde, H., Tsaknakis, K., Grages, A., Braunschweig, L., Lüders, K. A., Lorenz, H. M., Lippross, S., Walter, L. M., Tavassol, F., Lienenklaus, S., Neunaber, C., Claus, P., & Hell, A. K. (2020). Altered bone development with impaired cartilage formation precedes neuromuscular symptoms in spinal muscular atrophy. Human Molecular Genetics, 29(16), 2662–2673. https://doi.org/10.1093/hmg/ddaa145

      Leinroth, A. P., Mirando, A. J., Rouse, D., Kobayahsi, Y., Tata, P. R., Rueckert, H. E., Liao, Y., Long, J. T., Chakkalakal, J. V., & Hilton, M. J. (2022). Identification of distinct non-myogenic skeletal-muscle-resident mesenchymal cell populations. Cell Reports, 39(6), 110785. https://doi.org/10.1016/j.celrep.2022.110785

      Matsushita, Y., Nagata, M., Kozloff, K. M., Welch, J. D., Mizuhashi, K., Tokavanich, N., Hallett, S. A., Link, D. C., Nagasawa, T., Ono, W., & Ono, N. (2020). A Wnt-mediated transformation of the bone marrow stromal cell identity orchestrates skeletal regeneration. Nature Communications, 11(1). https://doi.org/10.1038/s41467-019-14029-w

      Oichi, T., Kodama, J., Wilson, K., Tian, H., Imamura Kawasawa, Y., Usami, Y., Oshima, Y., Saito, T., Tanaka, S., Iwamoto, M., Otsuru, S., & Enomoto-Iwamoto, M. (2023). Nutrient-regulated dynamics of chondroprogenitors in the postnatal murine growth plate. Bone Research, 11(1). https://doi.org/10.1038/s41413-023-00258-9

      Zhou, B. O., Yue, R., Murphy, M. M., Peyer, J. G., & Morrison, S. J. (2014). Leptin-receptor-expressing mesenchymal stromal cells represent the main source of bone formed by adult bone marrow. Cell Stem Cell, 15(2), 154–168. https://doi.org/10.1016/j.stem.2014.06.008

      Reviewer #2

      Major comments:

      1) Regarding bone deficits - CT analysis of bones should be more comprehensive than Figure 1A shows. How about cross-sections? (a) Are bone phenotypes also age-dependent? (b) PCR was done only for SMA and related proteins (such as IGF). IGF protein in the blood and relevant organs should be studied. Why not include biomarkers of osteoblasts or/and osteoclasts and their regulators? (c)

      (a) We appreciate the reviewer’s constructive comment. we added longitudinal section views in new Figure 1A and a description of trabecular bone volume and secondary ossification center in the main text.

      (b) Age-dependent evaluation is an important point. By adulthood, the difference between the SMN2 1-copy mutant and the control is much larger, and even at birth there is a slight difference, although not as large as at 2 weeks of age. We focused our phenotyping on bone growth at 2 weeks of age, a time when new bone formation by BMSCs is less influential, when bone growth is primarily driven by endochondral ossification of chondrocytes, and before the defect in the NMJ is primarily manifested.

      (c) As the reviewer comments, it is important that IGF are evaluated in tissues other than liver. However, the liver is most likely the source of systemic IGF, as shown by the liver-specific deletion of Igf1 and knockout of Igfals, a protein that forms the IGF ternary complex, which is predominantly expressed in the liver. This resulted in a 90% drop in serum IGF levels and a phenotype of shortened femur length and growth plates in the double KO mice (Yakar et al., 2002).

      The local IGF source in the growth plate is chondrocytes confirmed by Igf1 in situ hybridization and p-AKT staining (Oichi et al., 2023). From the In situ hybridization data, we can observe that bone marrow and bone do not express Igf1 at all, but only perichondrium and chondrocytes in the resting zone express Igf1 mRNA. Therefore, we can see that the only supplier of IGF among LPM-derived cells is chondrocytes, and in the new figure 2, we measured IGF pathway expression and AKT phosphorylation in chondrocytes. We have confirmed that the expression of Igf1/Igfbp3 is reduced in chondrocytes with SMN deletion.

      To assess serum IGF level, we could not set up this experiment condition during our revision period due to the requirement of administrative procedures for purchasing new apparatuses and the limitation of our research funds. However, as previously stated, there is no difference in the expression of Igf1 and Igfals in the liver, which accounts for 90% of serum IGF levels. Therefore, we did not anticipate significant variations in serum IGF levels.

      Evaluation of osteoblasts or osteoclasts was done by section staining due to sampling difficulties for PCR. we assessed osteoblasts and osteoclasts state in new Figure 1-figure supplement 2.

      2) What is the relationship between deficits of bone deficits and muscle deficits or even NMJ deficits? Are they inter-related? Is skeletal muscle development also defective in Smn∆MPC mice? Can NMJ deficits result from bone deficits? Or vice versa?

      Unfortunately, the reviewer's comments are very difficult to clarify in our study using the Prrx1-cre model. In skeletal muscle development, the myofiber number was not significantly different in our mouse models. A study has shown that inactivating noggin, a BMP antagonist expressed in condensed cartilage and immature chondrocytes, results in severe skeletal defects without affecting the early stages of muscle differentiation (Tylzanowski et al., 2006). Therefore, bone may not have a significant impact on the early development of muscle, but later in postnatal development it may have an impact on motor performance issues. The relationship between bone and NMJ hasn't been studied. The impact of bone defects on motor skill may result in muscle weakness and NMJ problems. In our study, we showed that NMJ deficit rescue by transplantation of FAPs and decreased IGF in chondrocytes, a key source of local IGF. This suggests that the functions of FAPs in NMJ and chondrocytes in bone deficit are crucial, rather than each other's influence.

      3) Regarding the rescue experiment, the interpretation of the data should be careful. Evidently, healthy FAPs (td-Tomato positive) were transplanted into TA muscles of 10 days-old SMN2 1-copy SmnΔMPC mice, and NMJs were looked at P56. The control was contralateral TA that was injected with the vehicle. As described above, the data had huge SEM and were difficult to interpret or believe. The control perhaps was wrong if FAPs act by releasing "chemicals" because FAPs from one leg may go to other muscles via blood. Second, if FAPs act via contact, the data shown did not support this. Two red FAPs were shown in Figure 6, one of which was superimposed with a nerve track to one of the three NMJs. This NMJ however did not show any difference to the other two, which did not support a contact mechanism. These rescue data were not convincing.

      We appreciate the reviewer’s critical comment, but the reviewer appears to have confused the minimum and maximum range bars in the box-and-whisker plot with the SEM error bar in the bar graph. We apologize for the insufficient description of the figure legends section. We revised them. New Figure 7C, which is a bar graph, has a sufficiently short SEM error bar. In contrast, box-and-whisker plots B and D depict the minimum and maximum range, instead of the SEM, and they are significantly different with a p-value of less than 0.001. If FAPs affect the NMJ via a paracrine factor or ECM with a short range of action, they may rescue the NMJ defect in a non-contact-dependent manner, without affecting the contralateral muscle. Also, the FAPs are heterogeneous, so if only a certain subpopulation rescues, the tomato+ FAP in the figure may not be the rescuing cells.

      4) For most experiments, the "n" numbers were too small. 3-5 mice were used for bone characterization. For the NMJ, most experiments were done with 3 mice. It was unclear how many NMJs were looked at. Perhaps due to small n numbers, the SEM values were enormous (for example, in Figure 6).

      As with the response to the previous comment, this is due to confusion between box-and-whisker plots and bar graphs, and our data was determined to be significant using the appropriate statistical method.

      5) Also for experimental design, some experiments included four genotypes of mice (Fig. 1 J,K) whereas some had only three (Fig.1 A, B, C, D and Fig.3) and others had two (many other figures).

      In the first experiments to confirm the phenotypes, we tested the 2-copy mutant, but it was not significantly different from the wild type, and in subsequent experiments, we mainly tested the only 1-copy mutant.

      6) What was the reason why mixed muscles were used for NMJ characterization (TA versus EDL)? Why not pick a type I-fiber muscle and a type II-fiber muscle?

      We appreciate the constructive comment from the reviewer. Firstly, we conducted a phenotype analysis on the TA muscle. For electrophysiological recording, the EDL muscle should be used for intact nerve with muscle preparation, technically. Additionally, for TEM imaging, EDL was a suitable muscle to locate NMJ positions before TEM processing. Both TA and EDL muscles are adjacent and have similar fiber-type compositions. It would be important to observe in different fiber types of muscles, but when we first identified the phenotype, various types of limb muscles showed similar defects, so we focused on specific muscles.

      7) The description of mouse strains was confusing. SMN2 transgenic mice (with different copies) were not described in the methods.

      We apologize for the insufficient description of the method section. By crossing mice with the SMN2+/+ homologous allele, SMN2 heterologous mice with only one SMN2 allele are SMN2 1-copy mice (SMN2+/0) and SMN2 homologous mice are SMN2 2-copy mice (SMN2+/+). We revised our manuscript method ‘Animals’ section.

      Reference Oichi, T., Kodama, J., Wilson, K., Tian, H., Imamura Kawasawa, Y., Usami, Y., Oshima, Y., Saito, T., Tanaka, S., Iwamoto, M., Otsuru, S., & Enomoto-Iwamoto, M. (2023). Nutrient-regulated dynamics of chondroprogenitors in the postnatal murine growth plate. Bone Research, 11(1). https://doi.org/10.1038/s41413-023-00258-9

      Tylzanowski, P., Mebis, L., and Luyten, F. P. (2006). The noggin null mouse phenotype is strain dependent and haploinsufficiency leads to skeletal defects. Dev. Dyn. 235, 1599–1607. doi: 10.1002/dvdy.20782

      Yakar, S., Rosen, C. J., Beamer, W. G., Ackert-Bicknell, C. L., Wu, Y., Liu, J. L., Ooi, G. T., Setser, J., Frystyk, J., Boisclair, Y. R., & LeRoith, D. (2002). Circulating levels of IGF-1 directly regulate bone growth and density. Journal of Clinical Investigation, 110(6), 771–781. https://doi.org/10.1172/JCI0215463

      Reviewer #3

      1) The authors used Prrx1Cre mouse with floxed Smn exon7(Smnf7) mouse carrying multiple (one or two) copies of the human SMN2 gene. Is it expressed both in chondrocytes and mesenchymal progenitors in the limb?

      We appreciate the reviewer's comment. We analyzed the deletion of Smn in chondrocytes and FAPs via Cre using genomic PCR and qRT-PCR, as depicted in new Figure 2. The SMN2 allele, which is expressed throughout the body, can rescue Smn knockout mouse lethality (Monani et al., 2000). Indeed, the short limb length and lethality observed in SMN2 0-copy mutants were mitigated by the presence of multiple copies of SMN2. Therefore, both Chondrocytes and FAPs may express SMN2 transcripts from the transgenic SMN2 allele.

      2) Page 10 regarding Fig.2E, please show pretzel-like structure. In Figure 2E, plaque, perforated, open, and branched are shown; however, the pretzel is not shown. The same issue is for the Fig. 3D explanation in the text on page 12.

      We appreciate the reviewer's constructive feedback. We included illustrative figures of all types of NMJ characterization, and the branched type is identical to the pretzel type. Therefore, we have replaced ‘branched’ with ‘pretzel’ in our text and revised Figure 3E by incorporating the example images.

      3) The explanation of the electrophysiology for Fig.4 in the text on pages 12 and 15 (RRP) is not so convincing for the readers. It is advisable to add TEM data for transplantation if it is not technically difficult.

      We appreciate the reviewer's critical feedback. Because we did not measure RRP directly, we removed speculation about the possibility of RRP difference. If observing the active zone with TEM and the docking synaptic vesicle would help quantify RRP, it is technically difficult to obtain images of sufficient quality to distinguish the active zones with our current TEM imaging technique.

      4) The authors used the word FAP for 7AAD(-)Lin(-)Vcam(-)Sca1(+). It is recommended to show the expression of PDGFR alpha. Furthermore, as the authors stated in the text, mesenchymal progenitors (FAPs) are heterogeneous. Please discuss this point further. Other reports show at least 6 subpopulations using single-cell analyses (Cell Rep. 2022).

      In the report, Ly6a (Sca1) is a good marker for FAPs, as well as Pdgfra (Leinroth et al., 2022). The 6 subpopulations expressed Ly6a. The one of subpopulations associated with NMJ was discovered. This population expressed Hsd11b1, Gfra1, and Ret and is located adjacent to the NMJ and responds to denervation, indicating an increased possibility of interaction with the NMJ organization. In further our study, we aim to determine which subpopulations are crucial for NMJ maturation by transplanting them to mutants for rescue.

      5) How do authors determine the number of FAP cells for transplantation?

      The FAPs transplantation was performed according to a previously reported our study (Kim et al., 2021).

      Reference Kim, J. H., Kang, J. S., Yoo, K., Jeong, J., Park, I., Park, J. H., Rhee, J., Jeon, S., Jo, Y. W., Hann, S. H., Seo, M., Moon, S., Um, S. J., Seong, R. H., & Kong, Y. Y. (2022). Bap1/SMN axis in Dpp4+ skeletal muscle mesenchymal cells regulates the neuromuscular system. JCI Insight, 7(10). https://doi.org/10.1172/jci.insight.158380

      Leinroth, A. P., Mirando, A. J., Rouse, D., Kobayahsi, Y., Tata, P. R., Rueckert, H. E., Liao, Y., Long, J. T., Chakkalakal, J. V., & Hilton, M. J. (2022). Identification of distinct non-myogenic skeletal-muscle-resident mesenchymal cell populations. Cell Reports, 39(6), 110785. https://doi.org/10.1016/j.celrep.2022.110785

      Monani, U. R., Sendtner, M., Coovert, D. D., Parsons, D. W., Andreassi, C., Le, T. T., Jablonka, S., Schrank, B., Rossol, W., Prior, T. W., Morris, G. E., & Burghes, A. H. M. (2000). The human centromeric survival motor neuron gene (SMN2) rescues embryonic lethality in Smn(-/-) mice and results in a mouse with spinal muscular atrophy. Human Molecular Genetics, 9(3), 333–339. https://doi.org/10.1093/hmg/9.3.333

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Previous studies have used a randomly induced label to estimate the number of hematopoietic precursors that contribute to hematopoiesis. In particular, the McKinneyFreeman lab established a measurable range of precursors of 50-2500 cells using random induction of one of the 4 fluorescent proteins (FPs) of a Confetti reporter in the fetal liver to show that hundreds of precursors establish lifelong hematopoiesis. In the presented work, Liu and colleagues aim to extend the measurable range of precursor numbers previously established and enable measurement in a variety of contexts beyond embryonic development. To this end, the authors investigated whether the random induction of a given Confetti FP follows the principles of binomial distribution such that the variance inversely correlates with the precursor number. They tested their hypothesis using a simplified 2-color in vitro system, paying particular attention to minimizing sources of experimental error (elimination of outliers, sample size, events recorded, etc.) that may obscure the measurement of variance. As a result, the data generated are robust and show that the measurable range of precursors can be extended up to 105 cells. They use tamoxifen-inducible Scl-CreER, which is active in hematopoietic stem and progenitor cells (HSPCs) to induce Confetti labeling, and investigated whether they could extend their model to cell numbers below 50 with in vivo transplantation of high versus low numbers of Confetti total bone marrow (BM) cells. The premise of binomial distribution requires that the number of precursors remains constant within a group of mice. The rare frequency of HSPCs in the BM means that the experimentally generated "low" number recipient animals showed some small variability of seeding number, which does not follow the requirement for binomial distribution. While variance due to differences in precursor numbers still dominates, it is unclear how accurate estimated numbers are when precursor numbers are low (<10).

      According to our simulation, the differences between estimated numbers and the corresponding expected numbers are more profound at numbers below 10, but they are still relatively small. Since Figure S4A is in log-scale, it might be difficult for readers to appreciate the magnitude in difference from the graph. We plan to add a linear scale figure to Figure S4A for better visualization of the absolute value differences (left). We also plan to provide an additional graph quantifying the value differences between estimated and expected values for numbers below 15 (right). From both graphs, the maximum difference between estimated n and expected n occurs at 10 precursor numbers (estimated as 7.6). We admit that these numbers are not numerically the same, and some minor correction of the formula may be needed if a very accurate absolute number is warrant. However, we also want to emphasize that 1. most estimated n values are within 25% range of the expected n; 2. despite the minor discrepancy, the estimated n is still highly correlated with the expected n, so the comparison between different precursor numbers was not affected.

      Author response image 1.

      The authors then apply their model to estimate the number of hematopoietic precursors that contribute to hematopoiesis in a variety of contexts including adult steady state, fetal liver, following myeloablation, and a genetic model of Fanconi anemia. Their modeling shows:

      - thousands of precursors (~2400-2600) contribute to adult myelopoiesis, which is in line with results from a previous study (Sun et al, 2014).

      - myeloablation (single dose 5-FU), while reducing precursor numbers of myeloid progenitors and HSPCs, was not associated with a reduction in precursor numbers of LTHSCs.

      - no major expansion of precursor number in the fetal liver derived from labeling at E11.5 versus E14.5, consistent with recent findings from Ganuza et al, 2022.

      - normal precursor numbers in Fancc-/- mice at steady state and from competitive transplantation of young Fancc-/- BM cells, suggesting that reduced Fancc-/- cell proliferation may underlie the reduced chimerism upon transplantation.

      - reduced number of lymphoid precursors following transplantation of BM cells from 9month-old Fancc-/- animals (beyond this age animals have decreased survival).

      Although this system does not permit the tracing of individual clones, the modeling presented allows measurements of clonal activity covering nearly the entire HSPC population (as recently estimated by Cosgrove et al, 2021) and can be applied to a wide range of in vivo contexts with relative ease. The conclusions are generally sound and based on high-quality data. Nevertheless, some results could benefit from further explanation or discussion:

      - The estimated number of LT-HSCs that contribute to myelopoiesis is not specifically provided, but from the text, it would be calculated to be 1958/5 = ~391. Data from Busch et al, 2015 suggest that the number of differentiation-active HSCs is 5.2x103, which is considered the maximum limit. There is nevertheless a more than 10-fold difference between these two estimates, and it is unclear how this discrepancy arises.

      First, we would like to clarify a sentence in the manuscript. 

      “The average myeloid precursor number at the time of BM analysis (1958) matched the average precursor number calculated from BM myeloid progenitors (MP, Lin-Sca-1-cKit+) and HSPCs (1773 and 1917), but it was five-fold higher than that of LT-HSC (Figure 3E).”

      In this sentence, we compared the number of precursors calculated from peripheral blood myeloid cells to the those calculated from BM myeloid progenitor, HSPC and LT-HSC. However, we did not intend to imply that those precursors numbers calculated from HSPC and LT-HSC specifically contribute to myelopoiesis. To avoid misunderstanding, we propose to change this sentence to read:

      “The average precursor number calculated from PB myeloid cells at the time of BM analysis (1958) matched those calculated from BM myeloid progenitors (MP, Lin-Sca-1-cKit+) and HSPCs (1773 and 1917), but it was fivefold higher than that of LT-HSC (Figure 3E).”

      Nonetheless, we appreciate the reviewers’ comment on the gap between the precursor numbers of LT-HSC and the number of differentiation-active HSCs reported in Busch et al, 2015. We propose the following explanation: 

      First of all, precursor numbers reflect LT-HSC self-renewal by symmetric division and maintenance by asymmetric division but not differentiation. To compare the number of differentiation-active LT-HSC, precursor numbers measured from differentiated progeny (progenitors) is a better choice. As our system does not differentiate the origin of a precursor, measuring the precursor number of differentiation-active LT-HSC is difficult, since progenitors may also derive from other long-lived MPPs. However, if we assume that most divisions of LT-HSC are asymmetric division, generating one LT-HSC and one progenitor, then we can approximate the number of differentiation-active HSCs with the precursor numbers of LT-HSC.

      Second, when Busch et al, 2015 calculated the number of differentiation-active HSC, they measured the cumulative activity of stem cells by following the mice up to 36 weeks postinduction. Our method measured the recent but not accumulative activity of HSC, thus the number of differentiation-active HSC in Busch et al 2015 is predicted to be higher. 

      Third, Busch et al, 2015 used Tie2MCM Cre to trace HSC. It has been shown that Tie2+ HSC have a higher reconstitution capacity (Ito et al 2016, Science), but no one has compared the in situ activity of Tie2+ and Tie2- HSC in a native environment. Since the behavior of HSCs in situ may be very different from their behavior in a transplantation setting, it is possible that Tie2+ HSC are more prone to differentiation than Tie2- HSC in a native environment, leading to an overestimation of differentiation-active HSC in the HSC pool. 

      - Similarly, in Figure 3E, the estimated number of precursors is highest in MPP4, a population typically associated with lymphoid potential and transient myeloid potential, whereas the numbers of MPP3, traditionally associated with myeloid potential, tend to be higher but are not significantly different than those found in HSCs.

      We believe this question results from similar confusion of the nomenclature of myeloid precursors in the previous question. As explained previously, the precursors quantified reflect a variety of possible differentiation routes, not just myelopoiesis. Thus, Figure 3E did not suggest that the lymphoid-biased MPP4 has more myeloid precursors than LTHSC. Instead, it simply means more precursors contribute to MPP4 population than the LT-HSC pool. We apologize for the confusion.

      - The requirement for estimating precursor numbers at stable levels of Confetti labeling is not well explained. As a result, it is unclear how accurate the estimates of B cell precursors upon transplantation of Fancc-/- cells are. In previous experiments on normal Confetti mice (Figure 3B), the authors do not estimate precursors of lymphopoiesis because Confetti labeling of B cells is not saturated, and this appears to be the case in Fanc-/- animals as well (Fig. 5B).

      We appreciate the request for clarification. Our approach required the labeling level to be stable in peripheral blood because we calculate the total number of precursors by normalizing precursor numbers in Confetti+ population with the labeling level (precursor numbers in Confetti+ population divided by labeling efficiency). If the labeling level is not saturated, then the calculation of total precursors will be overestimated. This requirement is more important in native hematopoiesis, since it takes a long time for the mature population, especially the lymphoid population, to be fully replaced by the progenies from the labeled HSPC population (as suggested by Busch et al 2015 and Säwen et al 2018). In transplantation, since lethal irradiation was performed, mature blood cells were rapidly generated by HSPCs, thus saturation of labeling level is not a major concern for precursor quantification. We plan to add Author response image 2 as evidence that Confetti labeling level was stable in mice transplanted with Fancc-/- cells.  

      Author response image 2.

      - Do 9-month-old Fanc-/- animals have reduced lymphoid precursors as well?

      Because of the non-saturated labeling in peripheral blood B cells and extra-HSPC induction of Confetti in T cells, we cannot accurately measure lymphoid precursor numbers in 9-month-old Fancc-/- animals. As an alternative, the precursor number of lymphoid biased MPP4 population were comparable between Fancc+/+ and Fancc-/- animals (Figure 5D).   We plan to add the frequency of common lymphoid progenitors (defined by Lin-IL-7Ra+Sca-1midcKitmid) add a supplementary figure to show were CLP frequencies between these two genotypes.

      Author response image 3.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Liu et al. uses Confetti labeling of hematopoietic stem and progenitor cells in situ to infer the clonal dynamics of adult hematopoiesis. The authors apply a new mathematical framework to analyze the data, allowing them to increase the range of applicability of this tool up to tens of thousands of precursors. With this tool, they (1) provide evidence for the large polyclonality of adult hematopoiesis, (2) offer insights on the expansion dynamics in the fetal liver stage, (3) assess the clonal dynamics in a Fanconi anemia model (Fancc), which has engraftment defects during transplantation.

      Strengths:

      The manuscript is well written, with beautiful and clear figures, and both methods and mathematical models are clear and easy to understand.

      Since 2017, Mikel Ganuza and Shannon McKinney-Freeman have been using these Confetti approaches that rely on calculating the variance across independent biological replicates as a way to infer clonal dynamics. This is a powerful tool and it is a pleasure to see it being implemented in more labs around the world. One of the cool novelties of the current manuscript is using a mathematical model (based on a binomial distribution) to avoid directly regressing the Confetti labeling variance with the number of clones (which only has linearity for a small range of clone numbers). As a result, this current manuscript of Liu et al. methodologically extends the usability of the Confetti approach, allowing them more precise and robust quantification.

      They then use this model to revisit some questions from various Ganuza et al. papers, validating most of their conclusions. The application to the clonal dynamics of hematopoiesis in a model of Fanconi anemia (Fancc mice) is very much another novel aspect, and shows the surprising result that clonal dynamics are remarkably similar to the wild-type (in spite of the defect that these Fancc HSCs have during engraftment).

      Overall, the manuscript succeeds at what it proposes to do, stretching out the possibilities of this Confetti model, which I believe will be useful for the entire community of stem cell biologists, and possibly make these assays available to other stem cell regenerating systems.

      Weaknesses:

      My main concern with this work is the choice of CreER driver line, which then relates to some of the conclusions made. Scl-CreER succeeds at being as homogenous as possible in labeling HSC/MPPs... however it is clear that it also labels a subcompartment of HSC clones that become dominant with time... This is seen as the percentage of Confettirecombined cells never ceases to increase during the 9-month chase of labeled cells, suggesting that non-labeled cells are being replaced by labeled cells. The reason why this is important is that then one cannot really make conclusions about the clonal dynamics of the unlabeled cells (e.g. for estimating the total number of clones, etc.).

      We appreciate the reviewers’ comments. We also agree that this is especially a concern for measuring B cell precursors in native hematopoiesis. For myeloid cells, the increase was much less profound (0.5% per month) after month four post-induction. One way to investigate the dynamics of unlabeled cells is to induce different groups of mice with different doses of tamoxifen so that labeling efficiency varies among different groups. With 14 days of tamoxifen treatment, maximum 60% of HSPC can be labeled (RFP+CFP+YFP). If the unlabeled cells behave similarly with labeled cells, then varying the labeling efficiency shouldn’t affect the total number of precursors calculated (if excluding the potential effect of longer tamoxifen treatment on HSC). While we haven’t extensively performed such lengthy experiment, we have performed one measurement (5 mice) with 14-days of tamoxifen treatment and showed that peripheral blood myeloid precursor numbers calculated from this experiment were comparable to the ones from Figure 3 (2-day tamoxifen).

      Author response image 4.

      It's possible that those HSPC that are never labeled with Confetti even during longer tamoxifen treatment could behave differently. In this case, a different Cre driver may provide insight into the total precursor numbers.

      I am not sure about the claims that the data shows little precursor expansion from E11 to E14. First, these experiments are done with fewer than 5 replicates, and thus they have much higher error, which is particularly concerning for distinguishing differences of such a small number of clones. Second, the authors do see a ~0.5-1 log difference between E11 and E14 (when looking at months 2-3). When looking at months 5+, there is already a clear decline in the total number of clones in both adult-labeled and embryonic-labeled, so these time points are not as good for estimating the embryonic expansion. In any case, the number of precursors at E11 (which in the end defines the degree of expansion) is always overestimated (and thus, the expansion underestimated) due to the effects of lingering tamoxifen after injection (which continues to cause Confetti allele recombination as stem cell divide). Thus, I think these results are still compatible with expansion in the fetal liver (the degree of which still remains uncertain to me).

      We agreed adding additional replicates will reducing any error and boost confidence in our conclusions. The dilemma of comparing fetal- and adult-labeled cohorts is that HSPC activities could not be synchronized among different developmental stages. At fetal to neonatal stage, HSPC proliferate faster to generate new blood cells and support developmental need, while at adult stage HSPC proliferate much slower. Thus, it takes long time for the mature myeloid cells in the adult-labeled cohort to reach a stable Confetti labeling and provide an accurate quantification of precursor. While we agree that it might be better to compare precursor numbers in earlier months, we preferred to compare precursor numbers at later time points for the aforementioned reasons. The other option is to compare the number of HSPC precursors in the BM at earlier time points, as no equilibration of labeling level is required in HSPC, but this requires earlier sacrifice, compromising long term assessment.    

      We did not revisit questions about the lingering effect of tamoxifen, as this has been studied by Ganuza et al 2017. They showed that tamoxifen was not able to induce additional Confetti recombination if given one day ahead, suggesting the effective window for tamoxifen is less than 24h.

      Based on our data, the expansion of lifelong precursors range anywhere from 1.4 to 7.0 (Figure 4G). It’s possible that we might observe a higher level of expansion if the comparison was done in earlier time points. Nonetheless, the assertion that the expansion of life-long HSPC is not as profound as evidenced by transplantation, emphasizes value of HSPC activity analysis in situ.

      Reviewer #3 (Public Review):

      Summary:  

      Liu et al. focus on a mathematical method to quantify active hematopoietic precursors in mice using Confetti reporter mice combined with Cre-lox technology. The paper explores the hematopoietic dynamics in various scenarios, including homeostasis, myeloablation with 5-fluorouracil, Fanconi anemia (FA), and post-transplant environments. The key findings and strengths of the paper include (1) precursor quantification: The study develops a method based on the binomial distribution of fluorescent protein expression to estimate precursor numbers. This method is validated across a wide dynamic range, proving more reliable than previous approaches that suffered from limited range and high variance outside this range; (2) dynamic response analysis: The paper examines how hematopoietic precursors respond to myeloablation and transplantation; (3) application in disease models: The method is applied to the FA mouse model, revealing that these mice maintain normal precursor numbers under steady-state conditions and posttransplantation, which challenges some assumptions about FA pathology. Despite the normal precursor count, a diminished repopulation capability suggests other factors at play, possibly related to cell proliferation or other cellular dysfunctions. In addition, the FA mouse model showed a reduction in active lymphoid precursors post-transplantation, contributing to decreased repopulation capacity as the mice aged. The authors are aware of the limitation of the assumption of uniform expansion. The paper assumes a uniform expansion from active precursor to progenies for quantifying precursor numbers. This assumption may not hold in all biological scenarios, especially in disease states where hematopoietic dynamics can be significantly altered. If non-uniformity is high, this could affect the accuracy of the quantification. Overall, the study underscores the importance of precise quantification of hematopoietic precursors in understanding both normal and pathological states in hematopoiesis, presenting a robust tool that could significantly enhance research in hematopoietic disorders and therapy development. The following concerns should be addressed.

      Major Points:

      • The authors have shown a wide range of seeded cells (1 to 1e5) (Figure 1D) that follow the linear binomial rule. As the standard deviation converges eventually with more seeded cells, the authors need to address this limitation by seeding the number of cells at which the assumption fails.

      While number range above 105 is not required for our measurement of hematopoietic precursors in mice, we agree that it will be valuable to understand the upper limit of experimental measurement. we plan to seed 106-107 cells per replicate to address reviewer’s comments. 

      • Line 276: This suggests myelopoiesis is preferred when very few precursors are available after irradiation-mediated injury. Did the authors see more myeloid progenitors at 1 month post-transplantation with low precursor number? The authors need to show this data in a supplement.

      While we appreciate the concern, we did not generate this dataset because this requires take down of a substantial number of animals at one-month post-transplantation. 

      Minor Points:

      • Please cite a reference for line 40: a rare case where a single HSPC clone supports hematopoiesis.

      • Line 262-263: "This discrepancy may reflect uneven seeding of precursors to the BM throughout the body after transplantation and the fact that we only sampled a part of the BM (femur, tibia, and pelvis)." Consider citing this paper (https://doi.org/10.1016/j.cell.2023.09.019) that explores the HSPCs migration across different bones.

      • Lines 299 and 304. Misspellings of RFP.

      We appreciate reviewer’s suggestions and will modify as suggested. 

      • The title is misleading as the paper's main focus is the precursor number estimator using the binomial nature of fluorescent tagging. Using a single-copy cassette of Confetti mice cannot be used to measure clonality.

      We appreciate reviewer’s suggestions and plan to modify the title of the manuscript to read: “Dynamic Tracking of Native Precursors in Adult Mice”.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      In the article by Dearlove et al., the authors present evidence in strong support of nucleotide ubiquitylation by DTX3L, suggesting it is a promiscuous E3 ligase with capacity to ubiquitylate ADP ribose and nucleotides. The authors include data to identify the likely site of attachment and the requirements for nucleotide modification. 

      While this discovery potentially reveals a whole new mechanism by which nucleotide function can be regulated in cells, there are some weaknesses that should be considered. Is there any evidence of nucleotide ubiquitylation occurring cells? It seems possible, but evidence in support of this would strengthen the manuscript. The NMR data could also be strengthened as the binding interface is not reported or mapped onto the structure/model, this seems of considerable interest given that highly related proteins do have the same activity. 

      The paper is for the most part well well-written and is potentially highly significant 

      Comments on revised version: 

      The revised manuscript has addressed many of the concerns raised and clarified a number of points. As a result the manuscript is improved. 

      The primary concern that remains is the absence of biological function for Ub-ssDNA/RNA and the inability to detect it in cells. Despite this the manuscript will be of interest to those in the ubiquitin field and will likely provoke further studies and the development of tools to better assess the cellular relevance. As a result this manuscript is important. 

      We agree with the reviewer’s assessment.

      Minor issue: 

      Figure 1A - the authors have now included the constructs used but it would be more informative if the authors lined up the various constructs under the relevant domains in the full-length protein. 

      Figure 1 will be fixed in the Version of Record.

      Reviewer #2 (Public Review):

      The manuscript by Dearlove et al. entitled "DTX3L ubiquitin ligase ubiquitinates single-stranded nucleic acids" reports a novel activity of a DELTEX E3 ligase family member, DTX3L, which can conjugate ubiquitin to the 3' hydroxyl of single-stranded oligonucleotides via an ester linkage. The findings that unmodified oligonucleotides can act as substrates for direct ubiquitylation and the identification of DTX3 as the enzyme capable of performing such oligonucleotide modification are novel, intriguing, and impactful because they represent a significant expansion of our view of the ubiquitin biology. The authors perform a detailed and diligent biochemical characterization of this novel activity, and key claims made in the article are well supported by experimental data. However, the studies leave room for some healthy skepticism about the physiological significance of the unique activity of DTX3 and DTX3L described by the authors because DTX3/DTX3L can also robustly attach ubiquitin to the ADP ribose moiety of NAD or ADP-ribosylated substrates. The study could be strengthened by a more direct and quantitative comparison between ubiquitylation of unmodified oligonucleotides by DTX3/DTX3L with the ubiquitylation of ADP-ribose, the activity that DTX3 and DTX3L share with the other members of the DELTEX family.

      Comment on revised version:

      In my opinion, reviewers' comments are constructively addressed by the authors in the revised manuscript, which further strengthens the revised submission and makes it an important contribution to the field. Specifically, the authors perform a direct quantitative comparison of two distinct ubiquitylation substrates, unmodified oligonucleotides and fluorescently labeled NADH and report that kcat/Km is 5-fold higher for unmodified oligos compared to NADH. This observation suggests that ubiquitylation of unmodified oligos is not a minor artifactual side reaction in vitro and that unmodified oligonucleotides may very well turn out to be the physiological substrates of the enzyme. However, the true identity of the physiological substrates and the functionally relevant modification site(s) remain to be established in further studies. 

      We agree with the reviewer’s assessment.


      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      In the article by Dearlove et al., the authors present evidence in strong support of nucleotide ubiquitylation by DTX3L, suggesting it is a promiscuous E3 ligase with capacity to ubiquitylate ADP ribose and nucleotides. The authors include data to identify the likely site of attachment and the requirements for nucleotide modification. 

      While this discovery potentially reveals a whole new mechanism by which nucleotide function can be regulated in cells, there are some weaknesses that should be considered. Is there any evidence of nucleotide ubiquitylation occurring cells? It seems possible, but evidence in support of this would strengthen the manuscript. The NMR data could also be strengthened as the binding interface is not reported or mapped onto the structure/model, this seems of considerable interest given that highly related proteins do have the same activity. 

      The paper is for the most part well well-written and is potentially highly significant, but it could be strengthened as follows: 

      (1) The authors start out by showing DTX3L binding to nucleotides and ubiquitylation of ssRNA/DNA. While ubiquitylation is subsequently dissected and ascribed to the RD domains, the binding data is not followed up. Does the RD protein alone bind to the nucleotides? Further analysis of nucleotide binding is also relevant to the Discussion where the role of the KH domains is considered, but the binding properties of these alone have not been analysed. 

      We thank the reviewer for the suggestion. We have tested DTX3L RD for ssDNA binding using NMR (see Figure 4A and Figure S2), which showed that DTX3L RD binds ssDNA. We have now tested the DTX3L KH domains for RNA/ssDNA binding using an FP experiment. However, the FP experiment did not show significant changes upon titrating RNA/ssDNA, suggesting that the KH domains alone are not sufficient to bind RNA/ssDNA. We have opted to put this data in the response-to-review as future investigation will be required to examine whether other regions of DTX3L cooperate with RD to bind RNA/ssDNA. We have revised the Discussion on the KH domains. We now state that “Our findings show the DTX3L DTC domain binds nucleic acids but whether the KHL domains contribute to nucleic acid binding requires further investigation.”

      Author response image 1.

      Fold change of fluorescence polarisation of 6-FAM-labelled ssDNA D4 upon titrating with DTX3L variants. DTX3L KH domain fragments were expressed with a N-terminal His-MBP tag to increase the molecular weight to enhance the signal.

      (2) With regard to the E3 ligase activity, can the authors account for the apparent decreased ubiquitylation activity of the 232-C protein in Figure 1/S1 compared to FL and RD? 

      We found that the 232-C protein batch used in the assay was not pure and have subsequently re-purified the protein. We have repeated the ubiquitination of ssDNA and RNA (Fig. 1H and 1I) and 232-C exhibited similar activity as WT. Furthermore, we performed autoubiquitination (Fig. S1G) and E2~Ub discharge assay (Fig. S1H) to compare the activity. 232-C was slower in autoubiquitination (Fig. S1G), but showed similar activity in the E2~Ub discharge assay as WT. These findings suggest that the RING domain in 232-C is functional and 232-C likely lacks ubiquitination site(s) present in 1-231 region necessary for autoubiquitination.

      (3) Was it possible to positively identify the link between Ub and ssDNA/RNA using mass spectrometry? This would overcome issues associated with labels blocking binding rather than modification. 

      We have tried to use mass spectrometry to detect the linkage between Ub and ssDNA/RNA, but was unable to do so. We suspect that the oxyester linkage might be labile, posing a challenge for mass spectrometry techniques. Similarly, a recent preprint from Ahel lab, which utilises LC-MS, detects the Ub-NMP product rather than the linkage (https://www.biorxiv.org/content/10.1101/2024.04.19.590267v1.full.pdf).

      (4) Furthermore, can a targeted MS approach be used to show that nucleotides are ubiquitylated in cells? 

      This will require future development and improvement of the MS approach, specifically the isolation of labile oxyester-linked products from cells and the optimisation of the MS detection method.

      (5) Do the authors have the assignments (even partial?) for DTX3L RD? In Figure 4 it would be helpful to identify the peaks that correspond to the residues at the proposed binding site. Also do the shifts map to a defined surface or do they suggest an extended site, particularly for the ssDNA.

      We only collected HSQC spectra which was insufficient for assignments. We have performed a competition experiment using ADPr and labelled ssDNA, showing that ADPr competes against the ubiquitination of ssDNA (Figure 4D). We have also provided an additional experiment showing that ssDNA with a blocked 3’-OH can compete against ubiquitination of ADPr (Figure 4E). These data, together with our NMR analysis, further strengthen the evidence that ssDNA and ADPr compete the same binding pocket in DTX3L RD. Understanding how DTX3L RD binds ssDNA/RNA is an ongoing research in the lab.

      (6) Does sequence analysis help explain the specificity of activity for the family of proteins? 

      We have performed sequence alignment and structure comparison of DTX proteins using both RING and DTC domains (Fig. S3). These analyses showed that DTX3 and DTX3L RING domains lack a N-terminal helix and two loop insertions compared to DTX1, DTX2 and DTX4. These additions make DTX1, DTX2 and DTX4 RING domain larger than DTX3L and DTX3. It is not clear how these would influence the orientation of the recruited E2~Ub. Comparison of the DTC domain showed that DTX1, DTX2 and DTX4 contain an Ala-Arg motif, which causes a bulge at one end of DTC pocket. In the absence of Ala-Arg motif, DTC pockets of DTX3 and DTX3L contain an extended groove which might accommodate one or more of the nucleotides 5' to the targeted terminal nucleotide. It seems that both features of RING and DTC domains might attribute to the specificity of DTX3L and DTX3. We have included these comparisons in the discussion and suggested that future structural characterization is necessary to unveil the specificity.

      (7) While including a summary mechanism (Figure 5I) is helpful, the schematic included does not necessarily make it easier for the reader to appreciate the key findings of the manuscript or to account for the specificity of activity observed. While this figure could be modified, it might also be helpful to highlight the range of substrates that DTX3L can modify - nucleotide, ADPr, ADPr on nucleotides etc. 

      We have modified this Figure to include the range of substrates.

      Reviewer #2 (Public Review): 

      Summary: 

      The manuscript by Dearlove et al. entitled "DTX3L ubiquitin ligase ubiquitinates single-stranded nucleic acids" reports a novel activity of a DELTEX E3 ligase family member, DTX3L, which can conjugate ubiquitin to the 3' hydroxyl of single-stranded oligonucleotides via an ester linkage. The findings that unmodified oligonucleotides can act as substrates for direct ubiquitylation and the identification of DTX3 as the enzyme capable of performing such oligonucleotide modification are novel, intriguing, and impactful because they represent a significant expansion of our view of the ubiquitin biology. The authors perform a detailed and diligent biochemical characterization of this novel activity, and key claims made in the article are well supported by experimental data. However, the studies leave room for some healthy skepticism about the physiological significance of the unique activity of DTX3 and DTX3L described by the authors because DTX3/DTX3L can also robustly attach ubiquitin to the ADP ribose moiety of NAD or ADP-ribosylated substrates. The study could be strengthened by a more direct and quantitative comparison between ubiquitylation of unmodified oligonucleotides by DTX3/DTX3L with the ubiquitylation of ADP-ribose, the activity that DTX3 and DTX3L share with the other members of the DELTEX family. 

      Strengths: 

      The manuscript reports a novel and exciting observation that ubiquitin can be directly attached to the 3' hydroxyl of unmodified, single-stranded oligonucleotides by DTX3L. The study builds on the extensive expertise and the impactful previous studies by the Huang laboratory of the DELTEX family of E3 ubiquitin ligases. The authors perform a detailed and diligent biochemical characterization of this novel activity, and all claims made in the article are well supported by experimental data. The manuscript is clearly written and easy to read, which further elevates the overall quality of submitted work. The findings are impactful and will help illuminate multiple avenues for future follow-up investigations that may help establish how this novel biochemical activity observed in vitro may contribute to the biological function of DTX3L. The authors demonstrate that the activity is unique to the DTX3/DTX3L members of the DELTEX family and show that the enzyme requires at least two single-stranded nucleotides at the 3' end of the oligonucleotide substrate and that the adenine nucleotide is preferred in the 3' position. Most notably, the authors describe a chimeric construct containing RING domain of DTX3L fused to the DTC domain DTX2, which displays robust NAD ubiquitylation, but lacks the ability to ubiquitylate unmodified oligonucleotides. This construct will be invaluable in the future cell-based studies of DTX3L biology that may help establish the physiological relevance of 3' ubiquitylation of nucleic acids. 

      Weaknesses: 

      The main weakness of the study is in the lack of direct evidence that the ubiquitylation of unmodified oligonucleotides reported by the authors plays any role in the biological function of DTX3L. The study leaves plenty of room for natural skepticism regarding the physiological relevance of the reported activity, because, akin to other DELTEX family members, DTX3 and DTX3L can also catalyze attachment of ubiquitin to NAD, ADP ribose and ADP-ribosylated substrates. Unfortunately, the study does not offer any quantitative comparison of the two distinct activities of the enzyme, which leaves plenty of room for doubt. One is left wondering, whether ubiquitylation of unmodified oligonucleotides is just a minor and artifactual side activity owing to the high concentration of the oligonucleotide substrates and E2~Ub conjugates present in the in-vitro conditions and the somewhat lower specificity of the DTX3 and DTX3L DTC domains (compared to DTX2 and other DELTEX family members) for ADP ribose over other adenine-containing substrates such as unmodified oligonucleotides, ADP/ATP/dADP/dATP, etc. The intriguing coincidence that DTX3L, which is the only DTX protein capable of ubiquitylating unmodified oligonucleotides, is also the only family member that contains nucleic acid interacting domains in the N-terminus, is suggestive but not compelling. A recently published DTX3L study by a competing laboratory (PMID: 38000390), which is not cited in the manuscript, suggests that ADP-ribose-modified nucleic acids could be the physiologically relevant substrates of DTX3L. That competing hypothesis appears more convincing than ubiquitylation of unmodified oligonucleotides because experiments in that study demonstrate that ubiquitylation of ADP-ribosylated oligos is quite robust in comparison to ubiquitylation of unmodified oligos, which is undetectable. It is possible that the unmodified oligonucleotides in the competing study did not have adenine in the 3' position, which may explain the apparent discrepancy between the two studies. In summary, a quantitative comparison of ubiquitylation of ADP ribose vs. unmodified oligonucleotides could strengthen the study. 

      We thank the reviewer for the constructive feedback. We agree that evidence for the biological function is lacking. While we have tried to detect Ub-ssDNA/RNA from cells, we found that isolating and detecting labile oxyester-linked Ub-ssDNA/RNA products remain challenging due to (1) low levels of Ub-ssDNA/RNA products, (2) the presence of DUBs and nucleases that rapidly remove the products during the experiments, and (3) our lack of a suitable MS approach to detect the product. For these reasons, we feel that discovering the biological function will require future effort and expertise and is beyond the scope of our current manuscript.

      In the manuscript (PMID: 38000390), the authors used PARP10 to catalyse ADP-ribosylation onto 5’-phosphorylated ssDNA/RNA. They used the following sequences which lacks 3’-adenosine, which could explain the lack of ubiquitination.

      E15_5′P_RNA [Phos]GUGGCGCGGAGACUU

      E15_5′P_DNA [Phos]GTGGCGCGGAGACTT

      We have performed the experiment using this sequence to verify this (see Author response image 2 below). We have cited this manuscript but for some reasons, Pubmed has updated its published date from mid 2023 to Jan 2024. We have updated the Endnote in the revised manuscript.

      Author response image 2.

      Fluorescently detected SDS-PAGE gel of in vitro ubiquitination catalysed by DTX3L-RD in the presence ubiquitination components and 6-FAM-labelled ssDNA D4 or D31.

      We agree that it is crucial to compare ubiquitination of oligonucleotides and ADPr by DTX3L to find its preferred substrate. We have challenged oligonucleotide ubiquitination by adding excess ADPr and found that ADPr efficiently competes with oligonucleotide (Figure 4D). We have also performed an experiment showing that ssDNA with a blocked 3’-OH can compete against ubiquitination of ADPr (Figure 4E). These data support that ADPr and ssDNA compete for the same binding site on DTX3L.

      We also performed kinetic analysis of ubiquitination of fluorescently labelled ssDNA (D4) and NAD+ by DTX3L-RD (Fig. 4F and Fig. S2D–G) to assess substrate preferences. Here, we used fluorescent-labelled NAD+ (F-NAD+) in place of ADPr as labelled NAD+ is commercially available. With the known concentration of fluorescently labelled ssDNA and NAD+ as the standard, we could estimate the rate of ubiquitinated product formation across different substrate concentrations. We have included this finding in the main text “DTX3L-RD displayed _k_cat value of 0.0358 ± 0.0034 min-1 and a _K_m value of 6.56 ± 1.80 mM for Ub-D4 formation, whereas the Michaelis-Menten curve did not reach saturation for Ub-F-NAD+ formation (Fig. 4F and fig. S2, D-G). Comparison of the estimated catalytic efficiency (_k_cat/_K_m = 5457  M-1 min-1 for D4 and estimated _k_cat/_K_m = 1190  M-1 min-1 for F-NAD+; Fig. 4F) suggested that DTX3L-RD exhibited 4.5-fold higher catalytic efficiency for D4 than F-NAD+. This difference primarily results from a better _K_m value for D4 compared to F-NAD+. Although DTX3L-RD showed weak _K_m for F-NAD+, it displays a higher rate for converting F-NAD+ to Ub-F-NAD+ at higher substrate concentration (Fig. 4F). Thus, substrate concentration will play a role in determining the preference.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Writing/technical points: 

      (1) The introduction is relatively complex and the last paragraph, which reviews the discoveries on the paper, is long. It may be helpful to highlight the significance and frame the experiments as what they have addressed, rather than detailing each set of experiments completed. 

      We have modified the last paragraph in the introduction to highlight the major discovery of our work.

      (2) Line 24, Abstract. 'Its N-terminal region' is not obvious 

      We have changed “Its N-terminal region” to “the N-terminal region of DTX3L”.

      (3) Line 44 - split sentence to emphasize E3 ligase point? 

      We have modified the sentence as suggested.

      (4) Figures 1B and 1C could be larger - currently they are not very helpful. Also atoms (ADPr?) are shown, but not indicated in the legend or labelled on the panel. 

      We have enlarged Figures 1B and 1C and indicated RNA on the structure.

      (5) The structure of the D2 domain of DTX3L has recently been reported (Vela-Rodriguez et al). It might be helpful to comment on this manuscript. 

      We have now commented on D2 domain in the results section and in the discussion.

      (6) It would be helpful to indicate the DTX3L constructs used in Figure 1a. 

      We have included all DTX3L constructs used in Figure 1a.

      (7) Interpretation of Figure 4A is difficult, the authors may wish to consider other ways to visualize the data. 

      We have now removed the black arrow in Figure 4A as it was confusing. Instead, we drew a black box on the cross-peak where the close-up views are shown in Figures 4B and 4C.

      (8) Figure 4A. Please indicate which binding partner is highlighted by red/black arrows. 

      We have removed black arrow. The red arrows indicate cross-peaks which undergo chemical shift perturbation when DTX3L-RD was titrated with ssDNA or ADPr, highlighting their binding sites on DTX3L-RD overlap.

      (9) Line 284 - please indicate the bulge in Figure S3. 

      We have indicated the bulge on Figure S3.

      (10) Aspects of the discussion are speculative, given that evidence of Ub conjugated to nucleotides in cells is yet to be obtained and the functional consequences of modification are uncertain. 

      We understand that the discussion on the potential roles of ubiquitination of ssNAs is speculative. We have now modified it to: “Based on the known functions of the DTX3L/PARP9 complex and the findings of this study, we propose several hypotheses for future research”, so that readers will understand that these are speculative.

      (11) Line 295 onwards - this paragraph discusses the role of the KH domains in nucleotide binding, but it is not clear that the authors have directly demonstrated that the KH domains bind nucleotides as all constructs used in the binding experiments in Figure 1/S1 include the RING-DTC domains. 

      We found that KH domains alone did not bind ssDNA or RNA. We have modified line 295. This section now reads “Typically, KH domains contain a GXXG motif within the loop between the first and second α helix (22). However, analysis of the sequence of the KHL domains in DTX3L shows these domains lack this motif. Multiple studies have shown that mutation in this motif abolishes binding to nucleic acids (23-26). Our findings show the DTX3L DTC domain binds nucleic acids but whether the KHL domains contribute to nucleic acid binding requires further investigation. Additionally, the structure of the first KHL domain was recently reported and shown to form a tetrameric assembly (20). Our analysis with DTX3L 232-C, which lacks the first KHL domain and RRM, indicate that it can still bind ssDNA and ssRNA. Despite this, a more detailed analysis will be required to determine whether oligomerization plays a role in nucleic acid binding and ubiquitination.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Summary:

      In this study, Nishi et al. claim that the ratio of long-term hematopoietic stem cell (LT-HSC) versus short-term HSC (ST-HSC) determines the lineage output of HSCs and reduced ratio of ST-HSC in aged mice causes myeloid-biased hematopoiesis. The authors used Hoxb5 reporter mice to isolate LT-HSC and ST-HSC and performed molecular analyses and transplantation assays to support their arguments. How the hematopoietic system becomes myeloid-biased upon aging is an important question with many implications in the disease context as well. However, their study is descriptive with remaining questions.

      Weaknesses:

      Comment #1-1: The authors may need conceptual re-framing of their main argument because whether the ST-HSCs used in this study are functionally indeed short-term "HSCs" is questionable. The data presented in this study and their immunophenotypic definition of ST-HSCs (Lineage negative/Sca-1+/c-Kit+/Flk2-/CD34-/CD150+/Hoxb5-) suggest that authors may find hematopoietic stem cell-like lymphoid progenitors as previously shown for megakaryocyte lineage (Haas et al., Cell stem cell. 2015) or, as the authors briefly mentioned in the discussion, Hoxb5- HSCs could be lymphoid-biased HSCs.

      The authors disputed the idea that Hoxb5- HSCs as lymphoid-biased HSCs based on their previous 4 weeks post-transplantation data (Chen et al., 2016). However, they overlooked the possibility of myeloid reprogramming of lymphoid-biased population during regenerative conditions (Pietras et al., Cell stem cell., 2015). In other words, early post-transplant STHSCs (Hoxb5- HSCs) can be seen as lacking the phenotypic lymphoid-biased HSCs.

      Thinking of their ST-HSCs as hematopoietic stem cell-like lymphoid progenitors or lymphoidbiased HSCs makes more sense conceptually as well.

      Response #1-1: We appreciate this important suggestion and recognize the significance of the debate on whether Hoxb5- HSCs are ST-HSCs or lymphoid-biased HSCs.

      HSCs are defined by their ability to retain hematopoietic potential after a secondary transplantation1-2. If Hoxb5- HSCs were indeed lymphoid-biased HSCs, they would exhibit predominantly lymphoid hematopoiesis even after secondary transplantation. However, functional experiments demonstrate that these cells lose their hematopoietic output after secondary transplantation3 (see Fig. 2 in this paper). Based on the established definition of HSCs in this filed, it is appropriate to classify Hoxb5- HSCs as ST-HSCs rather than lymphoid-biased HSCs.

      Additionally, it has been reported that myeloid reprogramming may occur in the early posttransplant period, around 2-4 weeks after transplantation, even in lymphoid-biased populations within the MPP fraction, due to high inflammatory conditions4. However, when considering the post-transplant hematopoiesis of Hoxb5- HSC fractions as ST-HSCs, they exhibit almost the same myeloid hematopoietic potential as LT-HSCs not only during the early 4 weeks after transplantation but also at 8 weeks post-transplantation3, when the acute inflammatory response has largely subsided. Therefore, it is difficult to attribute the myeloid production by ST-HSCs post-transplant solely to myeloid reprogramming.

      References

      (1) Morrison, S. J. & Weissman, I. L. The long-term repopulating subset of hematopoietic stem cells is deterministic and isolatable by phenotype. Immunity 1, 661–673 (1994).

      (2) Challen, G. A., Boles, N., Lin, K. K. Y. & Goodell, M. A. Mouse hematopoietic stem cell identification and analysis. Cytom. Part A 75, 14–24 (2009).

      (3) Chen, J. Y. et al. Hoxb5 marks long-term haematopoietic stem cells and reveals a homogenous perivascular niche. Nature 530, 223–227 (2016).

      (4) Pietras, E. M. et al. Functionally Distinct Subsets of Lineage-Biased Multipotent Progenitors Control Blood Production in Normal and Regenerative Conditions. Cell Stem Cell 17, 35–46 (2015).

      Comment #1-2: ST-HSCs come from LT-HSCs and further differentiate into lineage-biased multipotent progenitor (MPP) populations including myeloid-biased MPP2 and MPP3. Based on the authors' claim, LT-HSCs (Hoxb5- HSCs) have no lineage bias even in aged mice. Then these LT-HSCs make ST-HSCs, which produce mostly memory T cells. These memory T cell-producing ST-HSCs then produce MPPs including myeloid-biased MPP2 and MPP3.

      This differentiation trajectory is hard to accept. If we think Hoxb5- HSCs (ST-HSCs by authors) as a sub-population of immunophenotypic HSCs with lymphoid lineage bias or hematopoietic stem cell-like lymphoid progenitors, the differentiation trajectory has no flaw.

      Response #1-2: Thank you for this comment, and we apologize for the misunderstanding regarding the predominance of memory T cells in ST-HSCs after transplantation. 

      Our data show that ST-HSCs are not biased HSCs that predominantly produce memory T cells, but rather, ST-HSCs are multipotent hematopoietic cells. ST-HSCs lose their ability to self-renew within a short period, resulting in the cessation of ST-HSC-derived hematopoiesis. As a result, myeloid lineage with a short half-life disappears from the peripheral blood, and memory lymphocytes with a long half-life remain (see Figure 5 in this paper). 

      Comment #1-3: Authors' experimental designs have some caveats to support their claims. Authors claimed that aged LT-HSCs have no myeloid-biased clone expansion using transplantation assays. In these experiments, authors used 10 HSCs and young mice as recipients. Given the huge expansion of old HSC by number and known heterogeneity in immunophenotypically defined HSC populations, it is questionable how 10 out of so many old HSCs can faithfully represent the old HSC population. The Hoxb5+ old HSC primary and secondary recipient mice data (Figure 2C and D) support this concern. In addition, they only used young recipients. Considering the importance of the inflammatory aged niche in the myeloid-biased lineage output, transplanting young vs old LT-HSCs into aged mice will complete the whole picture.

      Response #1-3: We appreciate the reviewer for the comments. We acknowledge that using ten HSCs may not capture the heterogeneity of aging HSCs.

      However, although most of our experiments have used a small number of transplanted cells (e.g., 10 cells), we have conducted functional experiments across Figures 2, 3, 5, 6, S3, and S6, totaling n = 126, equivalent to over 1260 cells. Previous studies have reported that myeloid-biased HSCs constitute more than 50% of the aged HSC population1-2. If myeloidbiased HSCs increase with age, they should be detectable in our experiments. Our functional experiments have consistently shown that Hoxb5+ HSCs exhibit unchanged lineage output throughout life. In contrast, the data presented in this paper indicate that changes in the ratio of LT-HSCs and ST-HSCs may contribute to myeloid-biased hematopoiesis.

      We believe that transplanting aged HSCs into aged recipient mice is crucial to analyzing not only the differentiation potential of aged HSCs but also the changes in their engraftment and self-renewal abilities. We aim to clarify further findings through these experiments in the future.

      References

      (1) Dykstra B, Olthof S, Schreuder J, Ritsema M, Haan G De. Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells. J Exp Med. 2011 Dec 19;208(13):2691–703. 

      (2) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      Comment #1-4: The authors' molecular data analyses need more rigor with unbiased approaches. They claimed that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid or lymphoid gene set enrichment but aged bulk HSCs, which are just a sum of LT-HSCs and ST-HSCs by their gating scheme (Figure 4A), showed the "tendency" of enrichment of myeloid-related genes based on the selected gene set (Figure 4D). Although the proportion of ST-HSCs is reduced in bulk HSCs upon aging, since ST-HSCs do not exhibit lymphoid gene set enrichment based on their data, it is hard to understand how aged bulk HSCs have more myeloid gene set enrichment compared to young bulk HSCs. This bulk HSC data rather suggests that there could be a trend toward certain lineage bias (although not significant) in aged LT-HSCs or ST-HSCs. The authors need to verify the molecular lineage priming of LT-HSCs and ST-HSCs using another comprehensive dataset.

      Response #1-4: Thank you for pointing out that neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid

      or lymphoid gene set enrichment, although aged bulk HSCs showed a tendency towards enrichment of myeloid-related genes.

      The actual GSEA result had an FDR > 0.05. Therefore, we cannot claim that bulk HSCs showed significant enrichment of myeloid-related genes with age. Consequently, we have revised the following sentences:

      [P11, L251] Neither aged LT-HSCs nor aged ST-HSCs exhibited myeloid/lymphoid gene set enrichment, while shared myeloid-related genes tended to be enriched in aged bulk-HSCs, although this enrichment was not statistically significant (Fig. 4, F and G).

      In addition to the above, we also found that the GSEA results differ among myeloid gene sets (Fig. 4, D-F; Fig. 4S, C-D). These findings suggest that discussing lineage bias in HSCs using GSEA is challenging. We believe that functional experimental data is crucial. From our functional experiments, when the ratio of LT-HSC to ST-HSC was reconstituted to match the ratio in young Bulk-HSCs (LT= 2:8) or aged bulk-HSCs (LT= 5:5), myeloid-biased hematopoiesis was observed with the aged bulk-HSC ratio. Based on this data, the authors concluded that age-related changes in the ratio between LT-HSCs and ST-HSCs in bulkHSCs cause myeloid-biased hematopoiesis rather than an increase in myeloid gene expression in the aged bulk-HSCs.

      Comment #1-5: Some data are too weak to fully support their claims. The authors claimed that age-associated extramedullary changes are the main driver of myeloid-biased hematopoiesis based on no major differences in progenitor populations upon transplantation of 10 young HSCs into young or old recipient mice (Figure 7F) and relatively low donor-derived cells in thymus and spleen in aged recipient mice (Figure 7G-J). However, they used selected mice to calculate the progenitor populations in recipient mice (8 out of 17 from young recipients denoted by * and 8 out of 10 from aged recipients denoted by * in Figure 7C). In addition, they calculated the progenitor populations as frequency in c-kit positive cells. Given that they transplanted 10 LT-HSCs into "sub-lethally" irradiated mice and 8.7 Gy irradiation can have different effects on bone marrow clearance in young vs old mice, it is not clear whether this data is reliable enough to support their claims. The same concern applies to the data Figure 7G-J. Authors need to provide alternative data to support their claims.

      Response #1-5: Thank you for useful comments. Our claim regarding Fig. 7 is that age-associated extramedullary changes are merely additional drivers for myeloid-biased hematopoiesis are not the main drivers. But we will address the issues pointed out.

      Regarding the reason for analyzing the asterisk mice

      We performed two independent experiments for Fig. 7. In the first experiment, we planned to analyze the BM of recipients 16 weeks after transplantation. However, as shown in Fig. 7B, many of the aged mice died before 16 weeks. Therefore, we decided to examine the BM of the recipient mice at 12 weeks in the second experiment. Below are the peripheral blood results 11-12 weeks after transplantation for the mice used in the second experiment.

      Author response image 1.

      For the second experiment, we analyzed the BM of all eight all eight aged recipients. Then, we selected the same number of young recipients for analysis to ensure that the donor myeloid output would be comparable to that of the entire young group. Indeed, the donor myeloid lineage output of the selected mice was 28.1 ± 22.9%, closely matching the 23.5 ± 23.3% (p = 0.68) observed in the entire young recipient population. 

      That being said, as the reviewer pointed out, it is considerable that the BM, thymus, and spleen of all mice were not analyzed. Hence, we have added the following sentences:

      [P14, L327] We performed BM analysis for the mice denoted by † in Figure 7C because many of the aged mice had died before the analysis.

      [P15, L338] The thymus and spleen analyses were also performed on the mice denoted by † in Figure 7C.

      Regarding the reason for 8.7 Gy.

      Thank you for your question about whether 8.7 Gy is myeloablative. In our previous report1, we demonstrated that none of the mice subjected to pre-treatment with 8.7 Gy could survive when non-LKS cells were transplanted, suggesting that 8.7 Gy is enough to be myeloablative with the radiation equipment at our facility.

      Author response image 2.

      Reference

      (1)  Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      Regarding the normalization of c-Kit in Figure 7F.  

      Firstly, as shown in Supplemental Figures S1B and S1C, we analyze the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in different panels. Therefore, normalization is required to assess the differentiation of HSCs from upstream to downstream. Additionally, the reason for normalizing by c-Kit+ is that the bone marrow analysis was performed after enrichment using the Anti-c-Kit antibody for both upstream and downstream fractions. Based on this, we calculated the progenitor populations as a frequency within the c-Kit positive cells.

      Next, the results of normalizing the whole bone marrow cells (live cells) are shown below. 

      Author response image 3.

      Similar to the results of normalizing c-Kit+ cells, myeloid progenitors remained unchanged, including a statistically significant decrease in CMP in aged mice. Additionally, there were no significant differences in CLP. In conclusion, we obtained similar results between the normalization with c-Kit and the normalization with whole bone marrow cells (live cells).

      However, as the reviewer pointed out, it is necessary to explain the reason for normalization with c-Kit. Therefore, we will add the following description.

      [P21, L502] For the combined analysis of the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in Figures 1B and 7F, we normalized by c-Kit+ cells because we performed a c-Kit enrichment for the bone marrow analysis.

      Reviewer #2:

      Summary:  

      Nishi et al, investigate the well-known and previously described phenomenon of ageassociated myeloid-biased hematopoiesis. Using a previously established HoxB5mCherry mouse model, they used HoxB5+ and HoxB5- HSCs to discriminate cells with long-term (LTHSCs) and short-term (ST-HSCs) reconstitution potential and compared these populations to immunophenotypically defined 'bulk HSCs' that consists of a mixture of LT-HSC and STHSCs. They then isolated these HSC populations from young and aged mice to test their function and myeloid bias in non-competitive and competitive transplants into young and aged recipients. Based on quantification of hematopoietic cell frequencies in the bone marrow, peripheral blood, and in some experiments the spleen and thymus, the authors argue against the currently held belief that myeloid-biased HSCs expand with age. 

      Comment #2-1: While aspects of their work are fascinating and might have merit, several issues weaken the overall strength of the arguments and interpretation. Multiple experiments were done with a very low number of recipient mice, showed very large standard deviations, and had no statistically detectable difference between experimental groups. While the authors conclude that these experimental groups are not different, the displayed results seem too variable to conclude anything with certainty. The sensitivity of the performed experiments (e.g. Figure 3; Figure 6C, D) is too low to detect even reasonably strong differences between experimental groups and is thus inadequate to support the author's claims. This weakness of the study is not acknowledged in the text and is also not discussed. To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section.

      Response #2-1: Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows:

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 ± 8.9 vs. 42.1 ± 35.5%, p = 0.01), even though n = 10.

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high self-renewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3.

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4±31.5% vs 47.4±39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased.

      Regarding Figure 6, we obtained a statistically significant difference and consider the sample size to be sufficient. 

      In addition, we have performed various functional experiments (Figures 2, 5, 6 and S6), and have obtained consistent results that expansion of myeloid biased HSCs does not occur with aging in Hoxb5+HSCs fraction. Based on the above, we conclude that the LT-HSC fraction does not differ in myeloid differentiation potential with aging.

      Comment #2-2: As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided.

      Response #2-2: Thank you for the comments. As the reviewer pointed out, we hope we could reconfirm our results using single-cell level technology in the future.

      On the other hand, we have reported that the ratio of myeloid to lymphoid cells in the peripheral blood changes when the number of HSCs transplanted, or the number of supporting cells transplanted with HSCs, is varied1-2. Therefore, single-cell transplant data need to be interpreted very carefully to determine differentiation potential.

      From this viewpoint, future experiments will combine the Hoxb5 reporter system with a lineage tracing system that can track HSCs at the single-cell level over time. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. We have reflected this comment by adding the following sentences in the manuscript.

      [P19, L451] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system3-4. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      References

      (1) Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      (2) Sakamaki T, Kao KS, Nishi K, Chen JY, Sadaoka K, Fujii M, et al. Hoxb5 defines the heterogeneity of self-renewal capacity in the hematopoietic stem cell compartment. Biochem Biophys Res Commun [Internet]. 2021;539:34–41. Available from: https://doi.org/10.1016/j.bbrc.2020.12.077

      (3) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      (4) Rodriguez-Fraticelli AE, Weinreb C, Wang SW, Migueles RP, Jankovic M, Usart M, et al. Single-cell lineage tracing unveils a role for TCF15 in haematopoiesis. Nature [Internet]. 2020;583(7817):585–9. Available from: http://dx.doi.org/10.1038/s41586-020-2503-6

      Comment #2-3: It is also unclear why the authors believe that the observed reduction of ST-HSCs relative to LT-HSCs explains the myeloid-biased phenotype observed in the peripheral blood. This point seems counterintuitive and requires further explanation.

      Response #2-3: Thank you for your comment. We apologize for the insufficient explanation. Our data, as shown in Figures 3 and 4, demonstrate that the differentiation potential of LT-HSCs remains unchanged with age. Therefore, rather than suggesting that an increase in LT-HSCs with a consistent differentiation capacity leads to myeloid-biased hematopoiesis, it seems more accurate to highlight that the relative decrease in the proportion of ST-HSCs, which remain in peripheral blood as lymphocytes, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, if we focus on the increase in the ratio of LT-HSCs, it is also plausible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Comment #2-4: Based on my understanding of the presented data, the authors argue that myeloid-biased HSCs do not exist, as<br /> a) they detect no difference between young/aged HSCs after transplant (mind low n-numbers and large std!); b) myeloid progenitors downstream of HSCs only show minor or no changes in frequency and c) aged LT-HSCs do not outperform young LT-HSC in myeloid output LT-HScs in competitive transplants (mind low n-numbers and large std!).

      Response #2-4: We appreciate the comments. As mentioned above, we will correct the manuscript regarding the sample size.

      Regarding the interpreting of the lack of increase in the percentage of myeloid progenitor cells in the bone marrow with age, it is instead possible that various confounding factors, such as differentiation shortcuts or changes in the microenviroment, are involved.

      However, even when aged LT-HSCs and young LT-HSCs are transplanted into the same recipient mice, the timing of the appearance of different cell fractions in peripheral blood is similar (Figure 3 of this paper). Therefore, we have not obtained data suggesting that clear shortcuts exist in the differentiation process of aged HSCs into neutrophils or monocytes. Additionally, it is currently consensually accepted that myeloid cells, including neutrophils and monocytes, differentiate from GMPs1. Since there is no changes in the proportion of GMPs in the bone marrow with age, we concluded that the differentiation potential into myeloid cells remains consistent with aging.

      Reference

      (1) Akashi K and others, ‘A Clonogenic Common Myeloid Progenitor That Gives Rise to All Myeloid Lineages’, Nature, 404.6774 (2000), 193–97.

      Strengths: 

      The authors present an interesting observation and offer an alternative explanation of the origins of aged-associated myeloid-biased hematopoiesis. Their data regarding the role of the microenvironment in the spleen and thymus appears to be convincing. 

      Weaknesses: 

      Comment #2-5: "Then, we found that the myeloid lineage proportions from young and aged LT-HSCs were nearly comparable during the observation period after transplantation (Figure 3, B and C)."<br /> Given the large standard deviation and low n-numbers, the power of the analysis to detect differences between experimental groups is very low. Experimental groups with too large standard deviations (as displayed here) are difficult to interpret and might be inconclusive. The absence of clearly detectable differences between young and aged transplanted HSCs could thus simply be a false-negative result. The shown experimental results hence do not provide strong evidence for the author's interpretation of the data. The authors should add additional transplants and include a detailed power analysis to be able to detect differences between experimental groups with reasonable sensitivity.

      Response #2-5: Thank you for providing these insights. Regarding the sample size, we have addressed this in Response #2-1.

      Comment #2-6: Line 293: "Based on these findings, we concluded that myeloid-biased hematopoiesis observed following transplantation of aged HSCs was caused by a relative decrease in ST-HSC in the bulk-HSC compartment in aged mice rather than the selective expansion of myeloid-biased HSC clones."<br /> Couldn't that also be explained by an increase in myeloid-biased HSCs, as repeatedly reported and seen in the expansion of CD150+ HSCs? It is not intuitively clear why a reduction of ST-HSCs clones would lead to a myeloid bias. The author should try to explain more clearly where they believe the increased number of myeloid cells comes from. What is the source of myeloid cells if the authors believe they are not derived from the expanded population of myeloid-biased HSCs?

      Response #2-6: Thank you for pointing this out. We apologize for the insufficient explanation. We will explain using Figure 8 from the paper.

      First, our data show that LT-HSCs maintain their differentiation capacity with age, while ST-HSCs lose their self-renewal capacity earlier, so that only long-lived memory lymphocytes remain in the peripheral blood after the loss of self-renewal capacity in ST-HSCs (Figure 8, upper panel). In mouse bone marrow, the proportion of LT-HSCs increases with age, while the proportion of STHSCs relatively decreases (Figure 8, lower panel and Figure S5). 

      Our data show that merely reproducing the ratio of LT-HSCs to ST-HSCs observed in aged mice using young LT-HSCs and ST-HSCs can replicate myeloid-biased hematopoiesis. This suggests that the increase in LT-HSC and the relative decrease in ST-HSC within the HSC compartment with aging are likely to contribute to myeloid-biased hematopoiesis.

      As mentioned earlier, since the differentiation capacity of LT-HSCs remain unchaged with age, it seems more accurate to describe that the relative decrease in the proportion of STHSCs, which retain long-lived memory lymphocytes in peripheral blood, leads to a relative increase in myeloid cells in peripheral blood and thus causes myeloid-biased hematopoiesis.

      However, focusing on the increase in the proportion of LT-HSCs, it is also possible to explain that “with aging, the proportion of LT-HSCs capable of long-term myeloid hematopoiesis increases. As a result, from 16 weeks after transplantation, the influence of LT-HSCs maintaining the long-term ability to produce myeloid cells becomes relatively more significant, leading to an increase in the ratio of myeloid cells in the peripheral blood and causing myeloid-biased hematopoiesis.”

      Reviewer #3:

      Summary:

      In this manuscript, Nishi et al. propose a new model to explain the previously reported myeloid-biased hematopoiesis associated with aging. Traditionally, this phenotype has been explained by the expansion of myeloid-biased hematopoietic stem cell (HSC) clones during aging. Here, the authors question this idea and show how their Hoxb5 reporter model can discriminate long-term (LT) and short-term (ST) HSC and characterized their lineage output after transplant. From these analyses, the authors conclude that changes during aging in the LT/ST HSC proportion explain the myeloid bias observed. 

      Although the topic is appropriate and the new model provides a new way to think about lineage-biased output observed in multiple hematopoietic contexts, some of the experimental design choices, as well as some of the conclusions drawn from the results could be substantially improved. Also, they do not propose any potential mechanism to explain this process, which reduces the potential impact and novelty of the study. Specific concerns are outlined below. 

      Major 

      Comment #3-1: As a general comment, there are experimental details that are either missing or not clear. The main one is related to transplantation assays. What is the irradiation dose? The Methods sections indicates "recipient mice were lethally irradiated with single doses of 8.7 or 9.1 Gy". The only experimental schematic indicating the irradiation dose is Figure 7A, which uses 8.7 Gy. Also, although there is not a "standard", 11 Gy split in two doses is typically considered lethal irradiation, while 9.5 Gy is considered sublethal.

      Response #3-1: We agree with reviewer’s assessment about whether 8.7 Gy is myeloablative. To confirm this, it would typically be necessary to irradiate mice with different dose and observe if they do not survive. However, such an experiment is not ethically permissible at our facility. Instead, in our previous report1, we demonstrated that none of the mice subjected to pretreatment with 8.7 Gy could survive when non-LKS cells were transplanted, suggesting that

      8.7 Gy is enough to be myeloablative with the radiation equipment at our facility.

      Reference

      (1) Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      Comment #3-2:  Is there any reason for these lower doses? Same question for giving a single dose and for performing irradiation a day before transplant. 

      Response #3-2: We appreciate the reviewer for these important comments. Although the 8.7 Gy dose used at our facility is lower than in other reports, we selected this dose to maintain consistency with our previous experiments. For the same reason, we used a single irradiation, not split.  Regarding the timing of irradiation, the method section specifies that irradiation timing is 12-24 hours prior to transplantation. In most experiments, irradiation is performed at 12 hours. However, due to experimental progress, there were occasional instances where nearly 24 hours elapsed between irradiation and transplantation. We provide this information to ensure accuracy.

      Comment #3-3: The manuscript would benefit from the inclusion of references to recent studies discussing hematopoietic biases and differentiation dynamics at a single-cell level (e.g., Yamamoto et. al 2018; Rodriguez-Fraticelli et al., 2020). Also, when discussing the discrepancy between studies claiming different biases within the HSC pool, the authors mentioned that Montecino-Rodriguez et al. 2019 showed preserved lymphoid potential with age. It would be good to acknowledge that this study used busulfan as the conditioning method instead of irradiation.

      Response #3-3: We agree with this comment and have incorporated this suggestion into the manuscript

      [P19, L451] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. Additionally, in this report we purified LT-HSCs by Hoxb5 reporter system. In contrast, various LT-HSC markers have been previously reported2-3.  Therefore, it is ideal to validate our findings using other LT-HSC makers.

      [P16, L368] Other studies suggest that blockage of lymphoid hematopoiesis in aged mice results in myeloid-skewed hematopoiesis through alternative mechanisms. However, this result should be interpreted carefully, since Busulfan was used for myeloablative treatment in this study4.   

      References

      (1) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      (2) Rodriguez-Fraticelli AE, Weinreb C, Wang SW, Migueles RP, Jankovic M, Usart M, et al. Single-cell lineage tracing unveils a role for TCF15 in haematopoiesis. Nature [Internet]. 2020;583(7817):585–9. Available from: http://dx.doi.org/10.1038/s41586-020-2503-6

      (3) Sanjuan-Pla A, Macaulay IC, Jensen CT, Woll PS, Luis TC, Mead A, et al. Plateletbiased stem cells reside at the apex of the haematopoietic stem-cell hierarchy. Nature. 2013;502(7470):232–6. 

      (4) Montecino-Rodriguez E, Kong Y, Casero D, Rouault A, Dorshkind K, Pioli PD. Lymphoid-Biased Hematopoietic Stem Cells Are Maintained with Age and Efficiently Generate Lymphoid Progeny. Stem Cell Reports. 2019 Mar 5;12(3):584–96. 

      Comment #3-4: When representing the contribution to PB from transplanted cells, the authors show the % of each lineage within the donor-derived cells (Figures 3B-C, 5B, 6B-D, 7C-E, and S3 B-C). To have a better picture of total donor contribution, total PB and BM chimerism should be included for each transplantation assay. Also, for Figures 2C-D and Figures S2A-B, do the graphs represent 100% of the PB cells? Are there any radioresistant cells?

      Response #3-4: Thank you for highlighting this point. Indeed, donor contribution to total peripheral blood (PB) is important information. We have included the donor contribution data for each figure above mentioned.

      Author response image 4.

      In Figure 2C-D and Figure S2A-B, the percentage of donor chimerism in PB was defined as the percentage of CD45.1-CD45.2+ cells among total CD45.1-CD45.2+ and CD45.1+CD45.2+ cells as described in method section.

      Comment #3-5: For BM progenitor frequencies, the authors present the data as the frequency of cKit+ cells. This normalization might be misleading as changes in the proportion of cKit+ between the different experimental conditions could mask differences in these BM subpopulations. Representing this data as the frequency of BM single cells or as absolute numbers (e.g., per femur) would be valuable.

      Response #3-5: We appreciate the reviewer's comment on this point. 

      Firstly, as shown in Supplemental Figures S1B and S1C, we analyze the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in different panels. Therefore, normalization is required to assess the differentiation of HSCs from upstream to downstream. Additionally, the reason for normalizing by c-Kit+ is that the bone marrow analysis was performed after enrichment using the Anti-c-Kit antibody for both upstream and downstream fractions. Based on this, we calculated the progenitor populations as a frequency within the c-Kit positive cells. Next, the results of normalizing the whole bone marrow cells (live cells) are shown in Author response image 2. 

      Similar to the results of normalizing c-Kit+ cells, myeloid progenitors remained unchanged, including a statistically significant decrease in CMP in aged mice. Additionally, there were no significant differences in CLP. In conclusion, similar results were obtained between the normalization with c-Kit and the normalization with whole bone marrow cells (live cells).

      However, as the reviewer pointed out, it is necessary to explain the reason for normalization with c-Kit. Therefore, we will add the following description.

      [P21, L502] For the combined analysis of the upstream (HSC, MPP, Flk2+) and downstream (CLP, MEP, CMP, GMP) fractions in Figures 1B and 7F, we normalized by c-Kit+ cells because we performed a c-Kit enrichment for the bone marrow analysis.

      Comment #3-6: Regarding Figure 1B, the authors argue that if myeloid-biased HSC clones increase with age, they should see increased frequency of all components of the myeloid differentiation pathway (CMP, GMP, MEP). This would imply that their results (no changes or reduction in these myeloid subpopulations) suggest the absence of myeloid-biased HSC clones expansion with age. This reviewer believes that differentiation dynamics within the hematopoietic hierarchy can be more complex than a cascade of sequential and compartmentalized events (e.g., accelerated differentiation at the CMP level could cause exhaustion of this compartment and explain its reduction with age and why GMP and MEP are unchanged) and these conclusions should be considered more carefully.

      Response #3-6: We wish to thank the reviewer for this comment. We agree with that the differentiation pathway may not be a cascade of sequential events but could be influenced by various factors such as extrinsic factors.

      In Figure 1B, we hypothesized that there may be other mechanisms causing myeloidbiased hematopoiesis besides the age-related increase in myeloid-biased HSCs, given that the percentage of myeloid progenitor cells in the bone marrow did not change with age. However, we do not discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B. 

      Our newly proposed theories—that the differentiation capacity of LT-HSCs remains unchanged with age and that age-related myeloid-biased hematopoiesis is due to changes in the ratio of LT-HSCs to ST-HSCs—are based on functional experiment results. As the reviewer pointed out, to discuss the presence or absence of myeloid-biased HSCs based on the data in Figure 1B, it is necessary to apply a system that can track HSC differentiation at single-cell level. The technology would clarify changes in the self-renewal capacity of individual HSCs and their differentiation into progenitor cells and peripheral blood cells. The authors believe that those single-cell technologies will be beneficial in understanding the differentiation of HSCs. Based on the above, the following statement has been added to the text.

      [P19, L451] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty cell transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      References

      (1) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      (2) Rodriguez-Fraticelli AE, Weinreb C, Wang SW, Migueles RP, Jankovic M, Usart M, et al. Single-cell lineage tracing unveils a role for TCF15 in haematopoiesis. Nature [Internet]. 2020;583(7817):585–9. Available from: http://dx.doi.org/10.1038/s41586-020-2503-6

      Comment #3-7: Within the few recipients showing good donor engraftment in Figure 2C, there is a big proportion of T cells that are "amplified" upon secondary transplantation (Figure 2D). Is this expected?

      Response #3-7: We wish to express our deep appreciation to the reviewer for insightful comment on this point. As the reviewers pointed out, in Figure 2D, a few recipients show a very high percentage of T cells. The authors had the same question and considered this phenomenon as follows:

      (1) One reason for the very high percentage of T cells is that we used 1 x 107 whole bone marrow cells in the secondary transplantation. Consequently, the donor cells in the secondary transplantation contained more T-cell progenitor cells, leading to a greater increase in T cells compared to the primary transplantation.

      (2) We also consider that this phenomenon may be influenced by the reduced selfrenewal capacity of aged LT-HSCs, resulting in decreased sustained production of myeloid cells in the secondary recipient mice. As a result, long-lived memory-type lymphocytes may preferentially remain in the peripheral blood, increasing the percentage of T cells in the secondary recipient mice.

      We have discussed our hypothesis regarding this interesting phenomenon. To further clarify the characteristics of the increased T-cell count in the secondary recipient mice, we will analyze TCR clonality and diversity in the future.

      Comment #3-8: Do the authors have any explanation for the high level of variability within the recipients of Hoxb5+ cells in Figure 2C?

      Response #3-8: We appreciate the reviewer's comment on this point. As noted in our previous report, transplantation of a sufficient number of HSCs results in stable donor chimerism, whereas a small number of HSCs leads to increased variability in donor chimerism1. Additionally, other studies have observed high variability when fewer than 10 HSCs are transplanted2-3. Based on this evidence, we consider that the transplantation of a small number of cells (10 cells) is the primary cause of the high level of variability observed.

      References

      (1) Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      (2) Dykstra B, Olthof S, Schreuder J, Ritsema M, Haan G De. Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells. J Exp Med. 2011 Dec 19;208(13):2691–703. 

      (3) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      Comment #3-9: Can the results from Figure 2E be interpreted as Hoxb5+ cells having a myeloid bias? (differences are more obvious/significant in neutrophils and monocytes).

      Response #3-9: Thank you for your insightful comments. Firstly, we have not obtained any data indicating that young LT-HSCs are myeloid biased HSCs so far. Therefore, we classify young LT-HSCs as balanced HSCs1. Secondly, our current data demonstrate no significant difference in differentiation capacity between young and aged LT-HSCs (see Figure 3 in this paper). Based on these findings, we interpret that aged LT-HSCs are balanced HSCs, similar to young LT-HSCs.

      Reference

      (1)  Chen JY, Miyanishi M, Wang SK, Yamazaki S, Sinha R, Kao KS, et al. Hoxb5 marks long-term haematopoietic stem cells and reveals a homogenous perivascular niche. Nature. 2016 Feb 10;530(7589):223–7. 

      Comment #3-10: Is Figure 2G considering all primary recipients or only the ones that were used for secondary transplants? The second option would be a fairer comparison.

      Response #3-10: We appreciate the reviewer's comment on this point. We considered all primary recipients in Figure 2G to ensure a fair comparison, given the influence of various factors such as the radiosensitivity of individual recipient mice1. Comparing only the primary recipients used in the secondary transplantation would result in n = 3 (primary recipient) vs. n = 12 (secondary recipient). Including all primary recipients yields n = 11 vs. n = 12, providing a more balanced comparison. Therefore, we analyzed all primary recipient mice to ensure the reliability of our results.

      Reference

      (1) Duran-Struuck R, Dysko RC. Principles of bone marrow transplantation (BMT): providing optimal veterinary and husbandry care to irradiated mice in BMT studies. J Am Assoc Lab Anim Sci. 2009; 48:11–22

      Comment #3-11: When discussing the transcriptional profile of young and aged HSCs, the authors claim that genes linked to myeloid differentiation remain unchanged in the LT-HSC fraction while there are significant changes in the ST-HSCs. However, 2 out of the 4 genes shown in Figure S4B show ratios higher than 1 in LT-HSCs.

      Response #3-11: Thank you for highlighting this important point. As the reviewer pointed out, when we analyze the expression of myeloid-related genes, some genes are elevated in aged LT-HSCs compared to young LT-HSCs. However, the GSEA analysis using myeloid-related gene sets, which include several hundred genes, shows no significant difference between young and aged LT-HSCs (see Figure S4C in this paper). Furthermore, functional experiments using the co-transplantation system show no difference in differentiation capacity between young and aged LT-HSCs (see Figure 3 in this paper). Based on these results, we conclude that LT-HSCs do not exhibit any change in differentiation capacity with aging.

      Comment #3-12: When determining the lymphoid bias in ST-HSCs, the authors focus on the T-cell subtype, not considering any other any other lymphoid population. Could the authors explain this?

      Response #3-12: We thank the reviewer for this comment. We conducted the experiments in Figure 5 to demonstrate that the hematopoiesis observed 16 weeks post-transplantation—when STHSCs are believed to lose their self-renewal capacity—is not due to de novo production of T cells from ST-HSCs. Instead, it is attributed to long-lived memory cells which can persistently remain in the peripheral blood.

      As noted by the reviewer, various memory cell types are present in peripheral blood. Our analysis focused on memory T cells due to the broad consensus on memory T cell markers1. 

      Our findings show that transplanted Hoxb5- HSCs do not continuously produce lymphoid cells, unlike lymphoid-biased HSCs. Rather, the loss of self-renewal capacity in Hoxb5- HSCs makes the presence of long-lived memory cells in the peripheral blood more apparent.

      Reference

      (1)  Yenyuwadee S, Sanchez-Trincado Lopez JL, Shah R, Rosato PC, Boussiotis VA. The evolving role of tissue-resident memory T cells in infections and cancer. Sci Adv. 2022;8(33). 

      Comment #3-13: Based on the reduced frequency of donor cells in the spleen and thymus, the authors conclude "the process of lymphoid lineage differentiation was impaired in the spleens and thymi of aged mice compared to young mice". An alternative explanation could be that differentiated cells do not successfully migrate from the bone marrow to these secondary lymphoid organs. Please consider this possibility when discussing the data.

      Response #3-13: We strongly appreciate the reviewer's comment on this point. In accordance with the reviewer's comment, we have incorporated this suggestion into our manuscript.

      [P15, L343] These results indicate that the process of lymphoid lineage differentiation is impaired in the spleens and thymi of aged mice compared to young mice, or that differentiating cells in the bone marrow do not successfully migrate into these secondary lymphoid organs. These factors contribute to the enhanced myeloid-biased hematopoiesis in peripheral blood due to a decrease in de novo lymphocyte production.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Recommendation #2-1: To support their conclusions the authors need to provide higher n-numbers and provide a detailed power analysis of the transplants in the methods section.

      Response to Recommendation #2-1: Thank you for your important remarks. The power analysis for this experiment shows that power = 0.319, suggesting that more number may be needed. On the other hand, our method for determining the sample size in Figure 3 is as follows:

      (1) First, we checked whether myeloid biased change is detected in the bulk-HSC fraction (Figure S3). The results showed that the difference in myeloid output at 16 weeks after transplantation was statistically significant (young vs. aged = 7.2 ± 8.9 vs. 42.1 ± 35.5%, p = 0.01), even though n = 10.

      (2) Next, myeloid biased HSCs have been reported to be a fraction with high self-renewal ability (2004, Blood). If myeloid biased HSCs increase with aging, the increase in myeloid biased HSCs in LT-HSC fraction would be detected with higher sensitivity than in the bulk-HSC fraction used in Figure S3.

      (3) However, there was no difference not only in p-values but also in the mean itself, young vs aged = 51.4±31.5% vs 47.4±39.0%, p = 0.82, even though n = 8 in Figure 3. Since there was no difference in the mean itself, it is highly likely that no difference will be detected even if n is further increased.

      Regarding Figure S3, 5, 6, S6 and 7, we obtained a statistically significant difference and consider the sample size to be sufficient. 

      Recommendation #2-2: As the authors attempt to challenge the current model of the age-associated expansion of myeloid-biased HSCs (which has been observed and reproduced by many different groups), ideally additional strong evidence in the form of single-cell transplants is provided.

      Response to Recommendation #2-2: Thank you for the comments. As the reviewer pointed out, we hope we could reconfirm our results using single-cell level technology in the future.

      On the other hand, we have reported that the ratio of myeloid to lymphoid cells in the peripheral blood changes when the number of HSCs transplanted, or the number of supporting cells transplanted with HSCs, is varied1-2. Therefore, single-cell transplant data need to be interpreted very carefully to determine differentiation potential.

      From this viewpoint, future experiments will combine the Hoxb5 reporter system with a lineage tracing system that can track HSCs at the single-cell level over time. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. We have reflected this comment by adding the following sentences in the manuscript.

      [P19, L451] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty transplantation assays. Therefore, the current theory should be revalidated using single-cell technology. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells.

      References

      (1) Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      (2) Sakamaki T, Kao KS, Nishi K, Chen JY, Sadaoka K, Fujii M, et al. Hoxb5 defines the heterogeneity of self-renewal capacity in the hematopoietic stem cell compartment. Biochem Biophys Res Commun [Internet]. 2021;539:34–41. Available from: https://doi.org/10.1016/j.bbrc.2020.12.077

      Minor points:

      Recommendation #2-3: Figure 1: "Comprehensive analysis of hematopoietic alternations with age shows a discrepancy of age-associated changes between peripheral blood and bone marrow"

      [Comment to the authors]: For clarity, the nature of the discrepancy should be stated clearly.

      Response to Recommendation #2-3: Thank you for this important comment. Following the reviewer’s recommendation, we have revised the manuscript as follows

      [P7, L139] Our analysis of hematopoietic alternations with age revealed that age-associated transition patterns of immunophenotypically defined HSC and CMP in BM were not paralleled with myeloid cell in PB (Fig. 1 C).

      Recommendation #2-4: Figure 1B "(B) Average frequency of immunophenotypically defined HSC and progenitor cells in BM of 2-3-month mice (n = 6), 6-month mice (n = 6), 12-13-month mice (n = 6), {greater than or equal to} 23-month mice (n = 7).

      [Comment to the authors]: It should be stated in the figure and legend that the values are normalized to the 2-3-month-old mice.

      Response to Recommendation #2-4: Thank you for this comment. Figure 1B presents the actual measured values of each fraction in c-Kit positive cells in the bone marrow, without any normalization.

      Recommendation #2-5: "We 127 found that the frequency of immunophenotypically defined HSC in BM rapidly increased 128 up to the age of 12 months. After the age, they remained plateaued throughout the 129 observation period (Fig. 1 B)."

      [Comment to the authors]: The evidence for a 'plateau', where HSC numbers don't change after 12 months is weak. It appears that the numbers increase continuously (although less steep) after 12 months. I thus recommend adjusting the wording to better reflect the data.

      Response to Recommendation #2-5: We thank the reviewer for the comments above and have incorporated these suggestions in our revision as follows. 

      [P6, L126] We found that the frequency of immunophenotypically defined HSC in BM rapidly increased up to the age of 12 months. After the age, the rate of increase in their frequency appeared to slow down.

      Recommendation #2-6: Figure 2G: [Comment to the authors]: Please add the required statistics, please check carefully all figures for missing statistical tests.

      Response to Recommendation #2-6: Thank you for these important comments. In response, we have added the results of the significance tests for Figures 1A, 1C, 4C, and S5.

      Recommendation #2-7: "If bulk-HSCs isolated from aged mice are already enriched by myeloid-biased HSC clones, we should see more myeloid-biased phenotypes 16 weeks after primary and the secondary transplantation. However, we found that kinetics of the proportion of myeloid cells in PB were similar across primary and the secondary transplantation and that the proportion of myeloid cells gradually decreased over time (Fig. 2 G). These results suggest the following two possibilities: either myeloid-biased HSCs do not expand in the LT-HSC fraction, or the expansion of myeloid-biased clones in 2-year-old mice has already peaked."

      [Comment to the authors]: Other possible explanations include that the observed reduction in myeloid reconstitution over 16 weeks reflects the time required to return to homeostasis. In other words, it takes time until the blood system approaches a balanced output.

      Response to Recommendation #2-7: We agree with the reviewer's comment. As the reviewer pointed out, the gradual decrease in the proportion of myeloid cells over time is not related to our two hypotheses in this part of the manuscript but rather to the hematopoietic system's process of returning to a homeostatic state after transplantation. Therefore, the original sentence could be misleading, as it is part of the section discussing whether age-associated expansion of myeloid-biased HSCs is observed. Based on the above, we have revised the sentence as follows.

      [P8, L179] However, we found that kinetics of the proportion of myeloid cells in PB were similar across the primary and the secondary transplantation (Fig. 2 G). These results suggest the following two possibilities: either myeloid-biased HSCs do not expand in the LTHSC fraction, or the expansion of myeloid-biased clones in 2-year-old mice has already peaked.

      Recommendation #2-8: It is also important to consider that the transplant results are highly variable (see large standard deviation), therefore the sensitivity to detect smaller but relevant changes is low in the shown experiments. As the statistical analysis of these experiments is missing and the power seems low these results should be interpreted with caution. For instance, it appears that the secondary transplants on average produce more myeloid cells as expected and predicted by the classical clonal expansion model.

      Regarding "expansion of myeloid-biased clones in 2-year-old mice has already peaked". This is what the author suggested above. It might thus not be surprising that HSCs from 2-year-old mice show little to no increased myeloid expansion.

      Response to Recommendation #2-8: Thank you for providing these insights. The primary findings of our study are based on functional experiments presented in Figures 2, 3, 5, 6, and 7. In Figure 3, there was no significant difference between young and aged LT-HSCs, with mean values of 51.4±31.5% and 47.4±39.0%, respectively (p = 0.82). Given the lack of difference in the mean values, it is unlikely that increasing the sample size would reveal a significant change. For ethical reasons, to minimize the use of additional animals, we conclude that LT-HSCs exhibit no change in lineage output throughout life based on the data in Figure 3. Statistically significant differences observed in Figures 2, 5, 6, and 7 further support our conclusions.

      Additionally, because whole bone marrow cells were transplanted in the secondary transplantation, there may be various confounding factors beyond the differentiation potential of HSCs. Therefore, we consider that caution is necessary when evaluating the differentiation capacity of HSCs in the context of the second transplantation.

      Recommendation #2-9: Figure 7C: [Comment to the authors]: The star * indicates with analyzed BM. As stars are typically used as indicators of significance, this can be confusing for the reader. I thus suggest using another symbol.

      Response to Recommendation #2-9: We appreciate the reviewer for this comment and have incorporated the suggestion in the revised manuscript. We have decided to use † instead of the star*.

      Reviewer #3 (Recommendations For The Authors):

      Recommendation #3.1: In Figure 1A, the authors show the frequency of PB lineages (lymphoid vs myeloid) in mice of different ages. It would be great if they could show the same data for each subpopulation including these two main categories individually (granulocytes, monocytes, B cells, T cells...).

      Response to Recommendation #3-1: We thank for this suggestion. We provide the frequency of PB lineages (granulocytes, monocytes, B cells, T cells, and NK cells) in mice of different ages.

      Author response image 5.

      Average frequency of neutrophils, monocytes, B cells, T cells, and NK cells in PB analyzed in Figure 1A. Dots show all individual mice. *P < 0.05. **P < 0.01. Data and error bars represent means ± standard deviation. 

      Recommendation #3.2: It would be great if data from young mice could be shown in parallel to the graphs in Figure 2A.

      Response to Recommendation #3-2: We thank the reviewer for the comments above and have incorporated these suggestions in Figure 2A. 

      [P34, L916] (A) Hoxb5 reporter expression in bulk-HSC, MPP, Flk2+, and Lin-Sca1-c-Kit+ populations in the 2-year-old Hoxb5-tri-mCherry mice (Upper panel) and 3-month-old Hoxb5_tri-mCherry mice (Lower panel). Values indicate the percentage of mCherry+ cells ± standard deviation in each fraction (_n = 3). 

      Recommendation #3.3: Do the authors have any explanation for the high level of variability within the recipients of Hoxb5+ cells in Figure 2C?

      Response to Recommendation #3-3: Thank you for providing these insights. As noted in our previous report, transplantation of a sufficient number of HSCs results in stable donor chimerism, whereas a small number of HSCs leads to increased variability in donor chimerism1. Additionally, other studies have observed high variability when fewer than 10 HSCs are transplanted2-3. Based on this evidence, we consider that the transplantation of a small number of cells (10 cells) is the primary cause of the high level of variability observed.

      References

      (1) Nishi K, Sakamaki T, Sadaoka K, Fujii M, Takaori-Kondo A, Chen JY, et al. Identification of the minimum requirements for successful haematopoietic stem cell transplantation. Br J Haematol. 2022;196(3):711–23. 

      (2) Dykstra B, Olthof S, Schreuder J, Ritsema M, Haan G De. Clonal analysis reveals multiple functional defects of aged murine hematopoietic stem cells. J Exp Med. 2011 Dec 19;208(13):2691–703. 

      (3) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      Recommendation #3.4: Are the differences in Figure 3D statistically significant? If yes, please add statistics. Same for Figure 4C.

      Response to Recommendation #3-4: Thank you for providing these insights. For Figure 3D, we performed an ANOVA analysis for each fraction; however, the results were not statistically significant. In contrast, for Figure 4C, we have added the results of significance tests for comparisons between Young LT-HSC vs. Young Bulk-HSC.

      Recommendation #3.5: As a general comment, although the results in this study are interesting, the use of a Hoxb5 lineage tracing mouse model would be more valuable for this purpose than the Hoxb5 reporter used here. The lineage tracing model would allow for the assessment of lineage bias without the caveats introduced by the transplantation assays.

      Response to Recommendation #3-5: We appreciate the reviewer for the important comments. Following the reviewer’s recommendation, we have revised the manuscript as follows

      [P19, L451] In contrast, our findings should be considered in light of some limitations. In this report, we primarily performed ten to twenty transplantation assays. Therefore, the current theory should be revalidated using single-cell technology with lineage tracing system1-2. This approach will investigate changes in the self-renewal capacity of individual HSCs and their subsequent differentiation into progenitor cells and peripheral blood cells. 

      References

      (1) Yamamoto R, Wilkinson AC, Ooehara J, Lan X, Lai CY, Nakauchi Y, et al. LargeScale Clonal Analysis Resolves Aging of the Mouse Hematopoietic Stem Cell Compartment. Cell Stem Cell [Internet]. 2018;22(4):600-607.e4. Available from: https://doi.org/10.1016/j.stem.2018.03.013

      (2) Rodriguez-Fraticelli AE, Weinreb C, Wang SW, Migueles RP, Jankovic M, Usart M, et al. Single-cell lineage tracing unveils a role for TCF15 in haematopoiesis. Nature [Internet]. 2020;583(7817):585–9. Available from: http://dx.doi.org/10.1038/s41586-020-2503-6

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers and editors for their careful assessment and review of our article. The many detailed comments, questions and suggestions were very helpful in improving our analyses and presentation of data. In particular, our Discussion benefited enormously from the comments. 

      Below we respond in detail to every point raised. 

      We especially note that Reviewer #3’s small query on “trial where learning is defined to have occurred, we were not given the quantitative criterion operationalizing "learning" - please provide” led to deeper analyses and insights and a lengthy response.

      This analysis prompted the addition of a sentence (red) to the Abstract. 

      “Animals navigate by learning the spatial layout of their environment. We investigated spatial learning of mice in an open maze where food was hidden in one of a hundred holes. Mice leaving from a stable entrance learned to efficiently navigate to the food without the need for landmarks. We developed a quantitative framework to reveal how the mice estimate the food location based on analyses of trajectories and active hole checks. After learning, the computed “target estimation vector” (TEV) closely approximated the mice’s route and its hole check distribution. The TEV required learning both the direction and distance of the start to food vector, and our data suggests that different learning dynamics underlie these estimates. We propose that the TEV can be precisely connected to the properties of hippocampal place cells. Finally, we provide the first demonstration that, after learning the location of two food sites, the mice took a shortcut between the sites, demonstrating that they had generated a cognitive map. ”

      Note: we added, at the end of the manuscript, the legends for the Shortcut video (Video 1) and the main text figure legends; these are with a larger font and so easier to read. 

      Reviewer #1 (Public Review):

      Assessment:

      This important work advances our understanding of navigation and path integration in mammals by using a clever behavioral paradigm. The paper provides compelling evidence that mice are able to create and use a cognitive map to find "short cuts" in an environment, using only the location of rewards relative to the point of entry to the environment and path integration, and need not rely on visual landmarks.

      Thank you.

      Summary:

      The authors have designed a novel experimental apparatus called the 'Hidden Food Maze (HFM)' and a beautiful suite of behavioral experiments using this apparatus to investigate the interplay between allothetic and idiothetic cues in navigation. The results presented provide a clear demonstration of the central claim of the paper, namely that mice only need a fixed start location and path integration to develop a cognitive map. The experiments and analyses conducted to test the main claim of the paper -- that the animals have formed a cognitive map -- are conclusive. While I think the results are quite interesting and sound, one issue that needs to be addressed is the framing of how landmarks are used (or not), as discussed below, although I believe this will be a straightforward issue for the authors to address.

      We have now added detailed discussion on this important point. See below.

      Strengths:

      The 90-degree rotationally symmetric design and use of 4 distal landmarks and 4 quadrants with their corresponding rotationally equivalent locations (REL) lends itself to teasing apart the influence of path integration and landmark-based navigation in a clever way. The authors use a really complete set of experiments and associated controls to show that mice can use a start location and path integration to develop a cognitive map and generate shortcut routes to new locations.

      Weaknesses:

      I have two comments. The second comment is perhaps major and would require rephrasing multiple sentences/paragraphs throughout the paper.

      (1) The data clearly indicate that in the hidden food maze (HFM) task mice did not use external visual "cue cards" to navigate, as this is clearly shown in the errors mice make when they start trials from a different start location when trained in the static entrance condition. The absence of visual landmark-guided behavior is indeed surprising, given the previous literature showing the use of distal landmarks to navigate and neural correlates of visual landmarks in hippocampal formation. While the authors briefly mention that the mice might not be using distal landmarks because of their pretraining procedure - I think it is worth highlighting this point (about the importance of landmark stability and citing relevant papers) and elaborating on it in greater detail. It is very likely that mice do not use the distal visual landmarks in this task because the pretraining of animals leads to them not identifying them as stable landmarks. For example, if they thought that each time they were introduced to the arena, it was "through the same door", then the landmarks would appear to be in arbitrary locations compared to the last time. In the same way, we as humans wouldn't use clouds or the location of people or other animate objects as trusted navigational beacons. In addition, the animals are introduced to the environment without any extra-maze landmarks that could help them resolve this ambiguity. Previous work (and what we see in our dome experiments) has shown that in environments with 'unreliable' landmarks, place cells are not controlled by landmarks - https://www.sciencedirect.com/science/article/pii/S0028390898000537, https://pubmed.ncbi.nlm.nih.gov/7891125/. This makes it likely that the absence of these distal visual landmarks when the animal first entered the maze ensured that the animal does not 'trust' these visual features as landmarks.

      Thank you. We have added many references and discussion exactly on this point including both direct behavioral experiments as well as discussion on the effects of landmark (in)stability of place cell encoding of “place”.  See Page 18 third paragraph.

      “An alternate factor might be the lack of reliability of distal spatial cues in predicting the food location. The mice, during pretraining trials, learned to find multiple food locations without landmarks. In the random trials, the continuous change of relative landmark location may lead the mice to not identifying them as “stable landmarks”. This view is supported by behavioral experiments that showed the importance of landmark stability for spatial learning (32-34) and that place cells are not controlled by “unreliable landmarks” (35-38). Control experiments without landmarks (Fig. S6A,B) or in the dark (Fig. S6C-F) confirmed that the mice did not need landmarks for spatial learning of the food location.”

      (2) I don't agree with the statement that 'Exogenous cues are not required for learning the food location'. There are many cues that the animal is likely using to help reduce errors in path integration. For example, the start location of the rat could act as a landmark/exogenous cue in the sense of partially correcting path integration errors. The maze has four identical entrances (90-degree rotationally symmetric). Despite this, it is entirely plausible that the animal can correct path integration errors by identifying the correct start entrance for a given trial, and indeed the distance/bearing to the others would also help triangulate one's location. Further, the overall arena geometry could help reduce PI error. For example, with a food source learned to be "near the middle" of the arena, the animal would surely not estimate the position to be near the far wall (and an interesting follow-on experiment would be to have two different-sized, but otherwise nearly identical arenas). As the rat travels away from the start location, small path integration errors are bound to accumulate, these errors could be at least partially corrected based on entrance and distal wall locations. If this process of periodically checking the location of the entrance to correct path integration errors is done every few seconds, path integration would be aided 'exogenously' to build a cognitive map. While the original claim of the paper still stands, i.e. mice can learn the location of a hidden food size when their starting point in the environment remains constant across trials. I would advise rewording portions of the paper, including the discussion throughout the paper that states claims such as "Exogenous cues are not required for learning the food location" to account for the possibility that the start and the overall arena geometry could be used as helpful exogenous cues to correct for path integration errors.

      We agree with the referee that our claim was ill-phrased. Surely the behavior of the mouse must be constrained by the arena size to some extent. To minimize potential geometric cues from the arena, we carefully analyzed many preliminary experiments (each with a unique batch of 4 mice) having the target positioned at different locations. We added a paragraph to the section “Further controls” where we explain our choice for the target position. Page 12 last paragraph; Page 13 “Arena geometry” paragraph.

      Also, following the suggestion from the reviewer, we probed whether the hole checks accumulated near the center of the arena for the random entrance mice, as a potential sign that some spatial learning is going on. In fact, neither the density of hole checks, nor the distance of the hole checks to the center of the arena change with learning: panel A below shows the probability density of finding a hole check at a given distance from the center of the arena; both trial 1 and trial 14 have very similar profiles. Panel B shows the density of hole checks near (<20cm) and far (>20cm) from the arena’s center.

      Author response image 1.

      It also doesn’t show any significant differences between trials 1 and 14.

      So even though there’s some trend (in panel A, the peak goes from 60cm to a double peak, one at 30cm away from the center, and the other still at 60cm), the distance from the center is still way too large compared to the mouse’s body size and to the average inter-hole distance (<10cm). These panels are now in the Supplementary Figure S8B.

      Finally, we enhanced the wording in our claim. We now have a new section entitled: “What cues are required for learning the food location?”. There, we systematically cover all possible cues and how they might be affected by their stability under the perturbation of maze floor rotation. 

      Reviewer #2 (Public Review):

      Summary:

      This manuscript reports interesting findings about the navigational behavior of mice. The authors have dissected this behavior in various components using a sophisticated behavioral maze and statistical analysis of the data.

      Strengths:

      The results are solid and they support the main conclusions, which will be of considerable value to many scientists.

      Thank you.

      Weaknesses:

      Figure 1: In some trials the mice seem to be doing thigmotaxis, walking along the perimeter of the maze. This is perhaps due to the fear of the open arena. But, these paths along the perimeter would significantly influence all metrics of navigation, e.g. the distance or time to reward.

      Perhaps analysis can be done that treats such behavior separately and the factors it out from the paths that are away from the perimeter.

      In Page 4, we added a small section entitled: “Pretraining trials”. Our reference was suggested by Reviewer #3 (noted as “Golani” with first author “Fonio”). Our preliminary experiments used naïve mice and they typically took greater than 2 days before they ventured into the arena center and found the single filled hole. This added unacceptable delays and the Pretraining trials greatly diminished the extensive thigmotaxis (not quantified). The “near the walls” trajectories did continue in the first learning trial (Fig. 2A, 3A) but then diminished in subsequent trials. We found no evidence that thigmotaxis (trajectories adjacent to the wall) were a separate category of trajectory. 

      Figure 1c: the color axis seems unusual. Red colors indicate less frequently visited regions (less than 25%) and white corresponds to more frequently visited places (>25%)? Why use such a binary measure instead of a graded map as commonly done?

      Thank you; you are completely correct. We have completely changed the color coding. 

      Some figures use linear scale and others use logarithmic scale. Is there a scientific justification? For example, average latency is on a log scale and average speed is on a linear scale, but both quantify the same behavior. The y-axis in panel 1-I is much wider than the data. Is there a reason for this? Or can the authors zoom into the y-axis so that the reader can discern any pattern?

      We use logarithmic scale with the purpose of displaying variables that have a wide range of variation (mainly, distance, latency, and number of hole checks, since it linearly and positively correlates with both distance and latency – see new Fig. S4B,C). For example, Latency goes from hundreds of seconds (trial 1) to just a few seconds (trial 14). Similarly, the total distance goes from hundreds of centimeters (trial 1, sometimes more than 1000cm, see answer about the 10-fold variation of distance below) to just the start-target distance (which is ~100cm). These variables vary over a few orders of magnitude. We display speed in a linear axis because it does not increase for more than one order of magnitude.

      Moreover, fitting the wide-ranged data (distance, latency, nchecks) yields smaller error in logscale [i.e., fitting log(y) vs. trial, instead of y vs. trial]. In these cases, the log-scale also helps visualizing how well the data was fitted by the curve. Thus, presenting wide-ranged data in linear scale could be misleading regarding goodness of fit.

      We now zoomed into the Y axis scale in Panels I of Fig. 2 and Fig. 3. We kept it in log-scale, but linear Y scale produces Author response image 2 for Figs. 3I and 2I, respectively.

      Author response image 2.

      Thus, we believe that the loglog-scale in these panels won’t compromise the interpretation of the phenomenon. In fact, the loglog of the static case suggests that the probability of hole checking distance increases according to a power law as the mouse approaches the target (however, we did not check this thoroughly, so we did not include this point in the discussion). Power law behavior is observed in other animals (e.g, ants: DOI: 10.1371/journal.pone.0009621) and is sometimes associated with a stochastic process with memory.

      1F shows no significant reduction in distance to reward. Does that mean there is no improvement with experience and all the improvement in the latency is due to increasing running speed with experience?

      Correct and in the section “Random Entrance experiments” under “Results” (Page 5) we explicitly note this point.

      “We hypothesize that the mice did not significantly reduce their distance travelled (Fig. 2A,B,F) because they had not learned the food location - the decrease in latency (Fig. 2D) was due to its increased running speed and familiarity with non-spatial task parameters.”

      Figure 3: The distance traveled was reduced by nearly 10-fold and speed increased by by about 3fold. So, the time to reach the reward should decrease by only 3 fold (t=d/v) but that too reduced by 10fold. How does one reconcile the 3fold difference between the expected and observed values?

      The traveled distance is obtained by linearly interpolating the sampled trajectory points. In other words, the software samples a discrete set of positions, for each recorded instant 𝑡. The total distance is 

      where is the Euclidean distance between two consecutively sampled points. However, the same result (within a fraction of cm error) can be obtained by integrating the sampled speed over time 𝑣! using the Simpson method

      Since Latency varies by 10-fold, it is just expected that, given 𝑑 = 𝑣𝑡, the total distance will also vary by 10-fold (since 𝑣 is constant in each time interval Δ𝑡; replacing 𝑣! in the integral yields the discrete sum above).

      The correctness of our kinetic measurements can be simply verified by multiplying the data from the Latency panel with the data from the Velocity panel. If this results in the Distance plot, then there is no discrepancy. 

      In Author response image 3, we show the actual measured distance, 𝑑_total_, for both conditions (random and static entrance), calculated with the discrete sum above (black filled circles). 

      Author response image 3.

      We compare this with two quantities: (a) average speed multiplied by average latency (red squares); and (b) average of the product of speed by latency (blue inverted triangles). The averages are taken over mice. Notice that if the multiplication is taken before the average (as it should be done), then the product 〈𝑣𝑡〉45*( is indistinguishable from the total distance obtained by linear interpolation. Even taking the averages prior to the multiplication (which is physically incorrect, since speed and latency and properties of each individual mouse), yields almost exactly the same result (well within 1 standard deviation).

      The only thing to keep in mind here is that the Distance panel in the paper presents the normalized distance according to the target distance to the starting point. This is necessary because in the random entrance experiments, each mouse can go to 1 of 4 possible targets (each of which has a different distance to the starting point).

      Figure 4: The reader is confused about the use of a binary color scheme here for the checking behavior: gray for a large amount of checking, and pink for small. But, there is a large ellipse that is gray and there are smaller circles that are also gray, but these two gray areas mean very different things as far as the reader can tell. Is that so? Why not show the entire graded colormap of checking probability instead of such a seemingly arbitrary binary depiction?

      Thank you. Our coloring scheme was indeed poorly thought out and we have changed it. Hopefully the reviewer now finds it easier to interpret. The frequency of hole checks is now encoded into only filled circles of varying sizes and shades of pink. Small empty circles represent the arena holes (empty because they have no food); The large transparent gray ellipse is the variance of the unrestricted spatial distribution of hole checks.

      Figure 4C: What would explain the large amount of checking behavior at the perimeter? Does that occur predominantly during thigmotaxis?

      Yes. As mentioned above, thigmotaxis still occurs in the first trial of training. The point to note is that the hole checking shown in Fig. 4C is over all the mice so that, per mice, it does not appear so overwhelming. 

      Was there a correlation between the amount of time spent by the animals in a part of the maze and the amount of reward checking? Previous studies have shown that the two behaviors are often positively correlated, e.g. reference 20 in the manuscript. How does this fit with the path integration hypothesis?

      We thank the reviewer for pointing this out. Indeed, the time spent searching & the hole checking behavior are correlated. We added a new panel C to Fig. S4 showing a raw correlation plot between Latency and number of checks. 

      Also, in the last paragraph of the “Revealing the mouse estimate of target position from behavior” section under “Results”), we now added a sentence relating the findings in Fig. 4H and 4K (spatial distribution of hole checks, and density of checks near the target, respectively) to note that these findings are in agreement with Fig 3C (time spent searching in each quadrant).

      “The mean position of hole checks near (20cm) the target is interpreted as the mouse estimated target (Fig. 4C,D,G,H; green + sign=mean position; green ellipses = covariance of spatial hole check distribution restricted to 20cm near the target). This finding together with the displacement and spatial hole check maps (Figs. 4F and 4H, respectively) corroborates the heatmap of time spent in the target quadrant (Fig. 3C), suggesting a positive correlation between hole checks and time searching (see also Fig. S4C).”

      "Scratches and odor trails were eliminated by washing and rotating the maze floor between trials." Can one eliminate scratches by just washing the maze floor? Rotation of the maze floor between trials can make these cues unreliable or variable but will not eliminate them. Ditto for odor cues.

      The upper arena floor is rotated between trials so that any scratches will not be stable cues. We clarified this in the Discussion about potential cues. 

      See “What cues are required for learning the food location?”

      "Possible odor gradient cues were eliminated by experiments where such gradients were prevented with vacuum fans (Fig. S6E)" What tests were done to ensure that these were *eliminated* versus just diminished?

      "Probe trials of fully trained mice resulted in trajectories and initial hole checking identical to that of regular trials thereby demonstrating that local odor cues are not essential for spatial learning." As far as the reader can tell, probe trials only eliminated the food odor cues but did not eliminate all other odors. If so, this conclusion can be modified accordingly.

      We were most worried about odor cues guiding the mice and as now described at great length, we tried to mitigate this problem in many ways. As the reviewer notes, it is not possible to have absolute certainty that there are no odor cues remaining. The most difficult odor to eliminate was the potential odor gradient emanating from the mouse’s home cage. However, the 2 vacuum fans per cage were very powerful in first evacuating the cage air (150x in 5 minutes) and then drawing air from the arena, through the cage and out its top for the duration of each trial. We believe that we did at least vastly reduce any odor cues and perhaps completely eliminated them.

      The interpretation of direction selectivity is a bit tricky. At different places in this manuscript, this is interpreted as a path integration signal that encodes goal location, including the Consync cells. However, studies show that (e.g. Acharya et al. 2016) direction selectivity in virtual reality is comparable to that during natural mazes, despite large differences in vestibular cues and spatial selectivity. How would one reconcile these observations with path integration interpretation?

      Thank you. We had not been serious enough in considering the VR studies and their implications for optic flow as a cue for spatial learning. We now have a section (Optic flow cues) in the Discussion that acknowledges the potential role of such cues in spatial learning in our maze. 

      However, spatial learning in our maze can also occur in the dark. The next small section (Vestibular and proprioceptive cues) addresses this point. We cannot be certain about the precise cues used by the mouse to effectively learn to locate food in our maze, but it will take further behavioral and electrophysiological studies to go deeper into these questions. 

      An extended discussion is found in the sections entitled “What cues are required for learning the food location” and “A fixed start location and self-motion cues are required for spatial learning”.  We may have missed some references or ideas regarding VR maze learning with optic flow signals – the Acharya et al reference was an excellent starting point, and we would be grateful for additional pointers that would improve our discussion of this point.

      The manuscript would be improved if the speculations about place cells, grid cells, BTSP, etc. were pared down. I could easily imagine the outcome of these speculations to go the other way and some claims are not supported by data. "We note that the cited experiments were done with virtual movement constrained to 1D and in the presence of landmarks. It remains to be shown whether similar results are obtained in our unconstrained 2D maze and with only self-motion cues available." There are many studies that have measured the evolution of place cells in non- virtual mazes, look up papers from the 1990s. Reference 43 reports such results in a 2D virtual maze.

      We understand the reviewer’s concerns with the length of the manuscript. However, both the first and third reviewer did find this extensive section useful. We did not add the many papers on the evolution of place fields in real world mazes simply to prevent even greater expansion of the discussion, but relied on the very thorough review of Knierim and Hamilton instead. 

      Reviewer #3 (Public Review):

      Summary:

      How is it that animals find learned food locations in their daily life? Do they use landmarks to home in on these learned locations or do they learn a path based on self-motion (turn left, take ten steps forward, turn right, etc.). This study carefully examines this question in a well-designed behavioral apparatus. A key finding is that to support the observed behavior in the hidden food arena, mice appear to not use the distal cues that are present in the environment for performing this task. Removal of such cues did not change the learning rate, for example. In a clever analysis of whether the resulting cognitive map based on self-motion cues could allow a mouse to take a shortcut, it was found that indeed they are. The work nicely shows the evolution of the rodent's learning of the task, and the role of active sensing in the targeted reduction of uncertainty of food location proximal to its expected location.

      Strengths:

      A convincing demonstration that mice can synthesize a cognitive map for the finding of a static reward using body frame-based cues. This shows that the uncertainty of the final target location is resolved by an active sensing process of probing holes proximal to the expected location. Showing that changing the position of entry into the arena rotates the anticipated location of the reward in a manner consistent with failure to use distal cues.

      Thank you.

      Weaknesses:

      The task is low stakes, and thus the failure to use distal cues at most costs the animal a delay in finding the food; this delay is likely unimportant to the animal. Thus, it is unclear whether this result would generalize to a situation where the animal may be under some time pressure, urgency due to food (or water) restriction, or due to predatory threat. In such cases, the use of distal cues to make locating the reward robust to changing start locations may be more likely to be observed.

      We have added “Combining trajectory direction and hole check locations yields a Target Estimation Vector” a section summarizing our main hypotheses and this section includes noting exactly this point + including the reference to the excellent MacIver paper on “robot aggression”.

      The main point here follows the Knierim and Hamilton review and assumes that learning “heading direction” and “distance from start to food” require different cues and extraction mechanisms.  “Here we follow a review by Knierim and Hamilton (12) suggesting independent mechanisms for extraction of target direction versus target distance information. Averaging across trajectories gave a mean displacement direction, an estimate of the average heading direction as the mouse ran from start to food. The heading direction must be continuously updated as the mice runs towards the food, given that the mean displacement direction remains straight despite the variation across individual trajectories. Heading direction might be extracted from optic flow and/or vestibular system and be encoded by head direction cells. However, the distance from home to food is not encoded by head direction signals.”

      And

      “We hypothesize that path integration over trajectories is used to estimate the distance from start to food. The stimuli used for integration might include proprioception or acceleration (vestibular) signals as neither depends on visual input. Our conclusion is in accord with a literature survey that concluded that the distance of a target from a start location was based on path integration and separate from the coding of target heading direction (12). Our “in the dark” experiments reveal the minimal stimuli required for spatial learning – an anchoring starting point and directional information based on vestibular and perhaps proprioceptive signals. This view is in accord with recent studies using VR (47, 48). Under more naturalistic conditions, animals have many additional cues available that can be used for flexible control of navigation under time or predation pressure (51).”.

      Furthermore, we added panel G do Fig S4, where we show the evolution of the heading angle along the trajectory, plotted as a function of the trials. We see that the mouse only steer towards the target in the last segment of the trajectory, consistent with having the head direction being continuously updated along the path to the food.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      All three reviewers agreed during the consultation that the context in which distal cues are described in the manuscript would benefit significantly from refinement. The distal cues may be made completely useless from an ethological perspective e.g. if they are seen as "moving" relative to the entrance point (i.e. if the animal were to think it were entering the same location), then the cues would appear as unstable in the random entrance. As such, they may be so unlike natural experiences as to be potentially confusing to the animal. Moreover, as reported in some of the reviews, the animals may be using the entrances and boundaries as cues to help refine path integration. The results are still very interesting, but more refinement in the text on the interpretation of cues would greatly improve the manuscript. Thus, we recommend that you revise your manuscript to address the reviews.

      Thank you. We agree with this recommendation of the reviewers have greatly expanded our discussion on cue stability as already indicated above. 

      Should you choose to revise your manuscript, pleasse ensure the manuscript include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.

      Done

      Lastly, I want to personally apologize for the long delay in editing this manuscript. All three reviews were unfortunately quite delayed, including my own review. I want to thank you for submitting your work to eLife and hope that we can be more efficient in editing your work in the future.

      It was a long review process, but we also appreciate that our article was dense and difficult to read. We tried to be comprehensive in our controls and analyses and we appreciate the considerable effort it must have taken to carefully review our paper.

      Reviewer #3 (Recommendations For The Authors):

      I quite enjoyed this paper and have some suggestions for further improvement.

      First, while I appreciate that the format of the journal has Methods at the end, there are some key details that need to be moved forward in the study for proper appreciation of the results. These include:

      (1) Location and size of distal cues.

      Done

      (2) Use of floor washing between mice.  

      Done

      (3) Use of food across the subfloor to provide some masking of the location of the food reward.

      Done

      (4) A scale bar on one of the early figures showing the apparatus would be beneficial.

      Done for Figure 1 where we also provide arena diameter and area.

      (5) Motivational state of the mouse with respect to the food reward (in this case, not food restricted, correct?).

      Done

      Although we are told the trial where learning is defined to have occurred, we were not given the quantitative criterion operationalizing "learning" - please provide (unless I missed it!).

      Thank you.  This question turned out to be of importance and led to more detailed analyses and related Discussion. We therefore answer in depth.

      We now realize that learning the distance to food versus learning the direction to food must be analyzed separately.

      On Page 5 second paragraph we provide a definition of “learning distance to food”.

      “Fitting the function dtotal \= B*exp(-Trial/K) reveals the characteristic timescale of learning, K, in trial units (Fig. 2F). We obtained K= 26±24 giving a coefficient of variation (CV) of 0.92. The mean, K=26, is therefore very uncertain and far greater than the actual number of trials. Thus, we hypothesize that the mice did not significantly reduce their distance travelled (Fig. 2A,B,F) because they had not learned the food location – the decrease in latency (Fig. 2D) was due to its increased running speed and familiarity with non-spatial task parameters. ”

      On Page 7 second paragraph the same analysis gives:

      “Now the fitting of the function dtotal\=B exp(-Trial/K) yielded K\=5.6±0.5 with a CV = 0.08; the mean is therefore a reliable estimate of total distance travelled. We interpret this to indicate that it takes a minimum number of K= 6 trials for learning the distance to the target (see also Fig. S4D,E,F,G).

      Learning is still not complete because it takes 14 trials before the trajectories become near optimal.”

      Learning of distance to food is evident by Trial 6 but is not complete.

      On Page 9 third paragraph we give a very precise answer to time taken to learn the direction from start to food. This was already very clear from Fig. 4I but we had missed the significance of this result. 

      “We compared the deviation between the TEV and the true target vector (that points from start directly to the food hole; Fig. 4I). While the random entrance mice had a persistent deviation between TEV and target of more than 70o, the static entrance mice were able to learn the direction of the target almost perfectly by trial 6 (TEV-target deviation in first trial mean±SD = 57.27o ± 41.61o; last trial mean±SD = 5.16o ± 0.20o; P=0.0166). A minimum of 6 trials is sufficient for learning both the direction and distance to food (Fig. 4I) (Fig. 3F) (see Discussion). The kinetics of learning direction to food are clearly different from learning distance to food since the direction to food remains stable after Trial 6 while the distance to food continues to approach the optimal value.”

      Learning the direction from start to food is completely learned by Trial 6. 

      These analyses led to an addition to the Discussion on Page 20 (following the Heading).

      “Here we follow a review by Knierim and Hamilton (12) that hypothesized independent mechanisms for extraction of target direction versus target distance information. Our data strongly supports their hypothesis. Target direction is nearly perfectly estimated at trial 6 (Fig. 4I and Results). The deviation of the TEV from the start to food vector is rapidly reduced to its minimal value (5.16o) and with minimal variability (SD=0.20o). Learning the distance from start to food is also evident at trial 6 but only reaches an asymptotic near optimal value at trial 14 (Fig. 3F). The learning dynamics are therefore very different for target direction versus target distance. As noted below, the food direction is likely estimated from the activity of head direction cells. The neural mechanisms by which distance from start to food is estimated are not known (but see (49)).”

      We believe that this small addition summarizes the complicated answer to the reviewer’s question and is helpful in better connecting the Knierim and Hamilton paper to our data. However, if the reviewers and editors feel that we have gone too far or that this discussion is not clear, we can remove or alter the extra sentences as per any comments. 

      Reference #49 is to a review paper on spatial learning in weakly electric fish in the dark (https://doi.org/10.1016/j.conb.2021.07.002). The review summarizes data on a neural “time stamp” mechanism for estimating distance from start to food. In this review article, we explicitly hypothesized that rodents might utilize such a time stamp mechanism for finding food. We did not include this in the discussion because it was too distracting and would likely confuse readers but put in the reference in case some readers did want to access the “time stamp” hypothesis for spatial learning in the dark. 

      Second, the discussion was thoughtful and rich. I particularly enjoyed the segment describing the likely computations of the hippocampus. There are a few thoughts I have for the authors to think about that might be useful to potentially add to the discussion:

      "The remaining one, mouse 34, went from B to the start location and then, to A."

      This out-and-back pattern has been seen in the literature, such as multiple papers by Golani (here's one: https://www.pnas.org/doi/full/10.1073/pnas.0812513106). Would the authors speculate, given their suggested algorithm, what the significance of out and back may be? Is there something about the cell's encoding of direction and distance that requires a return to the start location, and would this be different if representation is based on self-motion versus based on distal cues in an allocentric representation?

      We do discuss this for pretraining trials but have no idea what this mouse is doing in this case.

      In a low-stakes task environment, for an animal that has a low acuity visual system, where the penalty for not using distal cues is at most some additional (likely enriching in itself to these mice who live a fairly unenriched life in small cages) search/learning/exploration time, perhaps it is not so surprising that body-frame cues are used. Considering the ethology of the animal, if it had multiple exits of an underground burrow, it might need to use distal cues to avoid confusion. The scenario you provide to the animal is essentially a deceptive one where it has no way of telling it is coming out to the arena from a different burrow hole, modulo some small landmarks on an otherwise uniform cylinder of space. This might be asking too much of an animal where the space it would enter normally would not be a uniform cylinder.

      What happens with a higher-stakes case? This is clearly a different study, but you may find some recent work with a mobile predatory robot of interest (https://www.sciencedirect.com/science/article/pii/S2211124723016820). Visual cues are crucial in the avoidance of threats in this case. Re-routing, as shown by multiple videos of that study, is after a brief pause, and seemingly takes into account the likely future position of the threat.

      Done. A fascinating paper that illustrates the unexpected “high level” behavior a rodent is capable of when placed in more naturalistic situations. I think our “two food location” experiments are along the same direction – unexpected rich behavior when the mouse are challenged.

      Connected to the low-stakes vs high-stakes point, it might be nice for the paper to discuss situations in which cognitive-map-based spatial problem solutions make sense versus not.

      Here is an example of such a discussion, around page 496:

      https://www.dropbox.com/scl/fi/ayoo5w4jgnkblgfu7mpad/MacI09a_situated_cog.pdf?

      rlkey=2qhh89ii7jbkavt6ivevarvdk&dl=0.

      Right a very relevant discussion by MacIver. However, when I tried to write it in it took nearly half a page of dense writing to connect to the themes of our article. I figured that the already long discussion will try the patience of most readers and so decided to not include this extra discussion.

      Minor points/ queries

      Why the increase in sample density at about the 1/4 radius of arena distance? Static, trial 14, Figure 3I, shown also maybe Figure 4 H.

      We were also puzzled when this occurred but have no explanation. And there are, in our figures, many other examples of the mice hole checking near their exit site. See next answer.

      Why was the hole proximal to start so often probed in 7B?

      We were also puzzled when this occurred but have no explanation.

      Check Video 1 to exactly see this behavior. The mouse exits its home and immediately checks a nearby hole. It proceeds to Site B (empty) and then Site A (empty) with many hole checks along the way. After leaving Site A, the mouse proceeds to the wall located far from an entrance and does another hole check. The near the wall holes that are checked are in no way remarkable: a) they have never contained food; b) they are rotated between trials, and we wash the floor carefully, so they do not “smell” any particular hole; c) the food on the lower level floor is in no way “clumped” under that hole, etc.

      We have discussed this phenomenon quite a lot and LM was able to come up with only one hypothesis for this behavior. In analogy to the electric fish work (responses of diencephalic neurons to “leaving or encountering a landmark”), the “near the entrance” hole check might be an active sensing probe to “time stamp” the exit from home while finding food would “time stamp” the end of a successful trajectory. Path integration between time stamps would then provide the estimate for time/distance from start to food – exactly our hypothesis for weakly electric fish spatial learning in the dark. This hypothesis is exceedingly speculative and so we do not want to include it.  

      Normally I would cite a line number. Since I do not see line numbers, I will leave it to you to do a search:

      "A than the expected by chance" -> "than expected"

      Done. I apologize for the lack of line numbers. I have, so far, been unable to get Word to confine line numbers to selected text and not run over onto the Figure Legends. I have put in page numbers and hope this helps.

      RW, VR, MWM, etc - please expand the acronym on first use.

      Done

      It might be interesting to see differences in demand/reliance on active sensing in the individuals who learn the task less well than the animals who learn the task well. If the point is to expunge uncertainty, then does the need for such expunging increase with the poverty of internal representation resolution / fewer decimal places on the internal TEV calculation?

      We do have variation in the mice learning time but the numbers are not sufficient for this interesting extension. This is just one of many follow up studies we hope to carry out.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The crystal structure of the Sld3CBD-Cdc45 complex presented by Li et al. is a novel contribution that significantly advances our understanding of CMG formation during the rate-limiting step of DNA replication initiation. This structure provides insights into the intermediate steps of CMG formation. The study builds upon previously known structures of Sld3 and Cdc45 and offers new perspectives into how Cdc45 is loaded onto MCM DH through Sld3-Sld7. The most notable finding is the structural difference in Sld3CBD when bound to Cdc45, particularly the arrangement of the α8-helix, which is essential for Cdc45 binding and may also pertain to its metazoan counterpart, Treslin. Additionally, the conformational shift in the DHHA1 domain of Cdc45 suggests a possible mechanism for its binding to MCM2NTD.

      Strengths:

      The manuscript is generally well-written, with a precise structural analysis and a solid methodological section that will significantly advance future studies in the field. The predictions based on structural alignments are intriguing and provide a new direction for exploring CMG formation, potentially shaping the future of DNA replication research.

      Weaknesses:

      The main weakness of the manuscript lies in the lack of experimental validation for the proposed Sld3-Sld7-Cdc45 model. Specifically, the claim that Sld3 binding to Cdc45-MCM does not inhibit GINS binding, a finding that contradicts previous research, is not sufficiently substantiated with experimental evidence. To strengthen their model, the authors must provide additional experimental data to support this mechanism. Also, the authors have not compared the recently published Cryo-EM structures of the metazoan CMG helicases with their predicted models to see if Sld3/Treslin does not cause any clash with the GINS when bound to the CMG. Still, the work holds great potential in its current form but requires further experiments to confirm the authors' conclusions.

      We appreciate the reviewers’ careful reading and the comments.

      Our structural analysis of Sld3CBD-Cdc45 showed the detailed interaction map between Sld3CBD and Cdc45 at 2.6 Å resolution. The Sld3, MCM and GINS binding sites of Cdc45 completely differed, suggesting that the Sld3CBD, Cdc45 and GINS could bind to MCM together. The SCMG-DNA model confirmed such a binding manner, although our study does not show how this binding manner affects the GINS loading by other initiation factors (Dpb11, Sld2, et. al). Regarding the previous studies, competition of Sld3 and GINS for binding to Cdc45 or Cdc45-MCM (Bruck et. al), which may be caused by the conformation change of Cdc45 DHHA1 between Sld3CBD-Cdc45 and CMG. We modified our manuscript and discussed (P7/L168-173, and P10/L282-286). Following the comment, we checked the recently published Cryo-EM structure (PDBID:8Q6O) with their predicted models of the metazoan CMG helicases (P7/L198-P8/L202) and added the Cdc45 mutation experiments to confirm our conclusion ([Recommendations for the authors] Q18).

      Reviewer #2 (Public review):

      Summary

      The manuscript presents valuable findings, particularly in the crystal structure of the Sld3CBD-Cdc45 interaction and the identification of additional sequences involved in their binding. The modeling of the Sld7-Sld3CBD-CDC45 subcomplex is novel, and the results provide insights into potential conformational changes that occur upon interaction. However, the work remains incomplete as several main claims are only partially supported by experimental data, particularly the proposed model for Sld3 interaction with GINS on the CMG. Additionally, the single-stranded DNA binding data from different species do not convincingly advance the manuscript's central arguments.

      Strengths

      (1) The Sld3CBD-Cdc45 structure is a novel contribution, revealing critical residues involved in the interaction.

      (2) The model structures generated from the crystal data are well presented and provide valuable insights into the interaction sequences between Sld3 and Cdc45.

      (3) The experiments testing the requirements for interaction sequences are thorough and conducted well, with clear figures supporting the conclusions.

      (4) The conformational changes observed in Sld3 and Cdc45 upon binding are interesting and enhance our understanding of the interaction.

      (5) The modeling of the Sld7-Sld3CBD-CDC45 subcomplex is a new and valuable addition to the field.

      Weaknesses

      (1) The proposed model for Sld3 interacting with GINS on the CMG needs more experimental validation and conflicts with published findings. These discrepancies need more detailed discussion and exploration.

      Our structural analysis experiment of Sld3CBD-Cdc45 showed the detailed interaction information between Sld3CBD and Cdc45 at 2.6 Å resolution. The Sld3CBD-binding site of Cdc45 is completely different from that of GINS and MCM binding to Cdc45, suggesting that the Sld3CBD, Cdc45, and GINS could bind to MCM together. The SCMG-DNA model confirmed such a binding manner. Following the comment, we added a Cdc45 mutant analysis, disrupting the binding to MCM and GINS but not affecting the Sld3CBD binding (Supplementary Figure 9). Our model is consistent with the GINS-loading requirement (the phosphorylation of Sld3 on Cdc45-MCM) and has no discrepancies with the stepwise loading fashion (Please see the responses to [Recommendations for the authors] Reviewer#1-Q14-15]). Regarding the previous studies, competition of Sld3 and GINS for binding to Cdc45 or Cdc45-MCM (Bruck et. al), by in vitro binding experiments, please see the responses to [Recommendations for the authors] Q6.

      (2) The section on the binding of Sld3 complexes to origin single-stranded DNA needs significant improvement. The comparisons between Sld3-CBD, Sld3CBD-Cdc45, and Sld7-Sld3CBD-Cdc45 involve complexes from different species, limiting the comparisons' value.

      As suggested, we tried to improve the ssDNA-binding section (Please see the responses to [Recommendations for the authors]: Q4 and Q5). We used Sld7-Sld3CBD-Cdc45 from different sources due to limitations in protein expression. These two sources belong to the same family and the proteins Sld7, Sld3 and Cdc45 have sequence conservation with similar structures predicted by the alphafold3 (RMSD = 0.356, 1.392, and 0.891 for Ca atoms of Sld7CTD, Sld7NTD-Sld3NTD, and Sld3CBD-Cdc45). Such similarity in source and protein lever allows us to do the comparison.

      (3) The authors' model proposing the release of Sld3 from CMG based on its binding to single-stranded DNA is unclear and needs more elaboration.

      Considering that ssDNA (ssARS1) is produced by CMG, the ssDNA-binding of Sld3 should happen after forming an active CMG. Therefore, the results of ssDNA binding experiments implied that the Sld3 release could be with the binding to ssDNA produced by CMG. We tried to present more elaborations in the revised version. (Please see the responses to [Recommendations for the authors] Q4, Q5).

      Reviewer #3 (Public review):

      Summary:

      The paper by Li et al. describes the crystal structure of a complex of Sld3-Cdc45-binding domain (CBD) with Cdc45 and a model of the dimer of an Sld3-binding protein, Sld7, with two Sld3-CBD-Cdc45 for the tethering. In addition, the authors showed the genetic analysis of the amino acid substitution of residues of Sld3 in the interface with Cdc45 and biochemical analysis of the protein interaction between Sld3 and Cdc45 as well as DNA binding activity of Sld3 to the single-strand DNAs of the ARS sequence.

      Strengths:

      The authors provided a nice model of an intermediate step in the assembly of an active Cdc45-MCM-GINS (CMG) double hexamers at the replication origin, which is mediated by the Sld3-Sld7 complex. The dimer of the Sld3-Sld7 complexes tethers two MCM hexamers together for the recruitment of GINS-Pol epsilon on the replication origin.

      Weaknesses:

      The biochemical analysis should be carefully evaluated with more quantitative ways to strengthen the authors' conclusion.

      We thank your positive assessment. We provided more quantitative information and tried to quantify the experiments as suggested (Please see the responses to [Recommendations for the authors]).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I have several concerns that I will outline below, accompanied by my suggestions.

      (1) "The title of the paper- "Structural and functional insights into Cdc45 recruitment by Sld7-Sld3 for CMG complex Formation," appears misleading because it appears that authors present a structure of Sld3-Sld7 in complex with Cdc45, which is not the case here. If authors can provide additional structures proving the function of this complex, then this title justifies it. Otherwise, I recommend making a title that justifies the presented work in its current form.

      Following the comment, we change the title to “Sld3CBD-Cdc45 structural insights into Cdc45 recruitment for CMG complex formation”.

      (2) In lines 70-72, where the authors mention the known structures of different proteins, intermediates, and complexes, I recommend including PDB IDs of the described structures and reference citations. This will help the readers to analyze what is missing in the pathway and why this structure is essential.

      Following the comment, we added PBDIDs and references (P3/L72-74).

      (3) The representation of Figure 1A is unclear and looks clumsy. If the structure were rotated in another orientation, where α8 and α9 would be displayed on the forward side, it would be more helpful to understand the complex forming regions by looking at the structure. Also, I recommend highlighting the α8 and α9 in a contrasting color to be easily visible and attract readers' attention. Similarly, it would also be helpful if DHAA1 would be shown in a different color.

      Following the comment, we modified the Figure1 to show α8 and α9 of Sld3CBD and DHAA1 of Cdc45 clearly in revised version.

      (4) Can authors add a supplementary figure showing the probability of disorderness of the α8 helix region in the Sld3? Also, highlight what region became ordered in their structure.

      Yes, we have showed the disordered α8 helix region and highlight ordered α8 in the Sld3 in Figure S4 A.

      (5) Can you compare the Cdc45 long distorted helix (Supplementary Figure 3B) in the Sld3-Cdc45 complex with the Xenoupus and drosophila Cdc45 from their CMG structures? Also, can the authors explain why this helix is destabilized in their structure but is relatively stable in another Cdc45 structure (in CMG and HuCdc45)?

      We have checked all Cdc45 from published cryo-EM CMG structures, including Xenopus CMG-donson (8Q6O) and Drosophila CMG (6RAW), and all of them ordered the long helix in the CMG complex, whereas this long helix was disordered in the crystal structure of Sld3CBD-Cdc45 and Entamoeba histolytica Cdc45. The crystal packing around the long helix showed that it looks to be stabilized by crystal packing only in huCdc45, therefore we suggested that this long helix is detestable for crystallization.

      (6) I recommend adding the following parameters to Supplementary Table 2: 1. Rmerge values, 2. Wilson B factor, 3. Average B factor, and 4. Total number of molecules in ASU.

      We are sorry to make a mistake about Rmerge in Table 2. We correct it. We added the Wilson B factor, the average B factor, and the total number of Sld3CBD-Cde45 in ASU.

      (7) Can authors provide the B factor values of the α8 helix of Sld3?

      We checked the B factor values of the helix α8CTP of Sld3 in Sld3CBD-Cdc45. Since this helix binds to Cdc45 stably, the average B factor of the main chain is 45 Å<sup>2</sup> less than that of the whole structure. We added the average B factor of helix α8CTP into the Supplementary Figure 4A legend.

      (8) Can authors explain why higher Ramachandran outliers exist in their structure? Can it be reduced below 1% during refinement?

      There are 13 outliers (1.67%) in different places: four are close to the disorder regions (poor electron map), four are in a loop with poor map and the remains are turn parts or a loop. For the residues with poor electron maps, we could not modify them to the allow Ramachandran region with low Rfree value, so we could not reduce them to below 1% during refinement while keeping the current Rfree value.

      (9) In Supplementary Figure 8, please show the CD spectra of the Sld3WT. Why is the Sld3-3S peak relatively flat? Was the sample precipitating while doing the measurements, or does it have less concentration than others?

      To check the folding of the mutants, we did CD experiments with the estimated secondary structure elements. Because WT Sld3CBD was prepared in a complex with Cdc45, while the mutants of Sld3CBD existed along, we calculated the elements of secondary structure from the crystal structure of Sld3CBD-Cdc45. The concentration of samples was controlled to the same level for CD measurement. The relative plat of the Sld3-3S peak may be caused by precipitating while doing the measurement.

      (10) Can authors generate the alpha fold three models of the Sld3CBD-Cdc45-MCM-dsDNA and SCMG-dsDNA and compare them with the models they have generated?

      We tried to predict the Sld3CBD-Cdc45-MCM-dsDNA and SCMG-dsDNA using Alphafold3. Although the results showed similar structures to our models, many parts were disordered. So, we did not use the predicted structures.

      (11) The authors say that the overall molecular mass of the Sld7-Sld3ΔC-Cdc45 was >400kDa on the SEC column. However, the column used for purifying this complex and the standards that were run on it for molecular weight calculations have not been written anywhere. If the Superdex 200 column was used, then the sample of more than 400kDa should not elute at the position shown in Supplementary Figure 2B. I recommend showing the standard MW plot and where the elution volume of the Sld7-Sld3ΔC-Cdc45 lies on the standard curve. Also, add how molecular weight calculations were done and the calculated molecular mass.

      Following the comment, we added a measurement of Superdex 200 16/60 column (SEC) using a standard sample kit into Supplementary Figure 2 to show that the molecular weight of the peak at the position was estimated to be > 400 k Da.

      (12) I also recommend using at least one of the techniques, either SEC-MALS or AUC, to calculate the actual molecular mass of the Sld7-Sld3ΔC-Cdc45 complex and to find its oligomeric state. If the authors want to prove their hypothesis that a dimer of this complex binds to MCMDH, it is essential to show that it exists as a dimer. Based on the current SEC profile, it appears as a monomer peak if the S200 SEC column is being used.

      As the response to (11), we added the standard MW plot (measurement using Superdex 200 16/60 column) using a standard sample kit. The molecular weight at the peak elution position of Sld7-Sld3ΔC-Cdc45 was estimated to be 429k Da. Considering that the Sld7-Sld3ΔC-Cdc45 dimer should be a flexible long-shaped molecule, the elution position could be at a larger molecular weight position than the real one (158 x 2 k Da). We also tried to confirm the particle size using SEC-SAXS, as the response to the next question (13).

      (13) Dynamic light scattering is not the most accurate method for calculating intermolecular distance. I recommend using another technique that calculates the accurate molecular distances between two Cdc45 if Sld7-Sld3ΔC-Cdc45 is forming a dimer. Techniques such as FRET could be used. Otherwise, some complementary methods, such as SAXS, could also be used to generate a low-resolution envelope and fit the speculated dimer model inside, or authors could try negative staining the purified Sld7-Sld3ΔC-Cdc45 and generate 2D class averages and low-resolution ab initio models to see how the structure of this complex appears and whether it satisfies the speculated model of the dimeric complex.

      We have tried both negative staining TEM and SEC-SAXS experiments. We could not obtain images good enough of negative staining of TEM to generate 2D class averages and low-resolution ab initio models. The results of SEC-SAXS provided a molecular weight of 370 - 420 kDa, and an Rg > 85 Å, which are consistent with our conclusion from SEC and DLS results but with large error due to the measurement temperature at 10-15°C (measuring equipment limitation). The peak of SCE-SAXS under measurement conditions was not as sharp as purification at 4°C and SAXS data is not good enough to make a molecular model, so we did not add them to our manuscript.

      (14) Authors mentioned in the introduction section (lines 72-73) that based on the single-molecule experiments, Cdc45 is recruited in a stepwise manner to MCMDH. If this is true and if Sld7-Sld3ΔC-Cdc45 forms a dimer, this is also true, then for stepwise recruitment, the dimer will have to break into monomers, and this will be an energy-expensive process for the cell. So, would such a process occur physiologically? Can the authors explain how this would physiologically happen inside the cell?

      Sld7-Sld3-Cdc45 consists of domains linked by long loops, so the dimer Cdc45-Sld3-[Sld7]2-Sld3-Cdc45 is flexible long-sharp. Such a flexible dimer does not mean that two Cdc45 molecules must bind to MCM DH simultaneously and may bind to MCM DH by stepwise manner. The dimer formation of Sld7-Sld3-Cdc45 is advantageous for recruiting efficiently and saving energy. Moreover, our proposal of Cdc45-Sld3-[Sld7]2-Sld3-Cdc45 on MCM DH could be a stage during CMG formation in the cell. Following the comment, we added such descriptions (P7/L194, and P10/L276-279).

      (15) Can authors show experimentally that a dimer of Sld7-Sld3ΔC-Cdc45 is binding to MCMDH and not a monomer in a stepwise fashion?

      In our study, we provided experiments of particle size to show the dimer of Sld7-Sld3-Cdc45 off MCM DH and a model of SCMG to indicate the dimer of Sld7-Sld3ΔC-Cdc45 on MCM DH. This question should be addressed future by the Cryo-EM of Sld7-Sld3-Cdc45-MCM DH or Sld7-Sld3-CMG. As the response to Q14, the flexible dimer of Sld7-Sld3ΔC-Cdc45 binding on MCMDH does not contradict the stepwise-loading fashion. The dimer of Sld7-Sld3ΔC-Cdc45 binding on MCM DH shows a stage.

      (16) Can authors highlight where Sld7 will lie on their model shown in Figures 3A and 3C, considering their model shown in 3B is true?

      We predict that the Sld7-Sld3-Cdc45 should be in a dimer form of Cdc45-Sld3-[Sld7]2-Sld3-Cdc45 based on the structures and the particle size analysis. The Sld7 dimer could be across MCM DH on the top of Figure 3A right and 3C right. However, we could not add the Sld7 molecule to the models because there is no interaction data between Sld7 and MCM.

      (17) In Supplementary Figure 10, can authors show the residues between the loop region highlighted in the dotted circle to show that there is no steric clash between the residues in that region of their predicted model?

      Following the comment, we added the residues in Supplementary Figure 10 (Supplementary Figure 11 in the revised version) to show no steric clash in our predicted model.

      (18) It is essential to show experimentally that Sld3CBD neighbors MCM2 and binds Cdc45 on the opposite side of the GINS binding site. I recommend that the authors design an experiment that proves this statement. Mutagenesis experiments for the predicted residues that could be involved in interaction with proper controls might help to prove this point. Since this is the overall crux of the paper, it has to be demonstrated experimentally.

      We thank the reviewer’s recommendation. Our structural analysis experiment shows the interaction information between Sld3CBD and Cdc45 at 2.6 Å resolution. The Sld3CBD-binding site, GINS-binding site, and MCM-binding site of Cdc45 are completely different, indicating that the Sld3CBD, Cdc45 and GINS could bind to MCM together. The SCMG model confirmed such a binding manner. Following the recommendation, we added mutant analysis of Cdc45 G367D and W481R, which was reported to disrupt the binding to MCM and GINS, respectively. Both mutants do not affect the binging to Sld3CBD as we predicted (Supplementary Figure 9B). We modified our manuscript and discussed this point more clearly (P7/L170-173).

      (19) I recommend rewriting the sentence in lines 208-210. During EMSA experiments, new bands do not appear; instead, there is no shift at lower ratios, so you see a band similar to the control for Sld3CBD-Cdc45. So, re-write the sentence correctly to avoid confusion when interpreting the result.

      Following the comment, we rewrote this sentence to "The ssDNA band remained (Figure 4B) and new bands corresponding to the ssDNA–protein complex appeared in CBB staining PAGE (Supplementary Figures 13) when the Sld3CBD–Cdc45 complex was mixed with ssDNA at the same ratio, indicating that the binding affinity of Sld3CBD–Cdc45 for ssDNA was lower than that of Sld3CBD alone” (P8/L226-229)

      (20) Since CDK-mediated phosphorylation of Sld3 is known to be required for GINS loading, the ssDNA binding affinity of phosphorylated Sld3 remains the same. I wonder what would happen if phosphorylated Sld3 were used for the experiment shown in Figure 4B.

      The CDK phosphorylation site is located at Sld3CTD and our ssDNA-binding experiment did not include the Sld3CTD, so phosphorylated Sld3 does not affect the results shown in Figure 4B.

      (21) Sld3CBD-Cdc45 has a reduced binding affinity for ss DNA, and Sld7-Sld3ΔC-Cdc45 and Sl7-Sld3ΔC have a similar binding affinity to Sld3CBD based on figure 4B. It appears that Sld3CBD reduces the DNA binding affinity of CDC45 or vice versa. Is it correct to say so?

      Our opinion is “vice versa”. Cdc45 reduces the ssDNA-binding affinity of Sld3CBD. Although we could not point out the ssDNA-binding sites of Sld3CBD, the surface charge of Sld3CBD implies that α8CTP could contribute to ssDNA-binding (Supplementary Figures 15).

      (22) Cdc45 binds to the ssDNA by itself, but in the case of Sld3CBD-Cdc45, the binding affinity is reduced for Sld3CBD and Cdc45. Based on their structure, can authors explain what leads to this complex's reduced binding affinity to the ssDNA? Including a figure showing how Sld7-Sld3CBD-Cdc45 interacts with the DNA would be a nice idea.

      Previous studies showed that Cdc45 binds tighter to long ssDNA (> 60 bases) and the C-terminus of Cdc45 is responsible for the ssDNA binding activity. The structure of Sld3CBD-Cdc45 shows the C-terminal domain DHHA1 of Cdc45 binds to Sld3CBD, which may lead to Sld3CBD-Cdc45 complex reduced ssDNA-binding affinity of Cdc45. We agree that showing a figure of how Sld7-Sld3CBD-Cdc45 interacts with ssDNA is a nice idea. However, there is no detailed interaction information between Sld7-Sld3Δ-Cdc45 and ssDNA, so we could not give a figure to show the ssDNA-binding manner. We added a figure to show the surface charges of Sld3CBD of Sld3CBD-Cdc45, and Sld3NTD-Sld7NTD, respectively (Supplemental Figure 15).

      (23) Based on the predicted model of Sld7-Sld3 and Cdc45 complex, can authors explain how Sld7 would restore the DNA binding ability of the Sld3CBD?

      It can be considered that Sld7 and Sld3NTD could bind ssDNA. Although we did not perform the ssDNA-binding assay of Sld7, the Sld3NTD-Sld7NTD surface shows a large positive charge area which may contribute to ssDNA-binding (Supplemental Figure 15). We added the explanation (P9/L245-248).

      (24) It would be important to show binding measurements and Kd values of all the different complexes shown in Figure 4B with ssDNA to explain the dissociation of Cdc45 from Sld7-Sld3 after the CMG formation. I also recommend describing the statement from lines 224-227 more clearly how Sld7-Sld3-Cdc45 is loading Cdc45 on CMG.

      As the reviewer mentioned, the binding measurements and Kd of values of all the different complexes are important to explain the dissociation of Sld7-Sld3 from CMG. The pull-down assay using chromatography may be affected by balancing the binding affinity and chromatography conditions. Therefore, we used EMSA with native-PAGE, which is closest to the natural state. However, the disadvantage is that the Kd values could not be estimated. For lines 224-227, the ssARS1-binding affinity of Sld3 and its complex should relate to the dissociation of Sld7–Sld3 from the CMG complex but not Cdc45 loading, because ssARS1 is unwound from dsDNA by the CMG complex after Cdc45 and GINS loading. We modified the description (P9/L248-251).

      (25) Can authors explain why SDS-PAGE was used to assess the ssDNA (See line 420)?

      We are sorry for making this mistake and corrected it to “polyacrylamide gel electrophoresis”.

      (26) In line 421, can the authors elaborate on a TMK buffer?

      We are sorry for this omission and added the content of the TMK buffer (P16/L453).

      (27) I am curious to know if the authors also attempted to Crystallize the Sld7-Sld3CBD-Cdc45 complex. This complex structure would support the authors' hypothesis in this article.

      We tried to crystallize Sld7-Sld3Δ-Cdc45 but could not get crystals. We also tried using cryo-EM but failed to obtain data.

      Reviewer #2 (Recommendations for the authors):

      (1) The manuscript would be strengthened if the authors acknowledged in greater detail how their work agrees with or disagrees with Itou et al. (PMID: 25126958 DOI: 10.1016/j.str.2014.07.001). The introduction insufficiently described the findings of that previous work in lines 63-64.

      We compared Sld3CBD in Sld3CBD-Cdc45 to the monomer reported by Itou et al. (PMID: 25126958 DOI: 10.1016/j.str.2014.07.001) in the section of [The overall structure of Sld3CBD-Cdc45] and point out the structural similarity and difference (P5/L105-106), especially, conformation change of Sld3CBD α8 for binding to Cdcd45, which agrees to the mutant experiments of Itou et al., (P3/L126-127). Another Cdc45-binding site of Sld3CBD in the Sld3CBD-Cdc45 complex is α9 not residues predicted in previous studies.

      (2) Figure 2. Could you please perform and present data from multiple biological replicates (e.g., at least two independent experiments) for each mutant strain? This would help ensure that the observed pull-downs (2A-B) and growth patterns (2C) are consistent and reproducible.

      We have done pull-downs three times from co-expression to purification and pull-down assay. We added descriptions to the method of [Mutant analysis of Sld3 and Cdc45]. The growth patterns are two times in Figure 2C.

      (3) Figure 3B. The match between the predicted complex length and particle size measured by dynamic light scattering (DLS) is striking. Did the authors run the analysis with vehicle controls and particle size standards? There is no mention of these controls.

      Following the comment, we added the control data of buffer and standard protein lysozyme, and the descriptions to the method of [Dynamic light scattering].

      (4) Figure 4. In lines 216-217, the authors write that the binding of the K. marxianus complex "demonstrates that the presence of Sld7 could restore the single-stranded DNA binding capacity of Sld3." Another explanation is that complexes from each species bind differently. If the authors want to make a strong claim, they should compare the binding of complexes containing the same proteins.

      Agree with the comment, to make a strong claim using samples from the same source is better. Due to limitations in protein overexpression, we used Sld7-Sld3ΔC-Cdc45 from different sources two sources belong to the identical family (Saccharomycetaceae) and the proteins Sld7, Sld3 and Cdc45 have sequence conservation with similar structures (RMSD = 0.356, 1.392, and 0.891 for Ca atoms of Sld7CTD, Sld7NTD-Sld3NTD, and Sld3CBD-Cdc45) predicted by the alphafold3. Such similarity in source and protein level allows us to do the comparison. Moreover, we modified the description to “indicates that the presence of Sld7 and Sld3NTD could increase the ssDNA-binding affinity to a level comparable to that of Sld3CBD.

      (5) The logic of the following is unclear: "Considering that ssDNA is unwound from dsDNA by the helicase CMG complex, Sld7-Sld3ΔC-Cdc45, and Sld7-Sld3C having a stronger ssDNA-binding capacity than Sld3CBD-Cdc45 may imply a relationship between the dissociation of Sld7-Sld3 from the CMG complex and binding to ssDNA unwound by CMG." (Lines 224-227). How do the authors imagine that the binding affinity difference due to Sld7 contributes to the release of Sld3? Please explain.

      Considering that ssARS1 is unwound from dsARS1 by the activated helicase CMG complex formed after loading Cdc45 and GINS, Sld3–Sld7 having a stronger ssARS1-binding affinity may provide an advantage for the dissociation of Sld7–Sld3 from the CMG complex. We modified the sentence of Lines 224-227 (P9/L248-251).

      (6) The authors suggest that the release of Sld3 from the helicase is related to its association with single-stranded ARS1 DNA. They refer to the work of Bruck et al. (doi: 10.1074/jbc.M111.226332), which demonstrates that single-stranded origin DNA inhibits the interaction between Sld3 and MCM2-7 in vitro. The authors selectively choose data from this previous work, only including data that supports their model while disregarding other data. This approach hinders progress in the field. Specifically, Bruck proposed a model in which the association of Sld3 and GINS with MCM2-7 is mutually exclusive, explaining how Sld3 is released upon CMG assembly. In Figure 3 of the authors' model, they suggest that Sld3 can associate with MCM2-7 through CDC45, even when GINS is bound. Furthermore, Bruck's work showed that ssARS1-2 does not disrupt the Sld3-Cdc45 interaction. Instead, Bruck's data demonstrated that ssARS1-2 disrupts the interaction between MCM2-7 and Sld3 without Cdc45. While we do not expect the authors to consider all data in the literature when formulating a model, we urge them to acknowledge and discuss other critical data that challenges their model. Additionally, it would be beneficial for the field if the authors include both modes of Sld3 interaction with MCM2-7 (i.e., directly with MCM or through CDC45) when proposing a model for how CMG assembly and Sld3 release occurs.

      In our discussion, we referred to the studies of Bruck’s data (doi: 10.1074/jbc.M111.226332) but did not discuss more because we didn’t perform similar experiments in vitro, and we do not think that no discussion hinders progress in the field. Promoting research progress, the new experiment should provide a new proposal and updated knowledge. Although we do not know exactly the positional relationship between Sld3 and Dpb11-Sld2 on MCM during GINS recruiting, the Sld3CBD-Cdc45 structure shows clearly that the Sld3CBD-binding site of Cdc45 is completely different from that of GINS and MCM binding to Cdc45. The model SCMG confirmed such a binding manner, Sld3, Cdc45 and GINS could bind together. The competition of Sld3 and GINS for binding to Cdc45 or Cdc45-MCM reported by Bruck et. al, may be caused by the conformation change of Cdc45 DHHA1 between Sld3CBD-Cdc45 and CMG, or without other initiation factors (CMG formation is regulated by the initial factors). We modified the discussion (P10/L282-286). Regarding ssARS1-binding, we did not discuss with Bruck's data that ARS1-2 does not disrupt the Sld3-Cdc45 interaction, because the data does not conflict with our proposal, although the data does not have an advantage. We propose that the release of Sld3 and Sld7 from CMG could be associated with the binding of ssARS1 unwound by CMG, but the dissociation event of Sl3-Sld7 doesn’t only ssARS1-binding. The exploration of unwound-ssARS1 causes the conformation change of CMG, which may be another event for Sld3-Sld7 dissociation. However, we do not have more experiments to confirm this and Bruck’s ssDNA-binding experiment did not use all of Sld3, Cdc45 and MCM, so we do not discuss more with Bruck’ data in the revised version (P11/L303-305).,

      Reviewer #3 (Recommendations for the authors):

      Major points:

      (1) Figure 1, Sld3CBD-Cdc45 complex: Please indicate the number of critical residues and those of alpha-helixes and beta-sheets in this Figure or Supplemental Figure to confirm the authors' claim.

      Following the comment, we added the number of alpha-helixes and beta-sheets with residue numbers in Figure 1, and Supplemental Figures 4 and 5. We also added a topology diagram (Supplemental Figure 3).

      (2) Figure 2A and B: Please quantify the interaction here with a proper statistical comparison.

      In the experiments of Figures 2A and 2B, we used a co-expression system to co-purify the complexes and check their binding. For quantifying, we added the concentrations of the samples used in the Method of [Mutant analysis of Sld3 and Cdc45].

      (3) Figure 3B, EMSA: If these are from the EMSA assay, at least free DNAs and protein-bound DNAs are present on the gel. However, the authors showed one band, which seems to be free DNA in Figure 3B and separately the smear band of the protein complex in Supplementary Figure 12, and judged the DNA binding by the disappearance of the band (line 207). Interestingly, in the case of Sld3CBD, there are few smear bands (Supplementary Figure 12). Where is DNA in this case? The disappearance could be due to the contaminated nucleases (need a control non-specific DNA). Without showing the Sld3CBD-DNA complex in the gel, the conclusion that the DNA binding activity of Sld3CBD-Cdc45 to DNA is lower than Sld3CBD alone (line 210) is very much speculative. The same is true for Sld7-Sld3dC-Cdc45.

      Please explain the method (EMSA) briefly in the main text and show a whole gel in both Figures. If the authors insist that the Sld3 DNA-binding activity is altered with Cdc43 (and MCM), it is better to perform a more quantitative DNA binding assay such as BIAcore (surface plasmon), etc.

      In the EMSA, we use SYBR (Figure 4B) and CBB (Supplementary Figure 13) staining to show bands of ssDNA and protein, respectively. As the reviewer mentioned, the disappearance of the bands could be due to the contaminated nucleases, we did experiments with non-specific ssDNA-binding as a control using the same proteins shown in Supplementary Figure 14. So, we are convinced that the disappearance of the ssDNA bands or not disappearance could occur when binding to protein or not. We added such explanations in the text (P9/L242-244). As we mentioned in the legend of Supplementary Figure 13, the Sld3CBD could not enter the gel, even when bound to ssDNA, because the pI values exceeded the pH of the running buffer.

      Following the reviewer's comments, we attempted a pull-down experiment using Histag (C-terminal histag of Sld3CBD/Sld3ΔC). Unfortunately, we encountered difficulties in achieving the balance between binding and chromatography conditions.

      (4) Figure 3B: Please quantify the DNA binding here with a proper statistical comparison with triplicate.

      For EMSA (Figure 3B), we used samples of ssDNA:protein= 1:0. 1:1, 1:2, 1:4 and 0:1 molecular ratios with 10 pM as a 1 unit. We added concentrations of the samples in the Method of [Electrophoretic mobility shift assay for ssDNA binding].

      Following the comment, we tried to quantify the binding strength by integrating the grayscale of the bands in gel photos. However, we are concerned because this quantitative calculation through grayscale could not provide an accurate representation of results. Many sample groups cannot be run on one gel. Therefore, the gel differences in parameters cause large errors in the calculation as shown in Author response image 1. Although the calculated integral grayscale chart is consistent with our conclusion, we do not want to add this to our manuscript.

      Author response image 1.

      (5) Because of poor writing, the authors need to ask for English editing.

      We are very sorry for the language. We asked a company (Editag, https:www.editage.jp) to do a native speaker revision and used AI to recheck English.

      Minor points:

      (1) Lines 47-58, Supplementary Figure 1: Although the sentences describe well how CMG assembles on the replication origin, the figure does not reflect what is written, but rather shows a simple schematic figure related to the work. However, for the general readers, it is very useful to see a general model of the CMG assembly. Then, the authors need to emphasize the steps focused in this study.

      Thank you for your thoughtful comments. We optimized Figure 1 and hope it will be more understandable to general readers.

      (2) Line 50, DDK[6F0L](superscript): what is 5F0L?

      We are sorry for this mistake, that is a PDBID of the DDK structure. we deleted 6F0L.

      (3) Lines 68 and 69, ssDNA and dsDNA: should be "single-stranded DNA (ssDNA)" and double-stranded DNA (dsDNA) when these words appear for the first time.

      Following the comment, we modified it to “single-stranded DNA (ssDNA)” and “double-stranded DNA (dsDNA)” (P3/L68,70).

      (4) Line 84, Cdc45s: What "s" means here?

      We are sorry for this mistake, we modified it to “Cdc45”.

      (5) Line 87, Sld3deltaC: What is Sld3deltaC? This is the deletion of either the Cdc45-binding domain or the C-terminal domain.

      Sld3ΔC is a deletion of the C-terminal domain of Sld3. We added the residue range and explanation (P4/L91).

      (6) Line 103: Although the authors mentioned beta-sheets 1-14 in the text, there is no indication in Figures. It is impossible to see the authors' conclusion.

      The secondary structure elements of Sld3CBD-Cdc45 are shown in Supplementary Figures 4 and 5. Following the comment, we added a topology diagram of Sld3CBD and Cdc45 in the Sld3CBD-Cdc45 complex as Supplementary Figure 3 and added citations when describing structural elements.

      (7) Line 106, huCdc45: Does this mean human Cdc45? If so, it should be "human CDC45 (huCDC45). CMG form is from budding yeast? Please specify the species.

      Yes, huCdc45 is human Cdc45. We modified it into “human CDC45 (huCdc45)”.

      (8) Line 107, Supplemental Figure 3B, black ovals: Please add "alpha7" in the Figure.

      Following the comment, we added a label of Cdc45 α7 to Supplemental Figure 3B and 3C (Supplemental Figure 4B and 4C in revised version).

      (9) Line 128, DHHA1: What is this? Please explain it in the text.

      Following the comment, we added the information on DHHA1 (P3/L75-77).

      (10) Line 130, beta13, and beta14: If the authors would like to point out these structures, please indicate where these sheets are in Figures.

      We added a topology diagram as Supplementary Figure 3 to show the β-sheet in DHH and added a citation in the text.

      (11) Line 133: Please add (Figure 1B) after the a8CTP.

      Following the comment, we added “(Figure 1C)” (1B is 1C in revised version) after the α8CTP (P6/L133).

      (12) Line 140: After DHHA1, please add (Figure 1C).

      Following the comment, we added the figure citation after the DHHA1 (P6/L140).

      (13) Line 142: After DHHA1, please add (Figure 1D).

      Following the comment, we added the figure citation after the DHHA1 (P6/L142).

      (14) Line 149, Sld3-Y seemed to retain a faint interaction with Cdc45. The Cdc45 band is too faint here. Moreover, as shown above, without the quantification with proper statistics, it is hard to draw this kind of conclusion.

      We agree that the Cdc45 band corresponding to Sld3-Y in the pull-down assay was very faint, so we performed an in vivo experiment (Fig2C) to confirm this result.

      (15) Line 149, Figure 2A and B: What kind of interaction assay was used here? Simple pull-down. It seems to eluate from the column. If so, how do the authors evaluate the presence of the proteins in different fractions? Please explain the method briefly in the main text.

      Figure 2 shows a co-express pull-down binding assay. To describe the co-express pull-down experiments clearly, we added more explanations in the Methods [Mutation analysis of Sld3 and Cdc45].

      (16) Line 154-155: Please show the quantification to see if the reduced binding is statistically significant.

      Here, we explain why Cdc45-A remained Sld3CBD-bind ability. Although mutant Cdc45-A has reduced three hydrogen bonds with D344 of Sld3CBD, the remaining hydrogen-bond network keeps contact between Sld3CBD and Cdc45.

      (17) Line 158, cell death: "No growth" does not mean cell death. Please rephrase here.

      Following the comment, we modified it to “no growth” (P6/L158).

      (18) Line 166: After CMG dimer, please add "respectively".

      Following the comment, we added the word “, respectively” after CMG dimer (P7/L178).

      (19) Line 194-195: I can not catch the meaning. Please rephrase here to clarify the claim. What are ssARS1-2 and ARS1-5?

      Following the comment, we added more information about ssDNA fragments at the beginning of this section (P8/L210-214).

      (20) Figure 4A and Supplemental Figure 12 top, schematic figure of ARS region. It is hard to catch. More explanation of the nature of the DNA substrates and much better schematic presentations would be appreciated.

      Following the comment, we added more information about ARS1 to the figure legend.

      (21) Figure 1A, dotted ovals should be dotted squares as shown in the enlarged images on the bottom.

      Following the comment, we modified Figure 1A and the legend to change the dotted ovals into dotted squares.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Contractile Injection Systems (CIS) are versatile machines that can form pores in membranes or deliver effectors. They can act extra or intracellularly. When intracellular they are positioned to face the exterior of the cell and hence should be anchored to the cell envelope. The authors previously reported the characterization of a CIS in Streptomyces coelicolor, including significant information on the architecture of the apparatus. However, how the tubular structure is attached to the envelope was not investigated. Here they provide a wealth of evidence to demonstrate that a specific gene within the CIS gene cluster, cisA, encodes a membrane protein that anchors the CIS to the envelope. More specifically, they show that:

      - CisA is not required for assembly of the structure but is important for proper contraction and CIS-mediated cell death

      - CisA is associated to the membrane (fluorescence microscopy, cell fractionation) through a transmembrane segment (lacZ-phoA topology fusions in E. coli)

      - Structural prediction of interaction between CisA and a CIS baseplate component<br /> - In addition they provide a high-resolution model structure of the >750-polypeptide Streptomyces CIS in its extended conformation, revealing new details of this fascinating machine, notably in the baseplate and cap complexes.

      All the experiments are well controlled including trans-complemented of all tested phenotypes.

      One important information we miss is the oligomeric state of CisA.

      Thank you for this suggestion. We now provide information on the potential oligomeric state of CisA. We performed further AlphaFold3 modelling of CisA using an increasing number of CisA protomers (1 to 8). We ran predictions for the configuration using the sequence of the well-folded C-terminal CisA domain (amino acids 285-468), which includes the transmembrane domain and the conserved domain that shares similarities to carbohydrate-degrading domains. The obtained confidence scores (mean values for pTM=0.73, ipTM=0.7, n=5) indicate that CisA can assemble into a pentamer and that this oligomerization is mediated through the interaction of the C-terminal solute-binding like superfamily domain.

      We have added this information to the revised manuscript (Fig. 3b/c) and further discuss the possible implications of CisA oligomerization for its proposed mode of action.

      While it would have been great to test the interaction between CisA and Cis11, to perform cryo-electron microscopy assays of detergent-extracted CIS structures to maintain the interaction with CisA, I believe that the toxicity of CisA upon overexpression or upon expression in E. coli render these studies difficult and will require a significant amount of time and optimization to be performed. It is worth mentioning that this study is of significant novelty in the CIS field because, except for Type VI secretion systems, very few membrane proteins or complexes responsible for CIS attachment have been identified and studied.

      We thank this reviewer for their highly supportive and positive comments on our manuscript and we are grateful for their recognition of the novelty of our study, particularly in the context of membrane proteins and complexes involved in CIS attachment.

      We agree that further experimental evidence on direct interaction between CisA and Cis11 would have strengthened our model on CisA function. However, as noted by this reviewer, this additional work is technically challenging and currently beyond the scope of this study.

      Reviewer #2 (Public review):

      Summary:

      The overall question that is addressed in this study is how the S. coelicolor contractile injection system (CISSc) works and affects both cell viability and differentiation, which it has been implicated to do in previous work from this group and others. The CISSc system has been enigmatic in the sense that it is free-floating in the cytoplasm in an extended form and is seen in contracted conformation (i.e. after having been triggered) mainly in dead and partially lysed cells, suggesting involvement in some kind of regulated cell death. So, how do the structure and function of the CISSc system compare to those of related CIS from other bacteria, does it interact with the cytoplasmic membrane, how does it do that, and is the membrane interaction involved in the suggested role in stress-induced, regulated cell death? The authors address these questions by investigating the role of a membrane protein, CisA, that is encoded by a gene in the CIS gene cluster in S. coelicolor. Further, they analyse the structure of the assembled CISSc, purified from the cytoplasm of S. coelicolor, using single-particle cryo-electron microscopy.

      Strengths:

      The beautiful visualisation of the CIS system both by cryo-electron tomography of intact bacterial cells and by single-particle electron microscopy of purified CIS assemblies are clearly the strengths of the paper, both in terms of methods and results. Further, the paper provides genetic evidence that the membrane protein CisA is required for the contraction of the CISSc assemblies that are seen in partially lysed or ghost cells of the wild type. The conclusion that CisA is a transmembrane protein and the inferred membrane topology are well supported by experimental data. The cryo-EM data suggest that CisA is not a stable part of the extended form of the CISSc assemblies. These findings raise the question of what CisA does.

      We thank Reviewer #2 for the overall positive evaluation of our manuscript and the constructive criticism.

      Weaknesses:

      The investigations of the role of CisA in function, membrane interaction, and triggering of contraction of CIS assemblies, are important parts of the paper and are highlighted in the title. However, the experimental data provided to answer these questions appear partially incomplete and not as conclusive as one would expect.

      We acknowledge that some aspects of our work remain unanswered. We are currently unable to conduct additional experiments because the two leading postdoctoral researchers on this project have moved on to new positions. We currently don’t have the extra manpower with a similar skill set to pick up the project.

      The stress-induced loss of viability is only monitored with one method: an in vivo assay where cytoplasmic sfGFP signal is compared to FM5-95 membrane stain. Addition of a sublethal level of nisin lead to loss of sfGFP signal in individual hyphae in the WT, but not in the cisA mutant (similarly to what was previously reported for a CIS-negative mutant). Technically, this experiment and the example images that are shown give rise to some concern. Only individual hyphal fragments are shown that do not look like healthy and growing S. coelicolor hyphae. Under the stated growth conditions, S. coelicolor strains would normally have grown as dense hyphal pellets. It is therefore surprising that only these unbranched hyphal fragments are shown in Fig. 4ab.

      We thank this Reviewer for their thoughtful criticism regarding the viability assays and the data presented in Figure 4. We acknowledge the importance of ensuring that the presented images reflect the physiological state of S. coelicolor under the stated growth conditions and recognize that hyphal fragments shown in Figure 4 do not fully capture the typical morphology of S. coelicolor. As pointed out by this reviewer, S. coelicolor grows in large hyphal clumps when cultured in liquid media, making the quantification of fluorescence intensities in hyphae expressing cytoplasmic GFP or stained with the membrane dye FM5-95 particularly challenging. To improve the image analysis and quantification of GFP and FM5-95-fluorescent intensities across the three S. coelicolor strains (wildtype, cisA deletion mutant and the complemented cisA mutant), we vortexed the cell samples before imaging to break up hyphal clumps, increasing hyphal fragments. The hyphae shown in our images were selected as representative examples across three biological replicates.

      Further, S. coelicolor would likely be in a stationary phase when grown 48 h in the rich medium that is stated, giving rise to concern about the physiological state of the hyphae that were used for the viability assay. It would be valuable to know whether actively growing mycelium is affected in the same way by the nisin treatment, and also whether the cell death effect could be detected by other methods.

      The reasoning behind growing S. coelicolor for 48 h before performing the fluorescence-based viability assay was that we (DOI: 10.1038/s41564-023-01341-x ) and others (e.g.: DOI: 10.1038/s41467-023-37087-7 ) previously showed that the levels of CIS particles peak at the transition from vegetative to reproductive/stationary growth, thus indicating that CIS activity is highest during this growth stage. The obtained results in this manuscript are consistent with previous results, in which we showed a similar effect on the viability of wildtype versus cis-deficient S. coelicolor strains (DOI: 10.1038/s41564-023-01341-x ) using nisin, the protonophore CCCP and UV radiation. The results presented in this study and our previous study are based on biological triplicate experiments and appropriate controls. Furthermore, our results are in agreement with the findings reported in a complementary study by Vladimirov et al. (DOI: 10.1038/s41467-023-37087-7 ) that used a different approach (SYTO9/PI staining of hyphal pellets) to demonstrate that CIS-deficient mutants exhibit decreased hyphal death.

      Taken together, we believe that the results obtained from our fluorescence-based viability assay provide strong experimental evidence that functional CIS mediate hyphal cell death in response to exogenous stress.

      The model presented in Fig. 5 suggests that stress leads to a CisA-dependent attachment of CIS assemblies to the cytoplasmic membrane, and then triggering of contraction, leading to cell death. This model makes testable predictions that have not been challenged experimentally. Given that sublethal doses of nisin seem to trigger cell death, there appear to be possibilities to monitor whether activation of the system (via CisA?) indeed leads to at least temporally increased interaction of CIS with the membrane.

      We thank this reviewer for their suggestions on how to test our model further. This is a challenging experiment because we do not know the exact dynamics of how nisin stress is perceived and transmitted to CisA and CIS particles.

      In an attempt to address this point, we have performed co-immunoprecipitation experiments using S. coelicolor cells that produced CisA-FLAG as bait, and which were treated with a sub-lethal nisin concentration for 0/15/45 min.  Mass spectrometry analysis of co-eluted peptides did not show the presence of CIS-associated peptides at the analyzed timepoints. While we cannot exclude the possibility that our experimental assay requires further optimization to successfully demonstrate a CisA-CIS interaction (e.g. optimization of the use of detergents to improve the solubilization of CisA from Streptomyces membrane, which is currently not an established method), an alternative and equally valid hypothesis is that the interaction between CIS particles and CisA is transient and therefore difficult to capture. We would like to mention, however, that we did detect CisA peptides in crude purifications of CIS particles from nisin-stressed cells (Supplementary Table 2, manuscript: line 301/302), supporting our proposed model that CisA can associate with CIS particles in vivo.

      Further, would not the model predict that stress leads to an increased number of contracted CIS assemblies in the cytoplasm? No clear difference in length of the isolated assemblies if Fig. S7 is seen between untreated and nisin-exposed cells, and also no difference between assemblies from WT and cisA mutant hyphae.

      The reviewer is correct that there is no clear difference in length in the isolated CIS particles shown in Figure S7. This is in line with our results, which show that CisA is not required for the correct assembly of CIS particles and their ability to contract in the presence and absence of nisin treatment. The purpose of Figure S7 was to support this statement. We would like to note that the particles shown in Figure S7 were purified from cell lysates using a crude sheath preparation protocol, during which CIS particles generally contract irrespective of the presence or absence of CisA. Thus, we cannot comment on whether there is an increased number of contracted CIS assemblies in the cytoplasm of nisin-exposed cells. To answer this point, we would need to acquire additional cryo-electron tomograms (cyroET) of the different strains treated with nisin. CryoET is an extremely time and labor-intensive task and given that we currently don’t know the exact dynamics of the CIS-CisA interaction following exogenous stress, we believe this experiment is beyond the scope of this work.

      The interaction of CisA with the CIS assembly is critical for the model but is only supported by Alphafold modelling, predicting interaction between cytoplasmic parts of CisA and Cis11 protein in the baseplate wedge. An experimental demonstration of this interaction would have strengthened the conclusions.

      We agree that direct experimental evidence of this interaction would have further strengthened the conclusions of our study, and we have extensively tried to provide additional experimental evidence. Unfortunately, because of the toxicity of cisA expression in E. coli and the possibly transient nature of the interaction under the experimental conditions used, we were unable to confirm this interaction by biochemical or biophysical techniques, such as co-purification or bacterial two-hybrid assays. Despite these technical challenges, we believe that the AlphaFold predictions provided a valuable hypothesis about the role of CisA in firing and the function of CIS particles in S. coelicolor.

      The cisA mutant showed a similarly accelerated sporulation as was previously reported for CIS-negative strains, which supports the conclusion that CisA is required for function of CISSc. But the results do not add any new insights into how CIS/CisA affects the progression of the developmental life cycle and whether this effect has anything to do with the regulated cell death that is caused by CIS. The same applies to the effect on secondary metabolite production, with no further mechanistic insights added, except reporting similar effects of CIS and CisA inactivations.

      Thank you for your feedback on this aspect of the manuscript. We would like to note that the main focus of this study was to provide further insight into how CIS contraction and firing are mediated in Streptomyces. We used the analysis of accelerated sporulation and secondary metabolite production as a readout to directly assess the functionality of CIS in the presence or absence of CisA and to complement the in situ cryoET data. In summary, our data significantly expand our knowledge of CIS function and firing in Streptomyces and suggest a model in which CisA plays an essential role in mediating the interaction of CIS particles with the membrane, which is required for CIS-mediated cell death. We discuss this model in more detail in the revised manuscript (Line 274-283).

      We agree that we still don’t fully understand the full nature of the signals that trigger CIS contraction, but we do know that the production of CIS is an integral part of the Streptomyces multicellular life cycle as demonstrated by two independent previous studies by us and others (DOI: 10.1038/s41564-023-01341-x and DOI: 10.1038/s41467-023-37087-7 ).

      We further speculate that the assembly and CisA-dependent firing of Streptomyces CIS particles could present a molecular mechanism to dismantle part of the vegetative mycelium. This form of “regulated cell death” could provide two key benefits: (1) to prevent the spread of local cellular damage to the rest of mycelium and (2) to provide additional nutrients for the rest of the mycelium to delay the terminal differentiation into spores, which in turn also affects the production of secondary metabolites.

      Concluding remarks:

      The work will be of interest to anyone interested in contractile injection systems, T6SS, or similar machineries, as well for people working on the biology of streptomycetes. There is also a potential impact of the work in the understanding of how such molecular machineries could have been co-opted during evolution to become a mechanism for regulated cell death. However, this latter aspect remains still poorly understood. Even though this paper adds excellent new structural insights and identifies a putative membrane anchor, it remains elusive how the Streptomyces CIS may lead to cell death. It is also unclear what the advantage would be to trigger death of hyphal compartments in response to stress, as well as how such cell death may impact (or accelerate) the developmental progression. Finally, it is inescapable to wonder whether the Streptomyces CIS could have any role in protection against phage infection.

      We thank Reviewer #2 for the overall supportive assessment of our work. We will briefly discuss functional CIS's impact on Streptomyces development in the revised manuscript. We previously tested if Streptomyces could defend against phages but have not found any experimental evidence to support this idea (unpublished data). The analysis of phage defense mechanisms is an underdeveloped area in Streptomyces research, partly due to the currently limited availability of a diverse phage panel.

      Reviewer #3 (Public review):

      Summary:

      In this work, Casu et al. have reported the characterization of a previously uncharacterized membrane protein CisA encoded in a non-canonical contractile injection system of Streptomyces coelicolor, CISSc, which is a cytosolic CISs significantly distinct from both intracellular membrane-anchored T6SSs and extracellular CISs. The authors have presented the first high-resolution structure of extended CISSc structure. It revealed important structural insights in this conformational state. To further explore how CISSc interacted with cytoplasmic membrane, they further set out to investigate CisA that was previously hypothesized to be the membrane adaptor. However, the structure revealed that it was not associated with CISSc. Using fluorescence microscope and cell fractionation assay, the authors verified that CisA is indeed a membrane-associated protein. They further determined experimentally that CisA had a cytosolic N-terminal domain and a periplasmic C-terminus. The functional analysis of cisA mutant revealed that it is not required for CISSc assembly but is essential for the contraction, as a result, the deletion significantly affects CISSc-mediated cell death upon stress, timely differentiation, as well as secondary metabolite production. Although the work did not resolve the mechanistic detail how CisA interacts with CISSc structure, it provides solid data and a strong foundation for future investigation toward understanding the mechanism of CISSc contraction, and potentially, the relation between the membrane association of CISSc, the sheath contraction and the cell death.

      Strengths:

      The paper is well-structured, and the conclusion of the study is supported by solid data and careful data interpretation was presented. The authors provided strong evidence on (1) the high-resolution structure of extended CISSc determined by cryo-EM, and the subsequent comparison with known eCIS structures, which sheds light on both its similarity and different features from other subtypes of eCISs in detail; (2) the topological features of CisA using fluorescence microscopic analysis, cell fractionation and PhoA-LacZα reporter assays, (3) functions of CisA in CISSc-mediated cell death and secondary metabolite production, likely via the regulation of sheath contraction.

      Weaknesses:

      (1) The data presented are not sufficient to provide mechanistic details of CisA-mediated CISSc contraction, as authors are not able to experimentally demonstrate the direct interaction between CisA with baseplate complex of CISSc (hypothesized to be via Cis11 by structural modeling), since they could not express cisA in E. coli due to its potential toxicity. Therefore, there is a lack of biochemical analysis of direct interaction between CisA and baseplate wedge. In addition, there is no direct evidence showing that CisA is responsible for tethering CISSc to the membrane upon stress, and the spatial and temporal relation between membrane association and contraction remains unclear. Further investigation will be needed to address these questions in future.

      We thank Reviewer #3 for the supportive evaluation and constructive feedback of our study in the non-public review. We appreciate the recognition of the technical limitations of experimentally demonstrating a direct interaction between CisA and CIS baseplate complex, and we agree that further investigations in the future will hopefully provide a full mechanistic understanding of the spatiotemporal interaction of CisA and CIS particular and the subsequent CIS firing.

      To further improve the manuscript, we will revise the text and clarify figures and figure legends as suggested in the non-public review.

      Discussion:

      Overall, the work provides a valuable contribution to our understanding on the structure of a much less understood subtype of CISs, which is unique compared to both membrane-anchored T6SSs and host-membrane targeting eCISs. Importantly, the work serves as a good foundation to further investigate how the sheath contraction works here. The work contributes to expanding our understanding of the diverse CIS superfamilies.

      Thank you.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      - Magnification of the potential CisA-Cis11 model, with side chains at the interface, should be shown in Supplementary Figures 9/10 to help the reader appreciates the intercation between the two subunits.

      Done. A zoomed-in view of the relevant side chains at the CisA-Cis11 interface has been added to Supplementary Figure 9e. For clarity, we decided not to highlight these residues in Supplementary Figure 10 because they are identical to those in Figure 9e.

      - A model where CisA is positionned onto the baseplate (by merging the CisA-Cis11 model and the baseplate structure) will also be informative for the reader.

      We agree that such a presentation would be helpful to visualize the proposed CisA-Cis11 interaction. However, the Cis11 residues predicted to bind CisA are buried in our cryoEM single-particle structure of the elongated Streptomyces CIS. This is not surprising, as the structure is based on a previously established non-contractile CIS mutant variant (PMCID: PMC10066040), which means we were only able to capture one specific configuration of the baseplate complex in the current work. This baseplate configuration is most likely structurally distinct from the baseplate configuration in contracted CIS particles. A similar observation was also reported for the baseplate complex of eCIS particles from Algoriphagus machipongonesis (PMCID: PMC8894135 ).  

      We speculate that in Streptomyces, initial non-specific contacts between CisA and cytoplasmic CIS particles induce a rearrangement of baseplate components, resulting in the exposure of the relevant Cis11 residues, which in turn facilitates a transient interaction between CisA and Cis11. This interaction then leads to additional conformational changes within the baseplate complex, triggering sheath contraction and CIS firing.

      We believe that a transient binding step is a crucial part of the activation process, contributing to the dynamic nature of the system.

      - Providing information on the oligomeric state of CisA will strenghten the manuscript. Authors may consider having blue-native gel analysis of CisA-3xFLAG extracted from Streptomyces or E. coli membranes, or in vivo chemical cross-linking coupled to SDS-PAGE analyses. In case these quite straightforward experiments are not possible, the authors may consider providing AF3 models of various CisA multimers.

      Thank you for these suggestions. Unfortunately, we currently don’t have the capability to conduct additional experiments. However, we have performed additional AF3 modelling to explore potential different configurations of CisA. The results of these analyses suggest that CisA can assemble into a pentamer (see also Response to reviewer 1). We speculate that CisA may exist in different oligomeric states and that membrane-localized CisA monomers oligomerize into a larger protein complex in response to a cellular or extracellular (e.g. nisin) signal, which could then directly or indirectly interact with CIS particles in the cytoplasm to facilitate their recruitment to the membrane and CIS firing. Such a stress-dependent conformational change of CisA could also be a safety mechanism to prevent accidental interaction of CisA with CIS particles and CIS firing.

      We now show the AF model for the predicted CisA pentamer in Figure 3b/c and discuss the potential implications of the different CisA configurations in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      - The quantification of contracted versus extended CIS assemblies in the cytoplasm is only presented for the tomograms from the cisA mutant (graph in Fig. S2d). However, there are no data for the WT and complemented mutant to compare with. It would help to add such data, or at least refer to the previous quantification done for the WT in the previous paper. Further, would it be possible to illustrate the difference by measuring lengths of CIS assemblies and plot length distributions (assuming the extended ones are long and contracted are short)?

      Thank you for your suggestions. We have included the results from our previous quantification of CIS assembly states observed in the WT in the revised manuscript (lines 106–110).

      In the acquired tomograms of CIS particles observed in intact and dead hyphae, we consistently observed only two CIS conformations: the fully extended state (average length of 233 nm, diameter of 18 nm) and the fully contracted state (average length of 124 nm, diameter of 23 nm). We have added this information to the revised manuscript (lines 112-114).

      - The Western blot in Fig. 3d, top panel, contains additional bands that are not mentioned. Are they non-specific bands? Absent in disA mutant? It would help if it was clarified in the legend what they are.

      Correct, these additional bands are unspecific bands, which are also visible in the lysate and soluble fraction of wild-type sample (negative control, no FLAG-tagged protein). We have now labelled these bands in the figure and clarified the figure legend.

      - Fig. S8a needs improvement. It was not possible to clearly see the stated effect of disA deletion on secondary metabolite production in these photos.

      We agree and have removed figure panel S8a from the manuscript. The quantification of total actinorhodin production shown in Figure S8b convincingly shows a significantly reduction of actinorhodin production in the cisA deletion mutant compared to the wildtype and the complement mutant.

      - It is not an important point, but the paragraph in lines 109-116 appears more like a re-iteration of the Introduction than Results.

      We agree. We have removed the highlighted text from the Results section and added some of the information to the introduction.

      - Line 206 appears to have a typo. Should it not be WT instead of WT cisA?

      Correct. This is a typo which has been fixed. Thank you.

      - At the end of the Discussion, it is suggested that a stepwise mechanism of recruiting CIS to the membrane and then triggering firing would prevent unwanted activation and self-inflicted death. Since both steps appear to be dependent in DisA, it would be good to more clearly spell out how such a stepwise mechanism would work and how it could prevent spontaneous and erroneous firing of the system.

      Thank you for this suggestion. We have revised the text to clarify the proposed stepwise mechanism. Based on additional structural modeling, we propose that the conserved extra-cytoplasmic domain of CisA may play a role in sensing stress signals. Binding of a ‘stress-associated molecule’ could induce a conformational change in CisA, a hypothesis supported by: (1) Foldseek protein structure searches, which suggest that the conserved C-terminal CisA domain resembles substrate/solute-binding proteins, and (2) AlphaFold3 models predicting that CisA can form a pentamer via its putative substrate-binding domain. This suggests that a transition from CisA monomers to pentamers in response to stress may serve as a key checkpoint, activating CisA and facilitating the recruitment of CIS assemblies to the membrane, either directly or indirectly. Conversely, in the absence of a stress signal, CisA is likely to remain in its monomeric (resting) form, incapable of triggering CIS firing. We have revised the discussion to explain the proposed model in more detail.

      We recognize that this model poses many testable hypotheses that we currently cannot test but aim to address in the future.

      Reviewer #3 (Recommendations for the authors):

      There are a few concerns potentially worth addressing to strengthen the study or for future investigation.

      (1) It would be worth considering moving the first part of the result ('CisA is required for CISSc contraction in situ') after presenting the structure of extended CISSc, and combining it with the last part of the result section ('CisA is essential for the cellular function of CISSc'), as both parts describe the functional characterization of CisA.

      We appreciate the reviewer’s suggestion but have chosen to retain the current order of the results. As this manuscript focuses on the role of CisA, we believe that first establishing a functional link between CisA and CIS contraction provides essential context and motivation for the study.

      (2) Line 169: it is not clear to me if the fusion of CisA with mCherry is functional (if it complements the native CisA). Moreover, it was not shown if its localization changes under nisin stress or in the strain with non-contractile CISSc.

      We have not tested if the CisA-mCherry fusion is fully functional. While we cannot exclude the possibility that the activity of this protein fusion is compromised in vivo, we believe that the described accumulation of CisA-mCherry at the membrane is accurate. This conclusion is further supported by the results obtained from protein fractionation experiments and the membrane topology assay (Figure 3).

      We did not examine if the localization of CisA-mCherry changes in CIS mutant strains under nisin-stress, but this is something we will follow up on in the future.

      (3) In ref 18, the previous work from the same team presented a functional fluorescent fusion of Cis2 (sheath), thus, it will be interesting to see if (i) Cis2 localization and dynamics is affected by the absence of CisA under normal and stressed conditions; (ii) if Cis2 shows any co-localization with CisA under normal and especially stressed conditions, and potentially, its timing correlation to ghost cell formation by time-lapse imaging of both fusions.

      We thank this reviewer for the suggestions, and we plan to address these questions in the future.

      (4) Line 261: it was hypothesized by authors that the cytosolic portion of CisA was required for interacting with Cis11. While it was not possible to verify the direct interaction at current state, a S. coelicolor mutant lacking this cytosolic domain may be of help to indirectly test the hypothesis. Moreover, it would be interesting to see if the cytosolic region alone is enough to induce the contraction upon stress (by removing the TM-C region). If so, whether it leads to cell death, or if it is insufficient to cause cell death without membrane association despite the sheath contraction. If not, it would suggest that membrane association occurs before contraction.

      These are really great suggestions and if we had the manpower and resources, we would have performed these experiments. We plan to follow up on these questions in the future.

      However, additional structural modelling of CisA indicates that CisA may exist in different configurations (see response to Reviewer #1 and #2), a monomeric and/or a pentameric configuration. In these structural models (revised Figure 3), CisA oligomerization is mediated by the annotated periplasmic solute-binding domain. It is conceivable that CisA oligomerization (e.g. in response to a stress signal) presents a critical checkpoint that results in a conformational change within CisA monomers that subsequently drives CisA oligomerization into a configuration primed to interact with CIS particles. We would therefore speculate that the expression of just the cytoplasmic CisA domain may not be sufficient for CIS contraction and cell death.

      (5) Line 263: as it was not possible to express full-length cisA in E. coli, making it difficult to assess the interaction between CisA and Cis11, it may be worth considering expressing the cytosolic portion of CisA (ΔTM-C) instead of full-length CisA, or alternatively performing a co-immunoprecipitation assay of CisA (i.e., with an affinity tag) from S. coelicolor cultures under stressed conditions. However, I am aware that these may be beyond the scope of this work but can be considered for future investigation in general.

      Thank you for your suggestions and your understanding that some of this work is beyond the scope of this work. We have performed CisA-FLAG co-immunoprecipitation experiments from S. coelicolor cultures that were treated with nisin for 0/15/45 min. However, mass spectrometry analysis of co-eluted peptides did not show the presence of CIS-associated peptides at the analysed timepoints. While we cannot exclude technical issues with our assays that resulted in an inefficient solubilization of CisA from Streptomyces membranes, an alternative hypothesis is that the interaction between CIS particles and CisA is very transient and therefore difficult to capture. We would like to mention, however, that we did detect CisA peptides in crude purifications of CIS particles from nisin-stressed cells (Supplementary Table 2, manuscript: line 301/302), supporting our proposed model that CisA can associate with CIS particles in vivo.

      Minor points:

      (1) I will suggest moving Supplementary Fig 2d with control quantification of WT strain and complementation strain (similar to Fig 3g from ref 18) to the main Fig 1, as the quantitative representation with better comparison without going back and forth to ref 18.

      Thank you for your suggestion. Instead of moving Supplementary Fig. 2d to the main figure, we have added additional information in lines 106–110 to discuss the previous quantification of CIS assembly states in the WT, as described in our earlier work. We believe this approach allows readers to easily reference our established quantification without compromising the flow of the main figures.

      (2) Line 52/785: as work of Ref 12 has recently been published DOI: 10.1126/sciadv.adp7088, the reference should be updated accordingly.

      This reference has been updated. Thank you.

      (3) A brief description of key differences between contracted (ref 18) and extended sheath structure will be a good addition for a broader audience.

      Thank you for this suggestion. We have added more information on lines 178–180.

      (4) Fig 3d: it is not clear how well the samples from different fractions were normalized in amount (volume and cell density), but there was an inconsistency in the amount of CisA-Flag in lysate, vs. soluble and membrane fractions (total protein amount combined from soluble fraction and membrane fraction together seemed to be more than in the lysate, while in theory it should be more or less equal; and the amount of WhiA from WT seemed to be less than from the CisA-Flag strain). In the method section, it was mentioned that 'The final pellet was dissolved in 1/10 of the initial volume with wash buffer (no urea). Equi-volume amounts of fractions were mixed with 2x SDS sample buffer and analyzed by immunoblotting.' But it is still not clear whether equivalent amounts (normalized to the same OD for example) were used and if we could directly compare. A brief clarification in the legend of how samples were prepared is needed.

      The samples were normalized by first using the same volume of starting material (similar culture density and incubation period for each strain) and by loading equal volumes of each fraction for analysis. After fractionation, equi-volume amounts of the soluble and membrane protein fractions were mixed with 2× SDS sample buffer and subjected to immunoblotting, ensuring a consistent basis for comparison between samples. We have revised the figure legend and Material and Method sections to make this clear.

      We agree that the amount of CisA-3xFLAG appears slightly lower in the “Lysate” fraction compared to the “Membrane” fraction in Figure 3d (now Fig. 3f). However, this does not affect the overall conclusion of this experiment, showing that CisA-3xFLAG is clearly enriched in the membrane fraction.

      For reference, please find below the uncropped version of this Western blot image. Based on the signal of the unspecific bands, we would like to argue that equal amounts of samples obtained from the WT control strain (no FLAG epitope present) and a strain producing CisA-3xFLAG were loaded for each of the fractions. When we revisited this data, we noted that the protein size marker was wrong. This has been fixed.

      Author response image 1.

      (5) Fig. 4f: statistical analysis is missing.

      The missing statistical analysis has been added to this figure and figure legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work, the authors investigate the functional difference between the most commonly expressed form of PTH, and a novel point mutation in PTH identified in a patient with chronic hypocalcemia and hyperphosphatemia. The value of this mutant form of PTH as a potential anabolic agent for bone is investigated alongside PTH(1-84), which is a previously used anabolic therapy. The authors have achieved the aims of the study. Their conclusion, however, that this suggests a "new path of therapeutic PTH analog development" seems unfounded; the benefit of this PTH variant is not clear, but the work is still interesting.

      The work does not identify why the patient with this mutation has hypocalcemia and hyperphosphatemia; this was not the goal of the study, but the data are useful for helping to understand that.

      Strengths:

      The work is novel, as it describes the function of a novel, naturally occurring, variant of PTH in terms of its ability to dimerise, to lead to cAMP activation, to increase serum calcium, and its pharmacological action compared to normal PTH.

      Weaknesses:

      (1) The use of very young, 8-10 week old, mice as a model of postmenopausal osteoporosis is a major limitation of this study. At 8 weeks, the effect of ovariectomy leads to lack of new trabecular bone formation, rather than trabecular bone loss due to a defect in bone remodelling. Although the findings here provide a comparison between two forms of PTH, it is unlikely to be of direct relevance to the patient population. For example, the authors find an inhibitory effect of PTH on osteoclast surface, which is very unusual. Adding to this concern is that the authors have not described the regions used for histomorphometry, and from their figures (particularly the TRAP stain), it seems that the primary spongiosa (which is a region of growth) has been used for histomorphometry, rather than the secondary spongiosa (which more accurately reflects bone remodelling). Much further detail is needed to justify the use of this very young model, and a section on the limitations of this model is needed. Please provide that section in the revised manuscript.

      Thank you for your crucial comment. We obtained 8-week-old female mice and stabilized them in our facility for 2 weeks. Then, we performed OVX using 10-week-old mice and determined the effects of dimeric <sup>R25C</sup>PTH(1-34) on bone after 8 weeks because of 4 weeks for recovery and 4 weeks for PTH or <sup>R25C</sup>PTH(1-34). Therefore, we sacrificed the mice at 18-week-old mice. We revised the method section on page 18, line 436-441 and page 18, line 442-448 as follows.

      - ‘Eight-week-old C57BL/6N female mice were purchased from KOATECH (Gyeonggi-do, Republic of Korea), and stabilized mice for 2 weeks. All animal care and experimental procedures were conducted under the guidelines set by the Institutional Animal Care and Use Committees of Kyungpook National University (KNU-2021-0101). The mice were housed in a specific pathogen-free environment, with 4-5 mice per cage, under a 12-h light cycle at 22 ± 2°C. They were provided with standard rodent chow and water ad libitum.’

      - ‘An ovariectomized (OVX) mouse model was established using 10-week-old C57BL/6N female mice. Following surgery, mice were divided into the following four groups (n = 6 mice/group) as follows: sham, OVX control group, OVX + PTH (1–34) treated group (40 µg/kg/day), and OVX + dimeric <sup>R25C</sup>PTH treated group (40-80 µg/kg/day). OVX mice were allowed to recover for 4 weeks after surgery. Afterward, PTH (1–34) or <sup>R25C</sup>PTH was injected subcutaneously 5 times a week for 4 weeks. Micro-computed tomography (μ-CT) and histological analyses were performed on 4 groups at 18 weeks of age.’

      We also appreciate the reviewer's helpful comment on histology analysis. We agree with the reviewer’s comment that the primary spongiosa does not fully reflect bone remodeling. For histomorphometry analysis in young or male mice, we commonly use the secondary spongiosa, which more accurately reflects bone remodeling. However, in aged or OVX-induced osteoporosis mouse models, we use the primary and secondary spongiosa for histomorphometry analysis because of the barely detectable bone in the secondary spongiosa. In the TRAP staining, we observed an inhibitory effect of PTH on the osteoclast surface/bone surface, which was due to an increased bone surface in the PTH treatment group and less bone in the OVX-vehicle group. Serum CTX1 levels showed no significant difference between the OVX+vehicle and OVX+PTH(1-34) groups. We revised the Materials and Methods (page 21, line 502) and Discussion (page 14, line 330) sections as follows.

      - ‘In the histomorphometry analysis for TRAP staining, we used the secondary and primary spongiosa for the trabecular ROI because of the barely detectable in the secondary spongiosa of OVX model.’

      - ‘This study has several limitations. First, it is urgently necessary to determine whether dimeric <sup>R25C</sup>PTH is present in human patient serum. Second, TRAP staining showed an inhibitory effect of PTH treatment on the primary spongiosa area. However, the secondary spongiosa, which more accurately reflects bone remodeling (55), was not examined due to the barely detectable bone in this area in OVX-induced osteoporosis mouse models. Third, it is unclear whether similar bone phenotypes exist between human <sup>R25C</sup>PTH patients and dimeric <sup>R25C</sup>PTH-treated mice, particularly regarding low bone strength. Although the dimeric <sup>R25C</sup>PTH-treated group showed higher cortical BMD compared to WT-Sham or PTH groups, there was no difference in bone strength compared to the osteoporotic mouse model. Fourth, our study showed that PTH or <sup>R25C</sup>PTH treatment decreased circumferential length; it is uncertain if this phenotype is also present in PTH-treated or <sup>R25C</sup>PTH patients. Finally, we did not analyze the <sup>R25C</sup>PTH mutant mouse model, which would allow us to compare phenotypes that most closely resemble those of human patients.’

      (2) It is also somewhat concerning that the age range is from 8-10 weeks, increasing the variability within the model. Did the age of mice differ between the groups analysed?

      We utilized mice of the same age (10 weeks) across all experiments involving the surgically induced ovariectomy (OVX) model described as above.

      (3) Methods are not sufficiently detailed. For example, the regions used for histomorphometry are not described, there is no information on micro-CT thresholds, no detail on the force used for mechanical testing. Please address this request.

      Thank you for your comment. Let me address your points step by step.

      (1) Thresholds for analysis were determined manually based on grayscale values for each experimental group as follows: trabecular bone: 3000; cortical bone: 5000 for all samples. We utilized an HA (calcium hydroxyapatite) phantom with HA content ranging from 0 to 1200 mg CaHA/cm³ to measure the grayscale values via µ-CT. These measurements were then used to generate a standard curve.

      Author response image 1.

      (2) Bone parameters and density were analyzed in the region between 0.3–1.755 mm (Voxel size: 9.7um, 150 slices) from the bottom of the growth plate. Analysis of bone structure was performed using adaptive thresholding in a CT Analyser.

      Author response image 2.

      (3) Three‐point bending test, the left femur of the mouse was immersed in 0.9 % NaCl solution, wrapped in gauze, and stored at −20°C until ready for a three-point bending test. In this test, we placed the mouse femurs positioned horizontally with the anterior surface facing upwards, centered on the supports, and the compressive force was applied vertically to the mid-shaft. The pressure sensor was positioned at a distance that allowed for the maximum allowable pressure (200N) without interfering with the test (20.0 mm for the femur). A miniature material testing machine (Instron, MA, USA) was used for this test. The crosshead speed was decreased to 1 mm/min until failure. During the test, force-displacement data were collected to determine the maximum load and slope of the bones.

      (4)  As the reviewer’s suggestion, we revised the methods on page 20, line 477 and line 482-486 as follows.

      - ‘Bone parameters and density were analyzed in the region between 0.3–1.755 mm (150 slices) from the bottom of the growth plate. Analysis of bone structure was performed using adaptive thresholding in a µ-CT Analyser. Thresholds for analysis were determined manually based on grayscale values for each experimental group: trabecular bone: 3000; cortical bone: 5000 for all samples.’

      -  ‘The left femur of the mouse was immersed in 0.9 % NaCl solution, wrapped in gauze, and stored at −20°C until ready for a three-point bending test. In this test, we placed the mouse femurs horizontally with the anterior surface facing upwards, centered on the supports, and the compressive force was applied vertically to the mid-shaft. The pressure sensor was positioned at a distance that allowed maximum allowable pressure (1000N) without interfering with the test (20.0 mm for the femur). A miniature material testing machine (Instron, MA, U.S.A.) was used for this test. The crosshead speed was decreased to 1 mm/min until failure. During the test, force-displacement data were collected to determine the maximum load and slope of the bones.’

      (4) There are three things unclear about the calvarial injection mouse model. Firstly, were the mice injected over the calvariae or with a standard subcutaneous injection (e.g. at the back of the neck)? If they were injected over the calvaria, why were both surfaces measured? Secondly, why was the dose of the R25C-PTH double that of PTH(1-34)? Thirdly, there is no justification for the use of "more intense coloration" as a marker of new bone; this requires calcein labelling to prove it new bone. It would be more reliable to measure and report the thickness of the calvaria. Please address these technical questions.

      Thank you for your valuable feedback on the calvarial injection mouse model. Below are our responses to the specific points mentioned:

      (1) Injection method and measurement sites: The injections were administered subcutaneously above the calvaria, rather than at the standard subcutaneous site such as the back of the neck. This approach was chosen to ensure direct delivery of the peptide to the target area, enhancing the localized effects on bone formation. Measurements were taken at two different parts of the calvaria to account for any variation in the spread and absorption of the administered substance following injection. By analyzing both surfaces, we aimed to provide a comprehensive assessment of the impact on calvarial bone thickness.

      (2) Dose of <sup>R25C</sup>PTH compared to PTH(1-34): The dose of <sup>R25C</sup>PTH used in our study was determined based on molecular weight calculations. The molecular weight of the dimeric <sup>R25C</sup>PTH(1-34) is approximately twice that of the monomeric PTH(1-34). Therefore, to maintain a consistent molar concentration and ensure comparable biological effects, the dose of <sup>R25C</sup>PTH was adjusted accordingly.

      (3) Use of "more intense coloration" as a marker of new bone: We acknowledge that calcein labeling would provide a more reliable and quantifiable way to identify new bone formation. The use of “more intense coloration” was intended as a qualitative indicator in this study, and we recognize the technical limitations of this approach.

      (5) The presentation of mechanical testing data is not sufficient. Example curves should be shown, and data corrected for bone size needs to be shown. The difference in mechanical behaviour is interesting, but does it stem from a difference in the amount of bone, or two a difference in the quality of the bone? Please explain this matter better in the manuscript.

      Thank you for your comment.

      As a reviewer's comment, we provided example curves for the rat femur three-point bending test as shown below.

      Author response image 3.

      (1) The cortical bone area was decreased in the OVX-Vehicle and OVX-<sup>R25C</sup>PTH(1-34) groups but not in the OVX-PTH(1-34) group compared to the Sham group. However, the total bone area was decreased in the PTH(1-34) and <sup>R25C</sup>PTH(1-34) treated groups, with no significant difference in the OVX-Vehicle group compared to the Sham group. Collectively, there was an increase in cortical thickness which resulted in a narrowing of the bone marrow space in OVX-<sup>R25C</sup>PTH(1-34) groups. Accordingly, we revised Fig 5B with the addition of Tt.Ar and Ct.Ar.

      (2) As the reviewer’s suggestion, we revised the results on page 10, line 220-228 s follows.

      - ‘Quantitative micro-computed tomography (μ-CT) analysis of the femurs obtained from each group revealed that, as compared to OVX + vehicle controls, treatment with PTH(1–34) increased femoral trabecular bone volume fraction (Tb.BV/TV) by 121%, cortical bone volume fraction (Ct.BV/TV) by 128%, cortical thickness (Ct.Th) by 115%, cortical area (Ct.Ar) by 110%, and cortical area fraction (Ct.Ar/Tt.Ar) by 118% while decreased total tissue area (Tt.Ar) by 93% (Figure 5A and 5B). Treatment with dimeric <sup>R25C</sup>PTH(1-34) had similar effects on the femoral cortical bone parameters, as it increased Ct.BMD by 104%, Ct.BV/TV by 125%, Ct.Th by 107%, and Ct.Ar/Tt.Ar by 116%, while decreased Tt.Ar 86% (Figure 5). Considering the reduction of Tt.Ar and no change of Ct.Ar compared to the OVX+vehicle controls, the increase of Ct.Ar/Tt.Ar indicates a decrease in bone marrow space. The increase in cortical bone BMD was significant with dimeric <sup>R25C</sup>PTH(1-34) but not with PTH(1-34), whereas an increase in femoral trabecular bone was only observed with PTH(1-34).’

      (6) The micro-CT analysis of the cortical bone in the OVX model is insufficient. Please indicate whether cross-sectional area has increased. Is there an increase in the size of the bones, or is the increase in cortical thickness due to a narrowing of the marrow space? This may help resolve the apparent contradiction between the cortical thickness data (where there is no difference between the two PTH formulations) and the mechanical testing data (where there is a difference). Please explain this matter better in the manuscript.

      Thank you for your comment.

      (1) The cortical bone area was decreased in the OVX-Vehicle and OVX-<sup>R25C</sup>PTH(1-34) groups but not in the OVX-PTH(1-34) group compared to the Sham group. However, the total bone area was decreased in the PTH(1-34) and <sup>R25C</sup>PTH(1-34) treated groups, with no significant difference in the OVX-vehicle group compared to the Sham group. Taken together, there was an increase in cortical thickness due to a narrowing of the bone marrow space in OVX-<sup>R25C</sup>PTH(1-34) groups. Therefore, we revised as above.

      (2) As the reviewer’s suggestion, we revised the results on page 10, line 220-228 as follows.

      - ‘Quantitative micro-computed tomography (μ-CT) analysis of the femurs obtained from each group revealed that, as compared to OVX + vehicle controls, treatment with PTH(1–34) increased femoral trabecular bone volume fraction (Tb.BV/TV) by 121%, cortical bone volume fraction (Ct.BV/TV) by 128%, cortical thickness (Ct.Th) by 115%, cortical area (Ct.Ar) by 110%, and cortical area fraction (Ct.Ar/Tt.Ar) by 118% while decreased total tissue area (Tt.Ar) by 93% (Figure 5A and 5B). Treatment with dimeric <sup>R25C</sup>PTH(1-34) had similar effects on the femoral cortical bone parameters, as it increased Ct.BMD by 104%, Ct.BV/TV by 125%, Ct.Th by 107%, and Ct.Ar/Tt.Ar by 116%, while decreased Tt.Ar 86% (Figure 5B). Considering the reduction of Tt.Ar and no change of Ct.Ar compared to the OVX+vehicle controls, the increase of Ct.Ar/Tt.Ar indicates a decrease in bone marrow space. The increase in cortical bone BMD was significant with dimeric <sup>R25C</sup>PTH(1-34) but not with PTH(1-34), whereas an increase in femoral trabecular bone was only observed with PTH(1-34).’

      (7) The evidence that dimeric PTH has a different effect to monomeric PTH is very slim; I am not sure this is a real effect. Such differences take a long time to sort out (e.g. the field is still trying to determine whether teriparatide and abaloparatide are different). I think the authors need to look more carefully at their data - almost all effects are the same. Ultimately, the statement that dimeric PTH may be a more effective anabolic therapy than monomeric PTH are not supported by the data, and this should be removed. There is little to no difference found between normal PTH and the variant in their effects on calcium and phosphate homeostasis or on bone mass. However, the analysis has been somewhat cursory, with insufficient mechanical testing or cortical data presented. Many of the effects seem to be the same (e.g. cortical thickness, P1NP, ALP, vertebral BV/TV and MAR), but the way it is written it sounds like there is a difference. Please remove some of the unfounded claims that you have made in this manuscript.

      Thank you for your insightful comments. We strongly agree with your conclusion that PTH and dimeric <sup>R25C</sup>PTH indeed exhibit similar activities. We have toned-down our statement, however, there are still some elements showing statistical significance that need to be clearly stated. Specifically, when we changed the statistical method from t-test to one-way ANOVA, the significance of bone formation markers were only observed in dimeric PTH treated samples, and we have revised the manuscript of Results section on page 9, line 206-212 as follows to reflect the change.

      - ‘These analyses revealed that both PTH(1-34) and dimeric <sup>R25C</sup>PTH(1-34) significantly increased the width of the new bone area by approximately four-fold, as compared to the vehicle group (Figure 4B). These findings thus support a capacity of dimeric <sup>R25C</sup>PTH(1-34) to induce new bone formation in vivo, similar to PTH, despite molecular and structural changes.’

      Although it is unclear whether <sup>R25C</sup>PTH circulate as dimeric form or mutant monomeric form, the absence of bone resorption associated with long-term PTH exposure in the patients suggests the potential for a bone anabolic drug without side effects. Also, continued observation of the recently reported young patient in Denmark is expected to clarify this effect further. However, we acknowledge that our current data alone are insufficient to claim that <sup>R25C</sup>PTH may be a more effective anabolic therapy than wild type PTH, and we have adjusted our tone accordingly.

      (8) Statistical analysis used multiple t-tests. ANOVA would be more appropriate.

      We agree with your suggestion. To compare the means among three or more groups, ANOVA is more appropriate than the t-test. Accordingly, we performed new statistical analyses using one-way and two-way ANOVA. One-way ANOVA was applied to figure 4, 5, and 6 (In previous, figure 5, 6, and 7), and two-way ANOVA was applied to Figure 3, considering both time and treatment variables. We revised some of the figures and descriptions to reflect the changes in significance.

      Thank you for Reviewer #1’s thorough and thoughtful review. We greatly appreciate the suggestions and will incorporate them to enhance the quality of our paper.

      Reviewer #2 (Public Review):

      Summary:

      The study conducted by Noh et al. investigated the effects of parathyroid hormone (PTH) and a dimeric PTH peptide on bone formation and serum biochemistry in ovariectomized mice as a model for postmenopausal osteoporosis. The authors claimed that the dimeric PTH peptide has pharmacological benefits over PTH in promoting bone formation, despite both molecules having similar effects on bone formation and serum Ca2+. However, after careful evaluation, I am not convinced that this manuscript adds a significant contribution to the literature on bone and mineral research.

      Strengths:

      Experiments are well performed, but strengths are limited to the methodology used to evaluate bone formation and serum biochemical analysis.

      Weaknesses:

      (1) Limited significance of this study:

      • This study follows a previous study (not cited) reporting the effect of the dimeric R25CPTH(1-34) on bone regeneration in an osteoporotic dog (Beagle) model (Jeong-Oh Shin et al., eLife 13:RP93830, 2024). It's unclear why the authors tested the dimeric R25C-PTH peptide on a rodent animal model, which has limitations because the healing mechanism of human bone is more similar in dogs than in mice.

      Thank you for your interest in our research. To address the paper by Shin et al. (2024, DOI:10.7554/eLife.93830.1), we would like to clarify that our research on dimeric <sup>R25C</sup>PTH(1-34) was conducted first. Initially, we confirmed dimerization under in vitro conditions and observed its effects in a mouse model. Recognizing the need for additional animal models, we collaborated with Shin et al.'s team. Due to delays during the submission process, our paper was submitted later, which seems to have led to this misunderstanding. However, Shin et al. (2024) cited our pre-print article on bioRxiv (Noh, M., Che, X., Jin, X., Lee, D. K., Kim, H. J., Park, D. R., ... & Lee, S. (2024). Dimeric R25CPTH (1-34) Activates the Parathyroid Hormone-1 Receptor in vitro and Stimulates Bone Formation in Osteoporotic Female Mice. bioRxiv, 2024-03.DOI: 10.1101/2024.03.13.584815). Both Shin et al., and our mouse work supports the action of dimeric R25CPTH(1-34) on regulating bone metabolism.

      • The authors should clarify why they tested the effects of dimeric <sup>R25C</sup>PTH(1-34) and not dimeric <sup>R25C</sup>PTH(1-84)?

      Thank you for your valid comments. Here are several reasons why we used the 1-34 fragment peptide in our experiment. Currently, PTH analog peptides for medical purposes include human parathyroid hormone fragment 1-34 (PTH(1-34)) and full-length recombinant human parathyroid hormone (rhPTH(1-84)). PTH(1-34) is used as a bone anabolic agent, while rhPTH(1-84) is used for PTH replacement therapy in hypoparathyroid patients with hypocalcemia. We aimed to compare the bone formation effects of R25CPTH with wild-type PTH, for which PTH(1-34) was deemed more appropriate. Additionally, previous studies have shown that both PTH(1-34) and PTH(1-84) possess equal ligand binding affinity for the PTH1 receptor. Key sites within the first 34 N-terminal amino acids of PTH are critical for high-affinity interactions and receptor activation. Alterations in the N-terminal sequence of PTH(1-84) significantly reduce receptor binding, while truncations at the C-terminal end do not affect receptor affinity. The peptide used in our experiment was synthetic, and if the length does not affect affinity to its receptor affinity, the shorter length of PTH(1-34) made its synthesis more reasonable. Consequently, we tested the effects of PTH(1-34) and dimeric R25CPTH(1-34) due to its known efficacy on bone anabolic effect and relevance in receptor interactions. However, we aim to conduct functional analysis of the dimeric R25CPTH(1-84) in further study.

      • The study is descriptive with no mechanism.

      We recognize that your concern is legitimate. While our study includes descriptive elements, it extends beyond mere observation. The R25CPTH research, which began with a case report, has evolved to utilize molecular techniques to better understand the unique physiological phenomena observed in patients. We have validated the peptide’s dimerization caused by mutations in vitro and assessed their effects in both in vitro cell line models and in vivo mouse models. Although we have not yet confirmed whether <sup>R25C</sup>PTH exists as a dimer or monomer in patient blood, we anticipate it may exist in dimeric form at least some fractions and are currently conducting mass spectrometry on patient blood samples to determine this. Therefore, this paper serves as the first report on this PTH mutant suggesting that it may form a homodimer. Importantly, we are actively investigating the molecular mechanisms and downstream signaling pathways that differentiate normal PTH from dimeric <sup>R25C</sup>PTH. This includes analyzing differences in proteome and transcriptome induced by PTH and dimeric <sup>R25C</sup>PTH and examining the direct molecular characteristics and structural changes responsible for these mutations. Through this comprehensive approach, we aim to provide a detailed mechanistic understanding of <sup>R25C</sup>PTH in the subsequent publication.

      (2) Statistics are inadequately described or performed for the experimental design:

      • The statistical analysis in Figure 5 needs to be written in a way that makes it clearer how statistics were done; t-test or one-way ANOVA?

      Sorry for the inconvenience and thank you for your thorough review. Initially, we conducted the statistical analysis using a t-test. However, during the revision process, we performed a new statistical analysis using one-way ANOVA, as it is more appropriate for comparing the means among three or more groups. Despite this change, there were no differences in statistical significance, so the descriptions remained unchanged.

      • Statistics in Figures 6 and 7 should be performed by one-way ANOVA to compare the mean values of one variable among three or more groups, and not t-test.

      Thank you for your thorough review, and I apologize for any inconvenience. I agree with your suggestion that ANOVA is more appropriate than the t-test for comparing means among three or more groups. Accordingly, we performed new statistical analyses using one-way ANOVA. When we changed the statistical method from t-test to one-way ANOVA, the significance of bone formation markers, P1NP and ALP, appeared only in dimeric R25CPTH and not in wild-type PTH. We have reflected these findings in the text.

      (3) Misleading and confused discussion:

      • The first paragraph lacks clarity in the PTH nomenclature and the authors should provide a clear statement that the PTH mutant found in patients is likely a monomeric R25CPTH(1-84), considering that there has been no proof of a dimeric form.

      Thank you for your insightful comments. I agree that there was some ambiguity in the nomenclature used in the first paragraph of the Discussion section. However, we do not believe that no proof of a dimeric form of the <sup>R25C</sup>PTH(1-84) mutant necessarily indicates that the PTH mutant in the blood is solely monomeric. Identifying the in vivo structure of <sup>R25C</sup>PTH(1-84) is one of the goals of our ongoing project. While the exact form of <sup>R25C</sup>PTH(1-84) in patients is still elusive, we are investigating the possibility that some fraction may exist as a dimer. On page 12, line 274-276, we have revised the content to address this issue and improve clarity as follows.

      - ‘In this study, we show the introduction of a cysteine mutation at the 25th amino acid position of mature parathyroid hormone (<sup>R25C</sup>PTH) facilitates the formation of homodimers comprised of the resulting dimeric R25CPTH peptide in vitro.’

      • Moreover, the authors should discuss the study by White et al. (PNAS 2019), which shows that there are defective PTH1R signaling responses to monomeric R25CPTH(1-34). This results in faster ligand dissociation, rapid receptor recycling, a short cAMP time course, and a loss of calcium ion allosteric effect.

      Sorry for the inconvenience and thank you for your thorough review. The authors were aware of the referenced paper and deeply apologize for its omission during the writing and editing process. Citing this paper will enhance the credibility of our findings. We have now included this citation and made the necessary adjustments to the manuscript of Discussion section on page 12, line 295-296 as follows.

      - ‘We also observed that the potency of cAMP production in cells was lower for dimeric <sup>R25C</sup>PTH as compared to the monomeric <sup>R25C</sup>PTH, in accordance with a lower PTH1R-binding affinity. Previous reports indicated that a mutation at the 25th position of PTH results in the loss of calcium ion allosteric effects on monomeric <sup>R25C</sup>PTH, leading to faster ligand dissociation, rapid receptor recycling, and a shorter cAMP time course (50). Correspondingly, the weaker receptor affinity and reduced cAMP production observed in dimeric <sup>R25C</sup>PTH suggest a possibility that the formation of a disulfide bond at the 25th position significantly alters the function of PTH as a PTH1R ligand. These structural effects are not yet fully understood and need to be investigated further.’

      • The authors should also clarify what they mean by "the dimeric form of R25CPTH can serve as a new peptide ...(lines 328-329)" The dimeric R25CPTH(1-34) induces similar bone anabolic effects and calcemic responses to PTH(1-34), so it is unclear what the new benefit of the dimeric PTH is.

      We apologize for any confusion in our previous description. We concur that, as you mentioned, PTH and dimeric <sup>R25C</sup>PTH indeed exhibit similar activities. We have toned-down our statement, however, there are still some elements showing statistical significance that need to be clearly stated. Specifically, when we changed the statistical method from t-test to one-way ANOVA, the significance of bone formation markers was only observed in dimeric PTH treated samples, and we have revised the manuscript of Results section on page 9, line 206-212 as follows to reflect the change.

      - ‘These analyses revealed that both PTH(1-34) and dimeric <sup>R25C</sup>PTH(1-34) significantly increased the width of the new bone area by approximately four-fold, as compared to the vehicle group (Figure 4B). These findings thus support a capacity of dimeric <sup>R25C</sup>PTH(1-34) to induce new bone formation in vivo, similar to PTH, despite molecular and structural changes.’

      Although it is unclear whether <sup>R25C</sup>PTH circulate as dimeric form or mutant monomeric form, the absence of bone resorption associated with long-term PTH exposure in the patients suggests the potential for a bone anabolic drug without side effects. Also, continued observation of the recently reported young patient in Denmark is expected to clarify this effect further. However, we acknowledge that our current data alone are insufficient to claim that <sup>R25C</sup>PTH may be a more effective anabolic therapy than wild type PTH, and we have adjusted our tone accordingly.

      Thank you for Reviewer #2’s comprehensive and considerate review. We are grateful for the ideas, and we have revised our manuscript accordingly them to improve our paper.

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 1D lacks molecular weight markers.

      Thank you for your thorough review. We added protein molecular weight markers in the figure.

      (2) The lack of change in plasma cAMP is very surprising, particularly given that there is no difference in the effect of the two forms of PTH on serum calcium or phosphate, or urinary phosphate. This data is somewhat of a distraction since no effort has been made to assess the difference in the effects of these PTH forms on kidney function. I suggest removing this data and spending time working on the origin of this difference.

      Thank you for your insightful comments and valuable suggestions on our manuscript. We also could not precisely explain the discrepancy between the cell line and animal model experiments. However, since the results were consistently observed, we included them in the paper as they may be significant. We acknowledge that in the context of our current research, these data lack sufficient correlation with other findings. Therefore, we have removed the data about the lack of change in plasma cAMP by PTH injection (Figure 4. Effect of cAMP production by PTH injection in CD1 female mice) and revised the manuscript accordingly (Page 8, line 188-194; page 12, line 301-306; page 19, line 454-456). We are currently conducting further research with multiomics data analysis to elucidate potential differences in the sub-signaling pathways between PTH and dimeric R25CPTH, to identify the specific functions affected by these variations, and to understand the underlying mechanisms. The lack of changes in plasma cAMP levels in vivo will be addressed in a subsequent publication detailing our findings.

      (3) Introduction, line 61. The authors state that "most" anti-resorptive therapies cannot stimulate new bone formation. I don't believe that ANY anti-resorptive therapies stimulate new bone formation! If there is one, this should be referenced.

      Thank you for pointing out important aspects. Romosozumab, a humanized monoclonal anti-sclerostin antibody, has a dual effect by enhancing bone formation and inhibiting bone resorption. Sclerostin, a protein produced by osteocytes, plays a role in the regulation of bone metabolism. It promotes osteoclast differentiation, which is associated with bone resorption, and suppresses osteoblast activity, which is crucial for bone formation. By binding to sclerostin, Romosozumab prevents it from blocking the signaling pathways necessary for osteogenesis. Consequently, Romosozumab therapy not only regulates bone resorption but also affects new bone formation. We added the references to that information.

      (4) The authors tend to include a lot of methods in the results section (e.g. describing the number of replicates, and details of histological analysis). This should be minimized.

      Thank you for your thorough review, and sorry for the inconvenience. We have minimized the methodological details in the results section, ensuring that only essential information for understanding the findings and the procedures remain.

      (5) Lines 302-305: If retaining the blood cAMP data, please provide references for the assertion that renal PTH receptors mediate this response.

      PTH exerts its effects primarily through the PTH1 receptor (PTH1R), a G protein-coupled receptor present in various tissues, including bone and kidney (Chase et al., 1968, Chase et al., 1970). When activated by PTH, this receptor stimulates the production of cyclic AMP (cAMP), with the kidneys playing a significant role in this process (Maeda et al., 2013). In the initial manuscript, the importance of renal PTH receptors in mediating the blood cAMP response may have been overemphasized. We appreciate your feedback on this point, and we have provided references to support this assertion. However, by process following the former ‘Recommendations for the Authors’, we removed the data about the lack of change in plasma cAMP by PTH injection, the description of the renal PTH receptors mediate this response of blood cAMP also removed.

      - Chase, Lewis R., and G. D. Aurbach. "Renal adenyl cyclase: anatomically separate sites for parathyroid hormone and vasopressin." Science 159.3814 (1968): 545-547.DOI:10.1126/science.159.3814.545

      - Chase, Lewis R., and G. D. Aurbach. "The effect of parathyroid hormone on the concentration of adenosine 3', 5'-monophosphate in skeletal tissue in vitro." Journal of Biological Chemistry 245.7 (1970): 1520-1526.DOI:10.1016/S0021-9258(19)77126-9

      - Maeda, Akira, et al. "Critical role of parathyroid hormone (PTH) receptor-1 phosphorylation in regulating acute responses to PTH." Proceedings of the National Academy of Sciences 110.15 (2013): 5864-5869.DOI: 10.1073/pnas.1301674110

      (6) Eosin stains bone pink and haematoxylin stains cells purple. This has been incorrectly described in the manuscript.

      Thank you for your thorough review, and I apologize for any confusion caused by the poor description. It appears that the terms were used interchangeably during the editing process. We have corrected the description in the manuscript and will ensure such mistakes do not occur again in the future.

      (7) Sodium thiosulphate is a fixative for Von Kossa staining, not an agent that removes nonspecific binding.

      Thank you for your careful review. However, there seems to be a misunderstanding of sodium formaldehyde as sodium thiosulfate. A 5% sodium thiosulfate solution is a critical in vitro diagnostic agent used in various staining kits. As a reducing agent, it effectively removes excess silver ions in staining kits based on silver impregnation techniques. In our experiment, sodium thiosulfate was specifically used to remove residual silver ions in Von Kossa staining. For more details, please refer to the following link: https://www.morphisto.de/en/shop/detail/d/Natriumthiosulfat_5//12825/.

      Reviewer #2 (Recommendations For The Authors):

      Moderate-to-Minor points:

      • Line 73: it's either class B GPCR or secretin receptor family but not class B GPCR family.

      Thank you for your thorough review, and I apologize for any confusion in our previous description. We corrected the description in the manuscript as class B GPCR.

      • Line 79: correct "adenylate cyclase" to "transmembrane adenylate cyclases"

      Thank you for your thorough review, and I apologize for any confusion in our previous description. We corrected the description in the manuscript as transmembrane adenylate cyclases.

      • Line 89: should "hypothyroidism" be "hypoparathyroidism"?

      Thank you for your thorough review, and I apologize for any confusion in our previous description. We corrected the description in the manuscript as hypoparathyroidism.

      • Line 159: all agonists display higher binding affinities when their receptors are coupled to G proteins, so it's unclear why the higher affinity of the dimeric <sup>R25C</sup>PTH(1-34) for the RG state seems to be important for the authors.

      Thank you for your insightful comments. First of all, comparing the binding affinities of the R0 (G protein-uncoupled) and RG (G protein-coupled) conformations of the receptor is inappropriate. This is because the form and size of the radio-label ligand bound to each conformation differ, which consequently affects their binding affinities and, in turn, influences the binding strength of target ligands such as PTH, monomeric <sup>R25C</sup>PTH, and dimeric <sup>R25C</sup>PTH. Therefore, it is preferable to compare how the binding strengths of test ligands differ for each conformation. Additionally, the fact that significant binding affinity is lost for R<sup>0</sup> while remaining high for the RG conformation of PTH1R is important because typical PTH exhibits high binding affinity for R0, whereas PTHrP shows higher affinity for the RG conformation. This suggests that dimeric <sup>R25C</sup>PTH may possess distinct molecular characteristics and potentially induce different downstream signaling pathways compared to typical PTH.

      • Line 169-170 and Fig. 2: According to the theory of receptor pharmacology established in the 60s' for native receptors (Arch. Int. Pharmacodyn. 127:459-478 (1960); Arch. Int. Pharmacodyn. 136:385-413 (1962)) and verified later in the 80-90's for recombinant GPCRs, the activity constant (Kact or EC50) value of hormone actions in various tissues or cells is equal to the dissociation constant (Kd) of the hormone when receptors are not overexpressed (EC50 = Kd). When receptors are overexpressed (presence of spare receptors), then EC50 < Kd. Assuming that after Cheng-Prussof correction for data in Fig. 2, IC50 < Ki = Kd, how do the authors explain that IC50 values for RG are about 1-Log lower than EC50s (i.e., EC50 > Kd)?

      We appreciate your insightful comment and fully acknowledge the established theory of receptor pharmacology, which states that Kd equals EC50, and when the receptor is overexpressed, EC50 is less than Kd. After having read your comments, we have revisited this paper Okazaki et al, PNAS, 2008 to better understand the PTH interaction with PTH1R. While our data might appear to contradict this theory, we believe that a direct comparison between the IC50 of RG and the EC50 in Figure 2 may not be entirely appropriate for the following reasons. First, the IC50 was determined from membrane preparations of a receptor-overexpressing cell line (GP-2.3), whereas the EC50 was calculated based on the cAMP response in SaOS-2 cells. These different experimental conditions contribute to the observed discrepancies. Second, the peptides used in the competition assays differ. R<sup>0</sup> utilized radiolabeled PTH(1-34), while RG employed M-PTH(1-15) with several amino acid substitutions and a shorter length. This further complicates a direct comparison between the EC50 and IC50 values in our study.

      Thank you for all the reviewers’ thorough and thoughtful reviews. We greatly appreciate your suggestions and have addressed all the issues to enhance the quality of our paper.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      In this manuscript, the authors investigate differences between Tibetans and Han Chinese at altitude in terms of placental transcriptomes during full-term pregnancy. Most importantly, they found that the inter-population differentiation is mostly male-specific and the observed direction of transcriptional differentiation seems to be adaptive at high altitude. In general, it is of great importance and provides new insights into the functional basis of Tibetan high-altitude adaptations, which so far have been mostly studied via population genetic measures only. More specifically, I firmly believe that we need more phenotype data (including molecular phenotypes such as gene expression data) to fully understand Tibetan adaptations to high altitude, and this manuscript is a rare example of such a study. I have a few suggestions and/or questions with which I hope to improve the manuscript further, especially in terms of 1) testing if the observed DEG patterns are truly adaptive, and 2) how and whether the findings in this study can be linked to EPAS1 and EGLN1, the signature adaptation genes in Tibetans.

      We appreciate the reviewer’s constructive comments. We have addressed these points and the details are discussed below.

      Major Comments:

      1) The DEG analysis is the most central result in this manuscript, but the discrepancy between sex-combined and sex-specific DEGs is quite mind-boggling. For those that were differentially expressed in the sex-specific sets but not in the sex-combined one, the authors suggest an opposite direction of DE as an explanation (page 11, Figure S5). But Figure S5A does not show such a trend, showing that down-regulated genes in males are mostly not at all differentially expressed in females. Figure S5B does show such a trend, but it doesn't seem to be a dominant explanation. I would like to recommend the authors test alternative ways of analysis to boost statistical power for DEG detection other than simply splitting data into males and females and performing analysis in each subset. For example, the authors may consider utilizing gene-by-environment interaction analysis schemes here biological sex as an environmental factor.

      We agree with reviewer that the opposite direction of DEGs is likely only one of the possible explanations for the discrepancy between the sex-combined and the sex-specific DEGs. We have toned down the description of this point in the revised manuscripts.

      Following the suggestion of reviewer, we performed a ANCOVA analysis to evaluate the variance explained by sex from the expression data. For each gene, univariate comparisons of the average of gene expression between Tibetans and Han Chinese were made by using the ANCOVA test in R aov function with sex as covariates: aov (Expression ~ Ethnicity + Fetal sex). We observed a significantly higher variance explained by sex than by ethnicity in six layers of the placenta (except for the CN layer) (Author response image 1). For example, in the UC layer, fetal sex can explain ~0.203 variance, while the ethnicity explains ~0.107 variance (P-value = 4.9e-4). These results suggest a significant contribution of fetal sex for the observed variance of gene expression, consist with the observed sex-biased DEG patterns.

      Author response image 1.

      The ANCOVA results of the seven layers of placenta. The scatter plot shows the comparison of the explained variance (y-axis) and significance (x-axis, denoted by –log10(P-value)) between ethnicity (dots in red) and fetal sex (dots in blue). Each dot represents an investigated gene, and only genes with P<0.05 in significance are shown in the plots. The table is the summary statistics of the ANCOVA analysis.

      2) Please clarify how the authors handled multiple testing correction of p-values.

      There were three analyses involving multiple testing in this study: 1) for the differential expression analysis, we obtained the multiple corrected p-values by Benjamini-Hochberg FDR (false discovery rate) procedure; 2) for the GO enrichment analysis, we calculated the FDR-adjusted q-values from the overall p-values to correct for multiple testing.

      3) for the WGCNA analysis, considering the 12 traits were involved, including population, birth weight (BW), biparietal diameter (BPD), femur length (FL), gestation time (GT), placental weight (PW), placental volume (PLV), abdominal girth (AG), amniotic fluid maximcon depth (AFMD), amniotic fluid (AFI), fetal heart rate (FH) and fundal height (FUH). We calculated a Bonferroni threshold (p-value = 0.05/the number of independent traits) using the correlation matrix of the traits to evaluate the significant modules. We estimated the number of independent traits among the 12 investigated traits was 4 (Author response image 2). Therefore, we used a more stringent significant threshold p-value = 0.0125 (0.05/4) as the final threshold to correct the multiple testing brought by multiple traits in our WGCNA analyses. We have updated this section based on the new threshold.

      Author response image 2.

      The correlation matrix of 12 traits involved in the WGCNA analysis. The correlation coefficients larger than 0.2 (or smaller than -0.2) are regarded as significant correlation and marked in gradient colors.

      3) The "natural selection acts on the placental DEGs ..." section is potentially misleading readers to assume that the manuscript reports evidence for positive selection on the observed DEG pattern between Tibetans and Han, which is not.

      a) Currently the section simply describes an overlap between DEGs and a set of 192 genes likely under positive selection in Tibetans (TSNGs). The overlap is quite small, leading to only 13 genes in total (Figure 6). The authors are currently not providing any statistical measure of whether this overlap is significantly enriched or at the level expected for random sampling.

      We understand the reviewer’s point that the observed gene counts overlapped between DEGs from the three sets (4 for female + male; 9 for male only and 0 for female only) with TSNGs should be tested using a statistical method. Therefore, we adopted permutation approach to evaluate the enrichment of the overlapped DEGs with TSNGs.

      For each permutation, we randomly extracted 192 genes from the human genome, then overlapped with DEGs of the three sets (female + male; female only and male only) and counted the gene numbers. After 10,000 permutations, we constructed a null distribution for each set, and found that the overlaps between DEGs and TSNGs were significantly enriched in the “female + male” set (p-value = 0.048) and the “male only” set (p-value = 9e-4), but not in the “female only” set (p-value = 0.1158) (Author response image 3). This result suggests that the observed DEGs are significantly enriched in TSNGs when compared to random sampling, especially for the male DEGs. We added this analysis in the revised manuscript.

      Author response image 3.

      The distribution of 10,000 permutation tests of counts of the overlapped genes between DEGs and the 192 randomly selected genes in the genome. The red-dashed lines indicate the observed values based on the 192 TSNGs.

      b) The authors are describing sets of DEGs that seem to affect important phenotypic changes in a consistent and adaptive direction. A relevant form of natural selection for this situation may be polygenic adaptation while the authors only consider strong positive selection at a single variant/gene level.

      We agree with reviewer that polygenic adaptation might be a potential mechanism for DEGs to take effect on the adaptive phenotypes. Therefore, following the suggestion in the comment below, we conducted a polygenic adaptation analysis using eQTL information.

      c) The manuscript is currently providing no eQTL information that can explain the differential expression of key genes. The authors can actually do this based on the genotype and expression data of the individuals in this study. Combining eQTL info, they can set up a test for polygenic adaptation (e.g., Berg and Coop; https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004412). This will provide a powerful and direct test for the adaptiveness of the observed DEG pattern.

      Following the reviewer’s suggestion, we employed the PolyGraph (Racimo et al., 2018) tool to identify the signatures of polygenic selection in Tibetans using eQTL information. We conducted eQTL analysis for the seven layers, and collected a set of 5,251 eQTLs, covering the SNPs associated with gene expression with a significanct p-value < 5e-8. To obtain a list of independent eQTLs, we removed those SNPs in linkage disequilibrium (r2 > 0.2 in 1000 Genome Project). Finally, we obtained 176 independent eQTLs. At the same time, we generated a set of 1,308,436 independent SNPs of Tibetans as the control panel. The PolyGraph result showed that Tibetans have a clear signature of polygenic selection on gene expression (Bonferroni-correction p-value = 0.003) (Author response image 4).

      We have added this result in the revised manuscript (Figure S4), and added a detailed description of polygenic adaption in the Methods section.

      Author response image 4.

      Polygraphs for the eQTLs that show evidence for polygenic adaptation in the five-leaf tree built using the allele frequency data of 1001 Tibetans (Zheng et al. 2023) and 1000 Genome Project. The colors indicate the marginal posterior mean estimate of the selection parameter for variants associated with the gene expression. r, q, s and v in the tree nodes refer to the nodes in terminal branches and internal branches. TBN, Tibetans; CHB, Han Chinese in Beijing; JPT, Japanese in Tokyo, Japan; CEU, Northern Europeans from Utah; YRI, Yoruba in Ibadan, Nigeria.

      4) The manuscript is currently only minimally discussing how findings are linked to EPAS1 and EGLN1 genes, which show the hallmark signature of positive selection in Tibetans. In fact, the authors' group previously reported male-specific association between EPAS1 SNPs and blood hemoglobin level. Many readers will be intrigued to see a discussion about this point.

      According to the reviewer’s suggestion, in the revised manuscript, we added a paragraph to discuss the relationship between our transcriptomic data and the two genes with strong selective signals, i.e. EPAS1 and EGLN1.

      “As the gene with the strongest signal of natural selection in Tibetans, EPAS1 has been reported in numerus studies on its contribution to high altitude adaptation. In this study, we detected a significant expression reduction of EPAS1 in the Tibetan UC compared to the high-altitude Han. It was reported that the selected-for EPAS1 variants/haplotype were associated with lower hemoglobin levels in the Tibetan highlanders with a major effect (Beall et al., 2010; Peng et al., 2017), and the low hemoglobin concentration of Tibetans is causally associated with a better reproductive success (Cho et al., 2017). Therefore, we speculate that the selective pressure on EPAS1 is likely through its effect on hemoglobin, rather than directly on the reproductive traits. The down-regulation of EPAS1 in placentas likely reflects a blunted hypoxic response that may improve vasodilation of UC for better blood flow, and eventually leading to the higher BW in Tibetans (He et al., 2023). For EGLN1, another well-known gene in Tibetans, we detected between-population expression difference in the male UC layer, but not in other placental layers. Considering the known adaptation mechanism of EGLN1 is attributed to the two Tibetan-enriched missense mutations, the contribution of EGLN1 to the gene expression changes in the Tibetan UC is unexpected and worth to be explored in the future.”

      Reviewer #2 (Public Review):

      In this manuscript, the authors use newly-generated, large-scale transcriptomic data along with histological data to attempt to dissect the mechanisms by which individuals with Tibetan ancestry are able to mitigate the negative effects of high elevation on birth weight. They present detailed analyses of the transcriptomic data and find significant sex differences in the placenta transcriptome.

      I have significant concerns about the conclusions that are presented. The analyses also lack the information necessary to evaluate their reliability.

      The experimental design does not include a low elevation comparison and thus cannot be used to answer questions about how ancestry influences hypoxia responses and thus birthweight at high elevations. Importantly, because the placenta tissues (and trophoblasts specifically) are quickly evolving, there are a priori good reasons to expect to find population differences irrespective of adaptive evolution that might contribute to fetal growth protection. There are also significant details missing in the analyses that are necessary to substantiate and replicate the analyses presented.

      Although the datasets are ultimately valuable as reference sets, the absence of low elevation comparisons for Tibetans and Han Chinese individuals undermines the ability of the authors to assess whether differences observed between populations are linked to hypoxia responses or variation in the outcomes of interest (i.e., hypoxia-dependent fetal growth restriction).

      We understand the reviewer’s concern about the lack of low-altitude comparison. For the placenta transcriptomic data, actually, we previously studied the comparison of placenta from high-altitude Tibetans and low-altitude Han Chinese, including 63 placentas of Tibetans living at Lhasa (elevation: 3650m) and 14 placentas of Han in Kunming (elevation: 1800m) (Peng et al. 2017). The main finding was that in general, the expression profiles are similar between the high-altitude Tibetans and the low-altitude Han. In particular, most high-altitude Tibetans have a similar level of EPAS1 expression in the placenta as the lowlander Han Chinese, a reflection of Tibetans’ adaptation at altitude. In other words, (Peng et al. 2017). In this study, we observed a significant down-regulation of EPAS1 in the Tibetan UC when compared to Han Chinese living at the same high altitude. Therefore, the observed differences between Tibetans and Han Chinese placenta at high altitude are due to the adaptation of Tibetans.

      For phenotypic data, we made a systematical comparison of reproductive outcomes in our previous studies (He et al., 2023; He et al., 2022). We proved that polygenic adaptation of reproduction in Tibetans tends to reduce the chance of preterm birth and eliminate the restriction on fetal development at high altitude. Compared to the high-altitude Han Chinese migrants, the high-altitude Tibetans exhibit a less birth weight reduction and infant mortality induced by hypoxia, similar with the lowland Han Chinese as reference.

      In summary, although we cannot make combination analysis with our high-altitude data and the published low-altitude data because of batch effect and difference of sampling strategy, we obtained more supportive evidence for the adaptation of placenta expression regulation in Tibetans. To be objective, we have discussed the limitation of the lack of lowlander placenta data in the Discussion section.

      The authors attempt to tackle this phenotypic association by looking for correlations between gene networks (WGCNA) and individual genes with birthweight and other measurements collected at birth. I have some reservations about this approach with only two groups (i.e., missing the lowland comparison), but it is further problematic that the authors do not present data demonstrating that there are differences in birthweight or any other traits between the populations in the samples they collected.

      Throughout, I thus find conclusions about the adaptive value and hypoxia-responses made by the authors to be unsubstantiated and/or the data to be inadequate. There are also a gratuitous number of speculative statements about mechanisms by which differential gene expression leads to the protection of birthweight that are not evaluated and thus cannot be substantiated by the data presented.

      As currently presented and discussed, these results thus can only be used to evaluate population differences and tissue-specific variation therein.

      We understand the reviewer’s point that the observed differences of gene expression between Tibetan natives and Han immigrants living at high altitude might be explained by ancestral divergence, rather than hypoxia-associated response and genetic adaptation of native Tibetans.

      Firstly, we conclude that Tibetans have a better reproductive outcome, not only based on the two highlander groups living at the same altitude, but also relied on the change direction compared to the lowland level. For example, we observed a significant higher BW in Tibetans than Han migrants in our dataset (35 Tibetans vs. 34 Han: p-value = 0.012) (Author response image 5), and in a larger dataset (He et al. 2023) (1,317 Tibetans vs. 87 Han: p-value = 1.1e-6), suggesting an adaptation of Tibetans because BW decreases with the increase of altitude. The logic was the same to the other traits. Following the suggestion of reviewer, we added these phenotype comparisons in the revised manuscripts. The detailed information of the investigated samples and the statistic results were also added as supplementary tables in the revised version.

      For the WGCNA, we agree with the reviewer that the detected modules both showing significant correlation with population and other reproductive traits cannot be fully explained by adaptation of Tibetans. Therefore, we tuned down the description of this section and added other possible explanations, such as population differences, in the discussion.

      Author response image 5.

      Comparison of 11 reproductive traits between Tibetans and Han immigrants. (A) comparison based on the dataset of this study (35 Tibetans vs. 34 Han); (B) correlation between BW and altitude (left panel) and comparison analysis based on the larger sample size (the data were retrieved from (He et al., 2023)). Univariate comparisons of the average of each trait cross population were made by using the ANCOVA test in R aov function with fetal sex and maternal age as covariates.

      There is also some important methodological information missing that makes it difficult or impossible to assess the quality of the underlying data and/or reproduce the analyses, further limiting the potential impact of these data:

      1) Transcriptome data processing and analyses: RNA quality information is not mentioned (i.e., RIN). What # of reads are mapped to annotated regions? How many genes were expressed in each tissue (important for contextualizing the # of DE genes reported - are these a significant proportion of expressed genes or just a small subset?).

      According to the reviewer’s suggestion, we added more information about transcriptome data processing and analyses in the revised Methods and Results:

      “After RNA extraction, we assessed the RNA integrity and purity using 1% agarose gel electrophoresis. The RIN value of extracted RNA was 7.56 ± 0.71.”

      “In total, 10.6 billion reads were mapped to the annotated regions, and 17,283 genes express in all the investigated placenta.”

      “We identified 579 differentially expressed genes (DEGs) between Tibetans and Han, accounting for 3.4% of the total number of expressed genes.”

      2) The methods suggest that DE analyses were run using data that were normalized prior to reading them into DESeq2. DESeq2 has an internal normalization process and should not be used on data that was already normalized. Please clarify how and when normalization was performed.

      Actually, we made raw read count matrix as input file when conducting differential analysis using DESeq2, rather than using the normalized data. We have updated our description in the method section of the revised manuscript.

      3) For enrichment analyses, the background gene set (all expressed genes? all genes in the genome? or only genes expressed in the tissue of interest?) has deterministic effects on the outcomes. The background sets are not specified for any analyses.

      Actually, we utilized the genes expressed in placenta as the background gene set for enrichment analyses. The genes with more than two transcripts per million transcripts (TPM) were regarded as an expressed gene, which is commonly used criteria for RNA-seq data.

      4) In the WGCNA analysis, P-values for correlations of modules with phenotype data (birthweight etc.) should be corrected for multiple testing (i.e., running the module correlation for each outcome variables) and p.adjust used to evaluate associations to limit false positives given the large number of correlations being run.

      As we explained in response to comment#2 of Reviwer-1, we used a more stringent significant threshold of p-value = 0.0125 (0.05/4) as the final threshold to correct the multiple testing brought by multiple traits in the WGCNA analysis.

      5) The plots for umbilical histological data (Fig 5 C) contain more than 5 points, but the use of replicate sections is not specified. If replicate sections were used, the authors should control for non-independence of replicate sections in their analyses (i.e., random effects model).

      We did not use replicate sections. Figure 5C shows the umbilical artery intima and media. Because each human umbilical cord includes two umbilical arteries, the 5 vs. 5 individual comparison generates 10 vs. 10 umbilical artery comparison. To be clearer, we added an explanation in the revised manuscript.

      On more minor notes:

      There is significant and relevant published data on sex differences and hypoxia in rodents (see Cuffe et al 2014, "Mid- to late-term hypoxia in the mouse alters placental morphology, glucocorticoid regulatory pathways, and nutrient transporters in a sex-specific manner" and review by Siragher and Sferuzzi-Perro 2021, "Placental hypoxia: What have we learnt from small animal models?"), and historical work reporting sex differences in placental traits associated with high elevation adaptation in Andeans (series of publications by Moira Jackson in the late 1980s, reviewed in Wilsterman and Cheviron 2021, "Fetal growth, high altitude, and evolutionary adaptation: A new perspective").

      We thank the reviewer for the constructive comments on literature review. We have cited and discussed them in the revised manuscript.

      Reviewer #3 (Public Review):

      More than 80 million people live at high altitude. This impacts health outcomes, including those related to pregnancy. Longer-lived populations at high altitudes, such as the Tibetan and Andean populations show partial protection against the negative health effects of high altitude. The paper by Yue sought to determine the mechanisms by which the placenta of Tibetans may have adapted to minimise the negative effect of high altitude on fetal growth outcomes. It compared placentas from pregnancies from Tibetans to those from the Han Chinese. It employed RNAseq profiling of different regions of the placenta and fetal membranes, with some follow-up of histological changes in umbilical cord structure and placental structure. The study also explored the contribution of fetal sex in these phenotypic outcomes.

      A key strength of the study is the large sample sizes for the RNAseq analysis, the analysis of different parts of the placenta and fetal membranes, and the assessment of fetal sex differences.

      A main weakness is that this study, and its conclusions, largely rely on transcriptomic changes informed by RNAseq. Changes in genes and pathways identified through bioinformatic analysis were not verified by alternate methods, such as by western blotting, which would add weight to the strength of the data and its interpretations. There is also a lack of description of patient characteristics, so the reader is unable to make their own judgments on how placental changes may link to pregnancy outcomes. Another weakness is that the histological analyses were performed on n=5 per group and were rudimentary in nature.

      For the weakness raised by the reviewer, here are our responses:

      (1) Considering that our conclusions largely rely on the transcriptomic data, we agree with reviewer that more experiments are needed to validate the results from our transcriptomic data. However, this study was mainly aimed to provide a transcriptomic landscape of high-altitude placenta, and to characterize the gene-expression difference between native Tibetans and Han migrants. The molecular mechanism exploration is not the main task of this study, and more validation experiments are warranted in the future.

      (2) For the lack of description of patient characteristics, actually, we provided three level results on the placental changes of Tibetans: macroscopic phenotypes (higher placental weight and volume), histological phenotypes (larger umbilical vein walls and umbilical artery intima and media; lower syncytial knots/villi ratios) and transcriptomic phenotypes (DEG and differential modules). Combined with the previous studies, these placenta changes suggest a better reproductive outcome. For example, the placenta volume shows a significantly positive correlation with birth weight (R = 0.31, p-value = 2.5e-16), therefore, the larger placenta volume of Tibetans is beneficial to fetal development at high altitude. In addition, the larger umbilical vein wall and umbilical artery intima and media of Tibetans can explain their adaptation in preventing preeclampsia.

      (3) For the sample size of histological analyses, we understand the reviewer’s concern that 5 vs. 5 samples are not large in histological analyses. This is because it was difficult to collect high-altitude Han placenta samples, and we only got 13 Han samples, from which we selected 5 infant sex matched samples.

      References

      Beall, C.M., Cavalleri, G.L., Deng, L.B., Elston, R.C., Gao, Y., Knight, J., Li, C.H., Li, J.C., Liang, Y., McCormack, M., et al. (2010). Natural selection on EPAS1 (HIF2 alpha) associated with low hemoglobin concentration in Tibetan highlanders. P Natl Acad Sci USA 107, 11459-11464.

      Cho, J.I., Basnyat, B., Jeong, C., Di Rienzo, A., Childs, G., Craig, S.R., Sun, J., and Beall, C.M. (2017). Ethnically Tibetan women in Nepal with low hemoglobin concentration have better reproductive outcomes. Evol Med Public Health 2017, 82-96. He, Y., Guo, Y., Zheng, W., Yue, T., Zhang, H., Wang, B., Feng, Z., Ouzhuluobu, Cui, C., Liu, K., et al. (2023). Polygenic adaptation leads to a higher reproductive fitness of native Tibetans at high altitude. Curr Biol.

      He, Y., Li, J., Yue, T., Zheng, W., Guo, Y., Zhang, H., Chen, L., Li, C., Li, H., Cui, C., et al. (2022). Seasonality and Sex-Biased Fluctuation of Birth Weight in Tibetan Populations. Phenomics 2, 64-71.

      Peng, Y., Cui, C., He, Y., Ouzhuluobu, Zhang, H., Yang, D., Zhang, Q., Bianbazhuoma, Yang, L., He, Y., et al. (2017). Down-Regulation of EPAS1 Transcription and Genetic Adaptation of Tibetans to High-Altitude Hypoxia. Mol Biol Evol 34, 818-830.

      Racimo, F., Berg, J.J., and Pickrell, J.K. (2018). Detecting Polygenic Adaptation in Admixture Graphs. Genetics 208, 1565-1584.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In "Changes in wing morphology..." Roy et al investigate the potential allometric scaling in wing morphology and wing kinematics in 8 different hoverfly species. Their study nicely combines different new and classic techniques, investigating flight in an important, yet understudied alternative pollinator. I want to emphasize that I have been asked to review this from a hoverfly biology perspective, as I do not work on flight kinematics. I will thus not review that part of the work.

      Strengths:

      The paper is well-written and the figures are well laid out. The methods are easy to follow, and the rationale and logic for each experiment are easy to follow. The introduction sets the scene well, and the discussion is appropriate. The summary sentences throughout the text help the reader.

      We thank the reviewer for these positive comments on our study.

      Weaknesses:

      The ability to hover is described as useful for either feeding or mating. However, several of the North European species studied here would not use hovering for feeding, as they tend to land on the flowers that they feed from. I would therefore argue that the main selection pressure for hovering ability could be courtship and mating. If the authors disagree with this, they could back up their claims with the literature.

      We thank the reviewer for this insight on potential selection pressures on hovering flight. As suggested, we now put the main emphasize on selection related to mating flight (lines 106–111).

      On that note, a weakness of this paper is that the data for both sexes are merged. If we agree that hovering may be a sexually dimorphic behaviour, then merging flight dynamics from males and females could be an issue in the interpretation. I understand that separating males from females in the movies is difficult, but this could be addressed in the Discussion, to explain why you do not (or do) think that this could cause an issue in the interpretation.

      We acknowledge that not distinguishing sexes in the flight experiment prevents investigating the hypothesis that selection may act especially on male’s flight. This weakness was not addressed in our first manuscript and is now discussed in the revised Discussion section. We nuanced the interpretation and suggested further investigation on flight dimorphism (lines 726–729).

      The flight arena is not very big. In my experience, it is very difficult to get hoverflies to fly properly in smaller spaces, and definitely almost impossible to get proper hovering. Do you have evidence that they were flying "normally" and not just bouncing between the walls? How long was each 'flight sequence'? You selected the parts with the slowest flight speed, presumably to get as close to hovering as possible, but how sure are you that this represented proper hovering and not a brief slowdown of thrust?

      We very much agree with the reviewer that flight studied in laboratory conditions does not perfectly reflects natural flight behavior. Moreover, having individual hoverflies performing stable hovering in the flight arena, in the intersecting field of view of all three cameras, is quite challenging. Therefore, we do not claim that we studied “true” hovering (i.e. flight speed = 0 m/s), but that we attempted to get as close as possible to true hovering by selecting the flight sections with the lowest flight speeds for our analysis.

      In most animal flight studies, hovering is defined as flight with advance ratios J<0.1, i.e. when the forward flight speed is less than 10% of the wingbeat-induced speed of the wingtip (Ellington, 1984a; Fry et al., 2005; Liu and Sun, 2008). By selecting the low flight-speed wingbeats for our analysis, the mean advance ratio in our experiment was 0.08±0.02 (mean±sd), providing evidence that the hoverflies were operating close to a hovering flight mode. This is explained in both the methods and results sections (lines 228–231 and 467–469, respectively).

      We however acknowledge that this definition of hovering, although generally accepted, is not perfect. We edited the manuscript to clarify that our experiment does not quantify perfect hovering (lines 186–188). We moreover added the mean±sd duration of the recorded flight sequence from which the slowest wingbeat was selected (line 179), as this info was missing, and we further describe the behaviour of the hoverflies during the experiment (lines 168–169).

      Your 8 species are evolutionarily well-spaced, but as they were all selected from a similar habitat (your campus), their ecology is presumably very similar. Can this affect your interpretation of your data? I don't think all 6000 species of hoverflies could be said to have similar ecology - they live across too many different habitats. For example, on line 541 you say that wingbeat kinematics were stable across hoverfly species. Could this be caused by their similar habitat?

      We agree with the reviewer that similarity in habitat and ecology might partially explain the similarity in the wingbeat kinematics that we observe. But this similarity in ecology between the eight studied species is in fact a design feature of our study. Here, we aim to study the effect of size on hoverfly flight, and so we designed our study such that we maximize size differences and phylogenetic spread among the eight species, while minimizing variations in habitat, ecology and flight behavior (~hovering). This allows us to best test for the effect of differences in size on the morphology, kinematics and aerodynamics of hovering flight.

      Despite this, we agree with the reviewer that it would be interesting to test whether the observed allometric morphological scaling and kinematic similarity is also present beyond the species that we studied. In our revision, we therefore extended our analysis to address this question. Performing additional flight experiments and fluid mechanics simulations was beyond the scope of our current study, but extending the morphological scaling analyses was certainly possible.

      In our revised study, we therefore extended our morphological scaling analysis by including the morphology of twenty additional hoverfly species. This extended dataset includes wing morphology data of 74 museum specimens from Naturalis Biodiversity Centre (Leiden, the Netherlands), including two males and two females per species, whenever possible (4.2±1.7 individuals per species (mean±sd)). This extended analysis shows that the allometric scaling of wing morphology with size is robust along the larger sample of species, from a wider range of habitats and ecologies. Nevertheless, we advocate for additional flight measurement in species from different habitats to ascertain the generality of our results (lines 729–732).

      Reviewer #2 (Public review):

      Summary

      Le Roy et al quantify wing morphology and wing kinematics across eight hoverfly species that differ in body mass; the aim is to identify how weight support during hovering is ensured. Wing shape and relative wing size vary significantly with body mass, but wing kinematics are reported to be size-invariant. On the basis of these results, it is concluded that weight support is achieved solely through size-specific variations in wing morphology and that these changes enabled hoverflies to decrease in size throughout their phylogenetic history. Adjusting wing morphology may be preferable compared to the alternative strategy of altering wing kinematics, because kinematics may be under strong evolutionary and ecological constraints, dictated by the highly specialised flight and ecology of the hoverflies.

      Strengths

      The study deploys a vast array of challenging techniques, including flight experiments, morphometrics, phylogenetic analysis, and numerical simulations; it so illustrates both the power and beauty of an integrative approach to animal biomechanics. The question is well motivated, the methods appropriately designed, and the discussion elegantly and convincingly places the results in broad biomechanical, ecological, evolutionary, and comparative contexts.

      We thank the reviewer for appreciating the strengths of our study.

      Weaknesses

      (1) In assessing evolutionary allometry, it is key to identify the variation expected from changes in size alone. The null hypothesis for wing morphology is well-defined (isometry), but the equivalent predictions for kinematic parameters remain unclear. Explicit and well-justified null hypotheses for the expected size-specific variation in angular velocity, angle-of-attack, stroke amplitude, and wingbeat frequency would substantially strengthen the paper, and clarify its evolutionary implications.

      We agree with the reviewer that the expected scaling of wingbeat kinematics with size was indeed unclear in our initial version of the manuscript. In our revised manuscript (and supplement), we now explicitly define how all kinematic parameters should scale with size under kinematic similarity, and how they should scale for maintaining weight support across various sizes. These are explained in the introduction (lines 46–78), method section (lines 316–327), and dedicated supplementary text (see Supplementary Info section “Geometric and kinematic similarity and scaling for weight support”). Here, we now also provide a thorough description of the isometric scaling of morphology, and scaling of the kinematics parameters under kinematic similarity.

      (2) By relating the aerodynamic output force to wing morphology and kinematics, it is concluded that smaller hoverflies will find it more challenging to support their body mass - a scaling argument that provides the framework for this work. This hypothesis appears to stand in direct contrast to classic scaling theory, where the gravitational force is thought to present a bigger challenge for larger animals, due to their disadvantageous surface-to-volume ratios. The same problem ought to occur in hoverflies, for wing kinematics must ultimately be the result of the energy injected by the flight engine: muscle. Much like in terrestrial animals, equivalent weight support in flying animals thus requires a positive allometry of muscle force output. In other words, if a large hoverfly is able to generate the wing kinematics that suffice to support body weight, an isometrically smaller hoverfly should be, too (but not vice versa). Clarifying the relation between the scaling of muscle force input, wing kinematics, and weight support would resolve the conflict between these two contrasting hypotheses, and considerably strengthen the biomechanical motivation and interpretation.

      The reviewer highlights a crucial aspect of our study: our perspective on the aerodynamic challenges associated with becoming smaller or larger. This comment made us realize that our viewpoint might be unconventional regarding general scaling literature and requires further clarification.

      Our approach is focused on the disadvantage of a reduction in size, in contrast with classic scaling theory focusing on the disadvantage of increasing in size. As correctly stated by the reviewer, producing an upward directed force to maintain weight support is often considered as the main challenge, constrained by size. Hereby, researchers often focus on the limitations on the motor system, and specifically muscle force: as animals increase in size, the ability to achieve weight support is limited by muscle force availability. An isometric growth in muscle cannot sustained the increased weight, due to the disadvantageous surface-to-volume ratio.

      In animal flight, this detrimental effect of size on the muscular motor system is also present, particularly for large flying birds. But for natural flyers, there is also a detrimental effect of size on the propulsion system, being the flapping wings. The aerodynamic forces produced by a beating wing scales linearly with the second-moment-of-area of the wing. Under isometry, this second-moment-of-area decreases at higher rate than body mass, and thus producing enough lift for weight support becomes more challenging with reducing size. Because we study tiny insects, our study focuses precisely on this constraint on the wing-based propulsion system, and not on the muscular motor system.

      We revised the manuscript to better explain how physical scaling laws differentially affect force production by the muscular flight motor system and the wingbeat-induced propulsion system (lines 46–78).

      (3) The main conclusion - that evolutionary miniaturization is enabled by changes in wing morphology - is only weakly supported by the evidence. First, although wing morphology deviates from the null hypothesis of isometry, the difference is small, and hoverflies about an order of magnitude lighter than the smallest species included in the study exist. Including morphological data on these species, likely accessible through museum collections, would substantially enhance the confidence that size-specific variation in wing morphology occurs not only within medium-sized but also in the smallest hoverflies, and has thus indeed played a key role in evolutionary miniaturization.

      We thank the reviewer for the suggestion to add additional specimens from museum collections to strengthen the conclusions of our work. In our revised study, we did so by adding the morphology of 20 additional hoverfly species, from the Naturalis Biodiversity Centre (Leiden, the Netherlands). This extended dataset includes wing morphology data of 74 museum specimens, and whenever possible we sampled at least two males and two females (4.2±1.7 individuals per species (mean±sd)). This extended analysis shows that the allometric scaling of wing morphology with size is robust along the larger sample of species, including smaller ones. We discuss these additional results now explicitly in the revised manuscript (see Discussion).

      Second, although wing kinematics do not vary significantly with size, clear trends are visible; indeed, the numerical simulations revealed that weight support is only achieved if variations in wing beat frequency across species are included. A more critical discussion of both observations may render the main conclusions less clear-cut, but would provide a more balanced representation of the experimental and computational results.

      We agree with the reviewer that variations in wingbeat kinematics between species, and specifically wingbeat frequency, are important and non-negligible. As mentioned by the reviewer, this is most apparent for the fact that weight support is only achieved with the species-specific wingbeat frequency. To address this in a more balanced and thorough way, we revised the final section of our analysis approach, by including changes in wingbeat kinematics to that analysis. By doing so, we now explicitly show that allometric changes in wingbeat frequency are important for maintaining weight support across the sampled size range, but that allometric scaling of morphology has a stronger effect. In fact, the relative contributions of morphology and kinematics to maintaining weight-support across sizes is 81% and 22%, respectively (Figure 7). We discuss this new analysis and results now thoroughly in the revised manuscript (lines 621–629, 650–664), resulting in a more balanced discussion and conclusion about the outcome of our study. We sincerely thank the reviewer for suggesting to look closer into the effect of variations in wingbeat kinematics on aerodynamic force production, as the revised analysis strengthened the study and its results.

      In many ways, this work provides a blueprint for work in evolutionary biomechanics; the breadth of both the methods and the discussion reflects outstanding scholarship. It also illustrates a key difficulty for the field: comparative data is challenging and time-consuming to procure, and behavioural parameters are characteristically noisy. Major methodological advances are needed to obtain data across large numbers of species that vary drastically in size with reasonable effort, so that statistically robust conclusions are possible.

      We thank the reviewer for their encouraging words about the scholarship of our work. We will continue to improve our methods and techniques for performing comparative evolutionary biomechanics research, and are happy to jointly develop this emerging field of research.

      Reviewer #3 (Public review):

      The paper by Le Roy and colleagues seeks to ask whether wing morphology or wing kinematics enable miniaturization in an interesting clade of agile flying insects. Isometry argues that insects cannot maintain both the same kinematics and the same wing morphology as body size changes. This raises a long-standing question of which varies allometrically. The authors do a deep dive into the morphology and kinematics of eight specific species across the hoverfly phylogeny. They show broadly that wing kinematics do not scale strongly with body size, but several parameters of wing morphology do in a manner different from isometry leading to the conclusion that these species have changed wing shape and size more than kinematics. The authors find no phylogenetic signal in the specific traits they analyze and conclude that they can therefore ignore phylogeny in the later analyses. They use both a quasi-steady simplification of flight aerodynamics and a series of CFD analyses to attribute specific components of wing shape and size to the variation in body size observed. However, the link to specific correlated evolution, and especially the suggestion of enabling or promoting miniaturization, is fraught and not as strongly supported by the available evidence.

      We thank the reviewer for the accurate description of our work, and the time and energy put into reviewing our paper. We regret that the reviewer found our conclusions with respect to miniaturization fraught and not strongly supported by the evidence. In our revision, we addressed this by no longer focusing primarily on miniaturization, by extending our morphology analysis to 20 additional species (Figures 4 and 5), improving our analysis of both the kinematics and morphology data (Figure 7), and by discussing our results in a more balanced way (see Discussion). We hope that the reviewer finds the revised manuscript of sufficient quality for publication in eLife.

      The aerodynamic and morphological data collection, modeling, and interpretation are very strong. The authors do an excellent job combining a highly interpretable quasi-steady model with CFD and geometric morphometrics. This allows them to directly parse out the effects of size, shape, and kinematics.

      We thank the reviewer for assessing our experimental and modelling approach as very strong.

      Despite the lack of a relationship between wing kinematics and size, there is a large amount of kinematic variation across the species and individual wing strokes. The absolute differences in Figure 3F - I could have a very large impact on force production but they do indeed not seem to change with body size. This is quite interesting and is supported by aerodynamic analyses.

      We agree with the reviewer that there are important and non-negligible variations in wingbeat kinematics between species. As mentioned by the reviewer, although these kinematics do not significant scale with body mass, the interspecific variations are important for maintaining weight support during hovering flight. We thus also agree with the reviewer that these kinematics variations are interesting and deserve further investigations.

      In our revised study, we did so by including these wingbeat kinematic variations in our analysis on the effect of variations in morphology and kinematics on aerodynamic force production for maintaining in-flight weight support across the sampled size range (lines 422–444, Figure 7). By doing so, we now explicitly show that variations in wingbeat kinematics are important for maintaining weight across sizes, but that allometric scaling of morphology has a stronger effect. In fact, the relative contributions of adaptations in morphology and kinematics to maintaining weight support across sizes is 81% and 22%, respectively (Figure 7). We discuss these new analysis and results now in the revised manuscript (lines 621–629, 650–664), resulting in a more balanced discussion about the relative importance of adaptations in morphology and kinematics. We hope the reviewer appreciates this newly added analysis.

      The authors switch between analyzing their data based on individuals and based on species. This creates some pseudoreplication concerns in Figures 4 and S2 and it is confusing why the analysis approach is not consistent between Figures 4 and 5. In general, the trends appear to be robust to this, although the presence of one much larger species weighs the regressions heavily. Care should be taken in interpreting the statistical results that mix intra- and inter-specific variation in the same trend.

      We agree that it was sometimes unclear whether our analysis is performed at the individual or species level. To improve clarity and avoid pseudoreplication, we now analyze all data at the species level, using phylogenetically informed analyses. Because we think that showing within-species variation is nonetheless informative, we included dedicated figures to the supplement (Figures S3 and S5) in which we show data at the individual level, as equivalent to figures 4 and 5 with data at the species level. Note that this cannot be done for flight data due to our experimental procedure. Indeed, we performed flight experiments with multiple individuals in a single experimental setup, pseudoreplication is thus possible for these flight data. This is explained in the manuscript (lines 167–175). All morphological measurements were however done on a carefully organized series of specimens and thus pseudoreplication is hereby not possible.

      The authors based much of their analyses on the lack of a statistically significant phylogenetic signal. The statistical power for detecting such a signal is likely very weak with 8 species. Even if there is no phylogenetic signal in specific traits, that does not necessarily mean that there is no phylogenetic impact on the covariation between traits. Many comparative methods can test the association of two traits across a phylogeny (e.g. a phylogenetic GLM) and a phylogenetic PCA would test if the patterns of variation in shape are robust to phylogeny.

      After extending our morphological dataset from 8 to 28 species, by including 20 additional species from a museum collection, we increased statistical power and found a significant phylogenetic signal on all morphological traits, except for the second moment of area (lines 458–460, Table S2). Although we do not detect an effect of phylogeny on flight traits, likely due to the limited number of species for which flight was quantified (n=8), we agree with the reviewer’s observation that the absence of a phylogenetic signal does not rule out the potential influence of phylogeny on the covariation between traits. This is now explicitly discussed in the manuscript (lines 599–608). As mentioned in the previous comment, we now test all relationships between body mass and other traits using phylogenetic generalized least squares (PGLS) regressions, therefore accounting for the impact of phylogeny everywhere. The revised analyses produce sensibly similar results as for our initial study, and so the main conclusions remain valid. We sincerely thank the reviewer for their suggestion for revising our statistical analysis, because the revised phylogenetic analysis strengthens our study as a whole.

      The analysis of miniaturization on the broader phylogeny is incomplete. The conclusion that hoverflies tend towards smaller sizes is based on an ancestral state reconstruction. This is difficult to assess because of some important missing information. Specifically, such reconstructions depend on branch lengths and the model of evolution used, which were not specified. It was unclear how the tree was time-calibrated. Most often ancestral state reconstructions utilize a maximum likelihood estimate based on a Brownian motion model of evolution but this would be at odds with the hypothesis that the clade is miniaturizing over time. Indeed such an analysis will be biased to look like it produces a lot of changes towards smaller body size if there is one very large taxa because this will heavily weight the internal nodes. Even within this analysis, there is little quantitative support for the conclusion of miniaturization, and the discussion is restricted to a general statement about more recently diverged species. Such analyses are better supported by phylogenetic tests of directedness in the trait over time, such as fitting a model with an adaptive peak or others.

      We thank the reviewer for their expert insight in our ancestral state estimate of body size. We agree that the accuracy of this estimate is rather low. Based on the comments by the reviewer we have now revised our main analysis and results, by no longer basing it on the apparent evolutionary miniaturization of hoverflies, but instead on the observed variations in size in our studied hoverfly species. As a result, we removed the figure mapping ancestral state estimates (called figure S1 in the first version) from the manuscript. We now explicitly mention that ascertaining the evolutionary directedness of body size is beyond the scope of our work, but that we nonetheless focus on the aerodynamic challenge of size reduction (lines 609–615).

      Setting aside whether the clade as a whole tends towards smaller size, there is a further concern about the correlation of variation in wing morphology and changes in size (and the corresponding conclusion about lack of co-evolution in wing kinematics). Showing that there is a trend towards smaller size and a change in wing morphology does not test explicitly that these two are correlated with the phylogeny. Moreover, the subsample of species considered does not appear to recapitulate the miniaturization result of the larger ancestral state reconstruction.

      As also mentioned above, we agree with the reviewer that we cannot ascertain the trajectory of body size evolution in the diversification of hoverflies. We therefore revised our manuscript such that we do no longer focus explicitly on miniaturization; instead, we discuss how morphology and kinematics scale with size, independently of potential trends over the phylogeny. To do so, we revised the title, abstract results and discussion accordingly.

      Given the limitations of the phylogenetic comparative methods presented, the authors did not fully support the general conclusion that changes in wing morphology, rather than kinematics, correlate with or enable miniaturization. The aerodynamic analysis across the 8 species does however hold significant value and the data support the conclusion as far as it extends to these 8 species. This is suggestive but not conclusive that the analysis of consistent kinematics and allometric morphology will extend across the group and extend to miniaturization. Nonetheless, hoverflies face many shared ecological pressures on performance and the authors summarize these well. The conclusions of morphological allometry and conserved kinematics are supported in this subset and point to a clade-wide pattern without having to support an explicit hypothesis about miniaturization.

      The reviewer argues here fully correct that we should be careful about extending our analysis based on eight species to hoverflies in general, and especially to extend it to miniaturization in this family of insects. As mentioned above, we therefore do no longer specifically focus on miniaturization. Moreover, we extended our analysis by including the morphology of 20 additional species of hoverflies, sampled from a museum collection. We hope that the reviewer agrees with this more balanced and focused discussion of our study.

      The data and analyses on these 8 species provide an important piece of work on a group of insects that are receiving growing attention for their interesting behaviors, accessibility, and ecologies. The conclusions about morphology vs. kinematics provide an important piece to a growing discussion of the different ways in which insects fly. Sometimes morphology varies, and sometimes kinematics depending on the clade, but it is clear that morphology plays a large role in this group. The discussion also relates to similar themes being investigated in other flying organisms. Given the limitations of the miniaturization analyses, the impact of this study will be limited to the general question of what promotes or at least correlates with evolutionary trends towards smaller body size and at what phylogenetic scale body size is systematically decreasing.

      We thank the reviewer for their encouraging words about the importance of our work on hoverfly flight. As suggested by the reviewer, we narrowed down the main question of our study by no longer focusing on apparent miniaturization, but instead on the correlation between wing morphology, wingbeat kinematics and variations in size.

      In general, there is an important place for work that combines broad phylogenetic comparison of traits with more detailed mechanistic studies on a subset of species, but a lot of care has to be taken about how the conclusions generalize. In this case, since the miniaturization trend does not extend to the 8 species subsample of the phylogeny and is only minimally supported in the broader phylogeny, the paper warrants a narrower conclusion about the connection between conserved kinematics and shared life history/ecology.

      We truly appreciated the reviewer’s positive assessment of the importance of our work and study. We also thank the reviewer for their advice to generalize the outcome of our work in a more balanced way. Based on the above comments and suggestions of the reviewer, we did so by revising several aspects of our study, including adding additional species to our study, amending the analysis, and revising the title, abstract, results and discussion sections. We hope that the reviewer warrants the revised manuscript of sufficient quality for final publication in eLife.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations for the authors):

      Figure S1 is lovely. I would recommend merging it with Figure 1 so that it does not disappear.

      We appreciate the reviewer comment. However, reviewer 3 had several points of concern about the underlying analysis, which made us realize that our ancestral state estimation analysis does not conclusively support a miniaturization trend. We therefore are no longer focusing on miniaturization when interpreting our results.

      Figure 4 is beautiful. The consistent color coding throughout is very helpful.

      We thank the reviewer for this comment.

      Sometimes spaces are missing before brackets, and sometimes there are double brackets, or random line break.

      We did our best to remove these typos.

      Should line 367 refer to Table S2?

      Table S2 is now referred to when mentioning the result of phylogenetic signal (line 460 in the revised manuscript)

      Can you also refer to Figure 2 on line 377?

      Good suggestion, and so we now do so (line 462 in the revised manuscript).

      Lines 497-512: Please refer to relevant figures.

      We now refer to figure 4, and its panels (lines 621–629 in the revised manuscript).

      Figure legend 1: Do you need to say that the second author took the photos?

      We removed this reference.

      Figure legend 4: "(see top of A and B)" is not aligned with the figure layout.

      We corrected this.

      Figure 5 seems to have a double legend, A, B then A, B. Panel A says it's color-coded for body mass, but the figure seems to be color-coded for species.

      Thank you for noting this. We corrected this in the figure legend.

      Figure 6 legend: Can you confidently say that they were hovering, or do you need to modify this to flying?

      The CFD simulations were performed in full hovering (U<sub>¥</sub>=0 m/s), but any true flying hoverflies will per definition never hover perfectly. But as explained in our manuscript, we define a hovering flight mode as flying with advance ratios smaller than 0.1 (Ellington, 1984a). Based on this we can state that our hoverflies were flying in a hovering mode. We hope that the reviewer agrees with this approach.

      Reviewer #2 (Recommendations for the authors):

      Below, I provide more details on the arguments made in the public review, as well as a few additional comments and observations; further detailed comments are provided in the word document of the manuscript file, which was shared with the authors via email (I am not expecting a point-by-point reply to all comments in the word document!).

      We thank the reviewer for this detailed list of additional comments, here and in the manuscript. As suggested by the reviewer, we did not provide a point-by-point respond to all comments in the manuscript file, but did take them into account when improving our revised manuscript. Most importantly, we now define explicitly kinematic similarity as the equivalent from morphological similarity (isometry), we added a null hypothesis and the proposed references, and we revised the figures based on the reviewer suggestions.

      Null hypotheses for kinematic parameters.

      Angular amplitudes should be size-invariant under isometry. The angular velocity is more challenging to predict, and two reasonable options exist. Conservation of energy implies:

      W = 1/2 I ω2

      where I is the mass moment of inertia and W is the muscle work output (I note that this result is approximate, for it ignores external forces; this is likely not a bad assumption to first order. See the reference provided below for a more detailed discussion and more complicated calculations). From this expression, two reasonable hypotheses may be derived.

      First, in line with classic scaling theory (Hill, Borelli, etc), it may be assumed that W∝m; isometry implies that I∝m5/3 from which ω ∝m-1/3 follows at once. Note well the implication with respect to eq. 1: isometry now implies F∝m2/3, so that weight support presents a bigger challenge for larger animals; this result is completely analogous to the same problem in terrestrial animals, which has received much attention, but in strong contrast to the argument made by the authors: weight support is more challenging for larger animals, not for smaller animals.

      Second, in line with recent arguments, one may surmise that the work output is limited by the muscle shortening speed instead, which, assuming isometry and isophysiology, implies ω ∝m0 = constant; smaller animals would then indeed be at a seeming disadvantage, as suggested by the authors (but see below).

      The following references contain a more detailed discussion of the arguments for and against these two possibilities:

      Labonte, D. A theory of physiological similarity for muscle-driven motion. PNAS, 2023, 120, e2221217120

      Labonte, D.; Bishop, P.; Dick, T. & Clemente, C. J. Dynamics similarity and the peculiar allometry of maximum running speed. Nat Comms., 2024, 15, 2181

      Labonte, D. & Holt, N. Beyond power limits: the kinetic energy capacity of skeletal muscle. bioRxiv doi: 10.1101/2024.03.02.583090, 2024

      Polet, D. & Labonte, D. Optimising the flow of mechanical energy in musculoskeletal systems through gearing. bioRxiv doi: 10.1101/2024.04.05.588347, 2024

      Labonte et al 2024 also highlight that, due to force-velocity effects, the scaling of the velocity that muscle can impart will fall somewhere in between the extremes presented by the two hypotheses introduced above, so that, in general, the angular velocity should decrease with size with a slope of around -1/6 to -2/9 --- very close to the slope estimated in this manuscript, and to data on other flying animals.

      We greatly appreciate the reviewer's detailed insights on null hypotheses for kinematics, along with the accompanying references. As noted in the Public Review section (comment/reply 2.3), our study primarily explores how small-sized insects adapt to constraints imposed by the wing-based propulsion system, rather than by the muscular motor system.

      In this context, we chose to contrast the observed scaling of morphology and flight traits with a hypothetical scenario of geometric similarity (isometry) and kinematic similarity, where all size-independent kinematic parameters remain constant with body mass. While isometric expectations for morphological traits are well-defined (i.e., ), those for kinematic traits are more debatable (as pointed out by the reviewer). For this reason, we believe that adopting a simple approach based on kinematic similarity across sizes (f~m0, etcetera) enhances the interpretability of our results and strengthens the overall narrative.

      Size range

      The study would significantly benefit from a larger size range; it is unreasonable to ask for kinematic measurements, as these experiments become insanely challenging as animals get smaller; but it should be quite straightforward for wing shape and size, as this can be measured with reasonable effort from museum specimens. In particular, if a strong point on miniaturization is to be made, I believe it is imperative to include data points for or close to the smallest species.

      We appreciate that the reviewer recognizes the difficulty of performing additional kinematic measurements. Collecting additional morphological data to extend the size range was however feasible. In our revised study, we therefore extended our morphological scaling analysis by including the morphology of twenty additional hoverfly species. This extended dataset includes wing morphology data of 74 museum specimens (4.2±1.7 individuals per species (mean±sd)) from Naturalis Biodiversity Centre (Leiden, the Netherlands). This increased the studied mass range of our hoverfly species from 5 100 mg to 3 132 mg, and strengthened our results and conclusions on the morphological scaling in hoverflies.

      Is weight support the main problem?

      Phrasing scaling arguments in terms of weight support is consistent with the classic literature, but I am not convinced this is appropriate (neither here nor in the classic scaling literature): animals must be able to move, and so, by strict physical necessity, muscle forces must exceed weight forces; balancing weight is thus never really a concern for the vast majority of animals. The only impact of the differential scaling may be a variation in peak locomotor speed (this is unpacked in more detail in the reference provided above). In other words, the very fact that these hoverfly species exist implies that their muscle force output is sufficient to balance weight, and the arguably more pertinent scaling question is how the differential scaling of muscle and weight force influences peak locomotor performance. I appreciate that this is beyond the scope of this study, but it may well be worth it to hedge the language around the presentation of the scaling problem to reflect this observation, and to, perhaps, motivate future work.

      We agree with the reviewer that a question focused on muscle force would be inappropriate for this study, as muscle force and power availability is not under selection in the context of hovering flight, but instead in situation where producing increased output is advantageous (for example during take-off or rapid evasive maneuvers). But as explained in our revised manuscript (lines 81-85), we here do not focus on the scaling of the muscular motor with size and throughout phylogeny, but instead we focus on scaling of the flapping wing-based propulsion system. For this system there are known physical scaling laws that predict how this propulsion system should scale with size (in morphology and kinematics) for maintaining weight-support across sizes. In our study, we test in what way hoverflies achieve this weight support in hovering flight.

      Of course, it would be interesting to also test how peak thrust is produced by the propulsion system, for example during evasive maneuvers. In the revised manuscript, we now explicitly mention this as potential future research (lines 733–735).

      Other relevant literature

      Taylor, G. & Thomas, A. Evolutionary biomechanics: selection, phylogeny, and constraint, Oxford University Press, 2014

      This book has quite detailed analyses of the allometry of wing size and shape in birds in an explicit phylogenetic context. It was a while ago that I read it, but I think it may provide much relevant information for the discussion in this work.

      Schilder, R. J. & Marden, J. H. A hierarchical analysis of the scaling of force and power production by dragonfly flight motors J. Exp. Biol., 2004, 207, 767

      This paper also addresses the question of allometry of flight forces (if in dragonflies). I believe it is relevant for this study, as it argues that positive allometry of forces is partially achieved through variation of the mechanical advantage, in remarkable resemblance to Biewener's classic work on EMA in terrestrial animals (this is discussed and unpacked in more detail also in Polet and Labonte, cited above). Of course, the authors should not measure the mechanical advantage of this work, but perhaps this is an interesting avenue for future work.

      We thank the reviewer for these valuable literature suggestions and the insights they offer for future work.

      More generally, I thought the introduction misses an opportunity to broaden the perspective even further, by making explicit that running and flying animals face an analogous problem (with swimming likely being a curious exception!); some other references related to the role of phylogeny in biomechanical scaling analyses are provided in the comments in the word file.

      The introduction has been revised to better emphasize the generality of the scaling question addressed in our study. Specifically, we now explicitly highlight the similar constraints associated with increasing or decreasing size in both terrestrial and flying animals (lines 53–59). We thank the reviewer for this suggestion, which has improved our manuscript.

      Numerical results vs measurements

      I felt that the paper did not make the strongest possible use of the very nice numerical simulations. Part of the motivation, as I understood it, was to conduct more complex simulations to also probe the validity of the quasi-steady aerodynamics assumption on which eq. 1 is based. All parameters in eq. 1 are known (or can be approximated within reasonable bounds) - if the force output is evaluated analytically, what is the result? Is it comparable to the numerical simulations in magnitude? Is it way off? Is it sufficient to support body mass? The interplay between experiments and numerics is a main potential strength of the paper, which in my opinion is currently sold short.

      We agree with the reviewer that we did not make full use of the numerical simulations results. In fact, we did so deliberately because we aim to focus more on the fluid mechanics of hoverfly flight in a future study. That said, we thank the reviewer for suggesting to use the CFD for validating our quasi-steady model. We now do so by correlating the vertical aerodynamic force with variations in morphology and kinematics (revised Figure 7A). The striking similarity between the predicted and empirical fit shows that the quasi-steady model captures the aerodynamic force production during hovering flight surprisingly well.

      Statistics

      There are errors in the Confidence Intervals in Tab 2 (and perhaps elsewhere). Please inspect all tables carefully, and correct these mistakes. The disagreement between confidence intervals and p-values suggests a significant problem with the statistics; after a brief consultation with the authors, it appears that this result arises because Standard Major Axis regression was used (and not Reduced Major Axis regression, as stated in the manuscript). This is problematic because SMA confidence intervals become unreliable if the variables are uncorrelated, as appears to be the case for some parameters here (see https://cran.r-project.org/web/packages/lmodel2/vignettes/mod2user.pdf for more details on this point). I strongly recommend that the authors avoid SMA, and use MA, RMA or OLS instead. My recommendation would be to use RMA and OLS to inspect if the conclusions are consistent, in which case one can be shown in the SI; this is what I usually do in scaling papers, as there are some colleagues who have very strong and diverging opinions about which technique is appropriate. If the results differ, further critical analysis may be required.

      The reviewer correctly identified an error in the statistical approach: a Standard Major Axis was indeed used under inappropriate conditions. Following Reviewer #3’s comments, the expanded sample size and the resulting increase in statistical power to detect phylogenetic signal, our revised analysis now accounts for phylogenetic effects in these regressions. We therefore now report the results from Phylogenetic Least Square (PGLS) regressions (the phylogenetic equivalent of an OLS).

      Figures

      Please plot 3E-F in log space, add trendlines, and the expectation from isometry/isophysiology, to make the presentation consistent, and comparison of effect strengths across results more straightforward.

      The reviewer probably mentioned Figure 3F-I and not E-F (the four panels depicting the relationships between kinematics variables and body mass). As requested, we added the expectation for kinematic similarity to the revised figure, but prefer to not show the non-significant PGLS fits, as they are not used in any analysis. For completeness, we did add the requested figure in log-space with all trendlines to the supplement (Figure S2), and refer to it in the figure legend.

      The visual impression of the effect strength in D is a bit misleading, due to the very narrow y-axis range; it took me a moment to figure this out. I suggest either increasing the y-range to avoid this incorrect impression or to notify the reader explicitly in the caption.

      We believe the reviewer is referring to Figure 4D. As rightly pointed out, variation in non-dimensional second moment of area() is very low among species, which is consistent with literature (Ellington, 1984b). We agree that the small range on the y-axis might be confusing, and thus we increased it somewhat. More importantly, we now show, next to the trend line, the scaling for isometry (~m<sup>0</sup>) and for single-metric weight support. Especially the steepness of the last trend line shows the relatively small effect of on aerodynamic force production. This is even further highlighted by the newly added pie charts of the relative allometric scaling factor, where variations in contribute only 5% to maintaining weight support across sizes.

      Despite this small variation, these adaptations in wing shape are still significant and are highly interesting in the context of our work. We now discuss this in more detail in the revised manuscript (lines 645–649).

      In Figure 7b, one species appears as a very strong outlier, driving the regression result. Data of the same species seems to be consistent with the other species in 7a, c, and d - where does this strong departure come from? Is this data point flagged as an outlier by any typical regression metric (Cook's distance etc) for the analysis in 7b?

      We agree with the reviewer: the species in dark green (Eristalis tenax) appears as an outlier on the in Figure 7B ( vs. vertical force) in our original manuscript. This is most likely due to the narrow range of variation in ( — as the reviewer pointed out in the previous comment — which amplifies differences among species. We expanded the y-axis range in the revised Figure 7, so that the point no longer appears as an outlier (see updated graph, now on Figure 7F).

      In Figure 1, second species from the top, it reads "Eristalix tenax" when it is "Eristalis tenax" (relayed info by the Editor).

      Corrected.

      Reviewer #3 (Recommendations for the authors):

      I really like the biomechanical and aerodynamic analyses and think that these alone make for a strong paper, albeit with narrower conclusions. I think it is perfectly valid and interesting to analyze these questions within the scope of the species studied and even to say that these patterns may therefore extend to the hoverflies as a whole group given the great discussion about the shared ecology and behavior of much of the clade. However, the extension to miniaturization is too tenuous. This would need much more support, especially from the phylogenetic methods which are not rigorously presented and likely need additional tests.

      We thank the reviewer for the positive words about our study. We agree that our attempt to infer the directedness of size evolution was too simplistic, and thus the miniaturization aspect of our study would need more support. As suggested by the reviewer, we therefore do no longer focus on miniaturization, and thus removed these aspects from the title, abstract and main conclusion of our revised manuscript.

      There is a lot of missing data about the tree and the parameters used for the phylogenetic methods that should be added (especially branch lengths and models of evolution). Phylogenetic tests for the relationships of traits should go beyond the analysis of phylogenetic signals in the specific traits. My understanding is also that phylogenetic signal is not properly interpreted as a "control" on the effect of phylogeny. The PCA should probably be a phylogenetic PCA with a corresponding morphospace reconstruction.

      We agree with the reviewer that our phylogenetic approach based on phylogenetic signal only was incomplete. In our revised manuscript, we not only test for phylogenetic signal but also account for phylogeny in all regressions between traits and body mass using Phylogenetic Generalized Least Squares (PGLS) regressions. Additionally, we have provided more details about the model of evolution and the parameter estimation method in the Methods section (275–278).

      Following the reviewer suggestion, in our revised study we now also performed a phylogenetic PCA instead of a traditional PCA on the superimposed wing shape coordinates. The resulting morphospace was however almost identical to the traditional PCA (Figure S4). We nonetheless included it in the revised manuscript for completion. We thank the reviewer for this suggestion, as the revised phylogenetic analysis strengthens our study as a whole.

      For the miniaturization conclusion, my suggestion is a more rigorous phylogenetic analysis of directionality in the change in size across the larger phylogeny. However, even given this, I think the conclusion will be limited because it appears this trend does not hold up under the 8 species subsample. To support that morphology is evolutionarily correlated with miniaturization would for me require an analysis of how the change in body size relates to the change in wing shape and kinematics which is beyond what a scaling relationship does. In other words, you would need to test if the changes in body morphology occur in the same location phylogenetically with a shrinking of body size. I think even more would be required to use the words "enable" or "promote" when referring to the relationship of morphology to miniaturization because those imply evolutionary causality to me. To me, this wording would at least require an analysis that shows something like an increase in the ability of the wing morphological traits preceding the reduction in body size. Even that would likely be controversial. Both seem to be beyond the scope of what you could analyze with the given dataset.

      As mentioned in reply 3.1, we agree with the reviewer that the miniaturization aspect of our study would need more support. And thus, as suggested by the reviewer, we therefore do no longer focus primarily on miniaturization, by removing these aspects from the title, abstract and main conclusion of our revised manuscript.

      The pseudoreplication should be corrected. You can certainly report the data with all individuals, but you should also indicate in all cases if the analysis is consistent if only species are considered.

      As mentioned in the Public Review section, our revised approach avoids pseudoreplication by analyzing all data at the species level. Nonetheless, we have included supplementary figures (Figures S3 and S5) to visualize within-species variation.

      My overall suggestion is to remove the analysis of miniaturization and cast the conclusions with respect to the sampling you have. Add a basic phylogenetic test for the correlated trait analysis (like a phylogenetic GLM) which will likely still support your conclusions over the eight species and emphasize the specific conclusion about hoverflies' scaling relationships. I think that is still a very good study better supported by the extent of the data.

      We thank the reviewer for the positive assessment of our study, and their detailed and constructive feedback. As suggested by the reviewer, miniaturization is no longer the primary focus of our study, and we revised our analysis by extending the morphology dataset to more species, and by using phylogenetic regressions.

      References

      Ellington C. 1984a. The aerodynamics of hovering insect flight. III. Kinematics. Philosophical Transactions of the Royal Society of London B: Biological Sciences 305:41–78.

      Ellington C. 1984b. The aerodynamics of insect flight. II. Morphological parameters. Phil Trans R Soc Lond B 305:17–40.

      Fry SN, Sayaman R, Dickinson MH. 2005. The aerodynamics of hovering flight in Drosophila. Journal of Experimental Biology 208:2303–2318. doi:10.1242/jeb.01612

      Liu Y, Sun M. 2008. Wing kinematics measurement and aerodynamics of hovering droneflies. Journal of Experimental Biology 211:2014–2025. doi:10.1242/jeb.016931

    1. Author Respose

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors prepared several Acinetobacter baumannii strains from which an essential protein of known or unknown function can be depleted. They chose to study one of the proteins (AdvA) in more detail. AdvA is a known essential cell division protein that accumulates at cell division sites together with other such proteins. No clear homologs are present in model bacteria such as E.coli, and the precise role(s) of AdvA is still unclear. The authors rename AdvA here as Aeg1. The authors searched for suppressors of lethality caused by AdvA-depletion and recovered an allele of ftsA (E202K) that is capable of doing so. Based on similar superfission alleles previously recovered in other division genes in E.coli, they test several mutant genes and find that certain alleles in ftsB, L and W can also suppress lethality of AdvA-minus cells.

      In addition, the authors perform bacterial two-hybrid assays and protein sublocalization studies of AdvA and of other division proteins, but the results of these studies are either not new (confirming previous work) or not convincing.

      We appreciate the vigor of this reviewer.

      We agreed that the essentiality of AdvA/Aeg1 described in our submission is not new, we believed our work has firmly established its role as a cell division protein. The earlier work by the labs of Geisinger and Isberg labs (1) showed its essentiality and the cell morphology changes upon its depletion (Fig. 3 of ref. 1 in the end of this rebuttal letter). This protein was one of the many proteins addressed in their study and their results only suggests its role in cell division due to the close phenotypical relationships between AdvA/Aeg1 and genes associated with chromosome replication/segregation and cell division.

      Reviewer #2 (Public Review):

      In this study the authors confirm that one of the genes classified as essential in a Tn-mutagenesis study in A. baumannii is in fact an essential gene. It is also present in other closely related Gram-negative bacteria and the authors designated it Aeg1. Depletion of Aeg1 leads to cell filamentation and it appears that the requirement for Aeg1 can be suppressed by what appear to be activation mutations in various genes. Overall, it appears that Aeg1 is involved in cell division but many of the images suffer from poor quality - it may be due to conversion to PDF. One of the main issues is that depletion of Aeg1 is carried out for such long times (18 hr) (Fig. 2, 4 and 5). Depleting a cell division protein for such long times may have pleiotropic effects on cell physiology. A. baumannii grows quite fast and even with a small inoculum, cells will probably be in stationary phase. If Aeg1 is that essential cells should be quite filamentous 2-3 hours after Ara removal when they are still in exponential phase. Also, it would be better to see the recovery to small cells if cells are not grown such a long time before Ara is added back. Overall, Aeg1 is potentially interesting, but studies are needed to define its place in the assembly pathway for this to be published. What proteins are at the division site when Aeg1 is depleted and what proteins are required for Aeg1 to localize to the division site. These experiments should be done when cell are depleted of proteins for only 1 -2 hours.

      We appreciate these insightful suggestions and have followed them to make necessary modifications in the revised manuscript, including:

      1st, We have redone the experiment for Fig. 1C to obtain images of higher resolution.

      2nd, We have more carefully examined the kinetics of the depletion of Aeg1-mCherry upon removal of the inducer arabinose from medium. We first evaluated the protein of Aeg1-mCherry at 2, 4, and 6 h after withdrawing arabinose and found that at the 2 h and 4 h time points mCherry-Aeg1was still readily detectable (Fig. S4). Importantly, we found that removal of arabinose for 6 h rendered Aeg1-mCherry undetectable in approximately 90% of the cells. We thus used the 6 h inducer depletion to examine the effects of Aeg1 depletion.

      In experiments aiming to analyze the co-localization of Aeg1 with other core divisome proteins, cultures of strains derived from Δaeg1(PBAD::mCherry-Aeg1) harboring the GFP fusions were induced by ara for 16 h. The saturated bacterial cultures were then diluted into fresh LB broth without ara for 6 h to induce the elongation morphology. IPTG (0.25 mM) and ara (0.25%) were added to induce the expression of fusion proteins for 4 h before samples were processed for microscopic analysis. Our results indicate that Aeg1 colocalized with ZipA, FtsK, FtsL, FtsB, and FtsW (Fig. 4C), which is consistent with results from the protein interaction experiments using the bacterial two-hybrid assay.

      To determine the impact of Aeg1 depletion on cellular localization of the several core cell divisome proteins. In cells in which Aeg1 had been depleted (by removing the inducer arabinose), all of the examined core division proteins displayed midcell mistargeting, including ZipA, FtsK, FtsB, FtsL, and FtsN (Fig. 5A).

      Reviewer #1 (Recommendations For The Authors):

      Specific remarks 1) The manuscript title is misleading in that the 'novel cell division protein' studied in this paper has already been identified as such, and studied in some detail, by the Geisinger and Isberg labs (refs 37 and 20).

      We agreed with this point. Because of the data presented by Geisinger and Isberg labs (1) that demonstrated its essentiality and morphological changes upon its depletion (Fig. 3 in ref 1), we have changed the title to “A unique cell division protein critical for the assembly of the bacterial divisome”.

      2) The Isberg/Geisinger labs named this division protein AdvA in 2020 (ref 37). The authors of the present manuscript should follow this terminology, as there is no compelling reason to rename the protein Aeg1 here. It will only confuse the field.

      We named this protein Aeg1 because we identified and named it before the work by Geisinger and Isberg labs (1) was published and this name has been used in all of our records. In addition, this is a part of our research exploring hypothetical essential genes in A. baumannii and we thus would like to keep the name in this manuscript.

      3) Membrane topology of AdvA? Line 103-104: The authors predict a single transmembrane domain in AdvA (Aeg1). However, reference 37 predicted two, and some prediction programs (e.g. CCTOP) predict three with the N-terminus periplasmic. A good understanding of the membrane topology of AdvA is important, if not only for the design of credible BACTH two-hybrid assays. Figure 6 indicates that the authors assume that the N-terminus of AdvA is periplasmic with the bulk of the protein cytoplasmic. But then they choose to use pKT25::AdvA for two-hybrid assays, which would place the CyaA T25 domain periplasmic as well. This should not yield faithful interaction data as both the T25 and T18 domains need to be cytoplasmic to restore CyaA activity.

      The Bacterial Adenylate Cyclase-Based Two-Hybrid (BACTH) technique is a powerful tool for studying protein-protein interactions, especially those involving integral membrane or membrane-associated proteins. It overcomes the limitations of traditional two-hybrid systems by allowing the detection of interactions that occur within the membrane or in other difficult-to-study protein environments (2). This method has been successfully used to analyze the relationships among bacterial cell division proteins (e.g., ref 3 and 4). Furthermore,our results from bacterial two-hybrid and immunofluorescence techniques are consistent. As a result, the results presented here should be valid.

      4) Strains and plasmids, Table S4 Far more detail is needed. a) Please provide complete genotypes of strains and, especially, of the plasmids used, including replication origin, antibiotic resistance markers, promoters, promoter repressors, inducible genes/fusions to be expressed, and the placement of genetic tags (T25, T18, XFP, Flag, etcetera).

      We have added the information to Table S4.

      b) In addition, provide details on how each strain/plasmid was constructed in the Methods section or as supplement. Currently, you only provide some details on one or two of the strains or plasmids.

      We have added the necessary details about how the constructs and plasmids used in this study were made.

      5) Lines 114-129, Fig 2. AdvA is needed for cell division. a) Similar results were already described by refs 37 and 20, so this is merely confirmatory.

      We revised the description accordingly.

      b) Refs 37 and 20 should be referenced here, as well as in the section above where you find AdvA to be essential for viability on rich medium.

      We have added the appropriate reference as suggested.

      c) The micrographs in panel C are of poor quality. Consider higher magnification and resolution.

      We have redone the experiments and images of higher resolution have been used in the revised manuscript.

      6) Lines 130-143, selection for suppressors of AdvA-depletion. I would expect quite a few mutations in araC repressor on the plasmid in this screen, rendering the promoter more constitutive (i.e. arabinose-independent). Did these not appear?

      This is an interesting point. Unfortunately, we did not recover suppression mutants which mutations on araC or other elements of the BAD promoter. Given the complexity of AraC-mediated regulation (5), such mutants likely are rare or we did not screen enough candidates.

      7) Lines 173-178, Fig3E. Sublocalization of AdvA-mCherry. a) The micrographs in Fig. 3E are very poor and I can not see any specific localization, or barely any signal whatsoever, of the AdvA-mCherry fusion. Thus, this result is not convincing

      We have replaced this image with a new one of higher-resolution.

      b) In contrast, accumulation of an AdvA-GFP fusion at constriction sites was already clearly and convincingly shown in ref 37.

      We have revised the text to reflect this fact.

      c) So, this section needs convincing images, as well as a reference to ref 37.

      We have added an image of higher resolution and revised the text accordingly. Thank you

      8) Lines 179-188, Fig4a-b. BACTH assays

      a) As noted above (see point 3), the T25-AdvA fusion would likely place the T25 domain in the periplasm, casting doubt on the validity of these results.

      b) Similarly, the T18-ZipA fusion would place the T18 domain in the periplasm, casting further doubt.

      The Bacterial Adenylate Cyclase-Based Two-Hybrid (BACTH) technique is a powerful tool for studying protein-protein interactions, especially those involving integral membrane or membrane-associated proteins. It overcomes the limitations of traditional two-hybrid systems by allowing the detection of interactions that occur within the membrane or in other difficult-to-study protein environments (2). This method has been successfully used to analyze the relationships among bacterial cell division proteins (e.g., ref 3 and 4). Furthermore,our results from bacterial two-hybrid and immunofluorescence techniques are consistent. As a result, the results presented here should be valid.

      9) Lines 189-201, Fig4c, co-localization of proteins in AdvA-depleted filaments. These co-localization results are not convincing for several reasons:

      a) None of the proteins accumulate in specific ring-like structures, as might be expected for ZipA, at least. One possible reason is that division rings are not made at all due to the partial depletion of AdvA in these cells. But another possible reason is that some or all the fusions are simply non-functional. Do any of these proteins (co-)localize to the septal ring in wt cells?

      b) At least for the GFP-ZipA fusion, there is good reason to predict it is not functional, as correct membrane insertion of the fusion would place GFP in the periplasm. In E. coli this prevents GFP from becoming fluorescent in the first place. So the fluorescence seen here may reflect failure of the fusion to insert properly.

      c) Another possible reason for rings being absent is that the fusions are massively overexpressed. The plasmids are multicopy, the BAD and TAC promoters are strong, and the used levels of inducers (Ara and IPTG) are high. How do fusion levels compare to that of native proteins? Perhaps some of the bright spots we see are inclusion bodies or other types of non-specific protein aggregates.

      We appreciate these excellent suggestions and have carried out experiments to investigate the (co-)localization of these proteins at the septal ring in Δaeg1 cells under conditions of low-level inducers (Ara and IPTG) and reduced induction time.

      Cultures of strains derived from Δaeg1(PBAD::mCherry-Aeg1) harboring the GFP fusions were induced by ara for 16 h, saturated bacterial cultures were then diluted into fresh LB broth without ara for 6 h to induce the elongation morphology. IPTG (0.2 mM) and ara (0.2%) were added to induce the expression of fusion proteins for 4 h before samples were processed for microscopic analysis. Consistent with results from the protein interaction experiments using the bacterial two-hybrid assay, Aeg1 colocalized with ZipA, FtsK, FtsL, FtsB, and FtsW (Fig. 4C). Thus, Aeg1 interacts with multiple core cell divisome proteins of A. baumannii.

      In cells of the wild-type A. baumannii strain, we have observed cell elongation upon overexpression of FtsL, FtsB, FtsW, or FtsN. This raises concerns regarding the physiological relevance of the results obtained in wild-type cells. Of note, the phenotype of cell elongation following overexpression of division proteins has been observed in Escherichia coli by several groups (6-11).

      10) Lines 202-214, Fig5a, localization of division proteins in AdvA-depleted filaments. These localization results are not convincing for the same reasons outlined above (see point 9).

      a) Do any of the fusions localize correctly under similar expression conditions, but in normally dividing cells?

      In wild-type A. baumannii cells, cell elongation occurs upon overexpression of FtsL, FtsB, FtsW or FtsN, which raises the concern that the results from the suggested experiments may not physiologically relevant.

      b) Even the regular structures seen with GFP-FtsZ do not resemble rings, but appear more like blobs. Perhaps fixation with glutaraldehyde would preserve structures better?

      We have followed the suggestion to use glutaraldehyde fixation for cell fixation. The new images have been used in the revised manuscript.

      11) Other points:

      a) Line 97, Fig1. Is AdvA essential on minimal medium (~ slow growth) as well?

      We have performed this experiment. Yes, AdvA/Aeg1 is essential for A. baumannii growth in the Vogel-Bonner minimal medium with succinate (VBS) as the sole carbon source (12) (Fig S1).

      b) Fig1. What residues are actually missing (or replaced?) in the delta-TM version of AdvA?

      We have added the information, residues 1-23 have been removed.

      c) Fig1D. Also, the delta-TM version of HA-AdvA runs slower than HA-AdvA itself. Why?

      We have also been puzzled by this phenomenon that full-length AdvA/Aeg1 migrated faster than the delta-TM mutant. Interestingly, this discrepancy did not occur when the proteins were expressed in E. coli (see Author response image 1). We do not have a good explanation for this phenomenon.

      Author response image 1.

      The expression of the Aeg1 and Aeg1∆TM in A. baumannii and E. coli. Total proteins resolved by SDS-PAGE was probed by immunoblotting with the HA-specific antibody. The metabolic enzyme isocitrate dehydrogenase (ICDH) was probed as a loading control. Similar results were obtained in three independent experiments.

      d) Lines 159, 165 and elsewhere. The mutation in E. coli is actually FtsA(R286W), not Q286W.

      We have corrected this error. Thank you!

      e) Line 161. These alleles of ftsA should be referenced properly: ref 33 for I143L and ref 29 for E124A.

      We have made the correction. Thank you!

      f) Line 692, you incorrectly switched the two CyaA domains here.

      We have corrected this error.

      g) Fig4b. Is 'none' a vector control (pUT18C-Flag)?

      We have specified the control, it is the vector pUT18C-Flag.

      h) Lines 727-729. I don't understand this sentence. Please explain.

      We have revised this sentence.

      Reviewer #2 (Recommendations For The Authors):

      Line 159 and Fig. 2 Panel D. I am not sure that this panel should be in the paper for two reasons: 1) FtsA from E. coli and A. baumannii are only 50% identical and its not clear that one can make corresponding mutations and expect similar behavior. FtsA* from E. coli is R286W not Q286W. R286 does not appear to be conserved in A. baumannii. Also, what you label as Q286 appears to be Q285. Please check. 2) the alleles that are tested in this panel do not rescue the deletion of Aeg1. This may be due to the instability of the mutant proteins. It would be better to characterize the mutant that you have isolated - is it a superfission mutation; that is does it produce small cells in a strain that contains WT Aeg1?

      Thank you! We have more carefully examined the relevant sites in these proteins. We did not observe the small cell phenotype when FtsAE202K was overexpressed in WT strains (please see Author response image 2).

      Author response image 2

      The overexpression of FtsAE202K did not cause a small cell phenotype in A. baumannii. Bacterial strains derived from WT (Ptac::FtsAE202K) grown in LB broth overnight were diluted into fresh medium with the inducer and the cultures were induced with IPTG for 4 h prior to being processed for imaging (A). Total proteins were resolved by SDS-PAGE and proteins transferred onto nitrocellulose membranes were detected by immunoblotting with the HA-specific antibody. ICDH was probed as a loading control (B, right panels). Images were representatives of three parallel cultures. Bar, 10 µm.

      The images in Fig. 3, Panel C are quite poor (perhaps the original images [not PDF] are better). It is difficult to see the localization.

      We have redone the experiments and replaced the images with ones of higher resolution.

      Fig. 4. Panel C. This is an effort to show that Aeg1 colocalizes with known cell division proteins. Since in Fig. 3, panel C it is claimed that Aeg1 localizes to the division site, them it must colocalize with known division proteins. Doing the long term depletion of Aeg1 is likely causing artefacts. The localization of proteins seems very erratic. A better experiment would be to express the GFP fusions to the known proteins and then deplete Aeg1 and see what happens. Does depletion of Aeg1 prevent the localization of FtsZ, FtsK or FtsN? Another important question is if one of the known cell division proteins is depleted does Aeg1 localize to division sites. Since it is speculated that Aeg1 interacts with ZipA and FtsN, these proteins could be depleted and see if Aeg1 localizes.

      We greatly appreciate your insightful suggestions. We have carefully redone these experiments as follows: Each of the testing strains was grown in LB broth with ara overnight prior to being diluted into fresh medium without ara for 6 h to induce the elongation morphology. IPTG (0.25 mM) and ara (0.25%) were added to induce the expression of fusion proteins for 4 h before samples were processed for microscopic analysis. Consistent with results from the protein interaction experiments using the bacterial two-hybrid assay, we observed that Aeg1 colocalized with ZipA, FtsK, FtsL, FtsB, or FtsW (Fig. 4C).

      In cells not expressing Aeg1, all of the examined core division proteins including FtsZ, FtsK, and FtsN displayed midcell mistargeting, (Fig. 5A).

      As for the localization of Aeg1 upon depleting ZipA or FtsN, this is an ongoing project in our lab. Such information is beyond the scope of this manuscript.

      Fig. 5. Panel A. again the images are not of good quality. Also, why deplete for 18 hrs. This is too long.

      We have redone these experiments and images of higher resolution are now used in the revised manuscript. After extensive test, we have chosen to use a 6-h depletion, which gave us the window to observe the phenotype (Fig. 5A).

      Line 25. Change 'so' to 'as'

      Corrected as suggested. Thank you!

      Line 28. "Induces' to 'induce'

      We have made the suggested correction. Thank you!

      Line 43. Change 'of' to 'with'

      Corrected as suggested. Thank you!

      Line 74. Change 'determine' to 'test'

      Corrected as suggested. Thank you!

      Line 89. Delete 'of the'

      We have made the suggested correction. Thank you!

      Line 102. Some strains of E. coli? Does that mean there are strains that do not contain Aeg1? What are they?

      Yes, this is indeed the case, the common strains of E. coli derived from strain K12 does not have a discernable homolog of aeg1. This gene is present in some clinic E. coli isolates (e.g. HAY5567682, HBI862710, HAY5567682, MDD9849866, EFE8345364, and KAE9874289).

      Line 112. Note this TM domain has a rare topology as it is similar to ZipA. Please mention that this is a Type 1b.

      We have made the suggested revision. Thank you!

      Reference:

      1. Geisinger E, Mortman NJ, Dai Y, Cokol M, Syal S, Farinha A, et al. Antibiotic susceptibility signatures identify potential antimicrobial targets in the Acinetobacter baumannii cell envelope. Nature communications. 2020;11:4522.doi: 10.1038/s41467-020-18301-2

      2. Karimova G, Gauliard E, Davi M, Ouellette SP, Ladant D. Protein-Protein Interaction: Bacterial Two-Hybrid. Methods in molecular biology (Clifton, NJ). 2017;1615:159-76.doi: 10.1007/978-1-4939-7033-9_13

      3. Karimova G, Dautin N, Ladant D. Interaction network among Escherichia coli membrane proteins involved in cell division as revealed by bacterial two-hybrid analysis. Journal of bacteriology. 2005;187:2233-43.doi: 10.1128/jb.187.7.2233-2243.2005

      4. Boldridge WC, Ljubetič A, Kim H, Lubock N, Szilágyi D, Lee J, et al. A multiplexed bacterial two-hybrid for rapid characterization of protein-protein interactions and iterative protein design. Nature communications. 2023;14:4636.doi: 10.1038/s41467-023-38697-x

      5. Schleif R. AraC protein, regulation of the l-arabinose operon in Escherichia coli, and the light switch mechanism of AraC action. FEMS microbiology reviews. 2010;34:779-96.doi: 10.1111/j.1574-6976.2010.00226.x

      6. Addinall SG, Cao C, Lutkenhaus J. FtsN, a late recruit to the septum in Escherichia coli. Molecular microbiology. 1997;25:303-9.doi: 10.1046/j.1365-2958.1997.4641833.x

      7. Pichoff S, Lutkenhaus J. Identification of a region of FtsA required for interaction with FtsZ. Molecular microbiology. 2007;64:1129-38.doi: 10.1111/j.1365-2958.2007.05735.x

      8. Du S, Henke W, Pichoff S, Lutkenhaus J. How FtsEX localizes to the Z ring and interacts with FtsA to regulate cell division. Molecular microbiology. 2019;112:881-95.doi: 10.1111/mmi.14324

      9. Park KT, Du S, Lutkenhaus J. Essential Role for FtsL in Activation of Septal Peptidoglycan Synthesis. mBio. 2020;11.doi: 10.1128/mBio.03012-20

      10. Barre FX, Aroyo M, Colloms SD, Helfrich A, Cornet F, Sherratt DJ. FtsK functions in the processing of a Holliday junction intermediate during bacterial chromosome segregation. Genes & development. 2000;14:2976-88.doi: 10.1101/gad.188700

      11. Cameron TA, Vega DE, Yu C, Xiao H, Margolin W. ZipA Uses a Two-Pronged FtsZ-Binding Mechanism Necessary for Cell Division. mBio. 2021;12:e0252921.doi: 10.1128/mbio.02529-21

      12. Vogel HJ, Bonner DM. Acetylornithinase of Escherichia coli: partial purification and some properties. The Journal of biological chemistry. 1956;218:97-106.doi:

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      While the role of Rab27 was strongly examined, the hits of the VAMP proteins were not explored in detail. I was wondering if the decrease in the presence of VAMPS directly suggests the final step of membrane fusion in the exocytosis of EVs is what is being impaired. Or if it is other trafficking steps along the EV secretion pathway.

      We appreciate the relevance of this comment and we agree that the decrease of VAMP gene expression in the β-catenin-mutated HepG2 cells could suggest an impairment of the final membrane fusion step in exocytosis of EVs. We have therefore expanded this important point in the discussion (page 10). Indeed, we identified an upregulation of VAMP2, VAMP5 and VAMP8 expressions after mutated β-catenin depletion in the transcriptomic analysis of HepG2 cells. However, these proteins were not detected in the mass spectrometry analysis. Only VAMP3 and VAMP7 proteins were detected in the proteomic analysis without any variation. This is why we didn't focus on this trafficking step, but it could be interesting to explore it further in the future. 

      Reviewer 2:

      (1) In Figure 1F, it is essential to investigate why mass spectrometry analysis indicated no significant changes in SDC4 levels.

      We agree with the reviewer that indeed whereas we did observe a significant alteration of syndecan-4 expression at the mRNA level, we did not observe significant changes in syndecan-4 levels by mass spectrometry. One possible explanation is that heparan sulfate proteoglycans like syndecan-4 exhibit a high degree of structural heterogeneity due to the biosynthetic process that produces linear polysaccharides. This characteristic can alter the robustness of mass spectrometry analyses, leading to greater variability. 

      (2) Figure 2G lacks clarity in explaining how the quantification of MVBs (multivesicular bodies) was conducted.

      We apologize for the lack in clarity in explaining how the quantification of MVBs was conducted in figure 2G. The Materials and methods section (part electron microscopy-cells, page 23) has been modified in order to emphasize this point.

      (3) In Supplementary Figure 1F, there is a suggestion to highlight exosomes using arrowheads for enhanced clarity.

      According to the reviewer’s suggestions, we added arrowheads on supplementary figure 1F in order to highlight the exosomes (page 16). This indeed improves clarity.

      (4) Figure 3C prompts a question about the peculiar appearance of Actin staining in KD cells, requiring further investigation.

      The peculiar appearance of this intense phalloidin staining between hepatocytes corresponds to bile canaliculi (BC), features of more differentiated HepG2 cells. As phalloidin-stained BC are very bright, this may diminish the visibility of other, thinner actin structures. We decided to change the image of KD cells for a more relevant one (new Figure 3C).

      (5) An intriguing avenue for exploration is suggested in testing how the treatment of a GSK inhibitor on HepG2 cells might impact Rab27a and SDC4 expression.

      We appreciate the relevance of the suggestion in testing how the treatment of a GSK inhibitor on HepG2 cells might impact Rab27a and SDC4 expression. According to the reviewer’s suggestions, experiments have been carried out and the data are presented in Author response image 1 below. In HepG2 cells, GSK inhibitor stabilized the wild-type β-catenin protein but surprisingly the mutated form of β-catenin is slightly decreased (Author response image 1A). Regarding the expression levels of both Rab27a and SDC4 mRNA, a small increase is observed (Author response image 1B). Rab27a protein is also increased upon the treatment with a GSK inhibitor on HepG2 cells (Author response image 1C). This increased in expression could be due to the decrease of the mutated form of β-catenin in HepG2 cells confirming that Rab27a and SDC4 are repressed by the mutated β-catenin. 

      Author response image 1.

      Impact of a GSK inhibitor (CHIR99021) on Rab27a and syndecan-4 (SDC4) expressions in HepG2 cells. HepG2 cells were treated by 3 µM CHIR990221 or DMSO as control for 48h. A) Western-blot (upper panel) and quantification (lower panel) of wild-type (WT) and mutated (MUT) β-catenin proteins in HepG2 cells treated with DMSO (control) or with CHIR990221. B) qRT-PCR analysis of Rab27a and SDC4 expression in HepG2 cells treated with DMSO (control) or with CHIR990221. C) Western-blot (left panel) and quantification (right panel) of Rab27a protein in HepG2 cells treated with DMSO (control) or with CHIR990221. *P<0.05

      Reviewer 3:

      (1) One limitation of this study is that the mechanistic relationship of exosome release and how they affect immune cells remains to be elucidated. In this context, the authors conclusions rest on the assumption that hepatocarcinoma immune evasion is based exclusively on the reduced number of exosomes. However, the authors do not analyze exosome composition between exosomes of wild type and oncogenic background, which could be different.

      We agree that the mechanistic relationship of exosome release and how they affect immune cells remains to be elucidated. In the discussion we mentioned that the content of ß-catenin-regulated EVs remains to be explored to fully understand their function in the immunomodulation of the tumor microenvironment. In this line, we have ongoing experiments in order to analyse the exosomal content in term of proteins and microRNAs. According to our preliminary results, we are able to say  that the exosome composition in knock-down mutated ß-catenin HepG2 cells compared to control HepG2 cells seems to be different suggesting not only an involvement of the number of exosomes in the immunomodulation but also of their content. 

      (2) The manuscript would benefit from minor language editing and the introduction from restructuring to enhance clarity.

      The manuscript has now benefited from a language editing thanks to the Professor William A. Thomas (Colby-Sawyer College, New Hampshire). Acknowledgments have been modified (page 12) to thank the Professor William A. Thomas for proof- reading of the manuscript. The introduction has been also restructured and modified according to the reviewer's suggestions to enhance clarity (page 3).

      (3) I believe that within the abstract, the authors mean 'defect' not 'default' in the sentence: Then, we demonstrated in 3D spheroid models that activation of β-catenin promotes a decrease of immune cell infiltration through a default in exosome secretion.

      We apologize for the mistake between 'default' and 'defect' in the abstract. The abstract has been modified accordingly.

      (4) Within the 'Introduction' part of the manuscript, the authors might consider reviewing and reorganizing the first paragraph for more clarity - I suggest leading with the first three sentences of the second paragraph (HCC is the most...) and then introducing b-catenin and the effects and implications of oncogenic ß-catenin in HCC.

      If the authors prefer the current structure of the 'Introduction', I would like to propose exchanging some of the wording:

      -In line 4: 'despite' instead of 'in front of'? Sentence: Thus, in front of the therapeutic revolution for cancers, with the emergence of immunotherapy and more particularly immune checkpoint inhibitors (anti-PD1, anti-PD-L1)

      -Additionally in line 7: In these tumors, the oncogenic β-catenin is able to set up a microenvironment that favors tumor progression notably by promoting immune escape. Here, 'establish' might be a better choice instead of 'set up' - In line 9 I suggest rephrasing the sentence: Few studies have reported that the defect of intercellular communication between cancer cells and immune cells is partly mediated by a decrease of chemokines production leading to a reduction of immune infiltrates.... and maybe adding a reference here.

      The introduction has been altered accordingly. Thanks for these suggestions that helped us to improve our manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We gratefully thank the editors and all reviewers for their time spend making their constructive remarks and useful suggestions, which has significantly raised the quality of the manuscript and has enable us to improve the manuscript. Each suggested comment brought forward by the reviewers was accurately considered. The manuscript has been revised in consideration of all suggestions.

      Reviewer #1 (Public Review):

      Wang and all present an interesting body of work focused on the effects of high altitude and hypoxia on erythropoiesis, resulting in erythrocytosis. This work is specifically focused on the spleen, identifying splenic macrophages as central cells in this effect. This is logical since these cells are involved in erythrophagocytosis and iron recycling. The results suggest that hypoxia induces splenomegaly with decreased number of splenic macrophages. There is also evidence that ferroptosis is induced in these macrophages, leading to cell destruction. Finally, the data suggest that ferroptosis in splenic red pulp macrophages causes the decrease in RBC clearance, resulting in erythrocytosis aka lengthening the RBC lifespan. However, there are many issues with the presented results, with somewhat superficial data, meaning the conclusions are overstated and there is decreased confidence that the hypotheses and observed results are directly causally related to hypoxia.

      Major points:

      1) The spleen is a relatively poorly understood organ but what is known about its role in erythropoiesis especially in mice is that it functions both to clear as well as to generate RBCs. The later process is termed extramedullary hematopoiesis and can occur in other bones beyond the pelvis, liver, and spleen. In mice, the spleen is the main organ of extramedullary erythropoiesis. The finding of transiently decreased spleen size prior to splenomegaly under hypoxic conditions is interesting but not well developed in the manuscript. This is a shortcoming as this is an opportunity to evaluate the immediate effect of hypoxia separately from its more chronic effect. Based just on spleen size, no conclusions can be drawn about what happens in the spleen in response to hypoxia.

      Thank you for your insightful comments and questions. The spleen is instrumental in both immune response and the clearance of erythrocytes, as well as serving as a significant reservoir of blood in the body. This organ, characterized by its high perfusion rate and pliability, constricts under conditions of intense stress, such as during peak physical exertion, the diving reflex, or protracted periods of apnea. This contraction can trigger an immediate release of red blood cells (RBCs) into the bloodstream in instances of substantial blood loss or significant reduction of RBCs. Moreover, elevated oxygen consumption rates in certain animal species can be partially attributed to splenic contractions, which augment hematocrit levels and the overall volume of circulating blood, thereby enhancing venous return and oxygen delivery (Dane et al. J Appl Physiol, 2006, 101:289-97; Longhurst et al. Am J Physiol, 1986, 251: H502-9). In our investigation, we noted a significant contraction of the spleen following exposure to hypoxia for a period of one day. We hypothesized that the body, under such conditions, is incapable of generating sufficient RBCs promptly enough to facilitate enhanced oxygen delivery. Consequently, the spleen reacts by releasing its stored RBCs through splenic constriction, leading to a measurable reduction in spleen size.

      However, we agree with you that further investigation is required to fully understand the implications of these changes. Considering the comments, we extended our research by incorporating more detailed examinations of spleen morphology and function during hypoxia, including the potential impact on extramedullary hematopoiesis. We anticipate that such an expanded analysis would not only help elucidate the initial response to hypoxia but also provide insights into the more chronic effects of this condition on spleen function and erythropoiesis.

      2) Monocyte repopulation of tissue resident macrophages is a minor component of the process being described and it is surprising that monocytes in the bone marrow and spleen are also decreased. Can the authors conjecture why this is happening? Typically, the expectation would be that a decrease in tissue resident macrophages would be accompanied by an increase in monocyte migration into the organ in a compensatory manner.

      We appreciate your insightful query regarding the observed decrease in monocytes in the bone marrow and spleen, particularly considering the typical compensatory increase in monocyte migration into organs following a decrease in tissue resident macrophages.

      The observed decrease in monocytes within the bone marrow is likely attributable to the fact that monocytes and precursor cells for red blood cells (RBCs) both originate from the same hematopoietic stem cells within the bone marrow. It is well established that exposure to hypobaric hypoxia (HH) induces erythroid differentiation specifically within the bone marrow, originating from these hematopoietic stem cells (Exp Hematol, 2021 May;97:32-46). As such, the differentiation to monocyte is reduced under hypoxic conditions, which may subsequently cause a decrease in migration to spleen.

      Furthermore, we hypothesize that an increased migration of monocytes to other tissues under HH exposure may also contribute to the decreased migration to the spleen. The liver, which partially contributes to the clearance of RBCs, may play a role in this process. Our investigations to date have indeed identified an increased monocyte migration to the liver. We were pleased to discover an elevation in CSF1 expression in the liver following HH exposure for both 7 and 14 days. This finding was corroborated through flow cytometry, which confirmed an increase in monocyte migration to the liver.

      Consequently, we propose that under HH conditions, the liver requires an increased influx of monocytes, which in turn leads to a decrease in monocyte migration to the spleen. However, it is important to note that these findings will be discussed more comprehensively in our forthcoming publication, and as such, the data pertaining to these results have not been included in the current manuscript.

      Author response image 1.

      3) Figure 3 does not definitively provide evidence that cell death is specifically occurring in splenic macrophages and the fraction of Cd11b+ cells is not changed in NN vs HH. Furthermore, the IHC of F4/80 in Fig 3U is not definitive as cells can express F4/80 more or less brightly and no negative/positive controls are shown for this panel.

      We appreciate your insightful comments and critiques regarding Figure 3. We acknowledge that the figure, as presented, does not definitively demonstrate that cell death is specifically occurring in splenic macrophages. While it is challenging to definitively determine the occurrence of cell death in macrophages based solely on Figure 3D-F, our single-cell analysis provides strong evidence that such an event occurs. We initially observed cell death within the spleen under hypobaric hypoxia (HH) conditions, and to discern the precise cell type involved, we conducted single-cell analyses. Regrettably, we did not articulate this clearly in our preliminary manuscript.

      In the revised version, we have modified the sequence of Figure 3A-C and Figure 3D-F for better clarity. Besides, we observed a significant decrease in the fraction of F4/80hiCD11bhi macrophages under HH conditions compared to NN. To make the changes more evident in CD86 and CD206, we have transformed these scatter plots into histograms in our revised manuscript.

      Author response image 2.

      Considering the limitations of F4/80 as a conclusive macrophage identifier, we have concurrently presented the immunohistochemical (IHC) analyses of heme oxygenase-1 (HO-1). Functioning as a macrophage marker, particularly in cells involved in iron metabolism, HO-1 offers additional diagnostic accuracy. Observations from both F4/80 and HO-1 staining suggested a primary localization of positively stained cells within the splenic red pulp. Following exposure to hypoxia-hyperoxia (HH) conditions, a decrease was noted in the expression of both F4/80 and HO-1. This decrease implies that HH conditions contribute to a reduction in macrophage population and impede the iron metabolism process. In the revised version of our manuscript, we have enhanced the clarity of Figure 3U to illustrate the presence of positive staining, with an emphasis on HO-1 staining, which is predominantly observed in the red pulp.

      Author response image 3.

      4) The phagocytic function of splenic red pulp macrophages relative to infection cannot be used directly to understand erythrophagocytosis. The standard approach is to use opsonized RBCs in vitro. Furthermore, RBC survival is a standard method to assess erythrophagocytosis function. In this method, biotin is injected via tail vein directly and small blood samples are collected to measure the clearance of biotinilation by flow; kits are available to accomplish this. Because the method is standard, Fig 4D is not necessary and Fig 4E needs to be performed only in blood by sampling mice repeatedly and comparing the rate of biotin decline in HH with NN (not comparing 7 d with 14 d).

      We appreciate your insightful comments and suggestions. We concur that the phagocytic function of splenic red pulp macrophages in the context of infection may not be directly translatable to understanding erythrophagocytosis. Given our assessment that the use of cy5.5-labeled E.coli alone may not be sufficient to accurately evaluate the phagocytic function of macrophages, we extended our study to include the use of NHS-biotin-labeled RBCs to assess phagocytic capabilities. While the presence of biotin-labeled RBCs in the blood could provide an indication of RBC clearance, this measure does not exclusively reflect the spleen's role in the process, as it fails to account for the clearance activities of other organs.

      Consequently, we propose that the remaining biotin-labeled RBCs in the spleen may provide a more direct representation of the organ's function in RBC clearance and sequestration. Our observations of diminished erythrophagocytosis at both 7- and 14-days following exposure to HH guided our subsequent efforts to quantify biotin-labeled RBCs in both the circulatory system and spleen. These measurements were conducted during the 7 to 14-day span following the confirmation of impaired erythrophagocytosis. Comparative evaluation of RBC clearance rates under NN and HH conditions provided further evidence supporting our preliminary observations, with the data revealing a decrease in the RBC clearance rate in the context of HH conditions. In response to feedback from other reviewers, we have elected to exclude the phagocytic results and the diagram of the erythrocyte labeling assay. These amendments will be incorporated into the revised manuscript. The reviewers' constructive feedback has played a crucial role in refining the methodological precision and coherence of our investigation.

      5) It is unclear whether Tuftsin has a specific effect on phagocytosis of RBCs without other potential confounding effects. Furthermore, quantifying iron in red pulp splenic macrophages requires alternative readily available more quantitative methods (e.g. sorted red pulp macrophages non-heme iron concentration).

      We appreciate your comments and questions regarding the potential effect of Tuftsin on the phagocytosis of RBCs and the quantification of iron in red pulp splenic macrophages. Regarding the role of Tuftsin, we concur that the literature directly associating Tuftsin with erythrophagocytosis is scant. The work of Gino Roberto Corazza et al. does suggest a link between Tuftsin and general phagocytic capacity, but it does not specifically address erythrophagocytosis (Am J Gastroenterol, 1999;94:391-397). We agree that further investigations are required to elucidate the potential confounding effects and to ascertain whether Tuftsin has a specific impact on the phagocytosis of RBCs. Concerning the quantification of iron in red pulp splenic macrophages, we acknowledge your suggestion to employ readily available and more quantitative methods. We have incorporated additional Fe2+ staining in the spleen at two time points: 7 and 14 days subsequent to HH exposure (refer to the following Figure). The resultant data reveal an escalated deposition of Fe2+ within the red pulp, as evidenced in Figures 5 (panels L and M) and Figure S1 (panels L and M).

      Author response image 4.

      6) In Fig 5, PBMCs are not thought to represent splenic macrophages and although of some interest, does not contribute significantly to the conclusions regarding splenic macrophages at the heart of the current work. The data is also in the wrong direction, namely providing evidence that PBMCs are relatively iron poor which is not consistent with ferroptosis which would increase cellular iron.

      We appreciate your insightful critique regarding Figure 5 and the interpretation of our data on peripheral blood mononuclear cells (PBMCs) in relation to splenic macrophages. We understand that PBMCs do not directly represent splenic macrophages, and we agree that any conclusions drawn from PBMCs must be considered with caution when discussing the behavior of splenic macrophages.

      The primary rationale for incorporating PBMCs into our study was to investigate the potential correspondence between their gene expression changes and those observed in the spleen after HH exposure. This was posited as a working hypothesis for further exploration rather than a conclusive statement. The gene expression in PBMCs was congruous with changes in the spleen's gene expression, demonstrating an iron deficiency phenotype, ostensibly due to the mobilization of intracellular iron for hemoglobin synthesis. Thus, it is plausible that NCOA4 may facilitate iron mobilization through the degradation of ferritin to store iron.

      It remains ambiguous whether ferroptosis was initiated in the PBMCs during our study. Ferroptosis primarily occurs as a response to an increase in Fe2+ rather than an overall increase in intracellular iron. Our preliminary proposition was that relative changes in gene expression in PBMCs could potentially mirror corresponding changes in protein expression in the spleen, thereby potentially indicating alterations in iron processing capacity post-HH exposure. However, we fully acknowledge that this is a conjecture requiring further empirical substantiation or clinical validation.

      7) Tfr1 increase is typically correlated with cellular iron deficiency while ferroptosis consistent with iron loading. The direction of the changes in multiple elements relevant to iron trafficking is somewhat confusing and without additional evidence, there is little confidence that the authors have reached the correct conclusion. Furthermore, the results here are analyses of total spleen samples rather than specific cells in the spleen.

      We appreciate your astute comments and agree that the observed increase in transferrin receptor (TfR) expression, typically associated with cellular iron deficiency, appears contradictory to the expected iron-loading state associated with ferroptosis. We understand that this apparent contradiction might engender some uncertainty about our conclusions. In our investigation, we evaluated total spleen samples as opposed to distinct cell types within the spleen, a factor that could have contributed to the seemingly discordant findings. An integral element to bear in mind is the existence of immature RBCs in the spleen, particularly within the hematopoietic island where these immature RBCs cluster around nurse macrophages. These immature RBCs contain abundant TfR which was needed for iron uptake and hemoglobin synthesis. These cells, which prove challenging to eliminate via perfusion, might have played a role in the observed upregulation in TfR expression, especially in the aftermath of HH exposure. Our further research revealed that the expression of TfR in macrophages diminished following hypoxic conditions, thereby suggesting that the elevated TfR expression in tissue samples may predominantly originate from other cell types, especially immature RBCs (refer to Author response image 5).

      Author response image 5.

      Reviewer #2 (Public Review):

      The authors aimed at elucidating the development of high altitude polycythemia which affects mice and men staying in the hypoxic atmosphere at high altitude (hypobaric hypoxia; HH). HH causes increased erythropoietin production which stimulates the production of red blood cells. The authors hypothesize that increased production is only partially responsible for exaggerated red blood cell production, i.e. polycythemia, but that decreased erythrophagocytosis in the spleen contributes to high red blood cells counts.

      The main strength of the study is the use of a mouse model exposed to HH in a hypobaric chamber. However, not all of the reported results are convincing due to some smaller effects which one may doubt to result in the overall increase in red blood cells as claimed by the authors. Moreover, direct proof for reduced erythrophagocytosis is compromised due to a strong spontaneous loss of labelled red blood cells, although effects of labelled E. coli phagocytosis are shown. Their discussion addresses some of the unexpected results, such as the reduced expression of HO-1 under hypoxia but due to the above-mentioned limitations much of the discussion remains hypothetical.

      Thank you for your valuable feedback and insight. We appreciate the recognition of the strength of our study model, the exposure of mice to hypobaric hypoxia (HH) in a hypobaric animal chamber. We also understand your concerns about the smaller effects and their potential impact on the overall increase in red blood cells (RBCs), as well as the apparent reduced erythrophagocytosis due to the loss of labelled RBCs.

      Erythropoiesis has been predominantly attributed to the amplified production of RBCs under conditions of HH. The focus of our research was to underscore the potential acceleration of hypoxia-associated polycythemia (HAPC) as a result of compromised erythrophagocytosis. Considering the spontaneous loss of labelled RBCs in vivo, we assessed the clearance rate of RBCs at the stages of 7 and 14 days within the HH environment, and subsequently compared this rate within the period from 7 to 14 days following the clear manifestation of erythrophagocytosis impairment at the two aforementioned points identified in our study. This approach was designed to negate the effects of spontaneous loss of labelled RBCs in both NN and HH conditions. Correspondingly, the results derived from blood and spleen analyses corroborated a decline in the RBC clearance rate under HH when juxtaposed with NN conditions.

      Apart from the E. coli phagocytosis and the labeled RBCs experiment (this part of the results was removed in the revision), the injection of Tuftsin further substantiated the impairment of erythrophagocytosis in the HH spleen, as evidenced by the observed decrease in iron within the red pulp of the spleen post-perfusion. Furthermore, to validate our findings, we incorporated RBCs staining in splenic cells at 7 and 14 days of HH exposure, which provided concrete confirmation of impaired erythrophagocytosis (new Figure 4E).

      Author response image 6.

      As for the reduced expression of heme oxygenase-1 (HO-1) under hypoxia, we agree that this was an unexpected result, and we are in the process of further exploring the underlying mechanisms. It is possible that there are other regulatory pathways at play that are yet to be identified. However, we believe that by offering possible interpretations of our data and potential directions for future research, we contribute to the ongoing scientific discourse in this area.

      Reviewer #3 (Public Review):

      The manuscript by Yang et al. investigated in mice how hypobaric hypoxia can modify the RBC clearance function of the spleen, a concept that is of interest. Via interpretation of their data, the authors proposed a model that hypoxia causes an increase in cellular iron levels, possibly in RPMs, leading to ferroptosis, and downregulates their erythrophagocytic capacity. However, most of the data is generated on total splenocytes/total spleen, and the conclusions are not always supported by the presented data. The model of the authors could be questioned by the paper by Youssef et al. (which the authors cite, but in an unclear context) that the ferroptosis in RPMs could be mediated by augmented erythrophagocytosis. As such, the loss of RPMs in vivo which is indeed clear in the histological section shown (and is a strong and interesting finding) can be not directly caused by hypoxia, but by enhanced RBC clearance. Such a possibility should be taken into account.

      Thank you for your insightful comments and constructive feedback. In their research, Youssef et al. (2018) discerned that elevated erythrophagocytosis of stressed red blood cells (RBCs) instigates ferroptosis in red pulp macrophages (RPMs) within the spleen, as evidenced in a mouse model of transfusion. This augmentation of erythrophagocytosis was conspicuous five hours post-injection of RBCs. Conversely, our study elucidated the decrease in erythrophagocytosis in the spleen after both 7 and 14 days.

      Typically, macrophages exhibit an enhanced phagocytic capacity in the immediate aftermath of stress or stimulation. Nonetheless, the temporal points of observation in our study were considerably extended (7 and 14 days). It is currently unclear whether the phagocytic capacity is amplified during the acute phase of HH exposure, especially on the first day. Considering that the spleen contraction on the next day of HH leads to the release of stored RBCs into the bloodstream, and whether this initial reaction leads to ferroptosis, and the phagocytic capacity of RBCs is subsequently weakened after 7 or 14 days under sustained HH conditions.

      Major points:

      1) The authors present data from total splenocytes and then relate the obtained data to RPMs, which are quantitatively a minor population in the spleen. Eg, labile iron is increased in the splenocytes upon HH, but the manuscript does not show that this occurs in the red pulp or RPMs. They also measure gene/protein expression changes in the total spleen and connect them to changes in macrophages, as indicated in the model Figure (Fig. 7). HO-1 and levels of Ferritin (L and H) can be attributed to the drop in RPMs in the spleen. Are any of these changes preserved cell-intrinsically in cultured macrophages? This should be shown to support the model (relates also to lines 487-88, where the authors again speculate that hypoxia decreases HO-1 which was not demonstrated). In the current stage, for example, we do not know if the labile iron increase in cultured cells and in the spleen in vivo upon hypoxia is the same phenomenon, and why labile iron is increased. To improve the manuscript, the authors should study specifically RPMs.

      We express our gratitude for your perceptive remarks. In our initial manuscript, we did not evaluate labile iron within the red pulp and red pulp macrophages (RPMs). To address this oversight, we utilized the Lillie staining method, in accordance with the protocol outlined by Liu et al., (Chemosphere, 2021, 264(Pt 1):128413), to discern Fe2+ presence within these regions. The outcomes were consistent with our antecedent Western blot and flow cytometry findings in the spleen, corroborating an increment in labile iron specifically within the red pulp of the spleen.

      Author response image 7.

      However, we acknowledge the necessity for other supplementary experimental efforts to further validate these findings. Additionally, we scrutinized the expression of heme oxygenase-1 (HO-1) and iron-related proteins, including transferrin receptor (TfR), ferroportin (Fpn), ferritin (Ft), and nuclear receptor coactivator 4 (NCOA4) in primary macrophages subjected to 1% hypoxic conditions, both with and without hemoglobin treatment. Our results indicated that the expression of ferroptosis-related proteins was consistent with in vivo studies, however the expression of iron related proteins was not similar in vitro and in vivo. It suggesting that the increase in labile iron in cultured cells and the spleen in vivo upon hypoxia are not identical phenomena. However, the precise mechanism remains elusive.

      In our study, we observed a decrease in HO-1 protein expression following 7 and 14 days of HH exposure, as shown in Figure 3U, 5A, and S1A. This finding contradicts previous research that identified HO-1 as a hypoxia-inducible factor (HIF) target under hypoxic conditions (P J Lee et al., 1997). Our discussion, therefore, addressed the potential discrepancy in HO-1 expression under HH. According to our findings, HO-1 regulation under HH appears to be predominantly influenced by macrophage numbers and the RBCs to be processed in the spleen or macrophages, rather than by hypoxia alone.

      It is challenging to discern whether the increased labile iron observed in vitro accurately reflects the in vivo phenomenon, as replicating the iron requirements for RBCs production induced by HH in vitro is inherently difficult. However, by integrating our in vivo and in vitro studies, we determined that the elevated Fe2+ levels were not dependent on HO-1 protein expression, as HO-1 levels was increased in vitro while decreasing in vivo under hypoxic/HH exposure.

      Author response image 8.

      2) The paper uses flow cytometry, but how this method was applied is suboptimal: there are no gating strategies, no indication if single events were determined, and how cell viability was assessed, which are the parent populations when % of cells is shown on the graphs. How RBCs in the spleen could be analyzed without dedicated cell surface markers? A drop in splenic RPMs is presented as the key finding of the manuscript but Fig. 3M shows gating (suboptimal) for monocytes, not RPMs. RPMs are typically F4/80-high, CD11-low (again no gating strategy is shown for RPMs). Also, the authors used single-cell RNAseq to detect a drop in splenic macrophages upon HH, but they do not indicate in Fig. A-C which cluster of cells relates to macrophages. Cell clusters are not identified in these panels, hence the data is not interpretable).

      Thank you for your comments and constructive critique regarding our flow cytometry methodology and presentation. We understand the need for greater transparency and detailed explanation of our procedures, and we acknowledge that the lack of gating strategies and other pertinent information in our initial manuscript may have affected the clarity of our findings.

      In our initial report, we provided an overview of the decline in migrated macrophages (F4/80hiCD11bhi), including both M1 and M2 expression in migrated macrophages, as illustrated in Figure 3, but did not specifically address the changes in red pulp macrophages (RPMs). Based on previous results, it is difficult to identify CD11b- and CD11blo cells. We will repeat the results and attempt to identify F4/80hiCD11blo cells in the revised manuscript. The results of the reanalysis are now included (Figure 3M). However, single-cell in vivo analysis studies may more accurately identify specific cell types that decrease after exposure to HH.

      Author response image 9.

      Furthermore, we substantiated the reduction in red pulp, as evidenced by Figure 4J, given that iron processing primarily occurs within the red pulp. In Figure 3, our initial objective was merely to illustrate the reduction in total macrophages in the spleen following HH exposure.

      To further clarify the characterization of various cell types, we conducted a single-cell analysis. Our findings indicated that clusters 0,1,3,4,14,18, and 29 represented B cells, clusters 2, 10, 12, and 28 represented T cells, clusters 15 and 22 corresponded to NK cells, clusters 5, 11, 13, and 19 represented NKT cells, clusters 6, 9, and 24 represented cell cycle cells, clusters 26 and 17 represented plasma cells, clusters 21 and 23 represented neutrophils, cluster 30 represented erythrocytes, and clusters 7, 8, 16, 20, 24, and 27 represented dendritic cells (DCs) and macrophages, as depicted in Figure 3E.

      3) The authors draw conclusions that are not supported by the data, some examples: a) they cannot exclude eg the compensatory involvement of the liver in the RBCs clearance (the differences between HH sham and HH splenectomy is mild in Fig. 2 E, F and G).

      Thank you for your insightful comments and for pointing out the potential involvement of other organs, such as the liver, in the RBC clearance under HH conditions. We concur with your observation that the differences between the HH sham and HH splenectomy conditions in Fig. 2 E, F, and G are modest. This could indeed suggest a compensatory role of other organs in RBC clearance when splenectomy is performed. Our intent, however, was to underscore the primary role of the spleen in this process under HH exposure.

      In fact, after our initial investigations, we conducted a more extensive study examining the role of the liver in RBC clearance under HH conditions. Our findings, as illustrated in the figures submitted with this response, indeed support a compensatory role for the liver. Specifically, we observed an increase in macrophage numbers and phagocytic activity in the liver under HH conditions. Although the differences in RBC count between the HH sham and HH splenectomy conditions may seem minor, it is essential to consider the unit of this measurement, which is value*1012/ml. Even a small numerical difference can represent a significant biological variation at this scale.

      Author response image 10.

      b) splenomegaly is typically caused by increased extramedullary erythropoiesis, not RBC retention. Why do the authors support the second possibility? Related to this, why do the authors conclude that data in Fig. 4 G,H support the model of RBC retention? A significant drop in splenic RBCs (poorly gated) was observed at 7 days, between NN and HH groups, which could actually indicate increased RBC clearance capacity = less retention.

      Prior investigations have predominantly suggested that spleen enlargement under hypoxic conditions stems from the spleen's extramedullary hematopoiesis. Nevertheless, an intriguing study conducted in 1994 by the General Hospital of Xizang Military Region reported substantial exaggeration and congestion of splenic sinuses in high altitude polycythemia (HAPC) patients. This finding was based on the dissection of spleens from 12 patients with HAPC (Zou Xunda, et al., Southwest Defense Medicine, 1994;5:294-296). Moreover, a recent study indicated that extramedullary erythropoiesis reaches its zenith between 3 to 7 days (Wang H et al., 2021).

      Considering these findings, the present study postulates that hypoxia-induced inhibition of erythrophagocytosis may lead to RBC retention. However, we acknowledge that the manuscript in its current preprint form does not offer conclusive evidence to substantiate this hypothesis. To bridge this gap, we further conducted experiments where the spleen was perfused, and total cells were collected post HH exposure. These cells were then smeared onto slides and subjected to Wright staining. Our results unequivocally demonstrate an evident increase in deformation and retention of RBCs in the spleen following 7 and 14 days of HH exposure. This finding strengthens our initial hypothesis and contributes a novel perspective to the understanding of splenic responses under hypoxic conditions.

      Author response image 11.

      c) lines 452-54: there is no data for decreased phagocytosis in vivo, especially in the context of erythrophagocytosis. This should be done with stressed RBCs transfusion assays, very good examples, like from Youssef et al. or Threul et al. are available in the literature.

      Thanks. In their seminal work, Youssef and colleagues demonstrated that the transfusion of stressed RBCs triggers erythrophagocytosis and subsequently incites ferroptosis in red pulp macrophages (RPMs) within a span of five hours. Given these observations, the applicability of this model to evaluate macrophage phagocytosis in the spleen or RPMs under HH conditions may be limited, as HH has already induced erythropoiesis in vivo. In addition, it was unclear whether the membrane characteristics of stress induced RBCs were similar to those of HH induced RBCs, as this is an important signal for in vivo phagocytosis. The ambiguity arises from the fact that we currently lack sufficient knowledge to discern whether the changes in phagocytosis are instigated by the presence of stressed RBCs or by changes of macrophages induced by HH in vivo. Nonetheless, we appreciate the potential value of this approach and intend to explore its utility in our future investigations. The prospect of distinguishing the effects of stressed RBCs from those of HH on macrophage phagocytosis is an intriguing line of inquiry that could yield significant insights into the mechanisms governing these physiological processes. We will investigate this issue in our further study.

      d) Line 475 - ferritinophagy was not shown in response to hypoxia by the manuscript, especially that NCOA4 is decreased, at least in the total spleen.

      Drawing on the research published in eLife in 2015, it was unequivocally established that ferritinophagy, facilitated by Nuclear Receptor Coactivator 4 (NCOA4), is indispensable for erythropoiesis. This process is modulated by iron-dependent HECT and RLD domain containing E3 ubiquitin protein ligase 2 (HERC2)-mediated proteolysis (Joseph D Mancias et al., eLife. 2015; 4: e10308). As is widely recognized, NCOA4 plays a critical role in directing ferritin (Ft) to the lysosome, where both NCOA4 and Ft undergo coordinated degradation. In our study, we provide evidence that exposure to HH stimulates erythropoiesis (Figure 1). We propose that this, in turn, could promote ferritinophagy via NCOA4, resulting in a decrease in NCOA4 protein levels post-HH exposure. We will further increase experiments to verify this concern. This finding not only aligns with the established understanding of ferritinophagy and erythropoiesis but also adds a novel dimension to the understanding of cellular responses to hypoxic conditions.

      4) In a few cases, the authors show only representative dot plots or histograms, without quantification for n>1. In Fig. 4B the authors write about a significant decrease (although with n=1 no statistics could be applied here; of note, it is not clear what kind of samples were analyzed here). Another example is Fig. 6I. In this case, it is even more important as the data are conflicting the cited article and the new one: PMCID: PMC9908853 which shows that hypoxia stimulates efferocytosis. Sometimes the manuscript claim that some changes are observed, although they are not visible in representative figures (eg for M1 and M2 macrophages in Fig. 3M)

      We recognize that our initial portrayal of Figure 4B was lacking in precision, given that it did not include the corresponding statistical graph. While our results demonstrated a significant reduction in the ability to phagocytose E. coli, in line with the recommendations of other reviewers, we have opted to remove the results pertaining to E. coli phagocytosis in this revision, as they primarily reflected immune function.

      In relation to PMC9908853, which reported metabolic adaptation facilitating enhanced macrophage efferocytosis in limited-oxygen environments, it is worth noting that the macrophages investigated in this study were derived from ER-Hoxb8 macrophage progenitors following the removal of β-estradiol. Consequently, questions arise regarding the comparability between these cultured macrophages and primary macrophages obtained fresh from the spleen post HH exposure. The characteristics and functions of these two different macrophage sources may not align precisely, and this distinction necessitates further investigation.

      5) There are several unclear issues in methodology:

      • what is the purity of primary RPMs in the culture? RPMs are quantitatively poorly represented in splenocyte single-cell suspensions. This reviewer is quite skeptical that the processing of splenocytes from approx 1 mm3 of tissue was sufficient to establish primary RPM cultures. The authors should prove that the cultured cells were indeed RPMs, not monocyte-derived macrophages or other splenic macrophage subtypes.

      Thank you for your thoughtful comments and inquiries. Firstly, I apologize if we did not make it clear in the original manuscript. The purity of the primary RPMs in our culture was found to be approximately 40%, as identified by F4/80hiCD11blo markers using flow cytometry. We recognize that RPMs are typically underrepresented in splenocyte single-cell suspensions, and the concern you raise about the potential for contamination by other cell types is valid.

      We apologize for any ambiguities in the methodological description that may have led to misunderstandings during the review. Indeed, the entirety of the spleen is typically employed for splenic macrophage culture. The size of the spleen can vary dependent on the species and age of the animal, but in mice, it is commonly approximately 1 cm in length. The spleen is then dissected into minuscule fragments, each approximately 1 mm3 in volume, to aid in enzymatic digestion. This procedure does not merely utilize a single 1 mm3 tissue fragment for RPMs cultures. Although the isolation and culture of spleen macrophages can present considerable challenges, our method has been optimized to enhance the yield of this specific cell population.

      • (around line 183) In the description of flow cytometry, there are several missing issues. In 1) it is unclear which type of samples were analyzed. In 2) it is not clear how splenocyte cell suspension was prepared.

      1) Whole blood was extracted from the mice and collected into an anticoagulant tube, which was then set aside for subsequent thiazole orange (TO) staining.

      2) Splenic tissue was procured from the mice and subsequently processed into a single-cell suspension using a 40 μm filter. The erythrocytes within the entire sample were subsequently lysed and eliminated, and the remaining cell suspension was resuspended in phosphate-buffered saline (PBS) in preparation for ensuing analyses.

      We have meticulously revised these methodological details in the corresponding section of the manuscript to ensure clarity and precision.

      • In line 192: what does it mean: 'This step can be omitted from cell samples'?

      The methodology employed for the quantification of intracellular divalent iron content and lipid peroxidation level was executed as follows: Splenic tissue was first processed into a single cell suspension, subsequently followed by the lysis of RBCs. It should be noted that this particular stage is superfluous when dealing with isolated cell samples. Subsequently, a total of 1 × 106 cells were incubated with 100 μL of BioTracker Far-red Labile Fe2+ Dye (1 mM, Sigma, SCT037, USA) for a duration of 1 hour, or alternatively, C11-Bodipy 581/591 (10 μM, Thermo Fisher, D3861, USA) for a span of 30 minutes. Post incubation, cells were thoroughly washed twice with PBS. Flow cytometric analysis was subsequently performed, utilizing the FL6 (638 nm/660 nm) channel for the determination of intracellular divalent iron content, and the FL1 (488 nm/525 nm) channel for the quantification of the lipid peroxidation level.

      • 'TO method' is not commonly used anymore and hence it was unclear to this Reviewer. Reticulocytes should be analyzed with proper gating, using cell surface markers.

      We are appreciative of your astute observation pertaining to the methodology we employed to analyze reticulocytes in our study. We value your recommendation to utilize cell surface markers for effective gating, which indeed represents a more modern and accurate approach. However, as reticulocyte identification is not the central focus of our investigation, we opted for the TO staining method—due to its simplicity and credibility of results. In our initial exploration, we adopted the TO staining method in accordance with the protocol outlined (Sci Rep, 2018, 8(1):12793), primarily owing to its established use and demonstrated efficacy in reticulocyte identification.

      • The description of 'phagocytosis of E. coli and RBCs' in the Methods section is unclear and incomplete. The Results section suggests that for the biotinylated RBCs, phagocytosis? or retention? Of RBCs was quantified in vivo, upon transfusion. However, the Methods section suggests either in vitro/ex vivo approach. It is vague what was indeed performed and how in detail. If RBC transfusion was done, this should be properly described. Of note, biotinylation of RBCs is typically done in vivo only, being a first step in RBC lifespan assay. The such assay is missing in the manuscript. Also, it is not clear if the detection of biotinylated RBCs was performed in permeablized cells (this would be required).

      Thanks for the comments. In our initial methodology, we employed Cy5.5-labeled Escherichia coli to probe phagocytic function, albeit with the understanding that this may not constitute the most ideal model for phagocytosis detection within this context (in light of recommendations from other reviewers, we have removed the E. coli phagocytosis results from this revision, as they predominantly mirror immune function). Our fundamental aim was to ascertain whether HH compromises the erythrophagocytic potential of splenic macrophages. In pursuit of this, we subsequently analyzed the clearance of biotinylated RBCs in both the bloodstream and spleen to assess phagocytic functionality in vivo.

      In the present study, instead of transfusing biotinylated RBCs into mice, we opted to inject N-Hydroxysuccinimide (NHS)-biotin into the bloodstream. NHS-biotin is capable of binding with cell membranes in vivo and can be recognized by streptavidin-fluorescein isothiocyanate (FITC) after cells are extracted from the blood or spleen in vitro. Consequently, biotin-labeled RBCs were detectable in both the blood and spleen following NHS-biotin injection for a duration of 21 days. Ultimately, we employed flow cytometry to analyze the NHS-biotin labeled RBCs in the blood or spleen. This method facilitates the detection of live cells and is not applicable to permeabilized cells. We believe this approach better aligns with our investigative goals and offers a more robust evaluation of erythrophagocytic function under hypoxic conditions.

      Recommendations for the authors: please note that you control which, if any, revisions, to undertake.

      Thank you for your comments and recommendations. We appreciate your understanding that the choice of implementing revisions ultimately rests with us. However, we also value your expertise and will seriously consider your suggestions as they can provide additional perspectives to our work and contribute to the overall quality and robustness of our study.

      We strive to produce research that meets the highest scientific standards and we believe that constructive criticism, such as yours, helps us to achieve this objective. We will carefully review your comments and consider the appropriate changes to make in order to address your concerns and improve our manuscript.

      Reviewer #1 (Recommendations For The Authors):

      Minor:

      1) HCV in text is a typo, should be HCT. Please edit.

      Thanks for the correction. We’ve revised it.

      1. Fig 2D is not useful beyond the more accurate measure of HCT in Fig 2G and should be removed.

      Thank you for your feedback and suggestion about Fig. 2D. We understand your point regarding the comparative accuracy of HCT in Fig. 2G. However, our intention in including Fig. 2D was to provide a more intuitive visual representation of the erythrocyte position levels, which we believe complements the more precise HCT data. We have observed that the erythrocyte positions significantly increased for 14 days after HH splenectomy, and this trend is visually depicted in Fig. 2D. While HCT provides a more accurate measure, Fig. 2D provides a snapshot that can be more immediately graspable, especially for readers who may prefer visual data. Nevertheless, we appreciate your perspective and will reassess whether the inclusion of Fig. 2D adds enough value to the overall understanding of our findings. If we find that it indeed does not contribute significantly, we will consider removing it in line with your suggestion.

      1. What is the purpose of performing splenectomy? It is well established that reticuloendothelial cells of the liver perform a redundant function to splenic macrophages and since these cells are not being evaluated, data following splenectomy is of limited value. Please remove or move to supplement. Alternatively, evaluate what happens in the liver in response to hypoxia. Is there an increase in erythroblasts? Is there a decrease in liver macrophages in the same way as in the spleen in non-splenectomized mice? The minimally increased HCT in hypoxic splenectomized mice (relative to non-splenectomized mice) suggests that the spleen does the primary work of clearance but not exclusively since there is still a major increase in response to hypoxia in splenectomized mice. The sentence (page 16, line 292) states that the spleen is essential which is not the case based on this data.

      Thank you for your comments and recommendations. In reality, we have been consistently studying the liver's response to hypobaric hypoxia (HH) exposure. Nevertheless, the changes observed in the liver are contrary to those in the spleen, including an increase in macrophage count and the capacity for erythrophagocytosis, as well as processing heme iron (refer to the above figure for details).

      It is widely accepted that HH exposure predominantly induces erythropoiesis by stimulating bone marrow production. The primary objective of this study was not to refute this central mechanism behind erythrocytosis. Instead, our intent was to supplement this understanding by proposing that impaired clearance of red blood cells (RBCs) could potentially exacerbate erythrocytosis. We believe this additional perspective could significantly enhance our understanding of the complex dynamics involved in RBC production and clearance under hypoxic conditions.

      Reviewer #2 (Recommendations For The Authors):

      The following questions and remarks should be considered by the authors:

      1). The methods should clearly state whether the HH was discontinued during the 7- or 14-day exposure for cleaning, fresh water etc. Moreover, how was CO2 controlled? The procedure for splenectomy needs to be described in the methods.

      Thank you for your insightful comments and questions. We apologize for any lack of clarity in our original description. To address your questions:

      During the 7- or 14-day HH exposure, the HH was not discontinued for cleaning or providing fresh water. We ensured that the cage was thoroughly cleaned, and food and water were sufficiently stocked before placing the mice into the HH chamber. The design of the cage and the HH chamber allowed the mice to have continuous access to food and water during the entire exposure period.

      Regarding the control of CO2, the HH chamber was equipped with a CO2 scrubbing system. The system utilized soda lime to absorb excess CO2 produced by the mice, and the air inside the chamber was exchanged with the air outside 25 times per hour to maintain a stable atmospheric concentration and ensure adequate oxygen supply.

      As for the procedure for splenectomy, we apologize for the omission in the original manuscript. The mice were anesthetized using isoflurane, and a small incision was made in the left flank to expose the spleen. The spleen was then gently exteriorized, ligated, and excised. The incision was sutured, and the mice were allowed to recover under close monitoring. We ensured that all procedures were performed in accordance with our institution's guidelines for animal care.

      2) The lack of changes in MCH needs explanation? During stress erythropoiesis some limit in iron availability should cause MCH decrease particularly if the authors claim that macrophages for rapid iron recycling are decreased. Fig 1A is dispensable. Fig 1G NN control 14 days does not make sense since it is higher than 7 days of HH.

      Thank you for your insightful comments and queries. Regarding the lack of changes in Mean Corpuscular Hemoglobin (MCH), our hypothesis is that the decrease in iron recycling in the spleen following HH is potentially compensated by the increased iron absorption or supply from the liver, thus maintaining the iron requirement for erythropoiesis. This may explain why MCH levels did not significantly change after HH exposure. We have indeed observed an increase in macrophage numbers and their erythrophagocytosis/heme iron processing ability after HH exposure for 7 or 14 days in liver (please refer to the above figure for details), suggesting a compensatory mechanism to ensure adequate iron for erythropoiesis.

      Regarding your comment on Fig 1A, we included this figure to provide a baseline of the experimental condition before any treatment. However, we understand your point and will consider removing it if it does not contribute significantly to the interpretation of our results. As for Fig 1G, we agree that the control at 14 days being higher than 7 days of HH may seem counterintuitive. We believe this could be due to individual variations among the mice or potential experimental errors. However, considering recommendations from other reviewers, we have removed this result from the revised manuscript.

      3) Fig 2, the difference between sham and splenectomy is really marginal and not convincing. Is there also a difference at 7 days? Why does the spleen size decrease between 7 and 14 days?

      We understand your concerns regarding the observed differences in Fig. 2 between sham and splenectomy groups. We acknowledge that while the absolute numerical differences may appear marginal, it is important to consider the unit of measurement. In the case of RBC count, the unit is 1012/L, hence even slight numerical differences can translate to significant variations in the actual count of RBCs.

      We did not examine alterations occurring 7 days post-splenectomy in our study. The discernible trend of spleen size diminution between the 7th and 14th days is indeed compelling. It is plausible that this might be attributable to the body's adaptive response to hypobaric hypoxia (HH) exposure, wherein spleen size initially enlarges (at day 7) in response to compensatory erythropoiesis, followed by a reduction (at day 14) as the body acclimatizes to the HH conditions. Nevertheless, we did not identify a statistically significant difference between the measurements at day 7 and day 14, suggesting that this observation warrants further scrutiny.

      4) Fig 3B, the clusters should be explained in detail. If the decrease in macrophages in Fig 3K/L is responsible for the effect, why does splenectomy not have a much stronger effect? How do the authors know which cells died in the calcein stained population in Fig 3D?

      Thank you for your insightful queries and comments. Regarding Fig. 3B, we apologize for not providing sufficient detail on the clusters in the original manuscript. We will ensure that we include a comprehensive explanation of the clusters, including the specific cell types and their respective markers, in our revision. (clusters 0,1,3,4,14,18, and 29 represented B cells, clusters 2, 10, 12, and 28 represented T cells, clusters 15 and 22 corresponded to NK cells, clusters 5, 11, 13, and 19 represented NKT cells, clusters 6, 9, and 24 represented cell cycle cells, clusters 26 and 17 represented plasma cells, clusters 21 and 23 represented neutrophils, cluster 30 represented erythrocytes, and clusters 7, 8, 16, 20, 24, and 27 represented dendritic cells (DCs) and macrophages).

      As for the decrease in macrophages observed in Fig. 3K/L, it's important to note that the spleen is a complex organ comprising numerous cell types, all of which can contribute to its overall function. While macrophages play a crucial role in iron recycling and erythropoiesis, other cell types and factors may also influence these processes. Therefore, while splenectomy results in the removal of all splenic cells, the overall impact on these processes may not be as pronounced as the specific reduction in macrophages due to compensatory mechanisms from other tissues and cells.

      Concerning Fig. 3D, we acknowledge the ambiguity in the initial interpretation. The calcein staining was utilized to determine cell viability, but it doesn't identify the specific cell types that have died. To address this, we performed a single-cell analysis, which can provide a more accurate identification of the specific cell types affected.

      5) Is the reduced phagocytic capacity in Fig4B significant? Erythrophagocytosis is compromised due to the considerable spontaneous loss of labelled erythrocytes; could other assays help? (potentially by a modified Chromium release assay?). Is it necessary to stimulated phagocytosis to see a significant effect?

      We express our gratitude for your insightful queries and recommendations. In response to your initial question, the observed reduction in phagocytic capacity illustrated in Fig. 4B was indeed statistically significant. However, in alignment with feedback from other reviewers, we have elected to exclude the phagocytic results from this revised manuscript, as they predominantly reflect immune function rather than erythrophagocytosis of macrophages.

      With respect to your proposal of potential alternatives to the erythrophagocytosis assay, we concur that the spontaneous loss of labeled erythrocytes could have influenced our results. Your suggestion of implementing a modified Chromium release assay is indeed an intriguing possibility that warrants further exploration.

      Regarding the requirement for stimulating phagocytosis, we employed stimulation as a mechanism to investigate the potential for augmenting erythrophagocytosis and iron processing within the red pulp. Our findings suggest that increased phagocytosis in the spleen contributes positively to these processes. As part of the Tuftsin injection experiment, we assessed the RBC count and hemoglobin content. Despite an observed reduction trend, there were no statistically significant alterations. We are uncertain if the observation period was insufficiently long. Nevertheless, we concur that it would be worthwhile to explore inherent changes without external stimulation, and we will take this into consideration in our future research.

      6) Can the observed ferroptosis be influenced by bi- and not trivalent iron chelators?

      Thank you for your insightful question. Indeed, the role of iron chelators in the observed ferroptosis is an important aspect to explore. Ferroptosis is a form of regulated cell death characterized by an iron-dependent accumulation of lipid peroxides, and the role of different iron chelators could potentially influence this process.

      In the case of bi- versus trivalent iron chelators, their influence on ferroptosis could be distinct due to their specificities for different forms of iron. However, we have not yet investigated this in our current study.

      Your suggestion has highlighted a valuable direction for our future research. We agree that examining the influence of bi- and trivalent iron chelators on the observed ferroptosis would provide a deeper understanding of the iron-dependent mechanisms involved in this process. We will consider this important aspect in our subsequent investigations.

      Reviewer #3 (Recommendations For The Authors):

      Methodology:

      1) Several syntax and grammatical errors, and unclear phrasing. Some factual errors as well: eg, line 380-81 the authors wrote that hypoxia increased viable cell numbers and phagocytosis ability, although their data suggest the opposite. Lines in Discussion 454-55 and in the Results 346-47 convey opposite messages.

      We appreciate your attention to detail and your feedback on the language and factual discrepancies within the manuscript.

      Upon revisiting lines 380-381, we would like to clarify that we had made a mistake. Our data indeed suggest that hypoxia led to a reduction in viable cell numbers and phagocytosis ability, not an increase as originally stated. We sincerely apologize for the confusion and will correct this statement in our revised manuscript.

      As for the opposing messages between lines 454-455 in the Discussion and 346-347 in the Results, we apologize for any confusion caused. We understand that it is crucial to maintain consistent interpretation of our data throughout the manuscript. We will carefully reevaluate these sections and adjust our phrasing to ensure that our interpretations accurately reflect our results.

      2) It is not clear why the authors investigated CD47 expression.

      Thank you for your question regarding our investigation of CD47 expression. CD47, also known as integrin-associated protein, is ubiquitously expressed on many cell types, including red blood cells (RBCs). In the context of our study, we used CD47 expression as an indicator of young RBCs, as CD47 is known to be highly expressed on newly produced RBCs. Our intention was to use CD47 positive cells as a proxy for new RBC production, which would give us insights into erythropoiesis under hypobaric hypoxia conditions. This marker thus provides valuable information about the rate and effectiveness of erythropoietic response to hypoxic stress. However, according to others reviewers’ suggestion, we removed this part of results in the revised manuscript.

      Minor:

      1) Y axis is often labeled without sufficient detail.

      2) The legends do not specify the exact statistical tests.

      3) Some in vivo exp contain n=3 which is relatively low for mouse-based studies.

      Some suggestions for the text:

      Line 60: is the main cause of erythrocytosis which in turn alleviates..

      62-66 - argumentation is not clear/grammatically correct and should be rephrased (eg, „RBC homeostasis is disturbed and never formed into a homeostasis status" - „homeostasis.. is never formed into a homeostasis status" sounds incorrect.

      Ref # 8 - does not fit, I assume this was a mistake and the authors aimed to cite a Review article by Slusarczyk and Mleczko-Sanecka in Genes. However, this reference seems appropriate to be discussed in the Discussion section as it is very directly connected to the content of the present manuscript

      76-78 - unclear/incomplete sentence (binding of iron to Tf and Tf-Fe delivery to the erythroid compartment is missing in this sentence, please, rephrase)

      80 - iron is not stored ON FtL

      90 - should be written: important role in iron recycling from RBCs

      94 - phrasing 'damage of erythrophagocytosis' is incorrect

      96-97 - should be written, for example: 'followed by eryptosis and iron recycling defects in the spleen'

      282 - the sentence is grammatically incorrect and unclear.

      292-94 - the statement is completely unclear, what can 'inhibit the excessive proliferation of RBCs'? What does it mean?

      Reference to tuftsin was not provided (Am J Gastroenterol, 1999;94:391-397; PLoS One. 2012;7(4):e34933)

      How quantification of microscopy images for F4/80 signal was performed?

      In Figure 5, more explanation is required for the readers regarding the measured genes/proteins - why the patter of gene expression changes suggest ferroptosis?

      Writing that ferroptosis INHIBITS phagocytosis is incorrect

      Line 460 is unclear

      468 - erythrocytophagy is not a commonly used term/

      We are grateful for your keen eye and the time you have taken to provide such thorough feedback. It will undoubtedly help us to significantly enhance the clarity and completeness of our research. We have modified the corresponding sections in our manuscript to include these details. The comments have helped us ensure that our methodology is transparent and our findings are presented clearly. We have taken all your comments into consideration in our revision. we also have revised our manuscript to discuss these alternative interpretations more clearly and to acknowledge the potential limitations of our data.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study, utilizing CITE-Seq to explore CML, is considered a useful contribution to our understanding of treatment response. However, the reviewers express concern about the incomplete evidence due to the small sample size and recommend addressing these limitations. Strengthening the study with additional patient samples and validation measures would enhance its significance.

      We thank the editors for the assessment of our manuscript. In view of the comments of the three reviewers, we have increased the number of CML patient samples analyzed to confirm all the major findings included in the manuscript. In total, more than 80 patient samples across different approaches have now been analyzed and incorporated in the revised manuscript.

      To the best of our knowledge, this is the first single cell multiomics report in CML and differs substantially from the recent single cell omics-based reports where single modalities were measured one at a time (Krishnan et al., 2023; Patel et al., 2022). Thus, the sc-multiomic investigation of LSCs and HSCs from the same patient addresses a major gap in the field towards managing efficacy and toxicity of TKI treatment by enumerating CD26+CD35- LSCs and CD26-CD35+ HSCs burden and their ratio at diagnosis vs. 3 months of therapy. The findings suggest design of a simpler and cheaper FACS assay to simultaneously stratify CML patients for TKI efficacy as well as hematologic toxicity.

      Reviewer 1:

      Summary:

      This manuscript by Warfvinge et al. reports the results of CITE-seq to generate singlecell multi-omics maps from BM CD34+ and CD34+CD38- cells from nine CML patients at diagnosis. Patients were retrospectively stratified by molecular response after 12 months of TKI therapy using European Leukemia Net (ELN) recommendations. They demonstrate heterogeneity of stem and progenitor cell composition at diagnosis, and show that compared to optimal responders, patients with treatment failure after 12 months of therapy demonstrate increased frequency of molecularly defined primitive cells at diagnosis. These results were validated by deconvolution of an independent previously published dataset of bulk transcriptomes from 59 CML patients. They further applied a BCR-ABL-associated gene signature to classify primitive Lin-CD34+CD38- stem cells as BCR:ABL+ and BCR:ABL-. They identified variability in the ratio of leukemic to non-leukemic primitive cells between patients, showed differences in the expression of cell surface markers, and determined that a combination of CD26 and CD35 cell surface markers could be used to prospectively isolate the two populations. The relative proportion of CD26-CD35+ (BCR:ABL-) primitive stem cells was higher in optimal responders compared to treatment failures, both at diagnosis and following 3 months of TKI therapy.

      Strengths:

      The studies are carefully conducted and the results are very clearly presented. The data generated will be a valuable resource for further studies. The strengths of this study are the application of single-cell multi-omics using CITE-Seq to study individual variations in stem and progenitor clusters at diagnosis that are associated with good versus poor outcomes in response to TKI treatment. These results were confirmed by deconvolution of a historical bulk RNAseq data set. Moreover, they are also consistent with a recent report from Krishnan et al. and are a useful confirmation of those results. The major new contribution of this study is the use of gene expression profiles to distinguish BCRABL+ and BCR-ABL- populations within CML primitive stem cell clusters and then applying antibody-derived tag (ADT) data to define molecularly identified BCR:ABL+ and BCR-ABL- primitive cells by expression of surface markers. This approach allowed them to show an association between the ratio of BCR-ABL+ vs BCR-ABL- primitive cells and TKI response and study dynamic changes in these populations following short-term TKI treatment.

      Weaknesses:

      One of the limitations of the study is the small number of samples employed, which is insufficient to make associations with outcomes with confidence. Although the authors discuss the potential heterogeneity of primitive stem, they do not directly address the heterogeneity of hematopoietic potential or response to TKI treatment in the results presented. Another limitation is that the BCR-ABL + versus BCR-ABL- status of cells was not confirmed by direct sequencing for BCR-ABL. The BCR-ABL status of cells sorted based on CD26 and CD35 was evaluated in only two samples. We also note that the surface markers identified were previously reported by the same authors using different single-cell approaches, which limits the novelty of the findings. It will be important to determine whether the GEP and surface markers identified here are able to distinguish BCR-ABL+ and BCR-ABL- primitive stem cells later in the course of TKI treatment. Finally, although the authors do describe differential gene expression between CML and normal, BCR:ABL+ and BCR:ABL-, primitive stem cells they have not as yet taken the opportunity to use these findings to address questions regarding biological mechanisms related to CML LSC that impact on TKI response and outcomes.

      Reviewer #1 (Recommendations For The Authors):

      Minor comment: Fig 4 legend -E and F should be C and D.

      We thank the reviewer for positive assessment of our work. Here, we highlight the updates in the revised manuscript considering the feedback received.

      Minor comment: Fig 4 legend -E and F should be C and D.

      We have edited the revised manuscript accordingly

      One of the limitations of the study is the small number of samples employed, which is insufficient to make associations with outcomes with confidence.

      Although we performed CITE-seq for 9 CML patient samples at diagnosis, we extended our investigations to include additional samples (e.g., largescale deconvolution analysis of samples, Fig 3 C-E, qPCR for BCR::ABL1 status, Fig. 6A, and the ratio between CD35+ and CD26+ populations at diagnosis and during TKI therapy, Fig. 6C-D) as described in the manuscript.

      In comparison to a scRNA-seq, multiomic CITE-seq involves preparation and sequencing of separate libraries corresponding to RNA and ADTs thereby being even more resource demanding limiting our capacity to process an extensive number of patient samples. To confirm our findings in a larger cohort we have therefore adopted a computational deconvolution approach, CIBERSORT to analyze a larger number of independent samples (n=59). This reflects a growing, sustainable trend to study larger number of patients in face of still prohibitively expensive but potentially insightful scomics approaches (For example, please see Zeng et al, A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia, Nature Medicine, 2022).

      However, in view of the comment, we have now substantially increased the number of analyzed patients in the revised manuscript. These include increased number of patient samples to investigate the ratio between CD35 and CD26 marked populations at diagnosis, and 3 months of TKI therapy (from n=8 to n=12 with now 6 optimal responders and 5 treatment failure at diagnosis and after TKI therapy), qPCR for BCR::ABL1 expression status at diagnosis (from n=3 to n=9) , and followed up the BCR::ABL1 expression in three additional samples after TKI therapy. Moreover, we examined the CD26 and CD35 marked populations for expression of GAS2, one of our top candidate LSC signature genes in three additional samples at diagnosis and at 3m follow up. Thus, >80 patient samples across different approaches have been analyzed to strengthen all major conclusions of the study.

      We emphasize that we were cautious in generalizing the observation obtained from any one approach and sought to confirm any major finding using at least one complementary method. As an example, although CITE-seq (n=9) showed altered frequency of all cell clusters between optimal and poor responders (Fig. 3B), we refrained from generalizing because our independent large-scale computational deconvolution analysis (n=59) only substantiated the altered proportion of primitive and myeloid cell clusters (Fig. 3E).

      Although the authors discuss the potential heterogeneity of primitive stem, they do not directly address the heterogeneity of hematopoietic potential or response to TKI treatment in the results presented.

      Thanks for noting the discussion on heterogeneity of the primitive stem cells. As described in the original manuscript, the figure 6 D-E showed a relationship between heterogeneity and TKI therapy response. The results showed that CD35+/CD26+ ratio within the HSC fraction associated with this therapy response. We have now increased the number of patient samples analyzed and present the updated results in the revised manuscript (now figure 6 C-D). These observations set the stage for assessing whether long term therapy outcome can also be influenced by heterogeneity at diagnosis.

      We have shown the hematopoietic potential of HSCs marked by CD35 expression in an independent parallel study and therefore only mentioned it concisely in the current manuscript. A combination of scRNA-seq, scATAC-seq and cell surface proteomics showed CD35+ cells at the apex of healthy human hematopoiesis, containing an HSCspecific epigenetic signature and molecular program, as well as possessing self-renewal capacity and multilineage reconstitution in vivo and vitro. The preprint is available as Sommarin et al. ‘Single-cell multiomics reveals distinct cell states at the top of the human hematopoietic hierarchy’, Biorxiv; https://www.biorxiv.org/content/10.1101/2021.04.01.437998v2.full

      We also note that the surface markers identified were previously reported by the same authors using different single-cell approaches, which limits the novelty of the findings.

      Our current manuscript is indeed a continuation of and builds onto our previous paper (Warfvinge R et al. Blood, 2017). In contrast to our previous report which was limited to examination of only 96 genes per cell, CITE-seq allowed us to examine the molecular program of cells using unbiased global gene expression profiling. Finally, although CD26 appears, once again as a reliable marker of BCR::ABL1+ primitive cells, CD35 emerges as a novel and previously undescribed marker of BCR::ABL1- residual stem cells. A combination of CD35 and CD26 allowed us to efficiently distinguish between the two populations housed within the Lin-34+38/low stem cell immunophenotype.

      Another limitation is that the BCR-ABL + versus BCR-ABL- status of cells was not confirmed by direct sequencing for BCR-ABL. The BCR-ABL status of cells sorted based on CD26 and CD35 was evaluated in only two samples

      Single cell detection of fusion transcripts is challenging with low detection sensitivity in single cell RNA-seq as has been noted previously (Krishnan et al. Blood, 2023, Giustacchini et al. Nature Medicine, 2017, Rodriguez-Meira et al. Molecular Cell, 2019). However, this is likely to change with the inclusion of targetspecific probes in scRNA-seq library preparation protocols. Nonetheless, in view of the comment, we have included more patient samples (from the previous n=3 to current n=10 (including TKI treated samples) for direct assessment of BCR-ABL1 status by qPCR analysis; the updated results are included in the revised manuscript (Figure 6A).

      It will be important to determine whether the GEP and surface markers identified here are able to distinguish BCR-ABL+ and BCR-ABL- primitive stem cells later in the course of TKI treatment.

      We performed qPCR to check for BCR::ABL1 status, and the level of GAS2, one of the top genes expressed in CML cells within CD26+ and CD35+ cells at diagnosis and following 3 months of TKI therapy. The results showed that while CD26+ are BCR::ABL1+, the CD35+ cells are BCR::ABL1- at both time points. Moreover, the expression of LSC-specific gene, GAS2 was specific to BCR::ABL1+ CD26+ cells at both diagnosis as well as following 3 months of TKI therapy. The new results are presented in figure 6B in the revised manuscript.

      Finally, although the authors do describe differential gene expression between CML and normal, BCR:ABL+ and BCR:ABL-, primitive stem cells they have not as yet taken the opportunity to use these findings to address questions regarding biological mechanisms related to CML LSC that impact on TKI response and outcomes.

      We agree with the reviewer that our major focus here was to characterize the cellular heterogeneity coupled to treatment outcome and therefore we did not delve deep into the molecular mechanisms underlying TKI response. However, in response to this comment, as mentioned above, we noted that one of the top genes in BCR::ABL1 cells (Fig. 4 C; right; in red), GAS2 (Growth Specific Arrest 2) was expressed at both diagnosis and TKI therapy within CD26+ cells relative to CD35+ cells (updated figure 6B). Interestingly, GAS2 was also detected in CML LSCs in a recent scRNA-seq study (Krishnan et al. Blood, 2023) suggesting GAS2 upregulation could be a consistent molecular feature of CML cells. GAS2 has been previously noted as deregulated in CML (Janssen JJ et al. Leukemia, 2005, Radich J et al, PNAS, 2006), control of cell cycle, apoptosis, and response to Imatinib (Zhou et al. PLoS One, 2014). Future investigations are warranted to assess whether GAS2 could play a role in the outcome of long-term TKI therapy.

      Reviewer 2:

      Summary:

      The authors use single-cell "multi-comics" to study clonal heterogeneity in chronic myeloid leukemia (CML) and its impact on treatment response and resistance. Their main results suggest 1) Cell compartments and gene expression signatures both shared in CML cells (versus normal), yet 2) some heterogeneity of multiomic mapping correlated with ELN treatment response; 3) further definition of s unique combination of CD26 and CD35 surface markers associated with gene expression defined BCR::ABL1+ LSCs and BCR::ABL1- HSCs. The manuscript is well-written, and the method and figures are clear and informative. The results fit the expanding view of cancer and its therapy as a complex Darwinian exercise of clonal heterogeneity and the selective pressures of treatments.

      Strengths:

      Cutting-edge technology by one of the expert groups of single-cell 'comics.

      Weaknesses:

      Very small sample sizes, without a validation set. The obvious main problem with the study is that an enormous amount of results and conjecture arise from a very small data set: only nine cases for the treatment response section (three in each of the ELN categories), only two normal marrows, and only two patient cases for the division kinetic studies. Thus, it is very difficult to know the "noise" in the system - the stability of clusters and gene expression and the normal variation one might expect, versus patterns that may be reproducibly study artifact, effects of gene expression from freezing-thawing, time on the bench, antibody labeling, etc. This is not so much a criticism as a statement of reality: these elegant experiments are difficult, timeconsuming, and very expensive. Thus in the Discussion, it would be helpful for the authors to just frankly lay out these limitations for the reader to consider. Also in the Discussion, it would be interesting for the authors to consider what's next: what type of validation would be needed to make these studies translatable to the clinic? Is there a clever way to use these data to design a faster/cheaper assay?

      We thank the reviewer for appraisal of our manuscript. We take the opportunity to point out the updates in the revised manuscript in view of the comments.

      Very small sample sizes, without a validation set. The obvious main problem with the study is that an enormous amount of results and conjecture arise from a very small data set: only nine cases for the treatment response section (three in each of the ELN categories), only two normal marrows, and only two patient cases for the division kinetic studies.

      As the reviewer has noted the single cell omics experiments remain resource demanding thereby placing a limitation on the number of patients analyzed. As described above in response to the comments from reviewer 1, multiomic CITE-seq allows extraction of two modalities in comparison to a typical scRNA-seq, however, this also makes it even more limited in the number of samples processed in a sustainable way. This was one of the motivations to analyze a larger number of independent samples (n=59) while benefiting from the insights gained from CITE-seq (n=9). Furthermore, by analyzing CD34+ cells from bone marrow and peripheral blood of CML patients, including both responders and non-responders after one year of Imatinib therapy, we were able to significantly diversity the patient pool, which was lacking in our CITE-seq patient pool. As mentioned above, this reflects a growing trend to analyze larger number of patients while anchoring the analysis on prohibitively expensive but potentially insightful sc-omics approaches (For example, please see Zeng et al, A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia, Nature Medicine, 2022).

      As emphasized above, we frequently sought to confirm the findings from one approach using a complementary method and independent samples. For example, although CITE-seq (n=9) showed altered frequency of all cell clusters between optimal and poor responders (Fig. 3B), we refrained from generalizing because an independent largescale computational deconvolution analysis (n=59) only substantiated the altered proportion of primitive and myeloid clusters.

      In view of the comment, we have now increased the number of patients analyzed during the revision process. These include increased numbers to investigate the ratio between CD35+ and CD26+ populations at diagnosis, as well as 3 months of TKI therapy, qPCR for BCR::ABL1, and patients examined for GAS2, one of the top genes expressed in CML cells (see response to reviewer 1 for details). Altogether, >80 patient samples across different approaches were analyzed to strengthen the conclusions.

      During the revision, we have analyzed cells from 8 CML patients for cell cycle using gene activity scores. This is in addition to the cell division kinetics data reported previously are now together described in the supplementary figures 9C-F.

      It is very difficult to know the "noise" in the system - the stability of clusters and gene expression and the normal variation one might expect, versus patterns that may be reproducibly study artifact, effects of gene expression from freezing-thawing, time on the bench, antibody labeling, etc. This is not so much a criticism as a statement of reality: these elegant experiments are difficult, time-consuming, and very expensive. Thus in the Discussion, it would be helpful for the authors to just frankly lay out these limitations for the reader to consider.

      We agree with the reviewer that sc-omics approaches can be noisy despite continuing efforts to denoise single cell datasets through both experimental and bioinformatic innovations. Therefore, we have updated the discussion as recommended by the reviewer (paragraph 5 in the discussion).

      We also note that CITE-seq, in contrast to scRNA-seq alone provides dual features: surface marker/protein as well as RNA for annotating the same cluster. In our manuscript, for example, cell clusters in UMAP for normal BM; Fig 1B were described using both surface markers (Fig. 1C) and RNA (Fig. 1D) making the cluster identity robust. To further elaborate this approach, a new supplementary figure 1C shows annotations of clusters using both RNA and surface markers.

      To potentially address the issue of stability of clusters and gene expression, we compared the marker genes for major clusters from nBM from this study (supplementary table 4, Warfvinge et al.) with those described recently in a scRNA-seq study by Krishnan et al. supplementary table 8, Blood, 2023 using Cell Radar, a tool that identifies and visualizes which hematopoietic cell types are enriched within a given gene set (description: https://github.com/KarlssonG/cellradar

      Direct link: https://karlssong.github.io/cellradar/). To compare, we used our in-house gene list for the major clusters as well as mapped the same number of top marker genes based on log2FC from corresponding cluster from Krishnan et al. as inputs to Cell Radar. The Cell Radar plot outputs are shown below.

      Author response image 1.

      This approach showed broad similarities across clusters from this study with their counterparts from the other study suggesting the cluster identities reported here are likely to be robust. Please note these figures are for reviewer response only and not included in the final manuscript.

      Also in the Discussion, it would be interesting for the authors to consider what's next: what type of validation would be needed to make these studies translatable to the clinic? Is there a clever way to use these data to design a faster/cheaper assay?

      Our findings on CD26+ and CD35+ surface markers to enrich BCR::ABL1+ and BCR::ABL1- cells suggest a simpler, faster and cheaper FACS panel can possibly quantify leukemic and non-leukemic stem cells in CML patients. We anticipate that future investigations, clinical studies might examine whether CD26CD35+ cells could be plausible candidates for restoring normal hematopoiesis once the TKI therapy diminishes the leukemic load, and whether patients with low counts of CD35+ cells at diagnosis have a relatively higher chance of developing hematologic toxicity such as cytopenia during therapy.

      We briefly mentioned this possibility in the discussion; however, we have now moved it to another paragraph to highlight the same. Please see paragraph 5 in the revised manuscript.

      Reviewer 3:

      Summary:

      In this study, Warfvinge and colleagues use CITE-seq to interrogate how CML stem cells change between diagnosis and after one year of TKI therapy. This provides important insight into why some CML patients are "optimal responders" to TKI therapy while others experience treatment failure. CITE-seq in CML patients revealed several important findings. First, substantial cellular heterogeneity was observed at diagnosis, suggesting that this is a hallmark of CML. Further, patients who experienced treatment failure demonstrated increased numbers of primitive cells at diagnosis compared to optimal responders. This finding was validated in a bulk gene expression dataset from 59 CML patients, in which it was shown that the proportion of primitive cells versus lineage-primed cells correlates to treatment outcome. Even more importantly, because CITE-seq quantifies cell surface protein in addition to gene expression data, the authors were able to identify that BCR/ABL+ and BCR/ABL- CML stem cells express distinct cell surface markers (CD26+/CD35- and CD26-/CD35+, respectively). In optimal responders, BCR/ABL- CD26-/CD35+ CML stem cells were predominant, while the opposite was true in patients with treatment failure. Together, these findings represent a critical step forward for the CML field and may allow more informed development of CML therapies, as well as the ability to predict patient outcomes prior to treatment.

      Strengths:

      This is an important, beautifully written, well-referenced study that represents a fundamental advance in the CML field. The data are clean and compelling, demonstrating convincingly that optimal responders and patients with treatment failure display significant differences in the proportion of primitive cells at diagnosis, and the ratio of BCR-ABL+ versus negative LSCs. The finding that BCR/ABL+ versus negative LSCs display distinct surface markers is also key and will allow for a more detailed interrogation of these cell populations at a molecular level.

      Weaknesses:

      CITE-seq was performed in only 9 CML patient samples and 2 healthy donors. Additional samples would greatly strengthen the very interesting and notable findings.

      Reviewer #3 (Recommendations For The Authors):

      My only recommendation is to bolster findings with additional CML and healthy donor samples.

      CITE-seq was performed in only 9 CML patient samples and 2 healthy donors. Additional samples would greatly strengthen the very interesting and notable findings.

      We thank the reviewer for the positive assessment of our manuscript. As mentioned in response to comments from reviewer 1 and 2, CITE-seq remains an reource consuming single cell method potentially limiting the number of patients to be analyzed. However, during the revision process, we have increased the number of patient material analyzed for other assays; these include increased number to investigate the ratio between CD35+ and CD26+ populations at diagnosis, and 3 months of TKI therapy, qPCR for BCR::ABL1, and patients examined for GAS2, one of the top genes expressed in CML cells. Thus, >80 patient samples across different assays have been analyzed to strengthen the conclusions. (Please see comment to reviewer 1 for more details)

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      Detecting unexpected epistatic interactions among multiple mutations requires a robust null expectation - or neutral function - that predicts the combined effects of multiple mutations on phenotype, based on the effects of individual mutations. This study assessed the validity of the product neutrality function, where the fitness of double mutants is represented as the multiplicative combination of the fitness of single mutants, in the absence of epistatic interactions. The authors utilized a comprehensive dataset on fitness, specifically measuring yeast colony size, to analyze epistatic interactions.

      The study confirmed that the product function outperformed other neutral functions in predicting the fitness of double mutants, showing no bias between negative and positive epistatic interactions. Additionally, in the theoretical portion of the study, the authors applied a wellestablished theoretical model of bacterial cell growth to simulate the growth rates of both single and double mutants under various parameters. The simulations further demonstrated that the product function was superior to other functions in predicting the fitness of hypothetical double mutants. Based on these findings, the authors concluded that the product function is a robust tool for analyzing epistatic interactions in growth fitness and effectively reflects how growth rates depend on the combination of multiple biochemical pathways.

      Strengths:

      By leveraging a previously published extensive dataset of yeast colony sizes for single- and double-knockout mutants, this study validated the relevance of the product function, commonly used in genetics to analyze epistatic interactions. The finding that the product function provides a more reliable prediction of double-mutant fitness compared to other neutral functions offers significant value for researchers studying epistatic interactions, particularly those using the same dataset.

      Notably, this dataset has previously been employed in studies investigating epistatic interactions using the product neutrality function. The current study's findings affirm the validity of the product function, potentially enhancing confidence in the conclusions drawn from those earlier studies. Consequently, both researchers utilizing this dataset and readers of previous research will benefit from the confirmation provided by this study's results.

      Weaknesses:

      This study exhibits several significant logical flaws, primarily arising from the following issues: a failure to differentiate between distinct phenotypes, instead treating them as identical; an oversight of the substantial differences in the mechanisms regulating cell growth between prokaryotes and eukaryotes; and the adoption of an overly specific and unrealistic set of assumptions in the mutation model. Additionally, the study fails to clearly address its stated objective-investigating the mechanistic origin of the multiplicative model. Although it discusses conditions under which deviations occur, it falls short of achieving its primary goal. Moreover, the paper includes misleading descriptions and unsubstantiated reasoning, presented without proper citations, as if they were widely accepted facts. Readers should consider these issues when evaluating this paper. Further details are discussed below.

      (1) Misrepresentation of the dataset and phenotypes

      The authors analyze a dataset on the fitness of yeast mutants, describing it as representative of the Malthusian parameter of an exponential growth model. However, they provide no evidence to support this claim. They assert that the growth of colony size in the dataset adheres to exponential growth kinetics; in contrast, it is known to exhibit linear growth over time, as indicated in [Supplementary Note 1 of https://doi.org/10.1038/nmeth.1534]. Consequently, fitness derived from colony size should be recognized as a different metric and phenotype from the Malthusian parameter. Equating these distinct phenotypes and fitness measures constitutes a fundamental error, which significantly compromises the theoretical discussions based on the Malthusian parameter in the study.

      The reviewer is correct in pointing out that colony-size measurements are distinct from exponential growth kinetics. We acknowledge that our original text implied that the dataset directly measured the exponential growth rate (Malthusian parameter), when in fact it was measuring yeast colony expansion rates on solid media. Colony growth under these conditions often follows a biphasic pattern in that there is typically an initial microscopic phase where cells can grow exponentially, but as the colony expands further then the growth dynamics become more linear (Meunier and Choder 1999). We have revised our text to state clearly what the experiment measured.

      However, while colony size does not exhibit exponential growth kinetics, several studies have argued that the rate of colony expansion is related to the exponential growth rate of cells growing in non-limiting nutrient conditions in liquid culture. This is because colony growth is dominated by cells at the colony boundaries that have access to nutrients and are in exponential growth. Cells in the colony interior lack nutrients and therefore contribute little to colony growth. This has been shown both in theoretical and experimental studies, finding that the linear growth rate of the colony is directly linked to the single-cell exponential growth rate (Pirt 1967; Gray and Kirwan 1974; Korolev et al. 2012; Gandhi et al. 2016; Meunier and Choder 1999). In particular, the above studies suggest that the linear colony growth rate is directly proportional to the square root of the exponential growth rate. Therefore, one would expect that the validity of the product model for one fitness measure implies its validity for the other measure. In addition, colony size was found to be highly correlated with the exponential growth rate of cells in non-limiting nutrients in liquid culture (Baryshnikova et al. 2010; Zackrisson et al. 2016; Miller et al. 2022). For these reasons, we treated the colony size and exponential growth rate as interchangeable in our original manuscript. 

      To address the important point raised by the reviewer, we now explain more clearly in the text what the analyzed data on colony size show and why we believe it is reflective of the exponential growth rate. Finally, we note that our results supporting the product neutrality function are consistent with the work of (Mani et al. 2008), which used smaller datasets based on liquid culture growth rates (Jasnos and Korona 2007; Onge et al. 2007).

      The text in Section 2.3 now reads:

      “Having verified empirically that the Product neutrality function is supported by the latest data for cell proliferation, we now turn our attention to its origins. Addressing this question requires some mechanistic model of biosynthesis. However, most mechanistic models of growth apply directly to single cells in rich nutrient conditions, which may not directly apply to the SGA measurements of colony expansion rates. In particular, colony growth has been shown to follow a biphasic pattern (Meunier et al. 1999). A first exponential phase is followed by a slower linear phase as the colony expands. Previous modeling and empirical work indicates that this second linear expansion rate reflects the underlying exponential growth of cells in the periphery of the colony (Pirt 1967; Gray et al. 1974; Gandhi et al. 2016; Baryshnikova, Costanzo, S. Dixon, et al. 2010; Zackrisson et al. 2016; Miller et al. 2022). More precisely, mathematical models show the linear colony-size expansion rate is directly proportional to the square root of the exponential growth rate under non-limiting conditions. Intuitively, this relationship arises because colony growth is dominated by the expansion of the population of cells in an annulus at the colony border that are exposed to rich nutrient conditions. These cells expand at a rate similar to the exponential rate of cells growing in a rich nutrient liquid culture. In contrast, the cells in the interior of the colony experience poor nutrient conditions, grow very slowly, and do not contribute to colony growth.

      This intimate relationship between both proliferation rates allows us to explore the origin of the Product neutrality function in mechanistic models of cell growth. Indeed, if colony-based fitnesses follow a Product model, then

      where the superscript c indicates colony-based values for the fitness W and the growth rate λ. Taking into account the relationship between single-cell exponential growth rates and colony growth rates, we can write

      where the superscript l denotes liquid cultures. Combining these expressions, we obtain

      In other words, from the perspective of the Product neutrality function, fitnesses based on colony expansion rates are equivalent to fitnesses based on single-cell exponential growth rates. The prevalence of the Product neutrality model—both in the SGA data and in previous studies on datasets from liquid cultures (Jasnos et al. 2007; Onge et al. 2007; Mani et al. 2008)—encourages the exploration of its origin in mechanistic models of cell growth.”

      (2) Misapplication of prokaryotic growth models

      The study attempts to explain the mechanistic origin of the multiplicative model observed in yeast colony fitness using a bacterial cell growth model, particularly the Scott-Hwa model. However, the application of this bacterial model to yeast systems lacks valid justification. The Scott-Hwa model is heavily dependent on specific molecular mechanisms such as ppGppmediated regulation, which plays a crucial role in adjusting ribosome expression and activity during translation. This mechanism is pivotal for ensuring the growth-dependency of the ribosome fraction in the proteome, as described in [https://doi.org/10.1073/pnas.2201585119]. Unlike bacteria, yeast cells do not possess this regulatory mechanism, rendering the direct application of bacterial growth models to yeast inappropriate and potentially misleading. This fundamental difference in regulatory mechanisms undermines the relevance and accuracy of using bacterial models to infer yeast colony growth dynamics.

      If the authors intend to apply a growth model with macroscopic variables to yeast double-mutant experimental data, they should avoid simply repurposing a bacterial growth model. Instead, they should develop and rigorously validate a yeast-specific growth model before incorporating it into their study.

      There is nothing that is prokaryote specific in the Scott-Hwa model. It does not include the specific ppGpp mechanism to regulate ribosome fraction that does not exist in eukaryotes.  The general features of the model, like how the ribosome fraction is proportional to the growth rate have indeed been validated in yeast (Metzl-Raz et al. 2017; Elsemman et al. 2022; Xia et al. 2022). Performing a detailed physiological analysis of budding yeast across varying growth conditions in order to build a more extensive model is beyond the scope of this work. Finally, we note that the Weiße model, which we also analyzed, is also generic and has replicated empirical measurements both from bacteria and yeast (Weiße et al. 2015).

      To clarify this point in the text, we have added the following to Section 2.3: 

      “Experimental measurements in other organisms suggest that the observations leading to this model, including that the cellular ribosome fraction increases with growth rate, are in fact generic and also seen in the yeast S. cerevisiae (Metzl-Raz et al. 2017; Elsemman et al. 2022; Xia et al. 2022).”

      (3) Overly specific assumptions in the theoretical model

      he theoretical model in question assumes that two mutations affect only independent parameters of specific biochemical processes, an overly restrictive premise that undermines its ability to broadly explain the occurrence of the multiplicative model in mutations. Additionally, experimental evidence highlights significant limitations to this approach. For example, in most viable yeast deletion mutants with reduced growth rates, the expression of ribosomal proteins remains largely unchanged, in direct contradiction to the predictions of the Scott-Hwa model, as indicated in [https://doi.org/10.7554/eLife.28034]. This discrepancy emphasizes that the ScottHwa model and its derivatives do not reliably explain the growth rates of mutants based on current experimental data, suggesting that these models may need to be reevaluated or alternative theories developed to more accurately reflect the complex dynamics of mutant growth.

      In the data from the Barkai lab referenced by the reviewer (reproduced below), we see that the ribosomal transcript fraction is in fact proportional to growth rate in response to gene deletions in contradiction to the reviewer’s interpretation. However, it is notable that the ribosomal transcript fraction is a bit higher for a given growth rate if that growth rate is generated by a mutation rather than generated by a suboptimal nutrient condition. We know that the very simple Scott-Hwa model is not a perfect representation of the cell. Nevertheless, it does recapitulate important aspects of growth physiology and therefore we thought it is useful to analyze its response to mutations and compare those responses to the different neutrality functions.  We never claimed the Scott-Hwa model was a perfect model and fully agree with the referee’s statement above that “... these models may need to be reevaluated, or alternative theories developed to more accurately reflect the complex dynamics of mutant growth.” Indeed, we say as much in our discussion where we wrote: 

      “While we focused on coarse-grained models for their simplicity and mechanistic interpretability, they might be too simple to effectively model large double-mutant datasets and the resulting double-mutant fitness distributions. We therefore expect the combination of high throughput genetic data with the analysis of larger-scale models, for instance based on Flux Balance Analysis, Metabolic Control Analysis, or whole-cell modeling, to lead to important complementary insights regarding the regulation of cell growth and proliferation.”

      To further clarify this point, we discuss and cite the Barkai lab data for gene deletions see Figure 2 from Metzl-Raz et al. 2017.

      (4) Lack of clarity on the mechanistic origin of the multiplicative model

      The study falls short of providing a definitive explanation for its primary objective: elucidating the "mechanistic origin" of the multiplicative model. Notably, even in the simplest case involving the Scott-Hwa model, the underlying mechanistic basis remains unexplained, leaving the central research question unresolved. Furthermore, the study does not clearly specify what types of data or models would be required to advance the understanding of the mechanistic origin of the multiplicative model. This omission limits the study's contribution to uncovering the biological principles underlying the observed fitness patterns.”

      We appreciate the reviewer’s interest in a more complete mechanistic explanation for the product model of fitness. The primary goal of this study was to explore the validity of the Product model from the perspective of coarse-grained models of cell growth, and to extract mechanistic insights where possible. We view our work as a first step toward a deeper understanding of how double-mutant fitnesses combine, rather than a final, all-encompassing theory. As the referee notes, we are limited by the current state of the field, which has an incomplete understanding of cell growth. 

      Nonetheless, our analysis does propose concrete, mechanistically informed explanations. For example, we highlight how growth-optimizing feedback—such as cells’ ability to reallocate ribosomes or adjust proteome composition—naturally leads to multiplicative rather than additive or minimal fitness effects. We also link the empirical deviations from pure multiplicative behavior to differences in how specific pathways re-balance under perturbation, and we suggest that a product-like rule emerges when multiple interconnected processes each partially limit cell growth.

      In the discussion, we clarify what additional data and models we think will be required to advance this question. Namely, we propose extending our approach through larger-scale, more detailed modeling frameworks – that may include explicit modeling of ppGpp or TOR activities in bacteria or eukaryotic cells, respectively. We also emphasize the importance of refining the measurement of cell growth rates to uncover subtle deviations from the product rule that could yield greater mechanistic insight. By integrating high-throughput genetic data with nextgeneration computational models, it should be possible to hone in on the specific biological principles (e.g., metabolic bottlenecks, resource reallocation) that underlie the multiplicative neutrality function.

      Reviewer #2 (Public review):

      The paper deals with the important question of gene epistasis, focusing on asking what is the correct null model for which we should declare no epistasis.

      In the first part, they use the Synthetic Genetic Array dataset to claim that the effects of a double mutation on growth rate are well predicted by the product of the individual effects (much more than e.g. the additive model). The second (main) part shows this is also the prediction of two simple, coarse-grained models for cell growth.

      I find the topic interesting, the paper well-written, and the approach innovative.

      One concern I have with the first part is that they claim that:

      "In these experiments, the colony area on the plate, a proxy for colony size, followed exponential growth kinetics. The fitness of a mutant strain was determined as the rate of exponential growth normalized to the rate in wild type cells."

      There are many works on "range expansions" showing that colonies expand at a constant velocity, the speed of which scales as the square root of the growth rate (these are called "Fisher waves", predicted in the 1940', and there are many experimental works on them, e.g. https://www.pnas.org/doi/epdf/10.1073/pnas.0710150104) If that's the case, the area of the colony should be proportional to growth_rate X time^2 , rather than exp(growth_rate*time), so the fitness they might be using here could be the log(growth_rate) rather than growth_rate itself? That could potentially have a big effect on the results.

      We thank the reviewer for their thoughtful remarks. As they rightly pointed out, a large body of literature supports that colonies expand at constant velocity both from a theoretical and experimental standpoint. 

      As discussed in the answer to the first question of Reviewer 1, this body of work also suggests that the linear expansion rate of the colony front is directly related to the single-cell exponential growth rate of the cells at the periphery. Hence, although the macroscopic colony growth may not be exponential in time, measuring colony size (or radial expansion) across different genotypes still provides a consistent and meaningful proxy for comparing their underlying growth capabilities. 

      In particular, these studies suggest (consistently with Fisher-wave theory) that the linear growth rate of the colony 𝐾 is proportional to the square root of the exponential growth rate 𝜆. Under the assumption that the product model is valid for a given double mutant and for the exponential growth rate, we would have that

      The associated wave-front velocities would then be predicted to be

      In other words, if the product model is valid for fitness measures based on exponential growth rates, it should also be valid for fitness measures based on linear colony growth rates. 

      We now include this discussion in the revised version of Section 2.3.

      Additional comments/questions:

      (1) What is the motivation for the model where the effect of two genes is the minimum of the two?

      The motivation for the minimal model is the notion that there might be a particular process that is rate-limiting for growth due to a mutation. In this case, a mutation in process X makes it really slow and process Y proceeds in parallel and has plenty of time to finish its job before cell division takes place. In this case, even a mutation to process Y might not slow down growth because there is an excess amount of time for it to be completed. Thus, the double mutant might then be anticipated to have the growth rate associated with the single mutation to process X. We now add a similar description when we introduce the different neutrality functions in Section 2.1.

      (2) How seriously should we take the Scott-Hwa model? Should we view it as a toy model to explain the phenomenon or more than that? If the latter, then since the number of categories in the GO analysis is much more than two (47?) in many cases the analysis of the experimental data would take pairs of genes that both affect one process in the Scott-Hwa model - and then the product prediction should presumably fail? The same comment applies to the other coarse-grained model.

      From our perspective, models like the Scott-Hwa model constitute the simplest representation of growth based on data that is not trivial. Moreover, the Scott-Hwa model is able to incorporate interactions between two different biological processes. We believe models, like the Scott-Hwa and Weiße models, should be viewed as more than mere toy models because they have been backed up by some empirical data, such as that showing the ribosome fraction increases with growth rate. However, the Scott-Hwa model is inherently limited by its low dimensionality and relative simplicity. We do not claim that such models can provide a full picture of the cell. As argued in the main text, we have chosen to focus on such models because of their tractability and in the hope of extracting general principles. We nonetheless agree with the reviewer that they do not have the capacity to represent interactions between genes in the same biological process. We now note this limitation in the text. 

      (3) There are many works in the literature discussing additive fitness contributions, including Kaufmann's famous NK model as well as spin-glass-type models (e.g. Guo and Amir, Science Advances 2019, Reddy and Desai, eLife 2021, Boffi et al., eLife 2023) These should be addressed in this context.

      We thank the reviewer for pointing out this part of the literature. We do believe these works constitute a relevant body of work tackling the emergence of epistasis patterns from a theoretical grounding, and now reference and discuss them in the text. 

      (4) The experimental data is for deletions, but it would be interesting to know the theoretical model's prediction for the expected effects of beneficial mutations and how they interact since that's relevant (as mentioned in the paper) for evolutionary experiments. Perhaps in this case the question of additive vs. multiplicative matters less since the fitness effects are much smaller.

      This is an interesting question. Since mutations increasing the growth rate generated by gene deletions or other systematic perturbations are rare, we did not focus on them. Of course, as the reviewer notes, in the case of evolution experiments, these fitness enhancing mutations are selected for. To address the reviewer's question, we can first consider the Scott-Hwa model. In this case, the analytical solution remains valid in the case of fitness enhancing mutations so that the fitness of the double mutant will be the product neutrality function multiplied by an additional interaction term (see Figure 3). The mathematical derivation predicts that the double mutant fitness can potentially grow indefinitely. Indeed, the denominator can be equal to zero in some cases. In simulations, we see that the observation for deleterious mutations does not seem to hold for beneficial mutations (new supplementary Figure S5 shown below). Indeed, no model seems to replicate double mutant fitnesses much better than any other. This suggests that the growth-optimizing feedback we discuss in section 2.3 may have compound effects that ultimately make double-mutant fitnesses much larger than any model predicts.

      We recognize this may be an important point, and discuss it in detail in the revised section 2.3 as well as in the discussion.

      Baryshnikova, Anastasia, Michael Costanzo, Scott Dixon, Franco J. Vizeacoumar, Chad L. Myers, Brenda Andrews, and Charles Boone. 2010. “Synthetic Genetic Array (SGA) Analysis in Saccharomyces Cerevisiae and Schizosaccharomyces Pombe.” Methods in Enzymology 470 (March):145–79.

      Elsemman, Ibrahim E., Angelica Rodriguez Prado, Pranas Grigaitis, Manuel Garcia Albornoz, ictoria Harman, Stephen W. Holman, Johan van Heerden, et al. 2022. “Whole-Cell Modeling in Yeast Predicts Compartment-Specific Proteome Constraints That Drive Metabolic Strategies.” Nature Communications 13 (1): 801.

      Gandhi, Saurabh R., Eugene Anatoly Yurtsev, Kirill S. Korolev, and Jeff Gore. 2016. “Range Expansions Transition from Pulled to Pushed Waves as Growth Becomes More Cooperative in an Experimental Microbial Population.” Proceedings of the National Academy of Sciences of the United States of America 113 (25): 6922–27.

      Gray, B. F., and N. A. Kirwan. 1974. “Growth Rates of Yeast Colonies on Solid Media.” Biophysical Chemistry 1 (3): 204–13.

      Jasnos, Lukasz, and Ryszard Korona. 2007. “Epistatic Buffering of Fitness Loss in Yeast Double Deletion Strains.” Nature Genetics 39 (4): 550–54.

      Korolev, Kirill S., Melanie J. I. Müller, Nilay Karahan, Andrew W. Murray, Oskar Hallatschek, and David R. Nelson. 2012. “Selective Sweeps in Growing Microbial Colonies.” Physical Biology 9 (2): 026008.

      Mani, Ramamurthy, Robert P. St Onge, John L. Hartman 4th, Guri Giaever, and Frederick P. Roth. 2008. “Defining Genetic Interaction.” Proceedings of the National Academy of Sciences of the United States of America 105 (9): 3461–66.

      Metzl-Raz, Eyal, Moshe Kafri, Gilad Yaakov, Ilya Soifer, Yonat Gurvich, and Naama Barkai. 2017. “Principles of Cellular Resource Allocation Revealed by Condition-Dependent Proteome Profiling.” eLife 6 (August). https://doi.org/10.7554/elife.28034.

      Meunier, J. R., and M. Choder. 1999. “Saccharomyces Cerevisiae Colony Growth and Ageing: Biphasic Growth Accompanied by Changes in Gene Expression.” Yeast (Chichester, England) 15 (12): 1159–69.

      Miller, James H., Vincent J. Fasanello, Ping Liu, Emery R. Longan, Carlos A. Botero, and Justin C. Fay. 2022. “Using Colony Size to Measure Fitness in Saccharomyces Cerevisiae.” PloS e 17 (10): e0271709.

      Onge, Robert P. St, Ramamurthy Mani, Julia Oh, Michael Proctor, Eula Fung, Ronald W. Davis, Corey Nislow, Frederick P. Roth, and Guri Giaever. 2007. “Systematic Pathway Analysis Using High-Resolution Fitness Profiling of Combinatorial Gene Deletions.” Nature Genetics 39 (2): 199–206.

      Pirt, S. J. 1967. “A Kinetic Study of the Mode of Growth of Surface Colonies of Bacteria and Fungi.” Journal of General Microbiology 47 (2): 181–97.

      Weiße, Andrea Y., Diego A. Oyarzún, Vincent Danos, and Peter S. Swain. 2015. “Mechanistic Links between Cellular Trade-Offs, Gene Expression, and Growth.” Proceedings of the National Academy of Sciences of the United States of America 112 (9): E1038–47.

      Xia, Jianye, Benjamin J. Sánchez, Yu Chen, Kate Campbell, Sergo Kasvandik, and Jens Nielsen. 2022. “Proteome Allocations Change Linearly with the Specific Growth Rate of Saccharomyces Cerevisiae under Glucose Limitation.” Nature Communications 13 (1): 2819.

      Zackrisson, Martin, Johan Hallin, Lars-Göran Ottosson, Peter Dahl, Esteban Fernandez-Parada, Erik Ländström, Luciano Fernandez-Ricaud, et al. 2016. “Scan-O-Matic: High-Resolution Microbial Phenomics at a Massive Scale.” G3 (Bethesda, Md.) 6 (9): 3003–14.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Weaknesses: 

      This study's weakness is that it requires the use of chloroplasts isolated from leaves and the need to freeze them on a grid for observation, so it is unclear to what extent the observations reflect physiological conditions. In particular, the mode of existence of the thylakoid membrane complexes seems to be strongly influenced by the physicochemical environment surrounding the membranes, as indicated by the different distribution of PSII between intact chloroplasts and those with ruptured envelope membranes. 

      We agree with the reviewer, as discussed in the “Limitations and Future Perspectives” section of our manuscript. The duration and conditions of the chloroplast isolation will very likely influence the state of the sample and hamper conclusions about physiological adaptations to environmental conditions, which are important for a dynamic process like photosynthesis. Isolated chloroplasts were the most feasible option for vitrification by plunge freezing, but we intend to improve our technological approaches to overcome this obstacle in the future (e.g., by using the more involved approach of cryo-lift out from high-pressure frozen tissue). Here, we hope that by using plants acclimated to a “standard state” (standard growth conditions under low light) and proceeding with fast isolation and grid preparation (chloroplast were used only once per isolation and deposited on the grids as fast as 10 min from leaf harvesting), we preserve some physiological relevance. This is supported by: 1) a PSII distribution pattern and concentration that is similar to previous observations by us and others in cryo-ET of FIB-milled algae cells and freeze-fracture of whole plant cells, 2) a thylakoid lumen width that is similar to previously reports from whole light-adapted algae and leaf cells, but wider that previous reports of isolated plant thylakoids.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Figure 1-3: It would be better if it was easier to see which part of the figure the explanation in the text refers to. For example, not only the figure number but also the color of the arrowheads could be indicated in the text. Also, it would be better to indicate which part of the figure the explanation in the text and in the figure legend refers to by adding arrows or circles on the figure images.

      Thank you for this idea. We have added color references to individual objects segmented in Figs. 1 and 2. They are now indicated in the figure references in the text to facilitate the reading. In Fig. 3, we have added additional arrows (and indication in the text) to point to examples of Rubisco densities (as also requested by Reviewer #2).

      (2) Figure 5: Without having read the authors' previous works on "menbranogram", the reader may have no idea why the distribution of PSI and ATPase in the non-stack region in G can be inferred from the data in Figure 5C-E. Is it possible to add an explanation, for example by adding a supplement figure? 

      Thank you for this suggestion. Instead of creating another methods figure and movie about membranograms, we refer readers to our earlier work (Wietrzynski et al. 2020, eLife). This fits with the Research Advance format, and eLife should clearly link to that previous paper that our current study builds upon.

      Reviewer #2 (Recommendations for the authors): 

      Minor points: 

      (1) Please add to Figures 2A or 3A arrowheads showing Rubisco complexes.

      Done; we added colored arrowheads pointing to Rubisco complexes and an indication in the figure legend.

      (2) "We measured a membrane thickness of 5.1 {plus minus} 0. 3 nm, a stromal gap of 3.2 {plus minus} 0. 3 nm, a luminal thickness of 10.8 {plus minus} 2.0 nm, and a total thylakoid thickness (including two membranes plus the enclosed lumen) of 21.1 {plus minus} 1.8 nm (Fig. 4) (for comparison see [1, 2, 30, 40])."

      Please add ref: Kirchhoff, H. et al. Dynamic control of protein diffusion within the granal thylakoid lumen. Proc. Natl Acad. Sci. USA 108, 20248-20253 (2011).

      Thank you for this suggestion. The reference has been added.

      (3) Please add to the supplemental figures a raw data and a processed image with AI denoising.

      Denoising results differ between the tomograms. Below we provide an example of a significant improvement in signal to noise ratio in a denoised tomogram. On the left is a raw tomogram reconstructed using a standard approach: weighted back projection using etomo program from the IMOD package. On the right is the same tomogram denoised using cryoCARE, which performs a noise comparison between odd and even frames that were used to reconstruct the tomogram on the left. Below is a zoom in into the slices from the first row, highlighting the differences. The same approach was used for all the tomograms used in the figures. Please also see the Data deposition statement below (and the Data deposition section in the paper) that we hope fulfills the Reviewers request. All raw and denoised data, as well as segmentations and picked particle positions, are publicly available.

      “Data deposition statement

      The raw data consists of micrographs (frames) used to reconstruct each tomogram, acquisition parameters file (.mdoc) for each tomogram and reference images of the microscope camera: 273.7 GB in total. Following the current standard in the cryo-EM field, all images used to generate figures in the manuscript (AI-denoised tomograms and corresponding segmentations) have been deposited in the Electron Microscopy Data Base (EMDB) and are available under accession codes EMD-5243 through EMD-5248). They can be accessed here: https://www.ebi.ac.uk/emdb/EMD-52542. Additionally, all raw files (including tomograms used only for analysis), all used denoised tomographic volumes and unaltered membrane segmentations have been deposited onto the public EMPIAR server (www.ebi.ac.uk/empiar) and are available under the accession code EMPIAR-12612. Finally, positions of PSII particles used in the study, segmented single membrane instances and membrane meshes are available at: 10.5281/zenodo.15090119. All this data will be linked to (and is searchable by) the EMDB depositions and to manuscript DOI. Accession numbers to the data are added in the “Data availability” section of the manuscript.”

      Author response image 1.

      Results of tomogram denoising. An example tomogram from the dataset. Top row: on the left is a 5-slice average of the tomographic volume reconstructed using weighted back projection method. On the right is a single tomographic slice of the same tomographic volume denoised using cryoCARE program. Bottom row: zoom-ins into the corresponding tomographic slices from the top row. All images were recorded using 3dmod from the IMOD package.

      Additional modifications:

      Following other comments and suggestions, we have included following additions to the manuscript:

      Figure 4 – figure supplement 1. Its aim is to better explain the methodology behind thylakoid width measurements. The methods section concerning this figure has been slightly modify to match this addition.

      Figure 1 – video supplement 1. Overview of a chloroplast tomogram and segmentations the thylakoid and chloroplast envelope membranes.

      Figure 3 – video supplement 1. Chloroplast stroma and top views of the thylakoid network, with stromal lamellae connecting the grana.

      Figure 8 – video supplements 1 and 2. These tomographic views highlight the organization of PSII particles in thylakoids from intact and broken chloroplasts.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Overall, this study provides a meticulous comparison of developmental transcriptomes between two sub-species of the annelid Streblospio benedicti. Different lineages of S. benedicti maintain one of two genetically programmed alternative life histories, the ancestral planktotrophic or derived lecithotrophic forms of development. This contrast is also seen at the inter-species level in many marine invertebrate taxa, such as echinoderms and molluscs. The authors report relatively (surprisingly?) modest differences in transcriptomes overall but also find some genes whose expression is essentially morph-specific (which they term "exclusive").

      Strengths:

      The study is based on a dense and appropriately replicated sampling of early development. The tight clustering of each stage/morph combination in PCA space suggests the specimens were accurately categorized. The similar overall trajectories of the two morphs were surprising to me for two stages: 1) the earliest stage (16-cell), at which we might expect maternal differences due to the several-fold difference in zygote size, and 2) the latest stage (1-week), where there appears to be the most obvious morphological difference. This is why we need to do experiments!

      The examination of F1 hybrids was another major strength of the study. It also produced one of the most surprising results: though intermediate in phenotype, F1 embryos have the most distinct transcriptomes, and reveal a range of fixed, compensatory differences in the parental lines.

      Weaknesses:

      Overall I really enjoyed this paper, but I see a few places where it can be tightened and made more insightful. These relate to better defining the basis for "exclusive" expression (regulation or gene presence/absence?), providing more examples of how specific genes related to trophic mode behave, and placing the study in the context of similar work in other phyla.

      As suggested, we changed the term “exclusive expression” to “morph-specific” expression throughout the paper to clarify which genes are only expressed in one morph. We also added references to similar work in other phyla such as recent work on lecithotrophic and planktotrophic development in species of Heliocidaris sea urchins in the 4th paragraph of the discussion. We added additional data about the F1 hybrids in “Gene expression of Genetic Crosses” section and the new Figure 8B. We find that gene expression in F1 offspring is divided between matching the maternal and paternal gene expression patterns, with slightly more genes matching paternal expression.

      Reviewer #2 (Public Review):

      The manuscript by Harry and Zakas determined the extent to which gene expression differences contribute to developmental divergence by using a model that has two distinct developmental morphs within a single species. Although the authors did collect a valuable dataset and trends in differential expression between the two morphs of S. benedicti were presented, we found limitations about the methods, system, and resources that the authors should address.

      We have two major points:

      (1) Background information about the biological system needs to be clarified in the introduction of this manuscript. The authors stated that F1 offspring can have intermediate larval traits compared to the parents (Line 81). However, the authors collected F1 offspring at the same time as the mother in the cross. If offspring have intermediate larval traits, their developmental timeline might be different than both parents and necessitate the collection of offspring at different times to obtain the same stages as the parents. Could the authors (1) explain why they collected offspring at the same time as parents given that other literature and Line 81 state these F1 offspring develop at intermediate rates, and (2) add the F1 offspring to Figure 1 to show morphological and timeline differences in development?

      Additionally, the authors state (Lines 83-85) that they detail the full-time course of embryogenesis for both the parents and the F1 crosses. However, we do not see where the authors have reported the full-time course for embryogenesis of the F1 offspring. Providing this information would shape the remaining results of the manuscript.

      (2) We have several concerns about the S. benedicti genome and steps regarding the read mapping for RNA-seq:

      The S. benedicti genome used (Zakas et al. 2022) was generated using the PP morph. The largest scaffolds of this assembly correspond to linkage groups, showing the quality of this genome. The authors should point out in the Methods and/or Results sections that the quality of this genome means that PP-specific gene expression can be quantified well. However, the challenges and limitations of mapping LL-specific expression data to the PP genome should be discussed.

      It is possible that the authors did not find exclusive gene expression in the LL morph because they require at least one gene to be turned on in one morph as part of the data-cleaning criteria. Because the authors are comparing all genes to the PP morph, they could be missing true exclusive genes responsible for the biological differences between the two morphs. Did they make the decision to only count genes expressed in one stage of the other morph because the gene models and mapping quality led to too much noise?

      The authors state that the mapping rates between the two morphs are comparable (Supplementary Figure 1). However, there is a lot of variation in mapping the LL individuals (~20% to 43%) compared to the PP individuals. What is the level of differentiation within the two morphs in the species (pi and theta)? The statistical tests for this comparison should be added and the associated p-value should be reported. The statistical test used to compare mapping rates between the two morphs may be inappropriate. The authors used Salmon for their RNA alignment and differential expression analysis, but it is possible that a different method would be more appropriate. For example, Salmon has some limitations as compared to Kallisto as others have noted. The chosen statistical test should be explained, as well as how RNA-seq data are processed and interpreted.

      What about the read mapping rate and details for the F1 LP and PL individuals? How did the offspring map to the P genome? These details should be included in Supplementary Figure 1. Could the authors also provide information about the number of genes expressed at each stage in the F1 LP and PL samples in S Figure 2? How many genes went into the PCA? Many of these details are necessary to evaluate the F1 RNA-seq analyses.

      Generally, the authors need to report the statistics used in data processing more thoroughly. The authors need to report the statistics used to (1) process and evaluate the RNA-seq data and (2) determine the significance between the two morphs (Supplementary Figures 1 and 2).

      (1) We clarified in the methods that F1 embryos are collected at the same stage (not absolute time) as the parental types. So the “16-cell” stage is comparable across planktotrophic, lecithotrophic and F1 offspring regardless of absolute time taken to reach that stage (which differs by ~3 hours- Figure 1).

      Figure 2A details every time point collected for all crosses. As mentioned in the methods, we were unable to collect two timepoints for one set of crosses (LP) due to limited tissue. However, we still cover the full development time from “16 cell” through “swimming larvae” stages, which is the full larval development time.

      (2) We appreciate the reviewer's concerns regarding the mapping to the reference genome. The S. benedicti genome is a largely complete and contiguous chromosome-length genome which we have now highlighted in the manuscript. However, the reference is only for the planktotrophic morph. So it is certainly possible that there could be mapping bias for lecithotrophic reads or F1 reads, as we point out in the discussion. While some bias is certainly possible, it is unlikely to be driving major differences in the results. We performed several tests to demonstrate this:

      (1) We conducted two-sided T-tests of the mapping rates between all sample groups in our dataset (PP, LL, PL, LP)  to determine if there were significant differences in mapping rates among the populations. No significant differences were found. The specific results of these statistical tests are included in the updated manuscript in supplementary figure 1 and are as follows:

      Author response table 1.

      (2) In response to the comment about sequence level divergence affecting mapping rate, we estimated pi (nucleotide diversity within a population) and dxy (genomic divergence between two populations) based on the sampled transcriptomic data of our Planktotrophic and Lecithotrophic populations. We used PIXY (Korunes, K.L. and Samuk, K., 2021) with its standard settings to estimate these values, with variant call files in bcf format produced with bcftools - one for all planktotrophic samples and one for all lecithotrophic samples in our dataset. We found that across regions of the transcriptome, the difference in pi between Planktotrophs and Lecithotrophs was between 0.11% and 4.2%. Genomic divergence across the transcriptome is also relatively minor: estimates of dxy ranged from 0.0049 to 0.0076. Given that these estimates show relatively modest differences in nucleotide diversity and overall sequence divergence, we maintain that it is unlikely that they significantly impact the results described in this study. From what we have seen in the literature, these values are not outside of other population studies that are mapping to a species reference derived from one population.

      We added the mapping rates of all samples in the Supplement (SFig. 1) as requested. We added the number of genes expressed at each stage in the Supplement (SFig. 2) as requested. We have also provided further details and figures (Fig 8B) on read mapping rates and statistics used in data processing, including those for F1 RNA-seq data.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank you for the time you took to review our work and for your feedback! We have made only minor changes in this submission and primarily wanted to respond to the concerns raised by reviewer 1.

      Reviewer #1 (Public review): 

      Summary: 

      Fluorescence imaging has become an increasingly popular technique for monitoring neuronal activity and neurotransmitter concentrations in the living brain. However, factors such as brain motion and changes in blood flow and oxygenation can introduce significant artifacts, particularly when activitydependent signals are small. Yogesh et al. quantified these effects using GFP, an activity-independent marker, under two-photon and wide-field imaging conditions in awake behaving mice. They report significant GFP responses across various brain regions, layers, and behavioral contexts, with magnitudes comparable to those of commonly used activity sensors. These data highlight the need for robust control strategies and careful interpretation of fluorescence functional imaging data. 

      Strengths: 

      The effect of hemodynamic occlusion in two-photon imaging has been previously demonstrated in sparsely labeled neurons in V1 of anesthetized animals (see Shen and Kara et al., Nature Methods, 2012). The present study builds on these findings by imaging a substantially larger population of neurons in awake, behaving mice across multiple cortical regions, layers, and stimulus conditions. The experiments are extensive, the statistical analyses are rigorous, and the results convincingly demonstrate significant GFP responses that must be accounted for in functional imaging experiments. 

      In the revised version, the authors have provided further methodological details that were lacking in the previous version, expanded discussions regarding alternative explanations of these GFP responses as well as potential mitigation strategies. They also added a quantification of brain motion (Fig. S5) and the fraction of responsive neurons when conducting the same experiment using GCaMP6f (Fig. 3D-3F), among other additional information. 

      Weaknesses: 

      (1) The authors have now included a detailed methodology for blood vessel area quantification, where they detect blood vessels as dark holes in GFP images and measure vessel area by counting pixels below a given intensity threshold (line 437-443). However, this approach has a critical caveat: any unspecific decrease in image fluorescence will increase the number of pixels below the threshold, leading to an apparent increase in blood vessel area, even when the actual vessel size remains unchanged. As a result, this method inherently introduces a positive correlation between fluorescence decrease and vessel dilation, regardless of whether such a relationship truly exists. 

      To address this issue, I recommend labelling blood vessels with an independent marker, such as a red fluorescence dye injected into the bloodstream. This approach would allow vessel dilation to be assessed independently of GFP fluorescence -- dilation would cause opposite fluorescence changes in the green and red channels (i.e., a decrease in green due to hemodynamic occlusion and an increase in red due to the expanding vessel area). In my opinion, only when such ani-correlation is observed can one reliably infer a relationship between GFP signal changes and blood vessel dynamics. 

      Because this relationship is central to the author's conclusion regarding the nature of the observed GFP signals, including this experiment would greatly strengthen the paper's conclusion. 

      This is correct – a more convincing demonstration that blood vessels dilate or constrict anticorrelated with apparent GFP fluorescence would be a separate blood vessel marker. However, we don’t think this experiment is worth doing, as it is also not conclusive in the sense the reviewer may have in mind. The anticorrelation does not mean that occlusion drives all of the observed effect. Our main argument is instead that there is no other potential source than hemodynamic occlusion with sufficient strength that we can think of. The experiment one would want to do is block hemodynamic changes and demonstrate that the occlusion explains all of the observed changes. 

      (2) Regarding mitigation strategy, the authors advocate repeating key functional imaging experiments using GFP, and state that their aim here is to provide a control for their 2012 study (Keller et al., Neuron). Given this goal, I find it important to discuss how these new findings impact the interpretation of their 2012 results, particularly given the large GFP responses observed. 

      We are happy to discuss how the conclusions of our own work are influenced by this (see more details below), but the important response of the field should probably be to revisit the conclusions of a variety of papers published in the last two decades. This goes far beyond what we can do here. 

      For example, Keller et al. (2012) concluded that visuomotor mismatch strongly drives V1 activity (Fig. 3A in that study). However, in the present study, mismatch fails to produce any hemodynamic/GFP response (Fig. 3A, 3B, rightmost bar), and the corresponding calcium response is also the weakest among the three tested conditions (Fig. 3D). How do these findings affect their 2012 conclusions? 

      The average calcium response of L2/3 neurons to visuomotor mismatch is probably roughly similar to the average calcium response at locomotion onset (both are on the order of 1% to 5%, depending on indicator, dataset, etc.). In the Keller et al. (2012) paper, locomotion onset was about 1.5% and mismatch about 3% (see Figure 3A in that paper). What we quantify in Figure 3 of the paper here is the fraction of responsive neurons. Thus, mismatch drives strong responses in a small subset of neurons (approx. 10%), while locomotion drives a combination of a weak responses in a large fraction of the neurons (roughly 70%) and also large responses in a subset of neurons. A strong signal in a subset of neurons is what one would expect from a neuronal response, a weak signal from many neurons would be indicative of a contaminating signal. This all appears consistent. 

      Regarding influencing the conclusions of earlier work, the movement related signals described in the Keller et al. (2012) paper are probably overestimated, but are also apparent in electrophysiological recordings (Saleem et al., 2013). Thus, the locomotion responses reported in the Keller et al. (2012) paper are likely too high, but locomotion related responses in V1 are very likely real. The only conclusion we draw in the Keller et al. 2012 paper on the strength of the locomotion related responses is that they are smaller than mismatch responses (this conclusion is unaffected by hemodynamic contamination). In addition, the primary findings of the Keller et al. (2012) paper are all related to mismatch, and these conclusions are unaffected. 

      Similarly, the present study shows that GFP reveals twice as many responsive neurons as GCaMP during locomotion (Fig. 3A vs. Fig. 3D, "running"). Does this mean that their 2012 conclusions regarding locomotion-induced calcium activity need reconsideration? Given that more neurons responded with GFP than with GCaMP, the authors should clarify whether they still consider GCaMP a reliable tool for measuring brain activity during locomotion. 

      Comparisons of the fraction of significantly responsive neurons between GFP and GCaMP are not straightforward to interpret. One needs to factor in the difference in signal to noise between the two sensors. (Please note, we added the GCaMP responses here upon request of the reviewers). Note, there is nothing inherently wrong with the data, and comparisons within dataset are easily made (e.g. more grating responsive neurons than running responsive neurons in GCaMP, and vice versa with GFP). The comparison across datasets is not as straightforward as we define “responsive neurons” using a statistical test that compares response to baseline activity for each neuron. GFP labelled neurons are very bright and occlusion can easily be detected. Baseline fluorescence in GCaMP recordings is much lower and often close to or below the noise floor of the data (i.e. we only see the cells when they are active). Thus occlusion in GCaMP recordings is preferentially visible for cells that have high baseline fluorescence. Thus, in the GCaMP data we are likely underestimating the fraction of responsive neurons. 

      Regarding whether GCaMP (or any other fluorescence indicator used in vivo) is a reliable tool, we are not sure we understand. Whenever possible, fluorescence-sensor based measurements should be corrected for hemodynamic contamination – to quantify locomotion related signals this will be more difficult than e.g. for mismatch, but that does not mean it is not reliable. 

      (3) More generally, the author should discuss how functional imaging data should be interpreted going forward, given the large GFP responses reported here. Even when key experiments are repeated using GFP, it is not entirely clear how one could reliably estimate underlying neuronal activity from the observed GFP and GCaMP responses. 

      We are not sure we have a good answer to this question. The strategy for addressing this problem will depend on the specifics of the experiment, and the claims. Take the case of mismatch. Here we have strong calcium responses and no evidence of GFP responses. We would argue that this is reasonable evidence that the majority of the mismatch driven GCaMP signal is likely neuronal. For locomotion onsets, both GFP and GCaMP signals go in the same direction on average. Then one could use a response amplitude distribution comparison to conservatively exclude all neurons with a GCaMP amplitude lower than e.g. the 99th percentile of the GFP response. Etc. But we don’t think there is an easy generalizable fix for this problem.  

      For example, consider the results in Fig. 3A vs. 3D: how should one assess the relative strength of neuronal activity elicited by running, grating, or visuomotor mismatch? Does mismatch produce the strongest neuronal activity, since it is least affected by the hemodynamic/GFP confounds (Fig. 3A)? Or does mismatch actually produce the weakest neuronal activity, given that both its hemodynamic and calcium responses are the smallest? 

      See above, the reviewer may be confounding “response strength” with “fraction of responsive neurons” here. Regarding the relationship between neuronal activity and hemodynamics, it is very likely not just the average activity of all neurons, but a specific subset that drives blood vessel constriction and dilation. This would of course be a very interesting question to answer for the interpretation of hemodynamic based measurements of brain activity, like fMRI, but goes beyond the aim of the current paper.  

      In my opinion, such uncertainty makes it difficult to robustly interpret functional imaging results. Simply repeating experiments with GFP does not fully resolve this issue, as it does not provide a clear framework for quantifying the underlying neuronal activity. Does this suggest a need for a better mitigation strategy? What could these strategies be? 

      If the reviewer has a good idea - we would be all ears. We don’t have a better idea currently.  

      In my opinion, addressing these questions is critical not only for the authors' own work but also for the broader field to ensure a robust and reliable interpretation of functional imaging data. 

      We agree, having a solution to this problem would be important – we just don’t have one.  

      (4) The authors now discuss various alternative sources of the observed GFP signals. However, I feel that they often appear to dismiss these possibilities too quickly, rather than appreciating their true potential impacts (see below). 

      For example, the authors argue that brain movement cannot explain their data, as movement should only result in a decrease in observed fluorescence. However, while this might hold for x-y motion, movement in the axial (z) direction can easily lead to both fluorescence increase and decrease. Neurons are not always precisely located at the focal plane -- some are slightly above or below. Axial movement in a given direction will bring some cells into focus while moving others out of focus, leading to fluorescence changes in both directions, exactly as observed in the data (see Fig. S2). 

      The reviewer is correct that z-motion can result in an increase of apparent fluorescence (just like x-y motion can as well). On average however, just like with x-y motion, z-motion will always result in a decrease. This assumes that the user selecting regions of interest (the outlines of cells used to quantify fluorescence), will select these such that the distribution of cells selected centers on the zplane of the image. Thus, the distribution of z-location of the cell relative to the imaging plane will be some Gaussian like distribution centered on the z-plane of the image (with half the cell above the zplane and half below). Because the peak of the distribution is located on the z-plane at rest, any zmovement, up or down, will move away from the peak of the distribution (i.e. most cells will decrease in fluorescence). This is the same argument as for why x-y motion always results in decreases (assuming the user selects regions of interest centered on the location of the cells at rest).  

      Furthermore, the authors state that they discard data with 'visible' z-motion. However, subtle axial movements that escape visual detection could still cause fluorescence fluctuations on the order of a few percent, comparable to the reported signal amplitudes. 

      Correct, but as explained above, z-motion will always result in average decreases of average fluorescence as explained above.  

      Finally, the authors state that "brain movement kinematics are different in shape than the GFP responses we observe". However, this appears to contradict what they show in Fig. 2A. Specifically, the first example neuron exhibits fast GFP transients locked to running onset, with rapid kinematics closely matching the movement speed signals in Fig. S5A. These fast transients are incompatible with slower blood vessel area signals (Fig. 4), suggesting that alternative sources could contribute significantly. 

      We meant population average responses here. We have clarified this. Some of the signals we observed do indeed look like they could be driven by movement artifacts (whole brain motion, or probably more likely blood vessel dilation driven tissue distortion). We show this neuron to illustrate that this can also happen. However, to illustrate that this is a rare event we also show the entire distribution of peak amplitudes and the position in the distribution this neuron is from.  

      In sum, the possibility that alternative signal sources could significantly contribute should be taken seriously and more thoroughly discussed. 

      All possible sources (we could think of) are explicitly discussed (in roughly equal proportion). Nevertheless, the reviewer is correct that our focus here is almost exclusively on the what we think is the primary source of the problem. Given that – in my experience – this is also the one least frequently considered, I think the emphasis on – what we think is – the primary contributor is warranted.  

      (5) The authors added a quantification of brain movement (Fig. S5) and claim that they "only find detectable brain motion during locomotion onsets and not the other stimuli." However, Fig. S5 presents brain 'velocity' rather than 'displacement'. A constant (non-zero) velocity in Fig. S5 B-D indicates that the brain continues to move over time, potentially leading to significant displacement from its initial position across all conditions. While displacement in the x-y plane are corrected, similar displacement in the z direction likely occurs concurrently and cannot be easily accounted for. To assess this possibility, the authors should present absolute displacement relative to pre-stimulus frames, as displacement -- not velocity -- determines the size of movement-related fluorescence changes. 

      We use brain velocity here as a natural measure when using frame times as time bins. The problem with using a signed displacement is that if different running onsets move the brain in opposing directions, this can average out to zero. To counteract this, one can take the absolute displacement in a response window away from the position in a baseline time window. If this is done with time bins that correspond to frame times, this just becomes displacement per frame, i.e. velocity. Using absolute changes in displacement (i.e. velocity) is more sensitive than signed displacement. The responses for signed displacement are shown below (Author response image 1), but given that we are averaging signed quantities here, the average is not interpretable. 

      Author response image 1.

      Average signed brain displacement. 

      Regarding a constant drift, the reviewer might be misled by the fact that the baseline brain velocity is roughly 1 pixel per frame. The registration algorithm works in integer number of pixels only. 1 pixel per frame corresponds roughly to the noise floor of the registration algorithm. Registrations are done independently for each frame. As a consequence, the registration oscillates between a shift of 17 and 18 pixels – frame by frame – if the actual shift is somewhere between 17 and 18 pixels. This “jitter” results in a baseline brain velocity of about 1 pixel per frame. 

      (6) In line 132-133, the authors draw an analogy between the effect of hemodynamic occlusion and liquid crystal display (LCD) function. However, there are fundamental differences between the two. LCDs modulate light transmission by rotating the polarization of light, which then passes through a crossed polarizer. In contrast, hemodynamic occlusion alters light transmission by changing the number and absorbance properties of hemoglobin. Additionally, LCDs do not involve 'emission' light - backillumination travels through the liquid crystal layer only once, whereas hemodynamic occlusion affects both incoming excitation light and the emitted fluorescence. Given these fundamental differences, the LCD analogy may not be entirely appropriate. 

      The mechanism of occlusion is, as the reviewer correctly points out, different for an LCD. In both cases however, there is a variable occluder between a light source and an observer. The fact that with hemodynamic occlusion the light passes through the occluder twice (excitation and emission) does not appear to hamper the analogy to us. We have rephrased to highlight the time varying occlusion part. 

      Reviewer #2 (Public review):

      -  Approach 

      In this study, Yogesh et al. aimed at characterizing hemodynamic occlusion in two photon imaging, where its effects on signal fluctuations are underappreciated compared to that in wide field imaging and fiber photometry. The authors used activity-independent GFP fluorescence, GCaMP and GRAB sensors for various neuromodulators in two-photon and widefield imaging during a visuomotor context to evaluate the extent of hemodynamic occlusion in V1 and ACC. They found that the GFP responses were comparable in amplitude to smaller GCaMP responses, though exhibiting context-, cortical region-, and depth-specific effects. After quantifying blood vessel diameter change and surrounding GFP responses, they argued that GFP responses were highly correlated with changes in local blood vessel size. Furthermore, when imaging with GRAB sensors for different neuromodulators, they found that sensors with lower dynamic ranges such as GRAB-DA1m, GRAB-5HT1.0, and GRAB-NE1m exhibited responses most likely masked by the hemodynamic occlusion, while a sensor with larger SNR, GRAB-ACh3.0, showed much more distinguishable responses from blood vessel change. They thoroughly investigate other factors that could contribute to these signals and demonstrate hemodynamic occlusion is the primary cause. 

      -  Impact of revision 

      This is an important update to the initial submission, adding much supplemental imaging and population data that provide greater detail to the analyses and increase the confidence in the authors conclusions. 

      Specifically, inclusion of the supplemental figures 1 and 2 showing GFP expression across multiple regions and the fluorescence changes of thousands of individual neurons provides a clearer picture of how these effects are distributed across the population. Characterization of brain motion across stimulation conditions in supplemental figure 5 provides strong evidence that the fluorescence changes observed in many of the conditions are unlikely to be primarily due to brain motion associated imaging artifacts. The role of vascular area on fluorescence is further supported by addition of new analyses on vasoconstriction leading to increased fluorescence in Figures 4C1-4, complementing the prior analyses of vasodilation. 

      The expansion of the discussion on other factors that could lead to these changes is thorough and welcome. The arguments against pH playing a factor in fluorescence changes of GFP, due to insensitivity to changes in the expected pH range are reasonable, as are the other discussed potential factors. 

      With respect to the author's responses to prior critique, we agree that activity dependent hemodynamic occlusion is best investigated under awake conditions. Measurement of these dynamics under anesthesia could lead to an underestimation of their effects. Isoflurane anesthesia causes significant vasodilation and a large reduction in fluorescence intensity in non-functional mutant GRABs. This could saturate or occlude activity dependent effects. 

      - Strengths 

      This work is of broad interest to two photon imaging users and GRAB developers and users. It thoroughly quantifies the hemodynamic driven GFP response and compares it to previously published GCaMP data in a similar context, and illustrates the contribution of hemodynamic occlusion to GFP and GRAB responses by characterizing the local blood vessel diameter and fluorescence change. These findings provide important considerations for the imaging community and a sobering look at the utility of these sensors for cortical imaging. 

      Importantly, they draw clear distinctions between the temporal dynamics and amplitude of hemodynamic artifacts across cortical regions and layers. Moreover, they show context dependent (Dark versus during visual stimuli) effects on locomotion and optogenetic light-triggered hemodynamic signals. 

      The authors suggest that signal to noise ratio of an indicator likely affects the ability to separate hemodynamic response from the underlying fluorescence signal. With a new analysis (Supplemental Figure 4) They show that the relative degree of background fluorescence does not affect the size of the artifact. 

      Most of the first generation neuromodulator GRAB sensors showed relatively small responses, comparable to blood vessel changes in two photon imaging, which emphasizes a need for improved the dynamic range and response magnitude for future sensors and encourages the sensor users to consider removing hemodynamic artifacts when analyzing GRAB imaging data. 

      - Weaknesses 

      The largest weakness of the paper remains that, while they convincingly quantify hemodynamic artifacts across a range of conditions, they provide limited means of correcting for them. However they now discuss the relative utility of some hemodynamic correction methods (e.g. from Ocana-Santero et al., 2024). 

      The paper attributes the source of 'hemodynamic occlusion' primarily to blood vessel dilation, but leaves unanswered how much may be due to shifts in blood oxygenation. Figure 4 directly addresses the question of how much of the signal can be attributed to occlusion by measuring the blood vessel dilation, and has been improved by now showing positive fluorescence effects with vasoconstriction. They now also discuss the potential impact of oxygenation. 

      Along these lines, the authors carefully quantified the correlation between local blood vessel diameter and GFP response (or neuropil fluorescence vs blood vessel fluorescence with GRAB sensors). We are left to wonder to what extent does this effect depend on proximity to the vessels? Do GFP/ GRAB responses decorrelate from blood vessel activity in neurons further from vessels (refer to Figure 5A and B in Neyhart et al., Cell Reports 2024)? The authors argue that the primary impact of occlusion is from blood vessels above the plane of imaging, but without a vascular reconstruction, their evidence for this is anecdotal. 

      The choice of ACC as the frontal region provides a substantial contrast in location, brain movement, and vascular architecture as compared to V1. As the authors note, ACC is close to the superior sagittal sinus and thus is the region where the largest vascular effects are likely to occur. A less medial portion of M2 may have been a more appropriate comparison. The authors now include example imaging fields for ACC and interesting out-of-plane vascular examples in the supplementary figures that help assess these impacts. 

      -Overall Assessment 

      This paper is an important contribution to our understanding of how hemodynamic artifacts may corrupt GRAB and calcium imaging, even in two-photon imaging modes. While it would be wonderful if the authors were able to demonstrate a reliable way to correct for hemodynamic occlusion which did not rely on doing the experiments over with a non-functional sensor or fluorescent protein, the careful measurement and reporting of the effects here is, by itself, a substantial contribution to the field of neural activity imaging. It's results are of importance to anyone conducting two-photon or widefield imaging with calcium and GRAB sensors and deserves the attention of the broader neuroscience and invivo imaging community. 

      We agree with this assessment.

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors aimed to investigate if hemodynamic occlusion contributes to fluorescent signals measured with two-photon microscopy. For this, they image the activity-independent fluorophore GFP in 2 different cortical areas, at different cortical depths and in different behavioral conditions. They compare the evoked fluorescent signals with those obtained with calcium sensors and neuromodulator sensors and evaluate their relationship to vessel diameter as a readout of blood flow.

      They find that GFP fluorescence transients are comparable to GCaMP6f stimuli-evoked signals in amplitude, although they are generally smaller. Yet, they are significant even at the single neuronal level. They show that GFP fluorescence transients resemble those measured with the dopamine sensor GRABDA1m and the serotonin sensor GRAB-5HT1.0 in amplitude an nature, suggesting that signals with these sensors are dominated by hemodynamic occlusion. Moreover, the authors perform similar experiments with wide-field microscopy which reveals the similarity between the two methods in generating the hemodynamic signals. Together the evidence presented calls for the development and use of high dynamic range sensors to avoid measuring signals that have another origin from the one intended to measure. In the meantime, the evidence highlights the need to control for those artifacts such as with the parallel use of activity independent fluorophores.

      Strengths:

      - Comprehensive study comparing different cortical regions in diverse behavioral settings in controlled conditions.

      - Comparison to the state-of-the-art, i.e. what has been demonstrated with wide-field microscopy.

      - Comparison to diverse activity-dependent sensors, including the widely used GCaMP.

      Comments on revisions:

      The authors have addressed my concerns well. I have no further comments.

      We agree with this assessment.  


      The following is the authors’ response to the original reviews

      The major changes to the manuscript are:

      (1) Re-wrote the discussion, going over all possible sources of the signals we describe.

      (2) We added a quantification of brain motion as Figure S5.

      (3) We added an example of blood vessel contraction as Figure 4C.

      (4) We added data on the fraction of responsive neurons when measured with GCaMP as Figures 3D-3F.

      (5) We added example imaging sites from all imaged regions as Figure S1.

      (6) We added GFP response heatmaps of all neurons as Figure S2.

      (7) We add a quantification of the relationship between GFP response amplitude and expression level Figure S4.

      A detailed point-by-point response to all reviewer concerns is provided below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Fluorescence imaging has become an increasingly popular technique for monitoring neuronal activity and neurotransmitter concentrations in the living brain. However, factors such as brain motion and changes in blood flow and oxygenation can introduce significant artifacts, particularly when activity-dependent signals are small. Yogesh et al. quantified these effects using GFP, an activity-independent marker, under two-photon and wide-field imaging conditions in awake behaving mice. They report significant GFP responses across various brain regions, layers, and behavioral contexts, with magnitudes comparable to those of commonly used activity sensors. These data highlight the need for robust control strategies and careful interpretation of fluorescence functional imaging data.

      Strengths:

      The effect of hemodynamic occlusion in two-photon imaging has been previously demonstrated in sparsely labeled neurons in V1 of anesthetized animals (see Shen and Kara et al., Nature Methods, 2012). The present study builds on these findings by imaging a substantially larger population of neurons in awake, behaving mice across multiple cortical regions, layers, and stimulus conditions. The experiments are extensive, the statistical analyses are rigorous, and the results convincingly demonstrate significant GFP responses that must be accounted for in functional imaging experiments. However, whether these GFP responses are driven by hemodynamic occlusion remains less clear, given the complexities associated with awake imaging and GFP's properties (see below).

      Weaknesses:

      (1) The authors primarily attribute the observed GFP responses to hemodynamic occlusion. While this explanation is plausible, other factors may also contribute to the observed signals. These include uncompensated brain movement (e.g., axial-direction movements), leakage of visual stimulation light into the microscope, and GFP's sensitivity to changes in intracellular pH (see e.g., Kneen and Verkman, 1998, Biophysical Journal). Although the correlation between GFP signals and blood vessel diameters supports a hemodynamic contribution, it does not rule out significant contributions from these (or other) factors. Consequently, whether GFP fluorescence can reliably quantify hemodynamic occlusion in two-photon microscopy remains uncertain.

      We concur; our data do not conclusively prove that the effect is only driven by hemodynamic occlusion. We have attempted to make this clearer in the text throughout the manuscript. In particular we have restructured the discussion to focus on this point. Regarding the specific alternatives the reviewer mentions here:

      a) Uncompensated brain motion. While this can certainly contribute, we think the effect is negligible in our interpretation for the following reasons. First, just to point out the obvious, as with all two-photon data we acquire in the lab, we only keep data with no visible z-motion (axial). Second, and more importantly, uncompensated brain motion results in a net decrease of fluorescence. As regions of interest (ROI) are selected to be centered on neurons (as opposed to be randomly selected, or next to, or above or below), movement will – on average – result in a decrease in fluorescence, as neurons are moved out of the ROIs. In the early days of awake two-photon imaging (when preps were still less stable) – we used this movement onset decrease in fluorescence as a sign that running onsets were selected correctly (i.e. with low variance). See e.g. the dip in the running onset trace at time zero in figure 3A of (Keller et al., 2012). Third, we find no evidence for any brain motion in the case of visual stimulation, while the GFP responses during locomotion and visual stimulation are of similar magnitude. We have added a quantification of brain motion (Figure S5) and a discussion of this point to the manuscript.

      b) Leakage of stimulation light. First, all light sources in the experimental room (the projector used for the mouse VR, the optogenetic stimulation light, as well as the computer monitors used to operate the microscope) are synchronized to the turnaround times of the resonant scanner of the two-photon microscope. Thus, light sources in the room are turned off for each line scan of the resonant scanner and turned on in the turnaround period. With a 12kHz scanner this results in a light cycle of 24 kHz (see Leinweber et al., 2014 for details). While the system is not perfect, we can occasionally get detectable light leak responses at the image edges (in the resonant axis as a result of the exponential off kinetics of many LEDs & lasers), these are typically 2 orders of magnitude smaller than what one would get without synchronizing, and far smaller than a single digit percentage change in GFP responses, and only detectable at the image edges. Second, while in visual cortex, dark running onsets are different from running onsets with the VR turned on (Figures 5A and B), they are indistinguishable in ACC (Figure 5C). Thus, stimulation light artefacts we can rule out.

      c) GFP’s sensitivity to changes in pH. Activity results in a decrease in neuronal intracellular pH (https://pubmed.ncbi.nlm.nih.gov/14506304/, https://pubmed.ncbi.nlm.nih.gov/24312004/) – decreasing pH decreases GFP fluorescence (https://pubmed.ncbi.nlm.nih.gov/9512054/).

      To reiterate, we don’t think hemodynamic occlusion is the only possible source to the effects we observe, but we do think it is most likely the largest.

      (2) Regardless of the underlying mechanisms driving the GFP responses, these activity-independent signals must be accounted for in functional imaging experiments. However, the present manuscript does not explore potential strategies to mitigate these effects. Exploring and demonstrating even partial mitigation strategies could have significant implications for the field.

      We concur – however, in brief, we think the only viable mitigation strategy (we are capable of), is to repeat functional imaging with GFP imaging. To unpack this: There have been numerous efforts to mitigate these hemodynamic effects using isosbestic illumination. When we started to use such strategies in the lab for widefield imaging, we thought we would calibrate the isosbestic correction using GFP recordings. The idea was that if performed correctly, an isosbestic response should look like a GFP response. Try as we may, we could not get the isosbestic responses to look like a GFP response. We suspect this is a result of the fact that none of the light sources we used were perfectly match to the isosbestic wavelength the GCaMP variants we used (not for a lack of trying, but neither lasers nor LEDs were available for purchase with exact wavelength matches). Complicating this was then also the fact that the similarity (or dissimilarity) between isosbestic and GFP responses was a function of brain region. Importantly however, just because we could not successfully apply isosbestic corrections, of course does not mean it cannot be done. Hence for the widefield experiments we then resorted to mitigating the problem by repeating the key experiments using GFP imaging (see e.g. (Heindorf and Keller, 2024)). Note, others have also argued that the best way to correct for hemodynamic artefacts is a GFP recording based correction (Valley et al., 2019). A second strategy we tried was using a second fluorophore (i.e. a red marker) in tandem with a GCaMP sensor. The problem here is that the absorption of the two differs markedly by blood and once again a correction of the GCaMP signal using the red channel was questionable at best. Thus, we think the only viable mitigation strategy we have found is GFP recordings and testing whether the postulated effects seen with calcium indicators are also present in GFP responses. This work is our attempt at a post-hoc mitigation of the problem of our own previous two-photon imaging studies.

      (3) Several methodology details are missing from the Methods section. These include: (a) signal extraction methods for two-photon imaging data (b) neuropil subtraction methods (whether they are performed and, if so, how) (c) methods used to prevent visual stimulation light from being detected by the two-photon imaging system (d) methods to measure blood vessel diameter/area in each frame. The authors should provide more details in their revision.

      Please excuse, this was an oversight. All details have been added to the methods.

      Reviewer #2 (Public Review):

      In this study, Yogesh et al. aimed at characterizing hemodynamic occlusion in two photon imaging, where its effects on signal fluctuations are underappreciated compared to that in wide field imaging and fiber photometry. The authors used activity-independent GFP fluorescence, GCaMP and GRAB sensors for various neuromodulators in two-photon and widefield imaging during a visuomotor context to evaluate the extent of hemodynamic occlusion in V1 and ACC. They found that the GFP responses were comparable in amplitude to smaller GCaMP responses, though exhibiting context-, cortical region-, and depth-specific effects. After quantifying blood vessel diameter change and surrounding GFP responses, they argued that GFP responses were highly correlated with changes in local blood vessel size. Furthermore, when imaging with GRAB sensors for different neuromodulators, they found that sensors with lower dynamic ranges such as GRAB-DA1m, GRAB5HT1.0, and GRAB-NE1m exhibited responses most likely masked by the hemodynamic occlusion, while a sensor with larger SNR, GRAB-ACh3.0, showed much more distinguishable responses from blood vessel change.

      Strengths

      This work is of broad interest to two photon imaging users and GRAB developers and users. It thoroughly quantifies the hemodynamic driven GFP response and compares it to previously published GCaMP data in a similar context, and illustrates the contribution of hemodynamic occlusion to GFP and GRAB responses by characterizing the local blood vessel diameter and fluorescence change. These findings provide important considerations for the imaging community and a sobering look at the utility of these sensors for cortical imaging.

      Importantly, they draw clear distinctions between the temporal dynamics and amplitude of hemodynamic artifacts across cortical regions and layers. Moreover, they show context dependent (Dark versus during visual stimuli) effects on locomotion and optogenetic light-triggered hemodynamic signals.

      Most of the first generation neuromodulator GRAB sensors showed relatively small responses, comparable to blood vessel changes in two photon imaging, which emphasizes a need for improved the dynamic range and response magnitude for future sensors and encourages the sensor users to consider removing hemodynamic artifacts when analyzing GRAB imaging data.

      Weaknesses

      (1) The largest weakness of the paper is that, while they convincingly quantify hemodynamic artifacts across a range of conditions, they do not quantify any methods of correcting for them. The utility of the paper could have been greatly enhanced had they tested hemodynamic correction methods (e.g. from Ocana-Santero et al., 2024) and applied them to their datasets. This would serve both to verify their findings-proving that hemodynamic correction removes the hemodynamic signal-and to act as a guide to the field for how to address the problem they highlight.

      See also our response to reviewer 1 comment 2.

      In the Ocana-Santero et al., 2024 paper they also first use GFP recordings to identify the problem. The mitigation strategy they then propose, and use, is to image a second fluorophore that emits at a different wavelength concurrently with the functional indicator. The authors then simply subtract (we think – the paper states “divisive”, but the data shown are more consistent with “subtractive” correction) the two signals to correct for hemodynamics. However, the paper does not demonstrate that the hemodynamic signals in the red channel match those in the green channel. The evidence presented that this works is at best anecdotal. In our hands this does not work (meaning the red channel does not match GFP recordings), we suspect this is a combination of crosstalk from the simultaneously recorded functional channel and the fact that hemodynamic absorption is strongly wavelength specific, or something we are doing wrong. Either way, we cannot contribute to this in the form of mitigation strategy.

      Given that the GFP responses are a function of brain area and cortical depth – it is not a stretch to postulate that they also depend on genetic cell type labelled. Thus, any GFP calibration used for correction will need to be repeated for each cell type and brain area. Once experiments are repeated using GFP (the strategy we advocate for – we don’t think there is a simpler way to do this), the “correction” is just a subtraction (or a visual comparison).

      (2) The paper attributes the source of 'hemodynamic occlusion' primarily to blood vessel dilation, but leaves unanswered how much may be due to shifts in blood oxygenation. Figure 4 directly addresses the question of how much of the signal can be attributed to occlusion by measuring the blood vessel dilation, but notably fails to reproduce any of the positive transients associated with locomotion in Figure 2. Thus, an investigation into or at least a discussion of what other factors (movement? Hb oxygenation?) may drive these distinct signals would be helpful.

      See also our response to reviewer 1 comment 1.

      We have added to Figure 4 an example of a positive transient. At running onset, superficial blood vessels in cortex tend to constrict and hence result in positive transients.

      We now also mention changes in blood oxygenation as a potential source of hemodynamic occlusion. And just to be clear, blood oxygenation (or flow) changes in absence of any fluorophore, do not lead to a two-photon signal. Just in case the reviewer was concerned about intrinsic signals – these are not detectable in two photon imaging.

      (3) Along these lines, the authors carefully quantified the correlation between local blood vessel diameter and GFP response (or neuropil fluorescence vs blood vessel fluorescence with GRAB sensors). To what extent does this effect depend on proximity to the vessels? Do GFP/ GRAB responses decorrelate from blood vessel activity in neurons further from vessels (refer to Figure 5A and B in Neyhart et al., Cell Reports 2024)?

      We indeed thought about quantifying this, but to do this properly would require having a 3d reconstruction of the blood vessel plexus above (with respect to the optical axis) the neuron of interest, as well as some knowledge of how each vessel dilates as a function of stimulus. The prime effect is likely from blood vessels that are in the 45 degrees illumination cone above the neuron (Author response image 2). Lateral proximity to a blood vessel is likely only of secondary relevance. Thus, performing such a measurement is impractical and of little benefit for others.

      Author response image 2.

      A schematic representation of the cone of illumination.

      While imaging a neuron (the spot on the imaging plane at the focus of the cone of illumination), the relevant blood vessels that primarily contribute to hemodynamic occlusion are those in the cone of illumination between the neuron and the objective lens. Blood vessels visible in the imaging plane (indicated by gray arrows), do not directly contribute to hemodynamic occlusion. Any distance dependence of hemodynamic occlusion in the observed response of a neuron to these blood vessels in the imaging plane is at best incidental.

      (4) Raw traces are shown in Figure 2 but we are never presented with the unaveraged data for locomotion of stimulus presentation times, which limits the reader's ability to independently assess variability in the data. Inclusion of heatmaps comparing event aligned GFP to GCaMP6f may be of value to the reader.

      We fear we are not sure what the reviewer means by “the unaveraged data for locomotion of stimulus presentation times”. We suspect this should read “locomotion or stimulus…”. We have added heat maps of the responses of all neurons of the data shown in Figure 1 – as Figure S2.

      (5) More detailed analysis of differences between the kinds of dynamics observed in GFP vs GCaMP6f expressing neurons could aid in identifying artifacts in otherwise clean data. The example neurons in Figure 2A hint at this as each display unique waveforms and the question of whether certain properties of their dynamics can reveal the hemodynamic rather than indicator driven nature of the signal is left open. Eg. do the decay rate and rise times differ significantly from GCaMP6f signals?

      The most informative distinction we have found is differences in peak responses (Figure 2B). Decay and rise time measurements critically depend on the identification of “events”. As a function of how selective one is with what one calls an event (e.g. easy in example 1 of Figure 2 – but more difficult in examples 2 and 3), one gets very different estimates of rise and decay times. Due to the fact that peak amplitudes are lower in GFP responses – rise and decay times will be either slower or noisier (depending on where the threshold for event detection is set).

      (6) The authors suggest that signal to noise ratio of an indicator likely affects the ability to separate hemodynamic response from the underlying fluorescence signal. Does the degree of background fluorescence affect the size of the artifact? If there was variation in background and overall expression level in the data this could potentially be used to answer this question. Could lower (or higher!) expression levels increase the effects of hemodynamic occlusion?

      There may be a misunderstanding (i.e. we might be misunderstanding the reviewer’s argument here). Our statement from the manuscript that the signal to noise ratio of an indicator matters is based on the simple consideration that hemodynamic occlusion is in the range of 0 to 2 % ΔF/F. The larger the dynamic range of the indicator, the less of a problem 2% ΔF/F are. Imagine an indicator with average responses in the 100’s of % ΔF/F - then this would be a non-problem. For indicators with a dynamic range less than 1%, a 2% artifact is a problem.

      Regarding “background” fluorescence, we are not sure what is meant here. In case the reviewer means fluorescence that comes from indicator molecules in processes (as opposed to soma) that are typically ignored (or classified as neuropil) – we are not sure how this would help. The occlusion effects are identical for both somatic and axonal or dendritic GFP (the source of the GFP fluorescence is not relevant for the occlusion effect). In case the reviewer means “baseline” fluorescence – above a noise threshold ΔF/F<sub>0</sub> should be constant independent of F<sub>0</sub> (i.e. baseline fluorescence). This also holds in the data, see Figure S4. We might be stating the trivial - the normalization of fluorescence activity as ΔF/F<sub>0</sub> has the effect that the “occluder" effect is constant for all values of all F<sub>0</sub>.

      (7) The choice of the phrase 'hemodynamic occlusion' may cause some confusion as the authors address both positive and negative responses in the GFP expressing neurons, and there may be additional contributions from changes in blood oxygenation state.

      Regarding the potential confusion with regards to terminology, occlusion can decrease or increase.

      Only under the (incorrect) assumption that occlusion is zero at baseline would this be confusing – no? If the reviewer has a suggestion for a different term, we’d be open to changing it.

      Regarding blood oxygenation – this is absolutely correct, we did not explicitly point this out in the previous version of the manuscript. Occlusion changes are driven by a combination of changes to volume and “opacity” of the blood. Oxygenation changes would be in the second category. We have clarified this in the manuscript.

      (8) The choice of ACC as the frontal region provides a substantial contrast in location, brain movement, and vascular architecture as compared to V1. As the authors note, ACC is close to the superior sagittal sinus and thus is the region where the largest vascular effects are likely to occur. The reader is left to wonder how much of the ROI may or may not have included vasculature in the ACC vs V1 recordings as the only images of the recording sites provided are for V1. We are left unable to conclude whether the differences observed between these regions are due to the presence of visible vasculature, capillary blood flow or differences in neurovasculature coupling between regions. A less medial portion of M2 may have been a more appropriate comparison. At least, inclusion of more example imaging fields for ACC in the supplementary figures would be of value.

      Both the choice of V1 and ACC were simply driven by previous experiments we had already done in these areas with calcium indicators. And we agree, the relevant axis is likely distance from midline, not AP – i.e. RSC and ACC are likely more similar, and V1 and lateral M2 more similar. We have made this point explicitly in the manuscript and have added sample fields of view as Figure S1.

      (9) In Figure 3, How do the proportions of responsive GFP neurons compare to GCaMP6f neurons?

      We have added the data for GCaMP responses.

      (10) How is variance explained calculated in Figure 4? Is this from a linear model and R^2 value? Is this variance estimate for separate predictors by using single variable models? The methods should describe the construction of the model including the design matrix and how the model was fit and if and how cross validation was run.

      This is simply a linear model (i.e. R^2) – we have added this to the methods.

      (11) Cortical depth is coarsely defined as L2/3 or L5, without numerical ranges in depth from pia.

      Layer 2/3 imaging was done at a depth of 100-250 μm from pia, and the same for layer 5 was 400-600 μm. This has been added to the methods.

      Overall Assessment:

      This paper is an important contribution to our understanding of how hemodynamic artifacts may corrupt GRAB and calcium imaging, even in two-photon imaging modes. Certain useful control experiments, such as intrinsic optical imaging in the same paradigms, were not reported, nor were any hemodynamic correction methods investigated. Thus, this limits both mechanistic conclusions and the overall utility with respect to immediate applications by end users. Nevertheless, the paper is of significant importance to anyone conducting two-photon or widefield imaging with calcium and GRAB sensors and deserves the attention of the broader neuroscience and in-vivo imaging community.

      Reviewer #3 (Public review):

      In this study, the authors aimed to investigate if hemodynamic occlusion contributes to fluorescent signals measured with two-photon microscopy. For this, they image the activity-independent fluorophore GFP in 2 different cortical areas, at different cortical depths and in different behavioral conditions. They compare the evoked fluorescent signals with those obtained with calcium sensors and neuromodulator sensors and evaluate their relationship to vessel diameter as a readout of blood flow.

      They find that GFP fluorescence transients are comparable to GCaMP6f stimuli-evoked signals in amplitude, although they are generally smaller. Yet, they are significant even at the single neuronal level. They show that GFP fluorescence transients resemble those measured with the dopamine sensor GRABDA1m and the serotonin sensor GRAB-5HT1.0 in amplitude an nature, suggesting that signals with these sensors are dominated by hemodynamic occlusion. Moreover, the authors perform similar experiments with wide-field microscopy which reveals the similarity between the two methods in generating the hemodynamic signals. Together the evidence presented calls for the development and use of high dynamic range sensors to avoid measuring signals that have another origin from the one intended to measure. In the meantime, the evidence highlights the need to control for those artifacts such as with the parallel use of activity independent fluorophores.

      Strengths:

      - Comprehensive study comparing different cortical regions in diverse behavioral settings in controlled conditions.

      - Comparison to the state-of-the-art, i.e. what has been demonstrated with wide-field microscopy.

      - Comparison to diverse activity-dependent sensors, including the widely used GCaMP.

      Weaknesses:

      (1) The kinetics of GCaMP is stereotypic. An analysis/comment on if and how the kinetics of the signals could be used to distinguish the hemodynamic occlusion artefacts from calcium signals would be useful.

      We might be misunderstanding what the reviewer means by “the kinetics of GCaMP are stereotypic”. The kinetics are clearly stereotypic if one has isolated single action potential responses in a genetically identified cell type. But data recorded in vivo looks very different, see e.g. example traces in figure 1g of (Keller et al., 2012). And these are selected example traces, the average GCaMP trace looks perhaps more like the three example traces shown in Figure 2 (this is not surprising if the GCaMP signals one records in vivo are a superposition of calcium responses and hemodynamic occlusion). All quantification of kinetics relies on identifying “events”. We cannot identify events in any meaningful way for most of the data (see e.g. examples 2 and 3 in Figure 2). The one feature we can reliably identify as differing between GCaMP and GFP responses is peak response amplitude (as quantified in Figure 2).

      (2) Is it possible that motion is affecting the signals in a certain degree? This issue is not made clear.

      See also our response to reviewer 1 comment 1. In brief, we have added a quantification of motion artefacts as Figure S5, and argue that motion artefacts could only account for locomotion onset responses (there is no detectable brain motion to visual responses) and would predict a decrease in fluorescence (not an increase).

      (3) The causal relationship with blood flow remains open. Hemodynamic occlusion seems a good candidate causing changes in GFP fluorescence, but this remains to be well addressed in further research.

      We agree – we have made this clearer in the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 2A shows three neurons with convincing GFP responses, with amplitudes often exceeding 100%. However, after seeing these data, I actually feel less convinced that these responses are related to hemodynamic occlusion. Blood vessel diameter changes by at most a few percent during behavior -- how could such small changes lead to >100% changes in GFP fluorescence?

      My guess is that these responses might instead be related to motion artifacts, particularly given the strong correlation between these responses and running speed (Figure 2A). One possible way to test this is by examining a pixelwise map of fluorescence changes (dF/F) during running vs. baseline. If hemodynamic effects are involved, one would likely see a shadow of the involved blood vessels in this map. Conversely, if motion artifacts are the primary factor, the map of dF/F should resemble the spatial gradients of the mean fluorescence image. Examining pixelwise maps of dF/F will likely provide insights regarding the nature of the GFP signals.

      The underlying assumption (“blood vessel diameter changes by at most a few percent”) might be incorrect here. (Note also, relevant is likely the cross section, not diameter.) See Figure 4A1 and B1 for quantification of example blood vessel area changes - both example vessels change area by approximately 50%. Also note, example 1 in Figure 2 is an extreme example. The example was chosen to highlight that effects can be large. To try to illustrate that this is not typical however, we also show the distribution of all neurons in Figure 2B and mark all three example cells – example 1 is at the very tail of the distribution.

      Regarding the analysis suggested, we have added examples of this for running onset to the manuscript (Figure S7). We have examples in which a blood vessel shadow is clearly visible. More typical however, is a general increase in fluorescence (on running onset) that we think is caused by blood vessels closer to the surface of the brain.

      (2) Figure 3A shows strong GFP responses during running, while visuomotor mismatch elicit virtually no GFP-responsive neurons. This finding is puzzling, as visuomotor mismatch has been shown by the same group to activate L2/3 neurons more strongly than running (see Figure 3A, Keller et al., 2012, Neuron). Stronger neuronal activation should, in theory, result in more pronounced hemodynamic effects, and therefore, a higher proportion of GFP-responsive neurons. The absence of GFP responses during visuomotor mismatch raises questions about whether GFP signals are directly linked to hemodynamic occlusion.

      An alternative explanation is that the strong GFP responses observed during running could instead be driven by motion artifacts, e.g., those associated with the increased head or body movements during running onsets. Such artifacts could explain the observed GFP responses, rather than hemodynamic occlusion.

      This might be a misunderstanding. Mismatch responses are primarily observed in mismatch neurons. These are superficial L2/3 neurons (possibly the population that in higher mammals is L2 neurons). The fact that mismatch responses are primarily observed in this superficial population is likely the reason they were discovered using two-photon calcium imaging (which tends to have a bias towards superficial neurons as the image quality is best there), and seen in much fewer neurons when using electrophysiological techniques (Saleem et al., 2013) that are biased to deeper neurons. In response to Reviewer #2, we have now also added a quantification of the fraction of neurons responsive to these stimuli when using GCaMP (Figure 3D-F). The fraction of neurons responsive to visuomotor mismatch is smaller than those responsive on locomotion or to visual stimuli.

      Thus, based on “average” responses across all cortical cell types (our L2/3 recordings here are as unbiased across all of L2/3 as possible) the response profiles (strong running onset and visual responses, and weak MM responses) are probably what one would expect in first approximation also in the blood vessel response profile. Complicating this is of course the fact that it is likely some cell type specific activity that contributes most to blood flow changes, not simply average neuronal activity.

      See response to public review comment 1 for a discussion of alternative sources, including motion artefacts.

      (3) Given the potential confound associated with brain motion, the authors might consider quantifying hemodynamic occlusion effects under more controlled conditions, such as in anesthetized animals, where brain movement is minimal. They could use drifting grating stimuli, which are known to produce wellcharacterized blood vessel and hemodynamic responses in V1. The effects of hemodynamic occlusion can then be quantified by imaging the fluorescence of an activity-independent marker. For maximal robustness, GFP should ideally be avoided, due to its known sensitivity to pH changes, as noted in the public review.

      Brain motion is negligible to visual stimuli in the awake mouse as well (Figure S5). This is likely the better control than anesthetized recordings – anesthesia has strong effects on blood pressure, heart rate, breathing, etc. all of which would introduce more confounds.

      (4) Regardless of the precise mechanism driving the observed GFP response, these activity-independent signals must be accounted for in functional imaging experiments. This applies not only to experiments using small dynamic range sensors but also to those employing 'high dynamic range' sensors like GCaMP6, which, according to the authors, exhibit responses only ~2-fold greater than those of GFP.

      In this context, the extensive GFP imaging data are highly valuable, as they could serve as a benchmark for evaluating the effectiveness of correction methods. Ideally, effective correction methods should produce minimal responses when applied to GFP imaging data. With these data at hand, I strongly encourage the authors to explore potential correction methods, as such methods could have far-reaching impact on the field.

      As discussed above, we have tested a number of such correction approaches for both widefield and two-photon imaging and could never recover a response profile that resembles the GFP response. The “correction method” we have come to favor, is repeating experiments using GFP (i.e. what we have done here).

      (5) Several correction approaches could be considered: for instance, the strong correlation between GFP responses and blood vessel diameter (as shown in Figure 4) could potentially be leveraged to predict and compensate for the activity-independent signals. Alternatively, expressing an activity-independent marker alongside the activity sensor in orthogonal spectral channels could enable simultaneous monitoring and correction of activity-independent signals. Finally, computational procedure to remove common fluctuations, measured from background or 'neuropil' regions (see, e.g., Kerlin et al., 2010, Neuron; Giovannucci et al., 2019, eLife), may help reduce the contamination in cellular ROIs. The authors could try some or all of these methods, and benchmark their effectiveness by assessing, e.g., the number of GFP responsive neurons after correction.

      Over the years we have tried many of these approaches. A correction using a second fluorophore of a different color likely fails because blood absorption is strongly wavelength dependent, making it challenging to calibrate the correction factor. Neuropil “correction” on GCaMP data, even with the best implementations, is just a common mode subtraction. The signal in the neuropil – as the name implies is just an average of many axons and dendrites in the vicinity – most of these processes are from nearby neurons making a neuropil response simply an average response of the neurons in some neighborhood. Adding the problem of hemodynamic responses (which on small scales will also influence nearby neurons and neuropil similarly) makes disentangling the two effects impossible (i.e. neuropil subtraction makes the problem worse, not better). However, just because we fail in implementing all of these methods, does not necessarily mean the method is faulty. Hence we have chosen not to comment on any such method, and simply provide the only mitigation strategy that works in our hands – record GFP responses.

      (6) Given the potential usefulness of the GFP imaging data, I encourage the authors to share these data in a public repository to facilitate the development of correction methods.

      Certainly – all of our data are always published. In the early years of the lab on an FMI repository here https://data.fmi.ch/ - more recently now on Zenodo.

      (7) As noted in the public review, several methodology details are missing. Most importantly, I could not find the description in the Methods section explaining how fluorescence signals from individual neurons were extracted from two-photon imaging data. The existing section on 'Extraction of neuronal activity' appears to cover only the wide-field analysis, with details about two-photon analysis seemingly absent.

      Please excuse the omission – this has all been added to the methods. In brief, to answer your questions:

      Were regions of interest (ROIs) for individual cells identified manually or automatically?

      We use a mixture of manual and automatic methods for our two-photon data. Based on a median filtered (spatially) version of the mean fluorescence image, we used a threshold based selection of ROIs. This was then visually inspected and manually corrected where necessary such that ROIs were at least 250 pixels and only labelled clearly identifiable neurons.

      Was fluorescence within each ROI calculated by averaging signals across pixels, or were signal de-mixing algorithms (e.g., PCA, ICA, or NMF) applied?

      We use the average fluorescence across pixels without any de-mixing algorithms here and in all our two-photon experiments. De-mixing algorithms can introduce a variety of artefacts.

      Additionally, did the authors account for and correct the contribution of surrounding neuropil?

      No neuropil correction was applied. It would also be difficult to see how this would help. If the model of hemodynamic occlusion is correct, one would expect occlusion effects to change on the length scale of blood vessels (i.e. tens to hundreds of microns). Thus, the effect of occlusion on neuropil and cells should be the similar. Neuropil “correction” is always based on the idea of removing signals that are common to both neuropil and somata, thereby complicating the interpretation of the resulting signal even further.

      Without these methodological details, it is difficult to accurately interpret the two-photon signals reported in the manuscript.

      (8) The rationale for using the average fluorescence of a ROI within the blood vessel as a proxy for blood vessel diameter is not entirely clear to me. The authors should provide a clearer justification for this approach in their revision.

      Consider a ROI placed within a blood vessel at the focus of the illumination cone (Author response image 3). Given the axial point-spread-function of two-photon imaging is in the range of 0.5 μm laterally and 3 μm axially (indicated by the bicone), emitted photons from the fluorescent tissue outside of the blood vessel but within the two-photon volume will contribute to change in fluorescence in the ROI. A change in the blood vessel volume, say an increase on dilation, would decrease the amount of emission photons reaching the objective by, one, pushing more of the fluorescent tissue outside of the two-photon volume, and two, by presenting greater hemodynamic occlusion to the photons emitted by the fluorescent tissue immediately below the vessel. Conversely, on vasoconstriction there are more emission photons at the objective.

      In line with this argument, as shown in Figure 4A1-A2, B1-B2 and C1-C2, we do find that the change in fluorescence of blood vessel ROI varies inversely with the area of the blood vessel. Of course, change in blood vessel ROI fluorescence is only a proxy for vessel size. Extracting blood vessel boundaries from individual two-photon frames was noisy and proved unreliable in the absence of specific dyes to label the vessel walls. We thus resorted to using blood vessel ROI fluorescence as a proxy for hemodynamic occlusion, and tested how much of the variance in GFP responses is explained by the change in blood vessel ROI response.

      We have added an explanation to the manuscript, as suggested.

      Author response image 3.

      Average response of ROIs placed within blood vessels co-vary with hemodynamic occlusion.

      (9) I find that the Shen et al., 2012, Nature Methods paper has gone quite far to demonstrate the effect of hemodynamic occlusion in two photon imaging. Therefore, I suggest the authors describe and cite this work not only in the discussion but also in the introduction, where they can highlight the key questions left unanswered by that study and explain how their manuscript aims to address them.

      We have added the reference and point to the work in the introduction as suggested.

      Reviewer #3 (Recommendations for the authors):

      I appreciate very much that the study is presented in a very clear manner.

      A few comments that could clarify it even further:

      (1) Fig. 1: make clear on legend if it is an average of full FOVs.

      The traces shown are the average over ROIs (neurons) – we have clarified in the figure legend as suggested.

      (2) Give a more complete definition of hemodynamic occlusion to understand the hypothesis in the relationship between blood vessel dilation and GFP fluorescence (116-119). Maybe, move the phrase from conclusion "Since blood absorbs light, hemodynamic occlusion can affect fluorescence intensity measurements" (219-220).

      Very good point – we expanded on the definition in the introduction.

      (3) For clarity, mention in the main text the method used to assess how a parameter explains the variance (126-129).

      Is implemented.

      (4) Discuss the possible relationship of the signals to neuronal activity.

      We have added this to the discussion.

      (5) Discuss if the measurements could provide any functional insights, whether they could be used to learn something about the brain.

      We have added this to the discussion.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The manuscript by Wagstyl et al. describes an extensive analysis of gene expression in the human cerebral cortex and the association with a large variety of maps capturing many of its microscopic and macroscopic properties. The core methodological contribution is the computation of continuous maps of gene expression for >20k genes, which are being shared with the community. The manuscript is a demonstration of several ways in which these maps can be used to relate gene expression with histological features of the human cortex, cytoarchitecture, folding, function, development and disease risk. The main scientific contribution is to provide data and tools to help substantiate the idea of the genetic regulation of multi-scale aspects of the organisation of the human brain. The manuscript is dense, but clearly written and beautifully illustrated.

      Main comments

      The starting point for the manuscript is the construction of continuous maps of gene expression for most human genes. These maps are based on the microarray data from 6 left human brain hemispheres made available by the Allen Brain Institute. By technological necessity, the microarray data is very sparse: only 1304 samples to map all the cortex after all subjects were combined (a single individual's hemisphere has ~400 samples). Sampling is also inhomogeneous due to the coronal slicing of the tissue. To obtain continuous maps on a mesh, the authors filled the gaps using nearest-neighbour interpolation followed by strong smoothing. This may have two potentially important consequences that the authors may want to discuss further: (a) the intrinsic geometry of the mesh used for smoothing will introduce structure in the expression map, and (b) strong smoothing will produce substantial, spatially heterogeneous, autocorrelations in the signal, which are known to lead to a significant increase in the false positive rate (FPR) in the spin tests they used.

      Many thanks to the reviewer for their considered feedback. We have addressed these primary concerns into point-by-point responses below. The key conclusions from our new analyses are: (i) while the intrinsic geometry of the mesh had not originally been accounted for in sufficient detail, the findings presented in this manuscript paper are not driven by mesh-induced structure, (ii) that the spin test null models used in this manuscript [(including a modified version introduced in response to (i)] are currently the most appropriate way to mitigate against inflated false positive rates when making statistical inferences on smooth, surface-based data.

      a. Structured smoothing

      A brain surface has intrinsic curvature (Gaussian curvature, which cannot be flattened away without tearing). The size of the neighbourhood around each surface vertex will be determined by this curvature. During surface smoothing, this will make that the weight of each vertex will be also modulated by the local curvature, i.e., by large geometric structures such as poles, fissures and folds. The article by Ciantar et al (2022, https://doi.org/10.1007/s00429-022-02536-4) provides a clear illustration of this effect: even the mapping of a volume of pure noise into a brain mesh will produce a pattern over the surface strikingly similar to that obtained by mapping resting state functional data or functional data related to a motor task.

      Comment 1

      It may be important to make the readers aware of this possible limitation, which is in large part a consequence of the sparsity of the microarray sampling and the necessity to map that to a mesh. This may confound the assessments of reproducibility (results, p4). Reproducibility was assessed by comparing pairs of subgroups split from the total 6. But if the mesh is introducing structure into the data, and if the same mesh was used for both groups, then what's being reproduced could be a combination of signal from the expression data and signal induced by the mesh structure.

      Response 1

      The reviewer raises an important question regarding the potential for interpolation and smoothing on a cortical mesh to induce a common/correlated signal due to the intrinsic mesh structure. We have now generated a new null model to test this idea which indicates that intrinsic mesh structure is not inflating reproducibility in interpolated expression maps. This new null model spins the original samples prior to interpolation, smoothing and comparison between triplet splits of the six donors, with independent spins shared across the triplet. For computational tractability we took one pair of triplets and regenerated the dataset for each triplet using 10 independent spins. We used these to estimate gene-gene null reproducibility for 90 independent pairwise combinations of these 10 spins. Across these 90 permutations, the average median gene-gene correlation was R=0.03, whereas in the unspun triplet comparisons this was R=0.36. These results indicate that the primary source of the gene-level triplet reproducibility is the underlying shared gene expression pattern rather than interpolation-induced structure.

      In Methods 2a: "An additional null dataset was generated to test whether intrinsic geometry of the cortical mesh and its impact on interpolation for benchmarking analyses of DEMs and gradients (Fig S1d, Fig S2d, Fig S3c). In these analyses, the original samples were rotated on the spherical surface prior to subsequent interpolation, smoothing and gradient calculation. Due to computational constraints the full dataset was recreated only for 10 independent spins. These are referred to as the “spun+interpolated null”.

      Author response image 1.

      Figure S1d, Gene predictability was higher across all triplet-triplet pairs than when compared to spun+interpolated null.

      Comment 2

      It's also possible that mesh-induced structure is responsible in part for the "signal boost" observed when comparing raw expression data and interpolated data (fig S1a). How do you explain the signal boost of the smooth data compared with the raw data otherwise?

      Response 2

      We thank the reviewer for highlighting this issue of mesh-induced structure. We first sought to quantify the impact of mesh-induced structure through the new null model, in which the data are spun prior to interpolation. New figure S1d, S2d and S3c all show that the main findings are not driven by interpolation over a common mesh structure, but rather originate in the underlying expression data.

      Specifically, for the original Figure S1a, the reviewer highlights a limitation that we compared intersubject predictability of raw-sample to raw-sample and interpolated-to-interpolated. In this original formulation improved prediction scores for interpolated-to-interpolated (the “signal boost”) could be driven by mesh-induced structure being applied to both the input and predicted maps. We have updated this so that we are now comparing raw-to-raw and interpolated-to-raw, i.e. whether interpolated values are better estimations of the measured expression values. The new Fig S1a&b (see below) shows a signal boost in gene-level and vertex level prediction scores (delta R = +0.05) and we attribute this to the minimisation of location and measurement noise in the raw data, improving the intersubject predictability of expression levels.

      In Methods 2b: "To assess the effect of data interpolation in DEM generation we compared gene-level and vertex-level reproducibility of DEMs against a “ground truth” estimate of these reproducibility metrics based on uninterpolated expression data. To achieve a strict comparison of gene expression values between different individuals at identical spatial locations we focused these analyses on the subset of AHBA samples where a sample from one subject was within 3 mm geodesic distance of another. This resulted in 1097 instances (spatial locations) with measures of raw gene expression of one donor, and predicted values from the second donor’s un-interpolated AHBA expression data and interpolated DEM. We computed gene-level and vertex-level reproducibility of expression using the paired donor data at each of these sample points for both DEM and uninterpolated AHBA expression values. By comparing DEM reproducibility estimates with those for uninterpolated AHBA expression data, we were able to quantify the combined effect of interpolation and smoothing steps in DEM generation. We used gene-level reproducibility values from DEMs and uninterpolated AHBA expression data to compute a gene-level difference in reproducibility, and we then visualized the distribution of these difference values across genes (Fig S1a). We used gene-rank correlation to compare vertex-level reproducibility values between DEMs and uninterpolated AHBA expression data (Fig S1b)."

      Author response image 2.

      Figure S1. Reproducibility of Dense Expression Maps (DEMs) interpolated from spatially sparse postmortem measures of cortical gene expression. a, Signal boost in the interpolated DEM dataset vs. spatially sparse expression data. Restricting to samples taken from approximately the same cortical location in pairs of individuals (within 3mm geodesic distance), there was an overall improvement in intersubject spatial predictability in the interpolated maps. Furthermore, genes with lower predictability in the interpolated maps were less predictable in the raw dataset, suggesting these regions exhibit higher underlying biological variability rather than methodologically introduced bias. b, Similarly at the paired sample locations, gene-rank predictability was generally improved in DEMs vs. sparse expression data (median change in R from sparse samples to interpolated for each pair of subjects, +0.5).

      1. How do you explain that despite the difference in absolute value the combined expression maps of genes with and without cortical expression look similar? (fig S1e: in both cases there's high values in the dorsal part of the central sulcus, in the occipital pole, in the temporal pole, and low values in the precuneus and close to the angular gyrus). Could this also reflect mesh-smoothing-induced structure?

      Response 3

      As with comment 1, this is an interesting perspective that we had not fully considered. We would first like to clarify that non-cortical expression is defined from the independent datasets including the “cortex” tissue class of the human protein atlas and genes identified as markers for cortical layers or cortical cells in previous studies. This is still likely an underestimate of true cortically expressed genes as some of these “non-cortical genes” had high intersubject reproducibility scores. Nevertheless we think it appropriate to use a measure of brain expression independent of anything included in other analyses for this paper. These considerations are part of the reason we provide all gene maps with accompanying uncertainty scores for user discretion rather than simply filtering them out.

      In terms of the spatially consistent pattern of the gene ranks of Fig S1f, this consistent spatial pattern mirrors Transcriptomic Distinctiveness (r=0.52 for non-cortical genes, r=0.75 for cortical genes), so we think that as the differences in expression signatures become more extreme, the relative ranks of genes in that region are more reproducible/easier to predict.

      To assess whether mesh-smoothing-induced structure is playing a role, we carried out an additional the new null model introduced in response to comment 1, and asked if the per-vertex gene rank reproducibility of independently spun subgroup triplets showed a similar structure to that in our original analyses. Across the 90 permutations, the median correlation between vertex reproducibility and TD was R=0.10. We also recalculated the TD maps for the 10 spun datasets and the mean correlation with the original TD did not significantly differ from zero (mean R = 0.01, p=0.2, nspins =10). These results indicate that folding morphology is not the major driver of local or large scale patterning in the dataset. We have included this as a new Figure S3c.

      We have updated the text as follows:

      In Methods 3a: "Third, to assess whether the covariance in spatial patterning across genes could be a result of mesh-associated structure introduced through interpolation and smoothing, TD maps were recomputed for the spun+interpolated null datasets and compared to the original TD map (Fig S3c)."

      In Results: "The TD map observed from the full DEMs library was highly stable between all disjoint triplets of donors (Methods, Fig S3a, median cross-vertex correlation in TD scores between triplets r=0.77) and across library subsets at all deciles of DEM reproducibility (Methods, Fig S3b, cross-vertex correlation in TD scores r>0.8 for the 3rd-10th deciles), but was not recapitulated in spun null datasets (Fig S3c)."

      Author response image 3.

      Figure S3c, Correlations between TD and TD maps regenerated on datasets spun using two independent nulls, one where the rotation is applied prior to interpolation and smoothing (spun+interpolated) and one where it is applied to the already-created DEMs. In each null, the same rotation matrix is applied to all genes.

      Comment 4

      Could you provide more information about the way in which the nearest-neighbours were identified (results p4). Were they nearest in Euclidean space? Geodesic? If geodesic, geodesic over the native brain surface? over the spherically deformed brain? (Methods cite Moresi & Mather's Stripy toolbox, which seems to be meant to be used on spheres). If the distance was geodesic over the sphere, could the distortions introduced by mapping (due to brain anatomy) influence the geometry of the expression maps?

      Response 4

      We have clarified in the Methods that the mapping is to nearest neighbors on the spherically-inflated surface.

      The new null model we have introduced in response to comments 1 & 3 preserves any mesh-induced structure alongside any smoothing-induced spatial autocorrelations, and the additional analyses above indicate that main results are not induced by systematic mesh-related interpolation signal. In response to an additional suggestion from the reviewer (Comment 13), we also assessed whether local distortions due to the mesh could be creating apparent border effects in the data, for instance at the V1-V2 boundary. At the V1-V2 border, which coincides anatomically with the calcarine sulcus, we computed the 10 genes with the highest expression gradient along this boundary in the actual dataset and the spun-interpolated null. The median test expression gradients along this border was higher than in any of the spun datasets, indicating that these boundary effects are not explained by the interpolation and cortical geometry effects on the data (new Fig S2d). The text has been updated as follows:

      In Methods 1: "For cortical vertices with no directly sampled expression, expression values were interpolated from their nearest sampled neighbor vertex on the spherical surface (Moresi and Mather, 2019) (Fig 1b)."

      In Methods 2: "We used the spun+interpolated null to test whether high gene gradients could be driven by non-uniform interpolation across cortical folds. We quantified the average gradient for all genes along the V1-V2 border in the atlas, as well as for 10 iterations of the atlas where the samples were spun prior to interpolation. We computed the median gradient magnitude for the 20 top-ranked genes for each (Fig S2d)."

      Author response image 4.

      Figure S2d Mean of gradient magnitudes for 20 genes with largest gradients along V1-V2 border, compared to values along the same boundary on the spun+interpolated null atlas. Gradients were higher in the actual dataset than in all spun version indicating this high gradient feature is not primarily due to the effects of calcarine sulcus morphology on interpolation

      Comment 5

      Could you provide more information about the smoothing algorithm? Volumetric, geodesic over the native mesh, geodesic over the sphere, averaging of values in neighbouring vertices, cotangent-weighted laplacian smoothing, something else?

      Response 5

      We are using surface-based geodesic over the white surface smoothing described in Glasser et al., 2013 and used in the HCP workbench toolbox (https://www.humanconnectome.org/software/connectome-workbench). We have updated the methods to clarify this.

      In Methods 1: "Surface expression maps were smoothed using the Connectome Workbench toolbox (Glasser et al. 2013) with a 20mm full-width at half maximum Gaussian kernel , selected to be consistent with this sampling density (Fig 1c)."

      Comment 6

      Could you provide more information about the method used for computing the gradient of the expression maps (p6)? The gradient and the laplacian operator are related (the laplacian is the divergence of the gradient), which could also be responsible in part for the relationships observed between expression transitions and brain geometry.

      Response 6

      We are using Connectome Workbench’s metric gradient command for this Glasser et al., 2013 and used in the HCP workbench pipeline. The source code for gradient calculation can be found here: https://github.com/Washington-University/workbench/blob/131e84f7b885d82af76e be21adf2fa97795e2484/src/Algorithms/AlgorithmMetricGradient.cxx

      In Methods 2: >For each of the resulting 20,781 gene-level expression maps, the orientation and magnitude of gene expression change at each vertex (i.e. the gradient) was calculated for folded, inflated, spherical and flattened mesh representations of the cortical sheet using Connectome Workbench’s metric gradient command (Glasser et al. 2013).

      b. Potentially inflated FPR for spin tests on autocorrelated data."

      Spin tests are extensively used in this work and it would be useful to make the readers aware of their limitations, which may confound some of the results presented. Spin tests aim at establishing if two brain maps are similar by comparing a measure of their similarity over a spherical deformation of the brains against a distribution of similarities obtained by randomly spinning one of the spheres. It is not clear which specific variety of spin test was used, but the original spin test has well known limitations, such as the violation of the assumption of spatial stationarity of the covariance structure (not all positions of the spinning sphere are equivalent, some are contracted, some are expanded), or the treatment of the medial wall (a big hole with no data is introduced when hemispheres are isolated).

      Another important limitation results from the comparison of maps showing autocorrelation. This problem has been extensively described by Markello & Misic (2021). The strong smoothing used to make a continuous map out of just ~1300 samples introduces large, geometry dependent autocorrelations. Indeed, the expression maps presented in the manuscript look similar to those with the highest degree of autocorrelation studied by Markello & Misic (alpha=3). In this case, naive permutations should lead to a false positive rate ~46% when comparing pairs of random maps, and even most sophisticated methods have FPR>10%.

      Comment 7 There's currently several researchers working on testing spatial similarity, and the readers would benefit from being made aware of the problem of the spin test and potential solutions. There's also packages providing alternative implementations of spin tests, such as BrainSMASH and BrainSpace, which could be mentioned.

      Response 7

      We thank the reviewer for raising the issue of null models. First, with reference to the false positive rate of 46% when maps exhibit spatial autocorrelation, we absolutely agree that this is an issue that must be accounted for and we address this using the spin test. We acknowledge there has been other work on nulls such as BrainSMASH and BrainSpace. Nevertheless in the Markello and Misic paper to which the reviewer refers, the BrainSmash null models perform worse with smoother maps (with false positive rates approaching 30% in panel e below), whereas the spin test maintains false positives rates below 10%.

      Author response image 5.

      We have added a brief description of the challenge and our use of the spin test.

      In Methods 2a: "Cortical maps exhibit spatial autocorrelation that can inflate the False Positive Rate, for which a number of methods have been proposed(Alexander-Bloch et al. 2018; Burt et al. 2020; Vos de Wael et al. 2020). At higher degrees of spatial smoothness, this high False Positive Rate is most effectively mitigated using the spin test(Alexander-Bloch et al. 2018; Markello and Misic 2021; Vos de Wael et al. 2020). In the following analyses when generating a test statistic comparing two spatial maps, to generate a null distribution, we computed 1000 independent spins of the cortical surface using https://netneurotools.readthedocs.io, and applied it to the first map whilst keeping the second map unchanged. The test statistic was then recomputed 1000 times to generate a null distribution for values one might observe by chance if the maps shared no common organizational features. This is referred to throughout as the “spin test” and the derived p-values as pspin."

      Comment 8

      Could it be possible to measure the degree of spatial autocorrelation?

      Response 8

      We agree this could be a useful metric to generate for spatial cortical maps. However, there are multiple potential metrics to choose from and each of the DEMs would have their own value. To address this properly would require the creation of a set of validated tools and it is not clear how we could summarize this variety of potential metrics for 20k genes. Moreover, as discussed above the spin method is an adequate null across a range of spatial autocorrelation degrees, thus while we agree that in general estimation of spatial smoothness could be a useful imaging metric to report, we consider that it is beyond the scope of the current manuscript.

      Comment 9

      Could you clarify which version of the spin test was used? Does the implementation come from a package or was it coded from scratch?

      Response 9

      As Markello & Misic note, at the vertex level, the various implementations of the spin test become roughly equivalent to the ‘original’ Alexander-Bloch et al., implementation. We used took the code for the ‘original’ version implemented in python here: https://netneurotools.readthedocs.io/en/latest/_modules/netneurotools/stats.html# gen_spinsamples.

      This has been updated in the methods (see Response 7).

      Comment 10

      Cortex and non-cortex vertex-level gene rank predictability maps (fig S1e) are strikingly similar. Would the spin test come up statistically significant? What would be the meaning of that, if the cortical map of genes not expressed in the cortex appeared to be statistically significantly similar to that of genes expressed in the cortex?

      Response 10

      Please see response to comment 3, which also addresses this observation.

      Reviewer #2 (Public Review):

      The authors convert the AHBA dataset into a dense cortical map and conduct an impressively large number of analyses demonstrating the value of having such data.

      I only have comments on the methodology.

      Comment 1

      First, the authors create dense maps by simply using nearest neighbour interpolation followed by smoothing. Since one of the main points of the paper is the use of a dense map, I find it quite light in assessing the validity of this dense map. The reproducibility values they calculate by taking subsets of subjects are hugely under-powered, given that there are only 6 brains, and they don't inform on local, vertex-wise uncertainties). I wonder if the authors would consider using Gaussian process interpolation. It is really tailored to this kind of problem and can give local estimates of uncertainty in the interpolated values. For hyperparameter tuning, they could use leave-one-brain-out for that.

      I know it is a lot to ask to change the base method, as that means re-doing all the analyses. But I think it would strengthen the paper if the authors put as much effort in the dense mapping as they did in their downstream analyses of the data.

      Response 1

      We thank the reviewer for the suggestion to explore Gaussian process interpolation. We have implemented this for our dataset and attempted to compare this with our original method with the 3 following tests: i) intertriplet reproducibility of individual gene maps, ii) microscale validations: area markers, iii) macroscale validations: bio patterns.

      Overall, compared to our original nearest-neighbor interpolation method, GP regression (i) did not substantially improve gene-level reproducibility of expression maps (median correlation increase of R=0.07 which was greater for genes without documented protein expression in cortex): ii) substantially worsened performance in predicting areal marker genes and iii) showed similar but slightly worse performance at predicting macroscale patterns from Figure 1.

      Given the significantly poorer performance on one of our key tests (ii) we have opted not to replace our original database, but we do now include code for the alternative GP regression methodology in the github repository so others can reproduce/further develop these methods.

      Author response image 6.

      ii) Genes ranked by mean expression gradient from current DEMs (left) and Gaussian process-derived interpolation maps (right). Established Human and macaque markers are consistently higher-ranked in DEM maps. iii) Figure 1 Interpolated vs GP regression

      Author response table 1.

      Comment 2

      It is nice that the authors share some code and a notebook, but I think it is rather light. It would be good if the code was better documented, and if the user could have access to the non-smoothed data, in case they was to produce their own dense maps. I was only wondering why the authors didn't share the code that reproduces the many analyses/results in the paper.

      Response 2

      We thank the reviewer for this suggestion. In response we have updated the shared github repository (https://github.com/kwagstyl/magicc). This now includes code and notebooks to reproduce the main analyses and figures.

      Reviewer #1 (Recommendations For The Authors):

      Minor comments

      Comment 11

      p4 mentions Fig S1h, but the supp figures only goes from S1a to S1g

      Response 11

      We thank the reviewer for capturing this error. It was in fact referring to what is now Fig S1h and has been updated.

      Comment 12

      It would be important that the authors share all the code used to produce the results in the paper in addition to the maps. The core methodological contribution of the work is a series of continuous maps of gene expression, which could become an important tool for annotation in neuroimaging research. Many arbitrary (reasonable) decisions were made, it would be important to enable users to evaluate their influence on the results.

      Response 12

      We thank both reviewers for this suggestion. We have updated the github to be able to reproduce the dense maps and key figures with our methods.

      Comment 13

      p5: Could the sharp border reflect the effect of the geometry of the calcarine sulcus on map smoothing? More generally, could there be an effect of folds on TD?

      Response 13

      Please see our response to Reviewer 1, Comment 1 above, where we introduce the new null models now analyzed to test for effects of mesh geometry on our findings. These new null models - where original source data were spun prior to interpolation suggest that neither the sharp V1/2 border or the TD map are effects of mesh geometry. Specifically: (i) , the magnitudes of gradients along the V1/2 boundary from null models were notably smaller than those in our original analyses (see new figure S2d), and (ii) TD maps computed from the new null models showed no correlation with TD maps from ur original analyses (new Figure S3c, mean R = 0.01, p=0.2, nspins =10).

      Comment 14

      p5: Similar for the matching with the areas in Glasser's parcellation: the definition of these areas involves alignment through folds (based on freesurfer 'sulc' map, see Glasser et al 2016). If folds influence the geometry of TDs, could that influence the match?

      Response 14

      We note that Fig S3c provided evidence that folding was not the primary driver of the TD patterning. However, it is true that Glasser et al. use both neuroanatomy (folding, thickness and myelin) and fMRI-derived maps to delineate their cortical areas. As such Figure 2 f & g aren’t fully independent assessments. Nevertheless the reason that these features are used is that many of the sulci in question have been shown to reliably delineate cytoarchitectonic boundaries (Fischl et al., 2008).

      In Results: "A similar alignment was seen when comparing gradients of transcriptional change with the spatial orientation of putative cortical areas defined by multimodal functional and structural in vivo neuroimaging(Glasser et al., 2016) (expression change running perpendicular to area long-axis, pspin<0.01, Fig 2g, Methods)."

      Comment 15

      p6: TD peaks are said to overlap with functionally-specialised regions. A comment on why audition is not there, nor language, but ba 9-46d is? Would that suggest a lesser genetic regulation of those functions?

      Response 15

      The reviewer raises a valid point and this was a result that we were also surprised by. The finding that the auditory cortex is not as microstructurally distinctive as, say V1, is consistent with other studies applying dimensionality-reduction techniques to multimodal microstructural receptor data (e.g. Zilles et al., 2017, Goulas et al., 2020). These studies found that the auditory microstructure is not as extreme as either visual and somatomotor areas. From a methodological view point, the primary auditory cortex is significantly smaller than both visual and somatomotor areas, and therefore is captured by fewer independent samples, which could reduce the detail in which its structure is being mapped in our dataset.

      For the frontal areas, we would note that i) the frontal peak is the smallest of all peaks found and was more strongly characterised by low z-score genes than high z-score. ii) the anatomical areas in the frontal cortex are much more highly variable with respect to folding morphology (e.g. Rajkowska 1995). The anatomical label of ba9-46d (and indeed all other labels) were automatically generated as localisers rather than strict area labels. We have clarified this in the text as follows:

      In Methods 3a: "Automated labels to localize TD peaks were generated based on their intersection with a reference multimodal neuroimaging parcellation of the human cortex(Glasser et al., 2016). Each TD was given the label of the multimodal parcel that showed greatest overlap (Fig 2b)."

      Comment 16.

      p7: The proposition that "there is a tendency for cortical sulci to run perpendicular to the direction of fastest transcriptional change", could also be "there is a tendency for the direction of fastest transcriptional change to run perpendicular to cortical sulci"? More pragmatically, this result from the geometry of transcriptional maps being influenced by sulcal geometry in their construction.

      Response 16

      Please see our response to Reviewer 1, Comment 1 above, where we introduce the new null models now analyzed to test for effects of mesh geometry on our findings. These models indicate that the topography of interpolated gene expression maps do not reflect influences of sulcal geometry on their construction.

      Comment 17

      p7: TD transitions are indicated to precede folding. This is based on a consideration of folding development based on the article by Chi et al 1977, which is quite an old reference. In that paper, the authors estimated the tempo of human folding development based on the inspection of photographs, which may not be sufficient for detecting the first changes in curvature leading to folds. The work of the Developing Human Connectome consortium may provide a more recent indication for timing. In their data, by PCW 21 there's already central sulcus, pre-central, post-central, intra-parietal, superior temporal, superior frontal which can be detected by computing the mean curvature of the pial surface (I can only provide a tweet for reference: https://twitter.com/R3RT0/status/1617119196617261056). Even by PCW 9-13 the callosal sulcus, sylvian fissure, parieto-occipital fissure, olfactory sulcus, cingulate sulcus and calcarine fissure have been reported to be present (Kostovic & Vasung 2009).

      Response 17

      Our field lacks the data necessary to provide a comprehensive empirical test for the temporal ordering of regional transcriptional profiles and emergence of folding. Our results show that transcriptional identities of V1 and TGd are - at least - present at the very earliest stages of sulcation in these regions. In response to the reviewers comment we have updated with a similar fetal mapping project which similarly shows evidence of the folds between weeks 17-21 and made the language around directionality more cautious.

      In Results: "The observed distribution of these angles across vertices was significantly skewed relative to a null based on random alignment between angles (pspin<0.01, Fig 2f, Methods) - indicating that there is indeed a tendency for cortical sulci and the direction of fastest transcriptional change to run perpendicular to each other (pspin<0.01, Fig 2f).

      As a preliminary probe for causality, we examined the developmental ordering of regional folding and regional transcriptional identity. Mapping the expression of high-ranking TD genes in fetal cortical laser dissection microarray data(Miller et al., 2014) from 21 PCW (Post Conception Weeks) (Methods) showed that the localized transcriptional identity of V1 and TGd regions in adulthood is apparent during the fetal periods when folding topology begins to emerge (Chi et al. 1977; Xu et al. 2022) (Fig " S2d).

      In Discussion: "By establishing that some of these cortical zones are evident at the time of cortical folding, we lend support to a “protomap”(Rakic 1988; O'Leary 1989; O'Leary et al. 2007; Rakic et al. 2009) like model where the placement of some cortical folds is set-up by rapid tangential changes in cyto-laminar composition of the developing cortex(Ronan et al., 2014; Toro and Burnod, 2005; Van Essen, 2020). The DEMs are derived from fully folded adult donors, and therefore some of the measured genetic-folding alignment might also be induced by mechanical distortion of the tissue during folding(Llinares-Benadero and Borrell 2019; Heuer and Toro 2019). However, no data currently exist to conclusively assess the directionality of this gene-folding relationship."

      Comment 18

      p7: In my supplemental figures (obtained from biorxiv, because I didn't find them among the files submitted to eLife) there's no S2j (only S2a-S2i).

      Response 18

      We apologize, this figure refers to S3k (formerly S3j), rather than S2j. We have updated the main text.

      Comment 19 p7: It is not clear from the methods (section 3b) how the adult and fetal brains were compared. Maybe using MSM (Robinson et al 2014)?

      Response 19

      We have now clarified this in Methods text as reproduced below.

      In Methods 3b: "We averaged scaled regional gene expression values between donors per gene, and filtered for genes in the fetal LDM dataset that were also represented in the adult DEM dataset - yielding a single final 20,476*235 gene-by-sample matrix of expression values for the human cortex at 21 PCW. Each TD peak region was then paired with the closest matching cortical label within the fetal regions. This matrix was then used to test if each TD expression signature discovered in the adult DEM dataset (Fig 2, Table 3) was already present in similar cortical regions at 21 PCW."

      Comment 20

      p7: WGCNA is used prominently, could you provide a brief introduction to its objectives? The gene coexpression networks are produced after adjusting the weight of the network edges to follow a scale-free topology, which is meant to reflect the nature of protein-protein interactions. Soft thresholding increases contrast, but doesn't this decrease a potential role of infinitesimal regulatory signals?

      Response 20

      We agree with the reviewer that the introduction to WGCNA needed additional details and have amended the Results (see below). One limitation of WGCNA-derived associations is that it will downweigh the role of smaller relationships including potentially important regulatory signals. WGCNA methods have been titrated to capture strong relationships. This is an inherent limitation of all co-expression driven methods which lead to an incomplete characterisation of the molecular biology. Nevertheless we feel these stronger relationships are still worth capturing and interrogating. We have updated the text to introduce WGCNA and acknowledge this potential weakness in the approach.

      In Results: "Briefly, WGCNA constructs a constructs a connectivity matrix by quantifying pairwise co-expression between genes, raising the correlations to a power (here 6) to emphasize strong correlations while penalizing weaker ones, and creating a Topological Overlap Matrix (TOM) to capture both pairwise similarities expression and connectivity. Modules of highly interconnected genes are identified through hierarchical clustering. The resultant WGCNA modules enable topographic and genetic integration because they each exist as both (i) a single expression map (eigenmap) for spatial comparison with neuroimaging data (Fig 3a,b, Methods) and, (ii) a unique gene set for enrichment analysis against marker genes systematically capturing multiple scales of cortical organization, namely: cortical layers, cell types, cell compartments, protein-protein interactions (PPI) and GO terms (Methods, Table S2 and S4)."

      Comment 21

      WGCNA modules look even more smooth than the gene expression maps. Are these maps comparable to low frequency eigenvectors? Autocorrelation in that case should be very strong?

      Response 21

      These modules are smooth as they are indeed eigenvectors which likely smooth out some of the more detailed but less common features seen in individual gene maps. These do exhibit high degrees of autocorrelation, nevertheless we are applying the spin test which is currently the appropriate null model for spatially autocorrelated cortical maps (Response 7).

      Comment 22

      If the WGCNA modules provide an orthogonal basis for surface data, is it completely unexpected that some of them will correlate with low-frequency patterns? What would happen if random low frequency patterns were generated? Would they also show correlations with some of the 16 WGCNA modules?

      Response 22

      We agree with the reviewer that if we used a generative model like BrainSMASH, we would likely see similar low frequency patterns. However, the inserted figure in Response 7 from Makello & Misic provide evidence that is not as conservative a null as the spin test when data exhibit high spatial autocorrelation. The spatial enrichment tests carried out on the WGCNA modules are all carried out using the spin test.

      Comment 23

      In part (a) I commented on the possibility that brain anatomy may introduce artifactual structure into the data that's being mapped. But what if the relationship between brain geometry and brain organisation were deeper than just the introduction of artefacts? The work of Lefebre et al (2014, https://doi.org/10.1109/ICPR.2014.107; 2018, https://doi.org/10.3389/fnins.2018.00354) shows that clustering based on the 3 lowest frequency eigenvectors of the Laplacian of a brain hemisphere mesh produce an almost perfect parcellation into lobes, with remarkable coincidences between parcel boundaries and primary folds and fissures. The work of Pang et al (https://doi.org/10.1101/2022.10.04.510897) suggests that the geometry of the brain plays a critical role in constraining its dynamics: they analyse >10k task-evoked brain maps and show that the eigenvectors of the brain laplacian parsimoniously explain the activity patterns. Could brain anatomy have a downward effect on brain organisation?

      Response 23

      The reviewer raises a fascinating extension of our work identifying spatial modes of gene expression. We agree that these are low frequency in nature, but would first like to note that the newly introduced null model indicates that the overlaps with salient neuroanatomical features are inherent in the expression data and not purely driven by anatomy in a methodological sense.

      Nevertheless we absolutely agree there is likely to be a complex multidirectional interplay between genetic expression patterns through development, developing morphology and the “final” adult topography of expression, neuroanatomical and functional patterns.

      We think that the current manuscript currently contains a lot of in depth analyses of these expression data, but agree that a more extensive modeling analysis of how expression might pattern or explain functional activation would be a fascinating follow on, especially in light of these studies from Pang and Lefebre. Nevertheless we think that this must be left for a future modeling paper integrating these modes of microscale, macroscale and functional anatomy.

      In Discussion: "Indeed, future work might find direct links between these module eigenvectors and similar low-frequency eigenvectors of cortical geometry have been used as basis functions to segment the cortex (Lefèvre et al. 2018) and explain complex functional activation patterns(Pang et al. 2023)."

      Comment 24

      On p11: ASD related to rare, deleterious mutations of strong effect is often associated with intellectual disability (where the social interaction component of ASD is more challenging to assess). Was there some indication of a relationship with that type of cognitive phenotype?

      Response 24

      Across the two ABIDE cohorts, the total number of those with ASD and IQ <70, which is the clinical threshold for intellectual disability was n=10, which unfortunately did not allow us to conduct a meaningful test of whether ID impacts the relationship between imaging changes in ASD and the expression maps of genes implicated in ASD by rare variants.

      Comment 25

      Could you clarify if the 6 donors were aligned using the folding-based method in freesurfer?

      Response 25

      The 6 donors were aligned using MSMsulc (Robinson et al., 2014), which is a folding based method from the HCP group. This is now clarified in the methods.

      In Methods 1: "Cortical surfaces were reconstructed for each AHBA donor MRI using FreeSurfer(Fischl, 2012), and coregistered between donors using surface matching of individuals’ folding morphology (MSMSulc) (Robinson et al., 2018)."

      Comment 26

      The authors make available a rich resource and a series of tools to facilitate their use. They have paid attention to encode their data in standard formats, and their code was made in Python using freely accessible packages instead of proprietary alternatives such as matlab. All this should greatly facilitate the adoption of the approach. I think it would be important to state more explicitly the conceptual assumptions that the methodology brings. In the same way that a GWAS approach relies on a Mendelian idea that individual alleles encode for phenotypes, what is the idea about the organisation of the brain implied by the orthogonal gene expression modules? Is it that phenotypes - micro and macro - are encoded by linear combinations of a reduced number of gene expression patterns? What would be the role of the environment? The role of non-genic regulatory regions? Some modalities of functional organisation do not seem to be encoded by the expression of any module. Is it just for lack of data or should this be seen as the sign for a different organisational principle? Likewise, what about the aspects of disorders that are not captured by expression modules? Would that hint, for example, to stronger environmental effects? What about linear combinations of modules? Nonlinear? Overall, the authors adopt implicitly, en passant, a gene-centric conceptual standpoint, which would benefit from being more clearly identified and articulated. There are citations to Rakic's protomap idea (I would also cite the original 1988 paper, and O'Leary's 1989 "protocortex" paper stressing the role of plasticity), which proposes that a basic version of brain cytoarchitecture is genetically determined and transposed from the proliferative ventricular zone regions to the cortical plate through radial migration. In p13 the authors indicate that their results support Rakic's protomap. Additionally, in p7 the authors suggest that their results support a causal arrow going from gene expression to sulcal anatomy. The reviews by O'leary et al (2007), Ronan & Fletcher (2014, already cited), Llinares-Benadero & Borrell (2019) could be considered, which also advocate for a similar perspective. For nuances on the idea that molecular signals provide positional information for brain development, the article by Sharpe (2019, DOI: 10.1242/dev.185967) is interesting. For nuances on the gene-centric approach of the paper the articles by Rockmann (2012, DOI: 10.1111/j.1558-5646.2011.01486.x) but also from the ENCODE consortium showing the importance of non-genic regions of the genome ("Perspectives on ENCODE" 2020 DOI: 10.1038/s41586-021-04213-8) could be considered. I wouldn't ask to cite ideas from the extended evolutionary synthesis about different inheritance systems (as reviewed by Jablonka & Lamb, DOI: 10.1017/9781108685412) or the idea of inherency (Newman 2017, DOI: 10.1007/978-3-319-33038-9_78-1), but the authors may find them interesting. Same goes for our own work on mechanical morphogenesis which expands on the idea of a downward causality (Heuer and Toro 2019, DOI: 10.1016/j.plrev.2019.01.012)

      Response 26

      We thank the reviewer for recommending these papers, which we enjoyed reading and have deepened our thinking on the topic. In addition to toning down some of the language with respect to causality that our data cannot directly address, we have included additional discussion and references as follows:

      In Discussion: "By establishing that some of these cortical zones are evident at the time of cortical folding, we lend support to a “protomap”(Rakic 1988; O'Leary 1989; O'Leary et al. 2007; Rakic et al. 2009) like model where the placement of some cortical folds is set-up by rapid tangential changes in cyto-laminar composition of the developing cortex(Ronan et al., 2014; Toro and Burnod, 2005; Van Essen, 2020). The DEMs are derived from fully folded adult donors, and therefore some of the measured genetic-folding alignment might also be induced by mechanical distortion of the tissue during folding(Llinares-Benadero and Borrell 2019; Heuer and Toro 2019). However, no data currently exist to conclusively assess the directionality of this gene-folding relationship.

      Overall, the manuscript is very interesting and a great contribution. The amount of work involved is impressive, and the presentation of the results very clear. My comments indicate some aspects that could be made more clear, for example, providing additional methodological information in the supplemental material. Also, making aware the readers and future users of MAGICC of the methodological and conceptual challenges that remain to be addressed in the future for this field of research.

      Reviewer #2 (Recommendations For The Authors):

      Comment 1

      The supplementary figures seem to be missing from the eLife submission (although I was able to find them on europepmc)

      Response 1

      We apologize that these were not included in the documents sent to reviewers. The up-to-date supplementary figures are included in this resubmission and again on biorxiv.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Overall, the conclusions of the paper are mostly supported by the data but may be overstated in some cases, and some details are also missing or not easily recognizable within the figures. The provision of additional information and analyses would be valuable to the reader and may even benefit the authors' interpretation of the data. 

      We thank the reviewer for the thoughtful and constructive feedback. We are pleased that the reviewer found the overall conclusions of our paper to be well supported by the data, and we appreciate the suggestions for improving figure clarity and interpretive accuracy. Below, we address each point with corresponding revisions.

      The conclusion that DREADD expression gradually decreases after 1.5-2 years is only based on a select few of the subjects assessed; in Figure 2, it appears that only 3 hM4Di cases and 2 hM3Dq cases are assessed after the 2-year timepoint. The observed decline appears consistent within the hM4Di cases, but not for the hM3Dq cases (see Figure 2C: the AAV2.1-hSyn-hM3Dq-IRES-AcGFP line is increasing after 2 years.) 

      We agree that our interpretation should be stated more cautiously, given the limited number of cases assessed beyond the two-year timepoint. In the revised manuscript, we have clarified in the Results that the observed decline is based on a subset of animals. We have also included a text stating that while a consistent decline was observed in hM4Di-expressing monkeys, the trajectory for hM3Dq expression was more variable with at least one case showing an increased signal beyond two years.

      Revised Results section:

      Lines 140, “hM4Di expression levels remained stable at peak levels for approximately 1.5 years, followed by a gradual decline observed in one case after 2.5 years, and after approximately 3 years in the other two cases (Figure 2B, a and e/d, respectively). Compared with hM4Di expression, hM3Dq expression exhibited greater post-peak fluctuations. Nevertheless, it remained at ~70% of peak levels after about 1 year. This post-peak fluctuation was not significantly associated with the cumulative number of DREADD agonist injections (repeated-measures two-way ANOVA, main effect of activation times, F<sub>(1,6)</sub> = 5.745, P = 0.054). Beyond 2 years post-injection, expression declined to ~50% in one case, whereas another case showed an apparent increase (Figure 2C, c and m, respectively).”

      Given that individual differences may affect expression levels, it would be helpful to see additional labels on the graphs (or in the legends) indicating which subject and which region are being represented for each line and/or data point in Figure 1C, 2B, 2C, 5A, and 5B. Alternatively, for Figures 5A and B, an accompanying table listing this information would be sufficient. 

      We thank the reviewer for these helpful suggestions. In response, we have revised the relevant figures (Fig. 1C, 2B, 2C, and 5) as noted in the “Recommendations for the authors”, including simplifying visual encodings and improving labeling. We have also updated Table 2 to explicitly indicate the animal ID and brain regions associated with each data point shown in the figures.

      While the authors comment on several factors that may influence peak expression levels, including serotype, promoter, titer, tag, and DREADD type, they do not comment on the volume of injection. The range in volume used per region in this study is between 2 and 54 microliters, with larger volumes typically (but not always) being used for cortical regions like the OFC and dlPFC, and smaller volumes for subcortical regions like the amygdala and putamen. This may weaken the claim that there is no significant relationship between peak expression level and brain region, as volume may be considered a confounding variable. Additionally, because of the possibility that larger volumes of viral vectors may be more likely to induce an immune response, which the authors suggest as a potential influence on transgene expression, not including volume as a factor of interest seems to be an oversight. 

      We thank the reviewer for raising this important issue. We agree that injection volume could act as a confounding variable, particularly since larger volumes were used in only handheld cortical injections. This overlap makes it difficult to disentangle the effect of volume from those of brain region or injection method. Moreover, data points associated with these larger volumes also deviated when volume was included in the model.

      To address this, we performed a separate analysis restricted to injections delivered via microinjector, where a comparable volume range was used across cases. In this subset, we included injection volume as additional factor in the model and found that volume did not significantly impact peak expression levels. Instead, the presence of co-expressed protein tags remained a significant predictor, while viral titer no longer showed a significant effect. These updated results have replaced the originals in the revised Results section and in the new Figure 5. We have also revised the Discussion to reflect these updated findings.

      The authors conclude that vectors encoding co-expressed protein tags (such as HA) led to reduced peak expression levels, relative to vectors with an IRES-GFP sequence or with no such element at all. While interesting, this finding does not necessarily seem relevant for the efficacy of long-term expression and function, given that the authors show in Figures 1 and 2 that peak expression (as indicated by a change in binding potential relative to non-displaced radioligand, or ΔBPND) appears to taper off in all or most of the constructs assessed. The authors should take care to point out that the decline in peak expression should not be confused with the decline in longitudinal expression, as this is not clear in the discussion; i.e. the subheading, "Factors influencing DREADD expression," might be better written as, "Factors influencing peak DREADD expression," and subsequent wording in this section should specify that these particular data concern peak expression only. 

      We appreciate this important clarification. In response, we have revised the title to "Protein tags reduce peak DREADD expression levels" in the Results section and “Factors influencing peak DREADD expression levels” in the Discussion section. Additionally, we specified that our analysis focused on peak ΔBP<sub>ND</sub> values around 60 days post-injection. We have also explicitly distinguished these findings from the later-stage changes in expression seen in the longitudinal PET data in both the Results and Discussion sections.

      Reviewer #1 (Recommendations for the authors):

      (1) Will any of these datasets be made available to other researchers upon request?

      All data used to generate the figures have been made publicly available via our GitHub repository (https://github.com/minamimoto-lab/2024-Nagai-LongitudinalPET.git). This has been stated in the "Data availability" section in the revised manuscript.

      (2) Suggested modifications to figures:

      a) In Figures 2B and C, the inclusion of "serotype" as a separate legend with individual shapes seems superfluous, as the serotype is also listed as part of the colour-coded vector

      We agree that the serotype legend was redundant since this information is already included in the color-coded vector labels. In response, we have removed the serotype shape indicators and now represent the data using only vector-construct-based color coding for clarity in Figure 2B and C.

      b) In Figures 3A and B, it would be nice to see tics (representing agonist administration) for all subjects, not just the two that are exemplified in panels C-D and F-H. Perhaps grey tics for the non-exemplified subjects could be used.

      In response, we have included black and white ticks to indicate all agonist administration across all subjects in Figure 3A and B, with the type of agonist clearly specified. 

      c) In Figure 4C, a Nissl- stained section is said to demonstrate the absence of neuronal loss at the vector injection sites. However, if the neuronal loss is subtle or widespread, this might not be easily visualized by Nissl. I would suggest including an additional image from the same section, in a non-injected cortical area, to show there is no significant difference between the injected and non-injected region.

      To better demonstrate the absence of neuronal loss at the injection site, we have included an image from the contralateral, non-injected region of the same section for comparison (Fig. 4C).

      d) In Figure 5A: is it possible that the hM3Dq construct with a titer of 5×10^13 gc/ml is an outlier, relative to the other hM3Dq constructs used?

      We thank the reviewer for raising this important observation. To evaluate whether the high-titer constructs represented a statistical outlier that might artifactually influence the observed trends, we performed a permutation-based outlier analysis. This assessment identified this point in question, as well as one additional case (titer 4.6 x 10e13 gc/ml, #255, L_Put), as significant outlier relative to the distribution of the dataset.

      Accordingly, we excluded these two data points from the analysis. Importantly, this exclusion did not meaningfully alter the overall trend or the statistical conclusions—specifically, the significant effect of co-expressed protein tags on peak expression levels remain robust. We have updated the Methods section to describe this outlier handling and added a corresponding note in the figure legend.

      Reviewer #2 (Public review): 

      Weaknesses 

      This study is a meta-analysis of several experiments performed in one lab. The good side is that it combined a large amount of data that might not have been published individually; the downside is that all things were not planned and equated, creating a lot of unexplained variances in the data. This was yet judiciously used by the authors, but one might think that planned and organized multicentric experiments would provide more information and help test more parameters, including some related to inter-individual variability, and particular genetic constructs. 

      We thank the reviewer for bringing this important point to our attention. We fully acknowledge that the retrospective nature of our dataset—compiled from multiple studies conducted within a single laboratory—introduces variability related to differences in injection parameters and scanning timelines. While this reflects the practical realities and constraints of long-term NHP research, we agree that more standardized and prospectively designed studies would better control such source of variances. To address this, we have added the following statement to the "Technical consideration" section in Discussion:

      Lines 297, "This study included a retrospective analysis of datasets pooled from multiple studies conducted within a single laboratory, which inherently introduced variability across injection parameters and scan intervals. While such an approach reflects real-world practices in long-term NHP research, future studies, including multicenter efforts using harmonized protocols, will be valuable for systematically assessing inter-individual differences and optimizing key experimental parameters."

      Reviewer #2 (Recommendations for the authors):

      I just have a few minor points that might help improve the paper:

      (1) Figure 1C y-axis label: should add deltaBPnd in parentheses for clarity.

      We have added “ΔBP<sub>ND</sub>” to the y-axis label for clarity.

      The choice of a sigmoid curve is the simplest clear fit, but it doesn't really consider the presence of the peak described in the paper. Would there be a way to fit the dynamic including fitting the peak?

      We agree that using a simple sigmoid curve for modeling expression dynamics is a limitation. In response to this and a similar comment from Reviewer #3, we tested a double logistic function (as suggested) to see if it better represented the rise and decline pattern. However, as described below, the original simple sigmoid curve was a better fit for the data. We have included a discussion regarding this limitation of this analysis. See Reviewer #3 recommendations (2) for details.

      The colour scheme in Figure 1C should be changed to make things clearer, and maybe use another dimension (like dotted lines) to separate hM4Di from hM3Dq.

      We have improved the visual clarity of Figure 1C by modifying the color scheme to represent vector construct and using distinct line types (dashed for hM4Di and solid for hM3Dq data) to separate DREADD type.

      (2) Figure 2

      I don't understand how the referencing to 100 was made: was it by selecting the overall peak value or the peak value observed between 40 and 80 days? If the former then I can't see how some values are higher than the peak. If the second then it means some peak values occurred after 80 days and data are not completely re-aligned.

      We thank the reviewer for the opportunity to clarify this point. The normalization was based on the peak value observed between 40–80 days post-injection, as this window typically captured the peak expression phase in our dataset (see Figure 1). However, in some long-term cases where PET scans were limited during this period—e.g., with one scan performing at day 40—it is possible that the actual peak occurred later. Therefore, instances where ΔBP<sub>ND</sub> values slightly exceeded the reference peak at later time points likely reflect this sampling limitation. We have clarified this methodological detail in the revised Results section to improve transparency.

      The methods section mentions the use of CNO but this is not in the main paper which seems to state that only DCZ was used: the authors should clarify this

      Although DCZ was the primary agonist used, CNO and C21 were also used in a few animals (e.g., monkeys #153, #221, and #207) for behavioral assessments. We have clarified this in the Results section and revised Figure 3 to indicate the specific agonist used for each subject. Additionally, we have updated the Methods section to clearly specify the use and dosage of DCZ, CNO, and C21, to avoid any confusion regarding the experimental design.

      Reviewer #3 (Public review): 

      Minor weaknesses are related to a few instances of suboptimal phrasing, and some room for improvement in time course visualization and quantification. These would be easily addressed in a revision. <br /> These findings will undoubtedly have a very significant impact on the rapidly growing but still highly challenging field of primate chemogenetic manipulations. As such, the work represents an invaluable resource for the community.

      We thank the reviewer for the positive assessment of our manuscript and for the constructive suggestions. We address each comment in the following point-by-point responses and have revised the manuscript accordingly.

      Reviewer #3 (Recommendations for the authors):

      (1) Please clarify the reasoning was, behind restricting the analysis in Figure 1 only to 7 monkeys with subcortical AAV injection?

      We focused the analysis shown in Figure 1 on 7 monkeys with subcortical AAV injections who received comparative injection volumes. These data were primary part of vector test studies, allowing for repeated PET scans within 150 days post-injection. In contrast, monkeys with cortical injections—including larger volumes—were allocated to behavioral studies and therefore were not scanned as frequently during the early phase. We will clarify this rationale in the Results section.

      (2) Figure 1: Not sure if a simple sigmoid is the best model for these, mostly peaking and then descending somewhat, curves. I suggest testing a more complex model, for instance, double logistic function of a type f(t) = a + b/(1+exp(-c*(t-d))) - e/(1+exp(-g*(t-h))), with the first logistic term modeling the rise to peak, and the second term for partial decline and stabilization

      We appreciate the reviewer’s thoughtful suggestion to use a double logistic function to better model both the rising and declining phases of the expression curve. In response to this and similar comments from Reviewer #1, we tested the proposed model and found that, while it could capture the peak and subsequent decline, the resulting fit appeared less biologically plausible (See below). Moreover, model comparison using BIC favored the original simple sigmoid model (BIC = 61.1 vs. 62.9 for the simple and double logistic model, respectively). This information has been included in the revised figure legend for clarity.

      Given these results, we retained the original simple sigmoid function in the revised manuscript, as it provides a sufficient and interpretable approximation of the early expression trajectory—particularly the peak expression-time estimation, which was the main purpose of this analysis. We have updated the Methods section to clarify our modeling and rationale as follows:

      Lines 530, "To model the time course of DREADD expression, we used a single sigmoid function, referencing past in vivo fluorescent measurements (Diester et al., 2011). Curve fitting was performed using least squares minimization. For comparison, a double logistic function was also tested and evaluated using the Bayesian Information Criterion (BIC) to assess model fit."

      We also acknowledge that a more detailed understanding of post-peak expression changes will require additional PET measurements, particularly between 60- and 120-days post-injection, across a larger number of animals. We have included this point in the revised Discussion to highlight the need for future work focused on finer-grained modeling of expression decline:

      Lines 317, “Although we modeled the time course of DREADD expression using a single sigmoid function, PET data from several monkeys showed a modest decline following the peak. While the sigmoid model captured the early-phase dynamics and offered a reliable estimate of peak timing, additional PET scans—particularly between 60- and 120-days post-injection—will be essential to fully characterize the biological basis of the post-peak expression trajectories.”

      Author response image 1.<br />

      (3) Figure 2: It seems that the individual curves are for different monkeys, I counted 7 in B and 8 in C, why "across 11 monkeys"? Were there several monkeys both with hM4Diand hM3Dq? Does not look like that from Table 1. Generally, I would suggest associating specific animals from Tables 1 and 2 to the panels in Figures 1 and 2.

      Some animals received multiple vector types, leading to more curves than individual subjects. We have revised the figure legends and updated Table 2 to explicitly relate each curve with the specific animal and brain region.

      (4) I also propose plotting the average of (interpolated) curves across animals, to convey the main message of the figure more effectively.

      We agree that plotting the mean of the interpolated expression curves would help convey the group trend. We added averaged curves to Figure 2BC.

      (5) Similarly, in line 155 "We assessed data from 17 monkeys to evaluate ... Monkeys expressing hM4Di were assessed through behavioral testing (N = 11) and alterations in neuronal activity using electrophysiology (N = 2)..." - please explain how 17 is derived from 11, 2, 5 and 1. It is possible to glean from Table 1 that it is the calculation is 11 (including 2 with ephys) + 5 + 1 = 17, but it might appear as a mistake if one does not go deep into Table 1.

      We have clarified in both the text and Table 1 that some monkeys (e.g., #201 and #207) underwent both behavioral and electrophysiological assessments, resulting in the overlapping counts. Specifically, the dataset includes 11 monkeys for hM4Di-related behavior testing (two of which underwent electrophysiology testing), 5 monkeys assessed for hM3Dq with FDG-PET, and 1 monkey assessed for hM3Dq with electrophysiology, totaling 19 assessments across 17 monkeys. We have revised the Results section to make this distinction more explicit to avoid confusion, as follows:

      Lines 164, "Monkeys expressing hM4Di (N = 11) were assessed through behavioral testing, two of which also underwent electrophysiological assessment. Monkeys expressing hM3Dq (N = 6) were assessed for changes in glucose metabolism via [<sup>18</sup>F]FDG-PET (N = 5) or alterations in neuronal activity using electrophysiology (N = 1).”

      (6) Line 473: "These stock solutions were then diluted in saline to a final volume of 0.1 ml (2.5% DMSO in saline), achieving a dose of 0.1 ml/kg and 3 mg/kg for DCZ and CNO, respectively." Please clarify: the injection volume was always 0.1 ml? then it is not clear how the dose can be 0.1 ml/kg (for a several kg monkey), and why DCZ and CNO doses are described in ml/kg vs mg/kg?

      We thank the reviewer for pointing out this ambiguity. We apologize for the oversight and also acknowledge that we omitted mention of C21, which was used in a small number of cases. To address this, we have revised the “Administration of DREADD agonist” section of the Methods to clearly describe the preparation, the volume, and dosage for each agonist (DCZ, CNO, and C21) as follows:

      Lines 493, “Deschloroclozapine (DCZ; HY-42110, MedChemExpress) was the primary agonist used. DCZ was first dissolved in dimethyl sulfoxide (DMSO; FUJIFILM Wako Pure Chemical Corp.) and then diluted in saline to a final volume of 1 mL, with the final DMSO concentration adjusted to 2.5% or less. DCZ was administered intramuscularly at a dose of 0.1 mg/kg for hM4Di activation, and at 1–3 µg/kg for hM3Dq activation. For behavioral testing, DCZ was injected approximately 15 min before the start of the experiment unless otherwise noted. Fresh DCZ solutions were prepared daily.

      In a limited number of cases, clozapine-N-oxide (CNO; Toronto Research Chemicals) or Compound 21 (C21; Tocris) was used as an alternative DREADD agonist for some hM4Di experiments. Both compounds were dissolved in DMSO and then diluted in saline to a final volume of 2–3 mL, also maintaining DMSO concentrations below 2.5%. CNO and C21 were administered intravenously at doses of 3 mg/kg and 0.3 mg/kg, respectively.”

      (7) Figure 5A: What do regression lines represent? Do they show a simple linear regression (then please report statistics such as R-squared and p-values), or is it related to the linear model described in Table 3 (but then I am not sure how separate DREADDs can be plotted if they are one of the factors)?

      We thank the reviewer for the insightful question. In the original version of Figure 5A, the regression lines represented simple linear fits used to illustrate the relationship between viral titer and peak expression levels, based on our initial analysis in which titer appeared to have a significant effect without any notable interaction with other factors (such as DREADD type).

      However, after conducting a more detailed analysis that incorporated injection volume as an additional factor and excluded cortical injections and statistical outliers (as suggested by Reviewer #1), viral titer was no longer found to significantly predict peak expression levels. Consequently, we revised the figure to focus on the effect of reporter tag, which remained the most consistent and robust predictor in our model.

      In the updated Figure 5, we have removed the relationship between viral titer and expression level with regression lines.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This work revealed an important finding that the blood-brain barrier (BBB) functionality changes with age and is more pronounced in males. The authors applied a non-invasive, contrast-agent-free approach of MRI called diffusion-prepared arterial spin labeling (DP-pCASL) to a large cohort of healthy human volunteers. DP-pCASL works by tracking the movement of magnetically labeled water (spins) in blood as it perfuses brain tissue. It probes the molecular diffusion of water, which is sensitive to microstructural barriers, and characterizes the signal coming from fast-moving spins as blood and slow-moving spins as tissue, using different diffusion gradients (b-values). This differentiation is then used to assess the water exchange rates (kw) across the BBB, which acts as a marker for BBB functionality. The main finding of the authors is that kw decreases with age, and in some brain regions, kw decreases faster in males. The neuroprotective role of the female sex hormone, estrogen, on BBB function is discussed as one of the explanations for this finding, supported by literature. The study also shows that BBB function remains stable until the early 60s and remarkably decreases thereafter.

      Strengths:

      The two main strengths of the study are the MRI method used and the amount of data. The authors employed a contrast-agent-free MRI method called ASL, which offers the opportunity to repeat such experiments multiple times without any health risk - a significant advantage of ASL. Since ASL is an emerging field that requires further exploration and testing, a study evaluating blood-brain barrier functionality is of great importance. The authors utilized a large dataset of healthy humans, where volunteer data from various studies were combined to create a substantial pool. This strategy is effective for statistically evaluating differences in age and gender.

      Weaknesses:

      R1.0: Gender-related differences are only present in some brain regions, not in the whole brain or gray matter - which is usually the assumption unless stated otherwise. From the title, this was not clear. Including simulations could increase readers' understanding related to model fitting and the interdependence of parameters, if present. The discussion follows a clear line of argument supported by literature; however, focusing solely on AQP4 channels and missing a critical consideration of other known/proven changes in transport mechanisms through the BBB and their effects substantially weakens the discussion. 

      Thanks for your insightful feedback and suggestions. We have made the following changes to the manuscript:

      (1) The title has been modified to highlight the sex differences in specific brain regions: “Age-Related Decline in Blood-Brain Barrier Function is More Pronounced in Males than Females in Parietal and Temporal Regions.”

      (2) To study the potential impact of prolonged ATT seen in males on estimated kw, we simulated kw distribution for females by adjusting ATT by +60 ms to match males' ATT. This led to marginally higher kw values (Supplemental Figure S2), suggesting that the kw difference between males and females is not a direct result of prolonged ATT. Additionally, we have added a section titled “Data and Code Availability Statements” in the revised manuscript to indicate that we are willing to share the reconstruction toolbox with interested groups. The toolbox is a standalone MATLAB-based program (no license required) to generate kw, CBF, and ATT maps, which can run on Windows or Mac computers.

      (3) We agree with the reviewer that BBB water exchange can be facilitated by other transport mechanisms, as we mentioned in the introduction: “Water exchange across the BBB occurs at a relatively high level and is mediated by passive diffusion, active co-transport through the endothelial membrane, and facilitated diffusion through the dedicated water channel, aquaporin-4 (AQP4), at the end-feet of astrocytes.” We emphasized our findings related to AQP4 based on the technical properties of DP-pCASL, which is more sensitive to the exchange occurring across astrocyte end-feet. We also acknowledge that different techniques can be helpful to study other components of BBB water exchange, and we have added the following discussion to the updated manuscript: “Mahroo et al., utilized a multi-echo ASL technique to measure BBB permeability to water and reported shorter intra-voxel transit time and lower BBB exchange time (Tex) in the older participants (≥50 years) compared to the younger group (≤20 years). In animal studies, reduced BBB Tex was also reported in the older mice compared to the younger group using multi-echo ASL and a multi-flip-angle, multi-echo dynamic contrast-enhanced (MFAME-DCE) MRI method. These findings contrast with the results presented in this study, likely due to the different components assessed by different techniques, and increased BBB permeability to water has been suggested to indicate a leakage of tight junctions in aging. In contrast, our recent study utilizing high resolution MCDW-pCASL scans with long averages reveals the potential existence of an intermediate stage of water exchange between vascular and tissue compartments (e.g., paravascular space or basal lamina). The DP module of the DP-pCASL is hypothesized to null the fast-flowing and pseudo-random oriented spins, which may include both vascular flow and less restricted water in paravascular space. The observed lower kw in older participants may be more related to the delayed exchange across the astrocyte end-feet into the tissue due to loss of AQP-4 water channel with older age. However, these hypotheses require further investigation to understand the exact mechanisms, especially under different physiological states. Future studies, particularly with animal models targeting specific BBB components under different physiological or diseased conditions, will be valuable for validating these measurements.”

      Reviewer #1 (Recommendations For The Authors): 

      R1.1 The manuscript is well-organized and presents arguments in a logical order. The visual representation of results in the form of figures is sufficient (see style suggestions below). 

      Thanks for your suggestions on improving the figures, we have updated figures for better visualization (Please see our response to R1.5, R1.6, R1.7 and R1.8).

      R1.2 It would be beneficial if the model/toolbox could be made publicly available so that fellow researchers from the community could apply and test it in their research. 

      We have added a section “Data and code availability statements” in the revised manuscript to indicate we’re willing to share the toolbox to the interested groups (L529 in the annotated manuscript). The toolbox is a standalone MATLAB-based program (no license required) to generate kw, CBF and ATT maps, which can run on windows or MAC computers. Indeed, we have been sharing our reconstruction toolbox with over 50 collaboration sites. The following screenshots are examples of three steps performed by the toolbox (shared by one collaborator):

      Author response image 1.

      Step 1: Loading raw data and calculate T1 map

      Author response image 2.

      Step 2: Motion correction and skull stripping

      Author response image 3.

      Step 3: kw, CBF and ATT quantification (nii files will be saved)

      R1.3 Line 46 states that the technique is novel, but it has been introduced and used before (Shao, et al. MRM 2019). It sure is innovative but the term novel is too strong and may confuse the readers that it is something new introduced in this manuscript.

      Thanks for the suggestion, we agree the term ‘novel’ may cause confusion about the technique, we have removed it in the revised manuscript (L48, L50).

      R1.4 Line 395, kw was generated using PLD = 1.8s with b = 0, 50 s/mm2. Is only one-time point enough for estimating kw? To me, it is not clear how robust is the kw estimation with only one PLD.

      According to the single-pass approximation (SPA) model (1), kw can be accurately estimated when the PLD is longer than the ATT. We recruited cognitively normal participants in this study and found the longest ATT to be 1526.7±117.4 and 1468.1±166.9 ms in aged (62-92 years) males and females, respectively. A PLD of 1.8 s was chosen to balance the SNR of the data and the accuracy of the model fitting, which should be sufficient for this study. However, for future studies involving diseased populations with prolonged ATT, a longer PLD should be used, or a multi-PLD protocol could be helpful to improve the robustness of quantification accuracy.

      We have added a limitation statement in the revised manuscript (L407): "A single PLD of 1800 ms was used in this study, which should be sufficient to allow all the labeled water to reach the tissue (i.e., the longest ATT was 1526.7±117.4 and 1468.1±166.9 ms in aged males and females, respectively) (1). However, a longer PLD should be used in participants with longer expected ATT, such as in stroke and cerebrovascular disorders. Additionally, a multi-PLD protocol can also be helpful to improve the robustness of quantification accuracy (2)."

      R1.5 Suggestion: Figure 3A, colormap for kw appears suboptimal. Regional differences are hard to see.

      Thanks for the suggestion, we have updated the range of color scale (from [0, 200], to [70, 160]) to highlight the regional differences in the updated Figure 3:

      We prefer to use the same blue colormap that we and our collaborators have been using this for publications to maintain consistence. We also acknowledged the limitation of the spatial resolution of kw maps in the updated manuscript (L412): “To compensate for the half signal loss of the non-CPMG DP module, relatively low spatial resolution and TGV-regularized SPA modeling were employed. Our recently development of a motion-compensated diffusion weighted (MCDW)-pCASL can be utilized to improve the spatial resolution in the future studies (e.g. 3.5 mm3 isotropic maps in 10 mins) (2)”

      R1.6 Suggestion: use same/similar colormaps for the same parameters (kw, ATT, CBF) to help the reader follow across Figures 3, 4, and 5.

      Thanks for your suggestion, we agree that using the same color would be easier for readers to follow the context. However, figures 4 and 5 were created to show the age and sex dependent changes, so that we used warm and cold colors to indicate effects of decrease and increase, respectively. We clarified the choice of colormap in the figure captions (L260, L284): “The effects of decrease or increase were represented by warm colors (yellow to red) and cold (gray to blue) colors, respectively.”

      R1.7 Suggestion: please be consistent with the ordering of parameters in Figures 3, 4, and 5.

      Thanks for the suggestion, we have updated Figure 3 to consistently show kw, CBF and ATT results in order from left to right:

      R1.8 Suggestion: use the same scaling (e.g.[|1.9|, |11 |] for Fig. 4, [|1.9|, |4|] for Figure 5) to enhance comparability across parameters in the subfigures.

      Thanks for the suggestion, we agree that the same scaling would enhance the comparability across parameters. We have updated the color scales for Figure 5 using maximal |T| = 4:

      However, range of maximal |T| was relatively large for Figure 4 (i.e. 5 for kw, 11 for CBF and 7 for ATT), and using the same color scale might oversaturate the regional responses or diminish the visibility of regional differences. Therefore, we prefer to keep the original color scale for Figure 4.

      R1.9 In Figure 5, the interaction of age with sex in kw parameter seems to be more on one side of the brain. What could be the reasons for possible lateralization? 

      We agree with the reviewer that the age and sex interaction effects emphasized on one side is an interesting finding. While we do not have a clear explanation now, we suspect it may relate to aging-related asymmetrical vascular burdens. Giannakopoulos et al. reported that vascular scores, indicating higher vascular burden, were significantly higher in the left hemisphere across all Clinical Dementia Rating scores. Moreover, the predominance of Alzheimer’s disease and vascular pathology in the right hemisphere correlated with significantly higher Clinical Dementia Rating scores  (3). We added the following to the updated manuscript to discuss this potential mechanism (L370): “… We also observed an asymmetric effect on left and right brain hemispheres, which might be associated with asymmetrically developed vascular burdens in aging (3)."

      R1.10 A comparison between the present study and DCE MRI as well as other ASL methods evaluating BBB function with age is missing. ASL techniques probing transverse relaxation and DCE MRI have reported increased kw with age in humans as well as in animal models. What could be the reasons? 

      We agree with the reviewer that BBB water exchange measured by other methods should be sufficiently discussed, especially regarding their age-related changes. We added the following discussion in the updated manuscript (L415): “Mahroo et al., utilized a multi-echo ASL technique to measure BBB permeability to water and reported shorter intra-voxel transit time and lower BBB exchange time (Tex) in the older participants (≥50 years) compared to the younger group (≤20 years) (4). In animal studies, reduced BBB Tex was also reported in the older mice compared to the younger group using multi-echo ASL (5) and a multi-flip-angle, multi-echo dynamic contrast-enhanced (MFAME-DCE) MRI method (6). These findings contrast with the results presented in this study, likely due to the different components assessed by different techniques, and increased BBB permeability to water has been suggested to indicate a leakage of tight junctions in aging (5, 6). In contrast, our recent study utilizing high resolution MCDW-pCASL scans with long averages reveals the potential existence of an intermediate stage of water exchange between vascular and tissue compartments (e.g., paravascular space or basal lamina) (2). The DP module of the DP-pCASL is hypothesized to null the fast-flowing and pseudo-random oriented spins, which may include both vascular flow and less restricted water in paravascular space. The observed lower kw in older participants may be more related to the delayed exchange across the astrocyte end-feet into the tissue due to loss of AQP-4 water channel with older age. However, these hypotheses require further investigation to understand the exact mechanisms, especially under different physiological states (7, 8). Future studies, particularly with animal models targeting specific BBB components under different physiological or diseased conditions, will be valuable for validating these measurements (9-13).”

      R1.11 Line 163/164, a rapid decrease of CBF in males in the region of the hippocampus is reported. It would be beneficial to discuss this in discussion further (has this been reported before, possible reasons, etc). 

      Thanks for the suggestion, we agree that the accelerated CBF decline in males in the hippocampus is an important finding, we have added discussion in the revised manuscript (L300): "Furthermore, we found a more pronounced age-related decline in CBF in the hippocampus of males compared to females (Fig. 2, Supplemental Table S2). To the best of our knowledge, no study has previously reported this accelerated hippocampal CBF decline in males. This finding may be linked to the accelerated hippocampal volume loss in males, as reported in a study analyzing 19,793 generally healthy UK Biobank participants (14). Lower hippocampal perfusion has been associated with poor memory performance (15, 16), suggesting that males might be more vulnerable to potential cognitive decline (17).

      R1.12 Lines 198-202 describe a simulation done to test the dependence of kw on ATT. This is important and could be explained more in detail. Adding simulation results (numeric or figure) to supplementary materials would increase reproducibility and understanding for others. 

      We apologize for not referencing to the simulation results in the main text. We simulated kw distribution for females by adjusting ATT by +60 ms to matching males’ ATT, leading to a marginally higher kw values. And these results were shown in the Supplemental Figure S2 C (yellow):

      We have now referenced the simulation results in the updated manuscript (L206).

      R1.13 No limitations of the presented work are mentioned. A critical perspective would increase the scientific impact on future research decisions and implementation of this method by others. 

      Thanks for the suggestion, we agree the limitations need to be acknowledged. We have added a limitation paragraph in the revised manuscript (L406): "Limitations of the study and future directions: There are a few limitations of this study. A single PLD of 1800 ms was used in this study, which should be sufficient to allow all the labeled water to reach the tissue (i.e., the longest ATT was 1526.7±117.4 and 1468.1±166.9 ms in aged males and females, respectively) (1). However, a longer PLD should be used in participants with longer expected ATT, such as in stroke and cerebrovascular disorders. Additionally, a multi-PLD protocol can also be helpful to improve the robustness of quantification accuracy (2). To compensate for the half signal loss of the non-CPMG DP module, relatively low spatial resolution and TGV-regularized SPA modeling were employed. Our recently development of a motion-compensated diffusion weighted (MCDW)-pCASL can be utilized to improve the spatial resolution in the future studies (e.g. 3.5 mm3 isotropic maps in 10 mins) (2). Mahroo et al., utilized a multi-echo ASL technique to measure BBB permeability to water and reported shorter intra-voxel transit time and lower BBB exchange time (Tex) in the older participants (≥50 years) compared to the younger group (≤20 years) (4). In animal studies, reduced BBB Tex was also reported in the older mice compared to the younger group using multi-echo ASL (5) and a multi-flip-angle, multi-echo dynamic contrast-enhanced (MFAME-DCE) MRI method (6). These findings contrast with the results presented in this study, likely due to the different components assessed by different techniques, and increased BBB permeability to water has been suggested to indicate a leakage of tight junctions in aging (5, 6). In contrast, our recent study utilizing high resolution MCDW-pCASL scans with long averages reveals the potential existence of an intermediate stage of water exchange between vascular and tissue compartments (e.g., paravascular space or basal lamina) (2). The DP module of the DP-pCASL is hypothesized to null the fast-flowing and pseudo-random oriented spins, which may include both vascular flow and less restricted water in paravascular space. The observed lower kw in older participants may be more related to the delayed exchange across the astrocyte end-feet into the tissue due to loss of AQP-4 water channel with older age. However, these hypotheses require further investigation to understand the exact mechanisms, especially under different physiological stages (7, 8). Future studies, particularly with animal models targeting specific BBB components under different physiological or diseased conditions, will be valuable for validating these measurements (9-13). Including race as a covariate in our study aims to account for potential variations in brain perfusion observed in previous research (18, 19). However, it is important to recognize that these differences may not be solely attributable to race. They can be influenced by a complex interplay of factors such as education, environmental exposures, lifestyle, healthcare access, and other social determinants of health (20). For example, education has been shown to be highly relevant to regional CBF changes in AD (21, 22). Additionally, the potential influence of ancestry and mixed-race on perfusion and BBB function requires further investigation in future studies. Other factors such as hematocrit (23), menopausal status (24, 25), and vascular risk factors (26) should also be considered. These variables were not included in this study due to the unavailability or limited availability in some cohorts. We attempted to minimize the impact of these factors on our observations by including a relatively large and diverse sample. However, future studies examining the specific mechanism of each of these factors on BBB function in aging would be valuable.

      Reviewer #2 (Public Review):

      Summary: 

      This study used a novel diffusion-weighted pseudo-continuous arterial spin labelling (pCASL) technique to simultaneously explore age- and sex-related differences in brain tissue perfusion (i.e., cerebral blood flow (CBF) & arterial transit time (ATT) - a measure of CBF delivery to brain tissue) and blood-brain barrier (BBB) function, measured as the water exchange (kw) across the BBB. While age- and sex-related effects on CBF are well known, this study provides new insights to support the growing evidence of these important factors in cerebrovascular health, particularly in BBB function. Across the brain, the decline in CBF and BBB function (kw) and elevation in ATT were reported in older adults, after the age of 60, and more so in males compared to females. This was also evident in key cognitive regions including the insular, prefrontal, and medial temporal regions, stressing the consideration of age and sex in these brain physiological assessments. 

      Strengths: 

      Simultaneous assessment of CBF with BBB along with transit time and at the voxel-level helped elucidate the brain's vulnerability to age and sex-effects. It is apparent that the investigators carefully designed this study to assess regional associations of age and sex with attention to exploring potential non-linear effects. 

      Weaknesses: 

      R2.0 It appears that no brain region showed concurrent CBF and BBB dysfunction (kw), based on the results reported in the main manuscript and supplemental information. Was an association analysis between CBF and kw performed? There is a potential effect of the level of formal education on CBF (PMID: 12633147; 15534055), which could have been considered and accounted for as well, especially for a cohort with stated diversity (age, race, sex). 

      Thank you for your positive feedback and comments on the potential associations between BBB kw and other physiological parameters (e.g., CBF) and socioeconomic factors (e.g., education). We have made the following changes to the updated manuscript:

      (1) We conducted additional linear regressions between regional kw and regional CBF or ATT, incorporating sex as a covariate, for participants aged 8-61 years and 62-92 years (when BBB kw starts declining). The results are summarized in Supplemental Table S6. We found that BBB kw was significantly negatively associated with CBF in the putamen, amygdala, hippocampus, parahippocampal gyrus, and medial temporal lobe in participants younger than 62 years, when kw was relatively consistent across ages. However, no significant correlations were found in any brain regions in the 62-92 years group. In contrast to CBF, kw was significantly negatively associated with ATT in the GM, temporal lobe, and precuneus in participants aged 8-61 years, and these correlations became significant in additional ROIs, including WM, frontal lobe, ACC, caudate, putamen, amygdala, hippocampus, PHG, and MTL in participants aged 62-92 years. These results suggest that BBB function may be influenced by different aspects of neurovascular function represented by CBF and ATT at different stages of aging.

      (2) One limitation of this study is the lack of information on participants’ geographical, cultural, physical characteristics, and socioeconomic factors. While we included race as a covariate to account for potential variations observed in previous research, race is an imprecise proxy for the complex interplay of genetic, environmental, socioeconomic, and cultural factors that influence physiological outcomes. We have acknowledged this limitation by adding the following discussion in the updated manuscript: “Including race as a covariate in our study aims to account for potential variations in brain perfusion observed in previous research. However, it is important to recognize that these differences may not be solely attributable to race. They can be influenced by a complex interplay of factors such as education, environmental exposures, lifestyle, healthcare access, and other social determinants of health. For example, education has been shown to be highly relevant to regional CBF changes in AD. Additionally, the potential influence of ancestry and mixed-race on perfusion and BBB function requires further investigation in future studies.”

      Reviewer #2 (Recommendations For The Authors): 

      General comments: 

      I commend the authors on a very well-written and laid-out study. General remarks have been provided in the short assessment and public review sections. 

      We would like to thank the reviewer for the insightful suggestions and overall positive feedback. We have substantial revised and improved our manuscript, and point-to-point responses can be found in the following sections and in the annotated manuscript.

      Specific comments: 

      Results: 

      R2.1 Line 127: "since race may influence the changes in perfusion and kw with aging, it was included as a covariate". It is not clear how race - a simplistic term for ethnicity or to be more specific ancestry has been shown to influence changes in perfusion? Is it known for a fact that for example, older Black people have lower/higher CBF or kw compared to Asians or Asians to Caucasian Americans? Can this be extrapolated to Japanese Brazilians having different patterns of regional CBF to Caucasian or Black Brazilians or similar patterns of CBF to Japanese people in Japan since they share similar race? Do Dutch people in the Netherlands share CBF characteristics to their descendants in the US or in South Africa? Would the geographical, cultural, and other physical characteristics of one's ethnicity or lineage impact CBF? Race is often used as a poor substitute for the complex interactions of physical, socioeconomic, and geopolitical factors that produce disparities that may have measurable biological effects including CBF. But it is not clear why being one race vs the other will impact CBF, without carefully parcelling out the many factors beyond biology, if any. Is any of the participants in the study mixed race? How about recently settled individuals who may identify for example as Black but have spent all their life up to adult years outside of the US and marked here in the study as simply African American? Not that I am saying this is the case. However this simplification may require more careful analysis. 

      In our study, no participant indicated to be mixed-race, and unfortunately we do not have additional information about their specific ancestry or information about their geographical, cultural, and other physical characteristics. We acknowledge that race is an imprecise proxy for the complex interplay of genetic, environmental, socioeconomic, and cultural factors that influence physiological outcomes, including perfusion and BBB function. The use of race as a covariate in our study is intended to account for potential variations observed in previous research, rather than to imply a direct causal relationship.

      Research has shown differences in blood flow among racial groups (18, 19). However, these differences are not solely attributable to race, and they are also shaped by environmental exposures, lifestyle factors, healthcare access, and other social determinants of health (20). We have added the following discussion in the updated manuscript (L436): “Including race as a covariate in our study aims to account for potential variations in brain perfusion observed in previous research (18, 19). However, it is important to recognize that these differences may not be solely attributable to race. They can be influenced by a complex interplay of factors such as education, environmental exposures, lifestyle, healthcare access, and other social determinants of health (20). For example, education has been shown to be highly relevant to regional CBF changes in AD (21, 22). Additionally, the potential influence of ancestry and mixed-race on perfusion and BBB function requires further investigation in future studies.”

      R2.2 Figure 3: Could the standard deviation of the reported values be also stated so the variance can be appreciated? 

      Thanks for the suggestion, we have added the standard deviation of the kw, CBF and ATT values on the updated Figure 3:

      R2.3 Discussions: Line 280: .."observed distinct trajectory of kw changes with aging as compared with CBF and ATT. I presume this as compared to the earlier statements (line 268) of pervasive increase in ATT and decrease in CBF across the brain. Were there any brain regions that showed increased ATT, decreased CBF and kw as a function of age or even sex?? Was there any association between CBF and kw in any brain regions, across the participants after controlling for sex differences? If there is a suspicion of early BBB dysfunction (line 286) preceding cognitive decline that has been also suspected with CBF, is this concomitant with CBF in most people? This could maybe make CBF an easier and more straightforward biomarker since its effects mirror that of BBB? I suspect it generally does not, even in healthy aging. It would have been great to shed more light on this with your results and in your discussion.

      Thank you for your comments. By 'distinct trajectory of kw changes with aging,' we refer to the ‘turning point’ in age at which kw starts declining. BBB kw remained relatively stable and began to decline in the early 60s, while CBF consistently decreased and ATT consistently increased with age, although the rates of change differed at 22 years and 36 years, respectively. Using linear regressions for voxel analysis, Figure 4 shows that age-dependent decreases in CBF and increases in ATT were observed in most of the brain. However, significant age-related decreases in kw were more localized to specific brain regions and were mostly accompanied by simultaneous decreases in CBF and increases in ATT. We highlighted this finding in the updated manuscript (L250): “In the brain regions showing significant age-related kw decreases (Fig. 4A), these decreases are mostly accompanied by CBF decreases (Fig. 4B) and ATT increases (Fig. 4C).”

      Thank you for your suggestion regarding the relationship between kw and CBF. We further conducted linear regressions between regional kw and regional CBF or ATT, incorporating sex as a covariate, for participants aged 8-61 years and 62-92 years (when BBB kw starts declining). The results are summarized Supplemental Table S6.

      This new supplemental tables shows many interesting results. BBB kw was significantly negatively associated with CBF in the putamen, amygdala, hippocampus, parahippocampal gyrus, and medial temporal lobe in participants younger than 62 years, when kw was relatively consistent across ages. However, no significant correlations were found in any brain regions in the 62-92 years group. In contrast to CBF, kw was significantly negatively associated with ATT in the GM, temporal lobe, and precuneus in participants aged 8-61 years, and these correlations became significant in additional ROIs, including WM, frontal lobe, ACC, caudate, putamen, amygdala, hippocampus, PHG, and MTL in participants aged 62-92 years.

      We have added the following discussion to the updated manuscript (L307): 'We observed a distinct trajectory of kw changes with aging compared to CBF and ATT. To study the potential regional associations between kw and CBF and ATT, we conducted linear regressions between regional kw and regional CBF or ATT, incorporating sex as a covariate, for participants aged 8-61 years and 62-92 years (when BBB kw starts declining), respectively. The results are shown in Supplemental Table S6. BBB kw was significantly negatively associated with CBF in the putamen, amygdala, hippocampus, PHG, and MTL in participants aged 8-61 years (when kw was relatively consistent across ages), but no significant correlations were found in any brain regions in the 62-92 years group. In contrast to CBF, kw was significantly negatively associated with ATT in the GM, temporal lobe, and precuneus in participants aged 8-61 years, and these correlations became significant in additional brain regions, including WM, frontal lobe, ACC, caudate, putamen, amygdala, hippocampus, PHG, and MTL in participants aged 62-92 years. These results suggest that BBB function may be affected by different aspects of neurovascular function represented by CBF and ATT at different stages of aging."

      Other notes: 

      R2.4 While reading the results section, two things that jump out at me when I saw the sex differences: 1) hematocrit and 2) menopausal status. I saw in the discussion that these were touched on. I may have missed this in the methods, was hematocrit collected and included in the parameters estimates?? Was the menopausal status including ERT (estrogen replacement therapies) recorded and factored in? If not these could be included as limitations that may confound the results, especially when the age groups were split to include a group comprising or potentially both pre-and post-menopausal females (36-61). 

      We do not have the information about hematocrit nor menopausal status and they were not included in data analysis. We agree this is a limitation of the current study and we discussed in the updated manuscript (L442): “Other factors such as hematocrit (23), menopausal status (24, 25), and vascular risk factors (26) should also be considered. These variables were not included in this study due to data unavailability or limited availability in some cohorts. We attempted to minimize the impact of these factors on our observations by including a relatively large and diverse sample. However, future studies examining the specific mechanism of each of these factors on BBB function in aging would be valuable.”

      R2.5 The general vascular health of the cohort is not well described especially if some of the participants were from sickle cell study. While they are cognitively normal and free from major medical illnesses, or neurological disorders, did the sample also include individuals with considerable vascular risk factors and metabolic syndrome (known to affect CBF), especially in the older cohort?? 

      We agree with the reviewer that vascular health can significantly impact perfusion and BBB function. Since the data presented in this study were collected from multiple cohorts, vascular risk factors were not available in all cohorts and thus were not included as covariates in the data analysis. To account for potential vascular variations across participants, we included CBF and ATT as covariates in our analysis on age related BBB kw changes. We have added discussion in the updated manuscript (L442, same as our response to the previous comment): “Other factors such as hematocrit (23), menopausal status (24, 25), and vascular risk factors (26) should also be considered. These variables were not included in this study due to data unavailability or limited availability in some cohorts. We attempted to minimize the impact of these factors on our observations by including a relatively large and diverse sample. However, future studies examining the specific mechanism of each of these factors on BBB function in aging would be valuable.”.

      References:

      (1) K. S. St Lawrence, D. Owen, D. J. Wang, A two-stage approach for measuring vascular water exchange and arterial transit time by diffusion-weighted perfusion MRI. Magn Reson Med 67, 1275-1284 (2012).

      (2) X. Shao, C. Zhao, Q. Shou, K. S. St Lawrence, D. J. Wang, Quantification of blood–brain barrier water exchange and permeability with multidelay diffusion‐weighted pseudo‐continuous arterial spin labeling. Magnetic Resonance in Medicine  (2023).

      (3) P. Giannakopoulos, E. Kövari, F. R. Herrmann, P. R. Hof, C. Bouras, Interhemispheric distribution of Alzheimer disease and vascular pathology in brain aging. Stroke  (2009).

      (4) A. Mahroo, S. Konstandin, M. Günther, Blood–Brain Barrier Permeability to Water Measured Using Multiple Echo Time Arterial Spin Labeling MRI in the Aging Human Brain. Journal of Magnetic Resonance Imaging 59, 1269-1282 (2024).

      (5) Y. Ohene et al., Increased blood–brain barrier permeability to water in the aging brain detected using noninvasive multi‐TE ASL MRI. Magnetic resonance in medicine 85, 326-333 (2021).

      (6) B. R. Dickie, H. Boutin, G. J. Parker, L. M. Parkes, Alzheimer's disease pathology is associated with earlier alterations to blood–brain barrier water permeability compared with healthy ageing in TgF344‐AD rats. NMR in Biomedicine 34, e4510 (2021).

      (7) Y. Ying et al., Heterogeneous blood‐brain barrier dysfunction in cerebral small vessel diseases. Alzheimer's & Dementia  (2024).

      (8) V. Zachariou et al., Regional differences in the link between water exchange rate across the blood–brain barrier and cognitive performance in normal aging. GeroScience, 1-18 (2023).

      (9) Y. Zhang et al., Increased cerebral vascularization and decreased water exchange across the blood-brain barrier in aquaporin-4 knockout mice. PLoS One 14, e0218415 (2019).

      (10) Y. Ohene et al., Non-invasive MRI of brain clearance pathways using multiple echo time arterial spin labelling: an aquaporin-4 study. NeuroImage 188, 515-523 (2019).

      (11) Y. V. Tiwari, J. Lu, Q. Shen, B. Cerqueira, T. Q. Duong, Magnetic resonance imaging of blood–brain barrier permeability in ischemic stroke using diffusion-weighted arterial spin labeling in rats. Journal of Cerebral Blood Flow & Metabolism 37, 2706-2715 (2017).

      (12) Z. Wei et al., Non-contrast assessment of blood-brain barrier permeability to water in mice: an arterial spin labeling study at cerebral veins. NeuroImage, 119870 (2023).

      (13) Y. Jia et al., Transmembrane water-efflux rate measured by magnetic resonance imaging as a biomarker of the expression of aquaporin-4 in gliomas. Nature Biomedical Engineering 7, 236-252 (2023).

      (14) L. Nobis et al., Hippocampal volume across age: Nomograms derived from over 19,700 people in UK Biobank. NeuroImage: Clinical 23, 101904 (2019).

      (15) S. Rane et al., Inverse correspondence between hippocampal perfusion and verbal memory performance in older adults. Hippocampus 23, 213-220 (2013).

      (16) S. Heo et al., Resting hippocampal blood flow, spatial memory and aging. Brain research 1315, 119-127 (2010).

      (17) O. Gannon, L. Robison, A. Custozzo, K. Zuloaga, Sex differences in risk factors for vascular contributions to cognitive impairment & dementia. Neurochemistry international 127, 38-55 (2019).

      (18) A. E. Leeuwis et al., Cerebral blood flow and cognitive functioning in a community-based, multi-ethnic cohort: the SABRE study. Frontiers in aging neuroscience 10, 279 (2018).

      (19) L. R. Clark et al., Association of cardiovascular and Alzheimer’s disease risk factors with intracranial arterial blood flow in Whites and African Americans. Journal of Alzheimer's Disease 72, 919-929 (2019).

      (20) D. R. Williams, S. A. Mohammed, Discrimination and racial disparities in health: evidence and needed research. Journal of behavioral medicine 32, 20-47 (2009).

      (21) N. Scarmeas et al., Association of life activities with cerebral blood flow in Alzheimer disease: implications for the cognitive reserve hypothesis. Archives of neurology 60, 359-365 (2003).

      (22) N.-T. Chiu, B.-F. Lee, S. Hsiao, M.-C. Pai, Educational level influences regional cerebral blood flow in patients with Alzheimer’s disease. Journal of Nuclear Medicine 45, 1860-1863 (2004).

      (23) R. C. Gur et al., Gender differences in age effect on brain atrophy measured by magnetic resonance imaging. Proceedings of the National Academy of Sciences 88, 2845-2849 (1991).

      (24) M. J. Cipolla, J. A. Godfrey, M. J. Wiegman, The effect of ovariectomy and estrogen on penetrating brain arterioles and blood-brain barrier permeability. Microcirculation 16, 685-693 (2009).

      (25) A. C. Wilson et al., Reproductive hormones regulate the selective permeability of the blood-brain barrier. Biochim Biophys Acta 1782, 401-407 (2008).

      (26) M. S. Stringer et al., Tracer kinetic assessment of blood–brain barrier leakage and blood volume in cerebral small vessel disease: Associations with disease burden and vascular risk factors. NeuroImage: Clinical 32, 102883 (2021).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      Strengths:

      This study uses a carefully constructed experiment design and decision-making task that allows separation of multiple electroencephalographic (EEG) signals thought to track different stages of decision-making. For example, the steady-state visual evoked potential measures can be cleanly dissociated from more anterior beta-band activity over the motor cortex. They also allow evaluation of how cued expectancy effects may unfold over a number of testing sessions. This is important because the most consistent evidence of expectation-related modulations of electrophysiological measures (using EEG, local field potentials, or single neuron firing rates) is from studies of nonhuman primates that involved many days of cue-stimulus contingency learning, and there is a lack of similar work using several testing sessions in humans. Although there were several experimental conditions included in the study, careful trial-balancing was conducted to minimise biases due to incidental differences in the number of trials included for analyses across each condition. Performance for each individual was also carefully calibrated to maximise the possibility of identifying subtle changes in task performance by expectation and avoid floor or ceiling effects.

      We would like to thank Reviewer 1 for these very positive comments.

      Weaknesses:

      Although the experiment and analysis methods are cohesive and well-designed, there are some shortcomings that limit the inferences that can be drawn from the presented findings.

      Comment #1

      The first relates to the measures of SSVEPs and their relevance for decision-making in the task. In order to eliminate the influence of sporadic pulses of contrast changes that occurred during stimulus presentation, a time window of 680-975 ms post-stimulus onset was used to measure the SSVEPs. The mean response times for the valid and neutral cues were around 850-900 ms for correct responses, and within the same time window for errors in the invalid cue condition. In addition, a large portion of response times in perceptual decision-making tasks are substantially faster than the mean due to right-skewed response time distributions that are typically observed. As it has also been estimated to require 70-100 ms to execute a motor action (e.g., a keypress response) following the commitment to a decision. This raises some concerns about the proportion of trials in which the contrast-dependent visual responses (indexed by the SSVEPs) indexed visual input that was actually used to make the decision in a given trial. Additional analyses of SSVEPs that take the trial-varying pulses into account could be run to determine whether expectations influenced visual responses earlier in the trial.

      The reviewer raises a very valid point and, indeed, it is an issue that we grappled with in our analyses. Actually, in this study, the RT distributions were not right-skewed, but appear to be relatively normal (RT distributions shown below). This is something that we have previously observed when using tasks that involve an initial zero-evidence lead in at the start of each trial which means that participants cannot start accumulating at stimulus onset and must rely on their knowledge of the lead-in duration to determine when the physical evidence has become available (e.g. Kelly et al 2021, Nat Hum Beh). We agree that it is important to establish whether the reported SSVEP modulations occur before or after choice commitment. In our original submission we had sought to address this question through our analysis of the response-locked ‘difference SSVEP’. Figure 4D clearly indicates that the cue modulations are evident before as well as after response.

      However, we have decided to include an additional Bayesian analysis of the response-locked signal to offer more evidence that the cue effect is not a post-response phenomenon.

      Manuscript Changes

      To quantify the evidence that the cue effect was not driven by changes in the signal after the response, we ran Bayesian one-way ANOVAs on the SSVEP comparing the difference across cue conditions before and after the response. If the cue effect only emerged after the response, we would expect the difference between invalid and neutral or invalid and valid cues to increase in the post-response window. There was no compelling evidence of an increase in the effect when comparing invalid to neutral (BF10 = 1.58) or valid cues (BF10 = 0.32).

      Comment #2

      Presenting response time quantile plots may also help to determine the proportions of motor responses (used to report a decision) that occurred during or after the SSVEP measurement window.

      We agree that it may be helpful for the reader to be able to determine the proportion of responses occurring at different phases of the trial, so we have included the requested response time quantile plot (shown below) as a supplementary figure.

      Author response image 1.

      Reaction time quantiles across cue conditions. The plot illustrates the proportion of trials where responses occurred at different stages of the trial. The SSVEP analysis window is highlighted in purple.

      Comment #3

      In addition, an argument is made for changes in the evidence accumulation rate (called the drift rate) by stimulus expectancy, corresponding to the observed changes in SSVEP measures and differences in the sensory encoding of the stimulus. This inference is limited by the fact that evidence accumulation models (such as the Diffusion Decision Model) were not used to test for drift rate changes as could be determined from the behavioural data (by modelling response time distributions). There appear to be ample numbers of trials per participant to test for drift rate changes in addition to the starting point bias captured in earlier models. Due to the very high number of trials, models could potentially be evaluated for each single participant. This would provide more direct evidence for drift rate changes than the findings based on the SSVEPs, particularly due to the issues with the measurement window relating to the response times as mentioned above.

      The focus of the present study was on testing for sensory-level modulations by predictive cues, rather than testing any particular models. Given that the SSVEP bears all the characteristics of a sensory evidence encoding signal, we believe it is reasonable to point out that its modulation by the cues would very likely translate to a drift rate effect. But we do agree with the reviewer that any connection between our results and previously reported drift rate effects can only be confirmed with modelling and we have tried to make this clear in the revised text. We plan to comprehensively model the data from this study in a future project. While we do indeed have the benefit of plenty of trials, the modelling process will not be straightforward as it will require taking account of the pulse effects which could have potentially complicated, non-linear effects. In the meantime, we have made changes to the text to qualify the suggestion and stress that modelling would be necessary to determine if our hypothesis about a drift rate effect is correct.

      Manuscript Changes

      (Discussion): [...] We suggest that participants may have been able to stabilise their performance across task exposure, despite reductions in the available sensory evidence, by incorporating the small sensory modulation we detected in the SSVEP. This would suggest that the decision process may not operate precisely as the models used in theoretical work describe. Instead, our study tentatively supports a small number of modelling investigations that have challenged the solitary role of starting point bias, implicating a drift bias (i.e. a modulation of the evidence before or upon entry to the decision variable) as an additional source of prior probability effects in perceptual decisions (Dunovan et al., 2014; Hanks et al., 2011; Kelly et al., 2021; van Ravenzwaaij et al., 2012 Wyart et al., 2012) and indicates that these drift biases could, at least partly, originate at the sensory level. However, this link could only be firmly established with modelling in a future study.

      Recommendations For The Authors:

      Comment #4

      The text for the axis labels and legends in the figures is quite small relative to the sizes of the accompanying plots. I would recommend to substantially increase the sizes of the text to aid readability.

      Thank you for this suggestion. We have increased the size of the axis labels and made the text in the figure legends just 1pt smaller than the text in the main body of the manuscript.

      Comment #5

      It is unclear if the scalp maps for Figure 5 (showing the mu/beta distributions) are on the same scale or different scales. I assume they are on different scales (adjusted to the minimum/maximum within each colour map range), as a lack of consistent signals (in the neutral condition) would be expected to lead to a patchy pattern on the scalp as displayed in that figure (due to the colour range shrinking to the degree of noise across electrodes). I would recommend to include some sort of colour scale to show that, for example, in the neutral condition there are no large-amplitude mu/ beta fluctuations distributed somewhat randomly across the scalp.

      Thank you to the reviewer for pointing this out. They were correct, the original topographies were plotted according to their own scale. The topographies in Figure 5 have now been updated to put them on a common scale and we have included a colour bar (as shown below). The caption for Figure 5 has also been updated to confirm that the topos are on a common scale.

      Author response image 2.

      Manuscript Changes

      (Figure 5 Caption): [...] The topography of MB activity in the window - 200:0 ms before evidence onset is plotted on a common scale for neutral and cued conditions separately.

      Comment #6

      In Figure 2, the legend is split across the two panels, despite the valid/invalid/neutral legend also applying to the first panel. This gives an initial impression that the legend is incomplete for the first panel, which may confuse readers. I would suggest putting all of the legend entries in the first panel, so that all of this information is available to readers at once.

      We are grateful to the reviewer for spotting this. Figure 2 has been updated so that the full legend is presented in the first panel, as shown below.

      Author response image 3.

      Comment #7

      Although linear mixed-effects models (using Gaussian families) for response times are standard in the literature, they incorrectly specify the distributions of response times to be Gaussian instead of substantially right-skewed. Generalised linear mixed-effects models using gamma families and identity functions have been shown to more accurately model distributions of response times (see Lo and Andrews, 2015. Frontiers in Psychology). The authors may consider using these models in line with good practice, although it might not make a substantial difference relating to the patterns of response time differences.

      We appreciate this thoughtful comment from Reviewer 1. Although RT distributions are often right skewed, we have previously observed that RT distributions can be closer to normal when the trial incorporates a lead-in phase with no evidence (e.g. Kelly et al 2021, Nat Hum Beh). Indeed, the distributions we observed in this study were markedly Gaussian (as shown in the plot below). Given the shape of these distributions and the reviewer’s suggestion that adopting alternative models may not lead to substantial differences to our results, we have decided to leave the mixed effects models as they are in the manuscript, but we will take note of this advice in future work.

      Author response image 4.

      Reviewer #2

      Strengths:

      The work is executed expertly and focuses cleverly on two features of the EEG signals that can be closely connected to specific loci of the perceptual decision-making process - the SSVEP which connects closely to sensory (visual) encoding, and Mu-Beta lateralisation which connects closely to movement preparation. This is a very appropriate design choice given the authors' research question.

      Another advantage of the design is the use of an unusually long training regime (i.e., for humans) - which makes it possible to probe the emergence of different expectation biases in the brain over different timecourses, and in a way that may be more comparable to work with nonhuman animals (who are routinely trained for much longer than humans).

      We are very grateful for these positive comments from Reviewer 2.

      Weaknesses:

      In my view, the principal shortcoming of this study is that the experimental task confounds expectations about stimulus identity with expectations about to-be-performed responses. That is, cues in the task don't just tell participants what they will (probably) see, but what they (probably) should do.

      In many respects, this feature of the paradigm might seem inevitable, as if specific stimuli are not connected to specific responses, it is not possible to observe motor preparation of this kind (e.g., de Lange, Rahnev, Donner & Lau, 2013 - JoN).

      However, the theoretical models that the authors focus on (e.g., drift-diffusion models) are models of decision (i.e., commitment to a proposition about the world) as much as they are models of choice (i.e., commitment to action). Expectation researchers interested in these models are often interested in asking whether predictions influence perceptual processing, perceptual decision, and/ or response selection stages (e.g., Feuerriegel, Blom & Hoogendorn, 2021 - Cortex), and other researchers have shown that parameters like drift bias and start point bias can be shifted in paradigms where observers cannot possibly prepare a response (e.g., Thomas, Yon, de Lange & Press, 2020 - Psych Sci).

      The present paradigm used by Walsh et al makes it possible to disentangle sensory processing from later decisional processes, but it blurs together the processes of deciding about the stimulus and choosing/initiating the response. This ultimately limits the insights we can draw from this study - as it remains unclear whether rapid changes in motor preparation we see reflect rapid acquisition of new decision criterion or simple cue-action learning. I think this would be important for comprehensively testing the models the authors target - and a good avenue for future work.

      Thank you to Reviewer 2 for these observations. We adopted this paradigm because it is typical of the perceptual decision making literature and our central focus in this study was to test for a sensory-level modulation as a source of a decision bias. We are pleased that the Reviewer agrees that the paradigm successfully disentangles sensory encoding from later decisional processes since this was our priority. However, we agree with Reviewer 2 that because the response mapping was known to the participants, the cues predicted both the outcome of the perceptual decision (“Is this a left- or right-tilted grating?”) and the motor response that the participant should anticipate making (“It’s probably going to be a left click on this trial”). They are correct that this makes it difficult to know whether the changes in motor preparation elicited by the predictive cues reflect action-specific preparation or a more general shift in the boundaries associated with the alternate perceptual interpretations. We fully agree that it remains an interesting and important question and in our future work we hope to conduct investigations that better dissect the distinct components of the decision process during prior-informed decisions. In the interim, we have made some changes to the manuscript to reflect the Reviewer’s concerns and better address this limitation of the study design (these are detailed in the response to the comment below).

      Recommendations For The Authors:

      Comment #8

      As in my public review, my main recommendation to the authors is to think a bit more in the presentation of the Introduction and Discussion about the difference between 'perceiving', 'deciding', and 'responding'.

      The paper is presently framed in terms of the debates around whether expectations bias decision or bias perception - and these debates are in turn mapped onto different aspects of the driftdiffusion model. Biases in sensory gain, for instance, are connected to biases in the drift rate parameter, while decisional shifts are connected to parameters like start points.

      In line with this kind of typology, the authors map their particular EEG signals (SSVEP and MB lateralisation) onto perception and decision. I see the logic, but I think the reality of these models is more nuanced.

      In particular, strictly speaking, the process of evidence accumulation to bound is the formation of a 'decision' (i.e., a commitment to having seen a particular stimulus). Indeed, the dynamics of this process have been beautifully described by other authors on this paper in the past. Since observers in this task simultaneously form decisions and prepare actions (because stimuli and responses are confounded) it is unclear whether changes in motor preparation are reflecting changes in what perceivers 'decide' (i.e., changes in what crosses the decision threshold) or what they 'do' (i.e., changes in the motor response threshold). This is particularly important for the debate around whether expectations change 'perception' or 'decision' because - in some accounts - is the accumulation of evidence to the bound that is hypothesised to cause the perceptual experience observers actually have (Pereira, Perrin & Faivre, 2022 - TiCS). The relevant 'bound' here though is not the bound to push the button, but the bound for the brain to decide what one is actually 'seeing'.

      I completely understand the logic behind the authors' choices, but I would have liked more discussion of this issue. In particular, it seems strange to me to talk about the confounding of stimuli and responses as a particular 'strength' of this design in the manuscript - when really it is a 'necessary evil' for getting the motor preparation components to work. Here is one example from the Introduction:

      "While some have reported expectation effects in humans using EEG/MEG, these studies either measured sensory signals whose relevance to the decision process is uncertain (e.g. Blom et al., 2020; Solomon et al., 2021; Tang et al., 2018) and/or used cues that were implicit or predicted a forthcoming stimulus but not the correct choice alternative (e.g. Aitken et al., 2020; Feuerriegel et al., 2021b; Kok et al., 2017). To assess whether prior probabilities modulate sensory-level signals directly related to participants' perceptual decisions, we implemented a contrast discrimination task in which the cues explicitly predicted the correct choice and where sensory signals that selectively trace the evidence feeding the decision process could be measured during the process of deliberation."

      I would contend that this design allows you to pinpoint signals related to participant's 'choices' or 'actions' but not necessarily their 'decisions' in the sense outlined above.

      As I say though, I don't think this is fatal and I think the paper is extremely interesting in any case. But I think it would be strengthened if some of these nuances were discussed a bit more explicitly, as a 'perceptual decision' is more than pushing a button. Indeed, the authors might want to consider discussing work that shows the neural overlap between deciding and acting breaks down when Ps cannot anticipate which actions to use to report their choices ahead of time (Filimon, Philiastides, Nelson, Kloosterman & Heekeren, 2013 - JoN) and/or work which has combined expectations with drift diffusion modelling to show how expectations change drift bias (Yon, Zainzinger, de Lange, Eimer & Press, 2020 - JEP:General) and/or start bias (Thomas, Yon, de Lange & Press, 2020 - Psych Sci) even when Ps cannot prepare a motor response ahead of time.

      While our focus was on testing for sensory-level modulations, we think the question of whether the motor-level effects we observed are attributable to the task design or represents a more general perceptual bound adjustment is an important question for future research. In our previous work, we have examined this distinction between abstract, movement-independent evidence accumulation (indexed by the centro-parietal positivity, CPP) and response preparation in detail. The CPP has been shown to trace evidence accumulation irrespective of whether the sensory alternatives are associated with a specific response or not (Twomey et al 2016, J Neurosci). When speed pressure is manipulated in tasks with fixed stimulus-response mappings we have found that the CPP undergoes systematic adjustments in its pre-response amplitude that closely accord with the starting-level modulations observed in mu/beta, suggesting that motor-level adjustments do still translate to differences at the perceptual level under these task conditions (e.g. Kelly et al 2021, Nat Hum Beh; Steinemann et al., 2018, Nat Comms). We have also observed that the CPP and mu-beta exhibit corresponding adjustments in response to predictive cues (Kelly et al., 2021) that are consistent with both a starting-point shift and drift rate bias. However, the Kelly et al. study did not include a signature of sensory encoding and therefore could not test for sensory-level modulations.

      We have added some remarks to the discussion to acknowledge this issue with the interpretation of the preparatory shifts in mu-beta activity we observed when the predictive cues were presented, and we have included references to the papers that the reviewer helpfully provided. We have also offered some additional consideration of the features of the task design that may have influenced the SSVEP results.

      Manuscript Changes

      An implication of using cues that predict not just the upcoming stimulus, but the most likely response, is that it becomes difficult to determine if preparatory shifts in mu-beta (MB) activity that we observed reflect adjustments directly influencing the perceptual interpretation of the stimulus or simply preparation of the more probable action. When perceptual decisions are explicitly tied to particular modes of response, the decision state can be read from activity in motor regions associated with the preparation of that kind of action (e.g. de Lafuente et al., 2015; Ding & Gold, 2012; Shadlen & Newsome, 2001; Romo et al., 2004), but these modules appear to be part of a constellation of decision-related areas that are flexibly recruited based on the response modality (e.g. Filimon et al., 2013). When the response mapping is withheld or no response is required, MB no longer traces decision formation (Twomey et al., 2015), but an abstract decision process is still readily detectable (e.g. O’Connell et al., 2012), and modelling work suggests that drift biases and starting point biases (Thomas et al., 2020; Yon et al., 2021) continue to influence prior-informed decision making. While the design of the present study does not allow us to offer further insight about whether the MB effects we observed were inherited from strategic adjustments at this abstract level of the decision process, we hope to conduct investigations in the future that better dissect the distinct components of prior-informed decisions to address this question.

      Several other issues remain unaddressed by the present study. One, is that it is not clear to what extent the sensory effects may be influenced by features of the task design (e.g. speeded responses under a strict deadline) and if these sensory effects would generalise to many kinds of perceptual decision-making tasks or whether they are particular to contrast discrimination.

      Comment #9

      On a smaller, unrelated point - I thought the discussion in the Discussion section about expectation suppression was interesting, but I did not think it was completely logically sound. The authors suggest that they may see relative suppression (rather than enhancement) of their marginal SSVEP under a 'sharpening' account because these accounts suggest that there is a relative suppression of off-channel sensory units, and there are more off-channel sensory units than onchannel sensory units (i.e., there are usually more possibilities we don't expect than possibilities that we do, and suppressing the things we don't expect should therefore yield overall suppression).

      However, this strikes me as a non-sequitur given that the marginal SSVEP only reflects featurespecific visual activity (i.e., activity tuned to one of the two grating stimuli used). The idea that there are more off-channel than on-channel units makes sense for explaining why we would see overall signal drops on expected trials e.g., in an entire visual ROI in an fMRI experiment. But surely this explanation cannot hold in this case, as there is presumably an equal number of units tuned to each particular grating?

      My sense is that this possibility should probably be removed from the manuscript - and I suspect it is more likely that the absence of a difference in marginal SSVEP for Valid vs Neutral trials has more to do with the fact that participants appear to be especially attentive on Neutral trials (and so any relative enhancement of feature-specific activity for expected events is hard to detect against a baseline of generally high-precision sensory evidence on these highly attentive, neutral trials).

      We thank the reviewer for flagging that we did not clearly articulate our thoughts in this section of the manuscript. Our primary purpose in mentioning this sharpening account was simply to point out that, where at first blush our results seem to conflict with expectation suppression effects in the fMRI literature, the sharpening account provides an explanation that can reconcile them. In the case of BOLD data, the sharpening account proposes that on-channel sensory units are boosted and off-channel units are suppressed and, due to the latter being more prevalent, this leads to an overall suppression of the global signal. In the case of the SSVEP, the signal isolates just the onunits and so the sharpening account would predict that when there is a valid cue, the SSVEP signal associated with the high-contrast, expected stimulus should be boosted and the SSVEP signal associated with the low-contrast, unexpected stimulus should be weakened; this would result in a larger difference between these signals and therefore, a larger ‘marginal SSVEP’. Conversely, when there is an invalid cue, the SSVEP signal associated with the, now unexpected, high-contrast stimulus should be relatively weakened and the SSVEP signal associated with the expected, but low-contrast stimulus should be relatively boosted; this would result in a smaller difference between these signals and therefore, a lower amplitude marginal SSVEP. We do not think that this account needs to make reference to any channels beyond those feature-specific channels driving the two SSVEP signals. Again our central point is simply that the sharpening account offers a means of reconciling our SSVEP findings with expectation suppression effects previously reported in the fMRI literature.

      We suspect that this was not adequately explained in the discussion. We have adjusted the way this section is phrased to make it clear that we are not invoking off-channel activity to explain the SSVEP effect we observed and we thank the Reviewer for pointing out that this was unclear in the original text.

      Manuscript Changes

      An alternative account for expectation suppression effects, which is consistent with our SSVEP results, is that they arise, not from a suppression of expected activity, but from a ‘sharpening’ effect whereby the response of neurons that are tuned to the expected feature are enhanced while the responses of neurons tuned to unexpected features are suppressed (de Lange et al., 2018). On this account, the expectation suppression commonly reported in fMRI studies arises because voxels contain intermingled populations with diverse stimulus preferences and the populations tuned to the unexpected features outnumber those tuned to the expected feature. In contrast to these fMRI data, the SSVEP represents the activity of sensory units driven at the same frequency as the stimulus, and thus better isolates the feature-specific populations encoding the task-relevant sensory evidence. Therefore, according to the sharpening account, an invalid cue would have enhanced the SSVEP signal associated with the low contrast grating and weakened the SSVEP signal associated with the high contrast grating. As this would result in a smaller difference between these signals, and therefore, a lower amplitude marginal SSVEP compared to the neutral cue condition, this could explain the effect we observed. 

      Reviewer #3

      Observers make judgements about expected stimuli faster and more accurately. How expectations facilitate such perceptual decisions remains an ongoing area of investigation, however, as expectations may exert their effects in multiple ways. Expectations may directly influence the encoding of sensory signals. Alternatively (or additionally), expectations may influence later stages of decision-making, such as motor preparation, when they bear on the appropriate behavioral response.

      In the present study, Walsh and colleagues directly measured the effect of expectations on sensory and motor signals by making clever use of the encephalogram (EEG) recorded from human observers performing a contrast discrimination task. On each trial, a predictive cue indicated which of two superimposed stimuli would likely be higher contrast and, therefore, whether a left or right button press was likely to yield a correct response. Deft design choices allowed the authors to extract both contrast-dependent sensory signals and motor preparation signals from the EEG. The authors provide compelling evidence that, when predictive cues provide information about both a forthcoming stimulus and the appropriate behavioral response, expectation effects are immediately manifest in motor preparation signals and only emerge in sensory signals after extensive training.

      Future work should attempt to reconcile these results with related investigations in the field. As the authors note, several groups have reported expectation-induced modulation of sensory signals (using both fMRI and EEG/MEG) on shorter timescales (e.g. just one or two sessions of a few hundred trials, versus the intensive multi-session study reported here). One interesting possibility is that perceptual expectations are not automatic but demand the deployment of feature-based attention, while motor preparation is comparatively less effortful and so dominates when both sources of information are available, as in the present study. This hypothesis is consistent with the authors' thoughtful analysis showing decreased neural signatures of attention over posterior electrodes following predictive cues. Therefore, observing the timescale of sensory effects using the same design and methods (facilitating direct comparison with the present work), but altering task demands slightly such that cues are no longer predictive of the appropriate behavioral response, could be illuminating.

      We would like to thank Reviewer 3 for their positive comments and thoughtful suggestions for future work.

      Recommendations For The Authors:

      Comment #10

      In the methods, the term 'session' is used early on but only fleshed out at the end of the 'Procedure' subsection and never entirely explained (e.g., did sessions take place over multiple days?). A brief sentence laying this out early on, perhaps in 'Participants' after the (impressive) trial counts are reported, might be helpful.

      Thank you to Reviewer 3 for pointing out that this was not clear in the original draft. We have amended the text in the Methods section to better explain the relationship between sessions, days, and trial bins.

      Manuscript Changes

      (Methods - Participants): [...] All procedures were approved by the Trinity College Dublin School of Psychology Ethics Committee and were in accordance with the Declaration of Helsinki. Participants completed between 4 and 6 testing sessions, each on a different day. While the sample size was small, on average, participants completed 5750 (SD = 1066) trials each.

      (Methods - Data Analysis): [...] As there were two lengths of testing session and participants completed different numbers of sessions, we analysed the effect of task exposure by pooling trials within-subjects and dividing them into five ‘trial bins’. The first bin represents the participants’ earliest exposure to the task and the final bin represents trials at the end of their participation, when they had had substantial task exposure. All trials with valid responses and reaction times greater than 100 ms were included in the analyses of behavioural data and the SSVEP.

      Comment #11

      On a related note: participants completed a variable number of trials/sessions. To facilitate comparison across subjects, training effects are reported by dividing each subject's data into 5 exposure bins. This is entirely reasonable but does leave the reader wondering about whether you found any effects of rest or sleep between sessions.

      We agree with the reviewer that this is an interesting question that absolutely merits further investigation. As different participants completed different numbers of sessions, different session lengths, and had variable gaps between their sessions, we do not think a per-session analysis would be informative. We think it may be better addressed in a future study, perhaps one with a larger sample where we could collect data specifically about sleep and more systematically control the intervals between testing sessions.

      Comment #12

      Fig 2B: the 'correct' and 'neutral' labels in the legend are switched

      Thank you to the reviewer for spotting that error, the labels in Figure 2 have been corrected.

      Comment #13

      Fig 4B: it's a bit difficult to distinguish which lines are 'thick' and 'thin'

      We have updated Figure 4.B to increase the difference in line thickness between the thick and thin lines (as shown below).

      Author response image 5.

      Comment #14

      Fig 4C: missing (I believe?) the vertical lines indicating median reaction time

      We have updated Figure 4.C to include the median reaction times.

      Author response image 6.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This important work presents a new methodology for the statistical analysis of fiber photometry data, improving statistical power while avoiding the bias inherent in the choices that are necessarily made when summarizing photometry data. The reanalysis of two recent photometry data sets, the simulations, and the mathematical detail provide convincing evidence for the utility of the method and the main conclusions, however, the discussion of the re-analyzed data is incomplete and would be improved by a deeper consideration of the limitations of the original data. In addition, consideration of other data sets and photometry methodologies including non-linear analysis tools, as well as a discussion of the importance of the data normalization are needed.

      Thank you for reviewing our manuscript and giving us the opportunity to respond and improve our paper. In our revision, we have strived to address the points raised in the comments, and implement suggested changes where feasible. We have also improved our package and created an analysis guide (available on our Github - https://github.com/gloewing/fastFMM and https://github.com/gloewing/photometry_fGLMM), showing users how to apply our methods and interpret their results. Below, we provide a detailed point-by-point response to the reviewers.

      Reviewer #1:

      Summary:

      Fiber photometry has become a very popular tool in recording neuronal activity in freely behaving animals. Despite the number of papers published with the method, as the authors rightly note, there are currently no standardized ways to analyze the data produced. Moreover, most of the data analyses confine to simple measurements of averaged activity and by doing so, erase valuable information encoded in the data. The authors offer an approach based on functional linear mixed modeling, where beyond changes in overall activity various functions of the data can also be analyzed. More in-depth analysis, more variables taken into account, and better statistical power all lead to higher quality science.

      Strengths:

      The framework the authors present is solid and well-explained. By reanalyzing formerly published data, the authors also further increase the significance of the proposed tool opening new avenues for reinterpreting already collected data.

      Thank you for your favorable and detailed description of our work!

      Weaknesses:

      However, this also leads to several questions. The normalization method employed for raw fiber photometry data is different from lab to lab. This imposes a significant challenge to applying a single tool of analysis.

      Thank you for these important suggestions. We agree that many data pre-processing steps will influence the statistical inference from our method. Note, though, that this would also be the case with standard analysis approaches (e.g., t-tests, correlations) applied to summary measures like AUCs. For that reason, we do not believe that variability in pre-processing is an impediment to widespread adoption of a standard analysis procedure. Rather, we would argue that the sensitivity of analysis results to pre-processing choices should motivate the development of statistical techniques that reduce the need for pre-processing, and properly account for structure in the data arising from experimental designs. For example, even without many standard pre-processing steps, FLMM provides smooth estimation results across trial timepoints (i.e., the “functional domain”), has the ability to adjust for betweentrial and -animal heterogeneity, and provides a valid statistical inference framework that quantifies the resulting uncertainty. We appreciate the reviewer’s suggestion to emphasize and further elaborate on our method from this perspective. We have now included the following in the Discussion section:

      “FLMM can help model signal components unrelated to the scientific question of interest, and provides a systematic framework to quantify the additional uncertainty from those modeling choices. For example, analysts sometimes normalize data with trial-specific baselines because longitudinal experiments can induce correlation patterns across trials that standard techniques (e.g., repeated measures ANOVA) may not adequately account for. Even without many standard data pre-processing steps, FLMM provides smooth estimation results across trial time-points (the “functional domain”), has the ability to adjust for between-trial and -animal heterogeneity, and provides a valid statistical inference approach that quantifies the resulting uncertainty. For instance, session-to-session variability in signal magnitudes or dynamics (e.g., a decreasing baseline within-session from bleaching or satiation) could be accounted for, at least in part, through the inclusion of trial-level fixed or random effects. Similarly, signal heterogeneity due to subject characteristics (e.g., sex, CS+ cue identity) could be incorporated into a model through inclusion of animal-specific random effects. Inclusion of these effects would then influence the width of the confidence intervals. By expressing one’s “beliefs” in an FLMM model specification, one can compare models (e.g., with AIC). Even the level of smoothing in FLMM is largely selected as a function of the data, and is accounted for directly in the equations used to construct confidence intervals. This stands in contrast to “trying to clean up the data” with a pre-processing step that may have an unknown impact on the final statistical inferences.”

      Does the method that the authors propose work similarly efficiently whether the data are normalized in a running average dF/F as it is described in the cited papers? For example, trace smoothing using running averages (Jeong et al. 2022) in itself may lead to pattern dilution.

      By modeling trial signals as “functions”, the method accounts for and exploits correlation across trial timepoints and, as such, any pre-smoothing of the signals should not negatively affect the validity of the 95% CI coverage. It will, however, change inferential results and the interpretation of the data, but this is not unique to FLMM, or many other statistical procedures.

      The same question applies if the z-score is calculated based on various responses or even baselines. How reliable the method is if the data are non-stationery and the baselines undergo major changes between separate trials?

      Adjustment for trial-to-trial variability in signal magnitudes or dynamics could be accounted for, at least in part, through the inclusion of trial-level random effects. This heterogeneity would then influence the width of the confidence intervals, directly conveying the effect of the variability on the conclusions being drawn from the data. This stands in contrast to “trying to clean up the data” with a pre-processing step that may have an unknown impact on the final statistical inferences. Indeed, non-stationarity (e.g., a decreasing baseline within-session) due to, for example, measurement artifacts (e.g., bleaching) or behavioral causes (e.g., satiation, learning) should, if possible, be accounted for in the model. As mentioned above, one can often achieve the same goals that motivate pre-processing steps by instead applying specific FLMM models (e.g., that include trial-specific intercepts to reflect changes in baseline) to the unprocessed data. One can then compare model criteria in an objective fashion (e.g., with AIC) and quantify the uncertainty associated with those modeling choices. Even the level of smoothing in FLMM is largely selected as a function of the data, and is accounted for directly in the equations used to construct confidence intervals. In sum, our method provides both a tool to account for challenges in the data, and a systematic framework to quantify the additional uncertainty that accompanies accounting for those data characteristics.

      Finally, what is the rationale for not using non-linear analysis methods? Following the paper’s logic, non-linear analysis can capture more information that is diluted by linear methods.

      This is a good question that we imagine many readers will be curious about as well. We have added in notes to the Discussion and Methods Section 4.3 to address this (copied below). We thank the reviewer for raising this point, as your feedback also motivated us to discuss this point in Part 5 of our Analysis Guide.

      Methods

      “FLMM models each trial’s signal as a function that varies smoothly across trial time-points (i.e., along the “functional domain”). It is thus a type of non-linear modeling technique over the functional domain, since we do not assume a linear model (straight line). FLMM and other functional data analysis methods model data as functions, when there is a natural ordering (e.g., time-series data are ordered by time, imaging data are ordered by x-y coordinates), and are assumed to vary smoothly along the functional domain (e.g., one assumes values of a photometry signal at close time-points in a trial have similar values). Functional data analysis approaches exploit this smoothness and natural ordering to capture more information during estimation and inference.”

      Discussion

      “In this paper, we specified FLMM models with linear covariate–signal relationships at a fixed trial time-point across trials/sessions, to compare the FLMM analogue of the analyses conducted in (Jeong et al., 2022). However, our package allows modeling of covariate–signal relationships with non-linear functions of covariates, using splines or other basis functions. One must consider, however, the tradeoff between flexibility and interpretability when specifying potentially complex models, especially since FLMM is designed for statistical inference.”

      Reviewer #2:

      Summary:

      This work describes a statistical framework that combines functional linear mixed modeling with joint 95% confidence intervals, which improves statistical power and provides less conservative statistical inferences than in previous studies. As recently reviewed by Simpson et al. (2023), linear regression analysis has been used extensively to analyze time series signals from a wide range of neuroscience recording techniques, with recent studies applying them to photometry data. The novelty of this study lies in 1) the introduction of joint 95% confidence intervals for statistical testing of functional mixed models with nested random-effects, and 2) providing an open-source R package implementing this framework. This study also highlights how summary statistics as opposed to trial-by-trial analysis can obscure or even change the direction of statistical results by reanalyzing two other studies.

      Strengths:

      The open-source package in R using a similar syntax as the lme4 package for the implementation of this framework on photometry data enhances the accessibility, and usage by other researchers. Moreover, the decreased fitting time of the model in comparison with a similar package on simulated data, has the potential to be more easily adopted.

      The reanalysis of two studies using summary statistics on photometry data (Jeong et al., 2022; Coddington et al., 2023) highlights how trial-by-trial analysis at each time-point on the trial can reveal information obscured by averaging across trials. Furthermore, this work also exemplifies how session and subject variability can lead to opposite conclusions when not considered.

      We appreciate the in-depth description of our work and, in particular, the R package. This is an area where we put a lot of effort, since our group is very concerned with the practical experience of users.

      Weaknesses:

      Although this work has reanalyzed previous work that used summary statistics, it does not compare with other studies that use trial-by-trial photometry data across time-points in a trial. As described by the authors, fitting pointwise linear mixed models and performing t-test and BenjaminiHochberg correction as performed in Lee et al. (2019) has some caveats. Using joint confidence intervals has the potential to improve statistical robustness, however, this is not directly shown with temporal data in this work. Furthermore, it is unclear how FLMM differs from the pointwise linear mixed modeling used in this work.

      Thank you for making this important point. We agree that this offers an opportunity to showcase the advantages of FLMM over non-functional data analysis methods, such as the approach applied in Lee et al. (2019). As mentioned in the text, fitting entirely separate models at each trial timepoint (without smoothing regression coefficient point and variance estimates across timepoints), and applying multiple comparisons corrections as a function of the number of time points has substantial conceptual drawbacks. To see why, consider that applying this strategy with two different sub-sampling rates requires adjustment for different numbers of comparisons, and could thus lead to very different proportions of timepoints achieving statistical significance. In light of your comments, we decided that it would be useful to provide a demonstration of this. To that effect, we have added Appendix Section 2 comparing FLMM with the method in Lee et al. (2019) on a real dataset, and show that FLMM yields far less conservative and more stable inference across different sub-sampling rates. We conducted this comparison on the delay-length experiment (shown in Figure 6) data, sub-sampled at evenly spaced intervals at a range of sampling rates. We fit either a collection of separate linear mixed models (LMM) followed by a Benjamini–Hochberg (BH) correction, or FLMM with statistical significance determined with both Pointwise and Joint 95% CIs. As shown in Appendix Tables 1-2, the proportion of timepoints at which effects are statistically significant with FLMM Joint CIs is fairly stable across sampling rates. In contrast, the percentage is highly inconsistent with the BH approach and is often highly conservative. This illustrates a core advantage of functional data analysis methods: borrowing strength across trial timepoints (i.e., the functional domain), can improve estimation efficiency and lower sensitivity to how the data is sub-sampled. A multiple comparisons correction may, however, yield stable results if one first smooths both regression coefficient point and variance estimates. Because this includes smoothing the coefficient point and variance estimates, this approach would essentially constitute a functional mixed model estimation strategy that uses multiple comparisons correction instead of a joint CI. We have now added in a description of this experiment in Section 2.4 (copied below).

      “We further analyze this dataset in Appendix Section 2, to compare FLMM with the approach applied in Lee et al. (2019) of fitting pointwise LMMs (without any smoothing) and applying a Benjamini–Hochberg (BH) correction. Our hypothesis was that the Lee et al. (2019) approach would yield substantially different analysis results, depending on the sampling rate of the signal data (since the number of tests being corrected for is determined by the sampling rate). The proportion of timepoints at which effects are deemed statistically significant by FLMM joint 95% CIs is fairly stable across sampling rates. In contrast, that proportion is both inconsistent and often low (i.e., highly conservative) across sampling rates with the Lee et al. (2019) approach. These results illustrate the advantages of modeling a trial signal as a function, and conducting estimation and inference in a manner that uses information across the entire trial.”

      In this work, FLMM usages included only one or two covariates. However, in complex behavioral experiments, where variables are correlated, more than two may be needed (see Simpson et al. (2023), Engelhard et al. (2019); Blanco-Pozo et al. (2024)). It is not clear from this work, how feasible computationally would be to fit such complex models, which would also include more complex random effects.

      Thank you for bringing this up, as we endeavored to create code that is able to scale to complex models and large datasets. We agree that highlighting this capability in the paper will strengthen the work. We now state in the Discussion section that “[T]he package is fast and maintains a low memory footprint even for complex models (see Section 4.6 for an example) and relatively large datasets.” Methods Section 4.6 now includes the following:

      Our fastFMM package scales to the dataset sizes and model specifications common in photometry. The majority of the analyses presented in the Results Section (Section 2) included fairly simple functional fixed and random effect model specifications because we were implementing the FLMM versions of the summary measure analyses presented in Jeong et al. (2022). However, we fit the following FLMM to demonstrate the scalability of our method with more complex model specifications:

      We use the same notation as the Reward Number model in Section 4.5.2, with the additional variable TL_i,j,l_ denoting the Total Licks on trial j of session l for animal i. In a dataset with over 3,200 total trials (pooled across animals), this model took ∼1.2 min to fit on a MacBook Pro with an Apple M1 Max chip with 64GB of RAM. Model fitting had a low memory footprint. This can be fit with the code:

      model_fit = fui(photometry ~ session + trial + iri + lick_time + licks + (session + trial + iri + lick_time + licks | id), parallel = TRUE, data = photometry_data)

      This provides a simple illustration of the scalability of our method. The code (including timing) for this demonstration is now included on our Github repository.

      Reviewer #3:

      Summary:

      Loewinger et al., extend a previously described framework (Cui et al., 2021) to provide new methods for statistical analysis of fiber photometry data. The methodology combines functional regression with linear mixed models, allowing inference on complex study designs that are common in photometry studies. To demonstrate its utility, they reanalyze datasets from two recent fiber photometry studies into mesolimbic dopamine. Then, through simulation, they demonstrate the superiority of their approach compared to other common methods.

      Strengths:

      The statistical framework described provides a powerful way to analyze photometry data and potentially other similar signals. The provided package makes this methodology easy to implement and the extensively worked examples of reanalysis provide a useful guide to others on how to correctly specify models.

      Modeling the entire trial (function regression) removes the need to choose appropriate summary statistics, removing the opportunity to introduce bias, for example in searching for optimal windows in which to calculate the AUC. This is demonstrated in the re-analysis of Jeong et al., 2022, in which the AUC measures presented masked important details about how the photometry signal was changing.

      Meanwhile, using linear mixed methods allows for the estimation of random effects, which are an important consideration given the repeated-measures design of most photometry studies.

      We would like to thank the reviewer for the deep reading and understanding of our paper and method, and the thoughtful feedback provided. We agree with this summary, and will respond in detail to all the concerns raised.

      Weaknesses:

      While the availability of the software package (fastFMM), the provided code, and worked examples used in the paper are undoubtedly helpful to those wanting to use these methods, some concepts could be explained more thoroughly for a general neuroscience audience.

      Thank you for this point. While we went to great effort to explain things clearly, our efforts to be concise likely resulted in some lack of clarity. To address this, we have created a series of analysis guides for a more general neuroscience audience, reflecting our experience working with researchers at the NIH and the broader community. These guides walk users through the code, its deployment in typical scenarios, and the interpretation of results.

      While the methodology is sound and the discussion of its benefits is good, the interpretation and discussion of the re-analyzed results are poor:

      In section 2.3, the authors use FLMM to identify an instance of Simpson’s Paradox in the analysis of Jeong et al. (2022). While this phenomenon is evident in the original authors’ metrics (replotted in Figure 5A), FLMM provides a convenient method to identify these effects while illustrating the deficiencies of the original authors’ approach of concatenating a different number of sessions for each animal and ignoring potential within-session effects.

      Our goal was to demonstrate that FLMM provides insight into why the opposing within- and between-session effects occur: the between-session and within-session changes appear to occur at different trial timepoints. Thus, while the AUC metrics applied in Jeong et al. (2022) are enough to show the presence of Simpson’s paradox, it is difficult to hypothesize why the opposing within-/between-session effects occur. An AUC analysis cannot determine at what trial timepoints (relative to licking) those opposing trends occur.

      The discussion of this result is muddled. Having identified the paradox, there is some appropriate speculation as to what is causing these opposing effects, particularly the decrease in sessions. In the discussion and appendices, the authors identify (1) changes in satiation/habitation/motivation, (2) the predictability of the rewards (presumably by the click of a solenoid valve) and (3) photobleaching as potential explanations of the decrease within days. Having identified these effects, but without strong evidence to rule all three out, the discussion of whether RPE or ANCCR matches these results is probably moot. In particular, the hypotheses developed by Jeong et al., were for a random (unpredictable) rewards experiment, whereas the evidence points to the rewards being sometimes predictable. The learning of that predictability (e.g. over sessions) and variation in predictability (e.g. by attention level to sounds of each mouse) significantly complicate the analysis. The FLMM analysis reveals the complexity of analyzing what is apparently a straightforward task design.

      While we are disappointed to hear the reviewer felt our initial interpretations and discussion were poor, the reviewer brings up an excellent point re: potential reward predictability that we had not considered. They have convinced us that acknowledging this alternative perspective will strengthen the paper, and we have added it into the Discussion. We agree that the ANCCR/RPE model predictions were made for unpredictable rewards and, as the reviewer rightly points out, there is evidence that the animals may sense the reward delivery. After discussing extensively with the authors of Jeong et al. (2022), it is clear that they went to enormous trouble to prevent the inadvertent generation of a CS+, and it is likely changes in pressure from the solenoid (rather than a sound) that may have served as a cue. Regardless of the learning theory one adopts (RPE, ANCCR or others), we agree that this potential learned predictability could, at least partially, account for the increase in signal magnitude across sessions. As this paper is focused on analysis methods, we feel that we can contribute most thoughtfully to the dopamine–learning theory conversation by presenting this explanation in detail, for consideration in future experiments. We have substantially edited this discussion and, as per the reviewer’s suggestion, have qualified our interpretations to reflect the uncertainty in explaining the observed trends.

      If this paper is not trying to arbitrate between RPE and ANCCR, as stated in the text, the post hoc reasoning of the authors of Jeong et al 2022 provided in the discussion is not germane. Arbitrating between the models likely requires new experimental designs (removing the sound of the solenoid, satiety controls) or more complex models (e.g. with session effects, measures of predictability) that address the identified issues.

      Thank you for this point. We agree with you that, given the scope of the paper, we should avoid any extensive comparison between the models. To address your comment, we have now removed portions of the Discussion that compared RPE and ANCCR. Overall, we agree with the reviewer, and think that future experiments will be needed for conclusively testing the accuracy of the models’ predictions for random (unpredicted) rewards. While we understand that our description of several conversations with the Jeong et al., 2022 authors could have gone deeper, we hope the reviewer can appreciate that inclusion of these conversations was done with the best of intentions. We wish to emphasize that we also consulted with several other researchers in the field when crafting our discussion. We do commend the authors of Jeong et al., 2022 for their willingness to discuss all these details. They could easily have avoided acknowledging any potential incompleteness of their theory by claiming that our results do not invalidate their predictions for a random reward, because the reward could potentially have been predicted (due to an inadvertent CS+ generated from the solenoid pressure). Instead, they emphasized that they thought their experiment did test a random reward, to the extent they could determine, and that our results suggest components of their theory that should be updated. We think that engagement with re-analyses of one’s data, even when findings are at odds with an initial theoretical framing, is a good demonstration of open science practice. For that reason as well, we feel providing readers with a perspective on the entire discussion will contribute to the scientific discourse in this area.

      Finally, we would like to reiterate that this conversation is happening at least in part because of our method: by analyzing the signal at every trial timepoint, it provides a formal way to test for the presence of a neural signal indicative of reward delivery perception. Ultimately, this was what we set out to do: help researchers ask questions of their data that may have been harder to ask before. We believe that having a demonstration that we can indeed do this for a “live” scientific issue is the most appropriate way of demonstrating the usefulness of the method.

      Of the three potential causes of within-session decreases, the photobleaching arguments advanced in the discussion and expanded greatly in the appendices are not convincing. The data being modeled is a processed signal (∆F/F) with smoothing and baseline correction and this does not seem to have been considered in the argument. Furthermore, the photometry readout is also a convolution of the actual concentration changes over time, influenced by the on-off kinetics of the sensor, which makes the interpretation of timing effects of photobleaching less obvious than presented here and more complex than the dyes considered in the cited reference used as a foundation for this line of reasoning.

      We appreciate the nuance of this point, and we have made considerable efforts in the Results and Discussion sections to caution that alternative hypotheses (e.g., photobleaching) cannot be definitively ruled out. In response to your criticism, we have consulted with more experts in the field regarding the potential for bleaching in this data, and it is not clear to us why photobleaching would be visible in one time-window of a trial, but not at another (less than a second away), despite high ∆F/F magnitudes in both time-windows. We do wish to point out that the Jeong et al. (2022) authors were also concerned about photobleaching as a possible explanation. At their request, we analyzed data from additional experiments, collected from the same animals. In most cases, we did not observe signal patterns that seemed to indicate photobleaching. Given the additional scrutiny, we do not think that photobleaching is more likely to invalidate results in this particular set of experiments than it would be in any other photometry experiment. While the role of photobleaching may be more complicated with this sensor than others in the references, that citation was included primarily as a way of acknowledging that it is possible that non-linearities in photobleaching could occur. Regardless, your point is well taken and we have qualified our description of these analyses to express that photobleaching cannot be ruled out.

      Within this discussion of photobleaching, the characterization of the background reward experiments used in part to consider photobleaching (appendix 7.3.2) is incorrect. In this experiment (Jeong et al., 2022), background rewards were only delivered in the inter-trial-interval (i.e. not between the CS+ and predicted reward as stated in the text). Both in the authors’ description and in the data, there is a 6s before cue onset where rewards are not delivered and while not described in the text, the data suggests there is a period after a predicted reward when background rewards are not delivered. This complicates the comparison of this data to the random reward experiment.

      Thank you for pointing this out! We removed the parenthetical on page 18 of the appendix that incorrectly stated that rewards can occur between the CS+ and the predicted reward.

      The discussion of the lack of evidence for backpropagation, taken as evidence for ANCCR over RPE, is also weak.

      Our point was initially included to acknowledge that, although our method yields results that conflict with the conclusions described by Jeong et al., 2022 on data from some experiments, on other experiments our method supports their results. Again, we believe that a critical part of re-analyzing shared datasets is acknowledging both areas where new analyses support the original results, as well as those where they conflict with them. We agree with the reviewer that qualifying our results so as not to emphasize support for/against RPE/ANCCR will strengthen our paper, and we have made those changes. We have qualified the conclusions of our analysis to emphasize they are a demonstration of how FLMM can be used to answer a certain style of question with hypothesis testing (how signal dynamics change across sessions), as opposed to providing evidence for/against the backpropagation hypothesis.

      A more useful exercise than comparing FLMM to the methods and data of Jeong et al., 2022, would be to compare against the approach of Amo et al., 2022, which identifies backpropagation (data publicly available: DOI: 10.5061/dryad.hhmgqnkjw). The replication of a positive result would be more convincing of the sensitivity of the methodology than the replication of a negative result, which could be a result of many factors in the experimental design. Given that the Amo et al. analysis relies on identifying systematic changes in the timing of a signal over time, this would be particularly useful in understanding if the smoothing steps in FLMM obscure such changes.

      Thank you for this suggestion. Your thoughtful review has convinced us that focusing on our statistical contribution will strengthen the paper, and we made changes to further emphasize that we are not seeking to adjudicate between RPE/ANCCR. Given the length of the manuscript as it stands, we could only include a subset of the analyses conducted on Jeong et al., 2022, and had to relegate the results from the Coddington et al., data to an appendix. Realistically, it would be hard for us to justify including analyses from a third dataset, only to have to relegate them to an appendix. We did include numerous examples in our manuscript where we already replicated positive results, in a way that we believe demonstrates the sensitivity of the methodology. We have also been working with many groups at NIH and elsewhere using our approach, in experiments targeting different scientific questions. In fact, one paper that extensively applies our method, and compares the results with those yielded by standard analysis of AUCs, is already published (Beas et al., 2024). Finally, in our analysis guide we describe additional analyses, not included in the manuscript, that replicate positive results. Hence there are numerous demonstrations of FLMM’s performance in less controversial settings. We take your point that our description of the data supporting one theory or the other should be qualified, and we have corrected that. Specifically for your suggestion of Amo et al. 2022, we have not had the opportunity to personally reanalyze their data, but we are already in contact with other groups who have conducted preliminary analyses of their data with FLMM. We are delighted to see this, in light of your comments and our decision to restrict the scope of our paper. We will help them and other groups working on this question to the extent we can.

      Recommendations for the Authors:

      Reviewer #2:

      First, I would like to commend the authors for the clarity of the paper, and for creating an open-source package that will help researchers more easily adopt this type of analysis.

      Thank you for the positive feedback!

      I would suggest the authors consider adding to the manuscript, either some evidence or some intuition on how feasible would be to use FLMM for very complex model specifications, in terms of computational cost and model convergence.

      Thank you for this suggestion. As we described above in response to Reviewer #2’s Public Reviews, we have added in a demonstration of the scalability of the method. Since our initial manuscript submission, we have further increased the package’s speed (e.g., through further parallelization). We are releasing the updated version of our package on CRAN.

      From my understanding, this package might potentially be useful not just for photometry data but also for two-photon recordings for example. If so, I would also suggest the authors add to the discussion this potential use.

      This is a great point. Our updated manuscript Discussion includes the following:

      “The FLMM framework may also be applicable to techniques like electrophysiology and calcium imaging. For example, our package can fit functional generalized LMMs with a count distribution (e.g., Poisson). Additionally, our method can be extended to model time-varying covariates. This would enable one to estimate how the level of association between signals, simultaneously recorded from different brain regions, fluctuates across trial time-points. This would also enable modeling of trials that differ in length due to, for example, variable behavioral response times (e.g., latency-topress).”

      Reviewer #3:

      The authors should define ’function’ in context, as well as provide greater detail of the alternate tests that FLMM is compared to in Figure 7.

      We include a description of the alternate tests in Appendix Section 5.2. We have updated the Methods Section (Section 4) to introduce the reader to how ‘functions’ are conceptualized and modeled in the functional data analysis literature. Specifically, we added the following text:

      “FLMM models each trial’s signal as a function that varies smoothly across trial time-points (i.e., along the “functional domain”). It is thus a type of non-linear modeling technique over the functional domain, since we do not assume a linear model (straight line). FLMM and other functional data analysis methods model data as functions, when there is a natural ordering (e.g., time-series data are ordered by time, imaging data are ordered by x-y coordinates), and are assumed to vary smoothly along the functional domain (e.g., one assumes values of a photometry signal at close time-points in a trial have similar values). Functional data analysis approaches exploit this smoothness and natural ordering to capture more information during estimation and inference.”

      Given the novelty of estimating joint CIs, the authors should be clearer about how this should be reported and how this differs from pointwise CIs (and how this has been done in the past).

      We appreciate your pointing this out, as the distinction is nuanced. Our manuscript includes a description of how joint CIs enable one to interpret effects as statistically significant for time-intervals as opposed to individual timepoints. Unlike joint CIs, assessing significance with pointwise CIs suffers from multiple-comparisons problems. As a result of your suggestion, we have included a short discussion of this to our analysis guide (Part 1), entitled “Pointwise or Joint 95% Confidence Intervals.” The Methods section of our manuscript also includes the following:

      “The construction of joint CIs in the context of functional data analysis is an important research question; see Cui et al. (2021) and references therein. Each point at which the pointwise 95% CI does not contain 0 indicates that the coefficient is statistically significantly different from 0 at that point. Compared with pointwise CIs, joint CIs takes into account the autocorrelation of signal values across trial time-points (the functional domain). Therefore, instead of interpreting results at a specific timepoint, joint CIs enable joint interpretations at multiple locations along the functional domain. This aligns with interpreting covariate effects on the photometry signals across time-intervals (e.g., a cue period) as opposed to at a single trial time-point. Previous methodological work has provided functional mixed model implementations for either joint 95% CIs for simple random-effects models (Cui et al., 2021), or pointwise 95% CIs for nested models (Scheipl et al., 2016), but to our knowledge, do not provide explicit formulas or software for computing joint 95% CIs in the presence of general random-effects specifications.”

      The authors identify that many photometry studies are complex nested longitudinal designs, using the cohort of 8 animals used in five task designs of Jeong et al. 2022 as an example. The authors miss the opportunity to illustrate how FLMM might be useful in identifying the effects of subject characteristics (e.g. sex, CS+ cue identity).

      This is a fantastic point and we have added the following into the Discussion:

      “...[S]ignal heterogeneity due to subject characteristics (e.g., sex, CS+ cue identity) could be incorporated into a model through inclusion of animal-specific random effects.”

      In discussing the delay-length change experiment, it would be more accurate to say that proposed versions of RPE and ANCCR do not predict the specific change.

      Good point. We have made this change.

      Minor corrections:

      Panels are mislabeled in Figure 5.

      Thank you. We have corrected this.

      The Crowder (2009) reference is incorrect, being a review of the book with the book presumably being the correct citation.

      Good catch, thank you! Corrected.

      In Section 5 (first appendix), the authors could include the alternate spelling ’fibre photometry’ to capture any citations that use British English spelling.

      This is a great suggestion, but we did not have time to recreate these figures before re-submission.

      Section 7.4 is almost all quotation, though unevenly using the block quotation formatting. It is unclear why such a large quotation is included.

      Thank you for pointing this out. We have removed this Appendix section (formerly Section 7.4) as the relevant text was already included in the Methods section.

      References

      Sofia Beas, Isbah Khan, Claire Gao, Gabriel Loewinger, Emma Macdonald, Alison Bashford, Shakira Rodriguez-Gonzalez, Francisco Pereira, and Mario A Penzo. Dissociable encoding of motivated behavior by parallel thalamo-striatal projections. Current Biology, 34(7):1549–1560, 2024.

      Erjia Cui, Andrew Leroux, Ekaterina Smirnova, and Ciprian Crainiceanu. Fast univariate inference for longitudinal functional models. Journal of Computational and Graphical Statistics, 31:1–27, 07 2021. doi: 10.1080/10618600.2021.1950006.

      Huijeong Jeong, Annie Taylor, Joseph R Floeder, Martin Lohmann, Stefan Mihalas, Brenda Wu, Mingkang Zhou, Dennis A Burke, and Vijay Mohan K Namboodiri. Mesolimbic dopamine release conveys causal associations. Science, 378(6626):eabq6740, 2022. doi: 10.1126/science.abq6740. URL https://www. science.org/doi/abs/10.1126/science.abq6740.

      Rachel S Lee, Marcelo G Mattar, Nathan F Parker, Ilana B Witten, and Nathaniel D Daw. Reward prediction error does not explain movement selectivity in dms-projecting dopamine neurons. eLife, 8:e42992, apr 2019. ISSN 2050-084X. doi: 10.7554/eLife.42992. URL https://doi.org/10.7554/eLife.42992.

      Fabian Scheipl, Jan Gertheiss, and Sonja Greven. Generalized functional additive mixed models. Electronic Journal of Statistics, 10(1):1455 – 1492, 2016. doi: 10.1214/16-EJS1145. URL https://doi.org/10.1214/16-EJS1145.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors):

      Because many conclusions are drawn from overexpression studies and from a single cell line (HEK293), it is unclear how general these effects are. In particular, one of the main claims put forth in this manuscript is that of specificity, namely, that FZD5/8, and none of the other FZDs, are uniquely involved in this internalization and degradation. While there are examples of similar specificities, many of these examples can be attributed to a particular cellular context. Without demonstrating that this FZD5/8 specificity is observed in multiple cell lines and contexts, this point remains unconvincing and questionable. One way to address this point of criticism is to omit the word "specifically" in the title and soften the language concerning this idea throughout the manuscript.

      We appreciate your valuable comments and suggestions. We have removed the word “specifically” from the title and softened the language concerning this idea throughout the manuscript. Moreover, we performed new experiments to show that Wnt3a/5a induces FZD5/8 endocytosis and degradation and that IWP-2 treatment increases the cell surface levels of FZD5/8 in cell lines other than 293A (Figure 1-Figure supplement 1 and Figure 2-Figure supplement 1). These results indicate that Wnt-induced FZD5/8 endocytosis and degradation are not cell specific.

      The starting point for these studies is a survey of all 10 FZDs, V5-tagged and overexpressed in HEK293 cells. Here, the authors observed a decline in cell surface levels of only FZD5 and 8 in response to Wnt3a and Wnt5a. As illustrated in the immunoblot (Fig 1B), several FZDs were poorly expressed, including FZD1, 3, 6 and 9, which calls into question that only FZD5 and 8 were affected. Furthermore, total levels of FZD8 don't diminish appreciably, as claimed by the authors, and only FZD5 shows a subtle decline upon WNT treatment. All of these experiments are performed with overexpressed V5-tagged FZD proteins or with endogenously V5-tagged (KI) proteins, and it is possible that overexpression or tagging lead to potentially artifactual observations. Examining the effects of WNTs on FZD protein localization and levels need to be done with endogenously expressed, non-tagged FZDs. In this context, it is somewhat puzzling that the authors don't show such an experiment using the pan- and FZD5/8-specific antibodies, which they use in multiple experiments throughout the manuscript. With these available tools it should be possible to examine FZD levels at the cell surface in response to Wnt3a and Wnt5a, ideally in multiple cell lines.

      We appreciate your valuable comments and suggestions. Figure 1B shows the results of the follow-up study shown in Figure 1A. As shown in Figure 1A, we used flow cytometry analysis to detect the cell surface levels of stably expressed FZDs and found that Wnt3a/5a specifically reduced the levels of FZD5/8 on the cell surface, suggesting that Wnt3a/5a induces FZD5/8 endocytosis. As shown in Figure 1B and C, we performed immunoblotting to examine whether Wnt3a/5a-induced FZD5/8 internalization resulted in FZD5/8 degradation. Notably, most FZDs exhibit two bands on immunoblots, as also suggested by other published studies, and the upper bands represent the mature form that is fully glycosylated and presented to the cell surface (see also new Figure 2L), whereas the lower bands represent the immature form. Our results clearly indicated that Wnt3a/5a treatment reduced the levels of the mature forms of both FZD5 and FZD8, although the immunoblotting signals of the mature form of FZD8 (upper bands) were relatively weak. The immunoblotting signals of the other FZDs varied, and some of them (including FZD1, -3, -6 and -9) were relatively weak; however, according to the results in Figure 1A, all of the FZDs were expressed and present on the cell surface.

      Commercially available FZD5/8 antibodies, including those used in published studies, cannot detect endogenous FZD5/8 or can only recognize immature FZD5 in our hands, which is why we have to use the CRISPR-CAS9-based KI technique to introduce a V5 tag to FZD5 and FZD7. Notably, in the overexpression experiments, the V5 tag is on the amino terminus, and in the KI experiments, the V5 tag is on the carboxyl terminus of FZDs, which may minimize the potential artificial effects of the V5 tag on the immunoblotting assays.

      The monoclonal antibodies used in this study, such as anti-pan-FZD, anti-FZD5/8, and anti-FZD4 antibodies, are neutralizing antibodies that can compete with Wnt ligands to bind to the FZD CRD. These antibodies have been successfully used to detect the surface levels of FZDs via flow cytometry assays. However, as the binding affinity of the Wnt-FZD CRD is comparable to the binding affinity of the antibody-FZD, we were cautious in using these antibodies to detect the cell surface levels of FZDs when the cells were treated with Wnt3a/5a CM, which contains relatively high concentrations of Wnt3a/5a. As shown in Author response image 1, Wnt3a or Wnt5a treatment dramatically reduced the endogenous cell surface level of FZD5/8, as detected by flow cytometry using the anti-FZD5/8 antibody. However, in another experiment, HEK293A cells were first incubated with cold Wnt3a or Wnt5a CM at 4°C to minimize endocytosis and then analyzed via flow cytometry using the anti-FZD5/8 antibody. The results showed that Wnt3a/5a incubation reduced the floe cytometry signals, suggesting that Wnt3a/5a binding to FZD5/8 might interfere with antibody-FZD5/8 binding, although we cannot exclude the possibility that Wnt3a/5a may induce FZD5/8 endocytosis at 4°C (Author response image 1).

      Author response image 1.

      (A) HEK293A cells were treated with control, Wnt3a or Wnt5a CM for 2 hours at 37°C in a humidified incubator and were analyzed via flow cytometry using the anti-FZD5/8 antibody.

      (B) HEK293A cells were incubated with control, Wnt3a or Wnt5a CM for 1 h at 4°C and analyzed by flow cytometry using the anti-FZD5/8 antibody.

       

      Several experiments rely on gene-edited clonal cell lines, including knockouts of FZD5/8, RNF43/ZNRF3, and DVL. Gene knockouts were confirmed by genomic DNA sequencing and, for DVL and FZD5/8, by loss of protein expression. While these KO lines are powerful tools to study gene function, there is a concern for clonal variability. Each cell line may have acquired additional changes as a result of gene editing. In addition, there may be compensatory changes in gene expression as a consequence of the loss of certain genes. For example, expression of other FZDs may increase in FZD5/8 DKO cells. To address this critique, the authors should show that re-expression of the knocked-out genes rescues the observed effect. This is done in some instances (Fig 5E, G, H) but not in other instances, such as with the DVL TKO (Fig. 3). Since the authors assert that DVL is important for FZD internalization in the absence of WNT, but not for FZD internalization in the presence of WNT, this particular rescue experiment is important. This is a potentially important finding and it should be confirmed by re-expression of DVL in the TKO line. As an alternative, conditional knockdown using Tet-inducible shRNA expression could address concerns for clonal variability.

      We appreciate your valuable comments and suggestions. We re-expressed DVL2 in DVLTKO cells stably expressing V5-linker-FZD5 or V5-linker-FZD7. As shown in Figure 3G-K, re-expression of DVL2 rescued the decreased Wnt-independent endocytosis of FZD5 and FZD7 caused by DVL1/2/3 knockout.

      Given the significant differences in signaling activity by Wnt3a and Wnt5a, it is somewhat surprising that all experiments shown in this manuscript do not identify distinguishing features between Wnt3a and Wnt5a. In addition, it is unclear why the authors switch between Wnt3a and Wnt5a. For example, Figures 1C, 3G-J, 4C-D only use Wnt5a. In contrast, Figures 6E and H use Wnt3a, most likely because b-catenin stabilization is examined, an effect generally not observed with Wnt5a. The choice of which Wnt is examined/used appears to be somewhat arbitrary and the authors never provide any explanations for these choices. In the end, this type of inconsistency becomes puzzling when the authors present, quite convincingly, in Figure 7, that both Wnt3a and 5a promote an interaction between FZD5/8 and RNF43 through proximity biotin labeling.

      Although Wnt3a and Wnt5a are significantly different in triggering intracellular signaling pathways, both bind FZD5/8 and induce FZD5/8 endocytosis and degradation similarly. When FZD5 is stably overexpressed, Wnt5a has slightly stronger effects on inducing FZD5 endocytosis and degradation, possibly because the Wnt5a concentration may be higher than the Wnt3a concentration in our CM, which is why we used Wnt5a CM in some experiments when V5-FZD5 was overexpressed. In the revised manuscript, we used both Wnt3a and Wnt5a CM in the experiments as you suggested, as shown in Figure 1C, 3G-K and Figure 4-Figure supplement 1.

      Minor Points:

      Figure 3G and I: it is curious that individual cells are shown in the "0 h" samples, while the "Con 1 h" and "Wnt5a 1 h" show multiple cells with several making direct contact with each other. This is notable because the V5 staining at sites of cell-cell contact are quite distinct and variable between control and Wnt5a-treated and WT versus DVL TKO cells. Also, sub-cellular localization of FZD5 (V5 tag) puncta is quite distinct between Con and Wnt5a: puncta in Wnt5a-treated cells appear to be more plasma membrane proximal than in Con cells. These points may be easy to address by showing images of cells that are more similar with respect to cell number and density for each condition.

      Thank you for your suggestions. We repeated these experiments and added Wnt3a treatment and adjusted the cell density. Images including an individual cell were selected for presentation.

      Figure 5E: the following statement is confusing/misleading: "Furthermore, reintroducing ZNRF3 or RNF43 into ZRDKO cells efficiently restored the increase in cytosolic β-catenin levels, whereas the expression of RNF130 or RNF150, two structurally similar transmembrane E3 ubiquitin ligases, did not (Fig. 5E)." First, reintroduction of ZNRF3 or RNF43 restores cytosolic b-catenin levels; it does not restore the increase in b-catenin. Second, the claim that RNF130 fails to have this effect is not substantiated since it is barely expressed.

      Thank you for your suggestions and comments. We reorganized the language to make the statement clearer. Notably, the expression level of RNF130 was relatively low compared with that of other E3 ligases, but RNF130 was expressed (Figure 5E darker exposure) and could reduce the cell surface levels of FZDs, as shown in Figure 5G.

      Reviewer #2 (Recommendations for the authors):

      (1) Given their results the authors conclude that upregulation of Frizzled on the plasma membrane is not sufficient to explain the stabilization of beta-catenin seen in the ZNRF3/RNF43 mutant cells. This interpretation is sound, and they suggest in the discussion that ZNRF3/RNF43-mediated ubiquitination could serve as a sorting signal to sort endocytosed FZD to lysosomes for degradation and that absence or inhibition of this process would promote FZD recycling. This should be relatively easy to test using surface biotinylation experiments and would considerably strengthen the manuscript.

      Thank you for your valuable suggestions and comments. We performed cell surface biotinylation experiments in HEK293A FZD5KI cells, as shown in Figure 2L. The results indicated that Wnt3a or Wnt5a treatment induced the degradation of FZD5 on the cell surface, which was antagonized by cotreatment with RSPO1. We did not perform a more detailed endocytosis/recycling biotinylation experiment that requires complex reversible biotinylation and multiple washing steps because HEK293A cells are fragile in culture and not easy to handle. Furthermore, the results shown in Figure 4 indicate that knockout of ZNRF3/RNF43 or RSPO1 significantly blocked the degradation of internalized FZD5 and reduced the colocalization of internalized FZD5 with lysosomal markers, suggesting that Wnt3a/5a induced lysosomal degradation of FZD5 in the presence of ZNRF3/RNF43 and that the internalized FZD5 was most likely recycled back to the cell surface when ZNRF3/RNF43 was knocked out or inhibited by RSPO1.

      (2) The authors show that the FZD5 CRD domain is required for endocytosis since a mutant FZD5 protein in which the CRD is removed does not undergo endocytosis. This is perhaps not surprising since this is the site of Wnt binding, but the authors show that a chimeric FZD5CRD-FZD4 receptor can confer Wnt-dependent endocytosis to an otherwise endocytosis incompetent FZD4 protein. Since the linker region between the CRD and the first TM differs between FZD5 and FZD4, it would be interesting to understand whether the CRD specifically or the overall arrangement (such as the spacing) is the most important determinant.

      Our results in Figure 1D-H clearly show that the CRD of FZD5 specifically is both necessary and sufficient for Wnt3a/5a-induced FZD5 endocytosis, as replacing the CRD alone in FZD5 with the CRD from either FZD4 or FZD7 completely abolished Wnt-induced endocytosis, whereas replacing the CRD alone in FZD4 or FZD7 with the FZD5 CRD alone could confer Wnt-induced endocytosis.

      (3) I find it surprising that only FZD5 and FZD8 appear to undergo endocytosis or be stabilized at the cell surface upon ZNRF3/RNF43 knockout. Is this consistent with previous literature? Is that a cell-specific feature? These findings should be tested in a different cell line, with possibly different relative levels of ZNRF3 and RNF43 expression.

      Thank you for your comments and suggestions. Our finding that ZNRF3/RNF43 specifically regulates FZD5/8 degradation is consistent with recent published studies in which FZD5 is required for the survival of RNF43-mutant PDAC or colorectal cancer cells (Nature Medicine, 2017, PMID: 27869803) and FZD5 is required for the maintenance of intestinal stem cells (Developmental Cell, 2024, PMID: 39579768 and 39579769), and in both cases, FZDs other than FZD5/8 are also expressed but not sufficient to compensate for the function of FZD5. The mechanism by which Wnt3a/5a specifically induces FZD5/8 endocytosis and degradation is currently unknown and needs to be explored in the future. We speculate that Wnt binding to FZD5/8 may recruit another protein on the cell surface to specifically facilitate FZD5/8 endocytosis. On the other hand, we cannot exclude the possibility that Wnts other than Wnt3a/5a may induce the endocytosis and degradation of FZDs other than FZD5/8 since there are 19 Wnts and 10 FZDs in humans. Notably, several previous studies have suggested that ZNRF3/RNF43 may regulate the endocytosis and degradation of all FZDs without selectivity (such as Nature, 2012, PMID: 22575959; Nature, 2012, PMID: 22895187; Mol Cell, 2015, PMID: 25891077). However, their conclusions were drawn mostly on the basis of overexpression studies. According to the results shown in Figure 5E-H, overexpressing a membrane-tethered E3 ligase (such as ZNRF3, RNF43, RNF130, or RNF150) may nonspecifically degrade FZD proteins on the cell surface.

      Furthermore, in the revised manuscript, we showed that Wnt3a/5a induced FZD5/8 endocytosis and degradation in multiple cell lines, including Huh7, U2OS, MCF7, and 769P cells (Figure 1-Figure supplement 1 and Figure 2-Figure supplement 1), suggesting that these phenomena are not specific to 293A cells.

      (4) If FZD7 is not a substrate of ZNRF3/RNF43 and therefore is not ubiquitinated and degraded, how do the authors reconcile that its overexpression does not lead to elevated cytosolic beta-catenin levels in Figure 5B?

      We are currently not sure of the mechanism underlying this result. Considering that most FZDs are expressed in 293A cells, we do not know how much of the mature form of overexpressed FZD7 was presented to the plasma membrane.

      (5) For Figure 5B, it would be interesting if the authors could evaluate whether overexpression of FZD5 in the ZNRF3/RNF43 double knockout lines would synergize and lead to further increase in cytosolic beta-catenin levels. As control if the substrate selectivity is clear FZD7 overexpression in that line should not do anything.

      Thank you for your suggestion. We performed these experiments as suggested, and the results indicated that overexpressing FZD5 further increased cytosolic beta-catenin levels in ZRDKO cells, whereas FZD7 had no effect (Figure 6D).

      (6) In Figure 6G, the authors need to show cytosolic levels of beta-catenin in the absence of Wnt in all cases.

      We did not add Wnt CM in this experiment. RSPO1 activity, which relies on endogenous Wnt, has been well documented in previous studies.

      (7) Since the authors show that DVL is not involved in the Wnt and ZRNF3-dependent endocytosis they should repeat the proximity biotinylation experiment in figure 7 in the DVL triple KO cells. This is an important experiment since previous studies showed that DVL was required for the ZRNF3/RNF43-mediated ubiqtuonation of FZD.

      Thank you for your valuable suggestions. As you suggested, we performed a proximity biotinylation experiment in DVL TKO cells, and the results showed that Wnt3a/5a could still induce the interaction of FZD5 and RNF43 in DVLTKO cells (Figure 7-figure supplement 1), suggesting that the Wnt-induced FZD5‒RNF43 interaction is DVL independent.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study elucidates the molecular divergence of caspase 3 and 7 in the vertebrate lineage. Convincing biochemical and mutational data provide evidence that in humans, caspase 7 has lost the ability to cleave gasdermin E due to changes in a key residue, S234. However, the physiological relevance of the findings is incomplete and requires further experimental work.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary

      In this study, Xu et al. provide insights into the substrate divergence of CASP3 and CASP7 for GSDME cleavage and activation during vertebrate evolution vertebrates. Using biochemical assays, domain swapping, site-directed mutagenesis, and bioinformatics tools, the authors demonstrate that the human GSDME C-terminal region and the S234 residue of human CASP7 are the key determinants that impede the cleavage of human GSDME by human CASP7.

      Strengths

      The authors made an important contribution to the field by demonstrating how human CASP7 has functionally diverged to lose the ability to cleave GSDME and showing that reverse-mutations in CASP7 can restore GSDME cleavage. The use of multiple methods to support their conclusions strengthens the authors' findings. The unbiased mutagenesis screen performed to identify S234 in huCASP7 as the determinant of its GSDME cleavability is also a strength.

      Weaknesses

      While the authors utilized an in-depth experimental setup to understand the CASP7-mediated GSDME cleavage across evolution, the physiological relevance of their findings are not assessed in detail. Additional methodology information should also be provided.

      Specific recommendations for the authors

      (1) The authors should expand their evaluation of the physiological relevance by assessing GSDME cleavage by the human CASP7 S234N mutant in response to triggers such as etoposide or VSV, which are known to induce CASP3 to cleave GSDME (PMID: 28045099). The authors could also test whether the human CASP7 S234N mutation affects substrate preference beyond human GSDME by testing cleavage of mouse GSDME and other CASP3 and CASP7 substrates in this mutant.

      (1) The physiological relevance was discussed in the revised manuscript (lines 328-340). Our study revealed the molecular mechanism underlying the divergence of CASP3- and CASP7-mediated GSDME activation in vertebrate. One of the physiological consequences is that in humans, CASP7 no longer directly participates in GSDME-mediated cell death, which enables CASP7 to be engaged in other cellular processes. Another physiological consequence is that GSDME activation is limited to CASP3 cleavage, thus restricting GSDME activity to situations more specific, such as that inducing CASP3 activation. The divergence and specialization of the physiological functions of different CASPs are consistent with and possibly conducive to the development of refined regulations of the sophisticated human GSDM pathways, which are executed by multiple GSDM members (A , B, C, D, and E), rather than by GSDME solely in teleost, such as Takifugu. More physiological consequences of CASP3/7 divergence in GSDME activation need to be explored in future studies.

      With respect to the reviewer’s suggestion of assessing GSDME cleavage by the human CASP7 S234N mutant in response to triggers such as etoposide or VSV: (i) CASP7 S234N is a creation of our study, not a natural human product, hence its response to CASP7 triggers cannot happen under normal physiological conditions except in the case of application, such as medical application, which is not the aim of our study. (ii) CASP3/7 activators (such as raptinal) induced robust activation of the endogenous CASP3 (Heimer et al., Cell Death Dis. 2019;10:556) and CASP7 (Author response image 1, below) in human cells. Since CASP3 is the natural activator of GSDME, the presence of the triggers inevitably activates GSDME via CASP3. Hence, under this condition, it will be difficult to examine the effect of CASP7 S234N.

      Author response image 1.

      HsCASP7 activation by raptinal. HEK293T cells were transfected with the empty vector (-), or the vector expressing HsCASP7 or HsCASP7-S234N for 24 h. The cells were then treated with or without (control) 5 μM raptinal for 4 h. The cells were lysed, and the lysates were blotted with anti-CASP7 antibody.

      (2) As suggested by the reviewer, the cleavage of other CASP7 substrates, i.e., poly (ADP-ribose) polymerase 1 (PARP1) and gelsolin, by HsCASP7 and S234N mutant was determined. The results showed that HsCASP7 and HsCASP7-S234N exhibited similar cleavage capacities. Figure 5-figure supplement 1 and lines 212-214.

      (2) It would also be interesting to examine the GSDME structure in different species to gain insight into the nature of mouse GSDME, which cannot be cleaved by either mouse or human CASP7.

      Because the three-dimensional structure of GSDME is not solved, we are unable to explore the structural mechanism underlying the GSDME cleavage by caspase. Since our results showed that the C-terminal domain was essential for caspase-mediated cleavage of GSDME, it is likely that the C-terminal domain of mouse GSDME may possess some specific features that render it to resist mouse and human CASP7.

      (3) The evolutionary analysis does not explain why mammalian CASP7 evolved independently to acquire an amino acid change (N234 to S234) in the substrate-binding motif. Since it is difficult to experimentally identify why a functional divergence occurs, it would be beneficial for the authors to speculate on how CASP7 may have acquired functional divergence in mammals; potentially this occurred because of functional redundancies in cell death pathways, for example.

      According to the reviewer’s suggestion, a speculation was added. Lines 328-340.

      (4) For the recombinant proteins produced for these analyses, it would be helpful to know whether size-exclusion chromatography was used to purify these proteins and whether these purified proteins are soluble. Additionally, the SDS-PAGE in Figure S1B and C show multiple bands for recombinant mutants of TrCASP7 and HsCASP7. Performing protein ID to confirm that the detected bands belong to the respective proteins would be beneficial.

      The recombinant proteins in this study are soluble and purified by Ni-NTA affinity chromatography. Size-exclusion chromatography was not used in protein purification.

      For the SDS-PAGE in Figure 4-figure supplement 1B and C (Figure S1B and C in the previous submission), the multiple bands are most likely due to the activation cleavage of the TrCASP7 and HsCASP7 variants, which can result in multiple bands, including p10 and p20. According to the reviewer’s suggestion, the cleaved p10 was verified by immunoblotting. Figure 4-figure supplement 1B and C.

      (5) For Figures 3C and 4A, it would be helpful to mention what parameters or PDB files were used to attribute these secondary structural features to the proteins. In particular, in Figure 3C, residues 261-266 are displayed as a β-strand; however, the well-known α-model represents this region as a loop. Providing the parameters used for these callouts could explain this difference.

      For Figure 3C, in the revised manuscript, we used the structure of mouse GSDMA3 (PDB: 5b5r) for the structural analysis of HsGSDME. As indicated by the reviewer, the region of 261-266 is a loop. The description was revised in lines 172 and 174, Figure 3C and Figure 3C legend.

      For Figure 4A, the alignment of CASP7 was constructed by using Esprit (https://espript.ibcp.fr/ESPript/cgi-bin/ESPript.cgi) with human CASP7 (PDB:1k86) as the template. The description was revised in the Figure legend.

      (6) Were divergent sequences selected for the sequence alignment analyses (particularly in Figure 6A)? The selection of sequences can directly influence the outcome of the amino acid residues in each position, and using diverse sequences can reduce the impact of the number of sequences on the LOGO in each phylogenetic group.

      In Figure 6A, the sequences were selected without bias. For Mammalia, 45 CASP3 and 43 CASP7 were selected; for Aves, 41 CASP3 and 52 CASP7 were selected; for Reptilia, 31CASP3 and 39 CASP7 were selected; for Amphibia, 11 CASP3 and 12 CASP7 were selected; for Osteichthyes, 40 CASP3 and 43 CASP7 were selected. The sequence information was shown in Table 1 and Table 2.

      (7) For clarity, it would help if the authors provided additional rationale for the selection of residues for mutagenesis, such as selecting Q276, D278, and H283 as exosite residues, when the CASP7 PDB structures (4jr2, 3ibf, and 1k86) suggest that these residues are enriched with loop elements rather than the β sheets expected to facilitate substrate recognition in exosites for caspases (PMID: 32109412). It is possible that the inability to form β-sheets around these positions might indicate the absence of an exosite in CASP7, which further supports the functional effect of the exosite mutations performed.

      According to the suggestion, the rationale for the selection of residues for mutagenesis was added (lines 216-222). Unlike the exosite in HsCASP1/4, which is located in a β sheet, the Q276, D278, and H283 of HsCASP7 are located in a loop region (Figure 5-figure supplement 2), which may explain the mutation results and the absence of an exosite in HsCASP7 as suggested by the reviewer.

      Reviewer #2 (Public Review):

      The authors wanted to address the differential processing of GSDME by caspase 3 and 7, finding that while in humans GSDME is only processed by CASP3, Takifugu GSDME, and other mammalian can be processed by CASP3 and 7. This is due to a change in a residue in the human CAPS7 active site that abrogates GSDME cleavage. This phenomenon is present in humans and other primates, but not in other mammals such as cats or rodents. This study sheds light on the evolutionary changes inside CASP7, using sequences from different species. Although the study is somehow interesting and elegantly provides strong evidence of this observation, it lacks the physiological relevance of this finding, i.e. on human side, mouse side, and fish what are the consequences of CASP3/7 vs CASP3 cleavage of GSDME.

      Our study revealed the molecular mechanism underlying the divergence of CASP3- and CASP7-mediated GSDME activation in vertebrate. One of the physiological consequences is that in humans, CASP7 no longer directly participates in GSDME-mediated cell death, which enables CASP7 to be engaged in other cellular processes. Another physiological consequence is that GSDME activation is limited to CASP3 cleavage, thus restricting GSDME activity to situations more specific, such as that inducing CASP3 activation. The divergence and specialization of the physiological functions of different CASPs are consistent with and possibly conducive to the development of refined regulations of the sophisticated human GSDM pathways, which are executed by multiple GSDM members (A , B, C, D, and E), rather than by GSDME solely in teleost, such as Takifugu. More physiological consequences of CASP3/7 divergence in GSDME activation need to be explored in future studies. Lines 328-340.

      Fish also present a duplication of GSDME gene and Takifugu present GSDMEa and GSDMEb. It is not clear in the whole study if when referring to TrGSDME is the a or b. This should be stated in the text and discussed in the differential function of both GSDME in fish physiology (i.e. PMIDs: 34252476, 32111733 or 36685536).

      The TrGSDME used in this study belongs to the GSDMEa lineage of teleost GSDME. The relevant information was added. Figure 1-figure supplement 1 and lines 119, 271, 274-276, 287 and 288.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) For the chimeric and truncated constructs, such as HsNT-TrCT, TrNT-HsCT, Hsp20-Trp10, Trp20-Hsp10, etc., the authors should provide a table denoting which amino acids were taken from each protein to create the fusion or truncation.

      According to the reviewer’s suggestion, the information of the truncate/chimeric proteins was provided in Table 4.

      (2) Both reviewers agree that functional physiological experiments are needed to increase the significance of the work. Specifically, the physiological relevance of these findings can be assessed by using western blotting to monitor GSDME cleavage by the human CASP7 S234N mutant compared with wild type CASP7 in response to triggers such as etoposide or VSV, which are known to induce CASP3 to cleave GSDME (PMID: 28045099).

      Additionally, the authors can assess cell death in HEK293 cells, HEK293 cells transfected with TrGSDME, HEK293 cells expressing TrCASP3/7 plus TrGSDME, and TrCASP3/7 plus the D255R/D258A mutant. These cells can be stimulated, and pyroptosis can be assessed by using ELISA to measure the release of the cytoplasmic enzyme LDH as well as IL-1β and IL-18, and the percentage of cell death (PI+ positive cells) may also be assessed.

      (1) With respect to the physiological relevance, please see the above reply to Reviewer 1’s comment of “Specific recommendations for the authors, 1”.

      (2) As shown in our results (Fig. 2), co-expression of TrCASP3/7 and TrGSDME in HEK293T cells induced robust cell death without the need of any stimulation, as evidenced by LDH release and TrGSDME cleavage. In the revised manuscript, similar experiments were performed as suggested, and cell death was assessed by Sytox Green staining (Figure 2-figure supplement 3A and B) and immunoblot to detect the cleavage of both wild type and mutant TrGSDME (Figure 2-figure supplement 3C). The results confirmed the results of Figure 2.

      Reviewer #2 (Recommendations For The Authors):

      Abstract:

      Although the authors try to summarize the principal results of this study, please rewrite the abstract section to make it easier to follow and to empathise the implications of their results.

      We have modified the Abstract as suggested by the reviewer.

      Introduction:

      The authors do not mention anything about the implication of the inflammasome activation to get pyroptosis by GSDM cleave by inflammatory caspases. Please consider including this in the introduction section as they do in the discussion section.

      The introduction was modified according to the reviewer’s suggestion. Lines 58-61.

      From the results section the authors name the human GSDM as HsGSDM and the human CASP as HsCASP, maybe the author could use the same nomenclature in the introduction section. The same for the fish GSDM (Tr) and CASP.

      According to the reviewer’s suggestion, the same nomenclature was used in the introduction.

      Line 39. Remove the word necrotic.

      “necrotic” was removed .

      Line 42. Change channels by pores. In the manuscript, change channels by pores overall.

      “channels” was replaced by “pores”.

      Line 42: Include that: by these pores can be released the proinflammatory cytokines and if these pores are not solved then pyroptosis occurs. Please rephrase this statement.

      According to the reviewer's suggestion, the sentence was rephrased. Lines 46-48.

      Line 45. GSDMF is not an approved gene name, its official nomenclature is PJVK (Uniprot Q0ZLH3). Please use PJVK instead GSDMF.

      GSDMF was changed to PJVK.

      Line 103: Can the authors explain better the molecular determinant?

      The sentence was revised, line 109.

      Results:

      Line 110: Reference for this statement. The reference for this statement was added in line 116.

      Figure 1A, B: Concentration or units used of HsCASP?

      The unit (1 U) of HsCASPs was added to the figure legend (line 661).

      Line 113: Add Hs or Tr after CASP would be helpful to follow the story.

      “CASP” was changed to “HsCASP”.

      Fig 1D: Why the authors do not use the DMPD tetrapeptide (HsGSDME CASP3 cut site) in this assay? Comparing with the data obtained in Fig 3B the TrCASP3 activity is going to be very closer to that obtained for VEID o VDQQD in the CASP3 panel.

      The purpose of Figure 1D was to determine the cleavage preference of TrCASPs. For this purpose, a series of commercially available CASP substrates were used, including DEVD, which is commonly used as a testing substrate for CASP3. Figure 3B was to compare the cleavage of HsCASP3/7 and TrCASP3/7 specifically against the motifs from TrGSDME (DAVD) and HsGSDME (DMPD).

      Figure 1D and Figure 3B are different experiments and were performed under different conditions. In Figure 1D, CASP3 was incubated with the commercial substrates at 37 ℃ for 2 h, while in Figure 3B, CASP3/7 were incubated with non-commercial DAVD (motif from TrGSDME) and DMPD (motif from HsGSDME) at 37 ℃ for 30 min. More experimental details were added to Materials and Methods, lines 443 and 447.

      Fig 1H: What is the concentration used of the inhibitors?

      The concentration (20 μM) was added to the figure legend (line 669).

      Does the Hs CASP3/7 fail to cleave the TrGSDME mutants (D255R and D258A)? the authors do not show this result so they cannot assume that HsCASP3/7 cleave that sequence (although this is to be expected).

      The result of HsCASP3/7 cleavage of the TrGSDME mutants was added as Figure 1-figure supplement 2 and described in Results, line 133.

      Line 132-133: Can the author specify where is placed the mCherry tag? In the N terminal or C terminal portion of the different engineered proteins?

      The mCherry tag is attached to the C-terminus. Figure 2 legend (line 676).

      Fig 2A: Although is quite clear, a column histogram showing the quantification is going to be helpful.

      The expression of TrGSDME-FL, -NT and -CT was determined by Western blot, and the result was added as Figure 2-figure supplement 1.

      Fig 2A, B, C: After how many hours of expression are the pictures taken? Can the authors show a Western blot showing that the expression of the different constructions is similar?

      The time was added to Figure 2 legend and Materials and Methods (line 466). The expression of TrGSDME-FL, -NT and -CT was determined by Western blot, and the result was added as Figure 2-figure supplement 1.

      Fig 2C: Another helpful assay can be to measure the YO-PRO or another small dye internalization, to complete the LDH data.

      According the reviewer’s suggestion, in addition to LDH release, Sytox Green was also used to detect cell death. The result was added as Figure 2-figure supplement 2 and described in Results, line 146.

      Fig 2C: In the figure y axe change LHD by LDH.

      The word was corrected.

      Fig 2D: Change HKE293T by HEK293T in the caption.

      The word was corrected.

      Fig 2G: Please add the concentration used with the two plasmids co-transfection. A Western blot showing CASP3/7 expression vs TrGSDME is missing. Is that assay after 24h? please specify better the methodology.

      The concentration of plasmid used in co-transfection and the time post transfection were added to the Materials and Methods (lines 422 and 424). In addition, the expression of CASP3/7 was added to Figure 2I.

      Fig 2 J, K: Change HKE293T by HEK293T in the figure caption. The concentration of the caspase inhibitors is missing. Depending on the concentration used, these inhibitors used could provoke toxicity on the cells by themselves.

      The word was corrected in the figure caption. The inhibitor concentration (10 μM) was added to the figure legend (line 690).

      Line 151: TrCASP3/7 instead of CASP3/7

      CASP3/7 was changed to TrCASP3/7.

      Fig 3A, 3B: Please add the units used of the HsCASP

      The unit was added to the figure legends (lines 697).

      Fig 3A: Can the authors add the SDS-PAGE to see the Nt terminal portion as has been done in Fig 1A? Maybe in a supplementary figure.

      The SDS-PAGE was added as Figure 3-figure supplement 1.

      Fig 3B: If the authors could add some data about the caspase activity using any other CASP such as CASP2, CASP1 to compare the activity data with CASP3 and CASP7 would be helpful.

      The proteolytic activity of TrCASP1 was provided as Figure 3-figure supplement 2.

      Fig 3C: To state this (Line 160), the authors should use another prediction software to reach a consensus with the sequences of the first analysis. In fact, what happens when GSDME is modelled 3-dimensionally by comparing it to crystalized structures such as mouse GSDMA? If the authors add an arrow indicating where the Nt terminal portion ends and where Ct portion begins would make the figure clearer.

      According to the suggestions of both reviewers, in the revised manuscript, we used mouse GSDMA3 (PDB: 5b5r) for the structural analysis of HsGSDME, which showed that the 261-266 region of HsGSDME was a loop. As a result, Figure 3C was revised. Relevant change in Results: lines 172 and 174.

      As suggested by the reviewer, we modelled the three-dimensional structure of HsGSDME by using SWISS-MODEL with mouse GSDMA3 as the template (Author response image 2, below).

      Author response image 2.

      The three-dimensional structure model of HsGSDME. (A) The structure of HsGSDME was modeled by using mouse GSDMA3 (MmGSDMA3) as the template. The N-terminal domain (1-246 aa) and the C-terminal domain (279-468 aa) of HsGSDME are shown in red and blue, respectively. (B) The superposed structure of HsGSDME (cyan) and MmGSDMA3 (purple).

      Fig 3F: if this is an immunoblotting why NT can be seen? In other Western blots only the CT is detected, why? The use of the TrGSDME mouse polyclonal needs more details (is a purify Ab, was produced for this study, what are the dilution used...)

      Since the anti-TrGSDME antibody was generated using the full-length TrGSDME, it reacted with both the N-terminal and the C-terminal fragments of TrGSDME in Figure 3F. In Figure 3G, the GSDME chimera contained only TrGSDME-CT, so only the CT fragment was detected by anti-TrGSDME antibody. More information on antibody preparation and immunoblot was added to “Materials and Methods” (lines 390 and 391).

      Fig 4B: Can the authors show in which amino acid the p20 finish for each CASP? (Similarly, as they have done in panel 3E)

      Fig 4B was revised as suggested.

      Fig 5F: With 4 units of WT CASP7 the authors show a HsGSDME Ct in the same proportion than when the S234N mutant is used (at lower concentrations). How do the authors explain this?

      The result showed that the cleavage by 4U of HsCASP7 was comparable to the cleavage by 0.25U of HsCASP7-S234N, indicating that S234 mutation increased the cleavage ability of HsCASP7 by 16 folds.

      Line 203: Can the authors show an alignment between this region of casp1/4 and 7? Maybe in supplementary figures.

      As reported by Wang et. al (PMID: 32109412), the βIII/βIII’ sheet of CASP1/4 forms the exosite critical for GSDMD recognition. The structural comparison among HsCASP1/4/7 and the sequence alignment of HsCASP1/4 βIII/βIII’ region with its corresponding region in HsCASP7 were added as Figure 5-figure supplement 2.

      Line 205: A mutation including S234N with the exosite mutations (S234+Q276W+D278E+H283S) is required to support this statement.

      The sentence of “suggesting that, unlike human GSDMD, HsGSDME cleavage by CASPs probably did not involve exosite interaction” was deleted in the revised manuscript.

      Fig 5I, 5J: which is the amount of HsGSDME and TrGSDME? I would place these figures in supplementary material.

      The protein expression of TrGSDME/HsGSDME was shown in the figure. Fig 5I and 5J were moved to Figure 5-figure supplement 3.

      Line 218: I would specify that this importance is in HUMAN CASP7 to cleavage Human GSDME.

      “CASP7” and “GSDME” were changed to “HsCASP7” and “HsGSDME”, respectively.

      Fig 6C: 4 units is the amount of S234N mutant needed to see an optimal HsGSDME cleavage in Fig 5F.

      In Figure 6C, the cleavage efficacy of HsCASP3-N208S was apparently decreased compared to that of HsCASP3, and 4U of HsCASP3-N208S was roughly equivalent to 1U of HsCASP3 in cleavage efficacy. In Figure 5F, cleavage by 4U of HsCASP7 was comparable to the cleavage by 0.25U of HsCASP7-S234N. Together, these results confirmed the critical role of S234/N208 in HsCASP3/7 cleavage of HsGSDM.

      Fig 6I: Could be the fact that the mouse GSDME has a longer Ct than human GSDME affect the interaction with CASP7? Less accessible to the cut site? Needs a positive control of mouse GSDME with mouse Caspase 3.

      Although mouse GSDME (MmGSDME) (512 aa) is larger than HsGSDME (496 aa), the length of the C-terminal domain of MmGSDME (186 aa) is comparable to that of HsGSDME (190 aa).

      Author response image 3.

      Conserved domain analysis of mouse (upper) and human (lower) GSDME.

      As suggested by the reviewer, the cleavage of MmGSDME by mouse caspase-3 (MmCASP3) was added as Figure 6-figure supplement 2 and described in Results, lines 258.

      Material and Methods:

      -Overall, concentrations or amounts used in this study regarding the active enzyme or plasmids used are missing and need to be added.

      The missing concentrations of the enzymes and plasmids were added in Material and Methods (lines 421, 453, 457, and 470) or figure legends (Figure 1 and 3).

      -It would be helpful if the authors label in the immunoblotting panels what is the GSDME that they are using. (Hs GSDME FL...).

      As suggested, the labels were added to Figures 1A ,1B, and 3.

      -Add the units of enzyme used.

      The units of enzyme were added to figure legends (Figure 1A, 3A, 3D, and 3F) or Material and Methods (lines 453 and 457).

      The GSDME sequence obtained for Takifugu after amplification of the RNA extracted should be shown and specified (GSDMEa or GSDMEb). From which tissue was the RNA extracted?

      The details were added to Materials and Methods (lines 398 and 402).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work made a lot of efforts to explore the multifaceted roles of the inferior colliculus (IC) in auditory processing, extending beyond traditional sensory encoding. The authors recorded neuronal activitity from the IC at single unit level when monkeys were passively exposed or actively engaged in behavioral task. They concluded that 1)IC neurons showed sustained firing patterns related to sound duration, indicating their roles in temporal perception, 2) IC neuronal firing rates increased as sound sequences progress, reflecting modulation by behavioral context rather than reward anticipation, 3) IC neurons encode reward prediction error and their capability of adjusting responses based on reward predictability, 4) IC neural activity correlates with decision-making. In summary, this study tried to provide a new perspective on IC functions by exploring its roles in sensory prediction and reward processing, which are not traditionally associated with this structure.

      Strengths:

      The major strength of this work is that the authors performed electrophysiological recordings from the IC of behaving monkeys. Compared with the auditory cortex and thalamus, the IC in monkeys has not been adequately explored.

      We appreciate the reviewer’s acknowledgment of the efforts and strengths of our study. Indeed, our goal was to provide a comprehensive exploration of the multifaceted roles of the inferior colliculus (IC) in auditory processing and beyond, particularly in sensory prediction and reward processing. The use of electrophysiological recordings in behaving monkeys was central to our approach, as we sought to uncover the underexplored aspects of IC function in these complex cognitive domains. We are pleased that the reviewer recognizes the value of investigating the IC, a structure that has not been adequately explored in primates compared to other auditory regions like the cortex and thalamus. This feedback reinforces our belief that our work contributes significantly to advancing the understanding of the IC's roles in cognitive processing.

      We look forward to addressing any further points the reviewers may have and refining our manuscript accordingly. Thank you for your constructive feedback and for recognizing the strengths of our research approach.

      Weaknesses:

      (1) The authors cited several papers focusing on dopaminergic inputs in the IC to suggest the involvement of this brain region in cognitive functions. However, all those cited work were done in rodents. Whether monkey's IC shares similar inputs is not clear.

      We appreciate the reviewer's insightful comment on the limitations of extrapolating findings from rodent models to monkeys, particularly concerning dopaminergic inputs to the Inferior Colliculus (IC). While it is true that most studies on dopaminergic inputs to the IC have been conducted in rodents, to our knowledge, no studies have been conducted specifically in primates. To address the reviewer's concern, we have added a statement in both the introduction and discussion sections of our manuscript:

      • Introduction: "However, these studies were conducted in rodents, and the existence and role of dopaminergic inputs in the primate IC remain underexplored." (P.5, Line. 16-17)

      • Discussion: "However, the exact mechanisms and functions of dopamine modulation in the inferior colliculus are still not fully understood, particularly in primates. " (P.21, Line. 7-9)

      (2) The authors confused the two terms, novelty and deviation. According to their behavioral paradigm, deviation rather than novelty should be used in the paper because all the stimuli have been presented to the monkeys during training. Therefore, there is actually no novel stimuli but only deviant stimuli. This reflects that the author has misunderstood the basic concept.

      We appreciate the reviewer's clarification regarding the distinction between "novelty" and "deviation" in the context of our behavioral paradigm. We agree that, given the nature of our experimental design where all stimuli were familiar to the monkeys during training, the term "deviation" more accurately describes the stimuli used in our study rather than "novelty."

      To address this, we have revised the manuscript to replace the term "novelty" with "deviation" wherever applicable. This change has been made to ensure accurate terminology is used throughout the paper, thereby eliminating any potential misunderstanding of the concepts involved in our study.

      We thank the reviewer for pointing out this important distinction, which has improved the clarity and precision of our manuscript.

      (3) Most of the conclusions were made based on correlational analysis or speculation without providing causal evidences.

      We appreciate the reviewer’s concern regarding the reliance on correlational analyses in our study. Indeed, we acknowledge that the conclusions drawn primarily reflect correlations between neuronal activity and behavioral outcomes, rather than direct causal evidence. This limitation is common in many electrophysiological studies, particularly those conducted in behaving primates, where directly manipulating specific neural circuits to establish causality presents significant challenges, especially in comparison to research in mice.

      This complexity is further compounded when considering the IC’s role as a key lower-level relay station in the auditory pathway. Manipulating IC activity could have a widespread impact on auditory responses in downstream pathways, potentially influencing sensory prediction and decision-making processes.

      Despite this limitation, our study provides novel evidence suggesting that the IC may exhibit multiple facets of cognitive signaling, which could inspire future research aimed at exploring the underlying mechanisms and broader functional implications of these signals.

      To address the reviewer's concerns, we have made the following adjustments to the manuscript:

      (1) Clarified the Scope of Conclusions: We have revised the language in the Results and Discussion sections to explicitly state that our findings represent correlational relationships rather than causal mechanisms. For example, we have referred to the associations observed between IC activity and behavioral outcomes as "correlational" and have refrained from making definitive causal claims without supporting experimental evidence.

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      (2) Proposed Future Directions: In the Discussion section, we have included suggestions for future studies to directly test the causality of the observed relationships.

      “Further research is required to explore the underlying neuronal mechanisms and functional significance of this dynamic change comprehensively.” (P.18, Line. 11-12)

      We believe these revisions provide a more balanced interpretation of our findings while emphasizing the importance of future research to build on our results and establish causal relationships. Thank you for raising this critical point, which has led to a more rigorous and transparent presentation of our study.

      (4) Results are presented in a very "straightforward" manner with too many detailed descriptions of phenomena but lack of summary and information synthesis. For example, the first section of Results is very long but did not convey clear information.

      We appreciate the reviewer’s feedback regarding the presentation of our results. We understand that the detailed descriptions of phenomena may have made it difficult to discern the key findings and overarching themes in the study. We recognize the importance of balancing detailed reporting with clear summaries and synthesis to effectively communicate our findings.

      To address this concern, we have made the following revisions to the manuscript:

      (1) Condensed and Synthesized Key Findings: We have streamlined the presentation of the Results section by condensing overly detailed descriptions and focusing on the most critical aspects of the data. Key findings are now summarized at the end of each subsection to ensure that the main points are clearly conveyed.

      “The accumulation of the climbing effect alongside repetitive sound presentations suggests a potential linkage to reward prediction or sensory prediction, reflecting an increased probability of receiving a reward and the strengthening of sound prediction as the sound sequence progresses.” (P.10, Line. 17-20)

      “The distinct response in the control condition, where the reward was unpredictable, contrasted sharply with the predictable reward scenario in the deviant condition, underscoring the ability of auditory IC neurons to encode reward prediction errors.” (P.13, Line. 21-22; P.14, Line. 1-2)

      (2) Improved Flow and Clarity: We have revised the structure and organization of the Results section to improve the flow of information. By rearranging certain paragraphs and refining the language, we aim to present the results in a more cohesive and coherent manner.

      “Deviant Response dynamics in duration deviation detection” (P.6, Line. 12)

      “Standard Response dynamics in duration deviation detection” (P.9, Line. 4)

      We believe these changes will make the Results section more accessible and informative, allowing readers to more easily grasp the significance of our findings. Thank you for your valuable suggestion, which has significantly improved the clarity and impact of our manuscript.

      (5) The logic between different sections of Results is not clear.

      We appreciate the reviewer’s observation regarding the lack of clear logical connections between different sections of the Results. We acknowledge that a coherent flow is essential for effectively communicating the progression of findings and their implications.

      To address this concern, we have made the following revisions:

      (1) Enhanced Transitions Between Sections: We have introduced clearer transitional statements between sections of the Results. These transitions explicitly state how each new section builds upon or relates to the previous findings, creating a more cohesive narrative.

      “Building upon the findings from the deviant responses, we next explored whether the climbing effect also manifested in responses to preceding standard stimuli, thereby examining the influence of sensory prediction and repetition on IC neuronal activity.” (P.9, Line. 5-7)

      “To determine whether the observed climbing effect was driven by reward anticipation, we designed an experiment controlling for reward effects, thereby clarifying the underlying factors influencing IC neuronal activity.” (P.10, Line. 22; P.11, Line. 1-2)

      “Recognizing that some IC neurons responded to reward delivery, we investigated whether these responses reflected reward prediction errors, thereby further elucidating the IC's role in reward processing.” (P.12, Line. 9-11)

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      (2) Integration of Findings: In several places within the Results, we have added brief synthesis paragraphs that integrate findings across sections. These integrative summaries help to tie together the different aspects of our study, demonstrating how they collectively contribute to our understanding of the Inferior Colliculus’s (IC) role in sensory prediction, decision-making, and reward processing.

      “These results demonstrate that reward anticipation does not drive the climbing effect, thereby reinforcing the idea that sensory prediction is the primary factor influencing the accumulation of the climbing effect in the IC.” (P.12, Line. 4-7)

      “The distinct response in the control condition, where the reward was unpredictable, contrasted sharply with the predictable reward scenario in the deviant condition, underscoring the ability of auditory IC neurons to encode reward prediction errors.” (P.13, Line. 21-22; P.14, Line. 1-2)

      (3) Clarified Rationale: At the beginning of each major section, we have clarified the rationale behind why certain experiments were conducted, connecting them more clearly to the overarching goals of the study. This should help the reader understand the purpose of each set of results in the context of the broader research objectives.

      “Building upon the findings from the deviant responses, we next explored whether the climbing effect also manifested in responses to preceding standard stimuli, thereby examining the influence of sensory prediction and repetition on IC neuronal activity.” (P.9, Line. 5-7)

      “To determine whether the observed climbing effect was driven by reward anticipation, we designed an experiment controlling for reward effects, thereby clarifying the underlying factors influencing IC neuronal activity.” (P.10, Line. 22; P.11, Line. 1-2)

      “Recognizing that some IC neurons responded to reward delivery, we investigated whether these responses reflected reward prediction errors, thereby further elucidating the IC's role in reward processing.” (P.12, Line. 9-11)

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      We believe these changes improve the overall coherence and readability of the Results section, allowing readers to better follow the logical progression of our study. We are grateful for this constructive feedback and believe it has significantly enhanced the manuscript.

      (6) In the Discussion, there is excessive repetition of results, and further comparison with and discussion of potentially related work are very insufficient. For example, Metzger, R.R., et al. (J Neurosc, 2006) have shown similar firing patterns of IC neurons and correlated their findings with reward.

      We appreciate the reviewer's insightful critique regarding the excessive repetition in the Discussion and the lack of sufficient comparison with related work. We acknowledge that a well-balanced Discussion should not only interpret findings but also place them in the context of existing literature to highlight the novelty and significance of the study.

      To address these concerns, we have made the following revisions:

      (1) Reduction of Repetition: We have carefully revised the Discussion to minimize redundant repetition of the Results. Instead of restating the findings, we now focus more on their implications, limitations, and how they advance the current understanding of the Inferior Colliculus (IC) and its broader cognitive roles.

      “We demonstrated that the climbing effect is dynamically modulated (Figure 2D-G), and this modulation is driven primarily by sensory prediction rather than reward anticipation, as controlling for reward effects showed minimal impact on the response profile (Figure 3D, E). This modulation by preceding sensory experiences indicates that the IC is more than merely a relay station, suggesting a more intricate role in auditory processing influenced by both ascending and descending neural pathways.” (P.17, Line. 1-5)

      (2) Incorporation of Related Work: We have expanded the Discussion to include a more comprehensive comparison with existing literature, specifically highlighting studies that have reported similar findings. For example, we now discuss the work by Metzger et al. (2006), which demonstrated similar firing patterns of IC neurons and correlated these with reward-related processes. This comparison helps contextualize our results and emphasizes the novel contributions our study makes to the field.

      “Metzger and colleagues reported a gradual increase in neural activity—termed late-trial ramping—in the IC during an auditory saccade task. Similar to our results, they observed no climbing effect in the absence of a behavioral task. Both studies support the idea that the climbing effect depends on both behavioral engagement and reward. While both pieces of research emphasize the IC's complex role in integrating auditory processing with cognitive functions related to reward and behavior, our findings provide further insight by distinguishing between the effects of sensory prediction and reward anticipation on IC neuronal activity.” (P.16, Line. 16-24)

      We believe these revisions have significantly improved the quality of the Discussion by reducing unnecessary repetition and providing a more thorough engagement with the relevant literature. We are grateful for the reviewer's valuable feedback, which has helped us refine and strengthen the manuscript.

      Reviewer #2 (Public review):

      Summary:

      The inferior colliculus (IC) has been explored for its possible functions in behavioral tasks and has been suggested to play more important roles rather than simple sensory transmission. The authors revealed the climbing effect of neurons in IC during decision-making tasks, and tried to explore the reward effect in this condition.

      Strengths:

      Complex cognitive behaviors can be regarded as simple ideals of generating output based on information input, which depends on all kinds of input from sensory systems. The auditory system has hierarchic structures no less complex than those areas in charge of complex functions. Meanwhile, IC receives projections from higher areas, such as auditory cortex, which implies IC is involved in complex behaviors. Experiments in behavioral monkeys are always time-consuming works with hardship, and this will offer more approximate knowledge of how the human brain works.

      We greatly appreciate the reviewer's positive summary of our work and recognition of the effort involved in conducting experiments on behaving monkeys. We agree with the reviewer that the inferior colliculus (IC) plays a significant role beyond mere sensory transmission, particularly in integrating sensory inputs with higher cognitive functions. Our study aims to shed light on these complex functions by revealing the climbing effect of IC neurons during decision-making tasks and exploring how reward influences this dynamic.

      We are encouraged that the reviewer acknowledges the importance of investigating the IC's role within the broader framework of complex cognitive behaviors and appreciates the hierarchical nature of the auditory system. The reviewer's comments reinforce the value of our research in contributing to a more nuanced understanding of how the IC might contribute to sensory-cognitive integration.

      We thank the reviewer for highlighting the significance of using behavioral monkey models to approximate human brain function. We are hopeful that our findings will serve as a stepping stone for further research exploring the multifaceted roles of the IC in cognition and behavior.

      We will now proceed to address the specific concerns and suggestions provided by the reviewer in the following sections.

      Weaknesses:

      These findings are more about correlation but not causality of IC function in behaviors. And I have a few major concerns.

      We appreciate the reviewer’s concern regarding the reliance on correlational analyses in our study. We fully acknowledge the importance of distinguishing between correlation and causality. As outlined in our response to Question 3 from Reviewer #1, we recognize the limitations of relying on correlational data and the inherent challenges in establishing direct causal links, particularly in electrophysiological studies involving behaving primates, and given the lower-level role of the IC in the auditory pathway.

      We have taken steps to clarify this distinction throughout our manuscript. Specifically, we have revised the Results and Discussion sections to ensure that the findings are presented as correlational, not causal, and we have proposed future studies utilizing more direct manipulation techniques to assess causality. We hope these revisions adequately address your concerns.

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      “Further research is required to explore the underlying neuronal mechanisms and functional significance of this dynamic change comprehensively.” (P.18, Line. 11-12)

      Comparing neurons' spike activities in different tests, a 'climbing effect' was found in the oddball paradigm. The effect is clearly related to training and learning process, but it still requires more exploration to rule out a few explanations. First, repeated white noise bursts with fixed inter-stimulus-interval of 0.6 seconds was presented, so that monkeys might remember the sounds by rhymes, which is some sort of learned auditory response. It is interesting to know monkeys' responses and neurons' activities if the inter-stimuli-interval is variable. Second, the task only asked monkeys to press one button and the reward ratio (the ratio of correct response trials) was around 78% (based on the number from Line 302). so that, in the sessions with reward, monkeys had highly expected reward chances, does this expectation cause the climbing effect?

      We thank the reviewer for raising these insightful points regarding the 'climbing effect' observed in the oddball paradigm and its potential relationship with training, learning processes, and reward expectation. Below, we address each of the reviewer's specific concerns:

      (1) Inter-Stimulus Interval (ISI) and Rhythmic Auditory Response:

      The reviewer suggests that the fixed inter-stimulus interval (ISI) of 0.6 seconds might lead to a rhythmic auditory response, where monkeys could anticipate the sounds. We appreciate this perspective and recognize its relevance. However, we believe that rhythm is unlikely to be a significant contributor to the 'climbing effect' for two key reasons:

      a) The 'climbing effect' begins as early as the second sound in the block (as shown in Fig. 2D and Fig. 3B), before any rhythm or pattern could be fully established, since rhythm generally requires at least three repetitions to form.

      b) In our reward experiment (Figs. 4-5), the sounds were also presented at regular ISIs, which could have facilitated rhythmic learning, yet the observed climbing effect was comparatively small in those conditions.

      Unfortunately, we did not explore variable ISIs in this current study, so we cannot directly address this concern with the available data.

      (2) Reward Expectation and Climbing Effect:

      The reviewer raises a valid concern regarding whether the 'climbing effect' might be influenced by the monkeys' high reward expectation, especially given the high reward ratio (~78%) in the sessions. While it is plausible that reward expectation could contribute to the observed increase in neuronal firing rates, we believe the results from our reward experiment (Fig. 4) suggest otherwise.

      In this experiment, even though reward expectation was likely formed due to the consistent pairing of sounds with rewards (100% reward delivery), we did not observe a significant climbing effect in the auditory response. Additionally, the presence of reward prediction error (Fig. 4D) further supports the idea that while the monkeys may indeed form reward expectations, these expectations do not directly drive the climbing effect in the IC.

      To make this distinction clearer, we have added sentences in the revised manuscript explicitly discussing the relationship between reward expectation and the climbing effect.

      “Within the oddball paradigm, both sensory and reward predictions intensify alongside the recurrence of standard sounds, suggesting that the strength of these predictions could significantly influence neuronal responses. Our experimentation with rewards has effectively dismissed the role of reward prediction (Figures 3 and 4), highlighting the potential significance of sensory prediction in molding the climbing effect.” (P.17, Line. 14-19)

      We believe these revisions provide a clearer understanding of the factors contributing to the climbing effect and effectively address the reviewer's concerns. We sincerely thank the reviewer for these valuable suggestions, which have allowed us to improve the clarity and depth of our manuscript.

      "Reward effect" on IC neurons' responses were shown in Fig. 4. Is this auditory response caused by physical reward action or not? In reward sessions, IC neurons have obvious response related to the onset of water reward. The electromagnetic valve is often used in water-rewarding system and will give out a loud click sound every time when the reward is triggered. IC neurons' responses may be simply caused by the click sound if the electromagnetic valve is used. It is important to find a way to rule out this simple possibility.

      We appreciate the reviewer’s concern regarding the potential confounding factor introduced by the electromagnetic valve’s click sound during water reward delivery, which could be misinterpreted as an auditory response rather than a response to the reward itself. Anticipating this possibility, we took measures to eliminate it by placing the electromagnetic valve outside the soundproof room where the neuronal recordings were performed.

      To address your concern more explicitly, we have added sentences in the Methods section of the revised manuscript detailing this setup, ensuring that readers are aware of the steps we took to eliminate this potential confound. By doing so, we believe that the observed reward-related neural activity in the IC is attributable to the reward processing itself rather than an auditory response to the valve click. We appreciate you bringing this important aspect to our attention, and we hope our clarification strengthens the interpretation of our findings.

      “The reward was controlled electronically by a valve located outside the sound-proof room to prevent any noise interference from the valve.” (P.24, Line. 6-7)

      Reviewer #3 (Public review):

      Summary:

      The authors aimed to investigate the multifaceted roles of the Inferior Colliculus (IC) in auditory and cognitive processes in monkeys. Through extracellular recordings during a sound duration-based novelty detection task, the authors observed a "climbing effect" in neuronal firing rates, suggesting an enhanced response during sensory prediction. Observations of reward prediction errors within the IC further highlight its complex integration in both auditory and reward processing. Additionally, the study indicated IC neuronal activities could be involved in decision-making processes.

      Strengths:

      This study has the potential to significantly impact the field by challenging the traditional view of the IC as merely an auditory relay station and proposing a more integrative role in cognitive processing. The results provide valuable insights into the complex roles of the IC, particularly in sensory and cognitive integration, and could inspire further research into the cognitive functions of the IC.

      We appreciate the reviewer’s positive summary of our work and recognition of its potential impact on the field. We are pleased that the reviewer acknowledges the significance of our findings in challenging the traditional view of the Inferior Colliculus (IC) as merely an auditory relay station and in proposing its integrative role in cognitive processing.

      Our study indeed aims to provide new insights into the multifaceted roles of the IC, particularly in the context of sensory and cognitive integration. We believe that this research could pave the way for future studies that further explore the cognitive functions of the IC and its involvement in complex behavioral processes.

      We are encouraged by the reviewer’s positive assessment and are committed to continuing to refine our work in response to the constructive feedback provided. We hope that our findings will contribute to advancing the understanding of the IC’s role in the broader context of neuroscience.

      We will now proceed to address the specific concerns and suggestions provided by the reviewer in the following sections.

      Weaknesses:

      Major Comments:

      (1) Structural Clarity and Logic Flow:

      The manuscript investigates three intriguing functions of IC neurons: sensory prediction, reward prediction, and cognitive decision-making, each of which is a compelling topic. However, the logical flow of the manuscript is not clearly presented and needs to be well recognized. For instance, Figure 3 should be merged into Figure 2 to present population responses to the order of sounds, thereby focusing on sensory prediction. Given the current arrangement of results and figures, the title could be more aptly phrased as "Beyond Auditory Relay: Dissecting the Inferior Colliculus's Role in Sensory Prediction, Reward Prediction, and Cognitive Decision-Making."

      We appreciate the reviewer’s detailed feedback on the structural clarity and logical flow of the manuscript. We understand the importance of presenting our findings in a clear and cohesive manner, especially when addressing multiple complex topics such as sensory prediction, reward prediction, and cognitive decision-making.

      To address the reviewer's concerns, we have made the following revisions:

      (1) Reorganization of Figures and Results:

      We agree with the suggestion to merge Figure 3 into Figure 2. By doing so, we can present the population responses to the order of sounds more effectively, thereby streamlining the focus on sensory prediction. This will allow readers to more easily follow the progression of the results related to this key function of the IC.

      We have reorganized the Results section to ensure a smoother transition between the different aspects of IC function that we are investigating. The new structure will better guide the reader through the narrative, aligning with the themes of sensory prediction, reward prediction, and cognitive decision-making.

      “Deviant Response dynamics in duration deviation detection” (P.6, Line. 12)

      “Standard Response dynamics in duration deviation detection” (P.9, Line. 4)

      (2) Revised Title:

      In line with the reviewer's suggestion, we have revised the title to "Beyond Auditory Relay: Dissecting the Inferior Colliculus's Role in Sensory Prediction, Reward Prediction, and Cognitive Decision-Making." We believe this title more accurately reflects the scope and focus of our study, as it highlights the three core functions of the IC that we are investigating.

      (3) Improved Logic Flow:

      We have added introductory statements at the beginning of each section within the Results to clarify the rationale behind the experiments and the logical connections between them. This should help to improve the overall flow of the manuscript and make the progression of our findings more intuitive for readers.

      “Building upon the findings from the deviant responses, we next explored whether the climbing effect also manifested in responses to preceding standard stimuli, thereby examining the influence of sensory prediction and repetition on IC neuronal activity.” (P.9, Line. 5-7)

      “To determine whether the observed climbing effect was driven by reward anticipation, we designed an experiment controlling for reward effects, thereby clarifying the underlying factors influencing IC neuronal activity.” (P.10, Line 22; P.11, Line. 1-2)

      “Recognizing that some IC neurons responded to reward delivery, we investigated whether these responses reflected reward prediction errors, thereby further elucidating the IC's role in reward processing.” (P.12, Line. 9-11)

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      We believe these changes significantly enhance the clarity and logical structure of the manuscript, making it easier for readers to understand the sequence and importance of our findings. Thank you for your valuable suggestion, which has led to a more coherent and focused presentation of our work.

      (2) Clarification of Data Analysis:

      Key information regarding data analysis is dispersed throughout the results section, which can lead to confusion. Providing a more detailed and cohesive explanation of the experimental design would significantly enhance the interpretation of the findings. For instance, including a detailed timeline and reward information for the behavioral paradigms shown in Figures 1C and D would offer crucial context for the study. More importantly, clearly presenting the analysis temporal windows and providing comprehensive statistical analysis details would greatly improve reader comprehension.

      We appreciate the reviewer’s insightful comment regarding the need for clearer and more cohesive explanations of the data analysis and experimental design. We recognize that a well-structured presentation of this information is essential for the reader to fully understand and interpret our findings. To address this, we have made the following revisions:

      (1) Detailed Explanation of Experimental Design:

      We have included a more detailed explanation of the experimental design, particularly for the behavioral paradigms shown in Figures 1C and 1D. This includes a comprehensive timeline of the experiments, along with explicit information about the reward structure and timing. By providing this context upfront, we aim to give readers a clearer understanding of the conditions under which the neuronal recordings were obtained.

      (2) Cohesive Presentation of Data Analysis:

      Key information regarding data analysis, which was previously dispersed throughout the Results section, has been consolidated and moved to a dedicated subsection within the Methods. This subsection now provides a step-by-step description of the analysis process, including the temporal windows used for examining neuronal activity, as well as the specific statistical methods employed.

      We have also ensured that the temporal windows used for different analyses (e.g., onset window, late window, etc.) are clearly defined and consistently referenced throughout the manuscript. This will help readers track the use of these windows across different figures and analyses.

      (3) Enhanced Statistical Analysis Details:

      We have expanded the description of the statistical analyses performed in the study, including the rationale behind the choice of tests, the criteria for significance, and any corrections for multiple comparisons. This relevant information is highlighted in the Results section or figure legends to facilitate understanding.

      We believe these changes will significantly improve the clarity and comprehensibility of the manuscript, allowing readers to better follow the experimental design, data analysis, and the conclusions drawn from our findings. Thank you for this valuable feedback, which has helped us to enhance the rigor and transparency of our presentation.

      (3) Reward Prediction Analysis:

      The conclusion regarding the IC's role in reward prediction is underdeveloped. While the manuscript presents evidence that IC neurons can encode reward prediction, this is only demonstrated with two example neurons in Figure 6. A more comprehensive analysis of the relationship between IC neuronal activity and reward prediction is necessary. Providing population-level data would significantly strengthen the findings concerning the IC's complex functionalities. Additionally, the discussion of reward prediction in lines 437-445, which describes IC neuron responses in control experiments, does not sufficiently demonstrate that IC neurons can encode reward expectations. It would be valuable to include the responses of IC neurons during trials with incorrect key presses or no key presses to better illustrate this point.

      We deeply appreciate the detailed feedback provided regarding the conclusions on the inferior colliculus (IC)'s role in reward prediction within our manuscript. We acknowledge the importance of a robust and comprehensive presentation of our findings, particularly when discussing complex neural functionalities.

      In response to the reviewers' concerns, we have made the following revisions to strengthen our manuscript:

      (1) Inclusion of Population-Level Data for IC Neurons:

      In the revised manuscript, we have included population-level results for IC neurons in a supplementary figure. Initially, we focused on two example neurons that did not exhibit motor-related responses to key presses to isolate reward-related signals. However, most IC neurons exhibit motor responses during key presses (as indicated in Fig.6), which can complicate distinguishing between reward-related activity and motor responses. This complexity is why we initially presented neurons without motor responses. To clarify this point, we have added sentences in the Results section to explain the rationale behind our selection of neurons and to address the potential overlap between motor and reward responses in the IC.

      “This phenomenon was further supported by examining the responses in the duration deviation detection task. Since most IC neurons exhibit motor responses during key presses (Supplementary Figure 6), which can complicate distinguishing between reward-related activity and motor responses, we specifically selected two neurons without motor responses during key presses (Figure 5).” (P.13, Line. 10-15)

      (2) Addition of Data on Key Press Errors and No-Response Trials:

      In response to the reviewer’s suggestion, we have demonstrated Peri-Stimulus Time Histograms (PSTHs) for two example neurons during error trials as below, including incorrect key presses and no-response trials. Given that the monkeys performed the task with high accuracy, the number of error trials is relatively small, especially for the control condition (as shown in the top row of the figure below). While we remain cautious in drawing definitive conclusions from this limited trials, we observed that no clear reward signals were detected during the corresponding window (typically centered around 150 ms after the end of the sound). It is important to note that the experiment was initially designed to explore decision-making signals in the IC, rather than focusing specifically on reward processing. However, the data in Fig. 6 demonstrated intriguing signals of reward prediction error, which is why we believe it is important to present them.

      When combined with the results from our reward experiment (Fig. 5), we believe these findings provide compelling evidence of reward prediction errors being processed by IC neurons.

      Author response image 1.

      (A)  PSTH of the neuron from Figure 5A during a key press trial under control condition. The number in the parentheses in the legend represents the number of trials for control condition. (B) PSTHs of the neuron from Figure 5A during non-key press trials under experimental conditions. The numbers in the parentheses in the legend represent the number of trials for experimental conditions. (C-D) Equivalent PSTHs as in A-B but from the neuron in Figure 5B.

      We are grateful for the reviewer's insightful suggestions, which have allowed us to improve the depth and rigor of our analysis. We believe these revisions significantly enhance our manuscript's conclusions regarding the complex functionalities of IC.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      One of the major issues of this work is that its writing fails to convey the focus and significance of the work. Sentences are too long and multiple pieces of information are often integrated in one sentence, causing great confusion.

      We appreciate the reviewer's feedback regarding the clarity and structure of the manuscript. We agree that scientific writing should be clear and concise to effectively communicate the significance of the work. In response to this comment, we have undertaken the following revisions to improve the readability and focus of the manuscript:

      (1) Simplified Sentence Structure:<br /> We have revisited the manuscript and revised sentences that were overly complex or contained multiple pieces of information. Long sentences have been broken into shorter, more digestible statements to improve clarity and readability. Each sentence now conveys a single, focused idea.

      (2) Improved Flow and Focus:<br /> We have restructured certain paragraphs to ensure that the narrative flows logically and highlights the key findings. This restructuring includes placing the most significant results in prominent positions within paragraphs and ensuring that each section begins with a clear statement of purpose.

      “Building upon the findings from the deviant responses, we next explored whether the climbing effect also manifested in responses to preceding standard stimuli, thereby examining the influence of sensory prediction and repetition on IC neuronal activity.” (P.9, Line. 5-7)

      “To determine whether the observed climbing effect was driven by reward anticipation, we designed an experiment controlling for reward effects, thereby clarifying the underlying factors influencing IC neuronal activity.” (P.10, Line. 22; P.11, Line. 1-2)

      “Recognizing that some IC neurons responded to reward delivery, we investigated whether these responses reflected reward prediction errors, thereby further elucidating the IC's role in reward processing.” (P.12, Line. 9-11)

      “Finally, to determine whether the IC plays a role in decision-making processes related to auditory perception, we analyzed the correlation between neuronal activity and behavioral choices in the duration deviation detection task.” (P.14, Line. 4-6)

      (3) Refined Significance of the Work:<br /> In response to the reviewer's concern that the manuscript fails to clearly convey the significance of the work, we have revised the Introduction and Discussion sections to better emphasize the focus and impact of our findings. We now explicitly highlight the novel contributions of this research to the understanding of the multifaceted role of the IC in sensory prediction, decision-making, and reward processing.

      “In this research, we embarked on a deviation detection task centered around sound duration with trained monkeys, performing extracellular recordings in the IC. Our observations unveiled a 'climbing effect'—a progressive increase in firing rate after sound onset, not attributable to reward but seemingly linked to sensory experience such as sensory prediction. Moreover, we identified signals of reward prediction error and decision-making. These findings propose that the IC's role in auditory processing extends into the realm of complex perceptual and cognitive tasks, challenging previous assumptions about its functionality.” (P.6, Line. 1-8)

      “Overall, our results strongly suggest that the inferior colliculus is actively engaged in sensory experience, reward prediction and decision making, shedding light on its intricate functions in these processes.” (P.16, Line. 10-12)

      We believe these revisions address the reviewer's concern and will make the manuscript more accessible to readers. Thank you for the valuable suggestion, which has led to a more precise and effective presentation of our work.

      Reviewer #2 (Recommendations for the authors):

      (1) In oddball paradigm, inter-stimuli-interval of 0.6 seconds was used. Vary the inter-stimulus-interval should prove whether this effect is rhyme learning. It is better to choose random inter-stimuli-interval and inter-trial-interval for each experiment across whole experiment in case monkeys try to remember the rhythm.

      The reviewer suggests that the fixed inter-stimulus interval (ISI) of 0.6 seconds may lead to a rhythmic auditory response, allowing monkeys to anticipate sounds. This is a valuable suggestion, and we appreciate this perspective. However, we believe that rhythm is unlikely to play a significant role in driving the 'climbing effect.' The 'climbing effect' starts as early as the second sound in the block (as shown in Fig. 2D and Fig. 3B), which is before any rhythm or pattern could be fully established. Typically, rhythm learning requires at least three repetitions to form a predictable sequence.

      Unfortunately, we did not vary the inter-stimuli-interval in the current study, so we cannot directly test this hypothesis with the current dataset. However, we agree with the reviewer that using random ISIs would be an effective way to rule out any potential contribution of rhythm learning to the climbing effect directly.

      (2) Regarding "reward effect" on IC neurons' responses, we should rule out the possibility of simple auditory response to the switching of electromagnetic valve.

      We appreciate the reviewer’s concern about the potential confounding factor of the electromagnetic valve's click sound during water reward delivery, which could be interpreted as an auditory response rather than a true reward-related response. Anticipating this issue, we took measures to eliminate this possibility by placing the electromagnetic valve outside the soundproof room where neuronal recordings were conducted. This setup ensured that any potential auditory noise from the valve was minimized and unlikely to influence the IC neuronal activity.

      To address this concern more explicitly, we have added a description in the Methods section detailing this setup. This revision clarifies the steps we took to rule out this potential confound, strengthening the validity of our claim that the observed IC activity is genuinely related to reward processing and not a simple auditory response to the valve's operation.

      We thank the reviewer for bringing attention to this critical aspect of our experimental design, and we hope this clarification enhances the interpretation of our findings.

      “The reward was controlled electronically by a valve located outside the sound-proof room to prevent any noise interference from the valve.” (P.24, Line. 6-7)

      (3) Since monkeys are smart, simple Go/NoGo design is not a good strategy. The task with more buttons to press, such as 2-AFC or 4-AFC task, may prevent artificial effect of unwanted behaviors and offer us more reliable and useful data.

      We appreciate the reviewer’s suggestion to implement a more complex behavioral task, such as a 2-Alternative Forced Choice (2-AFC) or 4-AFC design, to reduce the possibility of unwanted behaviors and to gather more reliable data. We agree that such paradigms could offer additional insights and help control the monkeys’ decision-making processes by reducing potential confounding factors related to the simplicity of Go/NoGo responses.

      In our current study, we chose the Go/NoGo task because it aligns with our primary experimental goal: investigating the relationship between IC activity and sensory prediction, decision-making, and reward processing in a simplified manner. This task allowed us to focus on reward prediction and sensory responses without introducing additional complexity that could increase the cognitive load on the monkeys and affect their performance. It is worth noting that training monkeys to perform auditory tasks is generally more challenging compared to visual tasks, though they are indeed capable of complex learning.

      Moreover, this novelty detection task was initially designed as an oddball paradigm to explore predictive coding along the auditory pathway. Our lab has concentrated on this topic for several years, with the majority of current research focusing on non-behavioral subjects such as rodents. Implementing a more advanced paradigm like 2-AFC would have increased training time and required a different approach than our core objective.

      That said, we agree that future studies would benefit from using more sophisticated tasks, such as 2-AFC or 4-AFC paradigms, as they could offer a more refined understanding of decision-making processes while enhancing the quality of data by minimizing unwanted behaviors. We believe that incorporating more advanced behavioral paradigms in future work will further enhance the rigor and reliability of our findings.

      (4) Line 52, "challenges...", sounds a little bit too much. The authors tried to sell the ideal that IC is more than simple sensory relay point. I agree with that and I know the experiments on monkeys are not easy to gain too much comprehensive data. But to support authors' further bold opinions, more analysis is need to be done.

      We appreciate the reviewer’s feedback on the tone of the statement in Line 52, where we describe the findings as “challenging” conventional views of the IC as a simple sensory relay point. We agree that while our data provides intriguing insights into the multifunctionality of the IC, especially in sensory prediction, decision-making, and reward processing.

      To address this, we have toned down the language in the revised manuscript to better reflect the current state of our findings. Rather than presenting the results as a direct challenge to existing knowledge, we now describe them as contributing to a growing body of evidence that suggests the IC plays a more integrative role in auditory processing and cognitive functions.

      “This research highlights a more complex role for the IC than traditionally understood, showcasing its integral role in cognitive and sensory processing and emphasizing its importance in integrated brain functions.” (Abstract, P.3, Line.12-15)

      “This modulation by preceding sensory experiences indicates that the IC is more than merely a relay station, suggesting a more intricate role in auditory processing influenced by both ascending and descending neural pathways.” (P.17, Line. 3-5)

      (5) Line 143, "peak response", it is better not to refer this transient response as "peak response". How about "transient response" or "transient peak response"?

      Thank you for your suggestion regarding the terminology used in Line 143. We agree with the reviewer that referring to this as simply a "peak response" could be misleading. To improve clarity and precision, we have revised the term to "transient peak response" as recommended.

      We believe this adjustment better captures the nature of the neuronal activity observed and avoids confusion. The manuscript has been updated accordingly, and we appreciate the reviewer’s valuable input.

      (6) Is it possible to manipulate IC area and check the affection in behavior task?

      We appreciate the reviewer’s suggestion to manipulate the IC area and observe its effect on behavior during the task. Indeed, this would provide valuable causal evidence regarding the role of the IC in sensory prediction, decision-making, and reward processing, which would complement the correlational findings we have presented.

      However, in this particular study, we focused on electrophysiological recordings to observe naturally occurring neuronal activity in behaving monkeys. While it is certainly feasible to manipulate IC activity, such as through pharmacological inactivation, optogenetics, or electrical stimulation, these techniques pose technical challenges in primates. Moreover, manipulating the IC, given its role as a lower-level relay station in the auditory pathway, could potentially disrupt auditory processing more broadly, complicating the interpretation of behavioral outcomes.

      That said, we agree that introducing such manipulations in future studies would significantly enhance our understanding of the causal role of the IC in cognitive and sensory functions. We have now emphasized this as a key future research direction in the revised manuscript’s discussion section. Thank you for this insightful suggestion.

      “Further research is required to explore the underlying neuronal mechanisms and functional significance of this dynamic change comprehensively.” (P.18, Line. 11-12)

      Reviewer #3 (Recommendations for the authors):

      Minor Comments:

      (1) Figure Labeling:

      The figures require more precise labeling, particularly concerning the analysis time windows, to facilitate reader understanding of the results.

      We thank the reviewer for highlighting the importance of precise figure labeling, particularly regarding the analysis time windows. We understand that clear labeling is critical for conveying our findings effectively.

      In response to your suggestion, we have revised the figures to include more precise and detailed labels, especially for the analysis time windows. These changes will help guide readers through the experimental design and clarify the interpretation of the results. We hope these improvements enhance the overall clarity and accessibility of the figures.

      (2) Discrepancies in Figures and Text:

      There are discrepancies in the manuscript that could confuse readers. For example, on line 154, what was referred to as Supplementary Figure 1 seemed to actually be Supplementary Figure 2. Similar issues were noted on lines 480 and 606.

      We appreciate the reviewer bringing this issue to our attention. We apologize for the discrepancies between the figures referenced in the text and their actual labels in the manuscript, as this could indeed confuse readers.

      We have carefully reviewed the entire manuscript and corrected all discrepancies between the figures and their corresponding references in the text, including the issues noted on lines 154, 480, and 606. We have ensured that the figure and supplementary figure references are now consistent and accurate throughout the manuscript.

      (3) Inconsistent Formatting in Figure legends:

      Ensuring a more professional and uniform presentation throughout the manuscript would be appreciated. There was inconsistent use of uppercase and lowercase letters in legends.

      We appreciate the reviewer’s attention to detail regarding the formatting of figure legends. Ensuring a professional and consistent presentation is crucial for enhancing the readability and overall quality of the manuscript.

      We have carefully reviewed all figure legends and made the necessary corrections to ensure consistent use of uppercase and lowercase letters, as well as uniform formatting throughout the manuscript. This includes ensuring that all abbreviations and terminology are used consistently across the text and legends.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      (1) This study uses structural and functional approaches to investigate the regulation of the Na/Ca exchanger NCX1 by an activator, PIP2, and an inhibitor, SEA0400.  State-of-the-art methods are employed, and the data are of high quality and presented very clearly. The manuscript combines two rather different studies (one on PIP2; and one on SEA0400) neither of which is explored in the depth one might have hoped to form robust conclusions and significantly extend knowledge in the field.

      We combined the study of PIP2 and SEA0400 in this manuscript because both ligands inhibit or activate NCX1 by affecting the Na<sup>+</sup>-dependent inactivation of the exchanger - SEA0400 promotes inactivation by stabilizing the cytosolic inactivation assembly whereas PIP2 mitigates inactivation by destabilizing the assembly. The current study aims to provide structural insights into these ligand binding. We didn’t perform extensive electrophysiological analysis as the functional effects of both ligands have been extensively characterized over the last thirty years.

      (2) The novel aspect of this work is the study of PIP2. Unfortunately, technical limitations precluded structural data on binding of the native PIP2, so an unnatural short-chained analog, diC8 PIP2, was used instead. This raises the question of whether these two molecules, which have similar but very distinctly different profiles of activation, actually share the same binding pocket and mode of action. In an effort to address this, the authors mutate key residues predicted to be important in forming the binding site for the phosphorylated head group of PIP2. However, none of these mutations prevent PIP2 activation. The only ones that have a significant effect also influence the Na-dependent inactivation process independently of PIP2, thus casting doubt on their role in PIP2 binding, and thus identification of the PIP2 binding site. A more extensive mutagenic study, based on the diC8 PIP2 binding site, would have given more depth to this work and might have been more revealing mechanistically.

      The reviewer raises the important question of whether the short-chain PIP2 diC8 and long-chain native PIP2 share the same binding site. We have performed a pilot experiment to address this question. The data indicate that PIP2 diC8 competes with native brain PIP2 for its binding site (Author response image 1).  We believe that the mild effects of diC8 on the biophysical properties of NCX1 are due to its decreased affinity as compared to the long-chain PIP2. We have included this competition assay in the revised manuscript.

      The acyl-chain length-dependent PIP2 activation is consistent with some previous studies. Before PIP2 was demonstrated to regulate NCX1, some earlier studies showed that negatively charged long-chain lipids such as phosphatidylserine (PS) or phosphatidic acid (PA) could have the same potentiation effects on NCX1 as PIP2 (PMID: 1474504; PMID: 3276350). A later study showed that long-chain acyl-CoAs could also have the same potentiation effects on NCX1 as PIP2 (PMID: 16977318).  All these studies demonstrated that activation of NCX by the anionic lipids depends on their chain length with the short chain being ineffective or less effective. These findings have two implications. First, it is the negative surface charge rather than the specific IP3 head group of the lipid that is important for stimulating NCX1 activity. This would imply non-specific electrostatic interactions between the negatively charged lipids and those positively charged residues at the binding site. Second, a longer acyl chain is required for the high-affinity binding of PIP2 or negatively charged lipids. As further discussed in the revised manuscript (Discussion section), we suspect the tail of the long acyl chain from the native anionic lipids can enter the same binding pocket for SEA0400 thereby rendering higher affinity lipid binding than shorter chain lipids.

      As the interactions between PIP2 and NCX1 are both electrostatic involving multiple charged residues as well as hydrophobic involving the long lipid acyl chain, single amino acid substitutions likely only decrease the affinity of PIP2 rather than completely disrupt its binding. Our data demonstrated that mutants R220A, K225A, and R220A/K225A do show a significantly decreased potentiation effect of PIP2 (Figure 3 in the manuscript). We also conducted an experiment with a mutant exchanger in which all four amino were mutated. This K164A/R167A/R220A/K225A mutant is insensitive to PIP2 and shows no Na<sup>+</sup>-dependent inactivation (Figure 3A). The unresponsiveness to PIP2 and lack of Na<sup>+</sup>-dependent inactivation in this mutant is consistent with previous studies demonstrating that PIP2 activates NCX by tuning the amount of Na<sup>+</sup>-dependent inactivation and any mutation that decreases NCX sensitivity to PIP2 will affect the extent of Na<sup>+</sup>-dependent inactivation (PMID: 10751315). Such studies show that the two processes cannot be dissected from each other, making more extensive mutagenesis investigation unlikely to provide new mechanistic insights. A brief discussion related to this quadruple mutant has been added in the revised manuscript.

      Author response image 1.

      Giant patch recording of the human WT exchanger. Currents were first activated by intracellular application of 10 µM brain PIP2. Afterwards, a solution containing 100 mM Na<sup>+</sup> and 12 µM Ca<sup>2+</sup> was perfused for about 5 min (washout). The PIP2 effects was not reversible during this time. The same patch was then perfused internally with the same solution in presence of 10 µM di-C8. Application of the shorted-chained di-C8, partially decreased the current suggesting that that PIP2 and diC8 compete for the binding site.

      (3) The SEA0400 aspect of the work does not integrate particularly well with the rest of the manuscript. This study confirms the previously reported structure and binding site for SEA0400 but provides no further information. While interesting speculation is presented regarding the connection between SEA0400 inhibition and Na-dependent inactivation, further experiments to test this idea are not included here.

      Our SEA0400-bound NCX structure was determined and deposited in 2023, along with our previous study on the apo NCX published in 2023 (PMID: 37794011). We decided to combine the SEA0400-bound structure with the later study of PIP2 binding because both represent ligand modulation of NCX by affecting the Na<sup>+</sup>-dependent inactivation of the exchanger. The SEA0400 inhibition of NCX1 has been extensively investigated previously, which demonstrated a strong connection between SEA0400 and the Na<sup>+</sup>-dependent inactivation. As discussed in the manuscript, SEA0400 is ineffective in an exchanger lacking Na<sup>+</sup>-dependent inactivation. Conversely, enhancing the extent of Na<sup>+</sup>-dependent inactivation increases the affinity for SEA0400. Our structural analysis provides explanations for these pharmacological features of SEA0400 inhibition.

      Reviewer #2 (Public review):

      (1) The study by Xue et al. reports the structural basis for the regulation of the human cardiac sodium-calcium exchanger, NCX1, by the endogenous activator PIP2 and the small molecule inhibitor SEA400. This well-written study contextualizes the new data within the existing literature on NCX1 and the broader NCX family. This work builds upon the authors' previous study (Xue et al., 2023), which presented the cryo-EM structures of human cardiac NCX1 in both inactivated and activated states. The 2023 study highlighted key structural differences between the active and inactive states and proposed a mechanism where the activity of NCX1 is regulated by the interactions between the ion-transporting transmembrane domain and the cytosolic regulatory domain. Specifically, in the inward-facing state and at low cytosolic calcium levels, the transmembrane (TM) and cytosolic domains form a stable interaction that results in the inactivation of the exchanger. In contrast, calcium binding to the cytosolic domain at high cytosolic calcium levels disrupts the interaction with the TM domain, leading to active ion exchange.

      In the current study, the authors present two mechanisms explaining how both PIP2 stimulates NCX1 activity by destabilizing the protein's inactive state (i.e., by disrupting the interaction between the TM domain and the cytosolic domain) and how SEA400 stabilizes this interaction, thereby acting as a specific inhibitor of the system.

      The first part of the results section addresses the effect of PIP2 and PIP2 diC8 on NCX1 activity. This is pertinent as the authors use the diC8 version of this lipid (which has a shorter acyl chain) in their subsequent cryo-EM structure due to the instability of native PIP2. I am not an electrophysiology expert; however, my main comment would be to ask whether there is sufficient data here to characterise fully the differences between PIP2 and PIP2 diC8 on NCX1 function. It appears from the text that this study is the first to report these differences, so perhaps this data needs to be more robust. The spread of the data points in Figure 1B is possibly a little unconvincing given that only six measurements were taken. Why is there one outlier in Figure 1A? Were these results taken using the same batch of oocytes? Are these technical or biological replicates? Is the convention to use statistical significance for these types of experiments?

      Oocytes were isolated from at least 3 different frogs and each data point shown in Fig. 1 A or 1B of the manuscript represents a recording obtained from a single oocyte. For clarity, we have added this information to the Methods section. We understand that 6 observations (Fig. 1B) are a small sample size but electrophysiological recordings of NCX currents are extremely challenging and technically difficult due to the low transport activity of the exchanger. Because of these circumstances, this type of study relies on a small sample of observations. Nevertheless, our data clearly show that native PIP2 and the short-chain PIP2 diC8 can activate NCX activity although with different affinity. The spread of the steady state current data points is due to the variability in the extent of Na<sup>+</sup>-dependent inactivation within each patch, likely due to slightly different levels of endogenous PIP2 or other regulatory mechanisms that control this allosteric process. As PIP2 acts on the Na<sup>+</sup>-dependent inactivation this will lead to varying levels of potentiation. Because of that, we did occasionally observe some outliers in our recordings. Rather than cherry-picking in data analysis, we presented all the data points from patches with measurable NCX1 currents. Despite this variability, a T-test indicates that the effects of PIP2 are more pronounced on the steady-state current than peak current.  The differences between native PIP2 and PIP2 diC8 on NCX1 function are consistent with previous investigations showing that both PIP2 and anionic lipids enhance NCX current by antagonizing the Na<sup>+</sup>-dependent inactivation and long-chain lipids are more effective in potentiating NCX1 activity (PMID: 1474504; PMID: 3276350; PMID: 16977318). A discussion related to the chain length-dependent lipid activation of NCX1 is added in the Discussion of the revised manuscript. 

      (2) I am also somewhat skeptical about the modelling of the PIP2 diC8 molecule. The authors state, "The density of the IP3 head group from the bound PIP2 diC8 is well-defined in the EM map. The acyl chains, however, are flexible and could not be resolved in the structure (Fig. S2)."

      However, the density appears rather ambiguous to me, and the ligand does not fit well within the density. Specifically, there is a large extension in the volume near the phosphate at the 5' position, with no corresponding volume near the 4' phosphate. Additionally, there is no bifurcation of the volume near the lipid tails. I attempted to model cholesterol hemisuccinate (PDB: Y01) into this density, and it fits reasonably well - at least as well as PIP2 diC8. I am also concerned that if this site is specific for PIP2, then why are there no specific interactions with the lipid phosphates? How can the authors explain the difference between PIP2 and PIP2 diC8 if the acyl chains don't make any direct interactions with the TM domain? In short, the structures do not explain the functional differences presented in Figure 1.

      The side chain densities for Arg167 and Arg220 are also quite weak. While there is some density for the side chain of Lys164, it is also very weak. I would expect that if this site were truly specific for PIP2, it should exhibit greater structural rigidity - otherwise, how is this specific?

      Given this observation, have the authors considered using other PIP2 variants to determine if the specificity lies with PI4,5P<sub>2</sub> as opposed to PI3,5P<sub>2</sub> or PI3,4P<sub>2</sub>? A lack of specificity may explain the observed poor density.

      The map we provided to the editor in the initial submission is the overall map for PIP2-bound NCX1. Due to the relative flexibility between the cytosolic CBD and TM regions, we also performed local refinement on each region in data processing to improve the map quality as illustrated in Fig. S2.  The local-refined map focused on the TM domain provides a much better density for PIP2 diC8 and its surrounding residues than the overall map. The map quality allowed us to unambiguously identify the lipid as PIP2 with the IP3 head group having phosphate groups at the 4,5 positions. Furthermore, no lipid density is observed at the equivalent location in the local-refined map from the apo NCX1 TM region as shown in Fig. S3 in the revision. In the revised manuscript, the density for the bound PIP2 is shown in Fig. 2A. Those local-refined maps for PIP2-bound NCX1 were also deposited as additional maps along with the overall map in the Electron Microscopy Data Bank under accession numbers EMD-60921. The local-refined maps for the apo-NCX1 were deposited in the Electron Microscopy Data Bank under accession numbers EMD-40457 in our previous study (https://www.ebi.ac.uk/emdb/EMD-40457?tab=interpretation).

      As discussed in our response to reviewer #1, the acyl-chain length-dependent PIP2 activation is consistent with some previous studies. Before PIP2 was identified as a physiological regulator of NCX1, some earlier studies showed that negatively charged long-chain lipids such as phosphatidylserine (PS) or phosphatidic acid (PA) could have the same potentiation effects on NCX as PIP2 (PMID: 1474504; PMID: 3276350). A later study also showed that acyl-CoA could also have the same potentiation effects on NCX as PIP2 (PMID: 16977318). All these studies demonstrated that activation of NCX1 by the anionic lipids depends on their chain length with the short chain being ineffective.  These findings have two implications. First, it is the negative surface charge rather than the specific IP3 head group of the lipid that is important for stimulating NCX activity. This would imply non-specific electrostatic interactions between the negatively charged lipids and those positively charged residues at the binding site.  Second, a longer acyl chain is required for the high-affinity binding of PIP2 or negatively charged lipids. As further discussed in the revised manuscript (Discussion section), we suspect the tail of the long acyl chain can enter the same binding pocket for SEA0400 thereby rendering higher affinity lipid binding than shorter chain lipids. In light of the equivalent potentiating effect of various anionic lipids on NCX1, PI(4,5)P2 activation of NCX1 is likely non-specific and PI(3,5)P2 or PI(3,4)P2 may also activate the exchanger. However, as a key player in membrane signaling, PI(4,5)P2 has been demonstrated to be a physiological regulator of NCX1 in many studies.

      (3) I also noticed many lipid-like densities in the maps for this complex. Is it possible that the authors overlooked something? For instance, there is a cholesterol-like density near Val51, as well as something intriguing near Trp763, where I could model PIP2 diC8 (though this leads to a clash with Trp763). I wonder if the authors are working with mixed populations in their dataset. The accompanying description of the structural changes is well-written (assuming it is accurate).

      Densities from endogenous lipids and cholesterols are commonly observed in membrane protein structures. Other than the bound PIP2, those lipid and cholesterol densities are present in both the apo and PIP2-bound structures, including the density around Trp763 and Val53. Whether those bound lipids/cholesterols play any functional roles or just stabilize the protein is beyond the scope of this study.  We have added a supporting figure (Fig. S3) showing a side-by-side comparison of the density at the PIP2 binding site between the PIP2-bound and apo structures.

      I would recommend that the authors update the figures associated with this section, as they are currently somewhat difficult to interpret without prior knowledge of NCX architecture. My suggestions include:

      - Including the density for the PIP2 diC8 in Figure 2A.

      As suggested, we have included the density of PIP2 diC8 in Figure 2A.

      - Adding membrane boundaries (cytosolic vs. extracellular) in Figure 2B.

      - Labeling the cytosolic domains in Figure 2B.

      - Adding hydrogen bond distances in Figure 2A.

      We have added and labeled the boundaries for the TM and cytosolic domains in Figure 2B as suggested. Although we can identify those positively charged residues in the vicinity of the PIP2 head group and observe local structural changes, the poorly defined side-chain densities of these residues won’t allow us to properly determine the hydrogen bond distances.

      - Detailing the domain movements in Figure 2B (what is the significance of the grey vs. blue structures?).

      There is a rigid-body downward swing movement at CBDs between the apo (grey) and PIP2-bound (cyan) structures. The movement at the TM region is subtle. We have added the description in the legend for Figure 2B and also marked the movement at the tip of CBD1 in the figure.

      The section on the mechanism of SEA400-induced inactivation is strong. The maps are of better quality than those for the PIP2 diC8 complex, and the ligand fits well. However, I noticed a density peak below F02 on SEA400 that lies within the hydrogen bonding distance of Asp825. Is this a water molecule? If so, is this significant?

      The structure of SEA0400-bound NCX1 was determined at a higher resolution likely because the drug stabilize the exchanger in the inactivated state.  The mentioned density could be an ordered water molecule. We don’t know if it is functionally significant.

      Furthermore, there are many unmodeled regions that are likely cholesterol hemisuccinate or detergent molecules, which may warrant further investigation.

      We constantly observed partial densities from bound lipids, cholesterols, or detergents in our structures. Most of them are difficult to be unambiguously identified and modeled. Whether they play any functional roles is beyond the scope of this study.  

      The authors introduce SEA400 as a selective inhibitor of NCX1; however, there is little to no comparison between the binding sites of the different NCX proteins. This section could be expanded. Perhaps Fig. 4C could include sequence conservation data.

      SEA0400 is more specific for NCX1 than NCX2 and NCX3 as demonstrated in an early study (PMID: 14660663). The lack of structure information for NCX2 or NCX3 makes it difficult to make a direct comparison to reveal the structural basis of SEA0400 specificity.

      Additionally, is the fenestration in the membrane physiological, or is it merely a hole forced open by the binding of SEA400? I was unclear as to whether the authors were suggesting a physiological role for this feature, similar to those observed in sodium channels.

      The fenestration likely serves as the portal for SEA0400 binding as discussed in the manuscript. As further discussed in the revised manuscript, we suspect this fenestration also allows the tail of a long-chain lipid to enter the same binding pocket for SEA0400 and results in higher affinity binding of a long-chain lipid than a short-chain lipid.

      Reviewer #3 (Public review):

      NCXs are key Ca<sup>2+</sup> transporters located on the plasma membrane, essential for maintaining cellular Ca<sup>2+</sup> homeostasis and signaling. The activities of NCX are tightly regulated in response to cellular conditions, ensuring precise control of intracellular Ca<sup>2+</sup> levels, with profound physiological implications. Building upon their recent breakthrough in determining the structure of human NCX1, the authors obtained cryo-EM structures of NCX1 in complex with its modulators, including the cellular activator PIP2 and the small molecule inhibitor SEA0400. Structural analyses revealed mechanistically informative conformational changes induced by PIP2 and elucidated the molecular basis of inhibition by SEA0400. These findings underscore the critical role of the interface between the transmembrane and cytosolic domains in NCX regulation and small molecule modulation. Overall, the results provide key insights into NCX regulation, with important implications for cellular Ca<sup>2+</sup> homeostasis.

      We appreciate this reviewer’s positive comments.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The manuscript would be strengthened enormously by a much deeper focus on the novel and very interesting PIP2 work, as noted above, and perhaps the removal of the SEA0400 data.

      If that is beyond the scope of the authors' options, then a more robust discussion of limitations of the current work, perhaps speculation regarding other future experiments, a clearer presentation of how these data on SEA0400 are different from/extend from the previously published work, and a better effort to link the two disparate aspects of the work into a more cohesive manuscript should be attempted.

      As discussed in our response to this reviewer’s public review, we combined the study of PIP2 and SEA0400 in this manuscript because both ligands activate or inhibit NCX1 by affecting the Na<sup>+</sup>-dependent inactivation of the exchanger. The functional effects of both ligands on NCX1 have been extensively characterized over the last thirty years. Thus the current study is focused on providing structural explanations for some unique pharmacological features of these ligands. In the revised manuscript, we have added an extra paragraph of discussion that provides a plausible explanation for chain length-dependent PIP2 activation.

      Reviewer #3 (Recommendations for the authors):

      A few comments to consider:

      (1) The short-chain PIP2 appears to have lower potency, but the mechanism remains unclear. Based on structural analyses, are there potential binding sites for the acyl chains of PIP2 that could contribute to this difference?

      As discussed in our response to other reviewers, long-chain anionic lipids can have the same potentiation effect on NCX1 activity as PIP2, but the short-chain ones are ineffective just like short-chain PIP2 diC8. We suspect the tail of a long acyl chain from the native PIP2 can enter the same binding pocket for SEA0400 thereby rendering higher affinity binding for a long-chain lipid than a short-chain lipid. A discussion related to this point has been added to the revised manuscript.

      (2) It is unclear why mutating residues that interact with the IP3 head group retain PIP2 activation. Would it be possible to assess PIP2 and C8 PIP2 binding to these NCX1 variants? Identifying a mutant that abolishes C8 PIP2 binding would be valuable in interpreting those results.

      As the interactions between PIP2 and NCX1 are both electrostatic involving multiple charged residues and hydrophobic involving the long lipid acyl chain, single amino acid substitutions likely only decrease the affinity of PIP2 rather than completely disrupt its binding.  Individual mutants R220A and K225A show a 5-fold decrease in their response to PIP2 application indicating that their replacement alters the affinity of NCX for PIP2.  We have added a new experiment showing that an exchanger with all four residues mutated is insensitive to PIP2 in the revision.

      (3) What are the functional effects of mutating Y226 and R247, residues that seem to play an important role in PIP2-mediated activation?

      In a previous study, mutation at Y226 (Y226T), which is found within the XIP region of NCX, has been shown to have enhanced Na<sup>+</sup>-dependent inactivation (PMID: 9041455).  To our knowledge, the R247 mutation has not been investigated. Also positioned in the XIP region, we suspect its mutation could directly affect Na<sup>+</sup>-dependent inactivation. This would make it difficult to determine if the function effect of the mutation is caused by changing the stability of the XIP region or by changing the binding of PIP2.

      (4) Is there any overlap between the PIP2 and SEA0400 binding regions? Both appear to involve TM4, TM5, and TMD-beta hub interfaces. It might be interesting to discuss any shared mechanisms and why this region might serve as a hotspot for modulation.

      As mentioned in our previous response, we suspect the tail of a long acyl chain from the native PIP2 can enter the same binding pocket for SEA0400 thereby rendering higher affinity binding for a long-chain lipid than a short-chain lipid. A more detailed discussion related to this point has been included in the revision.

      (5) It would be helpful to show the density at the PIP2-binding site in the apo and PIP2-bound structures side by side

      This figure has been added in the revision as Fig. S3.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study combines psychophysics, fMRI, and TMS to reveal a causal role of FEF in generating an attention-induced ocular dominance shift, with potential relevance for clinical applications. The evidence supporting the claims of the authors is solid, but the theoretical and mechanistic interpretation of results and experimental approaches need to be strengthened. The work will be of broad interest to perceptual and cognitive neuroscience.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Based on a "dichoptic-background-movie" paradigm that modulates ocular dominance, the present study combines fMRI and TMS to examine the role of the frontoparietal attentional network in ocular dominance shifts. The authors claimed a causal role of FEF in generating the attention-induced ocular dominance shift.

      Strengths:

      A combination of fMRI, TMS, and "dichoptic-background-movie" paradigm techniques is used to reveal the causal role of the frontoparietal attentional network in ocular dominance shifts. The conclusions of this paper are mostly well supported by data.

      Weaknesses:

      (1) The relationship between eye dominance, eye-based attention shift, and cortical functions remains unclear and merits further delineation. The rationale of the experimental design related to the hemispheric asymmetry in the FEF and other regions should be clarified.

      Thanks for the reviewer’s comments! We have further clarified the relationship between eye dominance shift, eye-based attention, and cortical functions in the Introduction and Discussion. In the Introduction, we introduce the modulating effects of eye-based attention on eye dominance. On one hand, eye-based attention can enhance eye dominance of the attended eye in real time (see page 3 first paragraph or below):

      ”For instance, presenting top-down attentional cues to one eye can intensify the competition strength of input signals in the attended eye during binocular rivalry (Choe & Kim, 2022; Zhang et al., 2012) and shift the eye balance towards the attended eye (Wong et al., 2021).”

      On the other hand, prolonged eye-based attention can induce a shift of eye dominance to the unattended eye (see page 3 second paragraph or below):

      “In Song et al. (2023)’s “dichoptic-backward-movie” adaptation paradigm (see Figure 1B), participants are presented with regular movie images in one eye (i.e., attended eye) while the other eye (i.e., unattended eye) received the backward movie images of the same episode. They were also instructed to try their best to follow the logic of the regular movie and ignore the superimposed backward movie. Therefore, the goal-directed eye-based attention was predominantly focused on the attended eye. Song et al. (2023) found that the predominance of the unattended eye in binocular rivalry increased after one hour of adaptation to the “dichoptic-backward-movie”, indicating a shift of perceptual ocular dominance towards the unattended eye. Since the overall energy of visual input from the two eyes was balanced throughout the adaptation period, the change of ocular dominance after adaptation is thought to result from unbalanced eye-based attention rather than unbalanced input energy as in typical short-term monocular deprivation (Bai et al., 2017; Lunghi et al., 2011; Zhou et al., 2014).”

      Moreover, we discussed how FEF regulates attention-induced ocular dominance shift (see page 21 second paragraph to page 23 first paragraph or below, which also respond to this reviewer’s comment of Weakness #2):

      “Then how does FEF regulate the attention-induced ocular dominance shift? Our previous work has found that the aftereffect (for simplicity, hereafter we use aftereffect to denote the attention-induced ocular dominance shift) can be produced only when the adapting stimuli involve adequate interocular competition, and is measurable only when the testing stimuli are not binocularly fused (Song et al., 2023). Given the indispensability of interocular competition, we explained those findings in the framework of the ocular-opponency-neuron model of binocular rivalry (Said & Heeger, 2013). The model suggests that there are some opponency neurons which receive excitatory inputs from monocular neurons for one eye and inhibitory inputs from monocular neurons for the other eye (e.g. AE-UAE opponency neurons receive excitatory inputs from the attended eye (AE) and inhibitory inputs from the unattended eye (UAE)). Then a difference signal is computed so that the opponency neurons fire if the excitatory inputs surpass the inhibitory inputs. Upon activation, the opponency neurons will in turn suppress the monocular neurons which send inhibitory signals to them.

      Based on this model, we proposed an ocular-opponency-neuron adaptation account to explain the aftereffect, and pointed out that the attentional system likely modulated the AE-UAE ocular opponency neurons (Song et al., 2023). So why would FEF modulate the AE-UAE opponency neurons? The reason may be two fold. Firstly, understanding the logic during the dichoptic-backward-movie viewing may require filtering out the distracting information (from the unattended eye) and sustaining attention (to the attended eye), which is exactly the role of FEF (Esterman et al., 2015; Lega et al., 2019).

      Secondly, due to the special characteristics of binocular vision system, filtering the distracting input from the unattended eye may have to rely on the interocular suppression mechanism. According to the ocular-opponency-neuron model, this is achieved by the firing of the AE-UAE opponency neurons that send inhibitory signals to the UAE monocular neurons.

      As mentioned previously, the firing of the AE-UAE opponency neurons requires stronger activity for the AE monocular neurons than for the UAE monocular neurons. This is confirmed by the results shown in Figure 8 of Song et al. (2023) that monocular response for the attended eye during the entire adaptation phase was slightly stronger than that for the unattended eye. Accordingly, during adaptation the AE-UAE opponency neurons were able to activate for a longer period thus adapted to a larger extent than the UAE-AE opponency neurons. This would cause the monocular neurons for the unattended eye to receive less inhibition from the AE-UAE opponency neurons in the post-test as compared with the pre-test, leading to a shift of ocular dominance towards the unattended eye. In this vein, the magnitude of this aftereffect should be proportional to the extent of adaptation of the AE-UAE relative to UAE-AE opponency neurons. Attentional enhancement on the AE-UAE opponency neurons is believed to strengthen this aftereffect, as it has been found that attention can enhance adaptation (Dong et al., 2016; Rezec et al., 2004). Inhibition of FEF likely led such attentional modulation to be much less effective. Consequently, the AE-UAE opponency neurons might not have the chance to adapt to a sufficiently larger extent than the UAE-AE opponency neurons, leading to a statistically non-detectable aftereffect in Experiment 2. Therefore, the results of Experiments 2-4 in the present study suggest that within the context of the ocular-opponency-neuron adaptation account, FEF might be the core area to fulfill the attentional modulations on the AE-UAE opponency neurons.”

      We used the experimental design with hemispheric asymmetry in the FEF and other regions for two reasons. First, many studies have shown that the dorsal attentional network has a functional right-hemisphere dominance (Duecker et al., 2013; Mayrhofer et al., 2019; Sack, 2010). This was also indicated by the results of Experiment 1 (Figure 3). Second, we found that a recent research applying TMS to FEF and IPS stimulated only the right hemisphere (Gallotto et al., 2022). Therefore, we selected the right FEF and right IPS as the target regions for cTBS. In the Methods section of Experiment 2, we have elucidated the reasons for the selection of cTBS target regions (see page 35, first paragraph or below):

      “Given that the dorsal attentional network primarily consists of the FEF and the IPS (Corbetta & Shulman, 2002; Mayrhofer et al., 2019), with a functional right-hemisphere dominance (Duecker et al., 2013; Mayrhofer et al., 2019; Sack, 2010), we selected the right FEF and right IPS from the four clusters identified in Experiment 1 as the target regions for cTBS (Gallotto et al., 2022).”

      (2) Theoretically, how the eye-related functions in this area could be achieved, and how it interacts with the ocular representation in V1 warrant further clarification.

      Thanks for the reviewer’s comment! In the revised manuscript, we have discussed how FEF regulates attention-induced ocular dominance shift (see page 21 second paragraph to page 23 first paragraph or the quoted paragraphs under this reviewer’s first Public comment).

      Reviewer #2 (Public Review):

      Summary

      Song et al investigate the role of the frontal eye field (FEF) and the intraparietal sulcus (IPS) in mediating the shift in ocular dominance (OD) observed after a period of dichoptic stimulation during which attention is selectively directed to one eye. This manipulation has been previously found to transiently shift OD in favor of the unattended eye, similar to the effect of short-term monocular deprivation. To this aim, the authors combine psychophysics, fMRI, and transcranial magnetic stimulation (TMS). In the first experiment, the authors determine the regions of interest (ROIs) based on the responses recorded by fMRI during either dichoptic or binocular stimulation, showing selective recruitment of the right FEF and IPS during the dichoptic condition, in line with the involvement of eye-based attention. In a second experiment, the authors investigate the causal role of these two ROIs in mediating the OD shift observed after a period of dichoptic stimulation by selectively inhibiting with TMS (using continuous theta burst stimulation, cTBS), before the adaptation period (50 min exposure to dichoptic stimulation). They show that, when cTBS is delivered on the FEF, but not the IPS or the vertex, the shift in OD induced by dichoptic stimulation is reduced, indicating a causal involvement of the FEF in mediating this form of short-term plasticity. A third control experiment rules out the possibility that TMS interferes with the OD task (binocular rivalry), rather than with the plasticity mechanisms. From this evidence, the authors conclude that the FEF is one of the areas mediating the OD shift induced by eye-selective attention.

      Strengths

      (1) The experimental paradigm is sound and the authors have thoroughly investigated the neural correlates of an interesting form of short-term visual plasticity combining different techniques in an intelligent way.

      (2) The results are solid and the appropriate controls have been performed to exclude potential confounds.

      (3) The results are very interesting, providing new evidence both about the neural correlates of eye-based attention and the involvement of extra-striate areas in mediating short-term OD plasticity in humans, with potential relevance for clinical applications (especially in the field of amblyopia).

      Weaknesses

      (1) Ethics: more details about the ethics need to be included in the manuscript. It is only mentioned for experiment 1 that participants "provided informed consent in accordance with the Declaration of Helsinki. This study was approved by the Institutional Review Board of the Institute of Psychology, Chinese Academy of Sciences". (Which version of the Declaration of Helsinki? The latest version requires the pre-registration of the study. The code of the approved protocol together with the code and date of the approval should be provided.) There is no mention of informed consent procedures or ethics approval for the TMS experiments. This is a huge concern, especially for brain stimulation experiments!

      Response: Thanks for the reviewer’s comment! In the revised manuscript, we have provided the code of the approved protocol and date of the approval (see page 25 second paragraph or below):

      “This study was approved (H21058, 11/01/2021) by the Institutional Review Board of the Institute of Psychology, Chinese Academy of Sciences.”

      Indeed, ethics approval and informed consent were obtained for each experiment. To avoid duplication in the text, we only presented the ethics instructions in the Methods section of Experiment 1. We have now clarified in that section that all the experiments in this study were approved by the IRB in our Institute.

      (2) Statistics: the methods section should include a sub-section describing in detail all the statistical analyses performed for the study. Moreover, in the results section, statistical details should be added to support the fMRI results. In the current version of the manuscript, the claims are not supported by statistical evidence.

      Response: Thanks for the reviewer’s suggestion! In the Methods section of revised manuscript, we have added a section to describe the detailed statistical analyses for each experiment (see page 37 last paragraph for Experiment 2 and page 38 last paragraph for Experiment 3 or below):

      “Statistical analyses were performed using MATLAB. A 3 (stimulation site: Vertex, FEF, IPS) × 2 (test phase: pre-test and post-test) repeated measures ANOVA was used to investigate the effect of cTBS delivery on ocular dominance shift. Moreover, for the blob detection test, the target detection rate of each experimental condition was calculated by dividing the summed number of detected blob targets by the total number of blob targets. Then, a 2 (eye: attended eye, unattended eye) × 3 (stimulation site: Vertex, FEF, IPS) repeated measures ANOVA on the detection performance was performed. Post-hoc tests were conducted using paired t-tests (2-tailed significance level at α = 0.05), and the resulting p-values were corrected for multiple comparisons using the false discovery rate (FDR) method (Benjamini & Hochberg, 1995).”

      “In addition to the data analysis in Experiment 2, we complemented the standard inferential approach with the Bayes factor (van den Bergh et al., 2023; van Doorn et al., 2021; Wagenmakers et al., 2018), which allows quantifying the relative evidence that the data provide for the alternative (H1) or null hypothesis (H0). We conducted the Bayesian repeated measures ANOVA using JASP with default priors and computed inclusion Bayes factors (BFincl) which suggest the evidence for the inclusion of a particular effect calculated across matched models. A BF greater than 1 provides support for the alternative hypothesis. Specifically, a BF between 1 and 3 indicates weak evidence, a BF between 3 and 10 indicates moderate evidence, and a BF greater than 10 indicates strong evidence (van Doorn et al., 2021). In contrast, a BF below 1 provides evidence in favor of the null hypothesis.”

      Furthermore, in the Results section of revised manuscript, we have added the statistical details to support the fMRI results (see page 9 last paragraph or below):

      “To seek these brain regions, we used the AFNI program “3dttest++” to access the difference of ‘dichoptic-binocular’ contrast between the experimental and control runs. The AFNI program “ClustSim” was then applied for multiple comparison correction, yielding a minimum significant cluster size of 21 voxels (voxel wise p = .001; cluster threshold α = 0.05). We found 4 clusters showing stronger responses to the dichoptic movies than to the binocular movies especially in the experimental runs.”

      (3) Interpretation of the results: the TMS results are very interesting and convincing regarding the involvement of the FEF in the build-up of the OD shift induced by dichoptic stimulation, however, I am not sure that the authors can claim that this effect is related to eye-based attention, as cTBS has no effect on the blob detection task during dichoptic stimulation. If the FEF were causally involved in eye-based attention, one would expect a change in performance in this task during dichoptic stimulation, perhaps a similar performance for the unattended and attended eye. The authors speculate that the sound could have an additional role in driving eye-based attention, which might explain the lack of effect for the blob discrimination task, however, this hypothesis has not been tested.

      Response: Thanks for the reviewer’s comment! Following this reviewer’s insightful suggestion, we have conducted a new experiment to examine the effect of sound on blob detection task (see Experiment 4 in the revised manuscript). The procedure was similar to that of Experiment 2 except that the sound was no longer presented during the dichoptic-backward-movie adaptation. The results showed that the interocular difference of blob detection rate after sound elimination remained unaffected by the cTBS, which disagreed with our explanation in the previous version of manuscript. Based on the new data, we now question the validity to use the blob detection rate to precisely quantify eye-based attention, and have tried to explain why the blob detection results do not contradict with our account for the function role of FEF in modulating the aftereffect in the Discussion of the revised manuscript (see page 23 second paragraph to page 24 first paragraph or below):

      “An unresolved issue is why inhibiting the cortical function of FEF did not impair the performance of blob detection task. One potential explanation is that the synchronized audio in Experiment 2 might help increase the length of time that the regular movie dominated awareness. However, the results of Experiment 4 did not support this explanation, in which the performance of blob detection survived from the inhibition of FEF even when silent movies were presented. Although this issue remains to be explored in future work, it does not contradict with our notion of FEF modulating AE-UAE opponency neurons. It should be noted that our notion merely states that FEF is the core area for attentional modulations on activities of AE-UAE opponency neurons. No other role of FEF during the adaptation is assumed here (e.g. boosting monocular responses or increasing conscious level of stimuli in the attended eye). In contrast, according to the most original definition, the blob detection performance serves as an estimation of visibility (or consciousness level) of the stimuli input from each eye, despite the initial goal of adopting this task is to precisely quantify eye-based attention (which might be impractical). Thus, according to our notion, inhibition of FEF does not necessarily lead to deteriorate performance of blob detection. Furthermore, our findings consistently indicated that the visibility of stimuli in the attended eye was markedly superior to that of stimuli in the unattended eye, yet the discrepancy in the SSVEP monocular responses between the two eyes was minimal though it had reached statistical significance (Song et al., 2023). Therefore, blob detection performance in our work may only faithfully reflect the conscious level in each monocular pathway, but it is probably not an appropriate index tightly associated with the attentional modulations on monocular responses in early visual areas. Indeed, previous work has argued that attention but not awareness modulates neural activities in V1 during interocular competition (Watanabe et al., 2011), but see (Yuval-Greenberg & Heeger, 2013). We have noticed and discussed the counterintuitive results of blob detection performance in our previous work (Song et al., 2023). Here, with the new counterintuitive finding that inhibition of FEF did not impair the performance of blob detection, we suspect that blob detection performance in the “dichoptic-backward-movie” adaptation paradigm may not be an ideal index that can be used to accurately quantify eye-based attention.

      (4) Writing: in general, the manuscript is well written, but clarity should be improved in certain sections.

      (a) fMRI results: the first sentence is difficult to understand at first read, but it is crucial to understand the results, please reformulate and clarify.

      Response: Thanks for the reviewer’s suggestion! In the revised manuscript, we have reformulated this sentence (see page 9 last paragraph or below):

      “It was only in the dichoptic condition of experimental runs that participants had to selectively pay more attention to one eye (i.e., eye-based attention). Therefore, we speculate that if certain brain regions exhibit greater activities in the dichoptic condition as compared to the binocular condition in the experimental runs but not in the control runs, the activation of these brain regions could be attributable to eye-based attention.”

      (b) Experiment 3: the rationale for experiment one should be straightforward, without a long premise explaining why it would not be necessary.

      Response: Thanks for the reviewer’s suggestion! In the revised manuscript, we have streamlined the lengthy premise explaining to make the rationale of Experiment 3 more straightforward (see page 15 last two paragraphs or below):

      “The results of Experiment 2 support the notion that eye-based attention was the cause for attention-induced ocular dominance plasticity. However, an alternative account is that the significant two-way interaction between test phase and stimulation site did not stem from any persistent malfunction of FEF in modulating ocular dominance, but rather it was due to some abnormality of binocular rivalry measures in the post-test that occurred after stimulation at the FEF only (and not at the other two brain sites). For instance, stimulation at the FEF might simply reduce the ODI measured in the binocular rivalry post-test.

      Therefore, we conducted Experiment 3 to examine how suppression of the three target sites would impact binocular rivalry performance, in case that any unknown confounding factors, which were unrelated to adaptation but related to binocular rivalry measures, contributed to the results.”

      (c) Discussion: the language is a bit familiar here and there, a more straightforward style should be preferred (one example: p.19 second paragraph).

      Response: Thanks for the reviewer’s suggestion! We have carefully revised the language in the discussion. The discussion following the example paragraph has been largely rewritten.

      (5) Minor: the authors might consider using the term "participant" or "observer" instead of "subject" when referring to the volunteers who participated in the study.

      Response: Thanks for the reviewer’s suggestion! In the revised manuscript, we have replaced the term “subject” with “participant”.

      Reviewer #3 (Public Review):

      Summary:

      This study studied the neural mechanisms underlying the shift of ocular dominance induced by "dichoptic-backward-movie" adaptation. The study is self-consistent.

      Strengths:

      The experimental design is solid and progressive (relationship among three studies), and all of the raised research questions were well answered.

      The logic behind the neural mechanisms is solid.

      The findings regarding the cTMS (especially the position/site can be useful for future medical implications).

      Weaknesses:

      Why does the "dichoptic-backward-movie" adaptation matter? This part is severely missing. This kind of adaptation is neither intuitive like the classical (Gbison) visual adaptation, nor practical as adaptation as a research paradigm as well as the fundamental neural mechanism. If this part is not clearly stated and discussed, this study is just self-consistent in terms of its own research question. There are tons of "cool" phenomena in which the neural mechanisms are apparent as "FEF controls vision-attention" but never tested using TMS & fMRI, but we all know that this kind of research is just of incremental implications.

      Response: Thanks for the reviewer’s comment! We designed the "dichoptic-backward-movie" adaptation to study the perceptual consequence and mechanisms of sustained attention to a monocular pathway. Since the overall visual input to both eyes during adaptation were identical, any effect (i.e. the change of ocular dominance in our study) after adaptation can be easily ascribed to unbalanced eye-based attention between the two eyes rather than unbalanced input energy across the eyes. In typical short-term monocular deprivation, input signal from one eye is blocked. Accordingly, attention is undoubtedly distributed to the non-deprived eye. The fact that in a short-term monocular deprivation paradigm the deprived eye is also the unattended eye prevents researchers from ascertaining whether unbalanced eye-based attentional allocation contributes to the shift of ocular dominance just like unbalanced visual input across the two eyes. That is why the “dichoptic-backward-movie” adaptation was adopted in the present study. This new paradigm balances the input energy across the eyes but leaves attention unbalanced across the eyes. In the revised manuscript, we have added the description of the “dichoptic-backward-movie” adaptation (see page 3 last paragraph and page 4 first paragraph or below). Hope this complementary information improves the clarity.

      “In Song et al. (2023)’s “dichoptic-backward-movie” adaptation paradigm (see Figure 1B), participants are presented with regular movie images in one eye (i.e., attended eye) while the other eye (i.e., unattended eye) received the backward movie images of the same episode. They were also instructed to try their best to follow the logic of the regular movie and ignore the superimposed backward movie. Therefore, the goal-directed eye-based attention was predominantly focused on the attended eye. Song et al. (2023) found that the predominance of the unattended eye in binocular rivalry increased after one hour of adaptation to the “dichoptic-backward-movie”, indicating a shift of perceptual ocular dominance towards the unattended eye. Since the overall energy of visual input from the two eyes was balanced throughout the adaptation period, the change of ocular dominance after adaptation is thought to result from unbalanced eye-based attention rather than unbalanced input energy as in typical short-term monocular deprivation (Bai et al., 2017; Lunghi et al., 2011; Zhou et al., 2014).” In short-term monocular deprivation, input signal from one eye is blocked. Accordingly, attention is biased towards the non-deprived eye. However, it is difficult to tease apart the potential contribution of unbalanced eye-based attention from the consequence of the unbalanced input energy, as the deprived eye is also the unattended eye. Therefore, the advantage of the “dichoptic-backward-movie” adaptation paradigm is to balance the input energy across the eyes but leave attention unbalanced across the eyes.

      Our previous work (Song et al., 2023) has shown that eye-based attention plays a role in the formation of ocular dominance shift following adaptation to dichoptic backward movie. However, because the “dichoptic-backward-movie” adaptation paradigm is new, to our knowledge, no literature has ever discovered the brain areas that are responsible for eye-based attention. Our fMRI experiment for the first time resolves this issue, which, we believe, is one of the novelties of the present study. Attention is a pretty general definition of our ability to select limited information for preferential or privileged processing, yet it includes numerous aspects (e.g. spatial attention for spatial locations, feature-based attention for visual features, object-based attention for objects, social attention for social cues, and eye-based attention for monocular pathways etc). Are we 100% sure that the same brain network always underlies every aspect of attention including eye-based attention? No test, no answer. Maybe the answer is Yes, but we are not aware of any evidence for that from literature. It is not unlikely that attention is like an elephant while researchers are like blind people touching the elephant from different angles. Even if all previous researchers have touched the side of the elephant and state that an elephant is no different from a wall, as long as one researcher grabs the elephant’s tail, the “wall” knowledge will be falsified. From this perspective of the essence of science (falsifiable), we have the confidence to say that our fMRI experiment on eye-based attention is novel, because to our knowledge our experiment is the first one to explore the issue. On the basis of the fMRI experiment (otherwise we would have no idea on which precise brain site to apply the cTBS), we could successfully complete the subsequent TMS experiments.

      Of course, if the reviewer can kindly point out any previous neuroimaging work we missed that has already disclosed the neural mechanisms underlying human’s eye-based attention, we would truly appreciate the reviewer very much. But even so, we would like to emphasize that the purpose of the current study was actually not to use TMS & fMRI to confirm that “FEF controls visual attention”. As we mentioned in the Abstract and expanded the introduction in the last two paragraphs of Introduction, the goal of the TMS experiments is to examine the causal role of eye-based attention in producing the aftereffect of “dichoptic-backward-movie” adaptation. This research question is also new, thus we do not think the TMS experiments are incremental, either. Our findings provided direct causal evidence for the effect of FEF on modulating ocular dominance through eye-based attention. Please see the last two sentences in the first paragraph on page 20 in the revised manuscript or below,

      “Interestingly, in our Experiment 2 this aftereffect was significantly attenuated after we temporarily inhibited the cortical function of FEF via cTBS. This finding indicates the crucial role of FEF in the formation of attention-induced ocular dominance shift.”

      as well as the last sentence of the Abstract,

      “…and in this network, FEF plays a crucial causal role in generating the attention-induced ocular dominance shift.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The hemispheric asymmetry in the eye-based attention-related cortex should be further examined and discussed. For example, IPS in both hemispheres was identified in the fMRI experiment. It is not clear why only the right IPS was stimulated in the TMS experiment.

      Response: Thanks for the comment. We have elucidated the reasons for the experimental design with hemispheric asymmetry in FEF and IPS. Please see our response to the Weakness #1 raised by Reviewer #1 in the Public Review section.

      (2) It is known that the frontoparietal cortex plays a role in the contralateral shift of attentional allocation. Meanwhile, the latest stage of ocular-specific representation is V1. The authors should discuss how the eye-related function can be achieved in FEF.

      Response: Thanks for the comment. we have discussed how FEF regulates attention-induced ocular dominance shift (see page 21 second paragraph to page 23 first paragraph in the revised manuscript, and our response to the Weakness #2 raised by Reviewer #1 in the Public Review section).

      (3) To further validate the role of FEF in eye-related attention shifts, the authors may consider using the traditional monocular deprivation paradigm with fMRI and TMS. It would be valuable to compare the neural mechanisms related to the classical monocular deprivation paradigm with the current findings.

      Response: Thanks for the reviewer’s suggestion! That is indeed an interesting research topic that we are currently exploring. The current study investigated the attention-induced ocular dominance shift with the “dichoptic-backward-movie-adaptation” paradigm. This paradigm is substantially different from traditional short-term monocular deprivation. In our Neuroscience Bulletin paper (Song et al. 2023), we discuss the reason as follows.

      “An alternative account of our results is the homeostatic plasticity mechanism. The function of this mechanism is to stabilize neuronal activity and prevent the neuronal system from becoming hyperactive or hypoactive. For this goal, the mechanism moves the neuronal system back toward its baseline after a perturbation [51, 52]. In our case, the aftereffect can be explained such that the visual system boosts the signals from the unattended eye to maintain the balance of the network’s excitability. However, this account cannot easily explain why the change of neural ocular dominance led by prolonged eye-based attention was observed here using the binocular rivalry testing stimuli, but absent in the previous research using the binocularly fused stimuli [11]. In contrast, a recent SSVEP study also using the binocularly fused stimuli has successfully revealed a shift of neural ocular dominance after two hours of monocular deprivation [31], which is in line with the homeostatic plasticity account. Therefore, the mechanisms underlying the “dichoptic-backward-movie” adaptation and monocular deprivation are probably not fully overlapped with each other; and the binocular rivalry mechanism described in the ocular-opponency-neuron model seems to be more preferable than the homeostatic plasticity mechanism in accounting for the present findings.”

      Therefore, before asking whether FEF plays a role in the attention-induced ocular dominance shift in a traditional monocular deprivation paradigm, one should probably first examine whether attention also plays a role in traditional monocular deprivation, and whether the ocular-opponency-neuron adaptation account can also be used to explain the traditional monocular deprivation effect. Our newly accepted paper “Negligible contribution of adaptation of ocular opponency neurons to the effect of short-term monocular deprivation” (https://www.frontiersin.org/articles/10.3389/fpsyg.2023.1282113/full) gives a generally negative answer to the second question. And as to the first question, we have one manuscript under review and another ongoing study. In other words, to get a satisfactory answer to this particular comment of this reviewer, we need to first obtain clear answers to the two above questions. We think this is far beyond the scope of one single manuscript.

      (4) The authors only presented regular movies to the dominant eye to maximize the ocular dominance shift. This critical information of design should be clarified, not only in the method section.

      Response: Thanks for the reviewer’s suggestion! In the Results section of Experiment 2, we have added a description of this critical information of design (see page 11 last paragraph to page 12 first paragraph or below):

      “Then, participants adapted to the “dichoptic-backward-movie” in which regular movie images were presented to the dominant eye to maximize the effect of eye dominance shift (Song et al., 2023). Meanwhile they were asked to detect some infrequent blob targets presented on the movie images in one eye at the same time.”

      (5) The frame rate of the movie is 30 fps, which is much lower than a typical 60 fps visual presentation, does this have an effect on the adaptation outcome?

      Response: To our best of knowledge, there is no evidence that the frame rate of the movie influences the aftereffect of attention-induced ocular dominance shift. In our previous research, the frame rate of the movie during adaptation was 25 fps, which still produced a stable adaptation aftereffect (Song et al., 2023). And the frame rate of the movie was 30 fps in our monocular deprivation work (Lyu et al., 2020), which showed a similar monocular deprivation effect we previously observed in an altered reality study (Bai et al., 2017). The frame rate of the altered-reality video in Bai et al.’s (2017) work was 60 fps. All these clues suggest that the frame rate does not have an effect on the adaptation outcome.

      (6) Figure 5: The ODSE derived from ODI in Experiment 3 should also be illustrated, for a better comparison with results from Experiment 2.

      Response: Thanks for the reviewer’s suggestion! In the revised manuscript, we have added the results of ODSE in Experiment 3 to Figure 5 (see page 15 or below):

      Author response image 1.

      Figure 5. The results of (A) the ocular dominance index (ODI), (B) the ocular dominance shift effects (ODSE) in Experiment 2, (C) the ODI and (D) the ODSE in Experiment 3. The bars show the grand average data for each condition. The individual data are plotted with gray lines or dots. The dashed gray line represents the absolute balance point for the two eyes (ODI = 0.5). Error bars indicate standard errors of means. * p < .05; ** p < .01; n.s. p > .05.

      (7) Spelling issues: "i.e." → "i.e.,"

      Response: Thanks for the reviewer’s suggestion! In the revised manuscript, we have changed “i.e.” to “i.e.,”.

      Reviewer #2 (Recommendations For The Authors):

      Linked to weakness 3: Ideally, a control experiment with cTBS and dichoptic stimulation without sound but with the blob discrimination task should be performed to be able to make important claims about the neural mechanisms involved in eye-based attention.

      Response: Thanks for the comment. We have performed a new experiment as the reviewer suggested. Please see our response to the Weakness #3 raised by Reviewer #2 in the Public Review section.

      Reviewer #3 (Recommendations For The Authors):

      (1) The neural mechanisms are so apparent. We all know the FEF\IPS\SC matter in vision and attention and gaze. This is not groundbreaking.

      Response: As we addressed in our response to Reviewer #3’s public comment, the current study aimed at investigating the causal mechanism for eye-based attentional modulation of ocular dominance plasticity rather than simply the role of FEF\IPS\SC in visual attention. Moreover, eye-based attention is a less investigated aspect of visual attention. The neural mechanism underlying eye-based attention is still largely unknown, and seeking the brain areas for controlling eye-based attention is the necessary preparation work for applying the cTBS. We have responded in detail to Reviewer #3’s public comment why we think both the fMRI and TMS experiments are novel to the field, which we will not reiterate it here to avoid redundancy.

      (2) Why does the "dichoptic-backward-movie" adaptation matter? Is playing a backward movie to one eye realistic? Does that follow the efficient coding? Is that a mere consequence of information theory?

      Response: Thanks for the comments. We have added the description of the “dichoptic-backward-movie” adaptation paradigm in the revised manuscript (see page 3 last paragraph and page 4 first paragraph or our response to this reviewer’s Public comment).

      Is it realistic to play backward movie to one eye? We feel this question is somehow ambiguous to us. If the reviewer means the technical operability for such stimulus presentation, we can assure it since we have used this paradigm in both the current and previously published studies. To be more specific, we made the video stimuli in advance. The left half of the video was the regular movie and the right half was the backward version of the same movie (or vice versa). When viewing such video stimuli through stereoscopes, participants could only see the left half of the video with the left eye and the right half of the video with the right eye. In other words, the regular movie and backward movie were viewed dichoptically. Alternatively, if the reviewer means that such dichoptic presentation rarely happens in real world thus not realistic, we agree with the reviewer on one hand. On the other hand, we have explained on page 3 last paragraph and page 4 first paragraph why it is a particular useful paradigm for the main purpose of the present study. Let us make a similar example. The phenomenon of binocular rivalry rarely happens in everyday life. So people may say binocular rivalry is not realistic. However, our visual system does have the ability to deal with such conflicting visual inputs across the eyes, even binocular rivalry is unrealistic! Sometimes it is fun to investigate those seemingly unrealistic functions of our brains since those may also reveal the mystery of our neural system. As we know, despite binocular rivalry is uncommon in daily life, it is frequently used to investigate awareness. And in our work, we use binocular rivalry to measure perceptual ocular dominance.

      Finally, the reviewer queried about if the "dichoptic-backward-movie" adaptation paradigm follow efficient coding and information theory. The information theory and efficient coding assume that messages with low expectedness or of rare occurrence would attract more attention and induce larger neural responses than those with high expectedness. In the "dichoptic-backward-movie" adaptation paradigm, the backward movie should be less expected since the actions of the characters in the backward movie appeared illogical. Thus, according to the information theory and efficient coding, it would be expected that more attention was paid to the backward movie and thus the backward movie might dominate the awareness for a longer period during adaptation (Zhang et al., 2012). However, we instructed participants to follow the regular movie during adaptation. The results of blob detection task also showed a better task performance when the targets appeared in the eye presented with the regular movie, which contradicted with the prediction of the information theory and efficient coding. Thus, it seems not very likely that the "dichoptic-backward-movie" adaptation followed efficient coding and information theory.

      References

      Bai, J., Dong, X., He, S., & Bao, M. (2017). Monocular deprivation of Fourier phase information boosts the deprived eye’s dominance during interocular competition but not interocular phase combination. Neuroscience, 352, 122-130. https://doi.org/10.1016/j.neuroscience.2017.03.053

      Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1), 289-300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

      Choe, E., & Kim, M.-S. (2022). Eye-specific attentional bias driven by selection history. Psychonomic Bulletin & Review, 29(6), 2155-2166. https://doi.org/10.3758/s13423-022-02121-0

      Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3), 201-215. https://doi.org/10.1038/nrn755

      Dong, X., Gao, Y., Lv, L., & Bao, M. (2016). Habituation of visual adaptation. Sci Rep, 6, 19152. https://doi.org/10.1038/srep19152

      Duecker, F., Formisano, E., & Sack, A. T. (2013). Hemispheric differences in the voluntary control of spatial attention: direct evidence for a right-hemispheric dominance within frontal cortex. Journal of Cognitive Neuroscience, 25(8), 1332-1342. https://doi.org/10.1162/jocn_a_00402

      Esterman, M., Liu, G., Okabe, H., Reagan, A., Thai, M., & DeGutis, J. (2015). Frontal eye field involvement in sustaining visual attention: evidence from transcranial magnetic stimulation. Neuroimage, 111, 542-548. https://doi.org/10.1016/j.neuroimage.2015.01.044

      Gallotto, S., Schuhmann, T., Duecker, F., Middag-van Spanje, M., de Graaf, T. A., & Sack, A. T. (2022). Concurrent frontal and parietal network TMS for modulating attention. iScience, 25(3), 103962. https://doi.org/10.1016/j.isci.2022.103962

      Lega, C., Ferrante, O., Marini, F., Santandrea, E., Cattaneo, L., & Chelazzi, L. (2019). Probing the neural mechanisms for distractor filtering and their history-contingent modulation by means of TMS. Journal of Neuroscience, 39(38), 7591-7603. https://doi.org/10.1523/JNEUROSCI.2740-18.2019

      Lunghi, C., Burr, D. C., & Morrone, C. (2011). Brief periods of monocular deprivation disrupt ocular balance in human adult visual cortex. Curr Biol, 21(14), R538-539. https://doi.org/10.1016/j.cub.2011.06.004

      Lyu, L., He, S., Jiang, Y., Engel, S. A., & Bao, M. (2020). Natural-scene-based Steady-state Visual Evoked Potentials Reveal Effects of Short-term Monocular Deprivation. Neuroscience, 435, 10-21. https://doi.org/10.1016/j.neuroscience.2020.03.039

      Mayrhofer, H. C., Duecker, F., van de Ven, V., Jacobs, H. I., & Sack, A. T. (2019). Hemifield-specific correlations between cue-related blood oxygen level dependent activity in bilateral nodes of the dorsal attention network and attentional benefits in a spatial orienting paradigm. Journal of Cognitive Neuroscience, 31(5), 625-638. https://doi.org/10.1162/jocn_a_01338

      Rezec, A., Krekelberg, B., & Dobkins, K. R. (2004). Attention enhances adaptability: evidence from motion adaptation experiments. Vision Res, 44(26), 3035-3044. https://doi.org/10.1016/j.visres.2004.07.020

      Sack, A. T. (2010). Using non-invasive brain interference as a tool for mimicking spatial neglect in healthy volunteers. Restorative neurology and neuroscience, 28(4), 485-497. https://doi.org/10.3233/RNN-2010-0568

      Said, C. P., & Heeger, D. J. (2013). A model of binocular rivalry and cross-orientation suppression. PLoS computational biology, 9(3), e1002991. https://doi.org/10.1371/journal.pcbi.1002991

      Song, F., Lyu, L., Zhao, J., & Bao, M. (2023). The role of eye-specific attention in ocular dominance plasticity. Cerebral Cortex, 33(4), 983-996. https://doi.org/10.1093/cercor/bhac116

      van den Bergh, D., Wagenmakers, E.-J., & Aust, F. (2023). Bayesian Repeated-Measures Analysis of Variance: An Updated Methodology Implemented in JASP. Advances in Methods and Practices in Psychological Science, 6(2), 25152459231168024. https://doi.org/10.1177/25152459231168024

      van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., Etz, A., Evans, N. J., Gronau, Q. F., Haaf, J. M., Hinne, M., Kucharský, Š., Ly, A., Marsman, M., Matzke, D., Gupta, A., Sarafoglou, A., Stefan, A., Voelkel, J. G., & Wagenmakers, E. J. (2021). The JASP guidelines for conducting and reporting a Bayesian analysis. Psychonomic Bulletin & Review, 28(3), 813–826. https://doi.org/10.3758/s13423-020-01798-5

      Wagenmakers, E. J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Selker, R., Gronau, Q. F., Dropmann, D., Boutin, B., Meerhoff, F., Knight, P., Raj, A., van Kesteren, E. J., van Doorn, J., Šmíra, M., Epskamp, S., Etz, A., Matzke, D., de Jong, T., van den Bergh, D., Sarafoglou, A., Steingroever, H., Derks, K., Rouder, J. N., & Morey, R. D. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin & Review, 25(1), 58–76. https://doi.org/10.3758/s13423-017-1323-7

      Watanabe, M., Cheng, K., Murayama, Y., Ueno, K., Asamizuya, T., Tanaka, K., & Logothetis, N. (2011). Attention but not awareness modulates the BOLD signal in the human V1 during binocular suppression. Science, 334(6057), 829-831. https://doi.org/10.1126/science.1203161

      Wong, S. P., Baldwin, A. S., Hess, R. F., & Mullen, K. T. (2021). Shifting eye balance using monocularly directed attention in normal vision. J Vis, 21(5), 4. https://doi.org/10.1167/jov.21.5.4

      Yuval-Greenberg, S., & Heeger, D. J. (2013). Continuous flash suppression modulates cortical activity in early visual cortex. J Neurosci, 33(23), 9635-9643. https://doi.org/10.1523/jneurosci.4612-12.2013

      Zhang, P., Jiang, Y., & He, S. (2012). Voluntary attention modulates processing of eye-specific visual information. Psychol Sci, 23(3), 254-260. https://doi.org/10.1177/0956797611424289

      Zhou, J., Reynaud, A., & Hess, R. F. (2014). Real-time modulation of perceptual eye dominance in humans. Proc Biol Sci, 281(1795). https://doi.org/10.1098/rspb.2014.1717

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public reviews):

      Summary:

      Ciliary rootlet is a structure associated with the ciliary basal body (centriole) with beautiful striation observed by electron microscopy. It has been known for more than a century, but its function and protein arrangement are still unknown. This work reconstructed the near-atomic resolution 3D structure of the rootlet using cryo-electron tomography, discovered a number of interesting filamentous structures inside, and built a molecular model of the rootlet.

      Strengths:

      The authors exploited the currently possible ability of cryo-ET and used it appropriately to describe the 3D structure of the rootlet. They carefully conducted subtomogram averaging and classification, which enabled an unprecedented detailed view of this structure. The dual use of (nearly) intact rootlets from cilia and extracted (demembraned) rootlets enabled them to describe with confidence how D1/D2/A bands form periodic structures and cross with longitudinal filaments, which are likely coiled-coil.

      Weaknesses:

      Some more clarifications are needed. This reviewer believes that the authors can address them.

      Reviewer #1 (Recommendations for the authors):

      Recommendation 1: According to Fig.1B, the rootlet was mechanically pulled out from the visual cell for a long distance by vortexing. Is there no artifact? Can the authors comment on it by referring to old literature, for example, with EM of resin-embedded and sectioned basal bodies?

      Response: A previous study (Gilliam et al., 2012) compared cryoET of purified rootlets with resinembedded ultrathin sections of mouse eyecups. They reported no changes in striation repeat or rootlet morphology suggesting there is no artifact of purification. Our rootlet data are consistent with that of Gilliam, suggesting the tomograms we report are representative of rootlets prior to purification. 

      We have clarified this in the text: pg 2: “As previously described (Gilliam et al., 2012), rootlet striation-repeat and morphology appear unaltered by the purification method. Moreover, …” 

      Recommendation 2: Fig.1F: It is not clear how to distinguish striation-membrane joints indicated by grey and white arrows. It seems relatively straight striation is indicated by a white arrow, while in the case of the bulky feature it is shown by a grey arrow (and the bulk is colored in blue). But there is no clear border between these features. How were they distinguished? Are they based on classification?

      Response: The membrane-associated densities (colored in blue) were assigned according to the TomoSeg neural network. It was trained on a small set of globular densities closely associated with a membrane. This training set included examples both close to and far away from the rootlet. We trained a separate network on recognizing rootlet striations. Both networks competed on assigning pixels in the tomogram as either striations or membrane-associated proteins. The different membrane connections were therefore defined by the probability within the TomoSeg network rather than classification.

      We clarified this in the main text: pg 3: “All the striations partially or fully spanned the width of the rootlet and extended beyond the outermost longitudinal filaments. These rootlet-protruding striation-densities frequently contacted the membrane (Fig 1E). Close examination suggested some make a direct contact, whereas others contact a subset of globular membrane-associated densities that are a striking feature of the tomograms. These densities are ~7 nm in diameter and cover almost every membrane surface. Where two membranes come into proximity, the intervening space is filled with two layers of these membrane-associated proteins, one layer associated with each membrane (Fig 1C, S1A, blue arrowheads). We trained a TomoSeg neural network to assign these densities and let this network compete with one that assigned striations. This resulted in a final segmentation with membrane-associated densities indicated in blue and striations in yellow (Fig 1E, F and S1D–F).”  

      We also clarified this in the methods:

      pg 12/13: “The tomograms were then preprocessed in EMAN2.2 for training of the TomoSeg CNN (Chen et al., 2017). Here, the features (filaments, D-bands, A-bands, gold fiducials, actin, membranes, membrane-associated densities and ice contaminations) were individually trained. Segmented maps were allowed to compete for the assignment of pixels in the tomograms, cleaned up in Amira (Thermo Fisher Scientific), and converted to object files. The object files and corresponding tomograms were displayed in ChimeraX (Pettersen et al., 2021). Assignment of direct and indirect striation-membrane connections was done manually by assessing whether TomoSeg-segmented striations and membranes were connected directly or via membrane-associated densities. The automated segmentation of amorphous striations picked up mostly dense amorphous features. The fainter densities that we observed to laterally connect the amorphous features were manually drawn by dotted lines.” 

      Recommendation 3: p.3 "All the striations partially or fully spanned the width of the rootlet before protruding from its surface." This reviewer would read the last part of this sentence as "before protruding from the surface of the rootlet membrane toward inside". Is this correct?

      Response: This was not what we had intended to imply. 

      We have changed this sentence in the text to avoid confusion:  pg 3: “All the striations partially or fully spanned the width of the rootlet and extended beyond the outermost longitudinal filaments. These rootlet-protruding striation-densities frequently contacted the membrane (Fig 1E).”

      Recommendation 4: Same for p.4 "The protrusions from the rootlets were flexible". This means the protrusions from the membrane if this reviewer understands correctly.

      We also clarified this sentence in the text:  pg 4: “The proteinaceous protrusions that extended from the rootlets were flexible and did not induce a regular spacing in the membrane-associated proteins they contacted (Fig 1F, S1D–F).”

      Recommendation 5: p.4 "Due to the thickness of the sample and the presence of membranes": How thick is the typical sample?

      Response: We typically collected data on samples thicker than 300nm. We initially tried making thinner samples, for better contrast, but observed this led to sample disruption. We changed “sample” to “ice” to clarify that we refer to the prepared sample and not the biological object.

      Changes in text:

      pg 4: “Due to the ice-thickness and the presence of membranes, the tomograms had limited contrast.”

      Recommendation 6: p.4 "We were also able to see these bands with cryo-ET." It would be nice if the comparison between tomograms of the native and purified rootlets was done. This reviewer could not get where the D1/D2/A bands are in Fig.1E.

      Response: Due to the noise in the native tomograms it is difficult to see the regular striation pattern in Fig 1E. However, we see it better when we project the native rootlet onto a single image. We added the projection image, the corresponding fourier transform, and repeat measurements to the supplement (Fig S1B, C). We updated all figure references in the text.

      We updated the text accordingly:

      pg 4: “We were also able to see these bands with cryo-ET. The striations in the purified rootlets appeared more ordered and clearer than in the cellular tomograms due to the improved contrast. In the cellular rootlets, we identified the bands in a tomogram projection (Fig S1B), with an average distance of 79.52 ± 0.26 nm between each repeat (Fig S1C). The repeat distance for the purified rootlets is 80.1 ± 0.03 nm based on a sine fit to A and D-bands of 10 fourier-filtered tomogram projections (Fig 2D, Fig S2E–I).”

      We updated the figure legend of Fig S1:

      pg 18: “(B) Projection image of a 53 nm thick slice through the tomogram and the corresponding Fast Fourier Transform (FFT). Measured frequencies are indicated with red lines. (C) Quantification of the distance measured between pairs of discrete striations. (D–F) …”

      Recommendation 7: Fig.2E-I: Could the authors explain how these bands were tracked? It is very difficult for this reviewer to trace, for example, the A-band in Fig.2g.

      Response: We trained the neural network of TomoSeg to pick up discrete and amorphous striations. The Tomoseg segmentation of the amorphous striations often only picked up dense features marked in green. However, we could see densities by eye in the tomograms that connect these dense features.

      These connecting densities were manually drawn with a dotted line.

      We clarified this in the methods:

      pg 13: “The automated segmentation of amorphous striations picked up mostly dense amorphous features. The fainter densities that we observed to laterally connect the amorphous features were manually drawn by dotted lines.”

      We also changed the figure legend of Fig2: 

      pg 5: “(F,G,I) fainter features not picked up by the automated segmentation were drawn with dotted lines.”

      Recommendation 8: Fig.2: The caption of Fig.2I is missing.

      We have edited the legend of Fig 2 to include this caption: pg 5: “(I) Segmentation that shows amorphous features occur as two bands and connect to the rootlet surface densities.”

      Recommendation 9: p.6 "Additionally, the surface densities show evidence of connecting to the A-bands (Fig 2I and S3I)." Does the author mean Fig.2J and S3I?

      Response: This is most clearly visible in figure 2I and S3I (S3J after revisions), but it is also visible in 2J. 

      We therefore edited this figure reference:

      pg 6: (Fig 2I, J and S3J)

      Recommendation 10:  p.8 "The metazoan rootlet is a cilium-associated fiber that is characterized by regular cross-striations." In this reviewer's memory, Tetrahymena also has a rootlet. Are they different in structure?

      Response: Tetrahymena and other protists have striated rootlets (known as kinetodesmal fibres or System-I fibres), that are classified as being different from mammalian rootlets (Andersen et al., 1991). Tetrahymena rootlets have a 32 nm repeat (Munn, 1970), which is less than half of the 80 nm repeat observed for mammalian rootlets. While the protein composition of Tetrahymena rootlets is unknown, a 250 kDa protein was proposed to be their main component (Williams et al., 1979). Tetrahymena rootlet proteins were proposed to span a minimum of 4-5 striation repeats, based on early thin-sectioning EM (Munn, 1970), while we show that rootletin predictions span at most ~3.3 repeats in mammalian rootlets. Since the early proposal of Tetrahymena rootlet protein organisation, more components have been identified: DisAp (Galati et al., 2014) with a predicted length of ~37 nm (0.15 nm/residue), and proteins of 170 kDa that cross react with the Naegleria Gruberi major rootlet component (Dingle & Larson, 1981). Thus, the available data suggest that Tetrahymena rootlets are different in structure from mammalian ones.

      Reviewer #2 (Public reviews):

      Summary:

      This work performs structural analysis on isolated or purified rootlets.

      Strengths:

      To date, most studies of this cellular assembly have been from fluorescence microscopy, conventional TEM methods, or through biochemical analysis of constituents. It is clearly a challenging target for structural analysis due to its complexity and heterogeneity. The authors combine observations from cryo-electron tomograms, automated segmentations, subtomogram averaging, and previous data from the literature to present an overall model of how the rootlet is organised.

      Their model will serve as a jumping-off point for future studies, and as such it is something of considerable value and interest.

      Weaknesses:

      It is speculative but is presented as such, and is well-reasoned, plausible, and thorough.

      Reviewer #2 (Recommendations for the authors):

      Recommendation 1: My suggestions to improve the manuscript lie in some of the technical details:

      The subtomogram averaging methods are overly brief - I am not convinced that someone could replicate the process from the text in the methods (and results sections).

      We have now extended our description of the subtomogram averaging methods: 

      pg 13: “For particle picking, the tomograms were deconvolved using the TOM package (Tegunov & Cramer, 2019). Dynamo was used for particle extraction using the Dynamo surface model (Castaño-Díez et al., 2012, 2017): Each D2 band was traced in multiple slices per rootlet to define dynamo surfaces. Surface triangulation was set to result in extraction coordinates approximately 4 times the number of expected filaments. The coordinates were extracted as a Dynamo table that was subsequently converted to the motl-format using subTOM scripts, available at https://github.com/DustinMorado/subTOM/ (Leneva et al., 2021). Particles were extracted from tomograms reconstructed using novaCTF (Turoňová et al., 2017).

      An initial reference was obtained by in-plane randomizing and averaging all particles prior to alignments. Initial alignments were performed to centre filaments, by using a 10 nm wide cylindrical mask, limited to 4 nm shifts in X and Y with respect to the reference orientation, A spherical mask with large diameter was used for alignments the D-bands, these alignments were restricted to the reference Z direction. Cluster- and careful per-tomogram cross-correlation cleaning were applied to remove particle duplicates, particles with no filaments, and particles with disordered D-bands. This resulted in a cleaned particle dataset.  

      Prior to classification in subTOM, alignments with limited X/Y/Z shifts and increasingly finer in-plane rotations were performed. 20 eigenvolumes were generated by K-means classification over 20 eigenvectors. The eigenvolumes and particles clustered per eigenvector were assessed to identify which vectors described the missing wedge or structural features (Leneva et al., 2021). The structural eigenvectors were used to cluster particles into the final class averages that described particle heterogeneity. 

      For the final subtomogram class-average that contained the twist, the cleaned particle dataset motl was converted to a STAR file compatible with RELION 4.0 alpha (Zivanov et al., 2022). Gold beads were removed from the preprocessed tomogram frames by converting the aligned tomogram gold coordinates initially obtained by Etomo bead-finder during preprocessing steps (Kremer et al., 1996). Particles were then extracted in RELION 4.0 alpha. The initial reference was an inplane randomized average of the cleaned particle dataset. Instead of refinement, which resulted in anisotropic structures due to a lack of features for the alignment, we used simultaneous alignment and classification. We restricted the alignments to full inplane rotations with respect to the reference Z-axis.”

      Recommendation 2: I find it difficult to assess the quality of the final subtomogram averages as presented in the manuscript. One potential worry is the fact that the authors state that nothing is visible outside the mask, which can be a sign of overfitting (though, as the authors state, can just be a sign of heterogeneity). I would suggest that the authors include FSC curves, as well as 2D slices through the unmasked subtomogram averages - it is easier to judge the impact of the mask when viewing it this way and not at the isosurface.

      Response: We understand the reviewer’s concern for overfitting and masking. To clarify our approach, the class averages we show in Fig3G and FigS5C are the result of simultaneous classification with alignment and not a gold-standard refined average. The classification does not produce an FSC since it does not work with half sets. We initially tried a refinement approach, but the filaments did not have enough features to align and resulted in anisotropic structures. The FSC of such a refinement is shown below. However, because of the anisotropy, we did not include these structures or FSCs in the manuscript and we make no claims about the resolution. 

      Author response image 1.

      Instead, we presented the data from simultaneous classification with alignment which revealed the twist in the filament. Like the reviewer, we were initially concerned that the filament twist could be an artefact of the narrow masks and reference we used. However, we only used rotationally symmetric references and masks that do not contain any features. We therefore, realized this asymmetric twistfeature could not have arisen from imposed alignment regiments, reference biases or overfitting. 

      To make our approach clearer, we have updated the main text:

      pg 8: “To ensure unbiased alignment of any coiled-coil features we generated a smooth reference by randomizing the inplane rotational orientation of the particles (Fig S5B). Initial refinement of the data resulted in an anisotropic structure since the filaments did not have enough features to align to. Therefore, we performed classification with alignment in RELION 4.0 alpha (Zivanov et al., 2022), and used a narrow 3.3 nm-wide mask with a smooth edge up to 7.7 nm (Fig S5B). This was the narrowest mask that still resulted in an isotropic structure and revealed features that were absent in the smooth reference. The resulting class averages contained a twist along the filament length in classes 2, 3 and 4 but most prominently in class 5 (Fig S5C). Class 5 contained a filament of 2 nm thick by 5 nm wide with a groove along its length (Fig 3G).” 

      We also clarified this in the methods:

      pg 13: “The initial reference was an inplane randomized average of the cleaned particle dataset. Instead of refinement, which resulted in anisotropic structures due to a lack of features for the alignment, we used simultaneous alignment and classification. We restricted the alignments to full inplane rotations with respect to the reference Z-axis.”

      Recommendation 3: The authors should include the version of Alphafold that they used to perform the structural predictions. Predictions, especially for multimers, have improved in the newest version, and it could be expected that further improvements will occur in the future. Including the version used here will act as a timestamp.

      We have now updated the methods to include the version:

      pg 14: “Alpha fold predictions of 300 AA long dimer fragments with 50 AA overlap were generated using colabfold 4 that uses a modified version of alphaFold2. To run the large number of sequences we used a customized script called alphascreen (version 1.15) available at https://github.com/samichaaban/alphascreen.”

      Recommendation 4: Figure 2G is not so clear in depicting two offset D bands. The authors could include a more zoomed-out image to make it clearer.

      Response: We have now included a more zoomed out image in the supplement (Fig S3A).

      We updated the figure legend of Fig 2G and Fig S3A: pg 5: “(G) Example where D1 aligns with D2 of a neighboring sub-fiber. Larger view in Fig S3A.”

      pg 20: “(A) Tomogram slice and segmentation where D1 aligns with D2 of a neighboring sub-fiber. The dotted square marks the location of Fig 2G. (B)”

      Recommendation 5: Did the authors attempt to predict the structure of rootletin oligomers? i.e. folding four rootletin fragments at once instead of two? This could be interesting.

      Response: We attempted to predict interactions between all combinations of rootletin fragments. We did this for two fragment (e.g. CC1+CC1 or CC1+CC2) and four fragment (e.g. CC1+CC1+CC1+CC1 or CC1+CC1+CC2+CC2) combinations.

      Homodimer combinations (e.g. CC1+CC1) were predicted with most confidence. We did not identify any higher oligomerization. AlphaFold did not identify interactions that were previously proposed in the literature–for example between two CC3 dimers (Ko et al., 2020) or weak interactions between CC2 and CC3 (Yang et al., 2002). These interactions were either not properly predicted or may require additional proteins other than the ones we tested (CCDC102B, CEP68, beta-catenin, ARL2, centlein). 

      We have updated our methods to include our AlphaFold attempts:

      Pg 14: “This setup was used to predict interactions for dimeric and oligomeric combinations of rootletin fragments (e.g. CC2+CC2, CC3+CC4, CC1+CC1+CC1+CC1, CC3+CC3+CC4+CC4 etc). Homodimeric and oligomeric combinations were tested with other proteins identified as putative rootletin-binding: CCDC102B, CEP68, beta-catenin, ARL2, centlein. In our hands, only homodimeric rootletin fragment combinations resulted in confident predictions.”

      Reviewer #3 (Public reviews):

      Summary:

      The study offers a compelling molecular model for the organization of rootlets, a critical organelle that links cilia to the basal body. Striations have been observed in rootlets, but their assembly, composition, and function remain unknown. While previous research has explored rootlet structure and organization, this study delivers an unprecedented level of resolution, valuable to the centrosome and cilia field. The authors isolated rootlets from mice's eyes. They apply EM to partially purified rootlets (first negative stain, then cryoET). From these micrographs, they observed striations along the membranes along the rootlet but no regular spacing was observed.

      The thickness of the sample and membranes prevented good contrast in the tomograms. Thus they further purified the rootlets using detergent, which allowed them to obtain cryoET micrographs of the rootlets with greater details. The tomograms were segmented and further processed to improve the features of the rootlet structures. From their analysis, they described 3 regular cross-striations and amorphous densities, which are connected perpendicularly to filaments along the length of the rootlets. They propose that various proteins provide the striations and rootletin (mouse homolog of human cnap1) forms parallel coiled coils that run along the rootlet. Overall their data provide a detailed model for the molecular organization of the rootlet.

      The major strength is that this high-quality study uses state-of-the-art cryo-electron tomography, subtomogram averaging, and image analysis to provide a model of the molecular organization of rootlets. The micrographs are exceptional, with excellent contrast and details, which also implies the sample preparation was well optimized to provide excellent samples for cryo-ET. The manuscript is also clear and accessible.

      To further validate their model, it would have been useful to identify some components in the EM maps through complementary approaches (mass spectrometry, mutants disrupting certain features, CLEM). Some potential candidates are mentioned in the discussion.

      This research marks a significant step forward in our understanding of rootlets' molecular organization.

      Response: We agree with the reviewer that it would be ideal to identify rootlet components in the EM densities using complementary approaches. Prior to submitting the manuscript, we attempted several approaches, the details of which are described below:

      We performed mass spectrometry on our purified rootlets. This identified the rootlet components rootletin and CCDC102B and various axonemal components, due to the association between the rootlet and axoneme. However, due to the limitations in quantifying components using mass spectrometry, we were unable to confidently identify novel rootlet constituents present in quantities comparable to rootletin.

      We further attempted cross-linking mass spectrometry on the rootlets to gain deeper insights to the interactions between rootletin molecules. Unfortunately, this effort resulted in a completely insoluble sample despite extended digestion times, leading to issues with mass spectrometry column clogging and rendering our results inconclusive.

      We attempted to express rootlet components recombinantly and were able to purify fibres, but they did not contain the characteristic repeat pattern seen in native rootlets. We also considered purifying native rootlets from cultured cells, but we were unable to obtain sufficient sample for cryoET imaging.

      We therefore regret that other approaches to validate our model are outside the scope of this current work.

      Reviewer #3 (Recommendations for the authors):

      Recommendation 1: There are some problems with spaces in references in the methods.

      Response: We have thoroughly checked the methods and manuscript for double spaces and corrected this.

      Recommendation 2: Figure 1A, the figure would benefit from more labelling, to show the reader the basal body and nucleus.

      Response: We have now added the labels "basal bodies" and "Nucleus" to the cartoon in Fig 1A.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study has uncovered some important initial findings about how certain extracellular vehicles (EVs) from the mother might impact the energy usage of an embryo. While the study's findings are in general solid, some experiments lack statistical power due to small sample sizes. The study's title might be a bit too assertive as the evidence linking maternal mtDNA transmission to changes in embryo energy use is still correlative.

      We would like to express our sincere gratitude to the editors and reviewers for their invaluable comments on this work. Their feedback has been instrumental in enhancing the quality of our manuscript; we have incorporated their suggestions to the best of our abilities.

      Reviewer #1 (Public Review):

      Q1. Bolumar et al. isolated and characterized EV subpopulations, apoptotic bodies (AB), Microvesicles (MV), and Exosomes (EXO), from endometrial fluid through the female menstrual cycle. By performing DNA sequencing, they found the MVs contain more specific DNA sequences than other EVs, and specifically, more mtDNA were encapsulated in MVs. They also found a reduction of mtDNA content in the human endometrium at the receptive and post-receptive period that is associated with an increase in mitophagy activity in the cells, and a higher mtDNA content in the secreted MVs was found at the same time. Last, they demonstrated that the endometrial Ishikawa cell-derived EVs could be taken by the mouse embryos and resulted in altered embryo metabolism.

      This is a very interesting study and is the first one demonstrating the direct transmission of maternal mtDNA to embryos through EVs.

      A1. Thank you for your kind comments.

      Reviewer #2 (Public Review):

      Q2. In Bolumar, Moncayo-Arlandi et al. the authors explore whether endometrium-derived extracellular vesicles contribute mtDNA to embryos and therefore influence embryo metabolism and respiration. The manuscript combines techniques for isolating different populations of extracellular vesicles, DNA sequencing, embryo culture, and respiration assays performed on human endometrial samples and mouse embryos.

      Vesicle isolation is technically difficult and therefore collection from human samples is commendable. Also, the influence of maternally derived mtDNA on the bioenergetics of embryos is unknown and therefore novel. However, several experiments presented in the manuscript fail to reach statistical significance, likely due to the small sample sizes. Additionally, the experiments do not demonstrate a direct effect of mtDNA transfer on embryo bioenergetics. This has the unfortunate consequence of making several of the authors' conclusions speculative.

      In my opinion the manuscript supports the following of the authors' claims:

      1) Different amounts of mtDNA are shed in human endometrial extracellular vesicles during different phases of the menstrual cycle

      2) Endometrial microvesicles are more enriched for mitochondrial DNA sequences compared to other types of microvesicles present in the human samples

      3) Fluorescently labelled DNA from extracellular vesicles derived from an endometrial adenocarcinoma cell line can be incorporated into hatched mouse embryos.

      4) Culture of mouse embryos with endometrial extracellular vesicles can influence embryo respiration and the effect is greater when cultured with isolated exosomes compared to other isolated microvesicles

      A2. Thank you for your detailed feedback. We have made every effort to enhance the manuscript in this revised version, ensuring that our conclusions are grounded in solid evidence and that they avoid any speculation.

      My main concerns with the manuscript:

      Q3. The authors demonstrate that microvesicles contain the most mtDNA, however, they also demonstrate that only isolated exosomes influence embryo respiration. These are two separate populations of extracellular vesicles.

      A3. This manuscript focuses on the DNA content secreted by the endometrium and captured by the embryo. We identified both mitochondrial DNA and genomic DNA. We have found that mitochondrial DNA is predominantly secreted and encapsulated within microvesicles, while all three types of vesicles encapsulate genomic DNA. Specifically, based on the results we presented in Response A8 to the reviewers and included in the latest version of the manuscript, we observed that exosomes contain the highest amount of genomic DNA. Furthermore, exosomes have the greatest impact on embryo bioenergetics, suggesting that this DNA content may primarily exert this effect. We have thoroughly revised the manuscript, focusing our message on DNA content.

      Q4. mtDNA is not specifically identified as being taken up by embryos only DNA.

      A4. We agree with the reviewer; as we mention in answer A9, EdU does not specifically label mitochondrial DNA. To solve this issue, we incubated a synthetic molecule of labeled mtDNA with embryos and analyzed mtDNA incorporation using confocal microscopy. We co-cultured hatched mouse embryos (3.5 days) with an ATP8 sequence conjugated with Biotin overnight at 37ºC and 5% CO2. We then permeabilized embryos, incubated them with Streptavidine-Cy3 for 45 min, and visualized the results using an SP8 confocal microscope (Leica). We observed mtDNA internalization by cells of the hatched embryos; please see new supplementary Figure 7 and lines 234-237 on page 9 and lines 583-592 M&M on page 21.

      Q5. The authors do not rule out that other components packaged in extracellular vesicles could be the factors influencing embryo metabolism.

      A5. The vesicular subtypes contain molecules beyond DNA, such as microRNAs, proteins, or lipids. Our laboratory has studied the transmission of vesicles and their relationship with their contents (particularly microRNAs) and their connection to maternal-fetal communication. In this study, we focused on genomic/mitochondrial DNA. We cannot exclude the possibility that other molecules may influence metabolism; this statement is already noted in the discussion section on lines 328-331 on page 12.

      Q6. Taken together, these concerns seem to contradict the implication of the title of the manuscript – the authors do not demonstrate that inheritance of maternal mtDNA has a direct causative effect on embryo metabolism.

      A6. We have modified the title to better align with the manuscript’s results. The proposed new title for the manuscript is “Vertical transmission of maternal DNA through extracellular vesicles modulates embryo bioenergetics during the periconceptional period.”

      Reviewer #1 (Recommendations for The Authors):

      Q7. Would it be possible to validate the mtDNA content and mitophagy activity in different periods using the Ishikawa cells?

      A7. Unfortunately, this validation cannot be achieved with in vitro cultures of cell lines, especially with a cell line such as the endometrial adenocarcinoma-derived Ishikawa cell line. While mimicking the menstrual cycle (as observed in Figure 3 of the manuscript) is entirely artificial, we believe that the statistically significant results obtained in human samples faithfully represent the biological processes involved. Using a cell line, in our opinion, would not provide us with novel information.

      Q8. Characterization of the EVs subpopulations from Ishikawa cells and direct evidence to show the EdU labeled DNA is contained in the EVs are necessary.

      A8. To address this concern, we designed a novel experiment. We cultured Ishikawa cells in the presence of Edu, isolated the three types of vesicles, and evaluated labeled DNA content by flow cytometry (as illustrated in Supplementary Figure 5). All three types of vesicles exhibited positive EdU-DNA labeling; notably, the exosomal fraction demonstrated substantially higher DNA content than the other vesicle populations. Please see new supplementary Figure 5 and lines 217-218 on page 9, and lines 576-582 of the M&M on pages 20-21.

      Q9. Would EdU incorporate into the genomic DNA or mitochondrial DNA?

      A9. EdU (5-ethynyl-2′-deoxyuridine) is a nucleoside analog of thymidine and becomes incorporated into DNA during active DNA synthesis. EdU labels all newly synthesized DNA, both genomic and mitochondrial; however, we cannot differentiate between them with this technique.

      Q10. It is difficult to assess whether the EV-derived DNA was taken by the TE or ICM without immunostaining of cell lineage markers in mouse embryos.

      A10. We did not aim to label the inner cell mass, as the vesicles primarily enter through trophectodermal cells. The images presented in Figure 4 and Supplementary Figure 5 depict trophectoderm cells.

      Q11. It is also valuable to perform co-staining of Mitotracker to show the co-localization of EdU labelled DNA and the mitochondrial.

      A11. Per the reviewer's suggestion, we conducted an experiment as described in the following text. We isolated MVs from the culture media of EdU-treated Ishikawa cells and co-incubated them with embryos overnight. The resulting images (See Author response image 1) show an embryo subjected to staining with EdU-tagged DNA labeled with Alexa Fluor 488 (green), Mitotracker Deep Red (red), and nuclei (blue). Detailed views of the embryo are presented in panels A and B. Notably, we observed co-localization of mitochondria and EdU-tagged DNA, as indicated by the white arrows. Despite this intriguing finding, we chose not to include these results in the initial version of the manuscript; however, if the editor deems it appropriate, we would be delighted to incorporate them into the final version. The experimental procedure for co-localization of EdU DNA-tagged with mitochondria involved the following steps: Mitotracker Deep Red FM (Thermo Fisher Scientific, M22426) was added to the embryo media at a final concentration of 200 nM, and the embryos were subsequently incubated for 45-60 minutes prior to fixation.

      Author response image 1.

      Co-localization of mitochondria and EdU-tagged DNA in mouse embryos. Representative micrograph of an embryo co-incubated with MVs isolated from the culture media of Ishikawa cells treated with EdU. EdU-tagged DNA was labeled with Alexa Fluro 488 (green). Mitotracker Deep Red (mitochondria; red) and nuclei (blue). A and B) magnified images of the embryo show detailed co-localization of mitochondria and EdU-tagged DNA (white arrows). Negative control) Embryos incubated with MVs isolated from control Ishikawa cells (without EdU incubation) and stained with the click-it reaction cocktail. A and B showed magnified images of the embryo. Notice the absence of EdU-Alexa Fluro 488 signals (green).

      Reviewer #2 (Recommendations for The Authors):

      Q12. It would be helpful if the authors could provide citations and rationale for why they chose specific molecular markers to validate the different population of extracellular vesicles.

      A12. Different extracellular populations are defined by molecular marker signatures that reflect their origin. VDAC1 forms ionic channels in the mitochondrial membrane, has a role in triggering apoptosis, and has been described as characteristic of ABs.[1]

      The ER protein Calreticulin has also been used as an AB marker [2]; however, other studies have noted the presence of Calreticulin in MVs. [1] This apparent non-specificity may derive from apoptotic processes, during which the ER membrane fragments and forms vesicles smaller than ABs, which would contain Calreticulin and sediment at higher centrifugal forces.[3,4] In fact, proteomic studies have linked the presence of Calreticulin with vesicular fractions of a size range relevant for MVs [5] and ABs [6].

      ARF6, a GTP-binding protein implicated in cargo sorting and promoting MV formation, has been proposed as an MV marker. [7,8]

      Classic markers of EXOs include molecules involved in biogenesis, such as tetraspanins (CD63, CD9, CD81), Alix, TSG101, and flotillin-1.[9,10] Nonetheless, studies have recently reported the widespread nature of such markers among various EV populations, although with different relative abundances (such as is the case for CD9, CD63, HSC70, and flotillin-1[11]). Notably, certain molecular markers (such as TSG101[1,11]) have been ratified as specific to EXOs.

      References

      1. D. K. Jeppesen, M. L. Hvam, B. Primdahl-Bengtson, A. T. Boysen, B. Whitehead, L. Dyrskjøt, T. F. Orntoft, K. A. Howard, M. S. Ostenfeld, J. Extracell. Vesicle. 2014, 3, 25011, doi: 10.3402/jev.v3.25011.

      2. J. van Deun, P. Mestdagh, R. Sormunen, V. Cocquyt, K. Vermaelen, J. Vandesompele, M. Bracke, O. De Wever, A. Hendrix, J. Extracell. Vesicles. 2014, 3:24858, doi: 10.3402/jev.v3.24858.

      3. L. Abas, C. Luschnig, Anal. Biochem. 2010, 401, 217-227, doi: 10.1016/j.ab.2010.02.030.

      4. C. Lavoie, J. Lanoix, F. W. Kan, J. Paiement, J. Cell Sci. 1996, 109(6), 1415-1425.

      5. M. Tong, T. Kleffmann, S. Pradhan, C. L. Johansson, J. DeSousa, P. R. Stone, J. L. James, Q. Chen, L. W. Chamley, Hum. Reprod. 2016, 31(4), 687-699, doi: 10.1093/humrep/dew004.

      6. P. Pantham, C. A. Viall, Q. Chen, T. Kleffmann, C. G. Print, L. W. Chamley, Placenta. 2015, 36, 1463e1473, doi: 10.1016/j.placenta.2015.10.006.

      7. V. Muralidharan-Chari, J. Clancy, C. Plou, M. Romao, P. Chavrier, G. Raposo, C. D'Souza-Schorey, Curr. Biol. 2009, 19, 1875-1885.

      8. C. Tricarico, J. Clancy, C. D'Souza-Schorey, Small GTPases. 2016, 0(0), 1-13.

      9. M. Colombo, G. Raposo, C. Théry, Annu. Rev. Cell. Dev. Biol. 2014, 30, 255-289, doi: 10.1146/annurev-cellbio-101512-122326.

      10. S. Mathivanan, H. Ji, R. J. Simpson, J. Proteomics. 2010, 73(10), 1907-1920.

      11. J. Kowal, G. Arras, M. Colombo, M. Jouve, J. P. Morath, B. Primdal-Bengtson, F. Dingli, D. Loew, M. Tkach, C. Théry, Proc. Natl. Acad. Sci. U. S. A. 2016, 113(8), E968-77.

      Q13. The PCA analysis in supplementary figure 4 A&B needs more explanation for why they think separation of the two conditions based on principal component 1 is sufficient. The small number of replicates makes me concerned because principal component 2 does not show similarity of replicates for the DNase treated samples. Also, 4C has no description in the figure legend.

      A13. The PCA results show a clear separation between the two conditions; we believe this separation is primarily driven by the differences observed in principal component 1 (PC1). We would like to address the concerns raised by the reviewer with the following points:

      1. Interpretation of PCs: In PCA, the principal components represent orthogonal axes capturing the highest variance in the data. PC1 accounts for 56% and 57% of the variance in the two conditions, respectively. The significant variance explained by PC1 suggests that it effectively captures the major sources of variation between the samples.

      2. Sample Replicates and Variability: The concern regarding the small number of replicates is acknowledged, and we understand its impact on the analysis. Despite the limited number of replicates, the consistent pattern of separation in PC1 between the two conditions provides confidence in the observed separation. We also agree that PC2 does not show an apparent similarity among the DNase-treated samples; however, this does not diminish the significance of PC1, which robustly separates the two conditions.

      We include the Figure legend for 4C: “C) Principal component analysis shows EV sample grouping due to specificity in coding-gene sequences.

      Q14. I am confused by the phrasing in the last two sentences of the top paragraph on page 7. Why would apoptotic bodies all have similar content if they encapsulate a greater amount of material making their contents less specific? Please clarify.

      A14. This sentence intended to convey the fact that apoptotic bodies (ABs) are formed from apoptotic cells, they are larger in size, and their content is more non-specific - this non-specific nature arises as they do not encapsulate molecules specifically, unlike the other two types of vesicles. For more detailed information on ABs in human reproduction, we published an extensive review in 2018 (see below).

      Simon C, Greening DW, Bolumar D, Balaguer N, Salamonsen LA, Vilella F. Extracellular Vesicles in Human Reproduction in Health and Disease. Endocr. Rev. 2018 Jun 1;39(3):292-332. doi: 10.1210/er.2017-00229. PMID: 29390102.

      Q15. The first and last sentences of the last paragraph of page 8 seem to contradict each other. Please clarify.

      A15. We observe an enrichment in the amount of mitochondrial DNA in samples during the receptive and post-receptive phases. While the data may not show statistical significance, we observed a trend towards greater enrichment in receptivity compared to pre-receptivity. The lack of significant differences could be attributed to inherent variability among patients. We have also altered the text on page 8 to avoid confusion.

      Q16. Quantification of the rates of DNA incorporation into embryos would strengthen Figure 4 and Supplementary Figure 5.

      A16. We acknowledge the reviewer's feedback, and in response, we conducted an assay to quantify the total DNA incorporated into the embryos. We isolated EVs from the control Ishikawa cell culture media and EdU-treated Ishikawa cell culture media to achieve this. Subsequently, we co-incubated both types of EVs with ten embryos overnight in G2 plus media at 37ºC and 5% CO2.

      After co-incubation, we collected embryos and the culture media containing co-incubated EVs. We then isolated total DNA using the QIAamp® DNA Mini kit (Qiagen; 51304). To label the EdU-DNA particles, we performed a click-it reaction using the Click-iT™ EdU Alexa Fluor™ 488 flow cytometry assay Kit (Thermo Fisher Scientific, ref: C10420) per the manufacturer's instructions. Subsequently, we cleaned and purified DNA using AMPure beads XP (Beckman Coulter, A63882) and eluted DNA in 150 L of 0.1 M Tris-EDTA. Finally, we measured the fluorescence of each sample using a Victor3 plate reader (PerkinElmer). To ensure accuracy, we subtracted the background signal from non-labeled DNA-derived EVs and embryos incubated without EVs for each sample. Despite conducting the experiment twice, we encountered challenges in obtaining clear results, possibly due to the limitation of the technique's resolution.

      Q17. If mtDNA is most enriched in MVs but only embryos cultured with Exos demonstrated differences in respiration the authors need to comment on this discrepancy.

      A17. We ask the reviewer to refer to Answer A3; we have thoroughly revised the manuscript, focusing our message on DNA content.

      Q18. The authors should change the definitive language in the title of the manuscript because all evidence presented is correlative.

      A18.We have modified the title to better align with the manuscript's results. The proposed new title for the manuscript is “Vertical transmission of maternal DNA through extracellular vesicles modulates embryo bioenergetics during the periconceptional period.”

      Q19. I realize this is beyond what the authors intend for the scope of this paper, however, on page 6 the authors describe membranous structures within the ABs but say they couldn't study their presence with organelle-specific markers. Why? Presence of organelles in these vesicles is very interesting!

      A19. As the reviewer rightly points out, we did not study ABs in this manuscript. Analysis of the electron microscopy images suggests the presence of fragments of organelles, most likely originating from apoptotic processes; however, we did not use any specific markers to confirm our assertion. We have modified the text to avoid any confusion. Please see Page 6, Lines 120-121, for further details.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The authors have examined gene expression between life cycle stages in a range of brown macroalgae to examine whether there are conserved aspects of biological features. 

      Strengths: 

      The manuscript incorporates large gene expression datasets from 10 different species and therefore enables a comprehensive assessment of the degree of conservation of different aspects of gene expression and underlying biology. 

      The findings represent an important step forward in our understanding of the core aspects of cell biology that differ between life cycle phases and provide a substantial resource for further detailed studies in this area. Convincing evidence is provided for the conservation of lifecycle-specific gene expression between species, particularly in core housekeeping gene modules. 

      Weaknesses: 

      I found a few weaknesses in the methodology and experimental design. I think the manuscript could have been clearer when linking the findings to the biology of the brown algae. 

      Reviewer #2 (Public review): 

      Summary: 

      The manuscript by Ratchinski et al presents a comprehensive analysis of developmental and life history gene expression patterns in brown algal species. The manuscript shows that the degree of generation bias or generation-specific gene expression correlates with the degree of dimorphism. It also reports conservation of life cycle features within generations and marked changes in gene expression patterns in Ectocarpus in the transition between gamete and early sporophyte. The manuscript also reports considerable conservation of gene expression modules between two representative species, particularly in genes associated with conserved functional characteristics. 

      Strengths: 

      The manuscript represents a considerable "tour de force" dataset and analytical effort. While the data presented is largely descriptive, it is likely to provide a very useful resource for studies of brown algal development and for comparative studies with other developmental and life cycle systems. 

      Weaknesses: 

      Notwithstanding the well-known issues associated with inferring function from transcriptomics-only studies, no major weaknesses were identified by this reviewer. 

      Reviewing Editor Comments:

      The overall assessment of the reviewers does not contain major aspects of concern. We nevertheless recommend that the authors carefully consider the constructive comments, as this will further improve their manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      (1) Line 32: The abstract states 'considerable conservation of co-expressed gene modules', but the degree of conservation between Ectocarpus and D. dichotoma appeared limited to specific subsets of genes with highly conserved housekeeping functions, e.g., translation. I think the wording of the abstract should be rephrased to better reflect this. 

      We agree that genes with housekeeping functions figure strongly in the gene modules that showed strong conservation between Ectocarpus species 7 and D. dichotoma (and we actually highlight this point in the manuscript) but we do not believe that this invalidates the conservation. In the analysis shown in Figure 6A, for example, high scores were obtained for both connectivity and density for about a third of the gene modules and these modules cover broad range of cellular functions. This is a significant result given the large phylogenetic distance and we feel that "considerable conservation" is appropriate as a description of the level of correlation. 

      (2) Introduction - The Introduction needs a better explanation of the biology of the life cycle phases. Some of this information is present in the 1st paragraph of Materials and Methods, although it would be preferable to include this information within the main text, ideally within the Introduction before the Results are described. For example, when are flagella present? The presence of flagella could be indicated in Figure 3. The ecology of the life cycle is also not described. Are life cycles present in the same ecological niche? Do they co-exist or occupy distinct environments? It would be useful to understand how the observed genotypes could relate to this wider aspect of the brown algal biology. 

      We have added a sentence to explain that zoids (gametes and spores) are the only flagellated stages of the life cycle (line 678). In addition, in the legend for Figure 3, we have indicated which of the life cycle stages analysed in panel 3A consisted entirely or partially of flagellated cells. We have also added information about phenology to the Introduction. 

      (3) Line 127. 'The proportion of generation specific genes was positively correlated with the level of dimorphism'. The level of dimorphism between species was not clear to me. This needs to be clearly displayed in Figure 1B. 

      We had attempted to illustrate the level of dimorphism, using the size of each generation as a measurable proxy, in Figure S1 but we agree that the information was not very clearly presented. To improve clarity, we now provide independent size scales for each generation of the life cycle in this figure and state in the legend that "Size bars indicate the approximate sizes of each generation of each life cycle, providing an indication of the degree of dimorphism between the two generations.". In the text, Figure S1 is cited earlier in the paragraph but we now repeat the citation of the figure at the end of the sentence "The proportion of generation-specific genes (...) was positively correlated with the level of dimorphism" so that the reader can specifically consult the supplementary figure for this phenotypic parameter. 

      (4) Line 267. Are there known differences in cell wall composition between life cycle phases or within each generation as individual life cycle phases mature (e.g., differences between unicellular and multicellular stages)? 

      Detailed comparative analyses of cell wall composition at different stages of the life cycle have not been carried out for brown algae. However, Congo red stains Ectocarpus gametophytes but not sporophytes (Coelho et al., 2011), indicating a difference in cell wall composition between the two generations. Zoids (spores and gametes) do not have a cell wall and calcofluor white staining of meio-spores has indicated that a cell wall only starts to be deposited 24-48 hours post-release (Arun et al., 2013).

      (5) Line 388. The authors should comment on the accuracy of OrthoFinder for different gene types across this degree of divergence (250 MYA). The best conservation was found in genes with housekeeping characteristics (line 401). It may be that these gene modules show the highest degree of conservation in expression patterns, but I also wonder whether they pattern may also emerge because finding true orthologues is easier for highly conserved gene families. 

      We do not believe that this is the case because, as mentioned above, the "housekeeping" modules cover quite a broad range of cellular functions. Note also that the modules were given functional labels based on their being clearly enriched in genes corresponding to a particular class of function but not all the genes in a module have a predicted function that corresponds to the functional classification. 

      However, we have carried out an analysis to look for evidence of the bias proposed by the reviewer. For this, we used BLASTp identity scores as an approximate proxy for pairwise identity between Ectocarpus species 7 and D. dichotoma one-to-one orthologues in each module and plotted the mean identity score for each module against the Fischer test p-value of the contingency table in Figure 6C (Author response image 1).

      Author response image 1.

      Plot of estimations of the mean percent shared identity between the orthologues within each module (based on mean BLASTp identity scores) against log10(pvalue) values obtained with the Fisher's exact test applied in Figure 6C to determine whether pairs of modules shared a greater number of one-to-one orthologues than expected from a random distribution. Error bars indicate the standard deviation. 

      This analysis did not detect any correlation between the degree of sequence conservation of orthologues in a module and the degree of conservation of the module between Ectocarpus species 7 and D. dichotoma.

      Minor comments 

      (1) Line 650 loose should be lose.

      The error has been corrected.

      (2) Line 695 filtered through a 1 μm filter to remove multicellular gametophyte fractions. Is this correct? It seems too small to allow gametes to pass through. 

      Yes, the text is correct, a 1 μm filter was used. The gametes do pass through this filter, presumably because they do not have a rigid cell wall, allowing them to squeeze through the filter when a light pressure is applied. 

      (3) Line 709 - DDT should be DTT 

      The error has been corrected.

      Reviewer #2 (Recommendations for the authors): 

      (1) It is not clear why the chosen species for analysis do not include fucoid algae, which display a high degree of dimorphism between generations and which are relatively well studied with respect to gene expression patterns during early development. Indeed, it was recently shown that gene expression patterns in developing embryos of Fucus spp. obey the "hourglass" pattern whereby gene expression shows a minima of transcription age index (i.e., higher expression of evolutionarily older genes) associated with differentiation at the phylotypic stage. I am somewhat surprised that the manuscript does not consider this feature in the analysis or discussion. 

      Brown algae of the order Fucales have diploid life cycles and therefore do not alternate between a sporophyte and gametophyte generation. It is for this reason that we thought that it was more interesting to compare Ectocarpus species 7 with D. dichotoma, which has a haploid-diploid life cycle.

      (2) In Discussion, the comparison of maternal to zygote transition in animals and land plants, which show a high degree of dimorphism, with Ectocarpus would be strengthened by data/discussion from other brown algae that show a high degree of dimorphism. 

      Animals have diploid life cycles and dimorphism in that lineage generally refers to sexual rather than generational dimorphism. Land plants do have highly dimorphic haploiddiploid life cycles but it is unclear how this characteristic relates to events that occur during the maternal to zygote transition. In Ectocarpus, the transition from gamete to the first stages of sporophyte development involved more marked changes in gene expression than we observed when comparing the mature sporophyte and gametophyte generations (Figure 3C). At present, there is no evidence that events during these two transitions are correlated. The relationship between changes in gene expression during very early sporophyte development and during alternation of life cycle generations could be investigated further using a highly dimorphic kelp model system such as Saccharina latissima but we are not aware of any studies that have specifically addressed this point. 

      (3) Since marked changes were observed during the transition from gamete to early sporophyte in Ectocarpus, it would be interesting to know how gene expression patterns change during the transition from gamete to partheno-sporophyte. Would the same patterns of downregulation and upregulation be expected? 

      The sporophyte individuals derived from gamete parthenogenesis (parthenosporophytes) are indistinguishable morphologically and functionally from diploid sporophytes derived from gamete fusions (see line 76). They also express generation marker genes in a comparable manner (Peters et al., 2008). Based on these observations, we have treated partheno-sporophytes and diploid sporophytes as equivalent in our experiments. For clarity, we have now distinguished partheno-sporophyte from diploid sporophyte samples in Table S1. 

      (4) The authors show a correlation between the degree of dimorphism and generation-biased or generation-specific expression. How was the degree of dimorphism quantified? 

      The degree of dimorphism is illustrated in Figure S1 using the relative size of the two generations as a proxy. Size estimations are approximate because the size of an individual of a particular species is quite variable but the ten species nonetheless represent a very clear gradient of dimorphism due to the extreme differences in size between generations of species at each end of the scale, with the sporophyte generation being several orders of magnitude larger than the gametophyte generation or visa versa. 

      References

      Arun A, Peters NT, Scornet D, Peters AF, Cock JM, Coelho SM. 2013. Non-cell autonomous regulation of life cycle transitions in the model brown alga Ectocarpus. New Phytol 197:503– 510. doi:10.1111/nph.12007

      Coelho SM, Godfroy O, Arun A, Le Corguillé G, Peters AF, Cock JM. 2011. OUROBOROS is a master regulator of the gametophyte to sporophyte life cycle transition in the brown alga Ectocarpus. Proc Natl Acad Sci USA 108:11518–11523. doi:10.1073/pnas.1102274108

      Peters AF, Scornet D, Ratin M, Charrier B, Monnier A, Merrien Y, Corre E, Coelho SM, Cock JM. 2008. Life-cycle-generation-specific developmental processes are modified in the immediate upright mutant of the brown alga Ectocarpus siliculosus. Development 135:1503–1512.doi:10.1242/dev.016303

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this manuscript, Roy et al. used the previously published deep transfer learning tool, DEGAS, to map disease associations onto single-cell RNA-seq data from bulk expression data. The authors performed independent runs of DEGAS using T2D or obesity status and identified distinct β-cell subpopulations. β-cells with high obese-DEGAS scores contained two subpopulations derived largely from either non-diabetic or T2D donors. Finally, immunostaining using human pancreas sections from healthy and T2D donors validated the heterogeneous expression and depletion of DLK1 in T2D islets.

      Strengths:

      (1) This meta-analysis of previously published scRNA-seq data using a deep transfer learning tool.

      (2) Identification of novel beta cell subclusters.

      (3) Identified a relatively innovative role of DLK1 in T2D disease progression.

      Thank you for your comments on the strengths of our work.

      Weaknesses :

      “There is little overlap of the DE list of bulk RNA-seq analysis in Figure 1D and 1E overlap with the DE list of pseudo-bulk RNA-seq analysis of all cells in Figure S2C. “

      Thank you for pointing this out. To clarify, we did not perform pseudo-bulk analysis on the scRNAseq data. Instead, we used the Seurat FindClusterMarkers function to identify differentially enriched genes between T2D and ND single cells. Indeed, there are many significant genes in new Fig S2D (original S2C). There is some overlap between those data and the DEGS from bulk RNAseq data in Fig 1D, including IAPP, ENTPD3, and FFAR4. However, the limited overlap supports the notion that improved approaches are necessary to identify candidate DEGs from single cell data, as simply performing a comparison of T2D to ND of all β-cells may miss important genes or include many false positives. We have now added clarification to the text to highlight this point.

      The biological meaning of "beta cells had the lowest scores compared to other cell types" is not clear.

      The relatively lower T2D-DEGAS scores for beta cells overall compared to all other cell types (alpha cells, acinar cells, etc) likely reflects the fact that in T2D, beta cell-specific genes can be downregulated. This affects the DEGAS model which is reflected in the scores of all cells in the scRNAseq data. By subsetting the beta cells and replotting them on their own, we can analyze the relative differences in DEGAS scores between different subsets of beta cells. We have now amended the text to clarify, as follows:

      “We next mapped the T2D-association scores onto the single cells (Fig 3A). β-cells had a wide distribution of scores, possibly reflecting β-cell heterogeneity or altered β-cell gene expression after onset of T2D (Fig 3B).”

      The figures and supplemental figures were not cited following the sequence, which makes the manuscript very difficult to read. Some supplemental figures, such as Figures S1C-S1D, S2B-S2E, S3A-S3B, were not cited or mentioned in the text.

      We apologize for this oversight and have now amended the text to call out all figures/panels in order of first introduction.

      In Figure 7, the current resolution is too low to determine the localization of DLK1.

      We have confirmed that in our Adobe Illustrator file, each microscopy panel has a DPI of >600. We have also provided the highest quality TIFF file versions of our figure set. We hope the reviewer will have access to download the high-quality TIFF file for Fig 7 if possible, or the editorial staff can provide it.

      As a result of addressing the critiques, we identified CDKN1C as another promising candidate enriched in the β<sup>T2D-DEGAS</sup> and β<sup>obese-DEGAS</sup> subpopulations of β-cells. We found that CDKN1C is heterogeneously expressed at the protein level in β-cells and that it is increased in T2D in agreement with the DEGAS predictions. We have amended the manuscript to highlight CDKN1C more prominently while still discussing DLK1. DLK1 is very interesting, but exhibits greater donor to donor variability in its alterations in T2D.

      Reviewer #2 (Public Review):

      Summary:

      The manuscript by Gitanjali Roy et al. applies deep transfer learning (DEGAS) to assign patient-level disease attributes (metadata) to single cells of T2D and non-diabetic patients, including obese patients. This led to the identification of a singular cluster of T2D-associated β-cells; and two subpopulations of obese- β-cells derived from either non-diabetic or T2D donors. The objective was to identify novel and established genes implicated in T2D and obesity. Their final goal is to validate their findings at the protein level using immunohistochemistry of pancreas tissue from non-diabetic and T2D organ donors.

      Strengths:

      This paper is well-written, and the findings are relevant for β-cell heterogeneity in T2D and obesity.

      Thank you for your comments on the positive aspects of our work.

      Weaknesses:

      The validation they provide is not sufficiently strong: no DLK1 immunohistochemistry is shown of obese patient-derived sections.

      We have acquired additional FFPE pancreas samples from the Integrated Islet Distribution Program (IIDP) from lean, overweight, and obese humans with and without T2D. We have now stained for CDKN1C and DLK1 in these samples and have integrated the data into Fig 7 and Fig S5.

      Because the data with CDKN1C was more striking and consistent with the DEGAS predictions, we have chosen to highlight CDKN1C in the main figure and text. The DLK1 data is still quite interesting, although there is substantial variability between T2D donors when it comes to altered staining intensity. DLK1 presents an interesting challenge, given multiple isoforms and cleavage products, and will require further investigation as the focus of a different manuscript.

      Additional presumptive relevant candidates from this transcriptomic analysis should be screened for, at the protein level.

      Thank you for this suggestion. We also identified CDKN1C as promising candidate enriched in the β<sup>T2D-DEGAS</sup> and β<sup>obese-DEGAS</sup> subpopulations of β-cells. We found that CDKN1C is heterogeneously expressed at the protein level in β-cells and that it is increased in T2D in agreement with the DEGAS predictions. We have amended the manuscript to highlight CDKN1C more prominently while still discussing DLK1. DLK1 is very interesting but exhibits greater donor to donor variability in its alterations in T2D.

      Reviewer #1 (Recommendations For The Authors):

      Please explain and provide the detailed information on what percentage of the DE list of bulk RNA-seq analysis in Figures 1D and 1E overlap with the DE list of pseudo-bulk RNA-seq analysis of all cells in Figure S2C.

      Addressed in response to R1 Comment 1.

      Please provide the definition of each cluster of UMAP of the merged human islet scRNA-seq data.

      In figure panels 2A-B,D-G and 3A, the clusters are now labeled according to the marker genes described in Fig 2C.

      The integrative UMAP needs to be included in the main figure.

      We have now moved previous Fig S2A and S2B into the main figures as new Fig 2A-B.

      All figures and supplemental figures need to be cited following sequence.

      Addressed in response to R1 Comment 3.

      In Figure 7, high-resolution images are needed to determine the colocalization of INS and DLK1.

      Addressed in response to R1 Comment 4.

      Reviewer #2 (Recommendations For The Authors):

      Results: 124-128: Fig 1H_The error bars seem high, please include whether the boxplots are SEM or SD. Also, more detail on statistics is missing.

      Thank you for pointing out the need for clarification here. The whiskers on the box and whiskers plots are not error bars. By default, in geom_boxplot() and stat_boxplot(), the whiskers extend to 1.5 times the interquartile range. The box itself represents 50% of the data, the bottom of the box is the first quartile, the middle horizontal line is the median, and the top line of the box is the third quartile. We have now added a clearer description of this to the figure legend and in the methods section.

      The genes shown in Fig 1H were selected because they are found in the T2D Knowledge Portal, illustrating a clear link to T2D. At the T2DKP (https://t2d.hugeamp.org/research.html?pageid=mccarthy_t2d_247), PAX4 and APOE are listed as causal, SLC2A2 has strong evidence, and CYTIP has a linked SNP. This is now discussed in the results section before the Fig 1H callout. These genes are significantly differentially expressed using edgeR in panel 1D with FDR<0.05. The individual data points for each human are shown.

      Figure 6: In general, the representation of the data is quite misleading. It would be nice to have an alternative way of presenting the data, especially when comparing beta-obese differentially expressed genes and pathways and T2D beta obese. Maybe an additional Venn diagram can help. Also, it would be nice to compare data from T2D beta nonobese to ND beta obese, especially given how the story is presented in the paper.

      Thank you for pointing out this clarity issue. We agree that additional alternate ways to present the data would be helpful. When we performed DEGAS using BMI as the disease feature we noted two major and one minor clusters of high-scoring cells in Fig 6A .

      Author response image 1.

      Author response image 2.<br />

      This contrasted with the score map when we ran DEGAS with T2D as the disease feature

      The main difference seems to be the low scoring β<sup>T2D-DEGAS</sup> cluster is different from the low β<sup>obese-DEGAS</sup> cluster.

      Therefore, we could not easily apply thresholding to the β<sup>obese-DEGAS</sup> scores, so instead we subsetted them for comparison. It was also apparent from the metadata that single cells from the left-hand side of the β-cell cluster came from donors that had T2D.

      To clarify these points and address the reviewer’s concerns, we have added a comparison of the DEGs identified for β<sup>T2D-DEGAS</sup> high vs. low and T2D-β<sup>obese-DEGAS</sup> vs ND-β<sup>obese-DEGAS</sup> in Fig S4J, also shown below. DLK1 and CDKNC1C fall within the intersection, in addition to being two of the most enriched candidates in each DEGAS run (Fig 4C and Fig 6D).

      220-222: Figure 7C_ Is one of the nondiabetic beta samples obese? If so, please clearly label it; if not, that info is missing. One would expect that the DLK1 expression in ND obese beta cells resembles the T2D beta cell and not ND non-obese beta cells. That's a big point of this entire work, and experimentally missing. Additional candidate proteins should be checked.

      We have amended the entire Fig 7 to include more data for DLK1 staining as well as adding staining for CDKN1C. We also used CellProfiler to quantify the intensity distribution of DLK1 staining in β-cells and overall found that our initial conclusions were not supported when considering an increased sample size. DLK1 expression is heterogeneous both within and between donors. While we have data from T2D donors that shows DLK1 is lost, other T2D samples indicate that DLK1 is not always lost. At least in the current sample set we have analyzed, we cannot conclude that there is a clear correlation between diabetes or BMI for DLK1. Why DLK1 labels some β-cells and not others and what the role of this subpopulation is an open question.

      Alternatively, we greatly appreciate the reviewer’s suggestion to validate additional candidates, as this led us to CDKN1C. In new Fig 7E-H we now show that CDKN1C is increased in T2D β-cells, in agreement with the DEGAS predictions.

      This work shows that machine learning approaches are powerful for identifying potential candidates, but it also highlights the need for these predictions to be validated at the protein level in human samples.

      Discussion: Based on lack of supporting IHC data, this is an overstatement:

      “DLK1 expression highly overlapped with high scoring βT2D DEGAS cells (Figure 7A) and with T2D βobese-DEGAS cells (Figure 7B). DLK1 immunostaining primarily colocalized with β-cells in non-diabetic human pancreas (Figure 7C). DLK1 showed heterogeneous expression within islets and between islets within the same pancreas section, wherein some islets had DLK1/INS co-staining in most β-cells and other islets had only a few DLK1+ β-cells. In the T2D pancreas, DLK1 staining was much less intense and in fewer β-cells, yet DLK1+/INS+ cells were observed (Figure 7C). This contrasts with the relatively higher DLK1 gene expression seen in the β-cells from the βT2D-DEGAS and T2D-βobese-DEGAS subpopulations (Figure 4D & 6C) as highlighted in Figure 7A,B. which were up- or down-regulated in subpopulations of β-cells identified by DEGAS, and to validate our findings at the protein level using immunohistochemistry of pancreas tissue from non-diabetic and T2D organ donors.”

      This part was at the very end of the last results subsection. This section has been largely rewritten to better describe the new figure and the language has been tempered to not overinterpret the data shown.

      “Our current findings applying DEGAS to islet data have implications for β-cell heterogeneity in T2D and obesity. The abundance of T2D-related factors and functional β-cell genes in our analysis validates applying DEGAS to islet data to identify disease-associated phenotypes and increase confidence in the novel candidate.”

      This part was found at the end of the Background section. We have removed the second sentence to temper the language.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Structural colors (SC) are based on nanostructures reflecting and scattering light and producing optical wave interference. All kinds of living organisms exhibit SC. However, understanding the molecular mechanisms and genes involved may be complicated due to the complexity of these organisms. Hence, bacteria that exhibit SC in colonies, such as Flavobacterium IR1, can be good models.

      Based on previous genomic mining and co-occurrence with SC in flavobacterial strains, this article focuses on the role of a specific gene, moeA, in SC of Flavobacterium IR1 strain colonies on an agar plate. moeA is involved in the synthesis of the molybdenum cofactor, which is necessary for the activity of key metabolic enzymes in diverse pathways.

      The authors clearly showed that the absence of moeA shifts SC properties in a way that depends on the nutritional conditions. They further bring evidence that this effect was related to several properties of the colony, all impacted by the moeA mutant: cell-cell organization, cell motility and colony spreading, and metabolism of complex carbohydrates. Hence, by linking SC to a single gene in appearance, this work points to cellular organization (as a result of cell-cell arrangement and motility) and metabolism of polysaccharides as key factors for SC in a gliding bacterium. This may prove useful for designing molecular strategies to control SC in bacterial-based biomaterials.

      Strengths:

      The topic is very interesting from a fundamental viewpoint and has great potential in the field of biomaterials.

      Thank you for this.

      The article is easy to read. It builds on previous studies with already established tools to characterize SC at the level of the flavobacterial colony. Experiments are well described and well executed. In addition, the SIBR-Cas method for chromosome engineering in Flavobacteria is the most recent and is a leap forward for future studies in this model, even beyond SC.

      We appreciate these comments.

      Weaknesses:

      The paper appears a bit too descriptive and could be better organized. Some of the results, in particular the proteomic comparison, are not well exploited (not explored experimentally). In my opinion, the problem originates from the difficulty in explaining the link between the absence of moeA and the alterations observed at the level of colony spreading and polysaccharide utilization, and the variation in proteomic content.

      We have looked at the organisation of the manuscript carefully in this revision, as suggested. In terms of the proteomics, there are a large number of proteins affected by the moeA deletion and not all could be followed up. We chose spreading, structural colour formation and starch degradation to follow up phenotypically, as the most likely to be relevant. For example, (L615-617) we discuss the downregulation of GldL (which is known to be involved Flavobacterial gliding motility [Shrivastava et al., 2013]) in the moeA KO as a possible explanation for the reduced colony spreading of this mutant. Changes in polysaccharide (starch) utilization were seen on solid medium, as well as in the proteomic profile where we observed the upregulation of carbohydrate metabolism proteins linked to PUL (polysaccharide utilisation locus) operons (Terrapon et al., 2015), such as PAM95095-90 (Figure 8), and other carbohydrate metabolism-related proteins, including a pectate lyase (Table S7) which is involved in starch degradation (Aspeborg et al., 2012). And as noted in L555-566 and Figure 9, alterations in starch metabolism were investigated experimentally.

      First, the effect of moeA deletion on molybdenum cofactor synthesis should be addressed.

      MoeA is the last enzyme in the MoCo synthesis pathway, thus if only MoeA is absent the cell would accumulate MPT-AMP (molybdopterin-adenosine monophosphatase) (Iobbi-Nivol & Leimkühler, 2013), and the expressed molybdoenzymes would not be functional. In L582-585, we commented how the lack of molybdenum cofactor may affect the synthesis of molybdoenzymes. However, if you meant to analyse the presence of the small molecules, i.e. the cofactors involved in these pathways, that was an assay we were not able to perform. However, in L585-587, we addressed how the deletion of moeA affected the proteins encoded by the rest of genes in the operon which is relevant to the question.

      Second, as I was reading the entire manuscript, I kept asking myself if moeA (and by extension molybdenum cofactor) was really involved in SC or it was an indirect effect. For example, what if the absence of moeA alters the cell envelope because the synthesis of its building blocks is perturbed, then subsequently perturbates all related processes, including gliding motility and protein secretion? It would help to know if the effects on colony spreading and polysaccharide metabolism can be uncoupled. I don't think the authors discussed that clearly.

      The message of the paper is that the moeA gene, as predicted from a previous genomics analysis, is important in SC. This is based on the representation of the moeA gene in genomes of bacteria that display SC. This analysis does not predict the mechanism. When knocked out, a significant change in structural colour occurred, supporting this hypothesis. Whether this effect is direct or indirect is difficult to assess, as this referee rightly suggests. In order to follow up this central result, we performed proteomics (both intra- and extracellular). As we observed, the deletion of a single gene generated many changes in the proteomic profile, thus in the biological processes. Based on the known functions of molybdenum cofactor, we could only hypothesize that pterin metabolism is important for SC, not exactly how.

      We have discussed the links between gliding/spreading and polysaccharide metabolism more clearly, with reference to the literature, as quite a bit is known here including possible links to SC.

      “Polysaccharide metabolism in IR1 has been linked to changes in colony color and motility through the study of fucoidan metabolism (van de Kerkhof et al., 2022). Polysaccharide degradation and gliding motility are coupled to the same mechanism: the phylum-specific type IX secretion system, used for the secretion of enzymes and proteins involved in both functions (McKee et al., 2021).” [L622-626]

      Reviewer #2 (Public review):

      Summary:

      The authors constructed an in-frame deletion of moeA gene, which is involved in molybdopterin cofactor (MoCo) biosynthesis, and investigated its role in structural colors in Flavobacterium IR1. The deletion of moeA shifted colony color from green to blue, reduced colony spreading, and increased starch degradation, which was attributed to the upregulation of various proteins in polysaccharide utilization loci. This study lays the ground for developing new colorants by modifying genes involved in structural colors.

      Major strengths and weaknesses:

      The authors conducted well-designed experiments with appropriate controls and the results in the paper are presented in a logical manner, which supports their conclusions.

      We appreciate these comments.

      Using statistical tests to compare the differences between the wild type and moeA mutant, and adding a significance bar in Figure 4B, would strengthen their claims on differences in cell motility regarding differences in cell motility.

      Thank you. Figure 4B contains the significance bars that represent the standard deviation of the mean value of the three replicates, but we have modified it to make them more clear.

      Additionally, in the result section (Figure 6), the authors suggest that the shift in blue color is "caused by cells which are still highly ordered but narrower", which to my knowledge is not backed up by any experimental evidence.

      Thanks. We mentioned that the mutant cells are narrower than the wild type based on the estimated periodicity resulting from the goniometry analysis (L427-430). We will now say “likely to be narrower based on the estimated periodicity from the optical analysis” rather than just “narrower”.

      “This optical analysis aligns with visual observations, confirming the blue shift in ΔmoeA, and suggests that this change in SC is caused by cells which are likely to be narrower based on the estimated periodicity from the optical analysis.” [L409-411]

      Overall, this is a well-written paper in which the authors effectively address their research questions through proper experimentation. This work will help us understand the genetic basis of structural colors in Flavobacterium and open new avenues to study the roles of additional genes and proteins in structural colors.

      Much appreciated.

      Recommendations for the authors:

      Reviewing Editor Comments:

      As you will see, the reviewers were rather positive about the paper but suggested a number of points to improve it, including a discussion of the direct role of moeA as well as specific editorial comments.

      Reviewer #1 (Recommendations for the authors):

      More specific comments to the authors:

      (1( Line 300, Paragraph on bioinformatic analysis of molybdopterin operon : As written, it is not clear whether this operon is crucial for pterin cofactor synthesis or only some genes are involved. And what is the contribution of moeA?

      Based on the bioinformatic analysis done in Zomer et al., 2024, we know the score of which genes of the molybdopterin cofactor synthesis operon may be more relevant to the display of SC, in addition to moeA. We chose moeA to KO as it had the highest score, being careful to delete the coding sequence and not any upstream promoter. The other genes in the predicted operon are moaE, moaC2, and moaA. Then in the proteomic analysis (L435-442), we analysed how the encoded proteins from this operon were upregulated (MoaA, MoaC2, and MobA), indicating also the unaltered proteins (MoeZ and MoaE) and the undetected proteins (MoaD and SumT). Nevertheless, the operon is crucial for pterin cofactor synthesis because it contains all the genes involved in the pathway, and moeA encoded the enzyme for the last reaction of the pathway, being the the molecule produced in the mutated pathway the adenylated molybdopterin (MPT-AMP) instead of molybdenum cofactor (MoCo).

      (2) Paragraph line 342 on moeA mutant phenotyping :

      Is the reduction in colony spreading caused by a defect in single-cell gliding motility or is the cause more complex? This can be quantified.

      We believe the cause is more complex. As mentioned above, for example, in (L615-617) we discuss the downregulation of GldL (which is known to be involved Flavobacterial gliding motility [Shrivastava et al., 2013]) in the moeA KO as a possible explanation for the reduced colony spreading of this mutant. This cannot be explained simply by spreading, but must (from the optical analysis) indicate changes in cell organisation/dimensions.

      (3) During the description of the moeA mutant phenotype (associated with Figures 2 and 4) and throughout the article, the optical properties are « functions » of colony spreading and moeA-dependent metabolism. However it is not quite clear if these two effects are independent or if one may be a consequence of the other.

      As noted above, colony spreading alone does not explain the blue-shift in SC observed. Given the function of MoeA (molybdate insertion into MPT-AMP [adenylated molybdopterin], MoMPT [molybdenum-molybdopterin] formation) for the synthesis of MoCo (molybdenum cofactor), the primary effect seems to be on metabolism but as we are dealing with an influential enzymatic cofactor a number of secondary effects are likely, and indeed the proteomics supports this. It is likely that the effect on spreading is secondary as seen with the downregulation of GldL (see above), but we cannot be sure.

      (4) Paragraph starting line 381 and Figure 5 on gliding motility:

      Gliding motility has to be tested at the level of single cells, allowing a more thorough characterization of the spreading defects. In addition, since gliding is entangled with Type IX-dependent secretion in Flavobacteria, the authors should test if Type IXdependent was perturbed in the absence of moeA.

      Based on the intracellular and extracellular proteomic analyses, the regulated T9SS proteins in the absence of moeA are the downregulation of GldL and SprT, and the upregulation of PorU. It shows the log2 FC (moeA/WT) of each these extracellular proteins:

      Author response table 1.

      <-1: downregulated in moeA KO, -1<X<1: no significant regulation, >1: upregulated in moeA KO, -: not detected

      (5) L401: In my opinion, the section "Quantification of the optical responses of IR1 WT and ΔmoeA colonies" should be moved up, before the characterization of motility.

      We have done this, as suggested. The section was moved from L401-423 to L388-411.

      (6) L475: Proteome comparison: « Of the total known proteins in IR1, 27.5% (1,504 proteins) extracellular proteins were identified » Are some of these proteins also found in the cell fraction? Wouldn't it be more accurate to write that « 1504 proteins were found in the extracellular fraction"?

      We have done this, as suggested.

      “Of the total known proteins in IR1, 27.5% (1,504 proteins) proteins were detected in the extracellular fraction, 60.4% (909) were statistically significant (p<0.01), with 20.5% (186) considered downregulated, and 20% (182) upregulated in ΔmoeA (Figure 7B).” [L484-486]

      How can the authors exclude contamination of the extracellular fraction? This could easily explain the number of proteins lacking secretion signals: "29.6% (55) were likely secreted through a non-classical way, lacking typical secretion sequence motifs in their N-terminus."

      Based on the results from SecretomeP and SignalP, we excluded contamination, reducing the significant downregulated proteins from 186 (L476) to 69 (L486), and the upregulated ones from 182 (L477) to 111 (L500).

      (7) L490: if the protein misannotated flagellin is highly downregulated, why not push the analysis a bit further and ask what true function may be perturbed? In addition, it should not be classified as a motility protein in Table S6 and considered as a motility protein in the article.

      We reconsidered the information given by this and decided to remove it because after checking the homology of the polypeptide by Blast searching, we feel it is probably due to a missannotation.

      As is, the whole proteomic section is not that useful. Too many functions are evoked and the reader is not directed toward any particular conclusion. The most convincing hits from the proteomic analysis should be confirmed using another method. Transcriptional regulation could be easily probed by RT-qPCR. Or, since genetics is possible, proteins could be tagged and levels compared by western blot maybe? Do knock-out of the encoding genes generate any phenotype on SC? This would bring weight to the proteomic analysis.

      We have revised the proteomics section and removed functions that are not directly relevant to our conclusion.

      We feel the most important observation suggested by proteomics was the possible link between moeA and starch metabolism, because the metabolism of complex polysaccharides is important in the Flavobacteriia and known to be linked to SC (van de Kerkhof et al., 2022). It was not possible to follow up every pathway suggested by the proteomics, but the study is appropriately performed with the correct statistics.

      (8) Figure 9 : Does the absence of moeA affect the spreading of ASWS? Were colony sizes similar during the starch degradation assay? How can the authors rule out the idea that starch degradation is impacted by the difference in spreading rather than an independent function of moeA in starch metabolism? Slower spreading could lead to the accumulation of amylases, hence stronger activity. Why does starch degradation only accumulate at the center of the colony in the WT case?

      The colonies of the WT and moeA had similar size during the starch degradation assay (2 days). However, after day 3, only WT colonies kept expanding on diameter.

      Starch degradation is logically in the centre of the colony as it is where the greatest concentration of cells exists, secreting degradative enzymes, for the longest time. Presumably starch degradation at the colony edge is not yet seen as the action of extracellular enzymes is low and has not had time to degrade the starch to the point that there is no iodine staining.

      “In contrast to other media where ΔmoeA colony expansion was less than WT, the ΔmoeA showed similar colony spreading and stronger starch degradation, supporting a role of moeA in complex polysaccharides metabolism.” [L562-565]

      (9) Finally, I am not quite sure what the authors mean by « a role of moeA in complex polysaccharides metabolism ». Are they referring to enzymes secreted in the medium to degrade starch? or to the incorporation and use of starch degradation products?

      We meant that the deletion of moeA showed an increase of extracellular starch degradation as seen in the iodine assay (Figure 9), as well as the upregulation of three different PUL operons (Figure 8).

      Reviewer #2 (Recommendations for the authors):

      The paper in general is well written with proper experimentation. However, here are a few recommendations for improving the writing and presentation, including minor corrections to the text and figures.

      Thank you.

      (1) It would be helpful for the readers if you could expand on "some metabolic pathways" in line 71. Please provide examples of metabolic pathways that are linked to SC.

      We have done this.

      “A recent bioinformatic study has shown the possible link of some metabolic pathways, such as carbohydrate, pterin, and acetolactate metabolism, to bacterial SC (Zomer et al., 2024).”[L70-72]

      (2) "Line 79 : a bioinformatics analysis", please mention what kind of bioinformatics analysis was done and by whom to provide clarity for the readers: Either mention bio info analysis or give more details on what kind of bio info analysis and study done by whom"

      We have clarified this, as suggested.

      “A large-scale, genomic-based analysis of 117 bacteria strains (87 with SC and 30 without) identified genes potentially involved in SC by comparing gene presence/absence, providing a SC-score (Zomer et al., 2024). By this method, pterin pathway genes were strongly predicted to be involved in SC.” [L80-83]

      (3) Please correct "Bacteria strains used in this study" to "bacterial" strains in Line 122.

      We have done so.

      (4) Please indicate in "Lines 394-396" that there were no vortex patterns observed in the moeA mutant.

      We have done so.

      “In contrast, ΔmoeA exhibited limited motility, with a more tightly packed cell organization and a fine, slow-moving layer at the edge (Figure 6, blue arrows), and did not show a ‘vortex’ pattern. This suggests that moeA deletion significantly impairs cell motility and colony expansion.” [428-L431]

      (5) In Figure 4 it looks like with a different carbon source (ASWB with agar and Fucoidan (ASWBF)) the moeA mutant and wild type exchanges its phenotype compared to ASWBKC. Could you explain why this happens in the discussion by highlighting the differences between fucose and Kappa-Carrageenan or confirm if there are any differences in the carbohydrate utilization between the wild type and moeA mutant using biolog assays?

      We have explained the differences. Biolog would not be appropriate as we are looking for metabolic processes of bacteria on surfaces (agar) and this is not necessarily appropriate to biolog, which we understand uses liquid cultivation in microplates.

      “On different polysaccharide media, the ΔmoeA strain showed varied SC and colony expansion patterns: green/blue SC and low colony expansion on agar, intense blue SC and low colony expansion on kappa-carrageenan, dull green SC and low colony expansion on fucoidan, and blue/green SC with higher colony expansion on starch. Interestingly, the color phenotype of the WT and ΔmoeA exchanged their phenotype on kappa-carrageenan (a simple linear sulfated polysaccharide of D-galactopyranose) and fucoidan (a complex sulfated polysaccharide of fucose and other sugars as galactose, xylose, arabinose and rhamnose), showing the importance of the polysaccharide metabolism in SC. While reduced motility has been associated with dull or absent SC, and reduced polysaccharide metabolism (Kientz et al., 2012a; Johansen et al., 2018), ΔmoeA showed reduced motility, but an intense blue SC, and high polysaccharide metabolism. Based on these results, we established a link among polysaccharide metabolism, MoCo biosynthesis, and SC, showing that intense SC is not strictly dependent on motility.” [L636-648]

      (6) In the discussion "Line 632" it is unclear what loss is being limited, and it would help strengthen your discussion if you could add references for lines: 633-636. There are a lot of hypotheses in lines 637-642, it would help the readers if you could clearly mention that these are hypotheses and will need experimental evidence or provide appropriate evidence to support these claims.

      We have done this.

      “Ecologically, we hypothesize that dense, highly structured bacterial colonies, such as necessary for the SC phenotype, can enhance the uptake of metabolic degradation products from complex polysaccharides. These large macromolecules are often partially hydrolyzed extracellularly because they are too large to pass through bacterial cell membranes. For example, marine Vibrionaceae strains that produce lower levels of extracellular alginate lyases tend to aggregate more strongly, potentially facilitating localized degradation and uptake of polysaccharides (D’Souza et al., 2023). Additionally, certain marine bacteria employ a "selfish" mechanism to internalize large polysaccharide fragments into their periplasmic space, minimizing loss to the environment and enhancing substrate utilization (Reintjes et al., 2017). Bacteria secrete enzymes into the surrounding environment to break these polysaccharides down into more easily absorbable monosaccharides or oligosaccharides. This mechanism suggests that the colony structure could create a physical barrier that keeps these products concentrated and near the cells, allowing the colony to efficiently access and utilize these products, preventing the leakage into the surrounding environment. While SC may also yield other ecological benefits associated with growth in biofilms, the highly structured colonies that characterize SC may be more resistant against invasion by competitor species scavenging for degradation products, than an unstructured biofilm. This model is consistent with the observation that SC is associated with polysaccharide metabolism genes, and with the recent observation that SC is mainly localized on surface and interface environments such as airwater interfaces, tidal flats, and marine particles (Zomer et al., 2024).” [L650-670]

      (7) It would help the readers if you could expand on how polysaccharide metabolism is linked to motility in Line 610.

      As indicated previously, this is known and we will clarify.

      “Polysaccharide metabolism in IR1 has been linked to changes in colony color and motility through the study of fucoidan metabolism (van de Kerkhof et al., 2022).” [L622-623]

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      “…However, the findings are reliant on high concentrations of inhibitor drugs, and mechanistic details about the molecular interaction and respective functions of ABHD2 and mPRb are incomplete.”

      As discussed below in the response to Reviewers the drug concentrations used span the full dose response of the active range of each drug. In cases where the drug concentrations required to block oocyte maturation where significantly higher than those reported in the literature, we considered those drugs ineffective. In terms of the molecular details of the mechanistic interaction between mPRb and ABHD2, we now provide additional data confirming their molecular interaction to produce PLA2 activity where each protein alone is insufficient. Although these new studies provide more mechanistic insights, there remains details of the ABHD2-mPR interactions that would need to be addressed in future studies which are beyond the scope of the current already extensive study.   

      Public Reviews:

      Reviewer 1

      (1) The mechanism governing the molecular assembly of mPRbeta and ABHD2 remains unclear. Are they constitutively associated or is their association ligand-dependent? Does P4 bind not only to mPRbeta but also to ABHD2, as indicated in Figure 6J? In the latter case, the reviewer suggests that the authors conduct a binding experiment using labeled P4 with ABHD2 to confirm this interaction and assess any potential positive or negative cooperativity with a partner receptor.

      The co-IP experiments presented in Figure 5E argue that the two receptors are constitutively associated at rest before exposure to P4; but at low levels since addition of P4 increases the association between mPRβ and ABHD2 by ~2 folds. Importantly, we know from previous work (Nader et al., 2020) and from imaging experiments in this study that mPR recycles in immature oocytes between the PM and the endosomal compartment. It is not clear at this point within which subcellular compartment the basal association of mPR and ABHD2 occurs. We have tried to elucidate this point but have not been able to generate a functional tagged ABHD2. We generated GFP-tagged ABHD2 at both the N- and C-terminus but these constructs where not functional in terms of their ability to rescue ABHD2 knockdown. This prevented us from testing the association dynamics between ABHD2 and mPR.   

      Regarding whether ABHD2 in the oocyte directly binds P4 or not, we had in the initial submission no data directly supporting this rather we based the cartoon in Fig. 6J on the findings from Miller et al. (Science 2016) who showed that ABHD2 in sperm binds biotinylated P4. With the use of a new expression system to produce ABHD2 in vitro (please see below) we were able to try the experiment suggested by the Reviewer. In vitro expressed ABHD2 was incubated with biotinylated P4, and binding tested on a streptavidin column. Under these conditions we could not detect any specific binding of P4 to ABHD2, however, these experiments remain somewhat preliminary and would require validation using additional approaches to conclusively test whether Xenopus ABHD2 binds P4 or not. The discrepancy with the Miller et al. findings could be species specific as they tested mammalian ABHD2.  

      (2) The authors have diligently determined the metabolite profile using numerous egg cells. However, the interpretation of the results appears incomplete, and inconsistencies were noted between Figure 2B and Supplementary Figure 2C. Furthermore, PGE2 and D2 serve distinct roles and have different elution patterns by LC-MS/MS, thus requiring separate measurements. In addition, the extremely short half-life of PGI2 necessitates the measurement of its stable metabolite, 6-keto-PGF1a, instead. The authors also need to clarify why they measured PGF1a but not PGF2a.

      We believe the Reviewer meant to indicate discrepancies between Fig. 2E (not 2B) and Supp. Fig. 2C. Indeed, the Reviewer is correct, and this is because Fig. 2E shows pooled normalized data on a per PG species and frog, whereas Supp. Fig. 2E shows and example of absolute raw levels from a single frog to illustrate the relative basal abundance of the different PG species. We had failed to clarify this in the Supp. Fig. 2E figure legend, which we have now added in the revised manuscript. So, the discrepancies are due to variation between different donor animals which is highlighted in Supp. Fig. 2A. Furthermore, to minimize confusion, in the revised manuscript we revised Supp. Fig. 2C to show only PG levels at rest, to illustrate basal levels of the different PG species relative to each other, which is the goal of this supplemental figure. 

      (3) Although they propose PGs, LPA, and S1P are important downstream mediators, the exact roles of the identified lipid mediators have not been clearly demonstrated, as receptor expression and activation were not demonstrated. While the authors showed S1PR3 expression and its importance by genetic manipulation, there was no observed change in S1P levels following P4 treatment (Supplementary Figure 2D). It is essential to identify which receptors (subtypes) are expressed and how downstream signaling pathways (PKA, Ca, MAPK, etc.) relate to oocyte phenotypes.

      We agree conceptually with the Reviewer that identifying the details of the signaling of the different GPCRs involved in oocyte maturation would be interesting. However, our lipidomic data argue that the activation of a PLA2 early in the maturation process in response to P4 leads to the production of multiple lipid messengers that would activate GPCRs and branch out the signaling pathway to activate various pathways required for the proper and timely progression of oocyte maturation. Preparing the egg for fertilization is complex; so, it is not surprising that a variety of pathways are activated simultaneously to properly initiate both cytoplasmic and nuclear maturation to transition the egg from its meiotic arrest state to be ready to support the rapid growth during early embryogenesis. We focus on the S1P signaling pathway specifically because, as pointed out by the Reviewer, we could not detect an increase in S1P even though our metabolomic data collectively argued for an increase. Our results on the S1P pathway -as well as a plethora of other studies historically in the literature that we allude to in the manuscript- argue that these different GPCRs support and regulate oocyte maturation, but they are not essential for the early maturation signaling pathway. For example, for S1P, as shown in Figure 4, the delay/inhibition of oocyte maturation due to S1PR3 knockdown can be reversed at high levels of P4, which presumably leads to higher levels of other lipid mediators that would bypass the need for signaling through S1PR3. This is reminiscent of the kinase cascade driving oocyte maturation where there is significant redundancy and feedback regulation. Therefore, analyzing each receptor subtype that may regulate the different PG species, LPA, and S1P would be a tedious and time-consuming undertaking that goes beyond the scope of the current manuscript. More importantly based on the above arguments, we suggest that findings from such an analysis, similar to the conclusions from the S1PR3 studies (Fig. 4), would show a modulatory role on oocyte maturation rather than a core requirement for the maturation process as observed with mPR and ABHD2. Thus they would provide relatively little insights into the core signaling pathway driving P4-mediated oocyte maturation.

      Reviewer 2:

      (1) The ABHD2 knockdown and rescue, presented in Fig 1, is one of the most important findings. It can and should be presented in more detail to allow the reader to understand the experiments better. E.g.: the antisense oligos hybridize to both ABHD2.S and ABHD2.L, and they knock down both (ectopically expressed) proteins. Do they hybridize to either or both of the rescue constructs? If so, wouldn't you expect that both rescue constructs would rescue the phenotype since they both should sequester the AS oligo? Maybe I'm missing something here.

      For the ABHD2 rescue experiment, the ABHD2 constructs (S or L) were expressed 48 hrs before the antisense was injected. The experiment was conducted in this way to avoid the potential confounding issue of both constructs sequestering the antisense. The assumption is that the injected RNA after protein expression would be degraded thus allowing the injected antisense to target endogenous ABHD2. The idea is to confirm that ABHD2.S expression alone is sufficient to rescue the antisense knockdown as confirmed experimentally.

      However, to further confirm the rescue, we performed the experiment in a different chronological order, where we started with injecting the antisense to knock down endogenous ABHD2 and this was followed 24 hrs later by expressing wild type ABHD2.S. As shown in Author response image 1 this also rescues the knockdown.

      Author response image 1.

      ABHD2 knockdown and rescue. Oocytes were injected with control antisense (Ctrl AS) or specific ABHD2 antisense (AS) oligonucleotides and incubated at 18 oC for 24 hours. Oocytes were then injected with mRNA to overexpress ABHD.S for 48 hours and then treated with P4 overnight. The histogram shows % GVBD in naïve, oocytes injected with control or ABHD2 antisense with or without mRNA to overexpress ABHD2.S.

      In addition, it is critical to know whether the partial rescue (Fig 1E, I, and K) is accomplished by expressing reasonable levels of the ABHD2 protein, or only by greatly overexpressing the protein. The author's antibodies do not appear to be sensitive enough to detect the endogenous levels of ABHD2.S or .L, but they do detect the overexpressed proteins (Fig 1D). The authors could thus start by microinjecting enough of the rescue mRNAs to get detectable protein levels, and then titer down, assessing how low one can go and still get rescue. And/or compare the mRNA levels achieved with the rescue construct to the endogenous mRNAs.

      The dose response of ABHD2 protein expression in correlation with rescue of the ABHD2 knockdown is shown indirectly in Figure 1I and 1J. In experiments ABHD2 knockdown was rescued using either the WT protein or two mutants (H120A and N125A). All three constructs rescued ABHD2 KD with equal efficiency (Fig. 1I), eventhough their expression levels varied (Fig. 1J). The WT protein was expressed at significantly higher levels than both mutants, and N125A was expressed at higher levels than H120A (Fig. 1J), note the similar tubulin loading control. Crude estimation of the WBs argues for the WT protein expression being ~3x that of H120A and ~2x that of N125A, yet all three have similar rescue of the ABHD2 knockdown (Fig. 1I). This argues that low levels of ABHD2 expression is sufficient to rescue the knockdown, consistent with the catalytic enzymatic nature of the ABHD2 PLA2 activity.

      Finally, please make it clear what is meant by n = 7 or n = 3 for these experiments. Does n = 7 mean 7 independently lysed oocytes from the same frog? Or 7 groups of, say, 10 oocytes from the same frog? Or different frogs on different days? I could not tell from the figure legends, the methods, or the supplementary methods. Ideally one wants to be sure that the knockdown and rescue can be demonstrated in different batches of oocytes, and that the experimental variability is substantially smaller than the effect size.

      The n reflects the number of independent female frogs. We have added this information to the figure legends. For each donor frog at each time point 10-30 oocytes were used.

      (2) The lipidomics results should be presented more clearly. First, please drop the heat map presentations (Fig 2A-C) and instead show individual time course results, like those shown in Fig 2E, which make it easy to see the magnitude of the change and the experiment-to-experiment variability. As it stands, the lipidomics data really cannot be critically assessed.

      [Even as heat map data go, panels A-C are hard to understand. The labels are too small, especially on the heat map on the right side of panel B. The 25 rows in panel C are not defined (the legend makes me think the panel is data from 10 individual oocytes, so are the 25 rows 25 metabolites? If so, are the individual oocyte data being collapsed into an average? Doesn't that defeat the purpose of assessing individual oocytes?) And those readers with red-green colorblindness (8% of men) will not be able to tell an increase from a decrease. But please don't bother improving the heat maps; they should just be replaced with more informative bar graphs or scatter plots.]

      We have revised the lipidomics data as requested by the Reviewer. The Reviewer asked that we show the data as a time course with each individual frog as in Fig. 2E. This turns out to be confusing and not a good way to present the data (please see Author response image 2).

      Author response image 2.

      Metabolite levels from 5 replicates of 10 oocytes each at each time point were measured and averaged per frog and per time point. Fold change was measured as the ratio at the 5- and 30-min time points relative to untreated oocytes (T0). FCs that are not statistically significant are shown as faded. Oocytes with mPR knockdown (KD) are boxed in green and ABHD2-KD in purple.

      We therefore revised the metabolomics data as follow to improve clarity. The changes in the glycerophospholipids and sphingolipids determined on the Metabolon CLP platform (specific for lipids) are now shown as single metabolites clustered at the levels of species and pathways and arranged for the 5- and 30-min time points sequentially on the same heatmap as requested (Fig. 2B). This allows for a quick visual overview of the data that clearly shows the decrease in the lipid species following P4 treatment in the control oocytes and not in the mPR-KD or ABHD2-KD cells (Fig. 2B). The individual species are listed in Supplemental Tables 1 and 2. We also revised the Supplemental Tables to include the values for the non-significant changes, which were omitted from the previous submission.

      We revised the metabolomics data from the HD4 platform in a similar fashion but because the lipid data were complimentary and less extensive than those from the CLP platform, we moved that heatmap to Supplemental Fig. 2B.

      For the single oocyte metabolomics, we now show the data as the correlation between FC and p value, which clearly shows the upregulated (including LPA) and downregulated metabolites at T30 relative to T0 (Fig. 2C). The raw data is now shown in a new Supplemental Table 7.  

      (3) The reticulocyte lysate co-expression data are quite important and are both intriguing and puzzling. My impression had been that to express functional membrane proteins, one needed to add some membrane source, like microsomes, to the standard kits. Yet it seems like co-expression of mPR and ABHD2 proteins in a standard kit is sufficient to yield progesterone-regulated PLA2 activity. I could be wrong here - I'm not a protein expression expert - but I was surprised by this result, and I think it is critical that the authors make absolutely certain that it is correct. Do you get much greater activities if microsomes are added? Are the specific activities of the putative mPR-ABHD2 complexes reasonable?

      We thank the Reviewer for this insightful comment. We agree that this is a critical result that would benefit from cross validation, especially given the low level of PLA2 activity detected in the reticulocyte lysate expression system. We have therefore expanded these studies using another in vitro expression system with microsomal membranes based on tobacco extracts (ALiCE®Cell-Free Protein Synthesis System, Sigma Aldrich) to enhance production and stability of the expressed receptors as suggested by the Reviewer. We further prepared virus-like particles (VLPs) from cells expressing each receptor individually or both receptors together. We however could not detect any PLA2 activity from the VLPs. We thus focused on the coupled in vitro transcription/translation tobacco extracts that allow the expression of difficult-to-produce membrane proteins in microsomes. This kit targets membrane protein directly to microsomes using a microsome targeting melittin signal peptide. This system took significant time and effort to troubleshoot and adapt to mPR and ABHD2 expression. We were however ultimately able to produce significantly higher amounts of both ABHD2 and mPRb, which were readily detected by WBs (Supplemental Fig. 4I). In contrast, we could not reliably detect mPR or ABHD2 using WBs from reticulocyte lysates given the limited amounts produced.

      Similarly to our previous findings with proteins produced in reticulocytes, expression of ABHD2 or mPRβ alone was not associated with an increase in PLA2 activity over a two-hour incubation period (Fig. 5C). It is worth noting here that the tobacco lysates had high endogenous PLA2 activity. However, co-expression of both mPRb and ABHD2 produced robust PLA2 activity that was significantly higher than that detected in reticulocyte lysate system (Fig. 5C). Surprisingly, however this PLA2 activity was P4 independent as it was observed when both receptors are co-expressed in the absence of P4.

      These results validate our earlier conclusion that PLA2 activity requires both mPR and ABHD2, so their interaction in needed for enzymatic activity. It is interesting however that in the tobacco expression system this mPR-ABHD2 PLA2 activity becomes for the most part P4 independent. As the tobacco expression system forces both ABHD2 and mPR into microsomes using a signal sequence, the two receptors are enriched in the same vesicular compartment. As they can interact independently of P4 as shown in the co-IP experiments in immature oocytes (Fig. 5D), their forced co-expression in the same microsomal compartment could lead to their association and thus PLA2 activity. This is an attractive possibility that fits the current data, but would need independent validation.

      Reviewer 3:

      There were concerns with the pharmacological studies presented. Many of these inhibitors are used at high (double-digit micromolar) concentrations that could result in non-specific pharmacological effects and the authors have provided very little data in support of target engagement and selectivity under the multiple experimental paradigms. In addition, the use of an available ABHD2 small molecule inhibitor was lacking in these studies.

      For the inhibitors used we performed a full dose response to define the active concentrations. So, inhibitors were not used at one high dose. We then compared the EC50 for each active inhibitor to the reported EC50 in the literature (Table 1). The inhibitors were deemed effective only if they inhibited oocyte maturation within the range reported in the literature. This despite the fact that frog oocytes are notorious in requiring higher concentrations of drug given their high lipophilic yolk content, which acts as a sponge for drugs. So our criteria for an effective inhibitor are rather stringent.  

      Based on these criteria, only 3 inhibitors were ‘effective’ in inhibiting oocyte maturation: Ibuprofen, ACA and MP-A08 with relative IC50s to those reported in the literature of 0.7, 1.1, and 1.6 respectively. Ibuprofen targets Cox enzymes, which produce prostaglandins. We independently confirmed an increase in PGs in response to P4 in oocytes thus validating the drug inhibitory effect. ACA blocks PLA2 and inhibits maturation, a role supported by the metabolomics analyses that shows decrease in the PE/PE/LPE/LPC species; and by the ABHD2-mPR PLA2 activity following in vitro expression. Finally, MP-A08 blocks sphingosine kinase activity, which role is supported by the metabolomics showing a decrease in sphingosine levels in response to P4; and our functional studies validating a role for the S1P receptor 3 in oocyte maturation.     

      As pointed out by the Reviewer, other inhibitors did block maturation at very high concentration, but we do not consider these as effective and have not implicated the blocked enzymes in the early steps of oocyte maturation. To clarify this point, we edited the summary panel (now Fig. 2D) to simplify it and highlight the inhibitors with an effect in the reported range in red and those that don’t inhibit based on the above criteria in grey. Those with intermediate effects are shown in pink. We hope these edits clarify the inhibitors studies.

      Recommendations For the Authors

      Reviewer 2:

      (1) Introduction, para 1. Please change "mPRs mediated" to "mPR-mediated".

      Done

      (2) Introduction, para 2. Please change "cyclin b" to "cyclin B".

      Done

      (3) Introduction, para 2. Please change "that serves" to "which serves".

      Done

      (4) Introduction, para 4. I know that the authors have published evidence that "a global decrease in cAMP levels is not detectable" (2016), but old work from Maller and Krebs (JBC 1979) did see an early, transient decrease after P4 treatment, and subsequent work from Maller said that there was both a decrease in adenylyl cyclase activity and an increase in cAMP activity. Perhaps it would be better to say something like "early work showed a transitory drop in cAMP activity within 1 min of P4 treatment (Maller), although later studies failed to detect this drop and showed that P4-dependent maturation proceeds even when cAMP is high (25)".

      We agree and thank the Reviewer for this recommendation. The text was revised accordingly.

      (5) Results, para 1. Based on the results in Fig 1B, one should probably not assert that ABHD2 is expressed "at levels similar to those of mPRβ in the oocyte"-with different mRNAs and different PCR primers, it's hard to say whether they are similar or not. The RNAseq data from Xenbase in Supp Fig 1 supports the idea that the ABHD2 and mPRβ mRNAs are expressed at similar levels at the message level, although of course mRNA levels and protein levels do not correlate well when different gene products are compared (Wuhr's 2014 Curr Biol paper reported correlation coefficients of about 0.3).

      We agree and have changed the text as follow to specifically point out to RNA: “we confirmed that ABHD2 RNA is expressed in the oocyte at levels similar to those of mPRβ RNA (Fig. 1B).”

      (6) Results, para 2. It would be worth pointing out that since an 18 h incubation with microinjected antisense oligos was sufficient to substantially knock down both the ABHD2 mRNAs (Fig 1C) and the ectopically-expressed proteins (Fig 1D), the mRNA and protein half-lives must be fairly short, on the order of a few hours or less.

      Done

      (7) Figure 1. Please make the western blots (especially Fig 1D) and their labeling larger. These are key results and as it stands the labeling is virtually unreadable on printed copies of the figures. I'm not sure about eLife's policy, but many journals want the text in figures to be no smaller than 5-7 points at 100% size.

      Likewise for many of the western blots in subsequent figures.

      As requested by the Reviewer we have increased the font and size of all Western blots in the Figures.

      (8) Figure 1E, G. I am not sure one should compare the effectiveness of the ABHD2 rescue (Fig 1E) and the mPRβ rescue (Fig 1G). Even if these were oocytes from the same frog, we do not know how the levels of the overexpressed ABHD2 and mPRβ proteins compare. E.g. maybe ABHD2 was highly overexpressed and mPRβ was overexpressed by a tiny amount.

      Although this is a possibility, the expression levels of the proteins here is not of much concern because we previously showed that mPRβ expression effectively rescues mPRβ antisense knockdown which inhibits maturation (please see (Nader et al., 2020)). This argues that at the levels of mRNA injected mPR is functional to support maturation, yet it does not rescue ABHD2 knockdown to the same levels (Fig. 1G). With that it is fair to argue that mPRβ is not as effective at rescuing ABHD2 KD maturation.

      (9) Inhibitor studies: There are two likely problems in comparing the observed potencies with legacy data - in vitro vs in vivo data and frog vs. mammalian data. Please make it clear what is being compared to what when you are comparing legacy data.

      The legacy data are from the literature based on the early studies that defined the IC50 for inhibition primarily using in vivo models (cell line mostly) but not oocytes. Typically, frog oocytes require significantly higher concentrations of inhibitors to mediate their effect because of the high lipophilic yolk content which acts as a sponge for some drugs. So, the fact that the drugs that are effective in inhibiting oocyte maturation (ACA, MP-A08, and Ibuprofen) work in a similar or lower concentration range to the published IC<sub50</sub> gives us confidence as to the specificity of their effect. We have revised Table 1 to include the reference for each IC<sub50</sub> value from the literature to allow the reader to judge the exact model and context used.

      (10) Isn't it surprising that Gas seems to promote maturation, given the Maller data (and data from others) that cAMP and PKA oppose maturation (see also the authors' own Fig 1A) and the authors' previous data sees no positive effect (minor point 7 above)?

      We show that a specific Gas inhibitor NF-449 inhibits maturation (although at relatively high concentrations), which is consistent with a positive role for Gas in oocyte maturation. We argue based on the lipidomics data and the inhibitors data that GPCRs play a modulatory role and not a central early signaling role in terms of releasing oocyte meiotic arrest. They are likely to have effects on the full maturation of the egg in preparation for embryonic development. The actions of the multiple lipid messengers generated downstream of mPRβ activation are likely to act through GPCRs and could signal through Gas or other Ga or even through Gβγ. Minor point 7 refers to the size of Western blots.

      (11) Page 9, bottom: "...one would predict activation of sphingosine kinases...." Couldn't it just be the activity of some constitutively active sphingosine kinase? Maybe replace "activation" with "activity".

      A constitutively sphingosine kinase activity would not make sense as it needs to be activated by P4.

      (12) Sometimes the authors refer to concentrations in molar units plus a power of 10 (e.g. 10-5 M) and sometime in µM or nM, sometimes even within the same paragraph. This makes it unnecessarily difficult to compare. Please keep consistent.

      We replaced all the concentrations through the text to M with scientific notation for consistency as requested by the Reviewer.

      (13) Fig 3I: "Sphingosine kinase" is misspelled.

      This has been corrected. We thank the Reviewer for catching it.

      (14) Legend to Fig. 5: Please change "after P4 treatment in reticulocytes" to "after P4 treatment in reticulocyte lysates".

      Done

      (15) Fig 6J. Doesn't the MAPK cascade inhibit MYT1? I.e. shouldn't the arrow be -| rather than ->?

      Yes the Reviewer is correct. This has been changed. We thank the Reviewer for noticing this error.

      (16) Materials and Methods, second paragraph. Please change "inhibitor's studies" to "inhibitor studies".

      Corrected thanks.

      (17) Table 1: Please be consistent in how you write Cox-2.

      Done.

      Reviewer #3:

      The findings are of potential broad interest, but I have some concerns with the pharmacological studies presented. Many of these inhibitors are used at high (double-digit micromolar) concentrations that could result in non-specific pharmacological effects and the authors have provided very little data in support of target engagement and selectivity under the multiple experimental paradigms. Importantly, several claims regarding lipid metabolism signaling in the context of oocyte maturation are made without critical validation that the intended target is inactivated with reasonable selectivity across the proteome. Several of the inhibitors used for pharmacology and metabolomics are known covalent inhibitors (JZL184 and MJN110) that can readily bind additional lipases depending on the treatment time and concentration.

      I did not find any data using the reported ABHD2 inhibitor (compound 183; PMID: 31525885). Is there a reason not to include this compound to complement the knockdown studies? I believe this is an important control given that not all lipid effects were reversed with ABHD2 knockdown. The proper target engagement and selectivity studies should be performed with this ABHD2 inhibitor.

      We obtained aliquots the reported ABHD2 inhibitor compound 183 from Dr. Van Der Stelt and tested its effect on oocyte maturation at 10<sup>-4</sup>M using both low (10<sup>-7</sup>M) or high (10<sup>-5</sup>M) P4 concentration. Compound 183 partially inhibited P4-mediated oocyte maturation. The new data was added to the manuscript as Supplemental Figure 3D.

      Additional comments:

      (1) Pristimerin was tested at low P4 concentration for effects on oocyte maturation. Authors should also test JZL184 and MJN110 under this experimental paradigm.

      We have tested the effect of high concentration (2.10-<sup>-5</sup>M) of JZL184 or MJN110 on oocyte maturation at low P4 concentration (Author response image 3).  MJN 110 did not have a prominent effect on oocyte maturation at low P4, whereas JZL184 inhibited maturation by 50%. However, this inhibition of maturation required concentrations of JZL 184 that are 10 times higher than those reported in rat and human cells (Cui et al., 2016; Smith et al., 2015), arguing against an important role for a monoacylglycerol enzymatic activity in inducing oocyte maturation.

      Author response image 3.

      The effect of MJN110 and JZL184 compounds on oocyte maturation at low P4 concentration. Oocytes were pre-treated for 2 hours with the vehicle or with the highest concentration of 2.10-<sup>-5</sup> M for both JZL184 or MJN110, followed by overnight treatment with P4 at 10-<sup>7</sup>M. Oocyte maturation was measured as % GVBD normalized to control oocytes (treated with vehicle) (mean + SEM; n = 2 independent female frogs for each compound).

      2) Figure 4A showed different ct values of ODC between Oocytes and spleen, please explain them in the text. There is not any description regarding spleen information in Figure 4A, please make it clear in the text.

      We thank the Reviewer for this recommendation. The text was revised accordingly.

      (3) For Figures 3A, E, and I, there are different concentration settings for comparing the activity, is it possible to get the curves based on the same set of concentrations? The concentration gradient didn't include higher concentration points in these figures, thus the related values are incorrect. Please set more concentration points to improve the figures. And for the error bar, there are different display formats like Figure 4c and 4d, etc. Please uniform the format for all the figures. Additionally, for the ctrl. or veh., please add an error bar for all figures.

      Some of the drugs tested were toxic to oocytes at high concentrations so the dose response was adjusted accordingly. The graphs were plotted to encompass the entire tested dose response. We could have plotted the data on the same x-axis range but that would make the figures uneven and awkward.

      We are not clear what the Reviewer means by “The concentration gradient didn't include higher concentration points in these figures, thus the related values are incorrect.”

      The error bars for all dose responses are consistent throughout all the Figures. They are different from those on bar graphs to improve clarity. If the Reviewer wishes to have the error bars on the bar graphs and dose response the same, we are happy to do so. 

      For the inhibitor studies the data were normalized on a per frog basis to control for variability in the maturation rate in response to P4, which varies from frog to frog. It is thus not possible to add error bars for the controls.

      (4) Please check the sentence "However, the concentration of HA130...... higher that......'; Change "IC50" to "IC50" in the text and tables. Table 1 lists IC50 values in the literature, but the references are not cited. Please include the references properly. For the IC50 value obtained in the research, please include the standard deviation in the table. For reference parts, Ref 1, 27, 32, 46, doublecheck the title format.

      We edited the sentence as follows to be more clear: “However, this inhibition of maturation required high concentrations of HA130  -at least 3 orders of magnitude higher that the reported HA130 IC<sub>50</sub>-…”

      We changed IC50 to subscript in Table 1.

      We added the relevant references in Table 1 to provide context for the cited IC50 values for the different inhibitors used.

      We added SEM to the IC<sub>50</sub> for inhibition of oocyte maturation values in Table 1.

      We checked the titles on the mentioned references and cannot identify any problems.

      References

      Cui, Y., Prokin, I., Xu, H., Delord, B., Genet, S., Venance, L., and Berry, H. (2016). Endocannabinoid dynamics gate spike-timing dependent depression and potentiation. eLife 5, e13185.

      Nader, N., Dib, M., Hodeify, R., Courjaret, R., Elmi, A., Hammad, A.S., Dey, R., Huang, X.Y., and Machaca, K. (2020). Membrane progesterone receptor induces meiosis in Xenopus oocytes through endocytosis into signaling endosomes and interaction with APPL1 and Akt2. PLoS Biol 18, e3000901.

      Smith, M., Wilson, R., O'Brien, S., Tufarelli, C., Anderson, S.I., and O'Sullivan, S.E. (2015). The Effects of the Endocannabinoids Anandamide and 2-Arachidonoylglycerol on Human Osteoblast Proliferation and Differentiation. PloS one 10, e0136546.

    1. coinciden en destacar que en Alemania y Suiza la formación profesional goza de más prestigio que en España

      formación profesional = vocational training ¿Ocurre lo mismo en tu país?

    2. Pero españoles que viven fuera y que conocen el día a día de otros países aseguran que la imagen catastrofista y derrotista que se tiene a nivel interno es exagerada, que muchos de los sambenitos que nos atribuimos no son ciertos, y que la actitud de queja generalizada y un cierto sentimiento de inferioridad impiden avanzar en la solución a los problemas concretos de la sociedad española, porque en otros países con menos hacen más.

      En estas líneas, presta especial atención al uso de lenguaje sofisticado del tipo "imagen catastrofista y derrotista", "los sambenitos que nos atribuimos no son ciertos", o frases como "la actitud de queja generalizada" y "un sentimiento de inferioridad".

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Weaknesses: 

      - Only one mutant (YafK) is used to make the conclusion. 

      The aim of the study is to determine the effect of the hydrolysis of the PG→Lpp bond on the dynamics of the tethering of Lpp to PG. Since YafK is the only enzyme catalyzing this reaction, it is appropriate to compare the wild-type strain to an isogenic yafK deletion mutant. Nonetheless, we carefully consider this comment and will investigate the dynamics of the tethering of Lpp to PG in mutants deficient in the production of the L,D-transpeptidases responsible for tethering Lpp to PG.

      Additional kinetic analyses were performed on strains relying on a single L,D-transpeptidase for LPP tethering to PG. Escherichia coli produces three L,D-transpeptidases catalyzing the tethering of LPP to PG (Ybis, YcfS, and ErfK). The corresponding genes were deleted from the chromosome of strain BW25113, thus generating strain BW25113Δ3. Plasmids encoding each one of these three enzymes were independently introduced in BW25113Δ3. Qualitatively, LC-MS analyses revealed similar kinetics for the four Tri-KR isotopologues purified from wild-type strain BW25113 and from the three BW25113Δ3 derivatives producing a single plasmidencoded L,D-transpeptidase (Ybis, YcfS, or ErfK) under the control of a rhamnose inducible promoter (Prha) of plasmid pHV30 (Voedts et al. EMBO J. 2021 40:e108126, doi: 10.15252/embj.2021108126) (see panel A in figure 1 below). Briefly, and as indicated in the first version of the main text, the old→new Tri→KR isotopologue was first synthesized. The new→new isotopologue was not detected 5 min after the medium switch. These results indicate that the newly-synthesized PG disaccharidepeptide subunits and Lpp are independently incorporated into the expanding PG polymer. The proportion of the new→old isotopologue exceeded that of the old→new isotopologue at around 40 min (for the strain producing ErfK) or 20 min (for the strains producing Ybis or YcfS). This is the hallmark of the activity of the YafK hydrolase that liberates existing (old) Lpp that can be tethered to newly synthesized disaccharide-peptide subunit thereby generating the new→old isotopologue. In absence of the YafK hydrolase, the relative proportion of the new→old isotopologue is lower since this isotopologue can only result from the tethering of the preexisting free forms of Lpp to newly synthesized disaccharide-peptide units. The contribution of YafK to variations in the relative abundance of the four isotopologues was also investigated by combining the relative abundance of isotopologues containing either old versus new KR (panel B) or old versus new PG stem peptide (panel C) moieties. As discussed in the first version of the manuscript for strains BW25113 and BW25113ΔyafK, this analysis revealed that the existing (old) disaccharide-tripeptide moieties in the Tri→RK isotopologues disappears more rapidly than the existing (old) KR moieties due to the hydrolysis of the old→old Tri-KR isotopologue by YafK. These results indicate that the mode of tethering of Lpp to PG and the dynamic equilibrium between the PG-tethered and free forms of Lpp are similar for the Ybis, YcfS, and ErfK L,D-transpeptidases. Quantitatively, we also noticed that the overall decrease in the relative abundance of all Tri→KR isotopologues containing existing (old) moieties was slower for the strains producing only ErfK, Ybis, or YcfS than for the wild type and ΔyafK strains.  This could be accounted for by an increase in the generation time of the former group of three strains. This is a limitation of our study because it precludes the comparison of the evolution of a particular isotopologue in several strains, as performed in Fig. 3 for strains BW25113 and BW25113ΔyafK. For this reason, we prefer to present these data in the rebuttal rather than in the manuscript. Indeed, presentation of the data in the main text would require introducing a new mode of presentation of the data (variations in the relative abundance of all four isotopologues in the same strain; see figure below) in addition to variations of the relative abundance of any one of the four isotopologues between strains (Fig. 3). Introduction of this additional mode of presentation of the data would complicate the initial manuscript in an unnecessary manner because the data obtained with mutants producing a single L,D-transpeptidase (ErfK, YbiS, or YcfS) confirmed the data obtained with the wild-type strains producing the three L,D-transpeptidases.

      Author response image 1.

      MS-based kinetic analysis of Lpp tethering to PG.

      -Time points to analyse Tri-KR isotopologues in Wt (0,10,20,40,60 min) and yafK mutant (0,15, 25, 40, 60 min) are not the same. 

      The purpose of the experiments is to compare the kinetics of formation and hydrolysis of the PG→Lpp bond in the WT versus ΔyafK strains. Comparison of the kinetics is therefore possible even though the kinetics are not based on the exact same time points. Nonetheless, we will reproduce the kinetics experiment (see also answers to Reviewer 2) and use the same time points in these additional experiments.

      We have performed additional analyses to provide kinetic data for at least three biological repeats and for the same periods of incubation after the medium switch (0, 10, 20, 40, and 60 min). The full set of data, including means and standard deviations, appear in the additional Table S1. We have also updated Fig. 3 with the means calculated with these additional values. The conclusions of the first version of the manuscript are fully supported by the additional data requested by the reviewer. We have also revised Fig. 4 based on the full set of data appearing in Table S2.

      Reviewer #2 (Public Review): 

      Weaknesses: 

      - However, the authors make a few other conclusions from their data which are harder to understand the logic of, or to feel confident in based on the existing data. They claim that their 5-time point kinetic data indicates that new lpp is not substantially added to lipidII before it is added to the peptidoglycan, and that instead lpp is attached primarily to old peptidoglycan. I believe that this conclusion comes from the comparison of Fig.s 3A and 3C, where it appears that new lpp is added to old peptidoglycan a few minutes before new lpp is added to new peptidoglycan. However, the very small difference in the timing of this result, the minimal number of time points and the complete lack of any presentation of calculated error in any of the data make this conclusion very tenuous. In addition, the authors conclude that lpp is not significantly attached to septal peptidoglycan. The logic behind this conclusion appears to be based on the same data, but the authors do not provide a quantitative model to support this idea.  

      The reviewer is correct in stating that we claim that Lpp is not substantially added to lipid II before incorporation of the disaccharide-pentapeptide subunit into the expanding PG network. This conclusion is based on the paucity of PG-Lpp covalent adducts containing light PG and Lpp moieties at the earliest time points. To substantiate more thoroughly this finding, we will reproduce the kinetic experiments with more early time points. The paucity of the new→new PG-Lpp isotopologues also implies that Lpp might not be extensively tethered to septal peptidoglycan since the latter is assembled from newly synthesized PG (see our previous publication Atze et al. 2021 and references therein). Quantitatively, septal synthesis roughly accounts for one third of the total PG synthesis. It is therefore expected that tethering of Lpp to septal PG would represent one third of the total number of newly synthesized Lpp molecules tethered to PG. We therefore proposed that the paucity of new→new PG- Lpp isotopologues at early time points of the kinetics implies that Lpp is preferentially tethered to the side wall. This is only one of several conclusions that we reach in the present study and we were very careful in the wording of our results. 

      We would first like to stress that our claim that Lpp is primarily attached to old peptidoglycan rather than to lipid II is indeed supported by the results presented in the first version of the manuscript. In fact, the opposite mechanism, i.e. Lpp linking to Lipid II, as established for the linking of proteins to PG by sortases in Gram-positive bacteria, would result in the exclusive tethering of newly synthesized Lpp to newly synthesized PG stems (Fig. 3). This is clearly not the case since the new→new isotopologues are present in small amounts 10 min after the medium switch and are not detectable at 5 min (data appearing in Table S1 and new mass spectra added to Supplementary file 1). Instead, our data indicate that newly synthesized Lpp is tethered to existing PG. Thus, the relevant comparison is not the absolute value of the delay in the appearance of isotopologues in Figs 3A and 3C, as suggested by the reviewer. Rather, the relevant comparison should take into consideration these two following modes of Lpp tethering to PG: (i) tethering Lpp to Lipid II versus (ii) tethering of Lpp to existing PG independently from insertion of new subunits into the expanding PG. The former mode implies the exclusive formation of new→new isotopologues, which were not detected at early time points. The latter mode implies the prevalent formation of old→new isotopologues that were indeed preponderant at early time-points. Thus, our analysis clearly eliminates the first mode of Lpp tethering to PG (tethering of Lpp to Lipid II) and validates the second one (tethering of Lpp to existing PG). As stated in our answers to reviewer 1, we have generated additional repeats and the full set of data, including means and SD values, appears in the additional Supplementary Tables S1 and S2. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      -All major reactions catalysed by L,D-transpeptidases must be studied using the labeling-mass spec technique and compared with YafK to strengthen the conclusions. 

      As described above (Figure 1), we explored the dynamics of Lpp tethering in mutants producing a single L,D-transpeptidase.

      -Experiments on the effect of YafK on the bacterial envelope and production of vesicles should be concluded to support the claims. 

      We have analyzed the extent of outer membrane vesicle (OMV) formation both in the wild type strain and in each one of the mutant strains characterized in this study by using a procedure described in detail in one of our previous publications (Hugonneau-Beaufet et al. Microbiol Spectr. 2023 11:e0521722, doi: 10.1128/spectrum.05217-22). Figure 2 below shows that loss of Lpp or of its tethering to PG, following deletion of genes encoding L,D-transpeptidases ErfK, YbiS, and YcfS, results in the formation of OMVs as revealed by the presence of the maltose-binding protein (MBP, 42 kDa) in the corresponding spare culture medium (as detected by immunoblotting). The RNA polymerase subunit RpoA (36 kDa), used as a control, was not detected in these spare culture media, indicating that loss of either Lpp alone or of ErfK, YbiS, and YcfS together was not associated with bacterial lysis. This analysis also showed that production of ErfK, YbiS, or YcfS alone was sufficient to prevent formation of OMVs. Finally, deletion of YafK, as expected, did not lead to OMV formation. These confirmatory results are out of the scope of the manuscript that focuses on the dynamics of Lpp tethering to PG rather than on the role of that tethering in the envelope stability. 

      Author response image 2.

      Figure 2. Immuno-detection of OMV formation.

      Reviewer #2 (Recommendations For The Authors): 

      - Why so much background about previous results in the abstract? Previous results don't seem required for understanding the description of new results here. Maybe put a sentence about importance at the end, instead.

      The background information is important for two reasons. First, because it is important to stress that the method used to determine the structure and dynamics of the isotopologues is novel and has been validated in various ways, including the modeling of isotopic clusters, in a previous study (https://doi.org/10.7554/eLife.72863). Since the current study is an extension of this previous report it is relevant to introduce the type of information that can be obtained by this approach. Second, because it is also important to stress that kinetic analyses have been previously reported for the incorporation              of           disaccharide-peptide      units into        the         expanding           peptidoglycan (https://doi.org/10.7554/eLife.72863). In the current study, we focused on the mode of Lpp-to-PG tethering in the context of PG expansion that thus had to be introduced. 

      - Abstract: tethering of lpp to septal pg is limited by what? Limited to what? Wording not clear.

      The unclear sentence has been rephrased. Revised version “Newly synthesized septum PG appears to contain small amounts of tethered Lpp.”  

      - The figure legend for fig 1b - I only see one red double arrow?

      Black double arrows indicate the position of glycosidic bonds cleaved by the muramidases. Their size was increased so that they appear more distinctly in the image.

      - Fig 3 and Fig 4- these should be shown with error. 

      The full set of data with means and standard deviations appear in Supplementary Tables S1 and S2.

      - This new-> old, old-> new annotation is confusing. Is the PG fragment or the lpp old or new? Are you distinguishing between which part is old and new by the ordering? Or, could either the PG fragment or the lpp be old to be annotated as old-> new? I think you are trying to explain it in the figure 3CD legend, but it could be presented more clearly. When you say respectively, do you mean that old->new means old muropeptide, new lpp? And new-> old means new muropeptide and old lpp? Why not just use the same annotation system you use in fig 2? Or, use subscripts to indicate old and new?. 

      The designation of isotopologues is correct and adequate to designate the products of transpeptidation catalyzed both by PBPs and L,D-transpeptidases. This nomenclature of transpeptidation products has been introduced in the 70s (see Schleifer and Kandler 1972 Bacteriological Reviews 36:407-477).  In this bond designation, the acyl donor and the acyl acceptor appear left and right, respectively, separated by an arrow to indicate the CO-to-NH polarity of the amide bond. For the Tri→KR isotopologues, the peptide stem acts as the acyl donor whereas Lpp acts as the acyl acceptor. There is therefore no ambiguity in the annotation. This also applies to the old→new-type annotation, old (existing) PG stem linked to new (neosynthesized) Lpp. In the figures, we used a color code to identify old (red) and new (purple) in the Tri→KR moieties. Since a color code cannot be used in the main text, we used the old→new-type of annotation. A sentence has been added at the end of the legend to Fig. 1b to introduce this nomenclature “Please note that we used the standard nomenclature for transpeptidation products in which the acyl donor and the acyl acceptor appear left and right, respectively, separated by an arrow to indicate the CO-to-NH polarity of the amide bond”.

      - Pg 5 - first paragraph. I'm struggling with the logic of your conclusion that lpp is not attached to lipid II - it seems that this conclusion is based on the timing of the appearance of the hybrid isotopes. You say you would expect the new-new ones to appear quickly, but how quickly would you expect that, and why? You do see new-new ones appearing fairly quicky, in 20 minutes, so I don't understand the logic of why that timing excludes the lipidII modification model. Please elaborate further. 

      See answer above to reviewer 2 and analysis of samples collected shortly after the medium switch (Table S1). See also the revised version of Supplementary file 1 that shows mass spectra for peptidoglycan extracted 5 min after the medium switch.

      - The conclusion about tethering of lpp to septal PG also appears to be somewhat tenuous, which the authors concede when then use the word "might" in the section of the results. However, the language in the abstract is more definitive. Please tone down the language in the abstract, or provide more evidence to support this conclusion. At the least, you could add a little discussion of the numbers. At a given time in mixed culture, how much PG is being constructed at the septum? How does that percentage line up with the rate of PG label loss vs the rate of lpp label loss? 

      -  Pg 5, bottom paragraph. I don't know what you mean by "there was no loss of old->old in the ∆yafK strains, " when you just a sentence above described the decrease. 

      The data of the MS analyses are presented as the relative abundance of isotopologues. If the old→old Tri→KR isotopologue present at the medium shift were not hydrolyzed by YafK, its absolute amount would remain constant over time. However, the relative abundance of the old→old isotopologue decreases by 50% in one generation because the total amount of the Tri→KR muropeptide doubles in one generation (as any of the bacterial constituents). In Fig. 3B, we indeed observed that the relative amount of old→old isotopologue is about 50% after one generation in the ΔyafK mutant indicating the persistence of the isotopologue. In contrast, production of YafK in the strain BW25113 results in lower abundance of this isotopologue (in the order of 90%). 

      To better explicit the concept we expanded the reasoning in the relevant paragraph of the revised version of the manuscript. 

      - Pg 6 - I don't understand how you are drawing a conclusion about the proteolytic degradation of lpp from these data. Please clarify your reasoning.

      In the analysis presented in Fig. 4, we investigated the relative abundance of old and new Lpp based on the relative abundance of old and new KR moieties in all four Tri-KR isotopologues. As stated in the preceding answer, the relative abundance of KR moieties should be 50% after one generation if no degradation of Lpp occurs. This is observed both for BW25113 (Fig. 4A) and for the ΔyafK mutant (Fig. 4B), thus supporting our claim that Lpp is not degraded. In contrast, the relative abundance of the old Tri moiety is lower than 50% for the wild type strain (Fig. 4C) but not for the ΔyafK mutant (Fig. 4D). This reflects the fact that YafK hydrolyzes the PG-Lpp bond and that Lpp released by this reaction can be cross-linked to neo-synthesized PG stems. Please note that, in this reaction, the substrate is a tetrapeptide donor stem (Fig. 1C).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      The authors assess the effectiveness of electroporating mRNA into male germ cells to rescue the expression of proteins required for spermatogenesis progression in individuals where these proteins are mutated or depleted. To set up the methodology, they first evaluated the expression of reporter proteins in wild-type mice, which showed expression in germ cells for over two weeks. Then, they attempted to recover fertility in a model of late spermatogenesis arrest that produces immotile sperm. By electroporating the mutated protein, the authors recovered the motility of ~5% of the sperm, although the sperm regenerated was not able to produce offspring using IVF.

      We actually did not write that “sperm regenerated was not able to produce offspring using IVF” but rather that IVF was not attempted because the number of rescued sperm was too low. To address this important point, the ability of sperm to produce embryos was therefore challenged by two different assisted reproduction technologies, that are IVF and ICSI. To increase the number of motile sperm for IVF experiments, we have injected both testes from one male. We also conducted intracytoplasmic sperm injection (ICSI) experiments, using only rescued sperm, identified as motile sperm with a normal flagellum. The results of these new experiments have demonstrated that the rescued ARMC2 sperm successfully fertilized eggs and produced embryos at the two-cell stage by IVF and blastocysts by ICSI. These outcomes are presented in Figure 12.

      This is a comprehensive evaluation of the mRNA methodology with multiple strengths. First, the authors show that naked synthetic RNA, purchased from a commercial source or generated in the laboratory with simple methods, is enough to express exogenous proteins in testicular germ cells. The authors compared RNA to DNA electroporation and found that germ cells are efficiently electroporated with RNA, but not DNA. The differences between these constructs were evaluated using in vivo imaging to track the reporter signal in individual animals through time. To understand how the reporter proteins affect the results of the experiments, the authors used different reporters: two fluorescent (eGFP and mCherry) and one bioluminescent (Luciferase). Although they observed differences among reporters, in every case expression lasted for at least two weeks. 

      The authors used a relevant system to study the therapeutic potential of RNA electroporation. The ARMC2-deficient animals have impaired sperm motility phenotype that affects only the later stages of spermatogenesis. The authors showed that sperm motility was recovered to ~5%, which is remarkable due to the small fraction of germ cells electroporated with RNA with the current protocol. The 3D reconstruction of an electroporated testis using state-of-the-art methods to show the electroporated regions is compelling. 

      The main weakness of the manuscript is that although the authors manage to recover motility in a small fraction of the sperm population, it is unclear whether the increased sperm quality is substantial to improve assisted reproduction outcomes. The quality of the sperm was not systematically evaluated in the manuscript, with the endpoints being sperm morphology and sperm mobility. 

      We would like to thank the reviewers for their comments. As previously stated above, we produced additional rescue experiments and performed CASA, morphology observation, IVF and ICSI with the rescued sperm. The rescued ARMC2 sperm exhibited normal morphology (new figure 11 and Supp Fig 8), motility (figure 11), and fecundity (figure 12).  Whereas sperm from untreated KO males were unable to fertilize egg by IVF, the rescued sperm fertilized eggs in vitro at a significant level (mean 62%, n=5), demonstrating that our strategy improves the sperm quality and assisted reproduction outcome (from 0 to 62%). 

      Some key results, such as the 3D reconstruction of the testis and the recovery of sperm motility, are qualitative given the low replicate numbers or the small magnitude of the effects. The presentation of the sperm motility data could have been clearer as well. For example, on day 21 after Armc2-mRNA electroporation, only one animal out of the three tested showed increased sperm motility. However, it is unclear from Figure 11A what the percentage of sperm motility for this animal is since the graph shows a value of >5% and the reported aggregate motility is 4.5%. It would have been helpful to show all individual data points in Figure 11A. 

      We provide now in figure 11A, a graph showing the percentage of rescued sperm for all animals. (scatter dot plot). Moreover, we performed additional CASA experiments to analyze in detail sperm motility (Figure 11A2-A3). Individual CASA parameters for motile sperm cells were extracted as requested by reviewer 3 and represented in a new graph (Fig 11 A2). 

      The expression of the reporter genes is unambiguous; however, better figures could have been presented to show cell type specificity. The DAPI staining is diffused, and it is challenging to understand where the basement membranes of the tubules are. For example, in Figures 7B3 and 7E3, the spermatogonia seems to be in the middle of the seminiferous tubule. The imaging was better for Figure 8. Suboptimal staining appears to lead to mislabeling of some germ cell populations. For example, in Supplementary Figure 4A3, the round spermatid label appears to be labeling spermatocytes. Also, in some instances, the authors seem to be confusing, elongating spermatids with spermatozoa, such as in the case of Supplementary Figures 4D3 and D4.

      Thanks for the comments, some spermatogenic cells were indeed mislabeled as you mentioned. We have therefore readjusted the labeling accordingly. We also changed spermatozoa to mature spermatids. The new sentence is now: “At the cellular level, fluorescence was detectable in germ cells (B1-B3) including Spermatogonia (Sg), Spermatocytes (Scytes),round Spermatids (RStids), mature spermatids (m-Sptids) and Sertoli cells (SC)”. Moreover, to indicate the localization of the basal membrane, we have also labelled myoid cells.

      The characterization of Armc2 expression could have been improved as well. The authors show a convincing expression of ARMC2 in a few spermatids/sperm using a combination of an anti-ARMC2 antibody and tubules derived from ARMC2 KO animals. At the minimum, one would have liked to see at least one whole tubule of a relevant stage.  

      Thanks for the remark. 

      We present now new images showing transversal section of seminiferous tubules as requested (see supp fig 6). In this new figure, it is clear that Armc2 is only expressed in spermatids. We have also added in this figure an analysis of the RNA-seq database produced by Gan's team (Gan, Wen et al. 2013), confirming that ArmC2 expression is predominantly expressed at the elongated spermatid stage. This point is now clearly indicated in the text.

      Overall, the authors show that electroporating mRNA can improve spermatogenesis as demonstrated by the generation of motile sperm in the ARMC2 KO mouse model. 

      Thank you

      Reviewer #2 (Public Review): 

      Summary: 

      Here, the authors inject naked mRNAs and plasmids into the rete testes of mice to express exogenous proteins - GFP and later ARMC2. This approach has been taken before, as noted in the Discussion to rescue Dmc1 KO infertility. While the concept is exciting, multiple concerns reduce reviewer enthusiasm. 

      Strengths: 

      The approach, while not necessarily novel, is timely and interesting.  Weaknesses: 

      Overall, the writing and text can be improved and standardized - as an example, in some places in vivo is italicized, in others it's not; gene names are italicized in some places, others not; some places have spaces between a number and the units, others not. This lack of attention to detail in the preparation of the manuscript is a significant concern to this reviewer - the presentation of the experimental details does cast some reasonable concern with how the experiments might have been done. While this may be unfair, it is all the reviewers have to judge. Multiple typographical and grammatical errors are present, and vague or misleading statements. 

      Thanks for the comment, we have revised the whole manuscript to remove all the mistakes. We have also added new experiments/figures to strengthen the message. Finally, we have substantially modified the discussion.

      Reviewer #3 (Public Review):

      Summary: 

      The authors used a novel technique to treat male infertility. In a proof-of-concept study, the authors were able to rescue the phenotype of a knockout mouse model with immotile sperm using this technique. This could also be a promising treatment option for infertile men. 

      Strengths: 

      In their proof-of-concept study, the authors were able to show that the novel technique rescues the infertility phenotype in vivo. 

      Weaknesses: 

      Some minor weaknesses, especially in the discussion section, could be addressed to further improve the quality of the manuscript. 

      We have substantially modified the discussion, following the remarks of the reviewers.

      It is very convincing that the phenotype of Armc2 KO mice could (at least in part) be rescued by injection of Armc2 RNA. However, a central question remains about which testicular cell types have been targeted by the constructs. From the pictures presented in Figures 7 and 8, this issue is hard to assess. Given the more punctate staining of the DNA construct a targeting of Sertoli cells is more likely, whereas the more broader staining of seminiferous tubules using RNA constructs is talking toward germ cells. Further, the staining for up to 119 days (Figure 5) would point toward an integration of the DNA construct into the genome of early germ cells such as spermatogonia and/or possibly to Sertoli cells. 

      Thanks for the comment. We would like to recall the peculiar properties of the non-insertional Enhanced Episomes Vector (EEV) plasmid, which is a non-viral episome based on the Epstein-Barr virus (EBV: Epstein-Barr Virus). It allows the persistence of the plasmid for long period of time without integration. Its maintenance within the cell is made possible by its ability to replicate in a synchronous manner with the host genome and to segregate into daughter cells. This is due to the fact that EEV is composed of two distinct elements derived from EBV: an origin of replication (oriP) and an EpsteinBarr Nuclear Antigen 1 (EBNA1) expression cassette (Gil, Gallaher, and Berk, 2010).   The oriP is a locus comprising two EBNA1-binding domains, designated as the Family of Repeats (FR) and Dyad Symmetry (DS). The FR is an array of approximately 20 EBNA1-binding sites (20 repeats of 30 bp) with high affinity, while the DS comprises four lower-affinity sites operating in tandem (Ehrhardt et al., 2008). 

      The 641-amino-acid EBNA1 protein contains numerous domains. The N-terminal domains are rich in glycines and alanines, which enable interaction with host chromosomes. The C-terminal region is responsible for binding to oriP (Hodin, Najrana, and Yates, 2013). The binding of EBNA1 to the DS element results in the recruitment of the origin of replication. This results in the synchronous initiation of extra-chromosomal EEV replication with host DNA at each S phase of the cell cycle (Düzgüneş, Cheung, and Konopka 2018). Furthermore, EBNA1 binding to the FR domain induces the formation of a bridge between metaphase chromosomes and the vector during mitosis. This binding is responsible for the segregation of the EEV episome in daughter cells (Düzgüneş, Cheung, and Konopka 2018). It is notable that EEV is maintained at a rate of 90-95% per cell division.

      Because of the intrinsic properties of EEV described above, the presence of the reporter protein at 119 day after injection was likely due to the maintenance of the plasmid, mostly in Sertoli cells, and not to the DNA integration of the plasmid.

      Of note, the specificity of EEV was already indicated in the introduction (lines 124-128 clean copy). Nevertheless, we have added more information about EEV to help the readers.  

      Given the expression after RNA transfection for up to 21 days (Figure 4) and the detection of motile sperm after 21 days (Figure 11), this would point to either round spermatids or spermatocytes.  These aspects need to be discussed more carefully (discussion section: lines 549-574).

      We added a sentence to highlight that spermatids are transfected and protein synthetized at this stage and this question is discussed in details (see lines 677-684 clean copy).

      It would also be very interesting to know in which testicular cell type Armc2 is endogenously expressed (lines 575-591)

      Thanks for the remarks. We present now new images showing the full seminiferous tubules as requested by reviewer 1 (see supp fig 6). In this new figure, it is clear that Armc2 is only expressed in spermatids. We have also added in this figure an analysis of the RNA-seq database produced by Gan's team (Gan, Wen et al. 2013), confirming that Armc2 is predominantly expressed at the elongated spermatid stage. This point is now clearly indicated in the text. (lines 570-579 clean copy).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      The article is well-structured and easy to read. Nonetheless, there are typos and mistakes in some places that are distracting to the reader, such as the capitalization of the word "Oligo-" in the title of the manuscript, the use of the word "Materiel" in the title of the Materials and methods and the presence of space holders "Schorr staining was obtained from Merck (XXX)".  Thank you, we corrected the misspelling of "Materials and Methods" and corrected our error: "obtained from Merck (Darmstadt, Germany)". We also carefully corrected the manuscript to remove typos and mistakes.

      The discussion is too lengthy, with much repetition regarding the methods used and the results obtained. For example, these are two sentences from the discussion. "The vector was injected via the rete testis into the adult Armc2 KO mice. The testes were then electroporated." I would recommend shortening these passages.

      Thanks for your comments, we removed the sentences and we have substantially modified the discussion, following the remarks of the reviewers.

      The work is extensive, and many experiments have been done to prove the points made. However, a more in-depth analysis of critical experiments would have benefited the manuscript significantly. A more thorough analysis of sperm mobility and morphology using the CASA system would have been an initial step.

      In response to the observations made, additional CASA experiments and sperm motility analysis were conducted, as illustrated in Figure 11 (A2-A3). Individual CASA parameters for motile sperm cells were extracted as suggested and represented in a new graph (Fig 11 A2). We have observed significant differences between WT and rescued sperm. In particular, the VSL and LIN parameters were lower for rescued sperm. Nevertheless, these differences were not sufficient to prevent IVF, maybe because the curvilinear velocity (VCL) was not modified.

      In the case of ARMC2 localization, an analysis of the different stages of spermatogenesis to show when ARMC2 starts to be expressed. 

      Thanks for the remarks. This is an important remark pointed out by all reviewers. As explained above, we have performed more experiments. We present now new images showing transversal section of seminiferous tubules as requested (see supp fig 6). In this new figure, it is clear that Armc2 is only expressed in spermatid layers. We have also added in this figure an analysis of the RNA-seq database produced by Gan's team (Gan, Wen et al. 2013), confirming that ArmC2 expression is predominantly expressed at the elongated spermatid stage. This point is now clearly indicated in the text. (lines 575579 clean copy).

      Finally, exploring additional endpoints to understand the quality of the sperm generated, such as the efficiency of ICSI or sperm damage, could have helped understand the degree of the recovery.

      This point was underlined in public review. We paste here our answer: “To address this important point, the ability of sperm to produce embryos was therefore challenged by two different assisted reproduction technologies, that are IVF and ICSI. To increase the number of motile sperm for IVF experiments, we have injected both testes from one male. We also conducted intracytoplasmic sperm injection (ICSI) experiments, using only rescued sperm, identified as motile sperm with a normal flagellum. The results of these new experiments have demonstrated that the rescued ARMC2 sperm successfully fertilized eggs and produced embryos at the two-cell stage by IVF and blastocysts by ICSI. These outcomes are presented in Figure 12.”

      Reviewer #2 (Recommendations For The Authors):

      38,74 intracellular

      Thanks, we changed it accordingly: "Intracytoplasmic sperm injection (ICSI) is required to treat such a condition, but it has limited efficacy and has been associated with a small increase in birth defects" and "such as intracytoplasmic sperm injection (ICSI)".

      39 "limited efficacy" Versus what? And for what reason? "small increase in birth defects" - compared to what? 

      We changed to “… but it is associated with a small increase in birth defect with comparison to pregnancies not involving assisted conception.”

      40 Just thinking through the logic of the argument thus far - the authors lay out that there are people with OAT (true), ICSI must be used (true), ICSI is bad (not convincing), and therefore a new strategy is needed... so is this an alternative to ICSI? And this is to restore fertility, not "restore spermatogenesis"

      - because ICSI doesn't restore spermatogenesis. This logic flow needs to be cleaned up some

      Thanks we changed it accordingly: “restore fertility.”

      45 "mostly"?

      Thank you, we removed the word: “We show that mRNA-coded reporter proteins are detected for up to 3 weeks in germ cells, making the use of mRNA possible to treat infertility.”

      65 Reference missing. 

      We added the following reference Kumar, N. and A. K. Singh (2015). "Trends of male factor infertility, an important cause of infertility: A review of literature." J Hum Reprod Sci 8(4): 191-196.

      68 Would argue meiosis is not a reduction of the number of chromosomes - that happens at the ends of meiosis I and II - but the bulk of meiosis is doubling DNA and recombination; would re-word; replace "differentiation" with morphogenesis, which is much more commonly used:

      Thank you, we have changed the sentence accordingly: "proliferation (mitosis of spermatogonia), reduction of the number of chromosomes (meiosis of spermatocytes), and morphogenesis of sperm (spermiogenesis)".

      70 "almost exclusively" is an odd term, and a bit of an oxymoron - if not exclusively, then where else are they expressed? Can you provide some sense of scale rather than using vague words like "large", "almost", "several", "strongly" and "most...likely" - need some support for these claims by being more specific: 

      Thanks for the comment, we changed the sentence: "The whole process involves around two thousand genes, 60% of which are expressed exclusively in the testes."

      73 "severe infertility" is redundant - if they are infertile, is there really any more or less about it? I think what is meant is patients with immotile sperm can be helped by ICSI - so just be more specific... 

      We changed the transition : “Among infertility disorders, oligo-astheno-teratozoospermia  (OAT) is the most frequent (50 % (Thonneau, Marchand et al. 1991); it is likely to be of genetic origin. Spermatocytograms of OAT patients show a decrease in sperm concentration, multiple morphological defects and defective motility. Because of these combined defects, patients are infertile and can only conceive by IntraCytoplasmic Sperm Injection (ICSI). IntraCytoplasmic Sperm Injection (ICSI) can efficiently overcome the problems faced. However, there are …”

      75 "some" is vague - how many concerns, and who has them? Be specific!

      Thanks for the comment, we removed the word.

      76-7 Again, be specific - "real" has little meaning - what is the increased risk, in % or fold? This is likely a controversial point, so make sure you absolutely support your contention with data .

      77 "these"? There was only one concern listed - increased birth defects; and "a number" is vague - what number, 1 or 1,000,000? A few (2-3), dozens, hundreds? 

      Thanks for the comment, we have reworded the sentence: “Nevertheless, concerns persist regarding the potential risks associated with this technique, including blastogenesis defect, cardiovascular defect, gastrointestinal defect, musculoskeletal defect, orofacial defect, leukemia, central nervous system tumors, and solid tumors. Statistical analyses of birth records have demonstrated an elevated risk of birth defects, with a 30–40% increased likelihood in cases involving ICSI, and a prevalence of birth defects between 1% and 4%.” We have added a list of references to support these claims.

      79-81 So, basically transgenesis? Again, vague terms "widely" - I don't think it's all that widely used yet... and references are missing to support the statement that integration of DNA into patient genomes is widely used. Give specific numbers, and provide a reference to support the contention. 

      Thanks for the comment, we removed the word widely and add references.

      81-5 Just finished talking about humans, but now it appears the authors have switched to talking about mice - got to let the readers know that! Unless you're talking about the Chinese group that deleted CCR5 in making transgenic humans? 

      Your feedback is greatly appreciated. In response to your comments, the sentence in question has been amended to provide a more comprehensive understanding. Indeed, the text refers to experiences carried in mice. The revised wording is as follows: “Given the genetic basis of male infertility, the first strategy, tested in mice, was to overcome spermatogenic failure associated with monogenic diseases by delivery of an intact gene to deficient germ cells (Usmani, Ganguli et al. 2013). 

      84-5 "efficiently" and "high" - provide context so the reader can understand what is meant - do the authors mean the experiments work efficiently, or that a high percentage of cells are transfected? And give some numbers or range of numbers - you're asking the readers to take your word for things when you choose adjectives - instead, provide values and let the readers decide for themselves.

      Thanks for the comment, we have reworded the sentence: Gene therapy is effective in germ cells, as numerous publications have shown that conventional plasmids can be transferred into spermatogonia in several species with success, allowing their transcription in all cells of the germinal lineage (Usmani, Ganguli et al. 2013, Michaelis, Sobczak et al. 2014, Raina, Kumar et al. 2015, Wang, Liu et al. 2022).

      93 Reference at the end of the sentence "most countries"

      Thanks, we changed the sentence and added the reference: the new sentence is "… to avoid any eugenic deviations, transmissible changes in humans are illegal in 39 countries (Liu 2020)” (Liu, S. (2020). "Legal reflections on the case of genomeedited babies." Glob Health Res Policy 5: 24

      93-4 Odd to say "multiple" and then list only one. 

      Thanks for the comment, we have reworded the sentence: “Furthermore, the genetic modification of germ cell lines poses biological risks, including the induction of cancer, off-target effects, and cell mosaicism. Errors in editing may have adverse effects on future generations. It is exceedingly challenging to anticipate the consequences of genetic mosaicism, for instance, in a single individual. (Sadelain, Papapetrou et al. 2011, Ishii 2017).”

      97 Is this really a "small" change? Again, would use adjectives carefully - to this reviewer, this is not a small change, but a significant one! And "should be" is not altogether convincing

      Thanks for the comment, we have reworded the sentence: “Thanks to this change, the risk of genomic insertion is avoided, and thus there is no question of heritable alterations.”

      What chance is there of retrotransposition? Is there any data in the literature for that, after injecting millions of copies of RNA one or more might be reverse transcribed and inserted into the genome?

      This is certainly possible and is the putative origin for multiple intronless spermatid-expressed genes: 

      The expert poses an interesting question, but one that unfortunately remains unanswered at present. Most papers on mRNA therapy state that there is no risk concerning genomic integration, but no reference is given (for instance see mRNA-based therapeutics: looking beyond COVID-19 vaccines. Lancet. 2024 doi: 10.1016/S0140-6736(23)02444-3). This is an important question, which deserves to be evaluated, but is beyond the scope of this manuscript. Nevertheless is remaining very debating (Igyarto and Qin 2024).

      98 Odd to say "should be no risk" and then conclude with "there is no question" - so start the sentence with 'hedging', and then end with certainty - got to pick one or the other.

      Thanks for the comment, we have reworded the sentence

      99 "Complete" - probably not, would delete:

      We removed the word: “The first part of this study presents a characterization of the protein expression patterns obtained following transfection of naked mRNA coding for reporter genes into the testes of mice”

      101-2 Reference missing, as are numbers - what % of cases? 

      Thank you, we changed the sentence and added the reference: “Among infertility disorders, oligoastheno-teratozoospermia  (OAT) is the most frequent (50 % (Thonneau, Marchand et al. 1991)” Thonneau, P., S. Marchand, A. Tallec, M. L. Ferial, B. Ducot, J. Lansac, P. Lopes, J. M. Tabaste and A. Spira (1991). "Incidence and main causes of infertility in a resident population (1,850,000) of three French regions (1988-1989)." Hum Reprod 6(6): 811-816.

      103 Once again, the reference is missing:

      We have added these references: (Colpi, Francavilla et al. 2018) (Cavallini 2006)

      104-5 Awkward transition.

      Thanks, we changed the transition: “The first part of this study presents a characterization of the protein expression patterns obtained following transfection of naked mRNA coding for reporter genes into the testes of mice. The second part is to apply the protocol to a preclinical mouse model of OAT.”

      105 Backslash is odd - never seen it used in that way before

      Removed

      108 "completely infertile" is redundant;

      Thank you, we changed it accordingly: “Patients and mice carrying mutations in the ARMC2 gene present a canonical OAT phenotype and are infertile”.

      and is a KO mouse really "preclinical"? 

      The definition of preclinical research, is research involving the use of animals to ascertain the potential efficacy of a drug, procedure, or treatment. Preclinical studies are conducted prior to any testing in humans. Our KO mouse model has been shown to mimic human infertility. Indeed Armc2-/-mice exhibit a phenotype that is identical to that observed in humans. Our study is in line with this definition. For this reason, we have decided to maintain our current position and to use the term "preclinical" in the article. 

      110  Delete "sperm".

      Thank you, we changed it accordingly: “The preclinical Armc2 deficient (Armc2 KO) mouse model is therefore a valuable model to assess whether in vivo injection of naked mRNA combined with electroporation can restore spermatogenesis”

      111  "Easy"? Really? 

      We changed it accordingly: “We chose this model for several reasons: first, Armc2 KO mice are sterile and all sperm exhibit short, thick or coiled flagella [13].”

      112-3 "completely immobile" is redundant - either they are immobile or not.

      Thank you, we changed it accordingly: “As a result, 100 % of sperm are immobile, thus it should be easy to determine the efficacy of the technique by measuring sperm motility with a CASA system.”

      108-33 Condense this lengthy text into a coherent few sentences to give readers a sense of what you sought to accomplish, broadly how it was done, and what you found. This reads more like a Results section

      Thanks for the comment, we shortened the text.

      Materials and Methods 

      The sections appear to have been written by different scientists - the authors should standardize so that similar detail and formatting are used - e.g., in some parts the source is in parentheses with catalog number, in others not, some have city, state, country, others do not... the authors should check eLife mandates for this type of information and provide. 

      We are grateful for your feedback. We standardized the text, and if we had missed some, as outlined on the E-Life website, we can finish to format the article once it has been accepted for publication in the journal before sending the VOR.

      134 Misspelling

      We corrected the misspelling  

      142 Just reference, don't need to spell it out.

      Thanks, we changed it accordingly: “and the Armc2 KO mouse strain obtained by CRISPR-Cas9 (Coutton, Martinez et al. 2019). Experiments”

      150 What is XXX?

      We would like to express our gratitude for bringing this error to our attention. We have duly rectified the issue: “obtained from Merck (Darmstadt, Germany).”

      157-60 Are enough details provided for readers to repeat this if necessary? Doesn't seem so to this reviewer; if kits were followed, then can say "using manufacturer's protocol", or refer to another manuscript - but this is too vague. 

      Thanks, we change it accordingly: After expansion, plasmids were purified with a NucleoBond Xtra Midi kit (740410-50; Macherey-Nagel, Düren, Germany) using manufacturer's protocol.”

      165 Again, too few details - how was it purified? What liquid was it in?

      Thanks for the comment, the EEV plasmids were purified like all other plasmids. We change the text: “All plasmids,EEV CAGs-GFP-T2A-Luciferase,((EEV604A-2), System Bioscience, Palo Alto, CA, USA), mCherry plasmid ( given by Dr. Conti MD at UCSF, San Francisco, CA, USA) and EEV-Armc2-GFP plasmid (CUSTOM-S017188-R2-3,Trilink,San Diego, USA) were amplified by bacterial transformation” 

      170 Seems some words are missing - and will everyone know Dr. Conti by last name alone? Would spell out, and the details of the plasmid must either be provided or a reference given; how was amplification done? Purification? What was it resuspended in? 

      Thank for the remark, the mcherry plasmids were purified like all other plasmids. We change the text: “All plasmids,EEV CAGs-GFP-T2A-Luciferase,((EEV604A-2), System Bioscience, Palo Alto, CA, USA), mCherry plasmid ( given by Dr. Conti MD, UCSF, San Francisco, CA, USA) and EEV-Armc2-GFP plasmid (CUSTOM-S017188-R2-3,Trilink,San Diego, USA) were amplified by bacterial transformation”

      175 Again, for this plasmid provide more information - catalog number, reference, etc; how amplified and purified, what resuspension buffer?

      Thank you for the remark, as We mentioned, we add this sentence for the preparation: “All plasmids, EEV CAGs-GFP-T2A-Luciferase,((EEV604A-2), System Bioscience, Palo Alto, CA, USA), mCherry plasmid (given by Dr. Conti MD at UCSF, San Francisco, CA, USA) and EEV-Armc2-GFP plasmid (CUSTOMS017188-R2-3,Trilink,San Diego, USA) were amplified by bacterial transformation” and we add these sentence “The EEV-Armc2-GFP plasmid used for in vivo testes microinjection and electroporation was synthesized and customized by Trilink (CUSTOM-S017188-R2-3,San Diego, USA).”

      183 What sequence, or isoform was used? Mouse or human? 

      Thanks, we changed accordingly: “This non-integrative episome contains the mice cDNA sequences of Armc2 (ENSMUST00000095729.11)”

      186-7 Provide sequence or catalog number; what was it resolubilized in?

      Thanks we changed accordingly “the final plasmid concentration was adjusted to 9 μg μL-1 in water.” We provided the sequence of EEV-Armc2-GFP in supp data 6.

      207-219 Much better, this is how the entire section needs to be written! 

      237-240 Font

      Thanks for the comment, we changed it accordingly

      246 Cauda, and sperm, not sperm cells

      Thanks for the comment, we changed it accordingly

      255-6 Which was done first? Would indicate clearly.

      Thanks for the comment, we changed the sentence: “Adult mice were euthanized by cervical dislocation and then transcardiac perfused  with 1X PBS”

      281-2 Provide source for software - company, location, etc: 

      We changed it accordingly: FIJI software (Opened source software) was used to process and analyze images and Imaris software (Oxford Instruments Tubney Woods, Abingdon, Oxon OX13 5QX, UK) for the 3D reconstructions.  

      323 um, not uM. 

      Thanks for the comment, we changed our mistake: “After filtration (100 µm filter)”

      Results 

      369 Weighed.  

      Thanks for the comment, we changed our mistake: “the testes were measured and weighed”

      371 No difference in what, specifically?

      Thanks for the comment, we changed the sentence to: “No statistical differences in length and weight were observed between control and treated testes”

      375 "was respected"? What does this mean?

      Thanks for the comment, we changed the sentence to “The layered structure of germ cells were identical in all conditions”

      378  This is highly unlikely to be true, as even epididymal sperm from WT animals are often defective - the authors are saying there were ZERO morphological defects? Or that there was no difference between control and treated? Only showing 2-3 sperm for control vs treatment is not sufficient.

      Your observation that the epididymal spermatozoa from wild-type animals exhibited defective morphology is indeed true. The prevalence of these defects varies by strain, with an average incidence of 20% to 40% (Kawai, Hata et al., 2006; Fan, Liu et al., 2015). To provide a more comprehensive representation, we conducted a Harris-Shorr staining procedure and included a histogram of the percentage of normal sperm in each condition (new figure 2F4). Furthermore, Harris-Shorr staining of the epididymal sperm cells revealed that there were no discernible increases in morphological defects when mRNA and EEV were utilized, in comparison with the control. We add the sentence “At last, Harris-Shorr staining of the epididymal sperm cells demonstrated that there were no increases in morphological defects when mRNA and EEV were used in comparison with the control”.

      379  "safe" is not the right word - better to say "did not perturb spermatogenesis". 

      Thanks, we changed it accordingly: “these results suggest that in vivo microinjection and electroporation of EEV or mRNA did not perturb spermatogenesis”

      382-3 This sentence needs attention, doesn't make sense as written: 

      Thanks for the remark, we changed the sentence to: “No testicular lesions were observed on the testes at any post injection time”

      389  How long after injection? 

      Thanks for the comment, we changed the sentence to: “It is worth noting that both vectors induced GFP expression at one day post-injection”

      390  Given the duration of mouse spermatogenesis (~35 days), for GFP to persist past that time suggests that it was maintained in SSCs? How can the authors explain how such a strong signal was maintained after such a long period of time? How stable are the episomally-maintained plasmids, are they maintained 100% for months? And if they are inherited by progeny of SSCs, shouldn't they be successively diluted over time? And if they are inherited by daughter cells such that they would still be expressed 49 days after injection, shouldn't all the cells originating from that SSC also be positive, instead of what appear to be small subsets as shown in Fig. 3H2? Overall, this reviewer is struggling to understand how a plasmid would be inherited and passed through spermatogenesis in the manner seen in these results. 

      Thanks for the comment. 

      This point was already underlined in public review. We paste here our answer: “The non-insertional Enhanced Episomes Vector (EEV) plasmid is a non-viral episome based on the Epstein-Barr virus (EBV: Epstein-Barr Virus). Its maintenance within the cell is made possible by its ability to replicate in a synchronous manner with the host genome and to segregate into daughter cells. This is due to the fact that EEV is composed of two distinct elements derived from EBV: an origin of replication (oriP) and an Epstein-Barr Nuclear Antigen 1 (EBNA1) expression cassette (Gil, Gallaher, and Berk, 2010).   The oriP is a locus comprising two EBNA1-binding domains, designated as the Family of Repeats (FR) and Dyad Symmetry (DS). The FR is an array of approximately 20 EBNA1-binding sites (20 repeats of 30 bp) with high affinity, while the DS comprises four lower-affinity sites operating in tandem (Ehrhardt et al., 2008). 

      The 641-amino-acid EBNA1 protein contains numerous domains.The N-terminal domains are rich in glycines and alanines, which enable interaction with host chromosomes. The C-terminal region is responsible for binding to oriP (Hodin, Najrana, and Yates, 2013a). The binding of EBNA1 to the DS element results in the recruitment of the origin of replication. This results in the synchronous initiation of extra-chromosomal EEV replication with host DNA at each S phase of the cell cycle (Düzgüneş, Cheung, and Konopka 2018a). Furthermore, EBNA1 binding to the FR domain induces the formation of a bridge between metaphase chromosomes and the vector during mitosis. This binding is responsible for the segregation of the EEV episome in daughter cells (Düzgüneş, Cheung, and Konopka 2018b). It is notable that EEV is maintained at a rate of 90-95% per cell division.”

      Because of the intrinsic properties of EEV described above, the presence of the reporter protein at 119 day after injection was likely due to the maintenance of the plasmid, mostly in Sertoli cells, and not to the DNA integration of the plasmid.

      Of note, the specificity of EEV was already indicated in the introduction. Nevertheless, we have added more information about it to help the readers (lines 124-128 clean copy)  

      398 Which "cell types"? 

      Your feedback is greatly appreciated, and the sentence in question has been amended to provide a more comprehensive understanding. The revised wording is as follows: These results suggest that GFPmRNA and EEV-GFP targeted different seminiferous cell types, such as Sertoli cells and all germline cells, or that there were differences in terms of transfection efficiency.

      409 Why is it important to inject similar copies of EEV and mRNA? Wouldn't the EEV be expected to generate many, many more copies of RNA per molecule than the mRNAs when injected directly?? 

      We removed the word importantly. 

      415 How is an injected naked mRNA stably maintained for 3 weeks? What is the stability of this mRNA?? Wouldn't its residence in germ cells for 21 days make it more stable than even the most stable endogenous mRNAs? Even mRNAs for housekeeping genes such as actin, which are incredibly stable, have half-lives of 9-10 hours.

      We appreciate your inquiry and concur with your assessment that mRNA stability is limited.  It is our hypothesis that the source of the confusion lies in the fact that we injected mRNA coding for the GFP protein, rather than mRNA tagged with GFP. After a three-week observation period, we did not observe the mRNA, but we observed the expression of the GFP protein induced by the mRNA. To draw the reader's attention to this point, we have added the following sentence to the text “It is important to underline that the signal measured is the fluorescence emitted by the GFP. This signal is dependent of both the half-lives of the plasmid/mRNA and the GFP. Therefore, the kinetic of the signal persistence (which is called here expression) is a combination of the persistence of the vector and the synthetized protein. See lines 469-472 clean copy. 

      This being said, it is difficult to compare the lifespan of a cellular mRNA with that of a mRNA that has been modified at different levels, including 5’Cap, mRNA body, poly(A)tail modifications, which both increase mRNA stability and translation (see The Pivotal Role of Chemical Modifications in mRNA Therapeutics  (2022) https://doi.org/10.3389/fcell.2022.901510). This question is discussed lines 687698 clean copy

      467 "safely" should be deleted

      Thanks, we removed the word: “To validate and confirm the capacity of naked mRNA to express proteins in the testes after injection and electroporation”

      470  Except that apoptotic cells were clearly seen in Figure 2:

      We would like to thank the reviewer for their comment. We agree that the staining of the provided sections were of heterogenous quality. To address the remark, we carried out additional HE staining for all conditions, and we now present testis sections correctly stained obtained in the different condition in Fig. 2 and Supp. 7. Our observations revealed that the number of apoptotic cells remained consistent across all conditions.

      471  "remanence"?

      We appreciate your feedback and have amended the sentence to provide clear meaning. The revised wording is as follows: “The assessment of the temporal persistence of testicular mCherry fluorescent protein expression revealed a robust red fluorescence from day 1 post-injection, which remained detectable for at least 15 days (Fig. Supp. 3 B2, C2, and D2).”

      489 IF measures steady-state protein levels, not translation; should say you determined when ARMC2 was detectable. 

      Thanks for the remark, we changed the sentence to: “ By IF, we determined when ARMC2 protein was detectable during spermatogenesis.”

      491 Flagella

      Thanks for the comment, we changed our mistake: “in the flagella of the elongated spermatids (Fig 9A)”

      Discussion 

      The Discussion is largely a re-hashing of the Methods and Results, with additional background.

      Message stability must be addressed - how is a naked mRNA maintained for 21 days?

      As previously stated, it is our hypothesis that the source of the confusion lies in the fact that we injected mRNA coding for the GFP protein, rather than mRNA tagged with GFP. After a three-week observation period, we did not observe the mRNA, but we observed the synthetized GFP protein. This point and the stability of protein in the testis is now discussed lines 677-684 (clean copy).

      556 How do the authors define "safe"?

      Thanks for the comment, we changed the sentence to be clearer: “Our results also showed that the combination of injection and electroporation did not perturb spermatogenesis when electric pulses are carefully controlled”

      563 Synthesized

      Thanks, we changed it accordingly

      602 Again, this was not apparent, as there were more apoptotic cells in Fig. 2 - data must be provided to show "no effect".

      As previously stated, we carried out additional HE staining for all conditions, as can be observed in Fig. 2 . Our observations revealed that the number of apoptotic cells remained consistent across all conditions.

      629-30 This directly contradicts the authors' contention in the Introduction that ICSI was unsafe - how is this procedure going to be an advancement over ICSI as proposed, if ICSI needs to be used?? Why not just skip all this and do ICSI then?? Perhaps if this technique was used to 'repair' defects in spermatogonia or spermatocytes, then that makes more sense. But if ICSI is required, then this is not an advancement when trying to rescue a sperm morphology/motility defect.

      In light of the latest findings (Fig 12), we have revised this part of the discussion and this paragraph no longer exist.

      Nevertheless, to address specifically the reviewer’s remark, we would like to underline that ICSI with sperm from fertile donor is always more efficient than ICSI with sperm from patient suffering of OAT condition. Our strategy, by improving sperm quality, will improve the efficiency of ICSI and at the end will increase the live birth rate resulting from the first fresh IVF cycle.

      640-2 What is meant by "sperm organelles" And what examples are provided for sperm proteins being required at or after fertilization? 

      This paragraph was also strongly modified and the notion of protein persistence during spermatogenesis was discussed in the paragraph on fluorescent signal duration. See lines 698-705.

      651 "Dong team"??

      Thanks for the comment, we added the references. 

      Figure 2D2 - tubule treated with EEV-GFP appears to have considerably more apoptotic cells - this reviewer counted ~10 vs 0 in control; also, many of the spermatocytes appear abnormal in terms of their chromatin morphology - the authors must address this by staining for markers of apoptosis - not fair to conclude there was no difference when there's a very obvious difference! 

      We would like to thank the reviewer for their comment. This point was already addressed. As previously stated, we provide now new testis sections for all condition (see Fig. 2). Our observations revealed that the number of apoptotic cells remained consistent across all conditions.

      Figure 2D3 staining is quite different than D1-2, likely a technical issue - looks like no hematoxylin was added? Need to re-stain so results can be compared to the other 2 figures 

      As previously stated, we carried out additional HE staining for all conditions, and new images are provided, with similar staining. 

      Figure 3 - the fluorescent images lack any context of tubule structure so it is nearly impossible to get a sense of what cells express GFP, or whether they're in the basal vs adluminal compartment - can the authors outline them? Indicate where the BM and lumen are. 

      We would like to thank the reviewer for their comment. This figure provides actually a global view of the green fluorescent protein (GFP) expression at the surface of the testis. The entire testis was placed under an inverted epifluorescence microscope, and a picture of the GFP signal was recorded. For this reason, it is impossible to delineate the BM and the lumen. It should be noted that the fluorescence likely originates from different seminiferous tubules.

      Author response image 1.

      So, for Figure 3 if the plasmid is being uptaken by cells and maintained as an episome, is it able to replicate? Likely not. 

      Yes! it is the intrinsic property of the episome, see the detailed explanation provided above about the EEV plasmid

      So, initially, it could be in spermatogonia, spermatocytes, and spermatids. As time progressed those initially positive spermatids and then spermatocytes would be lost - and finally, the only cells that should be positive would be the progeny of spermatogonia that were positive - but, as they proliferate shouldn't the GFP signal decline? 

      Because EEV is able  to replicate in a synchronous manner with the host genome and to segregate into daughter cells at a level of 90% of the mother cell, the expected decline is very slow.

      And, since clones of germ cells are connected throughout their development, shouldn't the GFP diffuse through the intercellular bridges so entire clones are positive? Was this observed? 

      We did not perform IF experiments further than 7 days after injection, a time too short to observe what the reviewer suggested. Moreover, if at 1 day after injection, GFP synthesized from injected EEV was found in both germ cells and Sertoli cells (Fig 7), after one week, the reporter proteins were only observable in Sertoli cells. This result suggests that EEV is maintained only in Sertoli cells, thus preventing the observation of stained clones.

      Can these sections be stained for the ICB TEX14 so that clonality can be distinguished? Based on the apparent distance between cells, it appears some are clones, but many are not... 

      We thank the reviewer for this suggestion but we are not able to perform testis sectioning and costaining experiments because the PFA treatment bleaches the GFP signal. We also tested several GFP antibodies, but all failed.  

      Nevertheless, we were able to localize and identify transfected cells thank to the whole testis optical clearing, combined with a measure of GFP fluorescence and three-dimensional image reconstructions. 

      For Figure 4, with the mRNA-GFP, why does the 1-day image (which looks similar to the plasmidtransfected) look so different from days 7-21? 

      And why do days 7-21 look so different from those days in Fig 3? 

      Thank you for your feedback. It is an excellent question. Because of the low resolution of the whole testis epifluorescences imaging and light penetration issue, we decided to carry-out whole testis optical clearing and three-dimensional image reconstructions experiments, in order to get insights on the transfection process. At day 1, GFP synthesized from EEV injection was found in spermatogonia, spermatocytes and Sertoli cells (Fig 7).  After one week, the reporter protein synthesized from injected EEV was only observable in Sertoli cells.

      In contrast, for mRNA, on day 1 and day 7 post-injection, GFP fluorescent signal was associated with both Sertoli cells and germ cells. This explains why patterns between mRNA-GFP and EEV-GFP are similar at day 1 and different at day 7 between both conditions. 

      Why do the authors think the signal went from so strong at 21 to undetectable at 28? What changed so drastically over those 7 days?

      What is the half-life of this mRNA supposed to be? It seems that 21 days is an unreasonably long time, but then to go to zero at 28 seems also odd... Please provide some explanation, and context for whether the residence of an exogenous mRNA for 21 days is expected. 

      As previously stated, it is our hypothesis that the source of the confusion lies in the fact that we injected mRNA coding for the GFP protein, rather than mRNA tagged with GFP. After a three-week observation period, we did not observe the mRNA, but we observed the GFP protein produced by the mRNA. The time of observation of the reporter proteins expressed by the respective mRNA molecules (mCherry, luciferase, or GFP) ranged from 15 to 21 days. Proteins have very different turnover rates, with half-lives ranging from minutes to days. Half-lives depend on proteins but also on tissues. As explained in the discussion, it has been demonstrated that proteins involved in spermatogenesis exhibit a markedly low turnover rate and this explains the duration of the fluorescent signal. 

      The authors should immunostain testis sections from controls and those with mRNA and plasmid and immunostain with established germ cell protein fate markers to show what specific germ cell types are GFP+

      Thank you for your feedback. As previously mentioned, we were unable to perform testis sectioning and co-staining because the PFA treatment bleaches the GFP signal and because we were unable to reveal GFP with an GFP antibody, for unknown reasons.

      For the GFP signal to be maintained past 35 days, the plasmid must have integrated into SSCs - and for that to happen, the plasmid would have to cross the blood-testis-barrier... is this expected? 

      We are grateful for your observation. 

      First, as explained above, we do not think that the plasmid has been integrated. 

      Concerning the blood-testing barrier.  It bears noting that electroporation is a technique that is widely utilized in biotechnology and medicine for the delivery of drugs and the transfer of genes into living cells (Boussetta, Lebovka et al. 2009). This process entails the application of an electric current, which induces the formation of hydrophilic pores in the lipid bilayer of the plasma membrane (Kanduser, Miklavcic et al. 2009). The pores remain stable throughout the electroporation process and then close again once it is complete. Consequently, as electroporation destabilizes the cell membrane, it can also destabilize the gap junctions responsible of the blood-testis barrier. This was actually confirmed by several studies, which have observed plasmid transfection beyond the blood-testis barrier with injection into rete testis following electroporation (Muramatsu, Shibata et al. 1997, Kubota, Hayashi et al. 2005, Danner, Kirchhoff et al. 2009, Kanduser, Miklavcic et al. 2009, Michaelis, Sobczak et al. 2014).

      Figure 9 - authors should show >1 cell - this is insufficient; also, it's stated it's only in the flagella, but it also appears to be in the head as well. And is this just the principal piece?? And are the authors sure those are elongating vs condensing spermatids? Need to show multiple tubules, at different stages, to make these claims

      We have partly answered to this question in the public review; We pastehere  our answer

      “We present now new images showing the full seminiferous tubules as requested (see supp fig 6). In this new figure, it is clear that Armc2 is only expressed in spermatids. We have also added in this figure an analysis of the RNA-seq database produced by Gan's team (Gan, Wen et al. 2013), confirming that ArmC2 expression is predominantly expressed at the elongated spermatid stage. This point is now clearly indicated in the text.”

      Concerning the localization of the protein in the head, we confirm that the base of the manchette is stained but we have no explanation so far. This point is now indicated in the manuscript.

      Figure 10B2 image - a better resolution is necessary

      We are grateful for your feedback. We concede that the quality of the image was not optimal. Consequently, We have replaced it with an alternative.

      Figure 11 - in control, need to show >1 sperm; and lower-mag images should be provided for all samples to show population-wide effects; showing 1 "normal" sperm per group (white arrows) is insufficient: 

      We are grateful for your feedback. We conducted further experiments and provide now additional images in Supp. figure 8.

      Reviewer #3 (Recommendations For The Authors)

      In this study, Vilpreux et al. developed a microinjection/electroporation method in order to transfect RNA into testicular cells. The authors studied several parameters of treated testis and compared the injection of DNA versus RNA. Using the injection of Armc2 RNA into mice with an Armc2 knockout the authors were able to (partly) rescue the fertility phenotype. 

      Minor points. 

      Figure 6 + lines 553+554: might it be that the staining pattern primarily on one side of the testis is due to the orientation of the scissor electrode during the electroporation procedure and the migration direction of negatively charged RNA molecules (Figure 6)? 

      Your input is greatly appreciated. We concur that the observed peripheral expression is due to both the electroporation and injection. Accordingly, we have amended the sentence as follows: "The peripheral expression observed was due to the close vicinity of cells to the electrodes, and to a peripheral dispersal of the injected solution, as shown by the distribution of the fluorescent i-particles NIRFiP-180."

      Discussion of the safety aspect (lines 601-608): The authors state several times that there are no visible tissue changes after the electroporation procedure. However, in order to claim that this procedure is "safe", it is necessary to examine the offspring born after microinjection/electroporation. 

      Your input is greatly appreciated. Consequently, the term "safe" has been replaced with "did not perturb spermatogenesis" in accordance with the provided feedback. Your assertion is correct; an examination of the offspring born would be necessary to ascertain the safety of the procedure. Due to the quantity of motile sperm obtained, it was not possible to produce offspring through natural mating. However, novel Armc2-/--rescued sperm samples have been produced and in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) experiments have been conducted. The results demonstrate that the Armc2-/--rescued sperm can successfully fertilize eggs and produce two-cell embryos by IVF and blastocysts by ICSI. These outcomes are visually represented in Figure 12. The development of embryos up to the blastocyst stage is a step in the right direction.

      The discussion section could be shortened. Lines 632-646 are largely a repetition of the introductory section. In addition, the Dong paper (ref. 25) may be interesting; however, this part could also be shortened (lines 647-676). This reviewer would prefer the authors to focus on the technique (different application sites and applied nucleotides) and proof of concept for (partial) phenotype rescue in the knockout mice. 

      Your contribution is highly valued. In light of your observations and the latest findings, we have substantially revised the discussion accordingly.

      Line 63: oocytes rather than eggs.

      We are grateful for your input, but we have decided to retain our current position and to use the term "eggs" rather than "oocytes" in our writing because the definition of an oocyte is a female gametocyte or germ cell involved in reproduction. In other words, oocyte corresponds to a germ cell inside the ovary and after ovulation become an egg.  

      Boussetta, N., N. Lebovka, E. Vorobiev, H. Adenier, C. Bedel-Cloutour and J. L. Lanoiselle (2009). "Electrically assisted extraction of soluble matter from chardonnay grape skins for polyphenol recovery." J Agric Food Chem 57(4): 1491-1497.

      Cavallini, G. (2006). "Male idiopathic oligoasthenoteratozoospermia." Asian J Androl 8(2): 143-157.

      Colpi, G. M., S. Francavilla, G. Haidl, K. Link, H. M. Behre, D. G. Goulis, C. Krausz and A. Giwercman (2018). "European Academy of Andrology guideline Management of oligo-asthenoteratozoospermia." Andrology 6(4): 513-524.

      Coutton, C., G. Martinez, Z. E. Kherraf, A. Amiri-Yekta, M. Boguenet, A. Saut, X. He, F. Zhang, M. Cristou-Kent, J. Escoffier, M. Bidart, V. Satre, B. Conne, S. Fourati Ben Mustapha, L. Halouani, O. Marrakchi, M. Makni, H. Latrous, M. Kharouf, K. Pernet-Gallay, M. Bonhivers, S. Hennebicq, N. Rives, E. Dulioust, A. Toure, H. Gourabi, Y. Cao, R. Zouari, S. H. Hosseini, S. Nef, N. Thierry-Mieg, C. Arnoult and P. F. Ray (2019). "Bi-allelic Mutations in ARMC2 Lead to Severe Astheno-Teratozoospermia Due to Sperm Flagellum Malformations in Humans and Mice." Am J Hum Genet 104(2): 331-340.

      Danner, S., C. Kirchhoff and R. Ivell (2009). "Seminiferous tubule transfection in vitro to define postmeiotic gene regulation." Reprod Biol Endocrinol 7: 67.

      Gan, H., L. Wen, S. Liao, X. Lin, T. Ma, J. Liu, C. X. Song, M. Wang, C. He, C. Han and F. Tang (2013). "Dynamics of 5-hydroxymethylcytosine during mouse spermatogenesis." Nat Commun 4: 1995. Igyarto, B. Z. and Z. Qin (2024). "The mRNA-LNP vaccines - the good, the bad and the ugly?" Front Immunol 15: 1336906.

      Ishii, T. (2017). "Germ line genome editing in clinics: the approaches, objectives and global society." Brief Funct Genomics 16(1): 46-56.

      Kanduser, M., D. Miklavcic and M. Pavlin (2009). "Mechanisms involved in gene electrotransfer using high- and low-voltage pulses--an in vitro study." Bioelectrochemistry 74(2): 265-271.

      Kubota, H., Y. Hayashi, Y. Kubota, K. Coward and J. Parrington (2005). "Comparison of two methods of in vivo gene transfer by electroporation." Fertil Steril 83 Suppl 1: 1310-1318.

      Michaelis, M., A. Sobczak and J. M. Weitzel (2014). "In vivo microinjection and electroporation of mouse testis." J Vis Exp(90).

      Muramatsu, T., O. Shibata, S. Ryoki, Y. Ohmori and J. Okumura (1997). "Foreign gene expression in the mouse testis by localized in vivo gene transfer." Biochem Biophys Res Commun 233(1): 45-49.

      Raina, A., S. Kumar, R. Shrivastava and A. Mitra (2015). "Testis mediated gene transfer: in vitro transfection in goat testis by electroporation." Gene 554(1): 96-100.

      Sadelain, M., E. P. Papapetrou and F. D. Bushman (2011). "Safe harbours for the integration of new DNA in the human genome." Nat Rev Cancer 12(1): 51-58.

      Thonneau, P., S. Marchand, A. Tallec, M. L. Ferial, B. Ducot, J. Lansac, P. Lopes, J. M. Tabaste and A. Spira (1991). "Incidence and main causes of infertility in a resident population (1,850,000) of three French regions (1988-1989)." Hum Reprod 6(6): 811-816.

      Usmani, A., N. Ganguli, H. Sarkar, S. Dhup, S. R. Batta, M. Vimal, N. Ganguli, S. Basu, P. Nagarajan and S. S. Majumdar (2013). "A non-surgical approach for male germ cell mediated gene transmission through transgenesis." Sci Rep 3: 3430.

      Wang, L., C. Liu, H. Wei, Y. Ouyang, M. Dong, R. Zhang, L. Wang, Y. Chen, Y. Ma, M. Guo, Y. Yu, Q. Y. Sun and W. Li (2022). "Testis electroporation coupled with autophagy inhibitor to treat nonobstructive azoospermia." Mol Ther Nucleic Acids 30: 451-464.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an interesting and potentially important paper, which however has some deficiencies.

      Strengths:

      A significant amount of potentially useful data.

      Weaknesses:

      One issue is a confusion of thermal stability with solubility. While thermal stability of a protein is a thermodynamic parameter that can be described by the Gibbs-Helmholtz equation, which relates the free energy difference between the folded and unfolded states as a function of temperature, as well as the entropy of unfolding. What is actually measured in PISA is a change in protein solubility, which is an empirical parameter affected by a great many variables, including the presence and concentration of other ambient proteins and other molecules. One might possibly argue that in TPP, where one measures the melting temperature change ∆Tm, thermal stability plays a decisive or at least an important role, but no such assertion can be made in PISA analysis that measures the solubility shift.

      We completely agree with the insightful comment from the reviewer and we are very grateful that the point was raised. Our goal was to make this manuscript easily accessible to the entire scientific community, not just experts in the field. In an attempt to simplify the language, we likely also simplified the underlying physical principles that these assays exploit. In defense of our initial manuscript, we did state that PISA measures “a fold change in the abundance of soluble protein in a compound-treated sample vs. a vehicle-treated control after thermal denaturation and high-speed centrifugation.” Despite this attempt to accurately communicate the reviewer’s point, we seem to have not been sufficiently clear. Therefore, we tried to further elaborate on this point and made it clear that we are measuring differences in solubility and interpreting these differences as changes in thermal stability. 

      In the revised version of the manuscript, we elaborated significantly on our original explanation. The following excerpt appears in the introduction (p. 3):

      “So, while CETSA and TPP measure a change in melting temperature (∆TM), PISA measures a change in solubility (∆SM).  Critically, there is a strong correlation between ∆TM and ∆SM, which makes PISA a reliable, if still imperfect, surrogate for measuring direct changes in protein thermal stability (Gaetani et al., 2019; Li et al., 2020). Thus, in the context of PISA, a change in protein thermal stability (or a thermal shift) can be defined as a fold change in the abundance of soluble protein in a compoundtreated sample vs. a vehicle-treated control after thermal denaturation and high-speed centrifugation. Therefore, an increase in melting temperature, which one could determine using CETSA or TPP, will lead to an increase in the area under the curve and an increase in the soluble protein abundance relative to controls (positive log2 fold change). Conversely, a decrease in melting temperature will result in a decrease in the area under the curve and a decrease in the soluble protein abundance relative to controls (negative log2 fold change).”

      And the following excerpt appears in the results section (p. 4): 

      “In a PISA experiment, a change in melting temperature or a thermal shift is approximated as a

      significant deviation in soluble protein abundance following thermal melting and high-speed centrifugation. Throughout this manuscript, we will interpret these observed alterations in solubility as changes in protein thermal stability. Most commonly this is manifested as a log2 fold change comparing the soluble protein abundance of a compound treated sample to a vehicle-treated control (Figure 1 – figure supplement 1A).”

      We have now drawn a clear distinction between what we were actually measuring (changes in solubility) and how we were interpreting these changes (as thermal shifts). We trust that the Reviewer will agree with this point, as they rightly claim that many of the observations presented in our work, which measures thermal stability, indirectly, are consistent with previous studies that measured thermal stability, directly. Again, we thank the reviewer for raising the point and feel that these changes have significantly improved the manuscript. 

      Another important issue is that the authors claim to have discovered for the first time a number of effects well described in prior literature, sometimes a decade ago. For instance, they marvel at the differences between the solubility changes observed in lysate versus intact cells, while this difference has been investigated in a number of prior studies. No reference to these studies is given during the relevant discussion.

      We thank the reviewer for raising this point. Our aim with this paper was to test the proficiency of this assay in high-throughput screening-type applications. We considered these observations as validation of our workflow, but admit that our choice of wording was not always appropriate and that we should have included more references to previous work. It was certainly never our intention to take credit for these discoveries. Therefore, we were more than happy to include more references in the revised version. We think that this makes the paper considerably better and will help readers better understand the context of our study.  

      The validity of statistical analysis raises concern. In fact, no calculation of statistical power is provided.

      As only two replicates were used in most cases, the statistical power must have been pretty limited. Also, there seems to be an absence of the multiple-hypothesis correction.

      We agree with the reviewer that a classical comparison using a t-test would be underpowered comparing all log2 normalized fold changes. We know from the data and our validation experiments that stability changes that generate log2 fold changes of 0.2 are indicative of compound engagement. When we use 0.2 to calculate power for a standard two-sample t-test with duplicates, we estimated this to have a power of 19.1%. Importantly, increasing this to n=3 resulted in a power estimate of only 39.9%, which would canonically still be considered to be underpowered. Thus, it is important to note that we instead use the distribution of all measurements for a single protein across all compound treatments to calculate standard deviations (nSD) as presented in this work. Thus, rather than a 2-by-2 comparison, we are comparing two duplicate compound treatments to 94 other compound treatments and 18 DMSO vehicle controls. Moreover, we are using this larger sample set to estimate the sampling distribution. Estimating this with a standard z-test would result in a p-value estimate <<< 0.0001 using the population standard deviation. Additionally, rather than estimate an FDR using say a BenjaminiHochberg correction, we estimated an empirical FDR for target calls based on applying the same cutoffs to our DMSO controls and measuring the proportion of hits called in control samples at each set of thresholds. Finally, we note that several other PISA-based methods have used fold-change thresholds similar to, or less than, those employed in this work (PMID: 35506705, 36377428, 34878405, 38293219).  

      Also, the authors forgot that whatever results PISA produces, even at high statistical significance, represent just a prediction that needs to be validated by orthogonal means. In the absolute majority of cases such validation is missing.

      We appreciate this point and we can assure the reviewer that this point was not lost on us. To this point, we state throughout the paper that the primary purpose of this paper was to execute a chemical screen. Furthermore, we do not claim to present a definitive list of protein targets for each compound. Instead, our intention is to provide a framework for performing PISA studies at scale. In total, we quantified thousands of changes and feel that it would be unreasonable to validate the majority of these cases. Instead, as has been done for CETSA (PMID: 34265272), PISA (PMID: 31545609), and TPP (PMID: 25278616) experiments before, we chose to highlight a few examples and provide a reasonable amount of validation for these specific observations. In Figure 2, we show that two screening compounds—palbociclib and NVP-TAE-226—have a similar impact on PLK1 solubility as the two know PLK1 inhibitors. We then assay each of these compounds, alongside BI 2536, and show that the same compounds that impact the solubility of PLK1, also inhibit its activity in cell-based assays. Finally, we model the structure of palbociclib (which is highly similar to BI 2536) in the PLK1 active site. In Figure 4, we show that AZD-5438 causes a change in solubility of RIPK1 in cell- and lysate-based assays to a similar extent as other compounds known to engage RIPK1. We then test these compounds in cellbased assays and show that they are capable of inhibiting RIPK1 activity in vivo. Finally, in Figure 5, we show that treatment with tyrosine kinase inhibitors and AZD-7762 result in a decrease in the solubility of CRKL. We showed that these compounds, specifically, prevented the phosphorylation of CRKL at Y207. Next, we show that AZD-7762, impacts the thermal stability of tyrosine kinases in lysate-based PISA. Finally, we performed phosphoproteomic profiling of cells treated with bafetinib and AZD-7762 and find that the abundance of many pY sites is decreased after treatment with each compound. It is also worth stating that an important goal of this study was to determine the proficiency of these methods in identifying the targets of each compound. We do not feel that comprehensive validation of the “absolute majority of cases” would significantly improve this manuscript. 

      Finally, to be a community-useful resource the paper needs to provide the dataset with a user interface so that the users can data-mine on their own.

      We agree and are working to develop an extensible resource for this. Owing to the size and complexities there, that work will need to be included in a follow-up manuscript. For now, we feel that the supplemental table we provide can be easily navigated the full dataset. Indeed, this has been the main resource that we have been emailed about since the preprint was first made public. We are glad that the Reviewer considers this dataset to be a highly valuable resource for the scientific community.  

      Reviewer #2 (Public Review):

      Summary:

      Using K562 (Leukemia) cells as an experimental model, Van Vracken et. al. use Thermal Proteome Profiling (TPP) to investigate changes in protein stability after exposing either live cells or crude cell lysates to a library of anti-cancer drugs. This was a large-scale and highly ambitious study, involving thousands of hours of mass spectrometry instrument time. The authors used an innovative combination of TPP together with Proteome Integral Solubility Alternation (PISA) assays to reduce the amount of instrument time needed, without compromising on the amount of data obtained.

      The paper is very well written, the relevance of this work is immediately apparent, and the results are well-explained and easy to follow even for a non-expert. The figures are well-presented. The methods appear to be explained in sufficient detail to allow others to reproduce the work.

      We thank the reviewer. One of our major goals was to make these assays and the resulting data approachable, especially for non-experts. We are glad that this turned out to be the case. 

      Strengths:

      Using CDK4/6 inhibitors, the authors observe strong changes in protein stability upon exposure to the drug. This is expected and shows their methodology is robust. Further, it adds confidence when the authors report changes in protein stability for drugs whose targets are not well-known. Many of the drugs used in this study - even those whose protein targets are already known - display numerous offtarget effects. Although many of these are not rigorously followed up in this current study, the authors rightly highlight this point as a focus for future work.

      Weaknesses:

      While the off-target effects of several drugs could've been more rigorously investigated, it is clear the authors have already put a tremendous amount of time and effort into this study. The authors have made their entire dataset available to the scientific community - this will be a valuable resource to others working in the fields of cancer biology/drug discovery.

      We agree with the reviewer that there are more leads here that could be followed and we look forward to both exploring these in future work and seeing what the community does with these data.

      Reviewer #3 (Public Review):

      Summary:

      This work aims to demonstrate how recent advances in thermal stability assays can be utilised to screen chemical libraries and determine the compound mechanism of action. Focusing on 96 compounds with known mechanisms of action, they use the PISA assay to measure changes in protein stability upon treatment with a high dose (10uM) in live K562 cells and whole cell lysates from K562 or HCT116. They intend this work to showcase a robust workflow that can serve as a roadmap for future studies.

      Strengths:

      The major strength of this study is the combination of live and whole cell lysates experiments. This allows the authors to compare the results from these two approaches to identify novel ligand-induced changes in thermal stability with greater confidence. More usefully, this also enables the authors to separate the primary and secondary effects of the compounds within the live cell assay.

      The study also benefits from the number of compounds tested within the same framework, which allows the authors to make direct comparisons between compounds.

      These two strengths are combined when they compare CHEK1 inhibitors and suggest that AZD-7762 likely induces secondary destabilisation of CRKL through off-target engagement with tyrosine kinases.

      Weaknesses:

      One of the stated benefits of PISA compared to the TPP in the original publication (Gaetani et al 2019) was that the reduced number of samples required allows more replicate experiments to be performed. Despite this, the authors of this study performed only duplicate experiments. They acknowledge this precludes the use of frequentist statistical tests to identify significant changes in protein stability. Instead, they apply an 'empirically derived framework' in which they apply two thresholds to the fold change vs DMSO: absolute z-score (calculated from all compounds for a protein) > 3.5 and absolute log2 fold-change > 0.2. They state that the fold-change threshold was necessary to exclude nonspecific interactors. While the thresholds appear relatively stringent, this approach will likely reduce the robustness of their findings in comparison to an experimental design incorporating more replicates. Firstly, the magnitude of the effect size should not be taken as a proxy for the importance of the effect.

      They acknowledge this and demonstrate it using their data for PIK3CB and p38α inhibitors (Figures 2BC). They have thus likely missed many small, but biologically relevant changes in thermal stability due to the fold-change threshold. Secondly, this approach relies upon the fold-changes between DMSO and compound for each protein being comparable, despite them being drawn from samples spread across 16 TMT multiplexes. Each multiplex necessitates a separate MS run and the quantification of a distinct set of peptides, from which the protein-level abundances are estimated. Thus, it is unlikely the fold changes for unaffected proteins are drawn from the same distribution, which is an unstated assumption of their thresholding approach. The authors could alleviate the second concern by demonstrating that there is very little or no batch effect across the TMT multiplexes. However, the first concern would remain. The limitations of their approach could have been avoided with more replicates and the use of an appropriate statistical test. It would be helpful if the authors could clarify if any of the missed targets passed the z-score threshold but fell below the fold-change threshold.

      The authors use a single, high, concentration of 10uM for all compounds. Given that many of the compounds likely have low nM IC50s, this concentration will often be multiple orders of magnitude above the one at which they inhibit their target. This makes it difficult to assess the relevance of the offtarget effects identified to clinical applications of the compounds or biological experiments. The authors acknowledge this and use ranges of concentrations for follow-up studies (e.g. Figure 2E-F). Nonetheless, this weakness is present for the vast bulk of the data presented.

      We agree that there is potential to drive off-target effects at such high-concentrations. However, we note that the concentration we employ is in the same range as previous PISA/CETSA/TPP studies. For example, 10 µM treatments were used in the initial descriptions of TPP (Savitski et al., 2014) and PISA (Gaetani et al., 2019). We also note that temperature may affect off-rates and binding interactions (PMID: 32946682) potentiating the need to use compound concentrations to overcome these effects.

      Additionally, these compounds likely accumulate in human plasma/tissues at concentrations that far exceed the compound IC50 values. For example, in patients treated with a standard clinical dose of ribocicilb, the concentration of the compound in the plasma fluctuates between 1 µM and 10 µM. (Bao, X., Wu, J., Sanai, N., & Li, J. (2019). Determination of total and unbound ribociclib in human plasma and brain tumor tissues using liquid chromatography coupled with tandem mass spectrometry. Journal of pharmaceutical and biomedical analysis, 166, 197–204. https://doi.org/10.1016/j.jpba.2019.01.017)

      The authors claim that combining cell-based and lysate-based assays increases coverage (Figure 3F) is not supported by their data. The '% targets' presented in Figure 3F have a different denominator for each bar. As it stands, all 49 targets quantified in both assays which have a significant change in thermal stability may be significant in the cell-based assay. If so, the apparent increase in % targets when combining reflects only the subsetting of the data. To alleviate this lack of clarity, the authors could update Figure 3F so that all three bars present the % targets figure for just the 60 compounds present in both assays.

      We spent much time debating the best way to present this data, so we are grateful for the feedback. Consistent with the Reviewer’s suggestion, we have included a figure that only considers the 60 compounds for which a target was quantified in both cell-based and lysate-based PISA (now Figure 3E). In addition, we included a pie chart that further illustrates our point (now Figure 3 – figure supplement 2A). Of the 60 compounds, there were 37 compounds that had a known target pass as a hit using both approaches, 6 compounds that had a known target pass as a hit in only cell-based experiments, and 6 compounds that had a known target pass as a hit in only lysate-based experiments.

      Within the Venn diagram, we also included a few examples of compounds that fit into each category. Furthermore, we highlighted two examples of compound-target pairs that pass as a hit with one approach, but not the other (Figure 3 – figure supplement 2B,C). We would also like to refer the reviewer to Figure 4D, which indicates that BRAF inhibitors cause a significant change in BRAF thermal stability in lysates but not cells. 

      Aims achieved, impact and utility:

      The authors have achieved their main aim of presenting a workflow that serves to demonstrate the potential value of this approach. However, by using a single high dose of each compound and failing to adequately replicate their experiments and instead applying heuristic thresholds, they have limited the impact of their findings. Their results will be a useful resource for researchers wishing to explore potential off-target interactions and/or mechanisms of action for these 96 compounds, but are expected to be superseded by more robust datasets in the near future. The most valuable aspect of the study is the demonstration that combining live cell and whole cell lysate PISA assays across multiple related compounds can help to elucidate the mechanisms of action.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      More specifically:

      P 1 l 20, we quantified 1.498 million thermal stability measurements.

      It's a staggering assertion, and it takes some reading to realize that the authors mean the total number of proteins identified and quantified in all experiments. But far from all of these proteins were quantified with enough precision to provide meaningful solubility shifts.

      We can assure the reviewer that we were not trying to deceive the readers. We stated ‘1.498 million thermal stability measurements.’ We did not say 1.498 million compound-specific thermal stability shifts.’ We assume that most readers will appreciate that the overall quality of the measurements will be variable across the dataset, e.g., in any work that describes quantitation of thousands of proteins in a proteomics dataset. In accordance with the Reviewer’s suggestion, we have weakened this statement. The revised version of the manuscript now reads as follows (p. 1): 

      “Taking advantage of this advance, we quantified more than one million thermal stability measurements in response to multiple classes of therapeutic and tool compounds (96 compounds in living cells and 70 compounds in lysates).”

      P 7 l 28. We observed a large range of thermal stability measurements for known compound-target pairs, from a four-fold reduction in protein stability to a four-fold increase in protein stability upon compound engagement (Figure 2A).

      PISA-derived solubility shift cannot be interpreted simply as a "four-fold reduction/increase in protein stability".

      We thank the Reviewer for highlighting this specific passage and agree that it was worded poorly. As such, we have modified the manuscript to the following (p. 8): 

      “We observed a large range of thermal stability measurements for known compound-target pairs, from a four-fold reduction in protein solubility after thermal denaturation to a four-fold increase in protein solubility upon compound engagement (Figure 2A).”

      P 8, l 6. Instead, we posit that maximum ligand-induced change in thermal stability is target-specific.

      Yes, that's right, but this has been shown in a number of prior studies.

      We agree with the reviewer and accept that we made a mistake in how we worded this sentence, which we regret upon reflection. As such, we have modified this sentence to the following:

      “Instead, our data appears to be consistent with the previous observation that the maximum ligandinduced change in thermal stability is target-specific (Savitski et al., 2014; Becher et al., 2016).”

      P 11 l 7. Combining the two approaches allows for greater coverage of the cellular proteome and provides a better chance of observing the protein target for a compound of interest. In fact, the main difference is that in-cell PISA provides targets in cases when the compound is a pro-drug that needs to be metabolically processed before engaging the intended target. This has been shown in a number of prior studies, but not mentioned in this manuscript.

      While our study was not focused on the issue of pro-drugs, this is an important point and we would be happy to re-iterate it in our manuscript. We thank the Reviewer for the suggestion and have modified the manuscript to reflect this point (p. 19): 

      “Cell-based studies, on the other hand, have the added potential to identify the targets of pro-drugs that must be metabolized in the cell to become active and secondary changes that occur independent of direct engagement (Savitski et al., 2014; Franken et al., 2015; Almqvist et al., 2016; Becher et al., 2016; Liang et al., 2022).”

      While we are happy to make this change, we also would like to point out that the reviewer’s assertions that, “the main difference is that in-cell PISA provides targets in cases when the compound is a prodrug that needs to be metabolically processed before engaging the intended target” also may not fully capture the nuances of protein engagement effectors in the cellular context. Thus, we believe it is important to highlight the ability of cell-based assays to identify secondary changes in thermal stability.  

      P 11 l 28. These data suggest that the thermal destabilization observed in cell-based experiments might stem from a complex biophysical rearrangement. That's right because it is not about thermal stability, but about protein solubility which is much affected by the environment.

      We agree that the readout of solubility is an important caveat for nearly every experiment in the family of assays associated with ‘thermal proteome profiling’. Inherently complex biophysical arrangements could affect the inherent stability and solubility of a protein or complex. Thus, we would be happy to make the following change consistent with the reviewer’s suggestion (p. 12): 

      “These data suggest that the decrease in solubility observed in cell-based experiments might stem from a complex biophysical rearrangement.”

      P 12 l 7 A). Thus, certain protein targets are more prone to thermal stability changes in one experimental setting compared to the other. Same thing - it's about solubility, not stability.

      We thank the Reviewer for the recommendation and have modified the revised manuscript as follows (p. 13):

      “Thus, certain protein targets were more prone to solubility (thermal stability) changes in one experimental setting compared to the other (Huber et al., 2015).”

      P13 l 15. While the data suggests that cell- and lysate-based PISA are equally valuable in screening the proteome for evidence of target engagement... No, they are not equally valuable - cell-based PISA can provide targets of prodrugs, which lysate PISA cannot.

      We have removed this sentence to avoid any confusion. We will not place any value judgments on the two approaches. 

      P 18 l 10. In general, a compound-dependent thermal shift that occurs in a lysate-based experiment is almost certain to stem from direct target engagement. That's true and has been known for a decade. Reference needed.

      We recognize this oversight and would be happy to include references. The revised manuscript reads as follows: 

      “In general, a compound-dependent thermal shift that occurs in a lysate-based experiment is almost certain to stem from direct target engagement (Savitski et al., 2014; Becher et al., 2016). This is because cell signaling pathways and cellular structures are disrupted and diluted. Cell-based studies, on the other hand, have the added potential to identify the targets of pro-drugs that must be metabolized in the cell to become active and secondary changes that occur independent of direct engagement (Savitski et al., 2014; Franken et al., 2015; Almqvist et al., 2016; Becher et al., 2016; Liang et al., 2022).”

      P 18 l 29. the data seemed to indicate that the maximal PISA fold change is protein-specific. Therefore, a log2 fold change of 2 for one compound-protein pair could be just as meaningful as a log2 fold change of 0.2 for another. This is also not new information.

      We again appreciate the Reviewer for highlighting this oversight. The revised manuscript reads as follows: 

      “Ultimately, the data seemed to be consistent with previous studies that indicate the maximal change in thermal stability in protein specific (Savitski et al., 2014; Becher et al., 2016; Sabatier et al., 2022). Therefore, a log2 fold change of 2 for one compound-protein pair could be just as meaningful as a log2 fold change of 0.2 for another.”

      P 19 l 5. Specifically, the compounds that most strongly impacted the thermal stability of targets, also acted as the most potent inhibitors. I wish this was true, but this is not always so. For instance, in Nat Meth 2019, 16, 894-901 it was postulated that large ∆Tm correspond to biologically most important sites ("hot spots") - the idea that was later challenged and largely discredited in subsequent studies.

      Indeed, we agree with the Reviewer that there may be no essential connection between these. Rather, we are simply drawing conclusions from observations within the presented dataset. 

      Saying nothing about the work presented in the paper that the reviewer notes above, the referenced definition is also more nuanced “…we hypothesized that ‘hotspot’ modification sites identified in this screen (namely, those significantly shifted relative to the unmodified, bulk and even other phosphomodiforms of the same protein) may represent sites with disproportionate effects on protein structure and function under specific cellular conditions.” Indeed, in the response to that work, Potel et al. (https://doi.org/10.1038/s41592-021-01177-5) “agree with the premise of the Huang et al. study that phosphorylation sites that have a significant effect on protein thermal stability are more likely to be functionally relevant, for example, by modulating protein conformation, localization and protein interactions.” 

      Anecdotally, we also speculate that if we observe proteome engagement for two compounds (let’s say two ATP-competitive kinase inhibitors) that bind in the same pocket (let’s say the ATP binding site) and one causes a greater change in solubility, then it is reasonable to assume that it is a stronger evidence and we see evidence supporting this claim in Figure 2, Figure 3, Figure 4, and Figure 5.

      It is also important to point out that previous work has also made similar points. This is highlighted in a review article by Mateus et al. (10.1186/s12953-017-0122-4). The authors state, “To obtain affinity estimates with TPP, a compound concentration range TPP (TPP-CCR) can be performed. In TPPCCR, cells are incubated with a range of concentrations of compound and heated to a single temperature.” In support of this claim, the authors reference two papers—Savitski et al., 2014 and Becher et al., 2016. We have updated this section in the revised manuscript (p. 20): 

      “While the primary screen was carried out at fixed dose, the increased throughput of PISA allowed for certain compounds to be assayed at multiple doses in a single experiment. In these instances, there was a clear dose-dependent change in thermal stability of primary targets, off-targets, and secondary targets. This not only helped corroborate observations from the primary screen, but also seemed to provide a qualitative assessment of relative compound potency in agreement with previous studies (Savitski et al., 2014; Becher et al., 2016; Mateus et al., 2017). Specifically, the compounds that most strongly impacted the thermal stability of targets, also acted as the most potent inhibitors. In order to be a candidate for this type of study, a target must have a large maximal thermal shift (magnitude of log2 fold change) because there must be a large enough dynamic range to clearly resolve different doses.”

      Also, the compound efficacy is strongly dependent upon the residence time of the drug, which may or may not correlate with the PISA shift. Also important is the concentration at which target engagement occurs (Anal Chem 2022, 94, 15772-15780).

      In our study, the time and concentration of treatment and was fixed for all compounds at 30 minutes and 10 µM, respectively. Therefore, we do not believe these parameters will affect our conclusions.  

      P 19 l 19. For example, we found that the clinically-deployed CDK4/6 inhibitor palbociclib is capable of directly engaging and inhibiting PLK1. This is a PISA-based prediction that needs to be validated by orthogonal means.

      As we demonstrate in this work, the PISA assays serve as powerful screening methods, thus we agree that validation is important for these types of studies. To this end, we show the following:  

      • Proteomics: Palbociclib causes a decrease in solubility following thermal melting in cells.

      • Chemical Informatic: Palbociclib is structurally similar to BI 2536.

      • Protein informatics: Modeling of palbociclib in empirical structures of the PLK1 active site generates negligible steric clashes. 

      • Biochemical: Palbociclib inhibits PLK1 activity in cells.

      We have changed this text to the following to clarify these points:

      “For example, we found that the clinically-deployed CDK4/6 inhibitor palbociclib has a dramatic impact on PLK1 thermal stability in live cells, is capable of inhibiting PLK1 activity in cell-based assays, and can be modelled into the PLK1 active site.”

      Reviewer #2 (Recommendations For The Authors):

      I am wondering why the authors chose to use K562 (leukaemia) cells in this work as opposed to a different cancer cell line (HeLa? Panc1?). It would be helpful if the authors could present some rationale for this decision.

      This is a great question. Two reasons really. First, they are commonly used in various fields of research, especially previous studies using proteome-wide thermal shift assays (PMID: 25278616, 32060372) and large scale chemical perturbations screens (PMID: 31806696). Second, they are a suspension line that makes executing the experiments easier because they do not need to be detached from a plate prior to thermal melting. We think this is a valuable point to make in the manuscript, such that non-experts understand this concept. We tried to communicate this succinctly in the revised manuscript, but would be happy to elaborate further if the Reviewer would like us to. 

      “To enable large-scale chemical perturbation screening, we first sought to establish a robust workflow for assessing protein thermal stability changes in living cells. We chose K562 cells, which grow in suspension, because they have been frequently used in similar studies and can easily be transferred from a culture flask to PCR tubes for thermal melting (Savitski et al., 2014; Jarzab et al., 2020).”

      I note that integral membrane proteins are over-represented among targets for anti-cancer therapeutics. To what extent is the membrane proteome (plasma membrane in particular) identified in this work? After examining the methods, I would expect at least some integral membrane proteins to be identified. Do the authors observe any differences in the behaviour of water-soluble proteins versus integral membrane proteins in their assays? It would be helpful if the authors could comment on this in a potential revision.

      We agree this is an important point when considering the usage of PISA and thermal stability assays in general for specific classes of therapeutics. To address this, we explored what effect the analysis of thermal stability/solubility had on the proportion of membrane proteins in our data (Author response image 1). Annotations were extracted from Uniprot based on each protein being assigned to the “plasma membrane” (07/2024). We quantified 1,448 (16.5% of total proteins) and 1,558 (17.3% of total proteins) membrane proteins in our cell and lysate PISA datasets, respectively. We also compared the proportion of annotated proteins in these datasets to a recent TMTpro dataset (Lin et al.; PMID: 38853901) and found that the PISA datasets recovered a slightly lower proportion of membrane proteins (~17% in PISA versus 18.9% in total proteome analysis). Yet, we note that we expect more membrane proteins in urea/SDS based lysis methods compared to 0.5% NP-40 extractions.

      Author response image 1.

      We were not able to find an appropriate place to insert this data into the manuscript, so we have left is here in the response. If the Reviewer feels strongly that this data should be included in the manuscript, we would be happy to include these data.  

      A final note: I commend the authors for making their full dataset publicly available upon submission to this journal. This data promises to be a very useful resource for those working in the field.

      We thank the Reviewer for this and note that we are excited for this data to be of use to the community.

      Reviewer #3 (Recommendations For The Authors):

      There is no dataset PDX048009 in ProteomeXchange Consortium. I assume this is because it's under an embargo which needs to be released.

      We can confirm that data was uploaded to ProteomeXchange.

      MS data added to the manuscript during revisions was submitted to ProteomeXchange with the identifier – PDX053138.

      Page 9 line 5 refers to 59 compounds quantified in both cell-based and lysate-based, but Figure 3E shows 60 compounds quantified in both. I believe these numbers should match.

      We thank the Reviewer for catching this. In response to critiques from this Reviewer in the Public Review, we re-worked this section considerably. Please see the above critique/response for more details. 

      Page 10, lines 26-28: It would help the reader if some of the potential 'artefactual effects of lysatebased analyses' were described briefly.

      We thank the Reviewer for raising this point. The truth is, that we are not exactly sure what is happening here, but we know that, at least, for vorinostat, this excess of changes in lysate-based PISA is consistent across experiments. We also do not see pervasive issues within the plexes containing these compounds. Therefore, we do not think this is due to a mistake or other experimental error. We hypothesize that the effect might result from a change in pH or other similar property that occurs upon addition of the molecule, though we note that we have previously seen that vorinostat can induce large numbers of solubility changes in a related solvent shift assays (doi: 10.7554/eLife.70784). We have modified the text to indicate that we do not fully understand the reason for the observation (p. 11):

      “It is highly unlikely that these three molecules actively engage so many proteins and, therefore, the 2,176 hits in the lysate-based screen were likely affected in part by consistent, but artefactual effects of lysate-based analyses that we do not fully understand (Van Vranken et al., 2021).”

      Page 24, lines 29-30 appear to contain a typo. I believe the '>' should be '<' or the 'exclude' should be 'retain'.

      The Reviewer is completely correct. We appreciate the attention to detail. This mistake has been corrected in the revised manuscript.  

      Page 25, lines 5-7: The methods need to explain how the trimmed standard deviation is calculated.

      We apologize for this oversight. To calculate the trimmed standard deviation, we used proteins that were measured in at least 30 conditions. For these, we then removed the top 5% of absolute log2 foldchanges (compared to DMSO controls) and calculated the standard deviation of the resulting set of log2 fold-changes. This is similar in concept to the utilization of “trimmed means” in proteomics data (https://doi.org/10.15252/msb.20145625), which helps to overcome issues due to extreme outliers in datasets. We have added the following statement to the methods to clarify this point (p. 27):

      “Second, for each protein across all cells or lysate assays, the number of standard deviations away from the mean thermal stability measurement (z-score) for a given protein was quantified based on a trimmed standard deviation. Briefly, the trimmed standard deviation was calculated for proteins that were measured in at least 30 conditions. For these, we removed the top 5% of absolute log2 foldchanges (compared to DMSO controls) and calculated the standard deviation of the resulting set of log2 fold-changes.”

      Page 25, lines 9-11 needs editing for clarity.

      We tested empirical hit rates for estimation of mean and trimmed standard deviation (trimmedSD) thresholds to apply, to maximize sensitivity and minimizing the ‘False Hit Rate’, or the number of proteins in the DMSO control samples called as hits divided by the total number of proteins called as hits with a given threshold applied. 

      Author response image 2.

      Hit calling threshold setting based on maximizing the total hits called and minimizing the False Hit Rate in cells (number of DMSO hits divided by the total number of hits).

      Author response image 3.

      Hit calling threshold setting based on maximizing the total hits called and minimizing the False Hit Rate in lysates (number of DMSO hits divided by the total number of hits).

      Figure 1 supplementary 2a legend states: '32 DMSO controls'. Should that be 64?

      We thank the Reviewer for catching our mistake. This has been corrected in the revised manuscript. 

      I suggest removing Figure 1 supplementary 3c which is superfluous as only the number it presents is already stated in the text (page 5, line 9).

      We thank the Reviewer for the suggestion and agree that this panel is superfluous. It has been removed from the revised manuscript.

      New data and tables added during revisions:  

      (1) Table 3 – All log2 fold change values for the cell-based screen. Using this table, proteincentric solubility profiles can be plotted (as in Figures 2D and others). 

      (2) Table 4 – All log2 fold change values for the lysate-based screen. Using this table, proteincentric solubility profiles can be plotted (as in Figures 2D and others). 

      (3) Figure 1 – Figure supplement 3H – Table highlighting proteins that pass log2 fold change cutoffs, but not nSD cutoffs and vice versa. 

      (4) Figure 2 – Panels H and I were updated with a new color scheme. 

      (5) Figure 3 – Updated main figure and supplement at the request of Reviewer 3. 

      • Figure 3E – Compares on-target hits for the cell- and lysate-based screens for all compounds for which a target was quantified in both screens. 

      • Figure 3 – Figure supplement 2 – Highlights on-target hits in both screens, exclusively in cells, and exclusively in lysates. 

      (6) Figure 5 – PISA data for K562 lysates treated with AZD-7762 at multiple concentrations.

      • Figure 5F

      • Figure 5 – Figure supplement 3A-C

      • Figure 5 – Source data 2

      (7) Figure 5 – Phosphoproteomic profiling of K562 cells treated with AZD7762 or Bafetinib. 

      • Figure 5G

      • Figure 5 – Figure supplement 4A-F

      • Figure 5 – Source data 3 (phosphoproteome)

      • Figure 5 – Source data 4 (associated proteome data)

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Wang et al investigated the evolution, expression, and function of the X-linked miR-506 miRNA family. They showed that the miR-506 family underwent rapid evolution. They provided evidence that miR-506 appeared to have originated from the MER91C DNA transposons. Human MER91C transposon produced mature miRNAs when expressed in cultured cells. A series of mouse mutants lacking individual clusters, a combination of clusters, and the entire X-linked cluster (all 22 miRNAs) were generated and characterized. The mutant mice lacking four or more miRNA clusters showed reduced reproductive fitness (litter size reduction). They further showed that the sperm from these mutants were less competitive in polyandrous mating tests. RNA-seq revealed the impact of deletion of miR-506 on the testicular transcriptome. Bioinformatic analysis analyzed the relationship among miR-506 binding, transcriptomic changes, and target sequence conservation. The miR-506-deficient mice did not have apparent effect on sperm production, motility, and morphology. Lack of severe phenotypes is typical for miRNA mutants in other species as well. However, the miR-506-deficient males did exhibit reduced litter size, such an effect would have been quite significant in an evolutionary time scale. The number of mouse mutants and sequencing analysis represent a tour de force. This study is a comprehensive investigation of the X-linked miR-506 miRNA family. It provides important insights into the evolution and function of the miR-506 family.

      The conclusions of this preprint are mostly supported by the data except being noted below. Some descriptions need to be revised for accuracy.

      L219-L285: The conclusion that X-linked miR-506 family miRNAs are expanded via LINE1 retrotransposition is not supported by the data. LINE1s and SINEs are very abundant, accounting for nearly 30% of the genome. In addition, the LINE1 content of the mammalian X chromosome is twice that of the autosomes. One can easily find flanking LINE1/SINE repeat. Therefore, the analyses in Fig. 2G, Fig. 2H and Fig. S3 are not informative. In order to claim LINE1-mediated retrotransposition, it is necessary to show the hallmarks of LINE1 retrotransposition, which are only possible for new insertions. The X chromosome is known to be enriched for testis-specific multi-copy genes that are expressed in round spermatids (PMID: 18454149). The conclusion on the LINE1-mediated expansion of miR-506 family on the X chromosome is not supported by the data and does not add additional insights. I think that the LINE1 related figure panels and description (L219-L285) need to be deleted. In discussion (L557558), "...and subsequently underwent sequence divergence via LINE1-mediated retrotransposition during evolution" should also be deleted. This section (L219-L285) needs to deal only with the origin of miR506 from MER91C DNA transposons, which is both convincing and informative.

      Reply: Agreed, the corresponding sentences were deleted.

      Fig. 3A: can you speculate/discuss why the miR-506 expression in sperm is higher than in round spermatids?

      Reply: RNAs are much less abundant in sperm than in somatic or spermatogenic cells (~1/100). Spermborne small RNAs represent a small fraction of total small RNAs expressed in their precursor spermatogenic cells, including spermatocytes and spermatids. Therefore, when the same amount of total/small RNAs are used for quantitative analyses, sperm-borne small RNAs (e.g., miR-506 family miRNAs) would be proportionally enriched in sperm compared to other spermatogenic cells. We discussed this point in the text (Lines 550-556).

      **Reviewer #2 (Public Review):

      In this paper, Wang and collaborators characterize the rapid evolution of the X-linked miR-506 cluster in mammals and characterize the functional reference of depleting a few or most of the miRNAs in the cluster. The authors show that the cluster originated from the MER91C DNA transposon and provide some evidence that it might have expanded through the retrotransposition of adjacent LINE1s. Although the animals depleted of most miRNAs in the cluster show normal sperm parameters, the authors observed a small but significant reduction in litter size. The authors then speculate that the depletion of most miRNAs in the cluster could impair sperm competitiveness in polyandrous mating. Using a successive mating protocol, they show that, indeed, sperm lacking most X-linked miR-506 family members is outcompeted by wild-type sperm. The authors then analyze the evolution of the miR-506 cluster and its predicted targets. They conclude that the main difference between mice and humans is the expansion of the number of target sites per transcript in humans.

      The conclusions of the paper are, in most cases, supported by the data; however, a more precise and indepth analysis would have helped build a more convincing argument in most cases.

      (1) In the abstracts and throughout the manuscript, the authors claim that "... these X-linked miRNA-506 family miRNA [...] have gained more targets [...] " while comparing the human miRNA-506 family to the mouse. An alternative possibility is that the mouse has lost some targets. A proper analysis would entail determining the number of targets in the mouse and human common ancestor.

      Reply: This question alerted us that we did not describe our conclusion accurately, causing confusion for this reviewer. Our data suggest that although the sheer number of target genes remains the same between humans and mice, the human X-linked miR-506 family targets a greater number of genes than the murine counterpart on a per miRNA basis. In other words, mice never lost any targets compared to humans, but per the miR-506 family miRNA tends to target more genes in humans than in mice.

      We revised the text to more accurately report our data. The pertaining text (lines 490-508) now reads: “Furthermore, we analyzed the number of all potential targets of the miR-506 family miRNAs predicted by the aforementioned four algorithms among humans, mice, and rats. The total number of targets for all the X-linked miR-506 family miRNAs among different species did not show significant enrichment in humans (Fig. S9C), suggesting the sheer number of target genes does not increase in humans. We then compared the number of target genes per miRNA. When comparing the number of target genes per miRNA for all the miRNAs (baseline) between humans and mice, we found that on a per miRNA basis, human miRNAs have more targets than murine miRNAs (p<0.05, t-test) (Fig. S9D), consistent with higher biological complexity in humans. This became even more obvious for the X-linked miR-506 family (p<0.05, t-test) (Fig. S9D). In humans, the X-linked miR-506 family, on a per miRNA basis, targets a significantly greater number of genes than the average of all miRNAs combined (p<0.05, t-test) (Fig. S9D). In contrast, in mice, we observed no significant difference in the number of targets per miRNA between X-linked miRNAs and all of the mouse miRNAs combined (mouse baseline) (Fig. S9D). These results suggest that although the sheer number of target genes remains the same between humans and mice, the human X-linked miR-506 family targets a greater number of genes than the murine counterpart on a per miRNA basis.”

      We also changed “have gained” to “have” throughout the text to avoid confusion.

      (2) The authors claim that the miRNA cluster expanded through L1 retrotransposition. However, the possibility of an early expansion of the cluster before the divergence of the species while the MER91C DNA transposon was active was not evaluated. Although L1 likely contributed to the diversity within mammals, the generalization may not apply to all species. For example, SINEs are closer on average than L1s to the miRNAs in the SmiR subcluster in humans and dogs, and the horse SmiR subcluster seems to have expanded by a TE-independent mechanism.

      Reply: Agreed. We deleted the data mentioned by this reviewer.

      (3) Some results are difficult to reconcile and would have benefited from further discussion. The miR-465 sKO has over two thousand differentially expressed transcripts and no apparent phenotype. Also, the authors show a sharp downregulation of CRISP1 at the RNA and protein level in the mouse. However, most miRNAs of the cluster increase the expression of Crisp1 on a reporter assay. The only one with a negative impact has a very mild effect. miRNAs are typically associated with target repression; however, most of the miRNAs analyzed in this study activate transcript expression.

      Reply: Both mRNA and protein levels of Crisp1 were downregulated in KO mice, and these results are consistent with the luciferase data showing overexpression of these miRNAs upregulated the Crisp1 3’UTR luciferase activity. We agree that miRNAs usually repress target gene expression. However, numerous studies have also shown that some miRNAs, such as human miR-369-3, Let-7, and miR-373, mouse miR-34/449 and the miR-506 family, and the synthetic miRNA miRcxcr4, activate gene expression both in vitro (1, 2) and in vivo (3-6). Earlier reports have shown that these miRNAs can upregulate their target gene expression, either by recruiting FXR1, targeting promoters, or sequestering RNA subcellular locations (1, 2, 6). We briefly discussed this in the text (Lines 605-611).

      (4) More information is required to interpret the results of the differential RNA targeting by the murine and human miRNA-506 family. The materials and methods section needs to explain how the authors select their putative targets. In the text, they mention the use of four different prediction programs. Are they considering all sites predicted by any method, all sites predicted simultaneously by all methods, or something in between? Also, what are they considering as a "shared target" between mice and humans? Is it a mRNA that any miR-506 family member is targeting? Is it a mRNA targeted by the same miRNA in both species? Does the targeting need to occur in the same position determined by aligning the different 3'UTRs?

      Reply: Since each prediction method has its merit, we included all putative targets predicted by any of the four methods. The "shared target" refers to a mRNA that any miR-506 family member targets because the miR-506 family is highly divergent among different species. We have added the information to the “Large and small RNA-seq data analysis” section in Materials and Methods (Lines 871-882).

      (5) The authors highlight the particular evolution of the cluster derived from a transposable element. Given the tendency of transposable elements to be expressed in germ cells, the family might have originated to repress the expression of the elements while still active but then remained to control the expression of the genes where the element had been inserted. The authors did not evaluate the expression of transcripts containing the transposable element or discuss this possibility. The authors proposed an expansion of the target sites in humans. However, whether this expansion was associated with the expansion of the TE in humans was not discussed either. Clarifying whether the transposable element was still active after the divergence of the mouse and human lineages would have been informative to address this outstanding issue.

      Reply: Agreed. The MER91C DNA transposon is denoted as nonautonomous (7); however, whether it was active during the divergence of mouse and human lineages is unknown. To determine whether the expansion of the target sites in humans was due to the expansion of the MER91C DNA transposon, we analyzed the MER91C DNA transposon-containing transcripts and associated them with our DETs. Of interest, 28 human and 3 mouse mRNAs possess 3’UTRs containing MER91C DNA sequences, and only 3 and 0 out of those 28 and 3 genes belonged to DETs in humans and mice, respectively (Fig. S9E), suggesting a minimal effect of MER91C DNA transposon expansion on the number of target sites. We briefly discussed this in the text (Lines 511-518).

      Post-transcriptional regulation is exceptionally complex in male haploid cells, and the functional relevance of many regulatory pathways remains unclear. This manuscript, together with recent findings on the role of piRNA clusters, starts to clarify the nature of the selective pressure that shapes the evolution of small RNA pathways in the male germ line.

      Reply: Agreed. We appreciate your insightful comments.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, the authors conducted a comprehensive study of the X-linked miR-506 family miRNAs in mice on its origin, evolution, expression, and function. They demonstrate that the X-linked miR-506 family, predominantly expressed in the testis, may be derived from MER91C DNA transposons and further expanded by retrotransposition. By genetic deletion of different combinations of 5 major clusters of this miRNA family in mice, they found these miRNAs are not required for spermatogenesis. However, by further examination, the mutant mice show mild fertility problem and inferior sperm competitiveness. The authors conclude that the X-linked miR-506 miRNAs finetune spermatogenesis to enhance sperm competition.

      Strengths:

      This is a comprehensive study with extensive computational and genetic dissection of the X-linked miR506 family providing a holistic view of its evolution and function in mice. The finding that this family miRNAs could enhance sperm competition is interesting and could explain their roles in finetuning germ cell gene expression to regulate reproductive fitness.

      Weaknesses:

      The authors specifically addressed the function of 5 clusters of X-link miR-506 family containing 19 miRNAs. There is another small cluster containing 3 miRNAs close to the Fmr1 locus. Would this small cluster act in concert with the 5 clusters to regulate spermatogenesis? In addition, any autosomal miR-506 like miRNAs may compensate for the loss of X-linked miR-506 family. These possibilities should be discussed.

      Reply: The three FmiRs were not deleted in this study because the SmiRs are much more abundant than the FmiRs in WT mice (Author Response image 1, heatmap version of Fig. 5C). Based on small RNA-seq, some FmiRs, e.g., miR-201 and miR-547, were upregulated in the SmiRs KO mice, suggesting that this small cluster may act in concert with the other 5 clusters and thus, worth further investigation. To our best knowledge, all the miR-506 family miRNAs are located on the X chromosome, although some other miRNAs were upregulated in the KO mice, they don’t belong to the miR-506 family. We briefly discussed this point in the text (Lines 635-638).

      Author response image 1.

      sRNA-seq of WT and miR-506 family KO testis samples.

      Direct molecular link to sperm competitiveness defect remains unclear but is difficult to address.

      Reply: In this study, we identified a target of the miR-506 family, i.e. Crisp1. KO of Crisp1 in mice, or inhibition of CRISP1 in human sperm (7, 8), appears to phenocopy the quinKO mice, displaying largely normal sperm motility but compromised ability to penetrate eggs. The detailed mechanism warrants further investigation in the future.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Lines 84-85: "Several cellular events are unique to the male germ cells, e.g., meiosis, genetic recombination, and haploid male germ cell differentiation (also called spermiogenesis)". This statement is not accurate. Please revise. Meiosis and genetic recombination are common to both male and female germ cells. They are highly conserved in both sexes in many species including mouse.

      Reply: Agreed. We have revised the sentence and it now reads: “Several cellular events are unique to the male germ cells, e.g., postnatal formation of the adult male germline stem cells (i.e., spermatogonia stem cells), pubertal onset of meiosis, and haploid male germ cell differentiation (also called spermiogenesis) (9)” (Lines 83-86).

      Lines 163-164: "we found that Slitrk2 and Fmr1 were syntenically linked to autosomes in zebrafish and birds (Fig. 1A), but had migrated onto the X chromosome in most mammals". This description is not accurate. Chr 4 in zebrafish and birds is syntenic to the X chromosome in mammals. The term "migrated" is not appropriate. Suggestion: Slitrk2 and Fmr1 mapped to Chr 4 (syntenic with mammalian X chromosome) in zebrafish and birds but to the X chromosome in most mammals.

      Reply: Agreed. Revised as suggested.

      Reviewer #2 (Recommendations For The Authors):

      (1) In the significance statement, the authors mention that the mutants are "functionally infertile," although the decrease in competitiveness is partial. I suggest referring to them as "functionally sub-fertile."

      Reply: Agreed. Revised as suggested.

      (2) I will urge the authors to explain in more detail how some figures are generated and what they mean. Some critical information needs to be included in various panels.

      (2a) Figure S1. The phastCons track does not seem to align as expected with the rest of the figure. The highest conservation peak is only present in humans, and the sequence conserved in the sea turtle has the lowest phastCons score. I was expecting the opposite from the explanation.

      Reply: The tracks for phyloP and phastCons are the scores for all 100 species, whereas the tracks with the species names on the left are the corresponding sequences aligned to the human genome. We have revised our figure to make it clearer.

      (2b) Figure 2A and Figure S2C. Although all the functional analysis of the manuscript has been done in mice, the alignments showing sequence conservation do not include the murine miRNAs. Please include the mouse miRNAs in these panels.

      Reply: The mouse has Mir-506-P7 with the conserved miRNA-3P seed region, which was included in the lower panel in Figure S2C. However, mice do not have Mir-506-P6, which may have been lost or too divergent to be recognized during the evolution and thus, were not included in Figure 2A and the upper panel in Figure S2C.

      (2c) Figure S7H. The panel could be easier to read.

      Reply: Agreed. We combined all the same groups and turned Figure S7H (now Figure S6H) into a heatmap.

      (2d) The legend of Figure 6G reads, "The number of target sites within individual target mRNAs in both humans and mice ." Can the author explain why the value 1 of the human "Number of target sites" is connected to virtually all the "Number of target sites" values in mice?

      Reply: Sorry for the confusion. For example, for gene 1, we have 1 target site in the human and 1 target site in the mouse; but for gene 2, we have 1 target site in the human and multiple sites in the mouse; therefore, the value 1 is connected to more than one value in the mouse.

      Reviewer #3 (Recommendations For The Authors):

      CRISP1 and EGR1 protein localization in WT and mutant sperm by immunostaining would be helpful.

      Reply: Agreed. We performed immunostaining for CRISP1 on WT sperm, and the new results are presented in Figure S8D. CRISP1 seems mainly expressed in the principal piece and head of sperm.

      The detailed description of the generation of various mutant lines should be included in the Methods.

      Reply: We added more details on the generation of knockout lines in the Materials and Methods (686701).

      References:

      (1) S. Vasudevan, Y. Tong, J. A. Steitz, Switching from repression to activation: microRNAs can upregulate translation. Science 318, 1931-1934 (2007).

      (2) R. F. Place, L. C. Li, D. Pookot, E. J. Noonan, R. Dahiya, MicroRNA-373 induces expression of genes with complementary promoter sequences. Proc Natl Acad Sci U S A 105, 1608-1613 (2008).

      (3) Z. Wang et al., X-linked miR-506 family miRNAs promote FMRP expression in mouse spermatogonia. EMBO Rep 21, e49024 (2020).

      (4) S. Yuan et al., Motile cilia of the male reproductive system require miR-34/miR-449 for development and function to generate luminal turbulence. Proc Natl Acad Sci U S A 116, 35843593 (2019).

      (5) S. Yuan et al., Oviductal motile cilia are essential for oocyte pickup but dispensable for sperm and embryo transport. Proc Natl Acad Sci U S A 118 (2021).

      (6) M. Guo et al., Uncoupling transcription and translation through miRNA-dependent poly(A) length control in haploid male germ cells. Development 149 (2022).

      (7) V. G. Da Ros et al., Impaired sperm fertilizing ability in mice lacking Cysteine-RIch Secretory Protein 1 (CRISP1). Dev Biol 320, 12-18 (2008).

      (8) J. A. Maldera et al., Human fertilization: epididymal hCRISP1 mediates sperm-zona pellucida binding through its interaction with ZP3. Mol Hum Reprod 20, 341-349 (2014).

      (9) L. Hermo, R. M. Pelletier, D. G. Cyr, C. E. Smith, Surfing the wave, cycle, life history, and genes/proteins expressed by testicular germ cells. Part 1: background to spermatogenesis, spermatogonia, and spermatocytes. Microsc Res Tech 73, 241-278 (2010).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Below, we provide a detailed account of the changes we made. For clarity and ease of review:

      •        Original reviewers' comments are included and highlighted in grey

      •        Our responses to each comment are written in black text

      •        Print screens illustrating the specific changes made to the manuscript are enclosed within black squares

      eLife assessment

      The authors aim to develop a CRISPR system that can be activated upon sensing an RNA. As an initial step to this goal, they describe RNA-sensing guide RNAs for controlled activation of CRISPR modification. Many of the data look convincing and while several steps remain to achieve the stated goal in an in vivo setting and for robust activation by endogenous RNAs, the current work will be important for many in the field.  

      The eLife assessment summarises our ambition to create a CRISPR system controlled by RNA sensing. The synopsis provided encapsulates the essence of our research, emphasising both the progress we have made and the challenges that lie ahead. This assessment fully resonates with our views.

      Public Reviews:

      Reviewer #1 (Public Review):

      This paper describes RNA-sensing guide RNAs for controlled activation of CRISPR modification. This works by having an extended guide RNA with a sequence that folds back onto the targeting sequence such that the guide RNA cannot hybridise to its genomic target. The CRISPR is "activated" by the introduction of another RNA, referred to as a trigger, that competes with this "back folding" to make the guide RNA available for genome targeting. The authors first confirm the efficacy of the approach using several RNA triggers and a GFP reporter that is activated by dCas9 fused to transcriptional activators. A major potential application of this technique is the activation of CRISPR in response to endogenous biomarkers. As these will typically be longer than the first generation triggers employed by the authors they test some extended triggers, which also work though not always to the same extent. They then introduce MODesign which may enable the design of bespoke or improved triggers. After that, they determine that the mode of activation by the RNA trigger involves cleavage of the RNA complexes. Finally, they test the potential for their system to work in a developmental setting - specifically zebrafish embryos. There is some encouraging evidence, though the effects appear more subtle than those originally obtained in cell culture. 

      Overall, the potential of a CRISPR system that can be activated upon sensing an RNA is high and there are a myriad of opportunities and applications for it. This paper represents a reasonable starting point having developed such a system in principle. 

      The weakness of the study is that it does not demonstrate that the system can be used in a completely natural setting. This would require an endogenous transcript as the RNA trigger with a clear readout. Such an experiment would clearly strengthen the paper and provide strong confidence that the method could be employed for one of the major applications discussed by the authors. The zebrafish data relied on exogenous RNA triggers whereas the major applications (as I understood them) would use endogenous triggers. 

      Related, most endogenous RNAs are longer than the various triggers tested and may require extensive modification of the system to be detected or utilised effectively. 

      While additional data would clearly be beneficial, there should nevertheless be a more detailed discussion of these caveats and/or the strengths and applications of the system as it is presented (i.e. utility with synthetic triggers).  

      We agree with the observation regarding the subtler effects in the zebrafish embryos and the reliance on exogenous RNA triggers. Indeed, the utilisation of endogenous transcripts as triggers in a natural setting is a logical next step. We further acknowledge the need to delve deeper into the complexities and challenges of our system, particularly concerning the detection of endogenous RNA, thus offering valuable insights for researchers looking to adapt our system for various applications. In order to clarify these limitations, we made some changes in the final version of our paper. The following paragraphs have been therefore included in the manuscript discussion:

      “In their current iteration, iSBH-sgRNAs show considerable promise for mammalian synthetic biology applications. Specifically, their ability to detect synthetic triggers could be pivotal in the development of complex synthetic RNA circuits and logic gates, thereby advancing the field of cellular reprogramming. However, further work is required to achieve better ON/OFF activation ratios in vivo and more homogeneous activity across tissues in the presence of RNA triggers. Additional chemical modifications could improve iSBH-sgRNA properties, and we believe that chemical modification strategies adopted for siRNA drugs or antisense oligos (Khvorova and Watts (2017)) could also be essential for further iSBH-sgRNA technology development. As iSBH-sgRNAs might be targeted by endogenous nucleases, leading to their degradation, a strategy for preventing this could involve additional chemical modifications. When inserted at certain key positions, such modifications could prevent interaction between iSBH-sgRNAs and cellular enzymes by introducing steric clashes or inhibiting RNA hydrolysis.

      Once achieving superior dynamic ranges of iSBH-sgRNA activation in vivo, the next steps would involve understanding the classes of endogenous RNAs that could act as triggers. The chances that an iSBH-sgRNA encounters an endogenous RNA trigger inside a cell would depend on the relative concentrations of the two RNA species. Therefore, a first step towards determining potential endogenous RNA triggers will involve identifying RNA species with comparable expression levels as iSBH-sgRNAs. Then, iSBH-sgRNAs could be designed against these RNA species, followed by experimental validation. It is important to note that eukaryotic cells express a wide range of transcripts of varying sizes, expression levels, and subcellular localisations, all of which could greatly affect iSBH-sgRNA activation levels. Based on the data presented here, we speculate that RNA species up to 300nt that are also highly expressed might act as good triggers. Furthermore, as sgRNAs are involved in targeting Cas9 to genomic DNA in the nucleus, attempting to detect transcripts that are sequestered in the nucleus might also provide additional benefit.”

      Reviewer #3 (Public Review):

      In this work, the authors describe engineering of sgRNAs that render Cas9 DNA binding controllable by a second RNA trigger. The authors introduce several iterations of their engineered sgRNAs, as well as a computational pipeline to identify designs for user-specified RNA triggers which offers a helpful alternative to purely rational design. Also included is an investigation of the fate of the engineered sgRNAs when introduced into cells, and the use of this information to inform installation of modified nucleotides to improve engineered sgRNA stability. Engineered sgRNAs are demonstrated to be activated by trigger RNAs in both cultured mammalian cells and zebrafish. 

      The conclusions made by the authors in this work are predominantly supported by the data provided. However, some claims are not consistent with the data shown and some of the figures would benefit from revision or further clarification. 

      Strengths: 

      - The sgRNA engineering in this paper is performed and presented in a systematic and logical fashion.

      - Inclusion of a computational method to predict iSBH-sgRNAs adds to the strength of the engineering. 

      - Investigation into the cellular fate of the engineered sgRNAs and the use of this information to guide inclusion of chemically modified nucleotides is also a strength. 

      - Demonstration of activity in both cultured mammalian cells and in zebrafish embryos increases the impact and utility of the technology reported in this work. 

      Weaknesses: 

      - While the methods here represent an important step forward in advancing the technology, they still fall short of the dynamic range and selectivity likely required for robust activation by endogenous RNA.

      - While the iSBH-sgRNAs where the RNA trigger overlaps with the spacer appear to function robustly, the modular iSBH-sgRNAs seem to perform quite a bit less well. The authors state that modular iSBHsgRNAs show better activity without increasing background when the SAM system is added, but this is not supported by the data shown in Figure 3D, where in 3 out of 4 cases CRISPR activation in the absence of the RNA trigger is substantially increased.

      - There is very little discussion of how the performance of the technology reported in this work compares to previous iterations of RNA-triggered CRISPR systems, of which there are many examples.  

      Concerning the methods falling short of the dynamic range and selectivity required for robust activation by endogenous RNA, we acknowledge this limitation and recognise the need for improvement in this area. In the resubmitted version of the manuscript, we provided a detailed discussion on how the selection of appropriate triggers might partially improve dynamic ranges and selectivity. This includes an exploration of various strategies and considerations that may enhance the robustness of our system (print screen above, also used for addressing Reviewer #1 comments). 

      Regarding the inconsistent performance of the modular iSBH-sgRNAs, we acknowledge that modular iSBH-sgRNAs seem to perform slightly less well than first- and second-generation designs. In order to illustrate this, we modified corresponding bar graphs to include fold turn-on iSBH-sgRNA activation in addition to significance (Figures 1, 2 and 3 of the manuscript). We also acknowledge this fact in the text, as well as we recognise this discrepancy in the Figure 3.D and provide further clarifications. To help conveying this message even further, we introduced a new figure (Figure 3- figure supplement 2) to accompany the heat map shown in the Figure 3.D. with corresponding bar graphs. These changes are documented below:

      “…promoters. We ran 11 MODesign simulations for each trigger, incrementally extending the loop size while keeping the sgRNA 2 spacer input constant. HEK293T validation experiments showed that choosing modular iSBH-sgRNAs that detect the 4 U6-expressed triggers is possible (Figure 3.D, Figure 3- figure supplement 1.C). Despite not performing quite as well as second-generation designs (Figure 2.A.,Figure 3.D),modular iSBH-sgRNA still enable efficient RNA detection, especially for smaller RNAs such as triggers A and D. For highly efficient designs such asmodular iSBH-sgRNA (D), addition of the SAM effector system (Konermann et al. (2015)) boosted ON-state activation with only a negligible increase in the the OFF-state non-specific activation. Orthogonality tests suggested that activation of modular iSBH-sgRNA designs was specifically conditioned by complementary RNA triggers (Figure 3.E, Figure 3 - figure supplement 2), showing the exquisite specificity of the system.”

      Author response image 1.

      This supplementary figure reinterprets the data presented in Figure 3.E. using bar plots for enhanced clarity and comparison. It depicts the results of cotransfecting HEK293T cells with four modular iSBH-sgRNAs (A, B, C, and D) and examines all combinations of iSBH-sgRNA: RNA trigger pairings. The bar plots provide a visual representation of mean values with error bars indicating the standard deviation, based on three biological replicates.

      Regarding the concern about the lack of comparison with previous iterations of RNA-triggered CRISPR systems, we also acknowledged other similar technologies within the discussion. We also point readers to a literature review we recently published (doi/full/10.1089/crispr.2022.0052) where we describe other similar technologies in more detail.

      “To date, a variety of RNA-inducible gRNA designs have been developed (Hanewich-Hollatz et al. (2019); Hochrein et al. (2021); Jakimo et al. (2018); Jiao et al. (2021); Jin et al. (2019); Li et al. (2019); Liu et al. (2022); Lin et al. (2020); Siu and Chen (2019); Galizi et al. (2020); Hunt and Chen (2022b,a); Ying et al. (2020); Choi et al. (2023)). Nevertheless, there is a lack of direct, head-to-head comparisons of these designs under standardised experimental conditions. Some designs were evaluated in vitro, others in bacterial systems, and some in mammalian cells. Consequently, it is challenging to conclusively determine which design exhibits superior properties (Pelea et al. (2022)). Notably, to the best of our knowledge, the iSBH-sgRNA systemis the first RNA-inducible gRNA design tested in vivo and characterising the iSBH-sgRNA activation mechanism was essential for implementing iSBH-sgRNA technology in zebrafish embryos. In vivo, chemical modifications in the spacer sequence were vital for iSBH-sgRNA stability and function.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this study, the authors attempt to describe alterations in gene expression, protein expression, and protein phosphorylation as a consequence of chronic adenylyl cyclase 8 overexpression in a mouse model. This model is claimed to have resilience to cardiac stress.

      Major strengths of the study include 1) the large dataset generated which will have utility for further scientific inquiry for the authors and others in the field, 2) the innovative approach of using cross-analyses linking transcriptomic data to proteomic and phosphoproteomic data. One weakness is the lack of a focused question and clear relevance to human disease. These are all critical biological pathways that the authors are studying and essentially, they have compiled a database that could be surveyed to generate and test future hypotheses.

      Thank you for your efforts to review our manuscript, we are delighted to learn that you found our approach to link transcriptomic, proteomic and phosphoproteome data in our analysis to be innovative. Your comment that we have not focused on a question with clear relevance to human disease is “right on point!”

      During chronic pathophysiologic states e.g., chronic heart failure (CHF) in humans, AC/cAMP/PKA/Ca2+ signaling increases progressively the degree of heart failure progresses, leading to cardiac inflammation, mediated in part, by cyclic-AMP- induced up- regulation of renin-angiotensin system (RAS) signaling. Standard therapies for CHF include β-adrenoreceptor blockers and RAS inhibitors, which although effective, are suboptimal in amelioration of heart failure progression. One strategy to devise novel and better therapies for heart failure, would be to uncover the full spectrum of concentric cardio- protective adaptations that becomes activated in response to severe, chronic AC/cAMP/PKA/Ca2+ -induced cardiac stress.

      We employed unbiased omics analyses, in our prior study (https://elifesciences.org/articles/80949v1) of the mouse harboring cardiac specific overexpression of adenylyl cyclase type 8 (TGAC8), and identified more than 2,000 transcripts and proteins, comprising a broad array of biological processes across multiple cellular compartments, that differed in TGAC8 left ventricle compared to WT. These bioinformatic analyses revealed that marked overexpression of AC8 engages complex, concentric adaptation "circuity" that has evolved in mammalian cells to confer resilience to stressors that threaten health or life. The main human disease category identified in these analyses was Organismal Injury and Abnormalities, suggesting that defenses against stress were activated as would be expected, in response to cardiac stress. Specific concentric signaling pathways that were enriched and activated within the TGAC8 protection circuitry included cell survival initiation, protection from apoptosis, proliferation, prevention of cardiac-myocyte hypertrophy, increased protein synthesis and quality control, increased inflammatory and immune responses, facilitation of tissue damage repair and regeneration and increased aerobic energetics. These TGAC8 stress response circuits resemble many adaptive mechanisms that occur in response to the stress of disease states and may be of biological significance to allow for proper healing in disease states such as myocardial infarction or failure of the heart. The main human cardiac diseases identified in bioinformatic analyses were multiple types cardiomyopathies, again suggesting that mechanisms that confer resilience to the stress of chronic increased AC-PKA-Ca2+ signaling are activated in the absence of heart failure in the super-performing TGAC8 heart at 3-months of age.

      In the present study, we performed a comprehensive in silico analysis of transcription, translation, and post-translational patterns, seeking to discover whether the coordinated transcriptome and proteome regulation of the adaptive protective circuitry within the AC8 heart that is common to many types of cardiac disease states identified in our previous study (https://elifesciences.org/articles/80949v1) extends to the phosphoproteome.

      Reviewer #2 (Public Review):

      In this study, the investigators describe an unbiased phosphoproteomic analysis of cardiac-specific overexpression of adenylyl cyclase type 8 (TGAC8) mice that was then integrated with transcriptomic and proteomic data. The phosphoproteomic analysis was performed using tandem mass tag-labeling mass spectrometry of left ventricular (LV) tissue in TGAC8 and wild-type mice. The initial principal component analysis showed differences between the TGAC8 and WT groups. The integrated analysis demonstrated that many stress-response, immune, and metabolic signaling pathways were activated at transcriptional, translational, and/or post-translational levels.

      The authors are to be commended for a well-conducted study with quality control steps described for the various analyses. The rationale for following up on prior transcriptomic and proteomic analyses is described. The analysis appears thorough and well-integrated with the group's prior work. Confirmational data using Western blot is provided to support their conclusions. Their findings have the potential of identifying novel pathways involved in cardiac performance and cardioprotection.

      Thank you for your efforts to review our manuscript, we are delighted to learn that you found our approach to link transcriptomic, proteomic and phosphoproteome data in our analysis. We are delighted that you found our work to be well-conducted, to have been well performed, and that our analysis was thorough and well-integrated with our prior work in this arena and that are findings have the potential of identifying novel pathways involved in cardiac performance and cardioprotection.

      Reviewer #1 (Recommendations For The Authors):

      I humbly suggest that the authors reconsider the title, as it could be more clear as to what they are studying. Are the authors trying to highlight pathways related to cardiac resilience? Resilience might be a clearer word than "performance and protection circuitry".

      Thank you for this important comment. We have revised the title accordingly: Reprogramming of cardiac phosphoproteome, proteome and transcriptome confers resilience to chronic adenylyl cyclase-driven stress.

      Perhaps the text can be reviewed in detail by a copy-editor, as there are many grammatically 'awkward' elements (for example, line 56: "mammalians" instead of mammals), inappropriate colloquialisms (for example, line 73: "port-of-call"), and stylistic unevenness that make it difficult to read.

      We have reviewed the text in detail, with the assistance of a copy editor, in order to identify and correct awkward elements and to search for other colloquialisms. Finally, although “stylistic unevenness” to which you refer may be difficult for us to identify during our re-edits, we have tried our best to identify and revise them.

      The best-written sections are the first few paragraphs of the discussion section, which finally clarify why the TGAC8 mouse is important in understanding cardiac resilience to stress and how the present study leverages this model to disentangle the biological processes underlying the resilience. I wish this had been presented in this manner earlier in the paper, (in the abstract and introduction) so I could have had a clearer context in which to interpret the data. It would also be helpful to point out whether the TGAC8 mouse has any correlates with human disease.

      Thank you for this very important comment. Well put! In addition to recasting the title to include the concept of resilience, we have revised both the abstract and introduction to feature what you consider to be important to the understanding of cardiac resilience to stress, and how the present study leverages this model to disentangle the biological processes underlying the resilience.

      Reviewer #2 (Recommendations For The Authors):

      1. How were the cutoffs determined to distinguish between upregulated/downregulated phosphoproteins and phosphopeptides?

      Thank you for this important question. We used the same criteria to distinguish differences between TGAC8 and WT for unnormalized and normalized phosphoproteins, -log10(p-value) > 1.3, and log2FoldChange <= -0.4 (down) or log2FoldChange >= 0.4 (up), as stated in the methods section, main text and figure legend. The results were consistent across all analyses and selectively verified by experiments.

      1. Were other models assessed for correlation between transcriptome and phosphoproteome other than a linear relationship of log2 fold change?

      Thank you for this comment. In addition to a linear relationship of log2 fold change of molecule expression, we also compared protein activities, e.g., Fig 4F, and pathways enriched from different omics, e.g., Fig 3D, 5J, 6B and 6F.

      1. Figures 1A and 5G seem to show outliers. How many biological and technical replicates would be needed to minimize error?

      Thank you for the question. Figures 1A and 5G were PCA plots which, as expected, manifested some genetic variability among the same genotypes. The PCA plots, however, are useful in determining how the identified items separated, both within and among genotypes. For bioinformatics analysis such as ours, 4-5 samples are sufficient to accomplish this, as demonstrated by separation, by genotype, of samples in PCA. Thus, in addition to discovery of true heterogeneity among the samples, our results are still able to robustly discover the true differences between the genotypes.

      1. Were the up/downregulated genes more likely to be lowly expressed (which would lead to larger log2 changes identified)?

      In response to your query, we calculated the average expression of phosphorylation levels across all samples to observe whether they were expressed in low abundance in all samples. We also generated the MA plots, an application of a Bland–Altman plot, to create a visual representation of omics data. The MA plots in Author response image 1 illustrate that the target molecules with significantly changed phosphorylation levels did not aggregate within the very low abundance. To confirm this conclusion, we adopted two sets of cutoffs: (1) change: -log10(p-value) > 1.3, and log2FoldChange < 0 (down) or log2FoldChange > 0 (up); and (2) change_2: -log10(p-value) > 1.3, and log2FoldChange <= -0.4 (down) or log2FoldChange >= 0.4 (up).

      Author response image 1.

      1. "We verified some results through wet lab experiments" in the abstract is vague.

      Thank you for the good suggestion. What we meant to indicate here was that identified genotypic differences in selected proteins, phosphoproteins and RNAs discovered in omics were verified by western blots, protein synthesis detection, proteosome activity detection, and protein soluble and insoluble fractions detection. However, we have deleted the reference to the wet lab experiments in the revised manuscript.

      1. There are minor syntactical errors throughout the text.

      Thank you very much for the suggestion. As noted in our response, we have edited and revised those errors throughout the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      The objective of this study was to infer the population dynamics (rates of differentiation, division, and loss) and lineage relationships of clonally expanding NK cell subsets during an acute immune response. 

      Strengths: 

      A rich dataset and thorough analysis of a particular class of stochastic models. 

      We thank the reviewer for the positive comment.

      Weaknesses: 

      The stochastic models used are quite simple; each population is considered homogeneous with first-order rates of division, death, and differentiation. In Markov process models such as these, there is no dependence of cellular behavior on its history of divisions. In recent years models of clonal expansion and diversification, in the settings of T and B cells, have progressed beyond this picture. So I was a little surprised that there was no mention of the literature exploring the role of replicative history in differentiation (e.g. Bresser Nat Imm 2022), nor of the notion of family 'division destinies' (either in division number or the time spent proliferating, as described by the Cyton and Cyton2 models developed by Hodgkin and collaborators; e.g. Heinzel Nat Imm 2017). The emerging view is that variability in clone (family) size may arise predominantly from the signals delivered at activation, which dictate each precursor's subsequent degree of expansion, rather than from the fluctuations deriving from division and death modeled as Poisson processes. 

      As you pointed out, the Gerlach and Buchholz Science papers showed evidence for highly skewed distributions of family sizes and correlations between family size and phenotypic composition. Is it possible that your observed correlations could arise if the propensity for immature CD27+ cells to differentiate into mature CD27- cells increases with division number? The relative frequency of the two populations would then also be impacted by differences in the division rates of each subset - one would need to explore this. But depending on the dependence of the differentiation rate on division number, there may be parameter regimes (and time points) at which the more differentiated cells can predominate within large clones even if they divide more slowly than their immature precursors. One might not then be able to rule out the two-state model. I would like to see a discussion or rebuttal of these issues. 

      We thank the reviewer for the insightful comment and drawing our attention to the Cyton models. We have discussed the Cyton models in the Introduction (lines 80-95) and the Discussion (lines 538-553) sections of the revised manuscript and carried out simulations for the variant of the Cyton model suggested by the reviewer. The two-state model showed that for certain parameters it can give rise to a negative correlation between the clone size and the percentage of immature (CD27+) NK cells in the absence of any death suggesting the potential importance of division destiny along with stochastic fluctuations in giving rise to the heterogeneity observed in NK cell clone size distributions in the expansion phase. In addition, we also considered a two-state model where the NK cell activation time in individual cells vary following a log-normal distribution; this two-state model also shows the presence of negative correlations between clone sizes and the percentage of immature NK cells within the clones. We have added new results (Figs. S2-3) and discussed the results (lines 223-232) in the Results and the Discussion (lines 538-553) sections. We believe these additional simulations provide new insights into the results we carried out with our two- and three- state models. 

      Reviewer #2 (Public review): 

      Summary: 

      Wethington et al. investigated the mechanistic principles underlying antigen-specific proliferation and memory formation in mouse natural killer (NK) cells following exposure to mouse cytomegalovirus (MCMV), a phenomenon predominantly associated with CD8+ T cells. Using a rigorous stochastic modeling approach, the authors aimed to develop a quantitative model of NK cell clonal dynamics during MCMV infection. 

      Initially, they proposed a two-state linear model to explain the composition of NK cell clones originating from a single immature Ly49+CD27+ NK cell at 8 days post-infection (dpi). Through stochastic simulations and analytical investigations, they demonstrated that a variant of the twostate model incorporating NK cell death could explain the observed negative correlation between NK clone sizes at 8 dpi and the percentage of immature (CD27+) NK cells (Page 8, Figure 1e, Supplementary Text 1). However, this two-state model failed to accurately reproduce the first (mean) and second (variance and covariance) moments of the measured CD27+ and CD27- NK cell populations within clones at 8 dpi (Figure 1g). 

      To address this limitation, the authors increased the model's complexity by introducing an intermediate maturation state, resulting in a three-stage model with the transition scheme: CD27+Ly6C- → CD27-Ly6C- → CD27-Ly6C+. This three-stage model quantitatively fits the first and second moments under two key constraints: (i) immature CD27+ NK cells exhibit faster proliferation than CD27- NK cells, and (ii) there is a negative correlation (upper bound: -0.2) between clone size and the fraction of CD27+ cells. The model predicted a high proliferation rate for the intermediate stage and a high death rate for the mature CD27-Ly6C+ cells. 

      Using NK cell reporter mice data from Adams et al. (2021), which tracked CD27+/- cell population dynamics following tamoxifen treatment, the authors validated the three-stage model. This dataset allowed discrimination between NK cells originating from the bone marrow and those pre-existing in peripheral blood at the onset of infection. To test the prediction that mature CD27- NK cells have a higher death rate, the authors measured Ly49H+ NK cell viability in the mice spleen at different time points post-MCMV infection. Experimental data confirmed that mature (CD27-) NK cells exhibited lower viability compared to immature (CD27+) NK cells during the expansion phase (days 4-8 post-infection). 

      Further mathematical analyses using a variant of the three-stage model supported the hypothesis that the higher death rate of mature CD27- cells contributes to a larger proportion of CD27- cells in the dead cell compartment, as introduced in the new variant model. 

      Altogether, the authors proposed a three-stage quantitative model of antigen-specific expansion and maturation of naïve Ly49H+ NK cells in mice. This model delineates a maturation trajectory: (i) CD27+Ly6C- (immature) → (ii) CD27-Ly6C- (mature I) → (iii) CD27-Ly6C+ (mature II). The findings highlight the highly proliferative nature of the mature I (CD27-Ly6C-) phenotype and the increased cell death rate characteristic of the mature II (CD27-Ly6C+) phenotype. 

      Strengths: 

      By designing models capable of explaining correlations, first and second moments, and employing analytical investigations, stochastic simulations, and model selection, the authors identified the key processes underlying antigen-specific expansion and maturation of NK cells. This model distinguishes the processes of antigen-specific expansion, contraction, and memory formation in NK cells from those observed in CD8+ T cells. Understanding these differences is crucial not only for elucidating the distinct biology of NK cells compared to CD8+ T cells but also for advancing the development of NK cell therapies currently under investigation. 

      We thank the reviewer for the positive comments.

      Weaknesses: 

      The conclusions of this paper are largely supported by the available data. However, a comparative analysis of model predictions with more recent works in the field would be desirable. Moreover, certain aspects of the simulations, parameter inference, and modeling require further clarification and expansion, as outlined below: 

      (1) Initial Conditions and Grassmann Data: The Grassmann data is used solely as a constraint, while the simulated values of CD27+/CD27- cells could have been directly fitted to the Grassmann data, which assumes a 1:1 ratio of CD27+/CD27- at t = 0. This approach would allow for an alternative initial condition rather than starting from a single CD27+ cell, potentially improving model applicability. 

      We fit the moments of the cell populations along with the ratio of resulting cells from an initial condition of 1:1 ratio of CD27+/CD27- cells at t=0 in the model. The initial condition agrees with the experimental data. However, this fit produced parameter values that will lead to greater growth of mature CD27- NK cells compared to that of immature CD27+ NK cells. This could result from the equal weights given to the ratio as well as to the different moments, and a realistic parameter estimate could correspond to an unequal weight between the ratio and the moments. Imposing the constraint Δ<sub>k</sub> >0 in the fitting drives the parameter search in the region, which seems to alleviate this issue that produces estimates of the rates consistent with higher growth of immature NK cells. We included Table S6 and accompanying description to show this, as well as an additional section in the Materials and Methods (lines 669-676). 

      (2) Correlation Coefficients in the Three-State Model: Although the parameter scan of the threestate model (Figure 2) demonstrates the potential for achieving negative correlations between colony size and the fraction of CD27+ cells, the authors did not present the calculated correlation coefficients using the estimated parameter values from fitting the three-state model to the data. Including these simulations would provide additional insight into the parameter space that supports negative correlations and further validate the model.  

      We have included this figure (Figure 2d) in the revised manuscript.

      (3) Viability Dynamics and Adaptive Response: The authors measured the time evolution of CD27+/- dynamics and viability over 30 days post-infection (Figure 4). It would be valuable to test whether the three-state model can reproduce the adaptive response of CD27- cells to MCMV infection, particularly the observed drop in CD27- viability at 5 dpi (prior to the 8 dpi used in the study) and its subsequent rebound at 8 dpi. Reproducing this aspect of the experiment is critical to determine whether the model can simultaneously explain viability dynamics and moment dynamics. Furthermore, this analysis could enable sensitivity analysis of CD27- viability with respect to various model parameters. 

      We have compared the expansion kinetics of the adoptively transferred Ly49H+ NK cells (Figure 2) and endogenous Ly49H+ NK cells, where the endogenous NK cells show slower growth rates than their adoptively transferred counterparts (see lines 422-429). The data shown in Figure 4 refer to the relative percentage of the mature and immature endogenous NK cells, thus cannot be explained by the three-state model calibrated by the expansion of the adoptively transferred NK cells. One of the issues with using the viability data for parameter estimation for endogenous cells is the need to assume a model for dead cell clearance. We assume a model where dead cells are cleared according to a first-order decay reaction and vary the rate of this reaction to show that the qualitative results are in line with our model rates. This model cannot recreate the dip and rebound observed in the data, and instead monotonically and asymptotically approaches a percentage of live cells. We have attached a figure showing this behavior below. Rather, we intend to use this model as qualitative validation that the relative viability of mature NK cells is lower than that of immature NK cells. Models that include time-dependence of clearance of dead cells, or models with a higher-order (i.e. second) reaction for clearance of dead cells in which propensity for clearance is lower at early times and greater at later times may be better suited for this purpose but are beyond the scope of our validation. 

      Author response image 1.

      Reviewer #1 (Recommendations for the authors):  

      I think the manuscript could be improved substantially by exploring alternative models that incorporate replicative history. At the very least it needs a deeper discussion of the literature relating to clonal expansion, putting the existing models in the context of these studies, and arguing convincingly that your conclusions are robust.  

      We have substantially expanded our explorations with alternative models, in particular we considered a variant of the Cyton model suggested by Reviewer#1, a model where NK cells become activated at different times, and a model with asymmetric NK cell division. We have shown the results (Figs. S2-3) in the Supplementary material and discussed the results in the Results and Discussion sections. Please refer to our response #1 to Reviewer #1 for more details. 

      Reviewer #2 (Recommendations for the authors): 

      (1) Possible Typo (Page 12, Line 254): 

      The phrase: "immature NK cells compared to their immature counterparts" appears to contain a typo. Consider rephrasing for clarity. 

      Done. Thanks for finding this. 

      (2) Clarification of Data Source and Computational Procedure: 

      In the statement: "The NK cell clones reported by Flommersfeld et al. contained mixtures of CD27+ and CD27- NK cells. We evaluated the percentage of CD27+ NK cells in each clone and computed the correlation (Csize-CD27+) of the size of the clone with the percentage of CD27+ NK cells in the clones." Please clarify the data source and computational methodology for evaluating the percentage of CD27+ cells within clones. Additionally, consider including the curated data in the supplementary materials. Since the data originates from different immune compartments, explain which compartments were used. If data from all compartments were included, discuss how the calculated correlation changes when stratifying data from different sources (e.g., spleen and lymph nodes).  

      We have clarified the data source (spleen) where appropriate.

      (3) Figure 1b (Correlation Coefficient): 

      While the correlation coefficient with p-value is mentioned, it would be beneficial to also provide the standard deviation of the correlation coefficient and a 95% confidence band for the fitted line. This is particularly relevant as the authors use -0.2 as the upper bound for the correlation coefficient when fitting the three-stage model. 

      We have included the CI and the p-value for the correlation shown in Figure 1b. The figure with the 95% confidence band shown in the figure (appended below) where both axes are in normal scale does not appear visually clear as in Figure 1b where the clone sizes are shown in the logscale. Thus, we did not include the confidence band in Figure 1b but display the CI and p-values on the figure. If the reviewer prefers, we can include the figure with the confidence band in the SI.

      Author response image 2.

      (4) Confidence Intervals in Tables: 

      If confidence intervals in the tables are calculated using bootstrapping, please mention this explicitly in the table headings for clarity. 

      Done.

      (5) Figure 2d-e (Simulation Method): 

      Specify the simulation method used (e.g., stochastic simulation algorithm [SSA], as mentioned in the materials and methods). Panel (e) lacks a caption-please provide one. Additionally, it would be interesting to include the correlation between clone size and the fraction of CD27+ cells in the clones (similar to the experimental data from Flommersfeld et al., 2021). 

      Done.

      (6) Figure 3 (Confidence Band): 

      Include a 95% confidence band for the simulated values to enhance the interpretability of the plots. 

      Done.

      (7) Materials and Methods Section:  Include a mathematical formula defining the metrics described, ensuring clarity and precision. 

      Done. See newly added lines 587-599, as well as existing content in the Supplementary Materials.

      (8) Supplementary Text 1 (Numerical Integration and AICc): 

      The section "Numerical Integration of Master Equation and Calculation of the AICc" is well done. However, given that the master equation involves a system of 106 coupled ODEs, it would be highly appreciated if the authors provided the formulation in matrix representation for better comprehension. 

      We have included a supplementary text (Supplementary Text I) and a schematic figure within the text to provide the details.

      (9) Figure S7b (Three-State Model Validation): 

      Given that the three-state model fits the data, assess whether it can also fit the first and secondmoment data effectively. This validation would strengthen the robustness of the model.

      Although we showed that the best fit of the clonal burst data (moments) vastly overestimates the growth rates of endogenous cells (Figure S9a, previously Figure S7a), we did not fully emphasize the differences in the datasets that make fitting both with the same parameters impossible. We have added additional text in the main text where Figure S9a is located (lines 427-429) to discuss this.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The study seeks to establish accurate computational models to explore the role of hydrodynamic interactions on energy savings and spatial patterns in fish schools. Specifically, the authors consider a system of (one degree-of-freedom) flapping airfoils that passively position themselves with respect to the streamwise direction, while oscillating at the same frequency and amplitude, with a given phase lag and at a constant cross-stream distance. By parametrically varying the phase lag and the cross-stream distance, they systematically explore the stability and energy costs of emergent configurations. Computational findings are leveraged to distill insights into universal relationships and clarify the role of the wake of the leading foil.

      We would like to thank the referee for their careful read of the manuscript and for their constructive feedback. We appreciate it.

      Strengths:

      (1) The use of multiple computational models (computational fluid dynamics, CFD, for full Navier-Stokes equations and computationally efficient inviscid vortex sheet, VS, model) offers an extra degree of reliability of the observed findings and backing to the use of simplified models for future research in more complex settings.

      (2) The systematic assessment of the stability and energy savings in multiple configurations of pairs and larger ensembles of flapping foils is an important addition to the literature.

      (3) The discovery of a linear phase-distance relationship in the formation attained by pairs of flapping foils is a significant contribution, which helps compare different experimental observations in the literature.

      (4) The observation of a critical size effect for in-line formations of larger, above which cohesion and energetic benefits are lost at once, is a new discovery in the field.

      Thank you for this list of strength – we are delighted that these ideas were clearly communicated in our manuscript.

      Note that Newbolt et al. PNAS, 2019 reported distance as a function of phase for pairs of flapping hydrofoils, and Li et al, Nat. Comm., 2020 also reported phase-distance relationship in robotic and biological fish (calling it Vortex Phase Matching). We compiled their results, together with our and other numerical and experimental results, showing that the linear distance-phase relationship is universal.

      Weaknesses:

      (1) The extent to which observations on one-degree-of-freedom flapping foils could translate to real fish schools is presently unclear so some of the conclusions on live fish schools are likely to be overstated and would benefit from some more biological framing.

      Thank you for bringing up this point. Indeed, flapping foils that are free to translate in both the x- and y-directions and rotate in the x-y plane could drift apart in the y-direction. However, this drift occurs at a longer time scale than the forward swimming motion; it is much slower. For this reason, we feel justified to ignore it for the purpose of this study, especially that the pairwise equilibria in the swimming x-direction are reached at a faster time scale.

      Below, we include two snapshots taken from published work from the group of Petros Koumoutsakos (Gazzola et al, SIAM 2014). The figures show, respectively, a pair and a group of five undulating swimmers, free to move and rotate in the x-y plane. The evolution of the two and five swimmers is computed in the absence of any control. The lateral drift is clearly sub-dominant to the forward motion. Similar results were reported in Verma et al, PNAS 2018.

      These results are independent on the details of the flow interactions model. For example, similar lateral drift is observed using the dipole model dipole model (Kanso & Tsang, FDR 2014, Tsang & Kanso, JNLS 2023).

      Another reason why we feel justified to ignore these additional degrees of freedom is the following: we assume a live fish or robotic vehicle would have feedback control mechanisms that correct for such drift. Given that it is a slowly-growing drift, we hypothesize that the organism or robot would have sufficient time to respond and correct its course.

      Indeed, in Zhu et al. 2022, an RL controller, which drives an individual fish-like swimmer to swim at a given speed and direction, when applied to pairs of swimmers, resulted in the pair "passively" forming a stable school without any additional information about each other.

      We edited the main manuscript in page 4 of the manuscript to include reference to the work cited here and to explain the reasons for ignoring the lateral drift.

      Citations:  

      Gazzola, M., Hejazialhosseini, B., & Koumoutsakos, P. (2014). Reinforcement learning and wavelet adapted vortex methods for simulations of self-propelled swimmersSIAM Journal on Scientific Computing36(3), B622-B639. DOI: https://doi.org/10.1137/130943078

      Verma, S., Novati, G., & Koumoutsakos, P. (2018). Efficient collective swimming by harnessing vortices through deep reinforcement learningProceedings of the National Academy of Sciences115(23), 5849-5854. DOI: https://doi.org/10.1073/pnas.1800923115

      Tsang, A. C. H. & Kanso, E., (2013). Dipole Interactions in Doubly Periodic DomainsJournal of Nonlinear Science 23 (2013): 971-991. DOI: https://doi.org/10.1007/s00332-013-9174-5

      Kanso, E., & Tsang, A. C. H. (2014). Dipole models of self-propelled bodiesFluid Dynamics Research46(6), 061407. DOI: https://doi.org/10.1088/0169-5983/46/6/061407

      Zhu, Y., Pang, J. H., & Tian, F. B. (2022). Stable schooling formations emerge from the combined effect of the active control and passive self-organizationFluids7(1), 41. DOI: https://doi.org/10.3390/fluids7010041

      Author response image 1.

      Antiphase self-propelled anguilliform swimmers. (a) – (d) Wavelet adapted vorticity fields at, respectively, t = T, t = 4T, t = 10T. (e) Absolute normalized velocities |U|/L. (f) Swimmers’ centre of mass trajectories.

      Author response image 2.

      Parallel schooling formation. (a) – (d) wavelet adapted vorticity fields at, respectively, t = T, t = 4T, t = 7T, t = 10T. (e) Absolute normalized velocities |U|/L. (f) Swimmers’ center of mass trajectories.

      (2) The analysis of non-reciprocal coupling is not as novel as the rest of the study and potentially not as convincing due to the chosen linear metric of interaction (that is, the flow agreement).

      We thank the referee for this candid and constructive feedback. In fact, we view this aspect of the study as most “revolutionary” because it provides a novel approach to pre-computing the locations of stable equilibria even without doing expensive all-to-all coupled simulations or experiments.

      Basically, the idea is the following: you give me a flow field, it doesn’t matter how you obtained it, whether from simulations or experimentally, and I can tell you at what locations in this flow field a virtual flapping swimmer would be stable and save hydrodynamic energy!

      In the revised version, we changed page 3 and 7 in main text, and added a new section “Diagnostic tools” in SI to better illustrate this.

      Overall, this is a rigorous effort on a critical topic: findings of the research can offer important insight into the hydrodynamics of fish schooling, stimulating interdisciplinary research at the interface of computational fluid mechanics and biology.

      We thank the referee again for their careful read of the manuscript and their constructive feedback.

      Reviewer #2 (Public Review):

      The document "Mapping spatial patterns to energetic benefits in groups of flow-coupled swimmers" by Heydari et al. uses several types of simulations and models to address aspects of stability of position and power consumption in few-body groups of pitching foils. I think the work has the potential to be a valuable and timely contribution to an important subject area. The supporting evidence is largely quite convincing, though some details could raise questions, and there is room for improvement in the presentation. My recommendations are focused on clarifying the presentation and perhaps spurring the authors to assess additional aspects:

      We would like to thank the referee for their careful read of the manuscript and for their constructive feedback. We appreciate it.

      (1) Why do the authors choose to set the swimmers free only in the propulsion direction? I can understand constraining all the positions/orientations for investigating the resulting forces and power, and I can also understand the value of allowing the bodies to be fully free in x, y, and their orientation angle to see if possible configurations spontaneously emerge from the flow interactions. But why constrain some degrees of freedom and not others? What's the motivation, and what's the relevance to animals, which are fully free?

      We would like to thank the referee for raising this point. It is similar to the point raised above by the first referee. As explained above the reason is the following: in freely-swimming, hydrodynamically-interacting “fish,” the lateral drift is sub-dominant to the forward swimming motion. Therefore, we ignore it in the model. Please see our detailed response above for further clarification, and see changes in page 4 in the main manuscript.

      (2) The model description in Eq. (1) and the surrounding text is confusing. Aren't the authors computing forces via CFD or the VS method and then simply driving the propulsive dynamics according to the net horizontal force? It seems then irrelevant to decompose things into thrust and drag, and it seems irrelevant to claim that the thrust comes from pressure and the drag from viscous effects. The latter claim may in fact be incorrect since the body has a shape and the normal and tangential components of the surface stress along the body may be complex.

      Thank you for pointing this out! It is indeed confusing.

      In the CFD simulations, we are computing the net force in the swimming x-direction direction by integrating using the definition of force density in relation to the stress tensor. There is no ambiguity here.

      In the VS simulations, however, we are computing the net force in the swimming x-direction by integrating the pressure jump across a plate of zero thickness. There is no viscous drag. Viscous drag is added by hand, so-to-speak. This method for adding viscous drag in the context of the VS model is not new, it has been used before in the literature as explained in the SI section “Vortex sheet (VS) model” (pages 30 and 31).

      .

      (3) The parameter taudiss in the VS simulations takes on unusual values such as 2.45T, making it seem like this value is somehow very special, and perhaps 2.44 or 2.46 would lead to significantly different results. If the value is special, the authors should discuss and assess it. Otherwise, I recommend picking a round value, like 2 or 3, which would avoid distraction.

      Response: The choice of dissipation time is both to model viscous effect and reduce computational complexity. Introducing it is indeed introduces forcing to the simulation. Round value, like 2 or 3, is equal to an integer multiple of the flapping period, which is normalized to T=1, Therefore, an integer value of  would cause forcing at the resonant frequency and lead to computational blow up. To avoid this effect, a parameter choice of  = 2.45, 2.44 or 2.46 would be fine and would lead to small perturbation to the overall simulation, compared to no dissipation at all. This effect is studied in detail in the following published work from our group:

      Huang, Y., Ristroph, L., Luhar, M., & Kanso, E. (2018). Bistability in the rotational motion of rigid and flexible flyers. Journal of Fluid Mechanics849, 1043-1067. DOI: https://doi.org/10.1017/jfm.2018.446

      (4) Some of the COT plots/information were difficult to interpret because the correspondence of beneficial with the mathematical sign was changing. For example, DeltaCOT as introduced on p. 5 is such that negative indicates bad energetics as compared to a solo swimmer. But elsewhere, lower or more negative COT is good in terms of savings. Given the many plots, large amounts of data, and many quantities being assessed, the paper needs a highly uniform presentation to aid the reader.

      Thank you for pointing this out! We updated Figures 3,6 as suggested.

      (5) I didn't understand the value of the "flow agreement parameter," and I didn't understand the authors' interpretation of its significance. Firstly, it would help if this and all other quantities were given explicit definitions as complete equations (including normalization). As I understand it, the quantity indicates the match of the flow velocity at some location with the flapping velocity of a "ghost swimmer" at that location. This does not seem to be exactly relevant to the equilibrium locations. In particular, if the match were perfect, then the swimmer would generate no relative flow and thus no thrust, meaning such a location could not be an equilibrium. So, some degree of mismatch seems necessary. I believe such a mismatch is indeed present, but the plots such as those in Figure 4 may disguise the effect. The color bar is saturated to the point of essentially being three tones (blue, white, red), so we cannot see that the observed equilibria are likely between the max and min values of this parameter.

      Thank you for pointing this out! You are correct in your understanding of the flow agreement parameter, but not in your interpretation.

      Basically, “if the match were perfect, then the swimmer would generate no relative flow and thus no thrust,” means that “such a location could not be is an equilibrium.” Let me elaborate. An equilibrium is one at which the net thrust force is zero. The equilibrium is stable if the slope of the thrust force is negative. Ideally, this is what maximizing the flow agreement parameter would produce.

      For example, consider an ideal fluid where the flow velocity is form  in vertical direction. Consider a “ghost swimmer” heaving at a velocity  . Under this scenario, flow agreement and thrust parameters are

      Let’s now consider a balance of forces on the “ghost swimmer.” The ghost swimmer is in relative equilibrium if and only if:

      It gives us

      We then consider stability at this equilibrium by calculating the derivative of thrust parameter over phase

      The corresponding values at equilibria are

      Thus, when taking the positive which means the equilibria is a stable fixed point. We included this analysis in a new section in the SI page 32.

      (6) More generally, and related to the above, I am favorable towards the authors' attempts to find approximate flow metrics that could be used to predict the equilibrium positions and their stability, but I think the reasoning needs to be more solid. It seems the authors are seeking a parameter that can indicate equilibrium and another that can indicate stability. Can they clearly lay out the motivation behind any proposed metrics, and clearly present complete equations for their definitions? Further, is there a related power metric that can be appropriately defined and which proves to be useful?

      Thank you – these are excellent suggestions. Indeed, we needed to better explain the motivation and equations. Perhaps the main idea for these metrics can be best understood when explained in the context of the simpler particle model, which we now do in the SI and explain the main text.

      (7) Why do the authors not carry out CFD simulations on the larger groups? Some explanations should be given, or some corresponding CFD simulations should be carried out. It would be interesting if CFD simulations were done and included, especially for the in-line case of many swimmers. This is because the results seem to be quite nuanced and dependent on many-body effects beyond nearest-neighbor interactions. It would certainly be comforting to see something similar happen in CFD.

      We are using a open-source version of the Immersed Boundary Method that is not specifically optimized for many interacting swimmers. Therefore, the computational cost of performing CFD simulations for more swimmers is high. Therefore, we used the CFD simulations sporadically with fewer simmers (2 or 3) and we performed systematic simulations in the context of the VS model.

      For the same Reynolds number in Figure 1, we simulated three and four swimmers in CFD: three swimmers forms a stable formation, four swimmers don’t, consistent with the VS model, with the forth swimmer colliding with the third one. Results are included in the SI figure 8 of the main text.

      (8) Related to the above, the authors should discuss seemingly significant differences in their results for long in-line formations as compared to the CFD work of Peng et al. [48]. That work showed apparently stable groups for numbers of swimmers quite larger than that studied here. Why such a qualitatively different result, and how should we interpret these differences regarding the more general issue of the stability of tandem groups?

      Thank you for bringing up this important comparison. Peng et al. [48] (Hydrodynamic schooling of multiple self-propelled flapping plates) studied inline configuration of flapping airfoils at Reynolds number =200. There are several differences between their work and ours. The most important one is that they used a flexible plate, which makes the swimmer more adaptive to changes in the flow field, e.g. changes in tailbeat amplitude and changes in phase along its body and diverts some of the hydrodynamic energy to elastic energy. We edited the main text page 10 at the end of section “Critical size of inline formations beyond which cohesion is lost” to explain this distinction.

      (9) The authors seem to have all the tools needed to address the general question about how dynamically stable configurations relate to those that are energetically optimal. Are stable solutions optimal, or not? This would seem to have very important implications for animal groups, and the work addresses closely related topics but seems to miss the opportunity to give a definitive answer to this big question.

      Indeed, that is exactly the point – in pairwise formations, stable configurations are also energetically optimal! In larger groups, there is no unique stable configuration – each stable configuration is associated with a different degree of energy savings. Interestingly, when exploring various equilibrium configurations in a school of four, we found the diamond formation of D. Weihs, Nature, 1972 to be both stable and most optimal among the configurations we tested. However, claiming this as a global optimum may be misleading – our standpoint is that fish schools are always dynamic and that there are opportunities for energy savings in more than one stable configuration.

      We added a section in new text “Mapping emergent spatial patterns to energetic benefits”, and added a new figure in the maintext (Fig. 10) and a new figure in the SI (Fig. S. 8)

      (10) Time-delay particle model: This model seems to construct a simplified wake flow. But does the constructed flow satisfy basic properties that we demand of any flow, such as being divergence-free? If not, then the formulation may be troublesome.

      The simplified wake flow captures the hydrodynamic trail left by the swimmer in a very simplified manner. In the limit of small amplitude, it should be consistent with the inviscid vortex sheet shed of T. Wu’s waving swimmer model (Wu TY. 1961).

      The model was compared to experiments and used in several recent publications from the Courant Institute (Newbolt et al. 2019, 2022, 2024).

      Citations:  

      Wu, T. Y. T. (1961). Swimming of a waving plateJournal of Fluid Mechanics10(3), 321-344. DOI: https://doi.org/10.1017/S0022112061000949

      Newbolt, J. W., Lewis, N., Bleu, M., Wu, J., Mavroyiakoumou, C., Ramananarivo, S., & Ristroph, L. (2024). Flow interactions lead to self-organized flight formations disrupted by self-amplifying wavesNature Communications15(1), 3462. DOI: https://doi.org/10.1038/s41467-024-47525-9

      Newbolt, J. W., Zhang, J., & Ristroph, L. (2022). Lateral flow interactions enhance speed and stabilize formations of flapping swimmersPhysical Review Fluids7(6), L061101. DOI: https://doi.org/10.1103/PhysRevFluids.7.L061101

      Newbolt, J. W., Zhang, J., & Ristroph, L. (2019). Flow interactions between uncoordinated flapping swimmers give rise to group cohesionProceedings of the National Academy of Sciences116(7), 2419-2424.  DOI: https://doi.org/10.1073/pnas.1816098116

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Congratulations on such a comprehensive and well-thought-out study; I truly enjoyed reading it and have only a couple of suggestions that I believe will help further strengthen the paper. I am including a bunch of references here that are very familiar to me without the expectation of you to include them all, just to point at areas that I feel you might consider useful.

      We thank the referee again for their careful read of the manuscript and for their constructive feedback. We appreciate it.

      First, I believe that some more rationale is needed to justify the chosen modeling framework. I am fully aware of how difficult is to run these simulations, but I see some critical assumptions that need to be at least spelled out for the reader to appreciate the limitations of the study: (1) Constraining the cross-stream coordinate (a stability analysis should include perturbations on the cross-stream coordinate as well, see, for example, https://doi.org/10.1017/flo.2023.25 -- I know this is much simpler as it discards any vortex shedding) and (2) Assuming equal frequency and amplitude (there are studies showing variation of tail beat frequency in animals depending on their position in the school, see, for example, https://doi.org/10.1007/s00265-014-1834-4).

      Thank you for these suggestions. These are indeed important and interesting points to discuss in the manuscript. See response above regarding point 1. Regarding point 2, this is of course important and will be pursued in future extensions of this work. We edited the intro and discussion of the main text to explain this.

      In the paper “Stability of schooling patterns of a fish pair swimming against a flow”, The authors considered a pair of swimmers swimming in a channel. They analyzed stability of the system and find multiple equilibria of the system, including inline and staggered formation, and a special formation of perpendicular to the wall. Studying fish school in confined domain and analyzing their stability is very interesting. We added citation to this paper in the discussion section at the end of page 10.

      In the paper “Fish swimming in schools save energy regardless of their spatial position”, the authors measured the reduction in power of fish by measuring tail beat frequency and oxygen consumption and compared them to measurements in solitary fish. They found that in a school of fish, individuals always save power comparing to swimming alone.  However, there is one important caveat in this study: they considered a larger school of fish and expressed the results in terms of pairwise configurations (see schematics we draw below). This is misleading because it may suggest that formations with only two fish provide benefits each other, while in fact, the data is obtained from a larger school with many neighbors. They only consider a fish’s relationship to its nearest neighbor. But in a large school, other neighbors will also have influence on their energy consumption.  In the schematics below, we emphasized on several focal fishes, marking them as red, green, and blue. We also marked their nearest neighbors using the same color, but lighter. The nearest neighbors are what the authors are considering to show its neighbor relationship. For example, a problematic one is the red fish, for which its nearest neighbor is behind it, but indeed, its power saving may come from the other neighbors, which are around or ahead it.

      Author response image 3.

      Second, I would like to see more biology context with respect to limitations that are inherent to a purely mechanical model, including, neglecting vision that we know plays a synergistic role in determining schooling patterns. For example, a recent study https://doi.org/10.1016/j.beproc.2022.104767 has presented experiments on fish swimming in the dark and in bright conditions, showing that it is unlikely that hydrodynamics alone could explain typically observed swimming patterns in the literature.

      Thank you for this suggestion and for sharing us with the paper “Collective response of fish to combined manipulations of illumination and flow”. This is a great study, and we are sorry to have missed it.

      In this paper, the authors found that when having illumination, fish swim more cohesively, which is in consistent with another paper we already cited “The sensory basis of schooling by intermittent swimming in the rummy-nose tetra (Hemigrammus rhodostomus)”. Another important conclusion in this paper is that when having brighter illumination and with flow, fish school spend more time side by side. This connects well to the conclusion in another paper we cited “Simple phalanx pattern leads to energy saving in cohesive fish schooling,” where at lower flow speed in a water channel, fish tended to form a dynamic school while at higher flow speed, they organized in a side-by-side/ phalanx configuration. This conclusion is consistent with our study that in side-by-side formation, fish share power saving.

      Importantly, it is well known that both vision and flow sensing play important roles in fish schooling. This study aimed to merely explore what is possible through passive hydrodynamic interactions, without visual and flow sensing and response. We clarify this in the revised version of the manuscript.

      Third, I am not too convinced about the flow agreement metric, which only accounts for linear interactions between the foils. More sophisticated approaches could be utilized as the one proposed here https://doi.org/10.1017/jfm.2018.369, based on a truly model-agnostic view of the interaction - therein, the authors show non-reciprocal (in strength and time-scale) coupling between two in-line flapping foils using information theory. I also would like to mention this older paper https://doi.org/10.1098/rsif.2012.0084, where an equivalent argument about the positioning of a trailing fish with respect to a leading robotic fish is made from experimental observations.

      Thank you for these remarks and for sharing these two interesting papers.

      The flow agreement metric is not specific to two fish, as we show in Fig. 6 of the manuscript. We edited the manuscript and SI to better explain the motivation and implementation of the flow agreement parameter. We edited the main text, see revisions on page 7, and added a new section call “diagnostic tools.”.

      In the paper “An information-theoretic approach to study fluid–structure interactions”, the authors calculate the transfer entropy between two oscillating airfoils when they are hydrodynamically coupled.  This is an interesting study! We will apply this approach to analyzing larger schools in the future. We cited this paper in the introduction.

      In the paper “Fish and robots swimming together: attraction towards the robot demands biomimetic locomotion”, the authors found that fish will swim behind an artificial fish robot, especially when the fish robot is beating its tail instead of static. At specific conditions, the fish hold station behind the robot, which may be due to the hydrodynamic advantage obtained by swimming in the robot’s wake. DPIV resolved the wake behind a static/ beating fish robot, but did not visualize the flow field when the fish is there. This study is similar to a paper we already cited “In-line swimming dynamics revealed by fish interacting with a robotic mechanism”, in which, they considered fish-foil interaction. In the revised manuscript, we cite both papers.

      For the reviewer’s comments about flow agreement only accounts for linear interactions between the foils, we want to explain more to clarify this. The flow agreement parameter is a nonlinear metric, which considered the interaction between a virtual swimmer and an arbitrary unsteady flow field. Although the metric is a linear function of swimmer’s speed, it is indeed a nonlinear function of spacing and phase, which are the quantities we care about. Moreover, the flow field can by generated by either experiment or CFD simulation, and behind one or more swimmers. It is true that it is a one way coupled system since the virtual swimmer does not perturb the flow field.

      Again, this is great work and I hope these suggestions are of help.

      Thank you again! We are delighted to receive such a positive and constructive feedback.

      Reviewer #2 (Recommendations For The Authors):

      (1) About Figure 1: Panel C should be made to match between CFD and VS with regard to the swimmer positions. Also, if the general goal of the figure is to compare CFD and VS, then how about showing a difference map of the velocity fields as a third column of panels across A-D?

      Thank you for pointing this out. Figure 1 C is updated accordingly.

      The general goal is to show the CFD and VS simulations produce qualitatively similar results. Some quantities are not the same across models, e.g. the swimming speed of swimmers are different, but the scaled distance is the same.

      (2) Figure 3: In A, it would be nice to keep the y-axis the same across all plots, which would aid quick visual comparison. In B, the legend labels for CFD and VS should be filled in with color so that the reader can more easily connect to the markers in the plot.

      Thank you for pointing this out, we’ve updated figure 3 and 6.

      (3) Figures 4, 9, and Supplementary Figures too: As mentioned previously, the agreement parameter plots are saturated in the color map, possibly obscuring more detailed information.

      Thank you for pointing this out. The goal is to show that there is a large region with positive flow agreement parameter.

      We picked up the flow agreement behind a single swimmer in VS simulation (Fig.4B) and added the counter lines to it (represents 0.25 and 0.5).  Not many details are hidden by the saturated colormap.

      Author response image 4.

      We also updated Fig 4 and Fig 9 accordingly.

      (4) Figure 6: Is this CFD or VS? Why show one or the other and not both? In B, it seems that there are only savings available and no energetically costly positions. This seems odd. In C, it seems the absolute value on dF/dd is suppressing some important information about stability - the sign of this seems important. In E, the color bar seems to be reflected from what is standard, i.e. 0 on the left and 100 on the right, as in F.

      Thank you for asking. Fig. 6 is based only on VS simulations. There are hundreds of simulations in this figure, we are not running CFD simulations to save computational effort. Representative CFD simulations are shown in Figure 1,2,3, for comparison. We added a sentence in the figure caption for clarification.

      In C, since  is always negative for emergent formations (only stable equilibria can appear during forward time simulation), we are showing its absolute value for comparison.

      In E, we are flipping this because larger flow agreement parameter corresponds to more power saving, in the other word, negative changes in COT.

      (5) Fig. 8: For cases such as in D that have >100% power savings, does this mean that the swimmer has work done by the flow? How to interpret this physically for a flapping foil and biologically for a fish?

      Yes, it means the hydrofoil/fish gets a free ride, and even able to harvest energy from the incoming flow. Actually, similar phenomenon has been reported in the biology and engineering literature. For example, Liao et al. 2003, Beal et al. 2006 found that live or dead fish can harvest energy from incoming vortical flow by modulating their body curvature.

      In engineering, Chen et al. 2018, Ribeiro et al. 2021 have found that the following airfoil in a tandem/ inline formation can harvest energy from the wake of leading swimmer in both simulation and experiemnts.

      Citations:  

      Liao, J. C., Beal, D. N., Lauder, G. V., & Triantafyllou, M. S. (2003). Fish exploiting vortices decrease muscle activityScience302(5650), 1566-1569. DOI: https://doi.org/10.1126/science.1088295

      Beal, D. N., Hover, F. S., Triantafyllou, M. S., Liao, J. C., & Lauder, G. V. (2006). Passive propulsion in vortex wakesJournal of fluid mechanics549, 385-402. DOI: https://doi.org/10.1017/S0022112005007925

      Chen, Y., Nan, J., & Wu, J. (2018). Wake effect on a semi-active flapping foil based energy harvester by a rotating foilComputers & Fluids160, 51-63. DOI: https://doi.org/10.1016/j.compfluid.2017.10.024

      Ribeiro, B. L. R., Su, Y., Guillaumin, Q., Breuer, K. S., & Franck, J. A. (2021). Wake-foil interactions and energy harvesting efficiency in tandem oscillating foilsPhysical Review Fluids6(7), 074703. DOI: https://doi.org/10.1103/PhysRevFluids.6.074703

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      (1) Figure 2 is mentioned before Figure 1

      We thank the reviewer for pointing this out, this was a mistake. What was meant by Figure 2 was actually Figure 1. This has been corrected in the manuscript.

      (2) Figure 1c: red is used to indicate cell junctions on raw data, but also the error.

      The color red is used to indicate cell junctions on raw data on figure 1c left, while it is used to indicate the error on figure 1c right.

      The Lagrangian error can be negative right? This is not reflected by the error scale which goes from 0% to 100%

      A negative Lagragian error would mean that the distance between real and simulated cellular junctions decreased over time. We effectively treat this case as if there was no displacement, and the error is hence 0%.

      Why do you measure the error in percent?

      The error is measured in percentages because it is relative to the apical length of a cell.

      (3) Figure 2: The distinction between pink and red in e_2(t) is very difficult. What do the lines indicate?

      The lines indicate directions of the eigen vectors of the strain rate tensor at every material particle of the embryo.

      (4) L156 "per unit length": Rather per unit time?

      We thank the reviewer for pointing this out. We apologize for this mistake. "per unit length" has been changed to "per unit time"

      (5) L159 "Eigen vectors in this sense": is there another sense?

      "In this sense" is referring to the geometric description of eigen vectors. The phrase has been removed

      (6) L164 "magnitude of the rate of change underwent by a particle at the surface of the embryo in the three orthogonal spatial directions of most significant rate of change."

      Would a decomposition in two directions within the surface's tangent plane and one perpendicular to it not be better?

      We also performed the decomposition of the strain rate tensor as suggested within the surface's tangent plane and one perpendicular to it, but did not notice any tangible differences in the overall analysis, especially after derivation of the scalar field.

      (7) L174 "morphological activity": I think this notion is never defined

      By morphological activity we mean any noticeable shape changes

      (8) L177: I did not quite understand this part

      This part tries to convey that the scalar strain rate field evidences coordinated cell behaviors by highlighting wide regions of red that traverse cell boundaries (e.g. fig.2b, $t=5.48hpb$). At the same time, the strain rate field preserves cell boundaries, highlighted by bands of red at cellular intersections, when cell coordinated cell behaviors are not preponderant (e.g. fig.2b, $t=4hpb$).

      (9) Ll 194 "Unsurprisingly, these functions play an important role in many branches of science including quantum mechanics and geophysics Knaack and Stenflo (2005); Dahlen and Tromp (2021)." Does this really help in understanding spherical harmonics?

      This comment was made with the aim of showing to the reader that Spherical Harmonics have proved to be useful in other fields. Although it does not help in understanding spherical harmonics, it establishes that they can be effective.

      (10) Figure 3a: I do not find this panel particularly helpful. What does the color indicate? What are the prefactors of the spherical harmonics?

      This panel showcases the restriction of the strain rate scalar field to the spherical harmonics with the l and m specified. Each material particle of the embryo surface at the time  is colored with respect to the value of . The values are computed according to equation 2 and are showcased in figure 3c.

      (11) L 265: Please define "scalogram" as opposed to a spectrogram.

      Scalograms are the result of wavelet transforms applied to a signal. Although spectrogram can specifically refer to the spectrum of frequencies resulting for example from a Fourier transform, the term can also be used in a broader sense to designate any time-frequency representation. In the context of this paper, we used it interchangeably with scalogram. We have changed all occurrences of spectrogram to scalogram in the revised manuscript.

      (12) L 299 "the analysis was carried out the 64-cell stage.": Probably 'the analysis was carried out at the 64-cell stage'

      We thank the reviewer for pointing this out. The manuscript was revised to reflect the suggested change.

      (13) L 340 "Another outstanding advantage over traditional is": Something seems to be missing in this sentence.

      We thank the reviewer for pointing this out. We have modified the sentence in the revised manuscript. It now reads “Another outstanding advantage of our workflow over traditional methods is that our workflow is able to compress the story of the development ... ”.

      (14) Ll 357 "on the one hand, the overall spatial resolution of the raw data, on the other hand, the induced computational complexity.": Is there something missing in this sentence

      The sentence tries to convey the idea that in implementing our method, there is a comprise to be made between the choice of the number of particles on the constructed mesh and the computational complexity induced by this choice. There is also a comprise to be made between this choice of the number of particles and the spatial resolution of the original dataset.

      Reviewer 2:

      (1) The authors should clearly state to which data this method has been applied in this paper. Also, to what kind of data can this method be applied? For instance, should the embryo surface be segmented?

      The method has been applied on 3D+time imaging data of ascidian embryonic development data hosted on the morphonet (morphonet.org) platform. The data on the morphonet platform comes in two formats: closed surface meshes of segmented cells spatially organized into the embryo, and 3D voxelated images of the embryo. The method was first designed for the former format and then extended to the later. There is no requirement for the embryo surface to be segmented.

      (2) In this paper, it is essential to understand the way that the authors introduced the Lagrangian markers on the surface of the embryo. However, understanding the method solely based on the description in the main text was difficult. I recommend providing a detailed explanation of the methodology including equations in the main text for clarity.

      We believe that adding mathematical details of the method into the text will cloud the text and make it more difficult to understand. Interested readers can refer to the supplementary material for detailed explanation of the method.

      (3) In eq.(1) of the supplementary information, d(x,S_2(t)) could be a distance function between S_1 and S_2 although it was not stated. How was the distance function between the surfaces defined?

      What was meant here was d(x,S_1(t)) where x is a point of S_2(t). d(x,S_1(t)) referring to the distance between point x and S_1(t). The definition of the distance function has been clarified in the supplementary information.

      (4) In the section on the level set scheme of supplementary information, the derivation of eq.(4) from eq.(3) was not clear.

      We added an intermediary equation for clarification.

      (5) Why is a reference shape S_1(0) absent at t=0?

      A reference shape S_1(0) is absent at t=0 precisely because that is what we are trying to achieve: construct an evolving Lagrangian surface S_2(t) matching S_1(t) at all times.

      (6) In Figure 2(a), it is unclear what was plotted. What do the colors mean? A color bar should be provided.

      The caption of the figure describes the colors: “a) Heatmap of the eigenvector fields of the strain rate tensor. Each row represents a vector field distinguished by a distinct root color (\textit{yellow, pink, white}). The gradient from the root color to red represents increasing magnitudes of the strain rate tensor.”

      (7) With an appropriate transformation, it would be possible to create a 2D map from a 3D representation shown in for instance Figure 2. Such a 2D representation would be more tractable for looking at the overall activities.

      We thank the reviewer for pointing this out. In Figure 4b of the supplementary information, we provide a 2D projection of the scalar strain rate field.

      (8) The strain rate is a second-order tensor that contains rich information. In this paper, the information in the tensor has been compressed into a scalar field by taking the square root of the sum of the squares of the eigenvalues. However, such a representation may not distinguish important events such as stretching and compression of the tissue. The authors should provide appropriate arguments regarding the limitations of this analysis.

      The tensor form of the strain rate field is indeed endowed with more information than the scalar eigen value field derived. However, our objective in this project was not to exhaust the richness of the strain rate tensor field but rather to serve as a proof of concept that our global approach to studying morphogenesis could in fact unveil sufficiently rich information on the dynamical processes at play. Although not in the scope of this project, a more thorough exploration of the strain rate tensor field could be the object of future investigations.

      (9) The authors claimed that similarities emerge between the spatiotemporal distribution of morphogenesis processes in the previous works and the heatmaps in this work. Some concrete data should be provided to support this claim.

      All claims have been backed with references to previous works. For instances, looking at figure 2b, the two middle panels on the lower row (5.48hpf, 6.97hpf), we explained that the concentration of red refers respectively to endoderm invagination during gastrulation, and zippering during neurulation [we cited Hashimoto et al. (2015)]. Here, we relied on eye observation to spot the similarities. The rest of the paper provides substantial and robust additional support for these claims using spectral decomposition in space and time.

      (10) The authors also claimed that "A notable by-product of this scalar field is the evidencing of the duality of the embryo as both a sum of parts constituted of cells and an emerging entity in itself: the strain rate field clearly discriminates between spatiotemporal locations where isolated single cell behaviours are preponderant and those where coordinated cell behaviours dominate." The authors should provide specific examples and analysis to support this argument.

      Here, we relied on eye observation to make this claim. This whole section of the paper “Strain rate field describes ascidian morphogenesis” was about computing, plot and observing the strain rate field.

      However, specific examples were provided. This paragraph was building towards this statement, and the evidence was scattered through the paragraph. We have now revised the sentence to ensure that we highlight specific examples:

      “A notable by-product of this scalar field is the evidencing of the duality of the embryo as both a sum of parts constituted of cells and an emerging entity in itself: the strain rate field clearly discriminates between spatiotemporal locations where isolated single cell behaviours are preponderant (e.g. fig.2b, $t=4hpb$) and those where coordinated cell behaviours dominate (e.g. fig.2b, $t=5.48hpb$).”

      (11) The authors should provide the details of the analysis method used in Figure 3b, including relevant equations. In particular, it would be helpful to clarify the differences that cause the observed differences between Figure 3b and Figure 3c.

      Figure 3b was introduced with the sentence: “In analogy to Principal Components Analysis, we measure the average variance ratio over time of each harmonic with respect to the original signal (Fig.3b).” explaining the origin of variance ratio values used in figure 3b. We have now added the mathematical expression to further clarify.

      (12) The authors found that the variance ratio of Y_00 was 64.4%. Y_00 is a sphere, indicating that most of the activity can be explained by a uniform activity. Which actual biological process explains this symmetrical activity?

      The reviewer makes a good point which also gave us a lot to think about during the analysis. Observing that the contribution of Y00 peaks during synchronous divisions, which are interestingly restricted only to the animal pole, we conjecture that localized morphological ripples and can be felt throughout the embryo. 

      (13) The contribution of other spherical harmonics than Y_00 and Y_10 should be shown.

      Other spherical harmonics contributed individual to less than 1% and we did not find it important to include them in the main figure. We will add supplementary material.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In their manuscript entitled: "Is tumor mutational burden predictive of response to immunotherapy?", Gurjao and colleagues discuss the use of tumor mutational burden (TMB) as a predictive biomarker for cancer patients to respond to immune checkpoint blockage (ICB). By analyzing a large cohort of 882 patient samples across different tumor types they find either little or no association of TMB to the response of ICB. In addition, they showed that finding the optimal cutoff for patient stratification lead to a severe multiple testing problem. By rigorously addressing this multiple testing problem only non-small cell lung cancer out of 10 cancer types showed a statistically significant association of TMB and response to ICB. Nevertheless, it is clearly shown that in any case the rate of misclassification is too high that TMB alone would qualify as a clinically suitable biomarker for ICB response. Finally, the authors demonstrate with a simple mathematical model that only a few strong immunogenic mutations would be sufficient for an ICB response, thereby showing that also patients with a low TMB score could benefit from immunotherapy. The manuscript is clearly written, the results are well presented and the applied methods are state-of-the-art.

      We would like to thank the reviewer for their thoughtful suggestions and efforts towards improving our manuscript. We address below the reviewer’s recommendations.

      Reviewer #1 (Recommendations For The Authors):

      (1) The method used for mutation call can also influence the TMB score. Mutation data was downloaded from public databases and not re-called for this study, a potential caller bias could be present. What was the calling strategy of the used data sets? For the present study, I don't think that this is crucial because different callers or post-call processing would be used at different sites to determine TMB. I think it should the mutation calling bias should also be discussed in the manuscript as another shortcoming for TMB as a biomarker for ICB response.

      We thank the reviewer for this comment. Mutational data was not aggregated across studies and caller bias would thus not have any impact on the results of this manuscript. In addition, we further clarified the role of mutation calling bias in the Discussions section.

      “Although attractive and scalable, TMB does not consider the effect of specific mutations (missense, frameshift etc), their presentation and clonality (19), nor the state of the tumour, its microenvironment, and interactions with the immune system that can be integrated into potentially better predictors of response to ICB (43, 44). In addition, another major limitation of TMB is the lack of standardized measures. This includes the lack of standard sequencing methods to assess TMB: TMB can be measured from Whole-Exome sequencing, Whole-Genome sequencing, targeted panel and even RNA sequencing. This also includes biases introduced by using different mutation calling pipelines resulting in different TMB, sequencing depth and different characteristics of the samples (e.g. low purity samples typically yield lower TMB).”

      (2) In their mathematical model of neoantigens and immunogenicity it is assumed that the probability of a mutation to be immunogenic is constant for all mutations. In reality this is certainly not satisfied. However, the central conclusion from the model still holds. I think that this is important to discuss in the manuscript.

      We thank the reviewer for this suggestion and now consider the case where each mutation has its own probability p(i) of being immunogenic.

      “Our model shows that achieving about constant 𝑃{𝑖𝑚𝑚𝑢𝑛𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒} for 𝑁 > 10 − 20 mutations, requires and . The same argument holds when each mutation has its  own probability to be immunogenic 𝑝(𝑖), then , where is the mean probability of a mutation to be immunogenic. Thus only the average probability of a mutation to be immunogenic matters. In summary, we find that the model agrees with clinical data if individual non-synonymous mutations have, on average, 𝑝~10 − 20% chance for triggering an immune response.”

      (3) In the mathematical formula on page 8, C_N^k is the binomial coefficient. This should be stated or written out.

      Thank you for pointing this out. Corrected.

      “Due to immunodominance, only a few 𝑘crit immunogenic mutations are sufficient to elicit a full k𝑐𝑟𝑖𝑡 immune response. Hence, the probability for a cancer with 𝑁 (=TMB) mutations to elicit an immune response is then the probability of having 𝑘 or more immunogenic mutations among :

      which is the CDF of a binomial distribution.”

      (4) The mathematical model provides an explanation that tumors with a low TMB can also respond on ICB. It cannot explain tumors with high TMB lacking ICB response. An explanation of this phenomenon is discussed in the paper but I think also the impact of the tumor immune microenvironment should be mentioned here.

      As we explained in the presentation of the model, even immunogenic tumors elicit response to ICB with some probability. In the revision we write:

      “𝑃{𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒} = 𝑃{𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑖𝑚𝑚𝑢𝑛𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒} · 𝑃{𝑖𝑚𝑚𝑢𝑛𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒}, where 𝑃{𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒|𝑖𝑚𝑚𝑢𝑛𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒} is the probability of clinical response, given that cancer elicits an immune response which is complex and depends on many factors including tumor immune microenvironment. Yet the prerequisite for the clinical response is the immune response 𝑃{𝑖𝑚𝑚𝑢𝑛𝑒 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒} that we focus on.”

      Reviewer #2 (Public Review):

      The manuscript points out that TMB cut-offs are not strong predictors of response to immunotherapy or overall survival. By randomly shuffling TMB values within cohorts to simulate a null distribution of log-rank test p-values, they show that under correction, the statistical significance of previously reported TMB cut-offs for predicting outcomes is questionable.

      We would like to thank the reviewer for their thoughtful suggestions and efforts towards improving our manuscript.

      There is a clinical need for a better prediction of treatment response than TMB alone can provide. However, no part of the analysis challenges the validity of the well-known pan-cancer correlation between TMB and immunotherapy response.

      We address the pan-cancer correlation in the supplemental text and Figure S3. We realized the supplemental text was missing in eLife submission and included in the bioRxiv only. We apologize for this oversight. In particular, we show that the “well-known pan-cancer correlation” is largely based on a few outlier cancer subtypes - MSI colorectal cancers and uveal/ ocular melanomas. We show that when we remove these cancer types from the pan-cancer dataset, the correlation becomes non-significant for the remaining 15 cancer types.

      The failure to detect significant TMB cut-offs may be due to insufficient power, as the examined cohorts have relatively low sample sizes. A power analysis would be informative of what cohort sizes are needed to detect small to modest effects of TMB on immune response.

      Since we see no effect, we cannot perform a power analysis. Moreover, increasing cohort sizes cannot increase the effect -- dramatic misclassification of responders (the fraction of responders below the treatment cutoff) would remain the same, making TMB unsuitable for clinical decision-making.

      The manuscript provides a simple model of immunogenicity that is tailored to be consistent with a claimed lack of relationship between TMB and response to immunotherapy. Under the model, if each mutation that a tumor has acquired has a relatively high probability of being immunogenic (~10%, they suggest), and if 1-2 immunogenic mutations is enough to induce an immune response, then most tumors produce an immune response, and TMB and response should be uncorrelated except in very low-TMB tumors.

      Contrary to reviewer’s suggestion, our modeling is not tailored to be consistent with the lack of association between TMB and response. On the contrary, we found the model has two regimes: the first regime (where p<<1) in which higher TMB leads to a higher probability of response, which doesn’t agree with the data , and the second regime (p~0.1) in which cancers with TMB>10-20 are immunogenic, consistent with the clinical data.

      We further expanded on these key points in the Results:

      “The model shows two different behaviors. If individual mutations are unlikely to be immunogenic (𝑝 ≪ 1) , e.g. due to a low probability of being presented, the probability of response increases gradually with TMB (Figure 5B). The neoantigen theory generally expects such gradual increase in immunogenicity of cancer with TMB. Yet, available data (Figure 2) don’t show such a trend.

      On the contrary, if mutations are more likely to be immunogenic 𝑝~0. 1, the probability of response quickly saturates (Figure 5C), making such tumors respond to ICB irrespective of TMB, as we observed in clinical data.”

      We also expanded on these key points in the Introduction:

      “We develop a simple model that is based on the neoantigen theory and find that it has two regimes. In one regime, the probability of response increases gradually with TMB, as commonly believed. Yet in the other, the probability of response saturates after a few mutations, making a chance to respond independent of TMB. Our analysis of the clinical data is consistent with the latter regime. Thus our model shows that the neoantigen theory is fully consistent with the lack of association between TMB and response.”

      The question then becomes whether the response is sufficient to wipe out tumor cells in conjunction with immunotherapy, which is essentially the same question of predicting response that motivated the original analysis. While TMB alone is not an excellent predictor of treatment response, the pan-cancer correlation between TMB and response/survival is highly significant, so the model's only independent prediction is wrong.

      Our study indicates that TMB is a very poor predictor (writing that it’s “not an excellent predictor of treatment response” is understatement). Moreover we show that a widely believed “pan-cancer correlation” is shaky as well (Supplemental text and Figure S3). So we don’t see any contradictions between the model and the data.

      Additionally, experiments to predict and validate neoepitopes suggest that a much smaller fraction of nonsynonymous mutations produce immune responses1,2.

      We agree with the reviewer. That’s exactly what the model suggests.

      A key idea that is overlooked in this manuscript is that of survivorship bias: self-evidently, none of the mutations found at the time of sequencing have been immunogenic enough to provoke a response capable of eliminating the tumor. While the authors suggest that immunoediting "is inefficient, allowing tumors to accumulate a high TMB," the alternative explanation fits the neoepitope literature better: most mutations that reach high allele frequency in tumor cells are not immunogenic in typical (or patient-specific) tumor environments. Of course, immunotherapies sometimes succeed in overcoming the evolved immune evasion of tumors. Higher-TMB tumors are likely to continue to have higher mutation rates after sequencing; increased generation of new immunogenic mutations may partially explain their modestly improved responses to therapy.

      We disagree with reviewers' assertion that survivorship bias could explain observed phenomena. If immunogenic mutations that arise during cancer development were eliminated (by purifying selection, i.e. reduced fitness or cellular death) then observed mutations would carry noticeable signatures of purifying selection. On the contrary, cancer genomic data shows incredibly weak signals of purifying selection on non-synonymous mutations (Weghorn and Sunyaev, Nature Genetics 2017), and observed passenger mutations are practically indistinguishable from random in their effect on proteins (McFarland et al PNAS 2013).

      We do agree with the statement that “most mutations … in tumor cells are not immunogenic”. In fact that’s exactly what our model predicts: (1-p)~90% of mutations in the model are non-immunogenic, while remaining p~10% being sufficient to trigger an immune response. We clarify this in the text of the paper: “On the contrary, if mutations are more likely to be immunogenic 𝑝~0. 1, the probability of response quickly saturates (Figure 5C), making such tumors respond to ICB irrespective of TMB, as we observed in clinical data. ”

      Reviewer #2 (Recommendations For The Authors):

      Abstract

      Defining TMB as "number of non-synonymous mutations": while TMB is not consistently defined throughout the literature, it is usually given as a rate rather than a total count, and sometimes synonymous mutations are included. Consider adopting the definition used by the TMB Harmonization Project: "number of somatic mutations per megabase of interrogated genomic sequence.3"

      We thank the reviewer for their comment,

      Be more specific about your findings, so that abstract readers can get some understanding of your proposed explanation for the "immunogenicity of neoantigens and the lack of association between TMB and response."

      We thank the reviewer for their comment. We modified the abstract to explain that the theory we developed expands the neoantigen theory yet can be consistent with the observed lack of association between TMB and response:

      "Second, we develop a model that expands the neoantigen theory and can be consistent with both immunogenicity of neoantigens and the lack of association between TMB and response. Our analysis shows that the use of TMB in clinical practice is not supported by available data and can deprive patients of treatment to which they are likely to respond.”

      Introduction

      Again, consider using a more standard definition of TMB.

      We thank the reviewer for their comment. Our study did not seek to harmonize TMB across the datasets and we thus used the total number of mutations rather than the mutational rate often used for comparison across different datasets.

      Expand the introduction to provide a preview of the purpose and direction of your analysis. The current draft reveals only that the analysis will relate to TMB.

      We expanded the introduction providing the motivation, the approach, and the summary of main findings.

      “Using a biomarker to stratify and prioritize patients for treatment runs a risk of depriving patients who have a chance to respond to a life-saving treatment. High variability of response makes relying on a predictor particularly risky. Hence, we revisit original data that were used to establish correlation between TMB and response. We tested TMB as a predictor of both binary responder/non-responder labels from original clinical studies, as well as continuous survival data. We also investigated whether a TMB threshold could distinguish patients with high and low survival after multiple hypothesis testing. We find that no TMB threshold performs better on the clinical data than on randomized ones.

      We further show that irrespective of the strategy to choose the threshold, even if we were to employ the optimal TMB cutoff, it would still lead to about 25% of responders falling below the treatment prioritization threshold. In addition, we re-examine the pan-cancer association of TMB with response rate to ICB.

      “Finally we revisit the neoantigen theory that was the rationale for using TMB as a predictor of response to immunotherapy. The theory stipulates that non-synonymous mutations can lead to the production of unique antigens (_neo_antigens) that are recognized by the immune system as foreign, triggering the immune response to cancer. The theory further assumes that the more mutations a cancer has, the more likely it triggers the immune system, and the more likely it will benefit from immunotherapy. We develop a simple model that is based on the neoantigen theory and find that it has two regimes. In one regime, the probability of response increases gradually with TMB, as commonly believed. Yet in the other, the probability of response saturates after a few mutations, making a chance to respond independent of TMB. Our analysis of the clinical data is consistent with the latter regime. Thus our model shows that the neoantigen theory is fully consistent with the lack of association between TMB and response.”

      Section: Is TMB associated with response after treatment?

      The claim that after excluding melanoma and some colorectal cancers, there is no relationship between TMB and response rates in pan-cancer studies cites references 12 and 14. In reference 12 (Yarchoan et al.), it is clear from glancing at their Figure 1 that a pan-cancer correlation between TMB and response would remain with these cancer types excluded. This discrepancy requires explanation. "Supplementary text" is cited for this claim, but it was not included in the file that I received.

      We address the pan-cancer correlation in the supplemental text and Figure S3. While the figure was available, we realized the supplemental text was missing in eLife submission. We apologize for this oversight.

      Plots of survival and TMB do not show "visible correlation": Please strengthen this claim with an appropriate statistical test.

      We expand the figure caption to explain the following:

      “Plots of progression-free survival and TMB for melanoma and lung cancer ICB cohorts show the lack of correlation or of an obvious TMB cutoff. Computing a simple correlation for survival and censored data cannot correctly represent the dependence since patients who are alive live longer than the reported survival, and limiting correlation to patients who are dead would bias the analysis. Thus other survival statistics are used through the paper.”

      Section: Model reconciles neoantigen theory and data

      Page 8: In the probability formula, the C term is not defined. My guess is that it means choose(N, k).

      Please clarify.

      Thank you for pointing this out. Corrected using more conventional notation.

      which is the CDF of a binomial distribution.

      Page 8: Assuming the above, P(immune response) = P(X >= k_crit); where X~Bin(N, p). The formula should be explicitly introduced in terms of the CDF of the binomial distribution to prevent readers from thinking the wheel is being re-invented.

      We thank the reviewer for pointing this out, we modified the equation in the text to make it easier to see this point (see above). We refrain from going further since the CDF of a binomial distribution doesn’t have a closed form and can only be written as the regularized incomplete beta function.

      Page 9: Missing word in "allowing cancers with as little as mutations to be"

      We thank the reviewer for pointing this out, we modified the text accordingly.

      See comments in public review. In brief, I think a convincing case is made regarding the significance of TMB cut-offs as predictors of survival within cancer types, but frankly this elementary model is not compelling.

      Section: Materials and Methods

      In the manuscript, it is stated that TMB is accepted as reported by data sources. Since most of the comparisons in the manuscript are within-data-source, that is acceptable. However, it should be ensured that TMB measurements are comparable between samples within each source. For example, when TMB is reported as a total mutation count, it can be verified that all samples have the same coverage, or measurement can be converted to mutations per megabase of coverage. In the same vein, if this manuscript's definition of TMB only includes nonsynomous mutations, it should be confirmed that the TMB reported by data sources excludes synonymous mutations.

      We thank the reviewer for their comment. We leverage total TMB as reported in the original studies claiming an association between TMB and response/ survival.

      Figure S2: Instead of writing "the Youden index associated cutoffs is also plotted," it can be stated that the asterisk represents the Youden index cutoff, or a legend can be added that provides this information.

      We thank the reviewer for pointing this out, we modified the text accordingly.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Tiedje et al. investigated the transient impact of indoor residual spraying (IRS) followed by seasonal malaria chemoprevention (SMC) on the plasmodium falciparum parasite population in a high transmission setting. The parasite population was characterized by sequencing the highly variable DBL$\alpha$ tag as a proxy for var genes, a method known as varcoding. Varcoding presents a unique opportunity due to the extraordinary diversity observed as well as the extremely low overlap of repertoires between parasite strains. The authors also present a new Bayesian approach to estimating individual multiplicity of infection (MOI) from the measured DBL$\alpha$ repertoire, addressing some of the potential shortcomings of the approach that have been previously discussed. The authors also present a new epidemiological endpoint, the so-called "census population size", to evaluate the impact of interventions. This study provides a nice example of how varcoding technology can be leveraged, as well as the importance of using diverse genetic markers for characterizing populations, especially in the context of high transmission. The data are robust and clearly show the transient impact of IRS in a high transmission setting, however, some aspects of the analysis are confusing.

      (1) Approaching MOI estimation with a Bayesian framework is a well-received addition to the varcoding methodology that helps to address the uncertainty associated with not knowing the true repertoire size. It's unfortunate that while the authors clearly explored the ability to estimate the population MOI distribution, they opted to use only MAP estimates. Embracing the Bayesian methodology fully would have been interesting, as the posterior distribution of population MOI could have been better explored. 

      We thank the reviewer for appreciating the extension of var_coding we present here. We believe the comment on maximum _a posteriori (MAP) refers to the way we obtained population-level MOI from the individual MOI estimates. We would like to note that reliance on MAP was only one of two approaches we described, although we then presented only MAP.  Having calculated both, we did not observe major differences between the two, for this data set.  Nonetheless, we revised the manuscript to include the result based on the mixture distribution which considers all the individual MOI distributions in the Figure supplement 6.

      (2) The "census population size" endpoint has unclear utility. It is defined as the sum of MOI across measured samples, making it sensitive to the total number of samples collected and genotyped. This means that the values are not comparable outside of this study, and are only roughly comparable between strata in the context of prevalence where we understand that approximately the same number of samples were collected. In contrast, mean MOI would be insensitive to differences in sample size, why was this not explored? It's also unclear in what way this is a "census". While the sample size is certainly large, it is nowhere near a complete enumeration of the parasite population in question, as evidenced by the extremely low level of pairwise type sharing in the observed data. 

      We consider the quantity a census in that it is a total enumeration or count of infections in a given population sample and over a given time period. In this sense, it gives us a tangible notion of the size of the parasite population, in an ecological sense, distinct from the formal effective population size used in population genetics. Given the low overlap between var repertoires of parasites (as observed in monoclonal infections), the population size we have calculated translates to a diversity of strains or repertoires.  But our focus here is in a measure of population size itself.  The distinction between population size in terms of infection counts and effective population size from population genetics has been made before for pathogens (see for example Bedford et al. for the seasonal influenza virus and for the measles virus (Bedford et al., 2011)), and it is also clear in the ecological literature for non-pathogen populations (Palstra and Fraser, 2012). 

      We completely agree with the dependence of our quantity on sample size. We used it for comparisons across time of samples of the same depth, to describe the large population size characteristic of high transmission which persists across the IRS intervention. Of course, one would like to be able to use this quantity across studies that differ in sampling depth and the reviewer makes an insightful and useful suggestion.  It is true that we can use mean MOI, and indeed there is a simple map between our population size and mean MOI (as we just need to divide or multiply by sample size, respectively) (Table supplement 7).  We can go further, as with mean MOI we can presumably extrapolate to the full sample size of the host population, or to the population size of another sample in another location. What is needed for this purpose is a stable mean MOI relative to sample size.  We can show that indeed in our study mean MOI is stable in that way, by subsampling to different depths our original sample (Figure supplement 8 in the revised manuscript). We now include in the revision discussion of this point, which allows an extrapolation of the census population size to the whole population of hosts in the local area.

      We have also clarified the time denominator: Given the typical duration of infection, we expect our population size to be representative of a per-generation measure_._

      (3) The extraordinary diversity of DBL$\alpha$ presents challenges to analyzing the data. The authors explore the variability in repertoire richness and frequency over the course of the study, noting that richness rapidly declined following IRS and later rebounded, while the frequency of rare types increased, and then later declined back to baseline levels. The authors attribute this to fundamental changes in population structure. While there may have been some changes to the population, the observed differences in richness as well as frequency before and after IRS may also be compatible with simply sampling fewer cases, and thus fewer DBL$\alpha$ sequences. The shift back to frequency and richness that is similar to pre-IRS also coincides with a similar total number of samples collected. The authors explore this to some degree with their survival analysis, demonstrating that a substantial number of rare sequences did not persist between timepoints and that rarer sequences had a higher probability of dropping out. This might also be explained by the extreme stochasticity of the highly diverse DBL$\alpha$, especially for rare sequences that are observed only once, rather than any fundamental shifts in the population structure.

      We thank the reviewer raising this question which led us to consider whether the change in the number of DBLα types over the course of the study (and intervention) follows from simply sampling fewer P. falciparum cases. We interpreted this question as basically meaning that one can predict the former from the latter in a simple way, and that therefore, tracking the changes in DBLα type diversity would be unnecessary.  A simple map would be for example a linear relationship (a given proportion of DBLα types lost given genomes lost), and even more trivially, a linear loss with a slope of one (same proportion).  Note, however, that for such expectations, one needs to rely on some knowledge of strain structure and gene composition. In particular, we would need to assume a complete lack of overlap and no gene repeats in a given genome. We have previously shown that immune selection leads to selection for minimum overlap and distinct genes in repertoires at high transmission (see for example (He et al., 2018)) for theoretical and empirical evidence of both patterns). Also, since the size of the gene pool is very large, even random repertoires would lead to limited overlap (even though the empirical overlap is even smaller than that expected at random (Day et al., 2017)). Despite these conservators, we cannot a priori assume a pattern of complete non-overlap and distinct genes, and ignore plausible complexities introduced by the gene frequency distribution.  

      To examine this insightful question, we simulated the loss of a given proportion of genomes from baseline in 2012 and examined the resulting loss of DBLα types. We specifically cumulated the loss of infections in individuals until it reached a given proportion (we can do this on the basis of the estimated individual MOI values). We repeated this procedure 500 times for each proportion, as the random selection of individual infection to be removed, introduces some variation. Figure 2 below shows that the relationship is nonlinear, and that one quantity is not a simple proportion of the other.  For example, the loss of half the genomes does not result in the loss of half the DBLα types. 

      Author response image 1.

      Non-linear relationship between the loss of DBLα types and the loss of a given proportion of genomes. The graph shows that the removal of parasite genomes from the population through intervention does not lead to the loss of the same proportion of DBLα types, as the initial removal of genomes involves the loss of rare DBLα types mostly whereas common DBLα types persist until a high proportion of genomes are lost. The survey data (pink dots) used for this subsampling analysis was sampled at the end of wet/high transmission season in Oct 2012 from Bongo District from northern Ghana. We used the Bayesian formulation of the _var_coding method proposed in this work to calculate the multiplicity of infection of each isolate to further obtain the total number of genomes. The randomized surveys (black dots) were obtained based on “curveball algorithm” (Strona et al., 2014) which keep isolate lengths and type frequency distribution.

      We also investigated whether the resulting pattern changed significantly if we randomized the composition of the isolates.  We performed such randomization with the “curveball algorithm” (Strona et al., 2014). This algorithm randomizes the presence-absence matrix with rows corresponding to the isolates and columns, to the different DBLα types; importantly, it preserves the DBLα type frequency and the length of isolates. We generated 500 randomizations and repeated the simulated loss of genomes as above. The data presented in Figure 2 above show that the pattern is similar to that obtained for the empirical data presented in this study in Ghana. We interpret this to mean that the number of genes is so large, that the reduced overlap relative to random due to immune selection (see (Day et al., 2017)) does not play a key role in this specific pattern. 

      Reviewer #2 (Public Review):  

      In this manuscript, Tiedje and colleagues longitudinally track changes in parasite numbers across four time points as a way of assessing the effect of malaria control interventions in Ghana. Some of the study results have been reported previously, and in this publication, the authors focus on age-stratification of the results. Malaria prevalence was lower in all age groups after IRS. Follow-up with SMC, however, maintained lower parasite prevalence in the targeted age group but not the population as a whole. Additionally, they observe that diversity measures rebounds more slowly than prevalence measures. Overall, I found these results clear, convincing, and well-presented. They add to a growing literature that demonstrates the relevance of asymptomatic reservoirs.  There is growing interest in developing an expanded toolkit for genomic epidemiology in malaria, and detecting changes in transmission intensity is one major application. As the authors summarize, there is no one-size-fits-all approach, and the Bayesian MOIvar estimate developed here has the potential to complement currently used methods. I find its extension to a calculation of absolute parasite numbers appealing as this could serve as both a conceptually straightforward and biologically meaningful metric. However, I am not fully convinced the current implementation will be applied meaningfully across additional studies. 

      (1) I find the term "census population size" problematic as the groups being analyzed (hosts grouped by age at a single time point) do not delineate distinct parasite populations. Separate parasite lineages are not moving through time within these host bins. Rather, there is a single parasite population that is stochastically divided across hosts at each time point. I find this distinction important for interpreting the results and remaining mindful that the 2,000 samples at each time point comprise a subsample of the true population. Instead of "census population size", I suggest simplifying it to "census count" or "parasite lineage count".  It would be fascinating to use the obtained results to model absolute parasite numbers at the whole population level (taking into account, for instance, the age structure of the population), and I do hope this group takes that on at some point even if it remains outside the scope of this paper. Such work could enable calculations of absolute---rather than relative---fitness and help us further understand parasite distributions across hosts.

      Lineages moving exclusively through a given type of host or “patch”  are not a necessary requirement for enumerating the size of the total infections in such subset.  It is true that what we have is a single parasite population, but we are enumerating for the season the respective size in host classes (children and adults). This is akin to enumerating subsets of a population in ecological settings where one has multiple habitat patches, with individuals able to move across patches.

      Remaining mindful that the count is relative to sample size is an important point. Please see our response to comment (2) of reviewer 1, also for the choice of terminology. We prefer not to adopt “census count” as a census in our mind is a count, and we are not clear on the concept of lineage for these highly recombinant parasites.  Also, census population size has been adopted already in the literature for both pathogens and non-pathogens, to make a distinction with the notion of effective population size in population genetics (see our response to reviewer 1) and is consistent with our usage as outlined in the introduction. 

      Thank you for the comment on an absolute number which would extrapolate to the whole host population.  Please see again our response to comment (2) of reviewer 1, on how we can use mean MOI for this purpose once the sampling is sufficient for this quantity to become constant/stable with sampling effort.

      (2) I'm uncertain how to contextualize the diversity results without taking into account the total number of samples analyzed in each group. Because of this, I would like a further explanation as to why the authors consider absolute parasite count more relevant than the combined MOI distribution itself (which would have sample count as a denominator). It seems to me that the "per host" component is needed to compare across age groups and time points---let alone different studies.

      Again, thank you for the insightful comment. We provide this number as a separate quantity and not a distribution, although it is clearly related to the mean MOI of such distribution. It gives a tangible sense for the actual infection count (different from prevalence) from the perspective of the parasite population in the ecological sense. The “per host” notion which enables an extrapolation to any host population size for the purpose of a complete count, or for comparison with another study site, has been discussed in the above responses for reviewer 1 and now in the revision of the discussion.

      (3) Thinking about the applicability of this approach to other studies, I would be interested in a larger treatment of how overlapping DBLα repertoires would impact MOIvar estimates. Is there a definable upper bound above which the method is unreliable? Alternatively, can repertoire overlap be incorporated into the MOI estimator? 

      This is a very good point and one we now discuss further in our revision. There is no predefined upper bound one can present a priori. Intuitively, the approach to estimate MOI would appear to breakdown as overlap moves away from extremely low values, and therefore for locations with low transmission intensity.  Interestingly, we have observed that this is not the case in our paper by Labbe et al. (Labbé et al., 2023) where we used model simulations in a gradient of three transmission intensities, from high to low values. The original _var_coding method performed well across the gradient. This robustness may arise from a nonlinear and fast transition from low to high overlap that is accompanied by MOI changing rapidly from primarily multiclonal (MOI > 1) to monoclonal (MOI = 1). This matter clearly needs to be investigated further, including ways to extend the estimation to explicitly include the distribution of overlap.

      Smaller comments:

      - Figure 1 provides confidence intervals for the prevalence estimates, but these aren't carried through on the other plots (and Figure 5 has lost CIs for both metrics). The relationship between prevalence and diversity is one of the interesting points in this paper, and it would be helpful to have CIs for both metrics when they are directly compared. 

      Based on the reviewer’s advice we have revised both Figure 4 and Figure 5, to include the missing uncertainty intervals. The specific approach for each quantity is described in the corresponding caption.

      Reviewer #3 (Public Review): 

      Summary: 

      The manuscript coins a term "the census population size" which they define from the diversity of malaria parasites observed in the human community. They use it to explore changes in parasite diversity in more than 2000 people in Ghana following different control interventions. 

      Strengths: 

      This is a good demonstration of how genetic information can be used to augment routinely recorded epidemiological and entomological data to understand the dynamics of malaria and how it is controlled. The genetic information does add to our understanding, though by how much is currently unclear (in this setting it says the same thing as age-stratified parasite prevalence), and its relevance moving forward will depend on the practicalities and cost of the data collection and analysis. Nevertheless, this is a great dataset with good analysis and a good attempt to understand more about what is going on in the parasite population. 

      Census population size is complementary to parasite prevalence where the former gives a measure of the “parasite population size”, and the latter describes the “proportion of infected hosts”.  The reason we see similar trends for the “genetic information” (i.e., census population size) and “age-specific parasite prevalence” is because we identify all samples for var_coding based on the microscopy (i.e., all microscopy positive _P. falciparum isolates). But what is more relevant here is the relative percentage change in parasite prevalence and census population size following the IRS intervention. To make this point clearer in the revised manuscript we have updated Figure 4 and included additional panels plotting this percentage change from the 2012 baseline, for both census population size and prevalence (Figure 4EF). Overall, we see a greater percentage change in 2014 (and 2015), relative to the 2012 baseline, for census parasite population size vs. parasite prevalence (Figure 4EF) as a consequence of the significant changes in distributions of MOI following the IRS intervention (Figure 3). As discussed in the Results following the deployment of IRS in 2014 census population size decreased by 72.5% relative to the 2012 baseline survey (pre-IRS) whereas parasite prevalence only decreased by 54.5%. 

      With respect to the reviewer’s comment on “practicalities and cost”, var_coding has been used to successfully amplify _P. falciparum DNA collected as DBS that have been stored for more than 5-years from both clinical and lower density asymptomatic infection, without the additional step and added cost of sWGA ($8 to $32 USD per isolates, for costing estimates see (LaVerriere et al., 2022; Tessema et al., 2020)), which is currently required by other molecular surveillance methods (Jacob et al., 2021; LaVerriere et al., 2022; Oyola et al., 2016). _Var_coding involves a single PCR per isolate using degenerate primers, where a large number of isolates can be multiplexed into a single pool for amplicon sequencing.  Thus, the overall costs for incorporating molecular surveillance with _var_coding are mainly driven by the number of PCRs/clean-ups, the number samples indexed per sequencing run, and the NGS technology used (discussed in more detail in our publication Ghansah et al. (Ghansah et al., 2023)). Previous work has shown that _var_coding can be use both locally and globally for molecular surveillance, without the need to be customized or updated, thus it can be fairly easily deployed in malaria endemic regions (Chen et al., 2011; Day et al., 2017; Rougeron et al., 2017; Ruybal-Pesántez et al., 2022, 2021; Tonkin-Hill et al., 2021).

      Weaknesses: 

      Overall the manuscript is well-written and generally comprehensively explained. Some terms could be clarified to help the reader and I had some issues with a section of the methods and some of the more definitive statements given the evidence supporting them. 

      Thank you for the overall positive assessment. On addressing the “issues with a section of the methods” and “some of the more definitive statements given the evidence supporting them”, it is impossible to do so however, without an explicit indication of which methods and statements the reviewer is referring to. Hopefully, the answers to the detailed comments and questions of reviewers 1 and 2 address any methodological concerns (i.e., in the Materials and Methods and Results). To the issue of “definitive statements”, etc. we are unable to respond without further information.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      Line 273: there is a reference to a figure which supports the empirical distribution of repertoire given MOI = 1, but the figure does not appear to exist.

      We now included the correct figure for the repertoire size distribution as Figure supplement 3 (previously published in Labbé et al (Labbé et al., 2023)). This figure was accidently forgotten when the manuscript was submitted for review, we thank the reviewer for bringing this to our attention.

      Line 299: while this likely makes little difference, an insignificant result from a Kolmogorov-Smirnov test doesn't tell you if the distributions are the same, it only means there is not enough evidence to determine they are different (i.e. fail to reject the null). Also, what does the "mean MOI difference" column in supplementary table 3 mean? 

      The mean MOI difference is the difference in the mean value between the pairwise comparison of the true population-level MOI distribution, that of the population-level MOI estimates from either pooling the maximum a posteriori (MAP) estimates per individual host or the mixture distribution, or that of the population-level MOI estimates from different prior choices. This is now clarified as requested in the Table supplements 3 - 6. 

      Figure 4: how are the confidence intervals for the estimated number of var repertoires calculated? Also should include horizontal error bars for prevalence measures.

      The confidence intervals were calculated based on a bootstrap approach. We re-sampled 10,000 replicates from the original population-level MOI distribution with replacement. Each resampled replicate is the same size as the original sample. We then derive the 95% CI based on the distribution of the mean MOI of those resampled replicates. This is now clarified as requested in the Figure 4 caption (as well as Table supplement 7 footnotes). In addition, we have also updated Figure 4AB and have included the 95% CI for all measures for clarity. 

      Reviewer #2 (Recommendations For The Authors): 

      -  I would like to see a plot like Supplemental Figure 8 for the upsA DBLα repertoire size. 

      The upsA repertoire size for each survey and by age group has now been provided as requested in Figure supplement 5AB. 

      -  Supplemental Table 2 is cut off in the pdf. 

      We have now resolved this issue so that the Table supplement 2 is no longer cut off.  

      Reviewer #3 (Recommendations For The Authors): 

      The manuscript terms the phrase "census population size". To me, the census is all about the number of individuals, not necessarily their diversity. I appreciate that there is no simple term for this, and I imagine the authors have considered many alternatives, but could it be clearer to say the "genetic census population size"? For example, I found the short title not particularly descriptive "Impact of IRS and SMC on census population size", which certainly didn't make me think of parasite diversity.

      Please see our response to comment (2) of reviewer 1. We prefer not to add “genetic” to the phrase as the distinction from effective population size from population genetics is important, and the quantity we are after is an ecological one. 

      The authors do not currently say much about the potential biases in the genetic data and how this might influence results. It seems likely that because (i) patients with sub-microscopic parasitaemia were not sampled and (ii) because a moderate number of (likely low density) samples failed to generate genetic data, that the observed MOI is an overestimate. I'd be interested to hear the authors' thoughts about how this could be overcome or taken into account in the future. 

      We thank the reviewer for this this comment and agree that this is an interesting area for further consideration. However, based on research from the Day Lab that is currently under review (Tan et al. 2024, under review), the estimated MOI using the Bayesian approach is likely not an “overestimate” but rather an “underestimate”. In this research by Tan et al. (2024) isolate MOI was estimated and compared using different initial whole blood volumes (e.g., 1, 10, 50, 100 uL) for the gDNA extraction. Using _var_coding and comparing these different volumes it was found that MOI was significantly “underestimated” when small blood volumes were used for the gDNA extraction, i.e., there was a ~3-fold increase in median MOI between 1μL and 100μL blood. Ultimately these findings will allow us to make computational corrections so that more accurate estimates of MOI can be obtained from the DBS in the future.

      The authors do not make much of LLIN use and for me, this can explain some of the trends. The first survey was conducted soon after a mass distribution whereas the last was done at least a year after (when fewer people would have been using the nets which are older and less effective). We have also seen a rise in pyrethroid resistance in the mosquito populations of the area which could further diminish the LLIN activity. This difference in LLIN efficacy between the first and last survey could explain similar prevalence, yet lower diversity (in Figures 4B/5). However, it also might mean that statements such as Line 478 "This is indicative of a loss of immunity during IRS which may relate to the observed loss of var richness, especially the many rare types" need to be tapered as the higher prevalence observed in this age group could be caused by lower LLIN efficacy at the time of the last survey, not loss of immunity (though both could be true).  

      We thank the reviewer for this question and agree that (i) LLIN usage and (ii) pyrethroid resistance are important factors to consider. 

      (i) Over the course of this study self-reported LLIN usage the previous night remained high across all age groups in each of the surveys (≥ 83.5%), in fact more participants reported sleeping under an LLIN in 2017 (96.8%) following the discontinuation of IRS compared to the 2012 baseline survey (89.1%). This increase in LLIN usage in 2017 is likely a result of several factors including a rebound in the local vector population making LLINs necessary again, increased community education and/or awareness on the importance of using LLINs, among others. Information on the LLINs (i.e., PermaNet 2.0, Olyset, or DawaPlus 2.0) distributed and participant reported usage the previous night has now been included in the Materials and Methods as requested by the reviewer.

      (ii) As to the reviewer’s question on increased in pyrethroid resistance in Ghana over the study period, research undertaken by our entomology collaborators (Noguchi Memorial Insftute for Medical Research: Profs. S. Dadzie and M. Appawu; and Navrongo Health Research Centre:  Dr. V. Asoala) has shown that pyrethroid resistance is a major problem across the country, including the Upper East Region. Preliminary studies from Bongo District (2013 - 2015), were undertaken to monitor for mutations in the voltage gated sodium channel gene that have been associated with knockdown resistance to pyrethroids and DDT in West Africa (kdr-w). Through this analysis the homozygote resistance kdr-w allele (RR) was found in 90% of An. gambiae s.s. samples tested from Bongo, providing evidence of high pyrethroid resistance in Bongo District dating back to 2013, i.e., prior to the IRS intervention (S. Dadzie, M. Appawu, personal communication). Although we do not have data in Bongo District on kdr-w from 2017 (i.e., post-IRS), we can hypothesize that pyrethroid resistance likely did not decline in the area, given the widespread deployment and use of LLINs.

      Thus, given this information that (i) self-reported LLIN usage remained high in all surveys (≥ 83.5%), and that (ii) there was evidence of high pyrethroid resistance in 2013 (i.e., kdr-w (RR) _~_90%), the rebound in prevalence observed for the older age groups (i.e., adolescents and adults) in 2017 is therefore best explained by a loss of immunity.

      I must confess I got a little lost with some of the Bayesian model section methods and the figure supplements. Line 272 reads "The measurement error is simply the repertoire size distribution, that is, the distribution of the number of non-upsA DBLα types sequenced given MOI = 1, which is empirically available (Figure supplement 3)." This does not appear correct as this figure is measuring kl divergence. If this is not a mistake in graph ordering please consider explaining the rationale for why this graph is being used to justify your point. 

      We now included the correct figure for the repertoire size distribution as Figure supplement 3 (previously published in Labbé et al (Labbé et al., 2023)). This figure was accidently forgotten when the manuscript was submitted for review, we thank the reviewer for bringing our attention to this matter. We hope that the inclusion of this Figure as well as a more detailed description of the Bayesian approach helps to makes this section in the Materials and Methods clearer for the reader. 

      I was somewhat surprised that the choice of prior for estimating the MOI distribution at the population level did not make much difference. To me, the negative binomial distribution makes much more sense. I was left wondering, as you are only measuring MOI in positive individuals, whether you used zero truncated Poisson and zero truncated negative binomial distributions, and if not, whether this was a cause of a lack of difference between uniform and other priors. 

      Thank you for the relevant question. We have indeed considered different priors and the robustness of our  estimates to this choice and have now better described this in the text. We focused on individuals who had a confirmed microscopic asymptomatic P. falciparum infection for our MOI estimation, as median P. falciparum densities were overall low in this population during each survey (i.e., median ≤ 520 parasites/µL, see Table supplement 1). Thus, we used either a uniform prior excluding zero or a zero truncated negative binomial distribution when exploring the impact of priors on the final population-level MOI distribution.  A uniform prior and a zero-truncated negative binomial distribution with parameters within the range typical of high-transmission endemic regions (higher mean MOI with tails around higher MOI values) produce similar MOI  estimates at both the individual and population level. However, when setting the parameter range of the zero-truncated negative binomial to be of those in low transmission endemic regions where the empirical MOI distribution centers around mono-clonal infections with the majority of MOI = 1 or 2 (mean MOI » 1.5, no tail around higher MOI values), the final population-level MOI distribution does deviate more from that assuming the aforementioned prior and parameter choices. The final individual- and population-level MOI estimates are not sensitive to the specifics of the prior MOI distribution as long as this distribution captures the tail around higher MOI values with above-zero probability.   

      The high MOI in children <5yrs in 2017 (immediately after SMC) is very interesting. Any thoughts on how/why? 

      This result indicates that although the prevalence of asymptomatic P. falciparum infections remained significantly lower for the younger children targeted by SMC in 2017 compared 2012, they still carried multiclonal infections, as the reviewer has pointed out (Figure 3B). Importantly this upward shift in the MOI distributions (and median MOI) was observed in all age groups in 2017, not just the younger children, and provides evidence that transmission intensity in Bongo has rebounded in 2017, 32-months a er the discontinuation of IRS.  This increase in MOI for younger children at first glance may seem to be surprising, but instead likely shows the limitations of SMC to clear and/or supress the establishment of newly acquired infections, particularly at the end of the transmission season following the final cycle of SMC (i.e., end of September 2017 in Bongo District; NMEP/GHS, personal communication) when the posttreatment prophylactic effects of SMC would have waned (Chotsiri et al., 2022).  

      Line 521 in the penultimate paragraph says "we have analysed only low density...." should this not be "moderate" density, as low density infections might not be detected? The density range itself is not reported in the manuscript so could be added. 

      In Table supplement 1 we have provided the median, including the inter-quartile range, across each survey by age group. For the revision we have now provided the density min-max range, as requested by the reviewer. Finally, we have revised the statement in the discussion so that it now reads “….we have analysed low- to moderate-density, chronic asymptomatic infections (see Table supplement 1)……”.   

      Data availability - From the text the full breakdown of the epidemiological survey does not appear to be available, just a summary of defined age bounds in the SI. Provision of these data (with associated covariates such as parasite density and host characteristics linked to genetic samples) would facilitate more in-depth secondary analyses. 

      To address this question, we have updated the “Data availability statement” section with the following statement: “All data associated with this study are available in the main text, the Supporting Information, or upon reasonable request for research purposes to the corresponding author, Prof. Karen Day (karen.day@unimelb.edu.au).”  

      REFERENCES

      Bedford T, Cobey S, Pascual M. 2011. Strength and tempo of selection revealed in viral gene genealogies. BMC Evol Biol 11. doi:10.1186/1471-2148-11-220

      Chen DS, Barry AE, Leliwa-Sytek A, Smith T-AA, Peterson I, Brown SM, Migot-Nabias F, Deloron P, Kortok MM, Marsh K, Daily JP, Ndiaye D, Sarr O, Mboup S, Day KP. 2011. A molecular epidemiological study of var gene diversity to characterize the reservoir of Plasmodium falciparum in humans in Africa. PLoS One 6:e16629. doi:10.1371/journal.pone.0016629

      Chotsiri P, White NJ, Tarning J. 2022. Pharmacokinetic considerations in seasonal malaria chemoprevention. Trends Parasitol. doi:10.1016/j.pt.2022.05.003

      Day KP, Artzy-Randrup Y, Tiedje KE, Rougeron V, Chen DS, Rask TS, Rorick MM, Migot-Nabias F, Deloron P, Luty AJF, Pascual M. 2017. Evidence of Strain Structure in Plasmodium falciparum Var Gene Repertoires in Children from Gabon, West Africa. PNAS 114:E4103–E4111. doi:10.1073/pnas.1613018114

      Ghansah A, Tiedje KE, Argyropoulos DC, Onwona CO, Deed SL, Labbé F, Oduro AR, Koram KA, Pascual M, Day KP. 2023. Comparison of molecular surveillance methods to assess changes in the population genetics of Plasmodium falciparum in high transmission. Fron9ers in Parasitology 2:1067966. doi: 10.3389/fpara.2023.1067966

      He Q, Pilosof S, Tiedje KE, Ruybal-Pesántez S, Artzy-Randrup Y, Baskerville EB, Day KP, Pascual M. 2018. Networks of genetic similarity reveal non-neutral processes shape strain structure in Plasmodium falciparum. Nat Commun 9:1817. doi:10.1038/s41467-018-04219-3

      Jacob CG, Thuy-nhien N, Mayxay M, Maude RJ, Quang HH, Hongvanthong B, Park N, Goodwin S, Ringwald P, Chindavongsa K, Newton P, Ashley E. 2021. Genetic surveillance in the Greater Mekong subregion and South Asia to support malaria control and elimination. Elife 10:1–22.

      Labbé F, He Q, Zhan Q, Tiedje KE, Argyropoulos DC, Tan MH, Ghansah A, Day KP, Pascual M. 2023. Neutral vs . non-neutral genetic footprints of Plasmodium falciparum multiclonal infections. PLoS Comput Biol 19:e1010816. doi:doi.org/10.1101/2022.06.27.497801

      LaVerriere E, Schwabl P, Carrasquilla M, Taylor AR, Johnson ZM, Shieh M, Panchal R, Straub TJ, Kuzma R, Watson S, Buckee CO, Andrade CM, Portugal S, Crompton PD, Traore B, Rayner JC, Corredor V, James K, Cox H, Early AM, MacInnis BL, Neafsey DE. 2022. Design and implementation of multiplexed amplicon sequencing panels to serve genomic epidemiology of infectious disease: A malaria case study. Mol Ecol Resour 2285–2303. doi:10.1111/1755-0998.13622

      Oyola SO, Ariani C V., Hamilton WL, Kekre M, Amenga-Etego LN, Ghansah A, Rutledge GG, Redmond S, Manske M, Jyothi D, Jacob CG, Ogo TD, Rockeg K, Newbold CI, Berriman M, Kwiatkowski DP. 2016. Whole genome sequencing of Plasmodium falciparum from dried blood spots using selecFve whole genome amplification. Malar J 15:1–12. doi:10.1186/s12936-016-1641-7

      Palstra FP, Fraser DJ. 2012. Effective/census population size ratio estimation: A compendium and appraisal. Ecol Evol 2:2357–2365. doi:10.1002/ece3.329

      Rougeron V, Tiedje KE, Chen DS, Rask TS, Gamboa D, Maestre A, Musset L, Legrand E, Noya O, Yalcindag E, Renaud F, Prugnolle F, Day KP. 2017. Evolutionary structure of Plasmodium falciparum major variant surface antigen genes in South America : Implications for epidemic transmission and surveillance. Ecol Evol 7:9376–9390. doi:10.1002/ece3.3425

      Ruybal-Pesántez S, Sáenz FE, Deed S, Johnson EK, Larremore DB, Vera-Arias CA, Tiedje KE, Day KP. 2021. Clinical malaria incidence following an outbreak in Ecuador was predominantly associated with Plasmodium falciparum with recombinant variant antigen gene repertoires. medRxiv.

      Ruybal-Pesántez S, Tiedje KE, Pilosof S, Tonkin-Hill G, He Q, Rask TS, Amenga-Etego L, Oduro AR, Koram KA, Pascual M, Day KP. 2022. Age-specific patterns of DBLa var diversity can explain why residents of high malaria transmission areas remain susceptible to Plasmodium falciparum blood stage infection throughout life. Int J Parasitol 20:721–731.

      Strona G, Nappo D, Boccacci F, Fagorini S, San-Miguel-Ayanz J. 2014. A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals. Nat Commun 5. doi:10.1038/ncomms5114

      Tessema SK, Hathaway NJ, Teyssier NB, Murphy M, Chen A, Aydemir O, Duarte EM, Simone W, Colborn J, Saute F, Crawford E, Aide P, Bailey JA, Greenhouse B. 2020. Sensitive, highly multiplexed sequencing of microhaplotypes from the Plasmodium falciparum heterozygome. Journal of Infec9ous Diseases 225:1227–1237.

      Tonkin-Hill G, Ruybal-Pesántez S, Tiedje KE, Rougeron V, Duffy MF, Zakeri S, Pumpaibool T, Harnyuganakorn P, Branch OH, Ruiz-Mesıa L, Rask TS, Prugnolle F, Papenfuss AT, Chan Y, Day KP. 2021. Evolutionary analyses of the major variant surface antigen-encoding genes reveal population structure of Plasmodium falciparum within and between continents. PLoS Genet 7:e1009269. doi:10.1371/journal.pgen.1009269

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study makes an interesting finding: a polyunsaturated fatty acid, Lin-Glycine, increases the conductance of KCNQ1/KCNE1 channels by stabilizing a state of the selectivity filter that allows K+ conduction. The stabilization of a conducting state appears well supported by single-channel analysis, though some method details are missing. The linkage to PUFA action through the selectivity filter is supported by the disruption of PUFA effects by mutation of residues which change conformation in two KCNQ1 structures from the literature. Claims about differences in Lin-Glycine binding to these two structural conformations seem to lack clear support, thus the claim seems speculative that PUFAs increase Gmax by binding to a crevice in the pore domain. A potentially definitive functional experiment is conducted by single-channel recordings with selectivity filter domain mutation Y315F which ablates the Lin-Glycine effect on Gmax. However, this appears to be an n=1 experiment. Overall, the major claim of the abstract is supported: "... that the selectivity filter in KCNQ1 is normally unstable ... and that the PUFA-induced increase in Gmax is caused by a stabilization of the selectivity filter in an open-conductive state." However, the claim in the abstract that selectivity filter instability "explains the low open probability" seems too general.

      We thank the reviewer for the comments, and we would like to address the main concern regarding the single channels. We now state the number of experiments used for the single channel analysis. We agree that the claim in the abstract seems too general and we now made it more specific to our findings.

      Reviewer #2 (Public Review):

      Golluscio et al. address one of the mechanisms of IKs (KCNQ1/KCNE1) channel upregulation by polyunsaturated fatty acids (PUFA). PUFA is known to upregulate KCNQ1 and KCNQ1/KCNE1 channels by two mechanisms: one shifts the voltage dependence to the negative direction, and the other increases the maximum conductance (Gmax). While the first mechanism is known to affect the voltage sensor equilibrium by charge effect, the second mechanism is less known. By applying the single-channel recordings and mutagenesis on the putative binding sites (most of them related to the selectivity filter), they concluded that the selectivity filter is stabilized to a conductive state by PUFA binding.

      Strengths:

      They mainly used single-channel recordings and directly assessed the behavior of the selectivity filter. The method is straightforward and convincing enough to support their claims.

      Weaknesses:

      The structural model they used is the KCNQ1 channel without KCNE1 because KCNQ1/KCNE1 channel complex is not available yet. As the binding site of PUFAs might overlap with KCNE1, it is not very clear how PUFA binds to the KCNQ1 channel in the presence of KCNE1.

      Using other previous PUFA-related KCNQ1 mutants will strengthen their conclusions. For example, the Gmax of the K326E mutant is reduced by PUFA binding. Examining whether K326E shows reduced numbers of non-empty sweeps in the single-channel recordings will be a good addition.

      We thank the reviewer for the public review. We would like to address the main weak points of the comments. As a structure of KCNQ1/KCNE1 in complex is not available yet, we used KCNQ1 alone. We believe that the PUFA and KCNE1 binding sites will not overlap as we previously presented data in agreement with the idea that KCNE1 rotates the VSD relative the PD (Wu et al., 2021). This would leave enough space for both PUFA and KCNE1, so that PUFA can bind to the crevice (K326 and D301) without competing with KCNE1.  We appreciate the suggestion of adding single-channel recordings of K326E mutant and we agree it would make a valuable addition to strengthen our conclusions. However, single channel recordings for KCNQ1 are very challenging and time consuming to obtain, so we would like to keep this in consideration for future studies.

      Reviewer #3 (Public Review):

      This manuscript reveals an important mechanism of KCNQ1/IKs channel gating such that the open state of the pore is unstable and undergoes intermittent closed and open conformations. PUFA enhances the maximum open probability of IKs by binding to a crevice adjacent to the pore and stabilizing the open conformation. This mechanism is supported by convincing single-channel recordings that show empty and open channel traces and the ratio of such traces is affected by PUFA. In addition, mutations of the pore residues alter PUFA effects, convincingly supporting that PUFA alters the interactions among these pore residues.

      Strengths:

      The data are of high quality and the description is clear.

      Weaknesses:

      Some comments about the presentation.

      (1) The structural illustrations in this manuscript in general need to be more clarified.

      (2) The manuscript heavily relies on the comparison between the S4-down and S4-up structures (Figures 3, 4, and 7) to illustrate the difference between the extracellular side of the pore and to lead to the hypothesis of open-state stability being affected by PUFA. This may mislead the readers to think that the closed conformation of the channel in the up-state is the same as that in the down-state.

      We thank the reviewer for the public review, and we would like to address the comments about the presentation. We agree that the structural illustrations need to be more detailed, and we amended our previous illustrations. We have now included a new Figure 3 with a more detailed legend and a new Figure 4 that includes more information, such as the main chain of the whole selectivity filter and surrounding peptide.

      We have now added some clarification regarding the structures of KCNQ1 with S4-down and S4-up to clarify that the closed conformation of the channel in the up-state is different from that in the down-state. We also emphasize this difference in the Discussion.

      Recommendations for the authors:

      Reviewer #1:

      (1) Explain more thoroughly how the single-channel recordings were done:

      - How was Lin-Glycine applied in these experiments? The patch configuration is unclear. Was Lin-Glycine added to the patch pipette? If not, why is Lin-Glycine expected to reach the proposed binding site in the outer leaflet? Were controls time-matched applications of vehicles with ethanol?

      Data were collected using the cell attached patch configuration to minimize disruption to the patch and avoid rundown problems due to the loss of PIP2. Lin-Glycine was solubilized in DMSO and the desired concentration was added directly to the bath. We had no a priori reason to know if the PUFA would reach the proposed binding site but the consistency at which there was an increase in channel activity 5-10 minutes after addition to the bath convinced us that it was indeed reaching the binding site. This time frame fits with our prior experience with mefenamic acid effects on single channels (Wang et al 2020). The mefenamic acid binding site is external to the membrane so the drug must enter the cell and cross the patch membrane to affect channel activity. In addition, shown below is a previous recording from our lab, where nothing was added to the bath over a 55-minute time while recording consecutive files.  This shows the typical behavior of IKs, with activity tending to cluster with a few active sweeps in between many blank sweeps.  The behavior in this patch contrasts with that seen in the presence of Lin-glycine, where the clusters of activity spread over an increasing number of sweeps.

      In addition, we have previously shown that 0.1% DMSO (concentration used in the present study) does not affect the GV of KCNQ1 but there is a non-significant decrease in tail current amplitudes of about 14% (Eldstrom et al., 2021). As such we do not think that the effects we see with Lin-Glycine, with an increase in activity can be explained by vehicle effects alone.

      Author response image 1.

       

      We added some more details in the section Material and Method.

      - How well the replicates match the representative data in Figures 1, S1, and 6 is unclear (except for average current and Po in the last second of the traces from Figure 1). Are the results in Fig 6 n=1? 

      We now show in a data supplement that 3 replicates were used to access the change in channel activity upon addition of Lin-glycine.

      - Diary plots (as in Werry et al. 2013) and additional descriptions of the timeline of Lin-Glycine application and analyses could add credibility to interpretations. 

      We added a Diary plot of for the First latency to open in Supplementary Figure S1.

      - Amounts of plasmids and lipofectamine that were used in transfections are missing. 

      We added the information in Material and Method section as follow:

      “Single channel currents were recorded from transiently transfected mouse ltk- fibroblast cells (LM cells) using 1.5 mL Lipofectamine 2000 (Thermo Fisher Scientific). Cells were transfected with 1.5 mg of pcDNA3 containing a linked KCNE1-KCNQ1 construct 20, to ensure fully KCNE1-saturated complexes, in addition to a plasmid containing green fluorescent protein (GFP) to identify transfected cells”

      - Inclusion/exclusion criteria for patches analyzed are missing. 

      We added the information in Material and Method section as follow:

      “Only patches that were largely free of endogenous currents and had few channels, such that there were several blank sweeps to average for use for leak subtraction, were analyzed.”

      - Whether blinding, randomization, or pre-determined n values were employed is not mentioned. 

      No blinding, randomization or pre-determined n values were employed.

      - Analysis methods are sometimes unclear: How was Po calculated? Representative sweeps appear to have been leak and capacitance subtracted. How was that done? 

      Po was estimated from all-point amplitude histogram as follow: Po = Sum (iN/(iestimateNtotal), where N is the number of points for a specific current i in the histogram, iestimate = 0.4 pA from the peak of the histogram, and Ntotal = 10,000 is the total number of points in the last second of the trace. p = 0.75 ± 0.12 (n = 8) and p = 0.87 ± 0.04 (n = 3) for Control and Lin-Glycine, respectively.

      Leak and capacitance were subtracted with averaged empty sweeps.

      (2) The change of cells used for whole cell vs single channel (oocytes vs mouse ltk- fibroblast cells) could be discussed. These cells likely have different lipids in their membranes. Is there any other evidence that PUFAs have the same effects on KCNE1-KCNQ1 in these cells? Does the V0.5 shift? 

      A similar effect on Gmax, in both oocytes and mouse ltk-fibroblast cells, is shown in Figure 1 and 2. In Figure 2, the shift in latency suggests a shift in V0.5, suggesting the binding of PUFA to Site I.

      (3) The manuscript associates selectivity filter changes with S4 being up or down. It would help to clarify whether there was a change in [K+] in the two KCNQ1 structures used for modeling, as Mandala and MacKinnon (2023) state: "We note that one interesting difference between the two up structures regards the occupancy of K+ ions in the selectivity filter (SI Appendix, Fig. S5 C and D). In the polarized sample, due to the low extravesicular concentration of K+, density is only visible at the first and third positions in the selectivity filter, while density is present at all four positions in the unpolarized sample. Similar differences were observed in our previous study on Eag (20) and are qualitatively consistent with crystal structures of KcsA solved under symmetrical high and low K+ concentrations (45)." 

      Our studies states that there are some differences in the two structures with S4 in up-state and S4 in down-state and a reorganization of the pore. As for the change in [K+] occupancy in the two structures, we are not sure as our knowledge only come from what stated in Mandala and Mackinnon (2023). Mandala and MacKinnon did not discuss the selectivity filter in the down state structure in their paper and there are no K ions in any of their pdb files. So, we don’t know how many K+ ions there are in the down state.

      (4) The manuscript states " PUFAs increase Gmax by binding to a crevice in the pore domain" and "we elucidated that Lin-Glycine binds to a crevice between K326 and D301", this seems speculative without any actual binding studies or concrete structural evidence. A quantitative structural modeling analysis of whether changes in the crevice change the theoretical binding of Lin-Glycine might provide a stronger basis for speculation. 

      We toned down these statements in Results and Discussion to:

      “Crevice residues affect PUFA ability to increase Gmax"

      And

      Discussion: “We tested the hypothesis that the effect of Lin-Glycine involved conformational changes in the selectivity filter following PUFA binding to two residues K326 and D301 at the pore domain. Those residues delimit a small crevice that seems to change in size in different structures with S4 up or S4 down (Figure 3, D-F).”

      (5) The several figures detailing differences in selectivity filter conformation in the KCNQ1 structures are interesting and relevant in that they identify the movement of residues such as Y315 that, when mutated, ablate Lin-Glycine effect on Gmax. It would help to clarify whether T312 and I313 also move between the two selectivity filter conformations. 

      From the morph of the selectivity filter in the two conformations, it is noticeable that the changes and residue movements involve only residues at the upper part of the selectivity filter (including Y315 and D317). T312 and I313, are in the lower part of the selectivity filter and do not seem to move or rotate from their position between the two conformations of the selectivity filter.

      We now include a Supplementary Figures S3 and S4 that show the extent of movement of each residue in the pore region and a short description of this in the Results section.

      (6) The claim in the abstract that selectivity filter instability "explains the low open probability" seems too general. Lin-Glycine seems to increase the likelihood of conduction by 2.5-fold, but it was not clear whether open probability ceases to be low or whether other mechanisms also keep Po low. 

      We reword this sentence to “Our results suggest that the selectivity filter in KCNQ1 is normally unstable, contributing to the low open probability, and that the PUFA-induced increase in Gmax is caused by a stabilization of the selectivity filter in an open-conductive state..”

      Reviewer #2:

      (1) While all the electrophysiological recordings used KCNQ1/KCNE1 channels, all the structural models they used are KCNQ1 channels (without KCNE1). I know it is because the KCNQ1/KCNE1 complex structure is unavailable. However, according to their previous results, KCNQ1 alone is also upregulated by PUFAs. I am curious about what the single-channel recordings of KCNQ1 alone look like in the presence and absence of PUFAs. 

      We would love to include single-channel recordings of KCNQ1, but they are extremely hard to measure due to the small size and flickering nature of the channel.

      (2) As mentioned above, we do not have the KCNQ1/KCNE1 structure yet have the KCNQ1/KCNE3 structures (Sun and MacKinnon, Cell, 2020). According to the PDBs (6V00 or 6V01), the clevis (K326 and D301) looks covered by KCNE3. Is it true that PUFAs do not upregulate KCNQ1/KCNE3? If true, KCNE1 may not cover the clevis, so the binding mode should differ from the KCNQ1/KCNE3 structures. Please discuss the possible blocking of the clevis by KCNE proteins. 

      We previously presented data that is consistent with that KCNE1 rotates the VSD towards the PD (Wu et al., 2021). This mechanism would leave room for PUFA and KCNE1, so that PUFA can bind to the crevice (K326 and D301). So we think that this rotation will prevent PUFA and KCNE1 from competing for the same space. As for KCNQ1/KCNE3 we currently do not have any evidence about a possible upregulation by PUFA.

      (3) In the cryoEM structure with S4 resting (Figure 3F), the clevis looks too narrow for PUFA to bind. Is there any (either previous or current) evidence supporting that PUFA binding is state-dependent? 

      Because PUFAs integrate first into the bilayer and then diffuse towards its binding site on the channel, it would be hard to test a state-dependence of the binding. In addition, once PUFAs are in the bilayer, the rate of binding/unbinding is quite fast (within the ns range according to our previous MD simulations), whereas opening/closing rate is very slow (100 ms-s). So, the combination of slow wash in/washout, fast binding/unbinding, and slow opening/closing would make it very difficult to test the state-dependence of the binding by using a fast perfusion or different voltage protocols.  

      (4) In the previous report (Liin et al. Cell Reports, 2018), K326 is the most critical site for PUFA binding. Why the K326 mutants are not included in the current study? I also would like to see the single-channel recordings of the K326E mutant, which showed a smaller Gmax. Does the PUFA application reduce the probability of non-empty traces in this mutant? 

      As Liin et al. reported, mutations of K326 reduce the ability of PUFA to increase the Gmax. In this work, we wanted to gain further biophysical information on the mechanism that leads to an increase in Gmax, considering the knowledge we had from work conducted in our lab previously. We therefore focused here on residues downstream of K326 that we think are important for inducing the conformational changes at the selectivity filter. We agree that single channel experiments on K326E would be very interesting but that has to be for a future study.

      Minor points 

      (1) Liin et al. used S209F (Po of 0.4) and I204F (Po of 0.04) mutants. Their single-channel recordings would be a good addition. 

      We thank the reviewer for the suggestion. However, single channels analysis on S209F and I204F were previously shown (Eldstrom et al., 2010).

      (2) I would like to see how the Site I mutations (R2Q/Q3R) affect (or do not affect) the single-channel recordings (open probability and latency). 

      Thank you for the excellent suggestion. It would be interesting to assess the behavior of the channel when mutations occur at Site I. However, we think this information will not add any more detail to this study as we focus here our attention on the mechanism for Gmax increase. Single channels recordings are extremely hard to get, therefore we chose to include only mutations at Site II for this study.

      (3) I would like the G-V curves for all the mutations at 0 and 20 uM of Lin-Glycine (Figure 3C and Figures 5A and B). 

      We now added the G-V curves in Supplementary Figure S7.

      (4) I assume all the PUFAs have a similar effect on the selectivity filter, but a few other examples of PUFAs would be nice to see. 

      We anticipate that PUFAs and analogues with similar properties to Lin-Glycine would increasing the Gmax by a similar mechanism, because other PUFAs have been previously shown to increase the Gmax (Bohannon et al., 2020).

      (5) Although the probabilities of non-empty sweeps are written in the manuscript, bar graph presentations would be a nice addition to Figures 2 and 6. 

      We have added bar graphs of non-empty sweeps for Fig 2 and 6 in.

      (6) Is there no statistical significance for D317E and T309S in Figure 5A? 

      No statistical significance for D317E and T309S

      (7) There is no reference to Figure 7 in the manuscript. 

      A reference to Figure 7 has been added to the manuscript in the following paragraph.

      “Taken together, our results suggest that the binding of PUFA to Site II increases Gmax by promoting a series of interactions that stabilize the channel pore in the conductive state. For instance, we speculate that in the conductive state, hydrogen bonds between W304-D317 and W305-Y315, which are likely absent in the non-conductive conformation of KCNQ1, are created and that PUFA binding to Site II favors the transition towards the conductive state of the channel (Figure 7)”

      Reviewer #3:

      (1) Clarify the structural figures. Figures 3 D, E, and F - explain what the colors indicate. 

      A more detailed description of Figure 3 has been added to the legend.

      “D, E and F) Structure of crevice between S5 and S6 in KCNQ1 with S4 up (D and E) and S4 down (F). Residues that surround the crevice from S6 shown in blue (K326, T327, S330, V334) and from S5 in red (D301, A300, L303, F270). Remaining KCNQ1 residues shown in purple…, linoleic acid (LIN: gold color)”

      Fig 4. Only side chains of the residues are shown, making it hard to relate the figure to the familiar K channel selectivity filter. The main chain of the entire selectivity should be shown to orient readers to the familiar view of the K channel selectivity filter. In addition, the structures shown are only part of the selectivity filter, it should be specified which part of the selectivity filter is shown. These will also help the discussion at the bottom of page 10 and subsequent text. 

      We now provide a new Figure 4 with more details such as the main chain of the whole selectivity filter and surrounding peptide.

      (2) Cautions should be stated clearly when the structural comparison between the S4-up and S4-down is made that the structure of the pore when it is closed with S4-up may differ from the structure of the pore with S4-down. 

      We now state in addition “Clearly, there will be other differences in the pore domain between structures with activated and resting VSDs, for example the state of the activation gate.”

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      The authors did a great job addressing the weaknesses I raised in the previous round of review, except on the generalizability of the current result in the larger context of multi-attribute decision-making. It is not really a weakness of the manuscript but more of a limitation of the studied topic, so I want to keep this comment for public readers.

      The reward magnitude and probability information are displayed using rectangular bars of different colors and orientations. Would that bias subjects to choose an additive rule instead of the multiplicative rule? Also, could the conclusion be extended to other decision contexts such as quality and price, where a multiplicative rule is hard to formulate?

      We thank the reviewer for the comment. With regards whether the current type of stimuli may have biased participants to use an additive rule rather, we believe many other forms of stimuli for representing choice attributes would be equally likely to cause a similar bias. This is because the additive strategy is an inherently simplistic and natural way to integrate different pieces of non-interacting information. More importantly, even though it is easy to employ an additive strategy, most participants still demonstrated some levels of employing the multiplicative rule. However, it would indeed be interesting for future studies to explore whether the current composite model remains dominant in situations where the optimal solutions require an additive or subtractive rule, such as those concerning quality and price.

      “The same would apply even with a different choice of cues as long as the information is conveyed by two independent visual features.”

      “While the additive strategy is a natural and simple approach for integrating non-interacting pieces of information, to some extent, participants also used the multiplicative strategy that was optimal in the current experiment. A general question for such composite models is whether people mix two strategies in a consistent manner on every trial or whether there is some form of probabilistic selection occurring between the two strategies on each trial such that only one strategy is used on any given trial while, on average, one strategy is more probable than the other. It would also be interesting to examine whether a composite model is appropriate in contexts where the optimal solution is additive or subtractive, such as those concerning quality and price.”


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The current study provided a follow-up analysis using published datasets focused on the individual variability of both the distraction effect (size and direction) and the attribute integration style, as well as the association between the two. The authors tried to answer the question of whether the multiplicative attribute integration style concurs with a more pronounced and positively oriented distraction effect.

      Strengths:

      The analysis extensively examined the impacts of various factors on decision accuracy, with a particular focus on using two-option trials as control trials, following the approach established by Cao & Tsetsos (2022). The statistical significance results were clearly reported.

      The authors meticulously conducted supplementary examinations, incorporating the additional term HV+LV into GLM3. Furthermore, they replaced the utility function from the expected value model with values from the composite model.

      We thank the reviewer for the positive response and are pleased that the reviewer found our report interesting.

      Reviewer #1 Comment 1

      Weaknesses:

      There are several weaknesses in terms of theoretical arguments and statistical analyses.

      First, the manuscript suggests in the abstract and at the beginning of the introduction that the study reconciled the "different claims" about "whether distraction effect operates at the level of options' component attributes rather than at the level of their overall value" (see line 13-14), but the analysis conducted was not for that purpose. Integrating choice attributes in either an additive or multiplicative way only reflects individual differences in combining attributes into the overall value. The authors seemed to assume that the multiplicative way generated the overall value ("Individuals who tended to use a multiplicative approach, and hence focused on overall value", line 20-21), but such implicit assumption is at odds with the statement in line 77-79 that people may use a simpler additive rule to combine attributes, which means overall value can come from the additive rule.

      We thank the reviewer for the comment. We have made adjustments to the manuscript to ensure that the message delivered within this manuscript is consistent. Within this manuscript, our primary focus is on the different methods of value integration in which the overall value is computed (i.e., additive, multiplicative, or both), rather than the interaction at the individual level of attributes. However, we do not exclude the possibility that the distractor effect may occur at multiple levels. Nevertheless, in light of the reviewer’s comment, we agree that we should focus the argument on whether distractors facilitate or impair decision making and downplay the separate argument about the level at which distractor effects operate. We have now revised the abstract:

      “It is widely agreed that people make irrational decisions in the presence of irrelevant distractor options. However, there is little consensus on whether decision making is facilitated or impaired by the presence of a highly rewarding distractor or whether the distraction effect operates at the level of options’ component attributes rather than at the level of their overall value. To reconcile different claims, we argue that it is important to incorporate consideration of the diversity of people’s ways of decision making. We focus on a recent debate over whether people combine choice attributes in an additive or multiplicative way. Employing a multi-laboratory dataset investigating the same decision making paradigm, we demonstrated that people used a mix of both approaches and the extent to which approach was used varied across individuals. Critically, we identified that this variability was correlated with the effect of the distractor on decision making. Individuals who tended to use a multiplicative approach to compute value, showed a positive distractor effect. In contrast, in individuals who tended to use an additive approach, a negative distractor effect (divisive normalisation) was prominent. These findings suggest that the distractor effect is related to how value is constructed, which in turn may be influenced by task and subject specificities. Our work concurs with recent behavioural and neuroscience findings that multiple distractor effects co-exist.” (Lines 12-26)

      Furthermore, we acknowledge that the current description of the additive rule could be interpreted in several ways. The current additive utility model described as:

      where  is the options’ utility,  is the reward magnitude,  is the probability, and  is the magnitude/probability weighing ratio . If we perform comparison between values according to this model (i.e., HV against LV), we would arrive at the following comparison:

      If we rearrange (1), we will arrive at:

      While equations (1) and (2) are mathematically equivalent, equation (1) illustrates the interpretation where the comparison of the utilities occurs after value integration and forming an overall value. On the other hand, equation (2) can be broadly interpreted as the comparison of individual attributes in the absence of an overall value estimate for each option. Nonetheless, while we do not exclude the possibility that the distractor effect may occur at multiple levels, we have made modifications to the main manuscript employ more consistently a terminology referring to different methods of value estimation while recognizing that our empirical results are compatible with both interpretations.

      Reviewer #1 Comment 2

      The second weakness is sort of related but is more about the lack of coherent conceptual understanding of the "additive rule", or "distractor effect operates at the attribute level". In an assertive tone (lines 77-80), the manuscript suggests that a weighted sum integration procedure of implementing an "additive rule" is equal to assuming that people compare pairs of attributes separately, without integration. But they are mechanistically distinct. The additive rule (implemented using the weighted sum rule to combine probability and magnitude within each option and then applying the softmax function) assumes value exists before comparing options. In contrast, if people compare pairs of attributes separately, preference forms based on the within-attribute comparisons. Mathematically these two might be equivalent only if no extra mechanisms (such as inhibition, fluctuating attention, evidence accumulation, etc) are included in the within-attribute comparison process, which is hardly true in the three-option decision.

      We thank the reviewer for the comment. As described in our response to Reviewer #1 Comment 1, we are aware and acknowledge that there may be multiple possible interpretations of the additive rule. We also agree with the reviewer that there may be additional mechanisms that are involved in three- or even two- option decisions, but these would require additional studies to tease apart. Another motivation for the approach used here, which does not explicitly model the extra mechanisms the reviewer refers to was due to the intention of addressing and integrating findings from previous studies using the same dataset [i.e. (Cao & Tsetsos, 2022; Chau et al., 2020)]. Lastly, regardless of the mechanistic interpretation, our results show a systematic difference in the process of value estimation. Modifications to the manuscript text have been made consistent with our motivation (please refer to the reply and the textual changes proposed in response to the reviewer’s previous comment: Reviewer #1 Comment 1).

      Reviewer #1 Comment 3

      Could the authors comment on the generalizability of the current result? The reward magnitude and probability information are displayed using rectangular bars of different colors and orientations. Would that bias subjects to choose an additive rule instead of the multiplicative rule? Also, could the conclusion be extended to other decision contexts such as quality and price, whether a multiplicative rule is hard to formulate?

      We thank the reviewer for the comment. We agree with the observation that the stimulus space, with colour linearly correlated with magnitude, and orientation linearly correlated with probability, may bias subjects towards an additive rule. But that’s indeed the point: in order to maximise reward, subjects should have focused on the outcome space without being driven by the stimulus space. In practice, people are more or less successful in such endeavour. Nevertheless, we argue that the specific choice of visual stimuli we used is no more biased towards additive space than any other. In fact, as long as two or more pieces of information are provided for each option, as opposed to a single cue whose value was previously learned, there will always be a bias towards an additive heuristic (a linear combination), regardless of whether the cues are shapes, colours, graphs, numbers, words.

      As the reviewer suggested, the dataset analyzed in the current manuscript suggests that the participants were leaning towards the additive rule. Although there was a general tendency using the additive rule while choosing between the rectangular bars, we can still observe a spread of individuals using either, or both, additive and multiplicative rules, suggesting that there was indeed diversity in participants’ decision making strategies in our data.

      In previous studies, it was observed that human and non-human individuals used a mix of multiplicative and additive rules when they were tested on experimental paradigms different from ours (Bongioanni et al., 2021; Farashahi et al., 2019; Scholl et al., 2014). It was also observed that positive and negative distractor effects can be both present in the same data set when human and non-human individuals made decisions about food and social partner (Chang et al., 2019; Louie et al., 2013). It was less clear in the past whether the precise way a distractor affects decision making (i.e., positive/negative distractor effect) is related to the use of decision strategy (i.e., multiplicative/additive rules) and this is exactly what we are trying to address in this manuscript. A follow-up study looking at neural data (such as functional magnetic resonance imaging data) could provide a better understanding of the mechanistic nature of the relationship between distractor effects and decision strategy that we identified here.

      We agree with the reviewer that it is true that a multiplicative strategy may not be applicable to some decision contexts. Here it is important to look at the structure of the optimal solution (the one maximizing value in the long run). Factors modulating value (such as probability and temporal delay) require a non-linear (e.g., multiplicative solution), while factors of the cost-benefit form (such as effort and price) require a linear solution (e.g., subtraction). In the latter scenario the additive heuristic would coincide with the optimal solution, and the effect addressed in this study may not be revealed. Nonetheless, the present data supports the notion of distinct neural mechanisms at least for probabilistic decision-making, and is likely applicable to decision-making in general.

      Our findings, in conjunction with the literature, also suggest that a positive distractor effect could be a general phenomenon in decision mechanisms that involve the medial prefrontal cortex. For example, it has been shown that the positive distractor effect is related to a decision mechanism linked to medial prefrontal cortex [especially the ventromedial prefrontal cortex (Chau et al., 2014; Noonan et al., 2017)]. It is also known a similar brain region is involved not only when individuals are combining information using a multiplicative strategy (Bongioanni et al., 2021), but also when they are combining information to evaluate new experience or generalize information (Baram et al., 2021; Barron et al., 2013; Park et al., 2021). We have now revised the Discussion to explain this:

      “In contrast, the positive distractor effect is mediated by the mPFC (Chau et al., 2014; Fouragnan et al., 2019). Interestingly, the same or adjacent, interconnected mPFC regions have also been linked to the mechanisms by which representational elements are integrated into new representations (Barron et al., 2013; Klein-Flügge et al., 2022; Law et al., 2023; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). In a number of situations, such as multi-attribute decision making, understanding social relations, and abstract knowledge, the mPFC achieves this by using a spatial map representation characterised by a grid-like response (Constantinescu et al., 2016; Bongioanni et al., 2021; Park et al., 2021) and disrupting mPFC leads to the evaluation of composite choice options as linear functions of their components (Bongioanni et al., 2021). These observations suggest a potential link between positive distractor effects and mechanisms for evaluating multiple component options and this is consistent with the across-participant correlation that we observed between the strength of the positive distractor effect and the strength of non-additive (i.e., multiplicative) evaluation of the composite stimuli we used in the current task. Hence, one direction for model development may involve incorporating the ideas that people vary in their ways of combining choice attributes and each way is susceptible to different types of distractor effect.” (Lines 260-274)

      Reviewer #1 Comment 4

      The authors did careful analyses on quantifying the "distractor effect". While I fully agree that it is important to use the matched two-option trials and examine the interaction terms (DV-HV)T as a control, the interpretation of the results becomes tricky when looking at the effects in each trial type. Figure 2c shows a positive DV-HV effect in two-option trials whereas the DV-HV effect was not significantly stronger in three-option trials. Further in Figure 5b,c, in the Multiplicative group, the effect of DV-HV was absent in the two-option trials and present in the three-option trials. In the Additive group, however, the effect of DV-HV was significantly positive in the two-option trials but was significantly lowered in the three-option trials. Hence, it seems the different distractor effects were driven by the different effects of DV-HV in the two-option trials, rather than the three-option trials?

      We thank the reviewer for the comment. While it may be a bit more difficult to interpret, the current method of examining the (DV−HV)T term rather than (DV−HV) term was used because it was the approach used in a previous study (Cao & Tsetsos, 2022).

      During the design of the original experiments, trials were generated pseudo-randomly until the DV was sufficiently decorrelated from HV−LV. While this method allows for better group-level examination of behaviour, Cao and Tsetsos were concerned that this approach may have introduced unintended confounding covariations to some trials. In theory, one of the unintended covariations could occur between the DV and specific sets of reward magnitude and probability of the HV and LV. The covariation between parameters can lead to an observable positive distractor effect in the DV−HV as a consequence of the attraction effect or an unintended byproduct of using an additive method of integrating attributes [for further elaboration, please refer to Figure 1 in (Cao & Tsetsos, 2022)]. While it may have some limitations, the approach suggested by Cao and Tsetsos has the advantage of leveraging the DV−HV term to absorb any variance contributed by possible confounding factors such that true distractor effects, if any, can be detected using the (DV−HV)T term.

      Reviewer #1 Comment 5

      Note that the pattern described above was different in Supplementary Figure 2, where the effect of DV-HV on the two-option trials was negative for both Multiplicative and Additive groups. I would suggest considering using Supplementary Figure 2 as the main result instead of Figure 5, as it does not rely on multiplicative EV to measure the distraction effect, and it shows the same direction of DV-HV effect on two-option trials, providing a better basis to interpret the (DV-HV)T effect.

      We thank the reviewer for the comments and suggestion. However, as mentioned in the response to Reviewer #1 Comment 4, the current method of analysis adopted in the manuscript and the interpretation of only (DV−HV)T is aimed to address the possibility that the (DV−HV) term may be capturing some confounding effects due to covariation. Given that the debate that is addressed specifically concerns the (DV−HV)T term, we elected to display Figure 5 within the main text and keep the results of the regression after replacing the utility function with the composite model as Supplementary Figure 5 (previously labelled as Supplementary Figure 2).

      Reviewer #2 (Public Review):

      This paper addresses the empirical demonstration of "distractor effects" in multi-attribute decision-making. It continues a debate in the literature on the presence (or not) of these effects, which domains they arise in, and their heterogeneity across subjects. The domain of the study is a particular type of multi-attribute decision-making: choices over risky lotteries. The paper reports a re-analysis of lottery data from multiple experiments run previously by the authors and other laboratories involved in the debate.

      Methodologically, the analysis assumes a number of simple forms for how attributes are aggregated (adaptively, multiplicatively, or both) and then applies a "reduced form" logistic regression to the choices with a number of interaction terms intended to control for various features of the choice set. One of these interactions, modulated by ternary/binary treatment, is interpreted as a "distractor effect."

      The claimed contribution of the re-analysis is to demonstrate a correlation in the strength/sign of this treatment effect with another estimated parameter: the relative mixture of additive/multiplicative preferences.

      We thank the reviewer for the positive response and are pleased that the reviewer found our report interesting.

      Reviewer #2 Comment 1

      Major Issues

      (1) How to Interpret GLM 1 and 2

      This paper, and others before it, have used a binary logistic regression with a number of interaction terms to attempt to control for various features of the choice set and how they influence choice. It is important to recognize that this modelling approach is not derived from a theoretical claim about the form of the computational model that guides decision-making in this task, nor an explicit test for a distractor effect. This can be seen most clearly in the equations after line 321 and its corresponding log-likelihood after 354, which contain no parameter or test for "distractor effects". Rather the computational model assumes a binary choice probability and then shoehorns the test for distractor effects via a binary/ternary treatment interaction in a separate regression (GLM 1 and 2). This approach has already led to multiple misinterpretations in the literature (see Cao & Tsetsos, 2022; Webb et al., 2020). One of these misinterpretations occurred in the datasets the authors studied, in which the lottery stimuli contained a confound with the interaction that Chau et al., (2014) were interpreting as a distractor effect (GLM 1). Cao & Tsetsos (2022) demonstrated that the interaction was significant in binary choice data from the study, therefore it can not be caused by a third alternative. This paper attempts to address this issue with a further interaction with the binary/ternary treatment (GLM 2). Therefore the difference in the interaction across the two conditions is claimed to now be the distractor effect. The validity of this claim brings us to what exactly is meant by a "distractor effect."

      The paper begins by noting that "Rationally, choices ought to be unaffected by distractors" (line 33). This is not true. There are many normative models that allow for the value of alternatives (even low-valued "distractors") to influence choices, including a simple random utility model. Since Luce (1959), it has been known that the axiom of "Independence of Irrelevant Alternatives" (that the probability ratio between any two alternatives does not depend on a third) is an extremely strong axiom, and only a sufficiency axiom for a random utility representation (Block and Marschak, 1959). It is not a necessary condition of a utility representation, and if this is our definition of rational (which is highly debatable), not necessary for it either. Countless empirical studies have demonstrated that IIA is falsified, and a large number of models can address it, including a simple random utility model with independent normal errors (i.e. a multivariate Probit model). In fact, it is only the multinomial Logit model that imposes IIA. It is also why so much attention is paid to the asymmetric dominance effect, which is a violation of a necessary condition for random utility (the Regularity axiom).

      So what do the authors even mean by a "distractor effect." It is true that the form of IIA violations (i.e. their path through the probability simplex as the low-option varies) tells us something about the computational model underlying choice (after all, different models will predict different patterns). However we do not know how the interaction terms in the binary logit regression relate to the pattern of the violations because there is no formal theory that relates them. Any test for relative value coding is a joint test of the computational model and the form of the stochastic component (Webb et al, 2020). These interaction terms may simply be picking up substitution patterns that can be easily reconciled with some form of random utility. While we can not check all forms of random utility in these datasets (because the class of such models is large), this paper doesn't even rule any of these models out.

      We thank the reviewer for the comment. In this study, one objective is to address an issue raised by Cao and Tsetsos (2022), suggesting that the distractor effect claimed in the Chau et al. (2014) study was potentially confounded by unintended correlation introduced between the distractor and the chooseable options. They suggested that this could be tested by analyzing the control binary trials and the experimental ternary trials in a single model (i.e., GLM2) and introducing an interaction term (DV−HV)T. The interaction term can partial out any unintended confound and test the distractor effect that was present specifically in the experimental ternary trials. We adopted these procedures in our current studies and employed the interaction term to test the distractor effects. The results showed that overall there was no significant distractor effect in the group. We agree with the reviewer’s comment that if we were only analysing the ternary trials, a multinomial probit model would be suitable because it allows noise correlation between the choices. Alternatively, had a multinomial logistic model been applied, a Hausman-McFadden Test could be run to test whether the data violates the assumption of independence of irrelevant alternatives (IIA). However, in our case, a binomial model is preferred over a multinomial model because of: (1) the inclusion of the binary trials, and (2) the small number of trials in which the distractor was chosen (the median was 4% of all ternary trials).

      However, another main objective of this study is to consider the possibility that the precise distractor effect may vary across individuals. This is exactly why we employed the composite model to estimate individual’s decision making strategy and investigated how that varied with the precise way the distractor influenced decision making.

      In addition, we think that the reviewer here is raising a profound point and one with which we are in sympathy; it is true that random noise utility models can predict deviations from the IIA axiom. Central to these approaches is the notion that the representations of the values of choice options are noisy. Thus, when the representation is accessed, it might have a certain value on average but this value might vary from occasion to occasion as if each sample were being drawn from a distribution. As a consequence, the value of a distractor that is “drawn” during a decision between two other options may be larger than the distractor’s average value and may even have a value that is larger than the value drawn from the less valuable choice option’s distribution on the current trial. On such a trial it may become especially clear that the better of the two options has a higher value than the alternative choice option. Our understanding is that Webb, Louie and colleagues (Louie et al., 2013; Webb et al., 2020) suggest an explanation approximately along these lines when they reported a negative distractor effect during some decisions, i.e., they follow the predictions of divisive normalization suggesting that decisions become more random as the distractor’s value is greater.

      An alternative approach, however, assumes that rather than noise in the representation of the option itself, there is noise in the comparison process when the two options are compared. This is exemplified in many influential decision making models including evidence accumulation models such as drift diffusion models (Shadlen & Shohamy, 2016) and recurrent neural network models of decision making (Wang, 2008). It is this latter type of model that we have used in our previous investigations (Chau et al., 2020; Kohl et al., 2023). However, these two approaches are linked both in their theoretical origin and in the predictions that they make in many situations (Shadlen & Shohamy, 2016). We therefore clarify that this is the case in the revised manuscript as follows:

      “In the current study and in previous work we have used or made reference to models of decision making that assume that a noisy process of choice comparison occurs such as recurrent neural networks and drift diffusion models (Shadlen & Shohamy, 2016; Wang, 2008). Under this approach, positive distractor effects are predicted when the comparison process becomes more accurate because of an impact on the noisy process of choice comparison (Chau et al., 2020; Kohl et al., 2023). However, it is worth noting that another class of models might assume that a choice representation itself is inherently noisy. According to this approach, on any given decision a sample is drawn from a distribution of value estimates in a noisy representation of the option. Thus, when the representation is accessed, it might have a certain value on average but this value might vary from occasion to occasion. As a consequence, the value of a distractor that is “drawn” during decision between two other options may be larger than the distractor’s average value and may even have a value that is larger than the value drawn from the less valuable choice option’s distribution on the current trial. On such a trial it may become especially clear that the better of the two options has a higher value than the alternative choice option. Louie and colleagues (Louie et al., 2013) suggest an explanation approximately along these lines when they reported a positive distractor effect during some decisions. Such different approaches share theoretical origins (Shadlen & Shohamy, 2016) and make related predictions about the impact of distractors on decision making.” (Lines 297-313)

      Reviewer #2 Comment 2

      (2) How to Interpret the Composite (Mixture) model?

      On the other side of the correlation are the results from the mixture model for how decision-makers aggregate attributes. The authors report that most subjects are best represented by a mixture of additive and multiplicative aggregation models. The authors justify this with the proposal that these values are computed in different brain regions and then aggregated (which is reasonable, though raises the question of "where" if not the mPFC). However, an equally reasonable interpretation is that the improved fit of the mixture model simply reflects a misspecification of two extreme aggregation processes (additive and EV), so the log-likelihood is maximized at some point in between them.

      One possibility is a model with utility curvature. How much of this result is just due to curvature in valuation? There are many reasonable theories for why we should expect curvature in utility for human subjects (for example, limited perception: Robson, 2001, Khaw, Li Woodford, 2019; Netzer et al., 2022) and of course many empirical demonstrations of risk aversion for small stakes lotteries. The mixture model, on the other hand, has parametric flexibility.

      There is also a large literature on testing expected utility jointly with stochastic choice, and the impact of these assumptions on parameter interpretation (Loomes & Sugden, 1998; Apesteguia & Ballester, 2018; Webb, 2019). This relates back to the point above: the mixture may reflect the joint assumption of how choice departs from deterministic EV.

      We thank the reviewer for the comment. They are indeed right to mention the vast literature on curvature in subjective valuation; however it is important to stress that the predictions of the additive model with linear basis functions are quite distinct for the predictions of a multiplicative model with non-linear basis functions. We have tested the possibility that participants’ behaviour was better explained by the latter and we showed that this was not the case. Specifically, we have added and performed model fitting on an additional model with utility curvature based on prospect theory (Kahneman & Tversky, 1979) with the weighted probability function suggested by (Prelec, 1998):

      where  and  represent the reward magnitude and probability (both rescaled to the interval between 0 and 1), respectively.  is the weighted magnitude and  is the weighted probability, while  and  are the corresponding distortion parameters. This prospect theory (PT) model is included along with the four previous models (please refer to Figure 3) in a Bayesian model comparison. Results indicate that the composite model remains the best account of participants’ choice behaviour (exceedance probability = 1.000, estimated model frequency = 0.720). We have now included these results in the main text and Supplementary Figure 2:

      “Supplementary Figure 2 reports an additional Bayesian model comparison performed while including a model with nonlinear utility functions based on Prospect Theory (Kahneman & Tversky, 1979) with the Prelec formula for probability (Prelec, 1998). Consistent with the above finding, the composite model provides the best account of participants’ choice behaviour (exceedance probability = 1.000, estimated model frequency = 0.720).” (Lines 193-198)

      Reviewer #2 Comment 3

      3) So then how should we interpret the correlation that the authors report?

      On one side we have the impact of the binary/ternary treatment which demonstrates some impact of the low value alternative on a binary choice probability. This may reflect some deep flaws in existing theories of choice, or it may simply reflect some departure from purely deterministic expected value maximization that existing theories can address. We have no theory to connect it to, so we cannot tell. On the other side of the correlation, we have a mixture between additive and multiplicative preferences over risk. This result may reflect two distinct neural processes at work, or it may simply reflect a misspecification of the manner in which humans perceive and aggregate attributes of a lottery (or even just the stimuli in this experiment) by these two extreme candidates (additive vs. EV). Again, this would entail some departure from purely deterministic expected value maximization that existing theories can address.

      It is entirely possible that the authors are reporting a result that points to the more exciting of these two possibilities. But it is also possible (and perhaps more likely) that the correlation is more mundane. The paper does not guide us to theories that predict such a correlation, nor reject any existing ones. In my opinion, we should be striving for theoretically-driven analyses of datasets, where the interpretation of results is clearer.

      We thank the reviewer for their clear comments. Based on our responses to the previous comments it should be apparent that our results are consistent with several existing theories of choice, so we are not claiming that there are deep flaws in them, but distinct neural processes (additive and multiplicative) are revealed, and this does not reflect a misspecification in the modelling. We have revised our manuscript in the light of the reviewer’s comments in the hope of clarifying the theoretical background which informed both our data analysis and our data interpretation.

      First, we note that there are theoretical reasons to expect a third option might impact on choice valuation. There is a large body of work suggesting that a third option may have an impact on the values of two other options (indeed Reviewer #2 refers to some of this work in their Reviewer #2 Comment 1), but the body of theoretical work originates partly in neuroscience and not just in behavioural economics. In many sensory systems, neural activity changes with the intensity of the stimuli that are sensed. Divisive normalization in sensory systems, however, describes the way in which such neural responses are altered also as a function of other adjacent stimuli (Carandini & Heeger, 2012; Glimcher, 2022; Louie et al., 2011, 2013). The phenomenon has been observed at neural and behavioural levels as a function not just of the physical intensity of the other stimuli but as a function of their associated value (Glimcher, 2014, 2022; Louie et al., 2011, 2015; Noonan et al., 2017; Webb et al., 2020).

      Analogously there is an emerging body of work on the combinatorial processes that describe how multiple representational elements are integrated into new representations (Barron et al., 2013; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). These studies have originated in neuroscience, just as was the case with divisive normalization, but they may have implications for understanding behaviour. For example, they might be linked to behavioural observations that the values assigned to bundles of goods are not necessarily the sum of the values of the individual goods (Hsee, 1998; List, 2002). One neuroscience fact that we know about such processes is that, at an anatomical level, they are linked to the medial frontal cortex (Barron et al., 2013; Fellows, 2006; Hunt et al., 2012; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). A second neuroscientific fact that we know about medial frontal cortex is that it is linked to any positive effects that distractors might have on decision making (Chau et al., 2014; Noonan et al., 2017). Therefore, we might make use of these neuroscientific facts and theories to predict a correlation between positive distractor effects and non-additive mechanisms for determining the integrated value of multi-component choices. This is precisely what we did; we predicted the correlation on the basis of this body of work and when we tested to see if it was present, we found that indeed it was. It may be the case that other behavioural economics theories offer little explanation of the associations and correlations that we find. However, we emphasize that this association is predicted by neuroscientific theory and in the revised manuscript we have attempted to clarify this in the Introduction and Discussion sections:

      “Given the overlap in neuroanatomical bases underlying the different methods of value estimation and the types of distractor effects, we further explored the relationship. Critically, those who employed a more multiplicative style of integrating choice attributes also showed stronger positive distractor effects, whereas those who employed a more additive style showed negative distractor effects. These findings concur with neural data demonstrating that the medial prefrontal cortex (mPFC) computes the overall values of choices in ways that go beyond simply adding their components together, and is the neural site at which positive distractor effects emerge (Barron et al., 2013; Bongioanni et al., 2021; Chau et al., 2014; Fouragnan et al., 2019; Noonan et al., 2017; Papageorgiou et al., 2017), while divisive normalization was previously identified in the posterior parietal cortex (PPC) (Chau et al., 2014; Louie et al., 2011).” (Lines 109-119)

      “At the neuroanatomical level, the negative distractor effect is mediated by the PPC, where signal modulation described by divisive normalization has been previously identified (Chau et al., 2014; Louie et al., 2011). The same region is also crucial for perceptual decision making processes (Shadlen & Shohamy, 2016). The additive heuristics for combining choice attributes are closer to a perceptual evaluation because distances in this subjective value space correspond linearly to differences in physical attributes of the stimuli, whereas normative (multiplicative) value has a non-linear relation with them (cf. Figure 1c). It is well understood that many sensory mechanisms, such as in primates’ visual systems or fruit flies’ olfactory systems, are subject to divisive normalization (Carandini & Heeger, 2012). Hence, the additive heuristics that are more closely based on sensory mechanisms could also be subject to divisive normalization, leading to negative distractor effects in decision making.

      In contrast, the positive distractor effect is mediated by the mPFC (Chau et al., 2014; Fouragnan et al., 2019). Interestingly, the same or adjacent, interconnected mPFC regions have also been linked to the mechanisms by which representational elements are integrated into new representations (Barron et al., 2013; Klein-Flügge et al., 2022; Law et al., 2023; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). In a number of situations, such as multi-attribute decision making, understanding social relations, and abstract knowledge, the mPFC achieves this by using a spatial map representation characterised by a grid-like response (Constantinescu et al., 2016; Bongioanni et al., 2021; Park et al., 2021) and disrupting mPFC leads to the evaluation of composite choice options as linear functions of their components (Bongioanni et al., 2021). These observations suggest a potential link between positive distractor effects and mechanisms for evaluating multiple component options and this is consistent with the across-participant correlation that we observed between the strength of the positive distractor effect and the strength of non-additive (i.e., multiplicative) evaluation of the composite stimuli we used in the current task. Hence, one direction for model development may involve incorporating the ideas that people vary in their ways of combining choice attributes and each way is susceptible to different types of distractor effect.” (Lines 250-274)

      Reviewer #2 Comment 4

      (4) Finally, the results from these experiments might not have external validity for two reasons. First, the normative criterion for multi-attribute decision-making differs depending on whether the attributes are lotteries or not (i.e. multiplicative vs additive). Whether it does so for humans is a matter of debate. Therefore if the result is unique to lotteries, it might not be robust for multi-attribute choice more generally. The paper largely glosses over this difference and mixes literature from both domains. Second, the lottery information was presented visually and there is literature suggesting this form of presentation might differ from numerical attributes. Which is more ecologically valid is also a matter of debate.

      We thank the reviewer for the comment. Indeed, they are right that the correlation we find between value estimation style and distractor effects may not be detected in all contexts of human behaviour. What the reviewer suggests goes along the same lines as our response to Reviewer #1 Comment 3, multi-attribute value estimation may have different structure: in some cases, the optimal solution may require a non-linear (e.g., multiplicative) response as in probabilistic or delayed decisions, but other cases (e.g., when estimating the value of a snack based on its taste, size, healthiness, price) a linear integration would suffice. In the latter kind of scenarios, both the optimal and the heuristic solutions may be additive and people’s value estimation “style” may not be teased apart. However, if different neural mechanisms associated with difference estimation processes are observed in certain scenarios, it suggests that these mechanisms are always present, even in scenarios where they do not alter the predictions. Probabilistic decision-making is also pervasive in many aspects of daily life and not just limited to the case of lotteries.

      While behaviour has been found to differ depending on whether lottery information is presented graphically or numerically, there is insufficient evidence to suggest biases towards additive or multiplicative evaluation, or towards positive or negative distractor effects. As such, we may expect that the correlation that we reveal in this paper, grounded in distinct neural mechanisms, would still hold even under different circumstances.

      Taking previous literature as examples, similar patterns of behaviour have been observed in humans when making decisions during trinary choice tasks. In a study conducted by Louie and colleagues (Louie et al., 2013; Webb et al., 2020), human participants performed a snack choice task where their behaviour could be modelled by divisive normalization with biphasic response (i.e., both positive and negative distractor effects). While these two studies only use a single numerical value of price for behavioural modelling, these prices should originate from an internal computation of various attributes related to each snack that are not purely related to lotteries. Expanding towards the social domain, studies of trinary decision making have considered face attractiveness and averageness (Furl, 2016), desirability of hiring (Chang et al., 2019), as well as desirability of candidates during voting (Chang et al., 2019). These choices involve considering various attributes unrelated to lotteries or numbers and yet, still display a combination of positive distractor and negative distractor (i.e. divisive normalization) effects, as in the current study. In particular, the experiments carried out by Chang and colleagues (Chang et al., 2019) involved decisions in a social context that resemble real-world situations. These findings suggests that both types of distractor effects can co-exist in other value based decision making tasks (Li et al., 2018; Louie et al., 2013) as well as decision making tasks in social contexts (Chang et al., 2019; Furl, 2016).

      Reviewer #2 Comment 5

      Minor Issues:

      The definition of EV as a normative choice baseline is problematic. The analysis requires that EV is the normative choice model (this is why the HV-LV gap is analyzed and the distractor effect defined in relation to it). But if the binary/ternary interaction effect can be accounted for by curvature of a value function, this should also change the definition of which lottery is HV or LV for that subject!

      We thank the reviewer for the comment. While the initial part of the paper discussed results that were defined by the EV model, the results shown in Supplementary Figure 2 were generated by replacing the utility function based on values obtained by using the composite model. Here, we have also redefined the definition of HV or LV for each subject depending on the updated value generated by the composite model prior to the regression.

      References

      Apesteguia, J. & Ballester, M. Monotone stochastic choice models: The case of risk and time preferences. Journal of Political Economy (2018).

      Block, H. D. & Marschak, J. Random Orderings and Stochastic Theories of Responses. Cowles Foundation Discussion Papers (1959).

      Khaw, M. W., Li, Z. & Woodford, M. Cognitive Imprecision and Small-Stakes Risk Aversion. Rev. Econ. Stud. 88, 1979-2013 (2020).

      Loomes, G. & Sugden, R. Testing Different Stochastic Specificationsof Risky Choice. Economica 65, 581-598 (1998).

      Luce, R. D. Indvidual Choice Behaviour. (John Wiley and Sons, Inc., 1959).

      Netzer, N., Robson, A. J., Steiner, J. & Kocourek, P. Endogenous Risk Attitudes. SSRN Electron. J. (2022) doi:10.2139/ssrn.4024773.

      Robson, A. J. Why would nature give individuals utility functions? Journal of Political Economy 109, 900-914 (2001).

      Webb, R. The (Neural) Dynamics of Stochastic Choice. Manage Sci 65, 230-255 (2019).

      Reviewer #3 (Public Review):

      Summary:

      The way an unavailable (distractor) alternative impacts decision quality is of great theoretical importance. Previous work, led by some of the authors of this study, had converged on a nuanced conclusion wherein the distractor can both improve (positive distractor effect) and reduce (negative distractor effect) decision quality, contingent upon the difficulty of the decision problem. In very recent work, Cao and Tsetsos (2022) reanalyzed all relevant previous datasets and showed that once distractor trials are referenced to binary trials (in which the distractor alternative is not shown to participants), distractor effects are absent. Cao and Tsetsos further showed that human participants heavily relied on additive (and not multiplicative) integration of rewards and probabilities.

      The present study by Wong et al. puts forward a novel thesis according to which interindividual differences in the way of combining reward attributes underlie the absence of detectable distractor effect at the group level. They re-analysed the 144 human participants and classified participants into a "multiplicative integration" group and an "additive integration" group based on a model parameter, the "integration coefficient", that interpolates between the multiplicative utility and the additive utility in a mixture model. They report that participants in the "multiplicative" group show a negative distractor effect while participants in the "additive" group show a positive distractor effect. These findings are extensively discussed in relation to the potential underlying neural mechanisms.

      Strengths:

      - The study is forward-looking, integrating previous findings well, and offering a novel proposal on how different integration strategies can lead to different choice biases.

      - The authors did an excellent job of connecting their thesis with previous neural findings. This is a very encompassing perspective that is likely to motivate new studies towards a better understanding of how humans and other animals integrate information in decisions under risk and uncertainty.

      - Despite that some aspects of the paper are very technical, methodological details are well explained and the paper is very well written.

      We thank the reviewer for the positive response and are pleased that the reviewer found our report interesting.

      Reviewer #3 Comment 1

      Weaknesses:

      The authors quantify the distractor variable as "DV - HV", i.e., the relative distractor variable. Do the conclusions hold when the distractor is quantified in absolute terms (as "DV", see also Cao & Tsetsos, 2023)? Similarly, the authors show in Suppl. Figure 1 that the inclusion of a HV + LV regressor does not alter their conclusions. However, the (HV + LV)*T regressor was not included in this analysis. Does including this interaction term alter the conclusions considering there is a high correlation between (HV + LV)*T and (DV - HV)*T? More generally, it will be valuable if the authors assess and discuss the robustness of their findings across different ways of quantifying the distractor effect.

      We thank the reviewer for the comment. In the original manuscript we had already demonstrated that the distractor effect was related to the integration coefficient using a number of complementary analyses. They include Figure 5 based on GLM2, Supplementary Figure 3 based on GLM3 (i.e., adding the HV+LV term to GLM2), and Supplementary Figure 4 based on GLM2 but applying the utility estimate from the composite model instead of expected value (EV). These three sets of analyses produced comparable results. The reason why we elected not to include the (HV+LV)T term in GLM3 (Supplementary Figure 3) was due to the collinearity between the regressors in the GLM. If this term is included in GLM3, the variance inflation factor (VIF) would exceed an acceptable level of 4 for some regressors. In particular, the VIF for the (HV+LV) and (HV+LV)T regressors is 5.420, while the VIF for (DV−HV) and (DV−HV)T is 4.723.

      Here, however, we consider the additional analysis suggested by the reviewer and test whether similar results are obtained. We constructed GLM4 including the (HV+LV)T term but replacing the relative distractor value (DV-HV) with the absolute distractor value (DV) in the main term and its interactions, as follows:

      GLM4:

      A significant negative (DV)T effect was found for the additive group [t(72)=−2.0253, p=0.0465] while the multiplicative group had a positive trend despite not reaching significance. Between the two groups, the (DV)T term was significantly different [t(142)=2.0434, p=0.0429]. While these findings suggest that the current conclusions could be partially replicated, simply replacing the relative distractor value with the absolute value in the previous analyses resulted in non-significant findings. Taking these results together with the main findings, it is possible to conclude that the positive distractor effect is better captured using the relative DV-HV term rather than the absolute DV term. This would be consistent with the way in which option values are envisaged to interact with one another in the mutual inhibition model (Chau et al., 2014, 2020) that generates the positive distractor effect. The model suggests that evidence is accumulated as the difference between the excitatory input from the option (e.g. the HV option) and the pooled inhibition contributed partly by the distractor. We have now included these results in the manuscript:

      “Finally, we performed three additional analyses that revealed comparable results to those shown in Figure 5. In the first analysis, reported in Supplementary Figure 3, we added an  term to the GLM, because this term was included in some analyses of a previous study that used the same dataset (Chau et al., 2020). In the second analysis, we added an  term to the GLM. We noticed that this change led to inflation of the collinearity between the regressors and so we also replaced the (DV−HV) term by the DV term to mitigate the collinearity (Supplementary Figure 4). In the third analyses, reported in Supplementary Figure 5, we replaced the utility terms of GLM2. Since the above analyses involved using HV, LV, and DV values defined by the normative Expected Value model, here, we re-defined the values using the composite model prior to applying GLM2. Overall, in the Multiplicative Group a significant positive distractor effect was found in Supplementary Figures 3 and 4. In the Additive Group a significant negative distractor effect was found in Supplementary Figures 3 and 5. Crucially, all three analyses consistently showed that the distractor effects were significantly different between the Multiplicative Group and the Additive Group.” (Lines 225-237)

      Reviewer #3 Comment 2

      The central finding of this study is that participants who integrate reward attributes multiplicatively show a positive distractor effect while participants who integrate additively show a negative distractor effect. This is a very interesting and intriguing observation. However, there is no explanation as to why the integration strategy covaries with the direction of the distractor effect. It is unlikely that the mixture model generates any distractor effect as it combines two "context-independent" models (additive utility and expected value) and is fit to the binary-choice trials. The authors can verify this point by quantifying the distractor effect in the mixture model. If that is the case, it will be important to highlight that the composite model is not explanatory; and defer a mechanistic explanation of this covariation pattern to future studies.

      We thank the reviewer for the comment. Indeed, the main purpose of applying the mixture model was to identify the way each participants combined attributes and, as the reviewer pointed out, the mixture model per se is context independent. While we acknowledge that the mixture model is not a mechanistic explanation, there is a theoretical basis for the observation that these two factors are linked.

      Firstly, studies that have examined the processes involved when humans combine and integrate different elements to form new representations (Barron et al., 2013; Papageorgiou et al., 2017; Schwartenbeck et al., 2023) have implicated the medial frontal cortex as a crucial region (Barron et al., 2013; Fellows, 2006; Hunt et al., 2012; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). Meanwhile, previous studies have also identified that positive distractor effects are linked to the medial frontal cortex (Chau et al., 2014; Noonan et al., 2017). Therefore, the current study utilized these two facts to establish the basis for a correlation between positive distractor effects and non-additive mechanisms for determining the integrated value of multi-component choices. Nevertheless, we agree with the reviewer that it will be an important future direction to look at how the covariation pattern emerges in a computational model. We have revised the manuscript in an attempt to address this issue.

      “At the neuroanatomical level, the negative distractor effect is mediated by the PPC, where signal modulation described by divisive normalization has been previously identified (Chau et al., 2014; Louie et al., 2011). The same region is also crucial for perceptual decision making processes (Shadlen & Shohamy, 2016). The additive heuristics for combining choice attributes are closer to a perceptual evaluation because distances in this subjective value space correspond linearly to differences in physical attributes of the stimuli, whereas normative (multiplicative) value has a non-linear relation with them (cf. Figure 1c). It is well understood that many sensory mechanisms, such as in primates’ visual systems or fruit flies’ olfactory systems, are subject to divisive normalization (Carandini & Heeger, 2012). Hence, the additive heuristics that are more closely based on sensory mechanisms could also be subject to divisive normalization, leading to negative distractor effects in decision making.

      In contrast, the positive distractor effect is mediated by the mPFC (Chau et al., 2014; Fouragnan et al., 2019). Interestingly, the same or adjacent, interconnected mPFC regions have also been linked to the mechanisms by which representational elements are integrated into new representations (Barron et al., 2013; Klein-Flügge et al., 2022; Law et al., 2023; Papageorgiou et al., 2017; Schwartenbeck et al., 2023). In a number of situations, such as multi-attribute decision making, understanding social relations, and abstract knowledge, the mPFC achieves this by using a spatial map representation characterised by a grid-like response (Constantinescu et al., 2016; Bongioanni et al., 2021; Park et al., 2021) and disrupting mPFC leads to the evaluation of composite choice options as linear functions of their components (Bongioanni et al., 2021). These observations suggest a potential link between positive distractor effects and mechanisms for evaluating multiple component options and this is consistent with the across-participant correlation that we observed between the strength of the positive distractor effect and the strength of non-additive (i.e., multiplicative) evaluation of the composite stimuli we used in the current task. Hence, one direction for model development may involve incorporating the ideas that people vary in their ways of combining choice attributes and each way is susceptible to different types of distractor effect.” (Lines 250-274)

      Reviewer #3 Comment 3

      -  Correction for multiple comparisons (e.g., Bonferroni-Holm) was not applied to the regression results. Is the "negative distractor effect in the Additive Group" (Fig. 5c) still significant after such correction? Although this does not affect the stark difference between the distractor effects in the two groups (Fig. 5a), the classification of the distractor effect in each group is important (i.e., should future modelling work try to capture both a negative and a positive effect in the two integration groups? Or just a null and a positive effect?).

      We thank the reviewer for the comment. We have performed Bonferroni-Holm correction and as the reviewer surmised, the negative distractor effect in the additive group becomes non-significant. However, we have to emphasize that our major claim is that there was a covariation between decision strategy (of combining attributes) and distractor effect (as seen in Figure 4). That analysis does not imply multiple comparisons. The analysis in Figure 5 that splits participants into two groups was mainly designed to illustrate the effects for an easier understanding by a more general audience. In many cases, the precise ways in which participants are divided into subgroups can have a major impact on whether each individual group’s effects are significant or not. It may be possible to identify an optimal way of grouping, but we refrained from taking such a trial-and-error approach, especially for the analysis in Figure 5 that simply supplements the point made in Figure 4. The key notion we would like the readers to take away is that there is a spectrum of distractor effects (ranging from negative to positive) that will vary depending on how the choice attributes were integrated.

      Reviewer #1 (Recommendations For The Authors):

      Reviewer #1 Recommendations 1

      Enhancements are necessary for the quality of the scientific writing. Several sentences have been written in a negligent manner and warrant revision to ensure a higher level of rigor. Moreover, a number of sentences lack appropriate citations, including but not restricted to:

      - Line 39-41.

      - Line 349-350 (also please clarify what it means by parameter estimate" is very accurate: correlation?).

      We thank the reviewer for the comment. We have made revisions to various parts of the manuscript to address the reviewer’s concerns.

      “Intriguingly, most investigations have considered the interaction between distractors and chooseable options either at the level of their overall utility or at the level of their component attributes, but not both (Chau et al., 2014, 2020; Gluth et al., 2018).” (Lines 40-42)

      “Additional simulations have shown that the fitted parameters can be recovered with high accuracy (i.e., with a high correlation between generative and recovered parameters).” (Lines 414-416)

      Reviewer #1 Recommendations 2

      Some other minor suggestions:

      - Correlative vs. Causality: the manuscript exhibits a lack of attentiveness in drawing causal conclusions from correlative evidence (manuscript title, Line 91, Line 153-155).

      - When displaying effect size on accuracy, there is no need to show the significance of intercept (Figure 2,5, & supplementary figures).

      - Adding some figure titles on Figure 2 so it is clear what each panel stands for.

      - In Figure 3, the dots falling on zero values are not easily seen. Maybe increasing the dot size a little?

      - Line 298: binomial linking function (instead of binomial distribution).

      - Line 100: composite, not compositive.

      - Line 138-139: please improve the sentence, if it's consistent with previous findings, what's the point of "surprisingly"?

      We thank the reviewer for the suggestions. We have made revisions to the title and various parts of the manuscript to address the reviewer’s concerns.

      - Correlative vs. Causality: the manuscript exhibits a lack of attentiveness in drawing causal conclusions from correlative evidence (manuscript title, Line 91, Line 153-155).

      We have now revised the manuscript:

      “Distractor effects in decision making are related to the individual’s style of integrating choice attributes” (title of the manuscript)

      “More particularly, we consider whether individual differences in combination styles could be related to different forms of distractor effect.” (Lines 99-100)

      “While these results may seem to suggest that a distractor effect was not present at an overall group level, we argue that the precise way in which a distractor affects decision making is related to how individuals integrate the attributes.” (Lines 164-167)

      - When displaying effect size on accuracy, there is no need to show the significance of intercept (Figure 2,5, & supplementary figures).

      We have also modified all Figures to remove the intercept.

      - Adding some figure titles on Figure 2 so it is clear what each panel stands for.

      We have added titles accordingly.

      - In Figure 3, the dots falling on zero values are not easily seen. Maybe increasing the dot size a little?

      In conjunction with addressing Reviewer #3 Recommendation 6, we have adapted the violin plots into histograms for a better representation of the values.

      - Line 298: binomial linking function (instead of binomial distribution).

      - Line 100: composite, not compositive.

      - Line 138-139: please improve the sentence, if it's consistent with previous findings, what's the point of "surprisingly"?

      We have made revisions accordingly.

      Reviewer #2 (Recommendations For The Authors):

      Reviewer #2 Recommendations 1

      Line 294. The definition of DV, HV, LV is not sufficient. Presumably, these are the U from the following sections? Or just EV? But this is not explicitly stated, rather they are vaguely referred to as values." The computational modelling section refers to them as utilities. Are these the same thing?

      We thank the reviewer for the suggestion. We have clarified that the exact method for calculating each of the values and updated the section accordingly.

      “where HV, LV, and DV refer to the values of the chooseable higher value option, chooseable lower value option, and distractor, respectively. Here, values (except those in Supplementary Figure 5) are defined as Expected Value (EV), calculated by multiplying magnitude and probability of reward.” (Lines 348-350)

      Reviewer #2 Recommendations 2

      The analysis drops trials in which the distractor was chosen. These trials are informative about the presence (or not) of relative valuation or other factors because they make such choices more (or less) likely. Ignoring them is another example of the analysis being misspecified.

      We thank the reviewer for the suggestion and this is related to Major Issue 1 raised by the same reviewer. In brief, we adopted the same methods implemented by Cao and Tsetsos (Cao and Tsetsos, 2022) and that constrained us to applying a binomial model. Please refer to our reply to Major Issue 1 for more details.

      Reviewer #2 Recommendations 3

      Some questions and suggestions on statistics and computational modeling:

      Have the authors looked at potential collinearity between the regressors in each of the GLMs?

      We thank the reviewer for the comment. For each of the following GLMs, the average variance inflation factor (VIF) has been calculated as follows:

      GLM2 using the Expected Value model:

      Author response table 1.

      GLM2 after replacing the utility function based on the normative Expected Value model with values obtained by using the composite model:

      Author response table 2.

      GLM3:

      Author response table 3.

      As indicated in the average VIF values calculated, none of them exceed 4, suggesting that the estimated coefficients were not inflated due to collinearity between the regressor in each of the GLMs.

      Reviewer #2 Recommendations 4

      - Correlation results in Figure 4. What is the regression line displayed on this plot? I suspect the regression line came from Pearson's correlation, which would be inconsistent with the Spearman's correlation reported in the text. A reasonable way would be to transform both x and y axes to the ranked data. However, I wonder why it makes sense to use ranked data for testing the correlation in this case. Those are both scalar values. Also, did the authors assess the influence of the zero integration coefficient on the correlation result? Importantly, did the authors redo the correlation plot after defining the utility function by the composite models?

      We thank the reviewer for the suggestion. The plotted line in Figure 4 was based on the Pearson’s correlation and we have modified the text to also report the Pearson’s correlation result as well.

      If we were to exclude the 32 participants with integration coefficients smaller than 1×10-6 from the analysis, we still observe a significant positive Pearson’s correlation [r(110)=0.202, p=0.0330].

      Author response image 1.

      Figure 4 after excluding 32 participants with integration coefficients smaller than 1×10-6.

      “As such, we proceeded to explore how the distractor effect (i.e., the effect of (DV−HV)T obtained from GLM2; Figure 2c) was related to the integration coefficient (η) of the optimal model via a Pearson’s correlation (Figure 4). As expected, a significant positive correlation was observed [r(142)=0.282, p=0.000631]. We noticed that there were 32 participants with integration coefficients that were close to zero (below 1×10-6). The correlation remained significant even after removing these participants [r(110)=0.202, p=0.0330].” (Lines 207-212)

      The last question relates to results already included in Supplementary Figure 5, in which the analyses were conducted using the utility function of the composite model. We notice that although there was a difference in integration coefficient between the multiplicative and additive groups, a correlational analysis did not generate significant results [r(142)=0.124, p=0.138]. It is possible that the relationship became less linear after applying the composite model utility function. However, it is noticeable that in a series of complementary analyses (Figure 5: r(142)=0.282, p=0.000631; Supplementary Figure 3: r(142)=0.278, p=0.000746) comparable results were obtained.

      Reviewer #2 Recommendations 5

      - From lines 163-165, were the models tested on only the three-option trials or both two and three-opinion trials? It is ambiguous from the description here. It might be worth checking the model comparison based on different trial types, and the current model fitting results do not tell an absolute sense of the goodness of fit. I would suggest including the correctly predicted trial proportions in each trial type from different models.

      We thank the reviewer for the suggestion. We have only modeled the two-option trials and the key reason for this is because the two-option trials can arguably provide a better estimate of participants’ style of integrating attributes as they are independent of any distractor effects. This was also the same reason why Cao and Tsetsos applied the same approach when they were re-analyzing our data (Cao and Tsetsos, 2022). We have clarified the statement accordingly.

      “We fitted these models exclusively to the Two-Option Trial data and not the Distractor Trial data, such that the fitting (especially that of the integration coefficient) was independent of any distractor effects, and tested which model best describes participants’ choice behaviours.” (Lines 175-178)

      Reviewer #2 Recommendations 6

      - Along with displaying the marginal distributions of each parameter estimate, a correlation plot of these model parameters might be useful, given that some model parameters are multiplied in the value functions.

      We thank the reviewer for the suggestion. We have also generated the correlation plot of the model parameters. The Pearson’s correlation between the magnitude/probability weighting and integration coefficient was significant [r(142)=−0.259, p=0.00170]. The Pearson’s correlation between the inverse temperature and integration coefficient was not significant [r(142)=−0.0301, p=0.721]. The Pearson’s correlation between the inverse temperature and magnitude/probability weighting was not significant [r(142)=−0.0715, p=0.394].

      “Our finding that the average integration coefficient  was 0.325 coincides with previous evidence that people were biased towards using an additive, rather than a multiplicative rule. However, it also shows rather than being fully additive ( =0) or multiplicative ( =1), people’s choice behaviour is best described as a mixture of both. Supplementary Figure 1 shows the relationships between all the fitted parameters.” (Lines 189-193)

      Reviewer #2 Recommendations 7

      Have the authors tried any functional transformations on amounts or probabilities before applying the weighted sum? The two attributes are on entirely different scales and thus may not be directly summed together.

      We thank the reviewer for the comment. Amounts and probabilities were indeed both rescaled to the 0-1 interval before being summed, as explained in the methods (Line XXX). Additionally, we have now added and performed model fitting on an additional model with utility curvature based on the prospect theory (Kahneman & Tversky, 1979) and a weighted probability function (Prelec, 1998):

      where  and  represent the reward magnitude and probability (both rescaled to the interval between 0 and 1), respectively.  is the weighted magnitude and  is the weighted probability, while  and  are the corresponding distortion parameters. This prospect theory (PT) model was included along with the four previous models (please refer to Figure 3) in a Bayesian model comparison. Results indicate that the composite model remains as the best account of participants’ choice behaviour (exceedance probability = 1.000, estimated model frequency = 0.720).

      “Supplementary Figure 2 reports an additional Bayesian model comparison performed while including a model with nonlinear utility functions based on Prospect Theory (Kahneman & Tversky, 1979) with the Prelec formula for probability (Prelec, 1998). Consistent with the above finding, the composite model provides the best account of participants’ choice behaviour (exceedance probability = 1.000, estimated model frequency = 0.720).” (Lines 193-198)

      Reviewer #3 (Recommendations For The Authors):

      Reviewer #3 Recommendations 1

      - In the Introduction (around line 48), the authors make the case that distractor effects can co-exist in different parts of the decision space, citing Chau et al. (2020). However, if the distractor effect is calculated relative to the binary baseline this is no longer the case.

      - Relating to the above point, it might be useful for the authors to make a distinction between effects being non-monotonic across the decision space (within individuals) and effects varying across individuals due to different strategies adopted. These two scenarios are conceptually distinct.

      We thank the reviewer for the comment. Indeed, the ideas that distractor effects may vary across decision space and across different individuals are slightly different concepts. We have now revised the manuscript to clarify this:

      “However, as has been argued in other contexts, just because one type of distractor effect is present does not preclude another type from existing (Chau et al., 2020; Kohl et al., 2023). Each type of distractor effect can dominate depending on the dynamics between the distractor and the chooseable options. Moreover, the fact that people have diverse ways of making decisions is often overlooked. Therefore, not only may the type of distractor effect that predominates vary as a function of the relative position of the options in the decision space, but also as a function of each individual’s style of decision making.” (Lines 48-54)

      Reviewer #3 Recommendations 2

      - The idea of mixture models/strategies has strong backing from other Cognitive Science domains and will appeal to most readers. It would be very valuable if the authors could further discuss the potential level at which their composite model might operate. Are the additive and EV quantities computed and weighted (as per the integration coefficient) within a trial giving rise to a composite decision variable? Or does the integration coefficient reflect a probabilistic (perhaps competitive) selection of one strategy on a given trial? Perhaps extant neural data can shed light on this question.

      We thank the reviewer for the comment. The idea is related to whether the observed mixture in integration models derives from value being actually computed in a mixed way within each trial, or each trial involves a probabilistic selection between the additive and multiplicative strategies. We agree that this is an interesting question and to address it would require the use of some independent continuous measures to estimate the subjective values in quantitative terms (instead of using the categorical choice data). This could be done by collecting pupil size data or functional magnetic resonance imaging data, as the reviewer has pointed out. Although the empirical work is beyond the scope of the current behavioural study, it is worth bringing up this point in the Discussion:

      “The current finding involves the use of a composite model that arbitrates between the additive and multiplicative strategies. A general question for such composite models is whether people mix two strategies in a consistent manner on every trial or whether there is some form of probabilistic selection occurring between the two strategies on each trial such that only one strategy is used on any given trial while, on average, one strategy is more probable than the other. To test which is the case requires an independent estimation of subjective values in quantitative terms, such as by pupillometry or functional neuroimaging. Further understanding of this problem will also provide important insight into the precise way in which distractor effects operate at the single-trial level.” (Lines 275-282)

      Reviewer #3 Recommendations 3

      Line 80 "compare pairs of attributes separately, without integration". This additive rule (or the within-attribute comparison) implies integration, it is just not multiplicative integration.

      We thank the reviewer for the comment. We have made adjustments to the manuscript to ensure that the message delivered within this manuscript is consistent.

      “For clarity, we stress that the same mathematical formula for additive value can be interpreted as meaning that 1) subjects first estimate the value of each option in an additive way (value integration) and then compare the options, or 2) subjects compare the two magnitudes and separately compare the two probabilities without integrating dimensions into overall values. On the other hand, the mathematical formula for multiplicative value is only compatible with the first interpretation. In this paper we focus on attribute combination styles (multiplicative vs additive) and do not make claims on the order of the operations. More particularly, we consider whether individual differences in combination styles could be related to different forms of distractor effect.” (Lines 92-100)

      Reviewer #3 Recommendations 4

      - Not clear why the header in line 122 is phrased as a question.

      We thank the reviewer for the suggestion. We have modified the header to the following:

      “The distractor effect was absent on average” (Line 129)

      Reviewer #3 Recommendations 5

      - The discussion and integration of key neural findings with the current thesis are outstanding. It might help the readers if certain statements such as "the distractor effect is mediated by the PPC" (line 229) were further unpacked.

      We thank the reviewer for the suggestion. We have made modifications to the original passage to further elaborate the statement.

      “At the neuroanatomical level, the negative distractor effect is mediated by the PPC, where signal modulation described by divisive normalization has been previously identified (Chau et al., 2014; Louie et al., 2011). The same region is also crucial for perceptual decision making processes (Shadlen & Shohamy, 2016).” (Lines 250-253)

      Reviewer #3 Recommendations 6

      - In Fig. 3c, there seem to be many participants having the integration coefficient close to 0 but the present violin plot doesn't seem to best reflect this highly skewed distribution. A histogram would be perhaps better here.

      We thank the reviewer for the suggestion. We have modified the descriptive plots to use histograms instead of violin plots.

      “Figures 3c, d and e show the fitted parameters of the composite model: , the integration coefficient determining the relative weighting of the additive and multiplicative value ( , ); , the magnitude/probability weighing ratio ( , ); and , the inverse temperature ( , ). Our finding that the average integration coefficient  was 0.325 coincides with previous evidence that people were biased towards using an additive, rather than a multiplicative rule.” (Lines 186-191)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The current manuscript by Hajra et al deals with the role of the prominent Sirtuins SIRT1 and -3 during infection of macrophages with Salmonella Typhimurium (ST). Apparently, ST infection induces upregulation of host cell SRTs to aid its own metabolism during the intracellular lifestyle and to help reprogramming macrophage polarization. The manuscript has two parts, namely one part that deals with Salmonella infection in cells, where RAW 264.7 murine macrophage-like cells, sharing some features with primary macrophages, were employed. Infected RAW cells displayed a tendency to polarize towards wound-healing M2 and not inflammatory M1 macrophages, which was dependent on SRT. Consequently, the inflammatory response in RAW was more robust in the absence of SRT. Moreover, loss of SRTs leads to impaired bacterial proliferation in these cells, which was attributed to defects in metabolic adaption of the bacteria in the absence of SRT-activity and to the increased M1 inflammatory response.

      Unfortunately, the line of argumentation remains incomplete because corresponding assays in mice showed the opposite result as compared to the experiments using RAW 264.7 cells. i.e. loss of SRTs leads to increased bacterial load in animals (versus impaired proliferation in RAW 264.7 cells). The authors cannot explain this discrepancy.

      Strengths:

      Extensive analysis of Salmonella infection in RAW macrophage-like cells and mice in the context of SRT1/3 function.

      Weaknesses:

      Lack of connection between the cell-based and organismic data, which are not supportive of each other.

      We are highly grateful for your valuable and insightful comments. Thank you for appreciating the merit of our manuscript. We agree with the opposing phenotypes among the RAW264.7 cell line (Fig. 2A), primary peritoneal macrophages (ex vivo) (Fig.2B), and in vivo mouse model (Fig.8) findings. Both RAW264.7 macrophage and peritoneal macrophage infection show attenuated intracellular bacterial proliferation owing to the heightened proinflammatory burst. This is in sharp contrast to our in vivo mouse model of infection which shows increased organ burden and bacterial dissemination. The higher bacterial load in the organs including the spleen (Fig.8B) is attributed to increased pro-inflammatory cytokine burst and ROS production (Fig.8F-H, Fig.S9) triggering bacterial dissemination. The pro-inflammatory arsenals like IL-6, IL-1β and ROS that limit bacterial proliferation within the macrophages (F4/80+ macrophages within the spleen or in RAW264.7 macrophages or primary peritoneal macrophages) are facilitating bacterial dissemination in blood and to the other organs (Fig. 8I-L, Fig.S3F-G). This is in line with the following previous findings-

      Klebsiella pneumoniae infection triggers an inflammatory response via secretion of IL-6 upon HIF-1α activation that induces bacterial dissemination (Holden VI, Breen P, Houle S, Dozois CM, Bachman MA. Klebsiella pneumoniae Siderophores Induce Inflammation, Bacterial Dissemination, and HIF-1α Stabilization during Pneumonia. mBio. 2016 Sep 13;7(5):e01397-16. doi: 10.1128/mBio.01397-16. PMID: 27624128; PMCID: PMC5021805.).

      Correlation analysis of immune responses to Salmonella infection revealed that increased innate immune “cassette” opposes the adaptive immune arm leading to increased bacterial load in mice (Hotson AN, Gopinath S, Nicolau M, Khasanova A, Finck R, Monack D, et al. Coordinate actions of innate immune responses oppose those of the adaptive immune system during Salmonella infection of mice. Science signaling. 2016;9(410):ra4). 

      In our revised manuscript, we have assessed additional splenic populations including CD45+, Ly6C+, and CD11c+ populations. Our results show that the CD45+ splenic population depicts increased bacterial loads like that of the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, within the CD11c+ population, CD45+ granulocytes or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator-treated mice group (Fig. M-S, Fig.S8). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      Reviewer #2 (Public Review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study. There are a few comments below that would further strengthen the current study.

      Major comments:

      In the in vivo study (lines 436-446) - the authors noticed increased pathogen burden in the EX-527 or the 3TYP-treated mice cohorts but decreased pathogen burden within the F4/80+ macrophage population. What are the other cell types that have increased pathogen burden in splenocytes from EX-527 or the 3TYP treated? Can this be further explored and explained?

      While the authors indicated that IL-6 cytokine storm and elevated ROS production could result in bacterial dissemination in vivo, one could also argue that Sirt1/3 inhibitors might have an impact on gut function and/or gut microbiota (PMID: 22115311). Did Sirt1/3 inhibitors also lead to increased pathogen burdens in the gut? If so, the potential effect of these in vivo treatments on gut microbiota/colonization resistance should be discussed.

      Minor comment:

      Sirt1 has been shown to be degraded during Salmonella infection (PMID: 28192515), which is different from the current study. An explanation should be provided for this.

      We thank you for your encouraging and gracious comments. We deeply appreciate your time and efforts in providing constructive feedback for the betterment of our work. As per your precious suggestions, we have assessed additional splenic populations including CD45+, Ly6C+, and CD11c+ populations apart from F4/80+ macrophage populations. Our analysis suggests that the CD45+ splenic population show increased bacterial loads similar to the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, CD11c+ population, CD45+ granulocytes or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator treated mice group (Fig. 8M-S). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      We immensely appreciate the reviewer for this insightful question about the effect of SIRT1/3 on the gut per se. To answer your question, we observed increased pathogen loads within the mesenteric lymph nodes of the gut in the SIRT1/3 inhibitor-treated mice groups (Fig.8B). In our revised manuscript, we evaluated gut inflammation via IL1-β estimation in the mice's ileal tissues and have observed heightened IL-1β production in the inhibitor-treated mice cohorts in comparison to the vehicle control (Fig. S3G). We have also examined gut epithelial pathology via Haematoxylin-Eosin (H&E) staining of the ileal sections to address the effect of in vivo treatment on gut microbiota and colonization resistance which is appended here. However, the gut microbiota crosstalk and their effect on colonization resistance is a part of another current study and it is being examined in detail there. Therefore, this appended H&E has not been incorporated in the revised manuscript.

      Author response image 1.

      In line with the reference PMID: 28192515, where Sirt1 has been shown to be degraded during Salmonella infection at later time points of infection, our study also has shown that both SIRT1 mRNA (Fig. 1A) and protein levels (Fig. S1A) show an elevated expression at 2h and 6h post-infection and show a downregulation at 16h in comparison to the 6h time point.  However, SIRT3 expression levels remain elevated even at later time points of infection. Therefore, we speculate that there is a shared role between SIRT1 and SIRT3 that facilitates the phenotypes reported in our study.

      Reviewer #3 (Public Review):

      Summary:

      In this paper, Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knockdown experiments in RAW macrophage cell lines to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice, inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions has not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 results in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidence is also lacking to prove the proposed mechanisms. For instance, they show correlative data that the knockdown of Sirt1-mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      We appreciate the reviewer’s critical analysis of our work. In the revised manuscript, we aimed to eliminate the low-quality data sets and have tried to substantiate them with better and conclusive ones, as directed in the recommendations for the author section. We agree with the reviewer that the inclusion of both Sirtuins 1 and 3 has resulted in too many pathways and mechanisms and focusing on one SIRT and its mechanism of metabolic reprogramming and immune modulation would have been a less complicated alternative approach. However, as rightly pointed out, our work demonstrated the shared and few overlapping roles of the two sirtuins, SIRT1 and SIRT3, together mediating the immune-metabolic switch upon Salmonella infection. As per the reviewer’s suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G).

      Reviewer #1 (Recommendations For The Authors):

      The authors state "SIRT1 and SIRT3 inhibition resulted in increased pathogen loads in organs and triggered enhanced bacterial dissemination, together leading to increased susceptibility of the mice to S. Typhimurium infection owing to increased ROS and IL-6 production." How can this be reconciled? To the reviewer, this is not a convincing explanation. The reviewer is not a mouse pathologist, so maybe did not understand the argument in full.

      However, in order to clarify whether these phenomena can be brought into context and explained by for instance cell-autonomous (in (RAW) macrophages) versus non-autonomous (in mice) mechanisms, it would be required to bring in context the organismic phenotype with a cellular phenotype, using more physiologic primary macrophages.

      (1) The authors show in Figure 8 that in general SRT inhibition leads to increased infection whereas SRT activation results in decreased infection. This is even true for e the spleen (e.g. Figure 8B), which should be full of macrophages upon infection.

      (2) Only Figure 8L implies that endogenous primary, splenic macrophages show a higher infection rate upon pharmacologic SRT activation, which would potentially mirror the RAW results. This is however not supportive of their own explanation: Who would now produce more ROS and IL6 if these macrophages are more supportive of intracellular ST? Is there a difference in the roles or SRTs between different types of macrophages and/or neutrophils? And between macrophages and somatic cells concerning ST infection? The reviewer tends to believe that RAW cells display a defective killing response (such as ROS production) as they are highly transformed cells. Therefore, the authors should use cultured peritoneal macrophages or BMDMs in addition to RAW264.7 cells.

      The literature cited by the authors also implies that the inflammatory response in mice is higher in the absence of SRTs. This is in line with a role for SRTs in (negatively) regulating M1 inflammatory polarization but probably not with increased bacterial burden in mice. If it was, then increased dissemination could be explained by increased tissue damage. However, the flow cytometry experiments from infected organs then do not confirm that, as the infection of individual cells is higher upon SRT inhibition. Thus there seems a broad gap between the role of SRTs in ST infection in RAW264.7 cells versus non-transformed cells.

      I would not discard the RAW results, as I am convinced that they contain valuable data. However, it needs to be clarified what aspect of the host response RAW 264.7 cells represent. Primary macrophages might likely be more aggressive towards the bacteria. Finally, the question arises: what is the role of the metabolic switch in the in vivo setting?

      The reviewer recommends repeating some key experiments by in-vitro-infecting BMDMs or isolated peritoneal macrophages (after some days of culturing) to bridge between the present RAW-derived data and the mouse data. How is the bacterial load with and without SRT inhibitor/activator in primary macrophages, when infected outside of the body? Can ex-vivo infection also affect polarization of e.g. peritoneal macrophages or the metabolic switch? If it is possible to find a conclusive explanation for their data, then this story might really add to our understanding of another aspect of how ST manipulates the host to survive.

      In case the reviewer understands the mouse experiments correctly, all assays on peritoneal cells were performed after in-vivo-infection and/or treatment.

      Together, RAW 264.7 murine macrophage-like cells might not be the right model to understand the phenotypes in full. As far as the reviewer knows, these cells are not capable of killing bacteria as effectively as activated primary macrophages or neutrophils.

      A few of the key findings of RAW264.7 macrophages have been replicated in primary peritoneal macrophages (Fig. 2B, S3E-F, S6B, S7B-D). We wanted to clarify that the peritoneal macrophage experiments were performed ex vivo, wherein peritoneal macrophages were isolated from mice were then subjected to SIRT1/3 inhibitor treatments and Salmonella infection and not after in vivo treatment or infection. In ex vivo setting, we have examined the effect of SIRTs on the metabolic switch during Salmonella infection (Fig. S7B-D) which resembled our RAW264.7 macrophage data. Additionally, in in vivo setting, we have analyzed the transcript level expression of host metabolic genes and corresponding bacterial metabolic genes in infected mice liver and spleen tissue under SIRT1/3 inhibitor treatment (Fig.S7E-F, Fig.6C-D). Our primary peritoneal macrophage data exactly mirrors the RAW264.7 macrophage findings showing attenuated intracellular bacterial proliferation owing to the heightened proinflammatory burst upon SIRT1/3 knockdown or inhibition (Fig.2A-B). This is opposite to our in vivo mouse model of infection which shows increased organ burden and bacterial dissemination (Fig.8A-H). The pro-inflammatory arsenals that limit bacterial proliferation within the macrophages (F4/80+ macrophages within the spleen or in RAW264.7 macrophages or primary peritoneal macrophages) are facilitating bacterial dissemination in blood and to the other organs owing to tissue damage (Fig.8E-L). This is in line with the following previous findings-

      Klebsiella pneumoniae infection triggers an inflammatory response via secretion of IL-6 upon HIF-1α activation that induces bacterial dissemination (Holden VI, Breen P, Houle S, Dozois CM, Bachman MA. Klebsiella pneumoniae Siderophores Induce Inflammation, Bacterial Dissemination, and HIF-1α Stabilization during Pneumonia. mBio. 2016 Sep 13;7(5):e01397-16. doi: 10.1128/mBio.01397-16. PMID: 27624128; PMCID: PMC5021805.).

      Correlation analysis of immune responses to Salmonella infection revealed that increased innate immune “cassette” opposes the adaptive immune arm leading to increased bacterial load in mice (Hotson AN, Gopinath S, Nicolau M, Khasanova A, Finck R, Monack D, et al. Coordinate actions of innate immune responses oppose those of the adaptive immune system during Salmonella infection of mice. Science Signaling. 2016;9(410):ra4). 

      As per the reviewer’s suggestions, we have analyzed other populations apart from F4/80+ macrophages and have observed that the CD45+ splenic population depicts increased bacterial loads like that of the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, the CD1c+ population, CD45+ granulocytes, or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator-treated mice group (Fig.8M-S, Fig.S8). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      Reviewer #3 (Recommendations For The Authors):

      Abstract

      The authors state that perturbing Sirt1 and Sirt3 results in a shift in Salmonella's metabolism. On the contrary, the data reflects the metabolism in the host cell and not the bacteria. This statement is wrong. They only show increased expression of some of the glycolytic genes in Salmonella, which is not sufficient to make the claim that the switch to fatty acid oxidation in macrophages is due to utilisation of glucose by the bacteria.

      We value the reviewer’s response and have accordingly reframed our sentence in the abstract (Line 24-25).

      Fig 1: Expression of Sirt1 - The data needs to be supported with a western blot for Sirt1 and Sirt3 but the Western blots shown in the supplementary figure are of very poor quality and do not support the authors' claim.

      We have repeated the western blot and have supplemented the previous blot with an alternate blot in Fig. S1A as per your precious input.

      Why haven't the authors shown any representative blots for Sirt1 and Sirt3 upon infection with Salmonella mutants? They need to italicize the genes when they describe mRNA expression.

      Previously we had only performed transcript-level expression of Sirt1 and Sirt3 upon infection with Salmonella mutants and therefore representative blot image was absent. The gene names have been duly italicized while describing mRNA expression (Line 126-154). We regret the inconvenience caused. We have performed the western blotting to assess the protein expression profile upon infection with Salmonella mutants as per the reviewer’s suggestion and the representative blot image has been duly appended in the revised manuscript (Fig. S1B).

      What is the rationale for examining Sirt1 and Sirt3 mRNA in M1 and M2 macrophages? Salmonella infection on its own will polarise the macrophages towards M1. How long were these macrophages infected? The time points are missing.

      The rationale behind the examination of Sirt1 and Sirt3 mRNA in M1 and M2 polarized was to ascertain whether indeed M1 polarized macrophages exhibit decreased expression of Sirt1 or Sirt3 and polarization of macrophages toward M2 state show upregulation of Sirt1 and Sirt3 upon Salmonella infection. After confirming these above-mentioned findings through this preliminary experiment, we then hypothesized whether Salmonella infection on its own will polarise the macrophages toward an immunosuppressive M2 state at a later time course of infection as infection drives the induction of SIRT expression and whether this is mediated by Sirt1 and Sirt3 (Fig. 3). We are extremely apologetic for not mentioning the 16h time-point in the figure and the missing time point has been duly documented in the revised manuscript (Line 155).

      Fig S2 knockdown of Sirt1 and Sirt3 are not convincing.

      We are extremely sorry for the inconclusive knockdown blot. An alternative blot has been substantiated in the revised manuscript (Fig. S2,C-D).

      Fig 2A and 2B the time point post infection has not been mentioned. Although it is stated that 2h and 16h post-infection samples were analysed. Only one time point has been shown.

      We are sorry for the confusion. We wanted to clarify that Fig.2A and Fig. 2B show the fold proliferation where fold proliferation was calculated as CFU at 16hr divided by CFU at 2hr as mentioned in the materials and methods section under the heading of Intracellular proliferation or gentamicin protection assay.

      Fold Proliferation= [CFU at 16h]/[CFU at 2h]

      The cytokines data are intriguing in that the increase in IL-6 relative to control is seen only at 2h and 20h but not at 6h. Il-6 at 20h in untransfected cells is comparable to uninfected cells. Did the authors investigate cell death? Salmonella induces various forms of cell death which could account for the decreased cytokine production at later time points.

      We have investigated the cell death upon Salmonella infection via MTT assay. At later time points of infection, we indeed observed around 16 percent decrease in cell survival compared to the initial time point of 2h. The results have been appended here and it supports our eminent reviewer’s reasoning for the decreased cytokine production at later time points.

      Author response image 2.

      Additional cytokines such as IL-1b would be helpful. Also, not sure how uninfected macrophages produce nearly 200pg of IL-10.

      As per the author’s critical suggestion, we have assessed the IL-1b cytokine production at 16h post-infection in RAW264.7 macrophages and peritoneal macrophages and mice serum samples at 5th day post-infection (Fig.S3C, S3E-F). Our results indicate increased production of IL-b in the infected SIRT1/3 knockdown RAW264.7 macrophages, SIRT1/3 inhibitor-treated peritoneal macrophages and in mice serum samples under SIRT1/3 inhibitor treatment in comparison to the vehicle control. Additionally, we have quantified IL-1b in mice ileal tissues under SIRT1/3 inhibitor treatment (Fig.S3G) and have obtained heightened intestinal IL-1b production in the inhibitor-treated cohorts. We thank the reviewer for raising the concern for 200pg of IL-10 in the uninfected macrophages. We have repeated the experiment and have provided an alternative representative graph for the experiment wherein the IL-10 levels in the uninfected cohorts range between 20-40pg/ml (Fig. S3B).

      It is surprising that the authors have found increased Sirt1 binding to NFkB, however there is no change in acetylated NFkB upon infection (Fig 4B). Acetylated p65 is equally high in uninfected Scrambled siRNA, UI shSirt1, STM Scr, and STM shSirt1. Furthermore, increased binding of Sirt1 with NFkb would mean decreased acetylation hence decreased inflammation. However, Salmonella induces profound inflammation.

      We thank the reviewers for their insightful and critical questioning. We truly acknowledge that due to oversaturation there was no apparent change in the acetylated p65 among the different sample sets. Therefore, in the revised manuscript we have provided an image at lower exposure where the changes in the acetylation of the p65 subunit are apparent. Salmonella induces inflammation upon challenge similar to any other pathogens and induces acute inflammatory responses. This heightened acute inflammation at the initial phases of infection subsides at a later phase of infection. Here, we have performed the Sirt1 interaction with NFκB at 16hr post-infection where increased binding of Sirt1 with NFκB facilitates the resolution of the Salmonella-_induced acute inflammation. This is in line with previous reports that suggest SIRT1 suppresses acute inflammation through the promotion of p65 acetylation and inhibition of NFκB activity. (Yang H, Zhang W, Pan H, et al. SIRT1 activators suppress inflammatory responses through promotion of p65 deacetylation and inhibition of NF-κB activity. _PLoS One. 2012;7(9):e46364. doi:10.1371/journal.pone.0046364, Liu TF, Yoza BK, El Gazzar M, Vachharajani VT, McCall CE. NAD+-dependent SIRT1 deacetylase participates in epigenetic reprogramming during endotoxin tolerance. J Biol Chem. 2011;286(11):9856–64., Liu TF, Vachharajani V, Millet P, Bharadwaj MS, Molina AJ, McCall CE. Sequential actions of SIRT1-RELB-SIRT3 coordinate nuclear-mitochondrial communication during immunometabolic adaptation to acute inflammation and sepsis. J Biol Chem. 2015;290(1):396–408.)

      Please explain how the acetylated p65 was analysed.

      Total endogenous p65 subunit was immunoprecipitated using Anti-NFκB p65 antibody and the immunoprecipitated fraction was probed with Anti-Acetylated Lysine antibody to assess acetylated p65.

      An increase in ROS production is seen in a relatively small percentage of cells- not more than 4% of cells. How does this contribute to such a significant difference in intracellular bacterial burden? Also, it is not clear how the authors calculated the fold change in proliferation. It is better to show the actual bacterial burden logarithmically.

      We strongly agree with the reviewer’s concerns, and we have reanalyzed the flow cytometric data set. The revised data have been presented in Fig. S5 which shows a considerable increase in DCFDA positive population. For instance, the infected scrambled control shows around 2.44% of ROS-producing cells, however knockdown of SIRT1 and SIRT3 increases the ROS-producing cells to 27.34% and 28.64% respectively.

      Fold proliferation was calculated as CFU at 16hr divided by CFU at 2hr as mentioned in the materials and methods section under the heading of Intracellular proliferation or gentamicin protection assay. Fold proliferation has been calculated as opposed to absolute CFU values to nullify the differential phagocytosis of bacteria to the macrophages among the samples.

      Fold Proliferation= [CFU at 16h]/[CFU at 2h]

      An increase in metabolic genes is not sufficient to show that the macrophages are metabolically reprogrammed.

      We thank the reviewer for the valuable comment. We agree that an increase in metabolic gene profile is not sufficient to claim metabolic reprogramming. Therefore, in addition to the metabolic gene profile, we have estimated lactate production (end-product of glycolysis) as an indicator of glycolysis (Fig. 5 C-E) and have performed the fatty acid β oxidation activity (Fig. 5G-H) to support our claims.

      Figure 5F the band intensities do not visually match the bands shown for PFK. For instance, shSIRT1 STM (1.00) and shSIRT3 STM (0.81).

      We are extremely sorry for the erroneous band intensity for shSIRT3. Upon reanalysis of the band intensities, we have corrected the band intensity for shSIRT3 to 2.28 (Fig.5F).

      It is surprising that HADHA is not expressed in uninfected samples.

      We are extremely apologetic for the inappropriate representative blot. We feel that the discrepancy might have arisen due to the usage of old antibodies. We have provided an alternate blot for the HADHA gene where fresh antibody staining solution was used for probing which shows expression even in the uninfected samples (Fig.5F).

      Figure 6A - What is the significance of PFA fixed samples (PI) compared to SI samples? This has not been discussed.

      PFA-fixed samples are paraformaldehyde-treated bacterial samples that harbor the immune signals or Pattern Associated Molecular Patterns (PAMPs). The rationale for using PI in addition to SI samples was to show whether the phenomena is driven by live metabolically active pathogens or is mediated by PAMPs.

      I understand that the hypothesis is that during the later phase of infection, there is an increase in fatty acid oxidation which correlates with a decrease in inflammation. However, at 6h there is no increase in genes regulating fatty acid oxidation. Why did the authors choose 6h when the previous experiments have been done at 16h?

      We indeed agree with the reviewer’s understanding of our hypothesis that there is an increase in fatty acid oxidation along the progression of infection which correlates with a decrease in inflammation. The Salmonella intracellular replication has been reported to commence at 6h post-internalization when SPI-2 effector expression is fully established (Helaine S, Thompson JA, Watson KG, Liu M, Boyle C, Holden DW. Dynamics of intracellular bacterial replication at the single cell level. Proc Natl Acad Sci U S A. 2010;107(8):3746-3751. doi:10.1073/pnas.1000041107). Therefore, we have assessed the 6h timepoint post-infection in addition to the initial and later timepoints of 2h and 16h respectively. Additionally, the nanostring gene profiling data of both host and bacterial genes indicate the onset of both metabolic (Fig. 5A, 6A) and immune genes (Fig. 3A) modulation at 6h post-infection. We have validated these results via qPCR studies and have observed an upregulation in the transcript level of fatty acid oxidation genes as depicted in Fig. S7A in RAW264.7 macrophages.

      Line 355 it is mentioned that Sirt1 and Sirt3 abrogate metabolic shift by reducing glycolytic flux. This is incorrect as experiments such as carbon chase assays have not been performed to investigate glycolytic flux.

      As per the reviewer’s valuable suggestion, we have removed the word ‘flux’ from the above-mentioned statement(Line 351, Line 353).

      Lines 392-393: "We immunoprecipitated PDHA1 and checked for its interaction with SIRT3 or SIRT1 under knockdown condition of SIRT3 or upon SIRT3 inhibitor treatment (Fig.7 G-H)"

      What is the rationale for checking PDHA1 interaction with Sirt under Sirt knockdown conditions?

      We are thankful to the reviewer for the critical comments. The rationale for checking PDHA1 interaction with Sirt was to ascertain that indeed Sirt interacted with PDHA1 under S. Typhimurium infection and abrogation of either protein expression (knockdown) or their enzymatic activity (inhibitor treatment) diminished the interaction.

      Moreover, the blots are very confusing and do not represent the authors' claims.

      (1) In the input blot I do not see Sirt3 depletion in shSirt3 knockdown sample.

      The knockdown has been quantified in the input blot as per your suggestion. A knockdown of 40% has been obtained in the uninfected dataset whereas a knockdown of 47.1% has been obtained in the infected data set at 16h post-infection (Fig.7H).

      (2) Why does Sirt1 interact with PDHA1 similar to Sirt3. Do both the proteins bind to PDHA1 at the same time/ competitively? If so do they both deacetylate?

      In literature, Sirt3 has been shown to interact with PDHA1 and deacetylate PDHA1. However, the interaction of Sirt1 with PDHA1 has not been reported previously and therefore we are unable to comment on the exact dynamics of the interaction. Future studies need to be performed to explore these phenomena in depth. However, SIRT1 agonist SRT1720 has been shown to impact PDH phosphorylation and its activity (Han Y, Sun W, Ren D, Zhang J, He Z, Fedorova J, Sun X, Han F, Li J. SIRT1 agonism modulates cardiac NLRP3 inflammasome through pyruvate dehydrogenase during ischemia and reperfusion. Redox Biol. 2020 Jul;34:101538).

      (3) Figure 7I in the IP: IgG samples Sirt3 seem to bind to IgG non-specifically, which questions the specificity of Sirt3 binding to PDHA1.

      We appreciate the reviewer for pointing out this concern. The immunoprecipitation experiment has been repeated and the same has been appended in the revised manuscript and we observe no non-specific binding of Sirt3 antibody to IgG.

      (4) In Figure 7I all the bands Ac PDHA1, PDHA1, and Sirt3 look similar with double bands, which has not been seen in other blots. How is this possible?

      This cannot explain the increase in beta-oxidation observed.

      We thank the reviewer for raising this concern. We have repeated the experiment and provided the alternative blot as per the reviewer’s suggestion.

      The rationale for performing this experiment was to show that SIRT plays an important role in the activation of downstream TCA cycle pathways via PDHA1 deacetylation during Salmonella infection. The deacetylation of PDHA1 has been previously reported to cause transcriptional activation of the downstream TCA cycle and oxidative phosphorylation (Zhang Y, Wen P, Luo J, et al., Cell Death Dis.,2021). Additionally, PDHA1 hyperacetylation has been reported to cause lactate overproduction (An, S., Yao, Y., Hu, H. et al. PDHA1 hyperacetylation-mediated lactate overproduction promotes sepsis-induced acute kidney injury via Fis1 lactylation. Cell Death Dis 14, 457 (2023)). In our study, increased lactate production and PDHA1 hyperacetylation have been observed during SIRT3 inhibition conditions upon Salmonella infection.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      In this study, the authors used a multi-alternative decision task and a multidimensional signaldetection model to gain further insight into the cause of perceptual impairments during the attentional blink. The model-based analyses of behavioural and EEG data show that such perceptual failures can be unpacked into distinct deficits in visual detection and discrimination, with visual detection being linked to the amplitude of late ERP components (N2P and P3) and discrimination being linked to the coherence of fronto-parietal brain activity.

      Strengths:

      The main strength of this paper lies in the fact that it presents a novel perspective on the cause of perceptual failures during the attentional blink. The multidimensional signal detection modelling approach is explained clearly, and the results of the study show that this approach offers a powerful method to unpack behavioural and EEG data into distinct processes of detection and discrimination.

      Thank you.

      Weaknesses:

      (1.1) While the model-based analyses are compelling, the paper also features some analyses that seem misguided, or, at least, insufficiently motivated and explained. Specifically, in the introduction, the authors raise the suggestion that the attentional blink could be due to a reduction in sensitivity or a response bias. The suggestion that a response bias could play a role seems misguided, as any response bias would be expected to be constant across lags, while the attentional blink effect is only observed at short lags. Thus, it is difficult to understand why the authors would think that a response bias could explain the attentional blink.

      In the revision, we seek to better motivate the bias component. A deficit in T2 identification accuracy could arise from either sensitivity or criterion effects at short lags. For example, in short T1-T2 lag trials participants may adopt a more conservative choice criterion for reporting the presence of T2 thereby yielding lower accuracies for short lags. Criterion effects need not be uniform across lags: A participant could infer the T1-T2 lag on each trial based on various factors, such as trial length, and systematically adjust their choice criterion across lags, prior to making a response.

      Below, we present a simple schematic for how a conservative choice criterion impacts accuracy. Consider a conventional attentional blink paradigm where the task is to detect and report T2's presence. For simplicity, we assume that prior probabilities for T2’s occurrence are equal, such that the number of “T2 present” and “T2 absent” trials are equal.

      We model this task with a one-dimensional signal detection theory (SDT) model (left panel). Here, ψ represents the decision variable and the red and gray Gaussians represent the conditional density of ψ for the T2 present (“signal”) and T2 absent (“noise”) conditions, respectively. We increase the criterion from its optimal value (here, midpoint of signal and noise means), to reflect increasingly conservative choices. As the criterion increases and deviates further from its optimal value – here, reflecting a conservative bias – accuracy drops systematically (right panel).

      Author response image 1.

      We have revised the Introduction as follows:

      “Distinguishing between sensitivity and criterion effects is crucial because a change in either of these parameters can produce a change in the proportion of correct responses[41,42]. A lower proportion of correct T2 detections may reflect not only a lower detection d’ at short lags but also a sub-optimal choice criterion corresponding, for instance, to a conservative detection bias (Fig. 1, right, top). Importantly, such criterion effects need not be uniform across intertarget lags: the lag on each trial could be inferred based on various factors, such as trial length, allowing participants to adopt different choice criteria for the different lags prior to making a response.”

      (1.2) A second point of concern regards the way in which the measures for detection and discrimination accuracy were computed. If I understand the paper correctly, a correct detection was defined as either correctly identifying T2 (i.e., reporting CW or CCW if T2 was CW or CCW, respectively, see Figure 2B), or correctly reporting T2's absence (a correct rejection).

      Here, it seems that one should also count a misidentification (i.e., incorrect choice of CW or CCW when T2 was present) as a correct detection, because participants apparently did detect T2, but failed to judge/remember its orientation properly in case of a misidentification. Conversely, the manner in which discrimination performance is computed also raises questions. Here, the authors appear to compute accuracy as the average proportion of T2present trials on which participants selected the correct response option for T2, thus including trials in which participants missed T2 entirely. Thus, a failure to detect T2 is now counted as a failure to discriminate T2. Wouldn't a more proper measure of discrimination accuracy be to compute the proportion of correct discriminations for trials in which participants detected T2?

      Indeed, detection and discrimination accuracies were computed with precisely the same procedure, and under the same conditions, as described by the Reviewer. We regret our poor description. For clarity, we have revised the following line in the Results section; we have also updated the Methods (section on Behavioral data analysis: Measuring attentional blink effects on psychometric quantities).

      “Detection accuracies were calculated based on the proportion of trials in which T2 was correctly detected (Methods). Briefly, we computed the average proportion of hits, misidentifications, and correct rejections; misidentifications were included because, although incorrectly identified, the target was nevertheless correctly detected. In contrast, discrimination accuracies were derived from T2 present trials, based on the proportion of correct identifications alone (Methods).”

      (1.3) My last point of critique is that the paper offers little if any guidance on how the inferred distinction between detection and discrimination can be linked to existing theories of the attentional blink. The discussion mostly focuses on comparisons to previous EEG studies, but it would be interesting to know how the authors connect their findings to extant, mechanistic accounts of the attentional blink. A key question here is whether the finding of dissociable processes of detection and discrimination would also hold with more meaningful stimuli in an identification task (e.g., the canonical AB task of identifying two letters shown amongst digits).

      There is evidence to suggest that meaningful stimuli are categorized just as quickly as they are detected (Grill-Spector & Kanwisher, 2005; Grill-Spector K, Kanwisher N. Visual recognition: as soon as you know it is there, you know what it is. Psychol Sci. 2005 Feb;16(2):152-60. doi: 10.1111/j.0956-7976.2005.00796.x. PMID: 15686582.). Does that mean that the observed distinction between detection and discrimination would only apply to tasks in which the targets consist of otherwise meaningless visual elements, such as lines of different orientations?

      Our results are consistent with previous literature suggested by the reviewer. Specifically, we model detection and discrimination not as sequential processes, but as concurrent computations (Figs. 3A-B). Yet, our results suggest that these processes possess distinct neural bases. We have further revised the Discussion in context of this literature in the revised manuscript.

      “…Interestingly, we found no evidence indicating that these two computations (detection and discrimination) were sequential; in fact, the modulation of beta coherence occurred almost immediately after T2 onset, and lasted well afterwards (>400 ms from T2 onset) (Fig. 5A-B) suggesting that an analysis of T2’s features proceeded in parallel with its detection and consolidation. We also modeled detection and discrimination as concurrent computations in our SDT model (Fig. 3A-B). Previous work suggests that while object detection and categorization processes proceed in parallel, detection and identification processes occur sequentially[77]. Our results are in line with this literature, if we consider T2’s discrimination judgement – clockwise versus counterclockwise of vertical – to be a categorization, rather than an identification judgement. Moreover, this earlier study[75] observed significant trial-wise correlations between detection and categorization responses, suggesting that the two processes involve the operation of the same perceptual filters (“analyzers”). Our study, on the other hand, reports distinct neural bases for detection and discrimination computations. Yet, the two sets of findings are not mutually contradictory.

      In many conventional attentional blink tasks[3,20,25], complex visual stimuli, like letters, must be detected among a stream of background distractors with closely similar features, such as digits. In this case, target detection would require the operation of shape-selective perceptual filters for feature analysis. These same shape-selective filters would be involved also for discriminating between distinct, but related target stimuli (e.g., two designated candidate letters). In our task, target gratings needed to be distinguished in a stream of plainly distinct background distractors (plaids), whereas the discrimination judgement involved analysis of grating orientation. As a result, our task design likely precludes the need for the same perceptual filters in the detection and the discrimination judgements. Absent this common feature analysis, our results suggest distinct electrophysiological correlates for the detection and discrimination of targets.”

      Reviewer #2 Public review):

      Summary:

      The authors had two aims: First, to decompose the attentional blink (AB) deficit into the two components of signal detection theory; sensitivity and bias. Second, the authors aimed to assess the two subcomponents of sensitivity; detection and discrimination. They observed that the AB is only expressed in sensitivity. Furthermore, detection and discrimination were doubly dissociated. Detection modulated N2p and P3 ERP amplitude, but not frontoparietal beta-band coherence, whereas this pattern was reversed for discrimination.

      Strengths:

      The experiment is elegantly designed, and the data - both behavioral and electrophysiological - are aptly analyzed. The outcomes, in particular the dissociation between detection and discrimination blinks, are consistently and clearly supported by the results. The discussion of the results is also appropriately balanced.

      Thank you.

      Weaknesses:

      (2.1) The lack of an effect of stimulus contrast does not seem very surprising from what we know of the nature of AB already. Low-level perceptual factors are not thought to cause AB. This is fine, as there are also other, novel findings reported, but perhaps the authors could bolster the importance of these (null) findings by referring to AB-specific papers, if there are indeed any, that would have predicted different outcomes in this regard.

      While there is consensus that the low-level perceptual factors are not affected by the attentional blink, other studies have suggested evidence to the contrary (e.g., Chua et al, Percept. Psychophys., 2005)[1]. We have mentioned the significance of our findings in the context of such conflicting evidence in literature, in the revised Discussion.

      “Surprisingly, we found no significant effect of contrast on either type of deficit (Figs. 2A-B). In other words, high (100%) contrast T2 stimuli were also strongly susceptible to the detection and discrimination bottlenecks associated with the attentional blink. Thus, despite a clear contrast-dependent encoding of T2 in early sensory cortex, the attentional blink produced a significant deficit with downstream processing, even for targets of high contrast. While at odds with some earlier work, which suggest an early-stage perceptual bottleneck [82–84], these results are largely consistent with findings from the majority of previous studies [3,7,9,11,19,20,82,85,86] which suggest a late-stage bottleneck.”

      (2.2) On an analytical note, the ERP analysis could be finetuned a little more. The task design does not allow measurement of the N2pc or N400 components, which are also relevant to the AB, but the N1 component could additionally be analyzed. In doing so, I would furthermore recommend selecting more lateral electrode sites for both the N1, as well as the P1. Both P1 and N1 are likely not maximal near the midline, where the authors currently focused their P1 analysis.

      We performed these suggested analysis. Whereas in the original submission we had used the O1, O2 and Oz electrodes, we now estimate the P1 and N1 with the more lateral P7 and P8 electrodes[2], as suggested by the reviewer.

      Even with these more lateral electrodes, we did not observe a significant N1 component in a 90-160 ms window[3] in the long lag trials (p=0.207, signed rank test for amplitude less than zero); a one-tailed Bayes factor (BF=1.35) revealed no clear evidence for or against an N1 component. Analysis of the P1 component with these more lateral electrodes also yielded no statistically significant blink-induced modulation (P1(short lag-long lag) = 0.25 ± 0.16, uV, p=0.231, BF=0.651) (SI Figure S3, revised).

      These updated analyses are now reported in the revised Results (lines 317-319) and Methods (lines 854-855). In addition, we have revised SI Table S2 with the new P1 component analysis.

      (2.3) Impact & Context:

      The results of this study will likely influence how we think about selective attention in the context of the AB phenomenon. However, I think its impact could be further improved by extending its theoretical framing. In particular, there has been some recent work on the nature of the AB deficit, showing that it can be discrete (all-or-none) and gradual (Sy et al., 2021; Karabay et al., 2022, both in JEP: General). These different faces of target awareness in the AB may be linked directly to the detection and discrimination subcomponents that are analyzed in the present paper. I would encourage the authors to discuss this potential link and comment on the bearing of the present work on these behavioural findings.

      Thank you. We have now discussed our findings in the context of these recent studies in the revised manuscript.

      “…In line with this hypothesis, we discovered that the attentional blink induced dissociable detection and discrimination deficits. There was no statistically significant correlation between these two types of deficits within and across participants and evidence for such a correlation was weak, at best. Unlike previous target identification designs that conflated attentional blink’s effect on detection versus discrimination performance[3,4,9,25,37], our 3-AFC task, and associated signal detection model enabled quantifying each of these deficits separately and identifying a double dissociation between their respective neural correlates. Our dissociation of the attentional blink into distinct subcomponents is complementary to recent studies, which examined whether the attentional blink reflects an all-or-none phenomenon[73,74]. For example, the T2 deficit induced by the attentional blink can be either all-or-none or graded, depending on whether T1 and T2 judgements involve distinct or common features, respectively[73]. While a graded change in precision could reflect sensitivity effects, an all-or-none change in guess rates – without a concomitant change in precision – may reflect a criterion increase (conservative detection bias) effect. Future experiments that incorporate a three-alternative response, with concurrent detection and discrimination, along with key task elements of these earlier studies, may further help resolve these findings.”

      Reviewer #3 (Public review):

      Summary:

      In the present study, the authors aimed to achieve a better understanding of the mechanisms underlying the attentional blink, that is, a deficit in processing the second of two target stimuli when they appear in rapid succession. Specifically, they used a concurrent detection and identification task in- and outside of the attentional blink and decoupled effects of perceptual sensitivity and response bias using a novel signal detection model. They conclude that the attentional blink selectively impairs perceptual sensitivity but not response bias, and link established EEG markers of the attentional blink to deficits in stimulus detection (N2p, P3) and discrimination (fronto-parietal high-beta coherence), respectively. Taken together, their study suggests distinct mechanisms mediating detection and discrimination deficits in the attentional blink.

      Strengths:

      Major strengths of the present study include its innovative approach to investigating the mechanisms underlying the attentional blink, an elegant, carefully calibrated experimental paradigm, a novel signal detection model, and multifaceted data analyses using state-of-the art model comparisons and robust statistical tests. The study appears to have been carefully conducted and the overall conclusions seem warranted given the results. In my opinion, the manuscript is a valuable contribution to the current literature on the attentional blink. Moreover, the novel paradigm and signal detection model are likely to stimulate future research.

      Thank you.

      Weaknesses:

      Weaknesses of the present manuscript mainly concern the negligence of some relevant literature, unclear hypotheses, potentially data-driven analyses, relatively low statistical power, potential flaws in the EEG methods, and the absence of a discussion of limitations. In the following, I will list some major and minor concerns in detail.

      (3.1) Hypotheses: I appreciate the multifaceted, in-depth analysis of the given dataset including its high amount of different statistical tests. However, neither the Introduction nor the Methods contain specific statistical hypotheses. Moreover, many of the tests (e.g., correlations) rely on selected results of previous tests. It is unclear how many of the tests were planned a priori, how many more were performed, and how exactly corrections for multiple tests were implemented. Thus, I find it difficult to assess the robustness of the results.

      We hypothesized that neural computations associated with target detection would be characterized by regional (local) neuronal markers (e.g., parietal or occipital ERPs), whereas computations linked to feature discrimination would involve neural coordination across multiple brain regions (e.g. fronto-parietal coherence) (lines 135-138). We planned and conducted our statistical tests based on this hypothesis. All multiple comparison corrections (Bonferroni-Holm correction, see Methods) were performed separately for each class of analyses.

      Based on this overarching hypothesis, the following tests were planned and conducted.

      ERP analysis: Based on an extensive review of recent literature[4] (Zivony et al., 2022 we performed the following tests: i) We tested whether four ERP component amplitudes (parietal P1, fronto-central P2, occipito-parietal N2p, and parietal P3) were significantly different between short and long lags with a Wilcoxon signed rank test followed by Bonferroni-Holm multiple comparison correction; ii) We correlated the ERPs whose amplitudes showed a significant difference in analysis (i) with detection and discrimination d’ deficits (six correlations) using robust (bend) correlations[5]; again, this was followed by a Bonferroni-Holm multiple comparison correction. Note that there is no circularity with planning analysis (ii) based on the results of analysis (i) because the latter is agnostic to detection versus discrimination blink deficits. In case (i), where no a priori hypothesis about directionality were available, all p-values were based on two-tailed tests but for case (ii), where we had an a priori directional hypothesis, p-values were computed from one-tailed tests. This has now been clarified in the revised Methods lines 937-940 and 950-952.

      Coherence analysis: Based on a seminal study of long-range synchrony modulation by the attentional blink[6], we examined fronto-parietal coherence in the beta (13-30 Hz) band, separately for the left and right hemispheres, and performed the following comparisons. i) We computed differences between the fronto-parietal coherogram (time-frequency representation of coherence, Fig. 5A-D) between short-lag and long-lag conditions, and performed a twodimensional cluster-based permutation test[7]; this method inherently corrects for multiple comparisons across time-frequency windows. ii) Because the analysis in (i) revealed the clearest evidence for coherence differences in the canonical high-beta (20-30 Hz band) in the left fronto-parietal electrodes (Figs. 5C-D; 0-300 ms following target onset), we correlated power in this band with detection and discrimination d’ deficits; this was followed by a Bonferroni-Holm multiple comparison correction. As before there is no circularity with planning analysis (ii) based on the results of analysis (i) because the latter is agnostic to detection versus discrimination blink deficits. Again, in case (i), where no a priori hypothesis about directionality was made, all p-values were based on two-tailed tests but for case (ii), where we had an a priori directional hypothesis, p-values were computed from one-tailed tests.

      For completeness, we performed all of the other correlations, for example, correlations with coherence in the low-beta band or with the right fronto-parietal electrodes (SI Table 3). These latter analyses were not planned, nor did they yield significant results.

      Neural distance analysis: This was a novel analysis designed to test the hypothesis that detection and discrimination deficits would be correlated with neural distances along distinct dimensions. i) First, we compared neural distances across lag conditions at different timepoints following target onset with a one-dimensional cluster-based permutation test[7] ; ii) Next, we correlated the neural distances along the detection and discrimination dimension with the detection and discrimination d’ deficits (Fig. 6E-F, 6G-H), as well as with the ERP and coherence markers (Fig. 7A-B, 7C-D). For each of these analyses, we employed robust (bend) correlations[5] followed by a Bonferroni-Holm multiple comparison correction. As before, pvalues were computed using two-tailed tests for case (i) and one-tailed tests for case (ii), based on the absence or presence of an a priori directional hypothesis.

      (3.2) Power: Some important null findings may result from the rather small sample sizes of N = 24 for behavioral and N = 18 for ERP analyses. For example, the correlation between detection and discrimination d' deficits across participants (r=0.39, p=0.059) (p. 12, l. 263) and the attentional blink effect on the P1 component (p=0.050, no test statistic) (p. 14, 301) could each have been significant with one more participant. In my opinion, such results should not be interpreted as evidence for the absence of effects.

      We have modified these claims in the revised Results. In addition, we now compute and report Bayes factors, which enable evaluating evidence for the presence versus absence of effects.

      “Detection and discrimination d’ deficits were not statistically significantly correlated (r=0.39, t=2.28, p=0.059); Bayes factor analysis revealed no clear evidence for or against a correlation between these subcomponent deficits (BF=1.18) (SI Fig. S2, left).”

      “Discrimination accuracy deficits were not statistically significantly different between high and low detection accuracy deficit blocks (z=1.97, p=0.067), and the Bayes factor revealed no strong evidence for or against such a difference (BF=1.42) (Fig. 3G).”

      In addition, the results are interpreted as follows (lines 294-296):

      “Moreover, detection and discrimination d’ deficits were not significantly correlated both within and across participants, with no clear evidence for or against a correlation, based on the Bayes factor.”

      The null result on the P1 has changed because of the analysis with the alternative electrode set suggested by Reviewer #2 (see comment #2.2). We now report these results as follows:

      “By contrast, the P1, an early sensory component, showed no statistically significant blinkinduced modulation (P1= 0.25 ± 0.16µV, z = 1.19, p=0.231, BF = 0.651) (SI Fig. S3).”

      (3.3) Neural basis of the attentional blink: The introduction (e.g., p. 4, l. 56-76) and discussion (e.g., p. 19, 427-447) do not incorporate the insights from the highly relevant recent review by Zivony & Lamy (2022), which is only cited once (p. 19, l. 428). Moreover, the sections do not mention some relevant ERP studies of the attentional blink (e.g., Batterink et al., 2012; Craston et al., 2009; Dell'Acqua et al., 2015; Dellert et al., 2022; Eiserbeck et al., 2022; Meijs et al., 2018).

      We have now cited these previous studies at the appropriate places in the revised Introduction.

      “The effect of the attentional blink on the processing of the second target is well studied. In particular, previous studies have investigated the stage at which attentional blink affects T2’s processing (early or late) [14–17] and the neural basis of this effect, including the specific brain regions involved[15,18–20]. Several theoretical frameworks characterize a sequence of phases of the attentional blink, including target selection based on relevance, detection, feature processing, and encoding into working memory[9,21]. Overall, there is little support for attentional blink deficits at an early, sensory encoding[14] stage; by contrast, the vast majority of literature suggests that T2’s processing is affected at a late stage[8,10]. Consistent with these behavioral results, scalp electroencephalography (EEG) studies have reported partial or complete suppression of late event-related potential (ERP) components, particularly those linked to attentional engagement (P2, N2, N2pc or VAN)[15,22–25], working memory (P3) [20,26–30] or semantic processing (N400)[31]; early sensory components (P1/N1) are virtually unaffected[20,24] (reviewed in detail in Zivony and Lamy, 2022[32]) .”

      (3.4) Detection versus discrimination: Concerning the neural basis of detection versus discrimination (e.g., p. 6, l. 98-110; p. 18, l. 399-412), relevant existing literature (e.g., Broadbent & Broadbent, 1987; Hillis & Brainard, 2007; Koivisto et al., 2017; Straube & Fahle, 2011; Wiens et al., 2023) is not included.

      Thank you for these suggestions. We have now cited these studies in the revised Discussion.

      “It is increasingly clear that detection and discrimination are separable processes, each mediated by distinct neural mechanisms. Behaviorally, accurately identifying the first target, versus merely detecting it, produces stronger deficits with identifying the second target[59]. Moreover, dissociable mechanisms have been reported to mediate object detection and discrimination in visual adaptation contexts[60]. Neurally, shape detection and identification judgements produce activations in non-overlapping clusters in various brain regions in the visual cortex, inferior parietal cortex, and the medial frontal lobe[61]. Similarly, occipital ERPs associated with conscious awareness also show clear differences between detection and discrimination. For instance, an early posterior negative component (200-300 ms) was significantly modulated in amplitude by success in detection, but not in identification[62]. The closely related visual awareness negativity (VAN) was substantially stronger at the detection, compared to the discrimination, threshold[63].

      Furthermore, a significant body of previous work has reported dissociable behavioural and neural mechanisms underlying attention’s effects on target detection versus discrimination. Behavioral studies have reported distinct effects on target detection versus discrimination in both endogenous[64] and exogenous[65] attention tasks.”

      (3.5) Pooling of lags and lags 1 sparing: I wonder why the authors chose to include 5 different lags when they later pooled early (100, 300 ms) and late (700, 900 ms) lags, and whether this pooling is justified. This is important because T2 at lag 1 (100 ms) is typically "spared" (high accuracy) while T2 at lag 3 (300 ms) shows the maximum AB (for reviews, see, e.g., Dux & Marois, 2009; Martens & Wyble, 2010). Interestingly, this sparing was not observed here (p. 43, Figure 2). Nevertheless, considering the literature and the research questions at hand, it is questionable whether lag 1 and 3 should be pooled.

      Lag-1 sparing is not always observed in attentional blink studies; there are notable exceptions to reports of lag-1 sparing[8,9]. Our statistical tests revealed no significant difference in accuracies between short lag (100 and 300 ms) trials or between long lag (700 and 900 ms) trials but did reveal significant differences between the short and long lag trials (ANOVA, followed by post-hoc tests). To simplify the presentation of the findings, we pooled together the short lag (100 and 300 ms) and, separately, the long lag (700 and 900 ms) trials. We have presented these analyses, and clarified the motivation for pooling these lags in the revised Methods.

      “Based on these psychometric measures, we computed detection and discrimination accuracies as follows. Detection accuracies were computed as the average proportion of the hits, misidentification and correct rejection responses; misidentifications were included because not missing the target reflected accurate detection. By contrast, discrimination accuracies were computed based on the average proportion of the two correct identifications (hits) on T2 present trials alone. We performed 2-way ANOVAs on both detection and discrimination accuracies with the inter-target lag (5 values) and T2 contrast independent factors. We found main effects of both lag (F(4,92)=18.81, p<0.001) and contrast (F(1,92)=21.78, p<0.001) on detection accuracy, but no interaction effect between lag and contrast (F(4,92)=1.92, p=0.113). Similarly, we found main effects of both lag (F(4,92)=25.08, p<0.001) and contrast (F(1,92)=16.58, p<0.001) on discrimination accuracy, but no interaction effect between lag and contrast (F(4,92)=0.93, p=0.450). Post-hoc tests based on Tukey’s HSD revealed a significant difference in discrimination accuracies between the two shortest lags (100 ms and 300 ms) and the two longest lags (700 and 900 ms) for both low and high contrast targets, and for both detection and discrimination accuracies (p<0.01). But they revealed no significant difference between the two shortest lags (p>0.25) or the two longest lags (p>0.40) for either target contrast or for either accuracy type. As a result, for subsequent analyses, we pooled together the “short lag” (100 ms and 300 ms) and the “long lag” (700 ms and 900 ms) trials. We quantified the effect of the attentional blink on each of the psychometric measures as well as detection and discrimination accuracies by comparing their respective, average values between the short lag and long lag trials, separately for the high and low T2 contrasts.”

      (3.6) Discrimination in the attentional blink. Concerning the claims that previous attentional blink studies conflated detection and discrimination (p. 6, l. 111-114; p. 18, l. 416), there is a recent ERP study (Dellert et al., 2022) in which participants did not perform a discrimination task for the T2 stimuli. Moreover, since the relevance of all stimuli except T1 was uncertain in this study, irrelevant distractors could not be filtered out (cf. p. 19, l. 437). Under these conditions, the attentional blink was still associated with reduced negativities in the N2 range (cf. p. 19, l. 427-437) but not with a reduced P3 (cf. p. 19, l 439-447).

      We have addressed the relationship between our findings and those of Dellert et al (2022)[10] in the revised Discussion.

      “… In the present study, we observed that the parietal P3 amplitude was correlated selectively with detection, rather than discrimination deficits. This suggests that the P3 deficit indexes a specific bottleneck with encoding and consolidating T2 into working memory, rather than an inability to reliably maintain its features. In this regard, a recent study[22] measured ERP correlates of the perceptual awareness of the T2 stimulus whose relevance was uncertain at the time of its presentation. In contrast to earlier work, this study observed no change in P3b amplitude across seen (detected) and unseen targets. Taken together with this study, our findings suggest that rather than indexing visual awareness, the P3 may index detection, but only when information about the second target, or a decision about its appearance, needs to be maintained in working memory. Additional experiments, involving targets of uncertain relevance, along with our behavioral analysis framework, may help further evaluate this hypothesis.”

      (3.7) General EEG methods: While most of the description of the EEG preprocessing and analysis (p. 31/32) is appropriate, it also lacks some important information (see, e.g., Keil et al., 2014). For example, it does not include the length of the segments, the type and proportion of artifacts rejected, the number of trials used for averaging in each condition, specific hypotheses, and the test statistics (in addition to p-values).

      We regret the lack of details. We have included these in the revised Methods, and expanded on the description of the trial rejection (SCADS) algorithm.

      The revised Methods section on EEG Preprocessing mentions the type and proportion of artifacts rejected:

      “We then epoched the data into trials and applied SCADS (Statistical Control of Artifacts in Dense Array EEG/MEG Studies[90]) to identify bad epochs and artifact contaminated channels. SCADS detects artifacts based on three measures: maximum amplitude over time, standard deviation over time, and first derivative (gradient) over time. Any electrode or trial exhibiting values outside the specified boundaries for these measures was excluded. The boundaries were defined as M ± n*λ, where M is the grand median across electrodes and trials for each of the three measures, and λ is the root mean square (RMS) of the deviation of medians across sensors relative to the grand median. We set n to 3, allowing data within three boundaries to be retained. The percentage of electrodes per participant rejected was 6.3 ± 0.43% (mean ± s.e.m. across participants), whereas the percentage of trials rejected per electrode and participant was 3.4 ± 0.33% (mean ± s.e.m.).”

      The revised Methods section on ERP analysis mentions the number of trials for averaging in each condition and the length of the segments:

      “First trials were sorted based on inter-target lags (100, 300, 500, 700 and 900 ms). This yielded an average of (200±13, 171±9.71, 145 ± 7.54, 117 ± 5.43, 87 ± 4.51 ) (mean ± s.e.m. across participants) trials for each of the 5 lags, respectively.”

      “Then, EEG traces were epoched from -300 ms before to +700 ms after either T1 onset or T2 onset and averaged across trials to estimate T1-evoked and T2-evoked ERPs, respectively.”

      Specific hypotheses are mentioned in response #3.1; we also now mention the test statistic associated with each test at the appropriate places in the Results. For example:

      “Among these ERP components, the N2p component and the P2 component were both significantly suppressed during the blink (∆amplitude, short-lag – long-lag: N2p=-0.47 ± 0.12 µV, z=-3.20, p=0.003, BF=40, P2=-0.19 ± 0.07 µV, z=-2.54, p=0.021, BF=4.83, signed rank test) (Fig. 4A, right). Similarly, the parietal P3 also showed a significant blink-induced suppression (P3= -0.45 ± 0.09µV, z=-3.59, p < 0.001, BF>10<sup>2</sup>) (Fig. 4B, right).”

      “Neural inter-class distances (||η||) along both the detection and discrimination dimensions decreased significantly during the blink (short lag-long lag: ∆||ηdet|| = -1.30 ± 0.70, z=-3.68, p=0.006, BF=20; ∆||ηdis|| = -1.23 ± 0.42, z=-3.54, p<0.001, BF>10<sup>2</sup>) (Figs. 6C-D).”

      (3.8) EEG filters: P. 31, l. 728: "The data were (...) bandpass filtered between 0.5 to 18 Hz (...). Next, a bandstop filter from 9-11 Hz was applied to remove the 10 Hz oscillations evoked by the RSVP presentation." These filter settings do not follow common recommendations and could potentially induce filter distortions (e.g., Luck, 2014; Zhang et al., 2024). For example, the 0.5 high-pass filter could distort the slow P3 wave. Mostly, I am concerned about the bandstop filter. Since the authors commendably corrected for RSVP-evoked responses by subtracting T2-absent from T2-present ERPs (p. 31, l. 746), I wonder why the additional filter was necessary, and whether it might have removed relevant peaks in the ERPs of interest.

      Thank you for this suggestion. Originally, the 9-11 Hz bandstop filter was added to remove the strong 10 Hz evoked oscillation from the EEG response for obtaining a cleaner signal for the other analyses, like the analysis of neural dimensions (Fig. 6)

      We performed two control ERP analyses to address the reviewers’ concern:

      (1) We removed the bandstop filter and re-evaluated the P1, P2, N2pc and P3 ERP amplitudes. We observed no statistically significant difference in the modulation of any of the 4 ERP components (P1: p=0.031, BF=0.692, P2: p=0.038, BF=1.21, N2pc: p=0.286, BF=0.269, P3: p=0.085, BF=0.277). In particular, Bayes Factor analysis revealed substantial evidence against a difference in the N2pc and P3 amplitudes before versus after the bandstop filter removal (BF<0.3).

      (2) We removed the bandstop filter and repeated all of the same analyses as reported in the Results and summarized in SI Table S2. We observed a virtually identical pattern of results, summarized in an analogous table, below (compare with SI Table S2, revised, in the Supplementary Information).

      Author response table 2.

      We have now mentioned this control analysis briefly in the Methods (lines 863-865).

      (3.9) Coherence analysis: P. 33, l. 786: "For subsequent, partial correlation analyses of coherence with behavioral metrics and neural distances (...), we focused on a 300 ms time period (0-300 ms following T2 onset) and high-beta frequency band (20-30 Hz) identified by the cluster-based permutation test (Fig. 5A-C)." I wonder whether there were any a priori criteria for the definition and selection of such successive analyses. Given the many factors (frequency bands, hemispheres) in the analyses and the particular shape of the cluster (p. 49, Fig 5C), this focus seems largely data-driven. It remains unclear how many such tests were performed and whether the results (e.g., the resulting weak correlation of r = 0.22 in one frequency band and one hemisphere in one part of a complexly shaped cluster; p. 15, l. 327) can be considered robust.

      Please see responses to comments #3.1 and #3.2 (above). In addition to reporting further details regarding statistical tests, their hypotheses, and multiple comparisons corrections, we computed Bayes factors to quantify the strength of the evidence for correlations, as appropriate. Interpretations have been rephrased depending on whether the evidence for the null or alternative hypothesis is strong or equivocal. For example:

      “Bayes factor analysis revealed no clear evidence for or against a correlation between these subcomponent deficits (BF=1.18) (SI Fig. S2, left).”

      “Discrimination accuracy deficits were not statistically significantly different between high and low detection accuracy deficit blocks (z=1.97, p=0.067), and the Bayes factor revealed no strong evidence for or against such a difference (BF=1.42) (Fig. 3G).”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1.a) Line 76-79: "Despite this extensive literature, previous studies have essentially treated the attentional blink as a unitary, monolithic phenomenon. As a result, fundamental questions regarding the component mechanisms of the attentional blink remain unanswered." This statement seems antithetical to the fact that theories of the AB suggest a variety of different mechanisms as possible causes of the effect.

      The statement has been revised as follows:

      “Despite this extensive literature, many previous studies have[ studied the attentional blink as a unitary phenomenon. While some theoretical models9,21,32] and experimental studies[38,39] have explored distinct mechanisms underlying the attentional blink, several fundamental questions about its distinct component mechanisms remain unanswered.”

      (1.b) Line 95-97: Here, the authors should explain in more detail how a response bias could fluctuate across lags.

      Addressed in response to public reviews, #1.1.

      (1.c) Line 98: I found this second question a much more compelling motivation for the study than the earlier stated question of whether the AB reflects a reduction in sensitivity or a fluctuation (?) of response bias.

      Thank you.

      (1.d) Line 143: What do the authors mean by "geometric" distribution of lags? In virtually all AB studies, the distribution of lags is uniform. Wasn't that the case in this study?

      We employed a geometric distribution for the trials of different lags, and verified that the sampled distribution of lags was well fit by this distribution (χ<sup>2</sup>(3, 312)=0.22, p=0.974). We chose a geometric distribution – with a flat hazard function[11] – over the uniform distribution to avoid conflating the effects of temporal expectation with those of the attention blink on criterion[12] at different lags.

      (1.e) Line 158-160: Explain why incorrect discrimination responses were not counted as correct detection. Explain why failure to detect T2 was counted as a discrimination error.

      Addressed in response to public reviews, #1.2.

      (1.f) Line 167: The results do not show lag-1 sparing, which is a typical property of the AB.

      The authors should report this, and explain why their paradigm did not show a sparing effect.

      Addressed in response to public reviews, #3.5.

      (1.g) Line 262-263: With only 24 participants, the study appears to be underpowered to reliably detect correlations. This should be noted as a limitation.

      Addressed in response to public reviews, #3.2.

      (1.h) Line 399-412: This section could be moved to the introduction to explain and motivate the aim of examining the distinct contributions of detection and discrimination to the AB.

      We have revised the Introduction to better motivate the aims of the study.

      Reviewer #2 (Recommendations for the authors):

      (2.a) A small note about the writing: as a matter of style, I would advise editing the generic phrasing (e.g., "shedding new light", "complex interplay") in abstract and general discussion.

      These are now revised as follows (for example):

      Line 26 - “These findings provide detailed insights into the subcomponents of the attentional blink….”

      Line 596 - “More broadly, these findings contribute to our understanding of the relationship between attention and perception….”

      (2.b) Some references appear double and/or without volume or page numbers (e.g., 44/61).

      Thank you. Amended now.

      Reviewer #3 (Recommendations for the authors):

      (3.a) Suggestions for additional analyses:

      I appreciate that the authors have quantified the evidence for null effects in simple comparisons using Bayes factors. In my opinion, the study would additionally benefit from Bayesian ANOVAs, which can also easily be implemented in JASP (Keysers et al., 2020), which the authors have already used for the other tests. As a result, they could further substantiate some of their claims related to null effects (e.g., p. 9, l. 175; p. 12, l. 246).

      Thank you. We have added Bayes factor values for ANOVAs (implemented in JASP[13]) wherever applicable in the revised manuscript. For example:

      “While we found a main effect of both lag (detection: F(1,23)=29.8, p<0.001, BF >10<sup>3</sup> discrimination: F(1,23)=54.1, p<0.001, BF >10<sup>3</sup>) and contrast (detection: F(1,23)=21.02, p<0.001, BF>10<sup>2</sup>, discrimination: F(1,23) =13.75, p=0.001, BF=1.22), we found no significant interaction effect between lag and contrast (detection: F(1,23)=1.92, p=0.113, BF=0.49, discrimination: F(1,23) = 0.93, p=0.450, BF=0.4).”

      “A two-way ANOVA with inter-target lag and T2 contrast as independent factors revealed a main effect of lag on both d’<sub>det</sub> (F(1,23)=30.3, p<0.001, BF>10<sup>3</sup>) and d’<sub>dis</sub> (F(1,23)=100.3, p<0.001, BF>10<sup>3</sup>). Yet, we found no significant interaction effect between lag and contrast for d’<sub>det</sub> (F(1,23)=2.3, p=0.141, BF=0.44).”

      Minor points

      (3.b) Statistics: Many p-values are reported without the respective test statistics (e.g., p. 9, l. 164; p. 12, l. 241-244 and 252-258; p. 13, l. 271, etc.).

      Addressed in response to public reviews, #3.7.

      (3.c) P. 4, l. 58: It is not entirely clear how the authors define "early or late". For example, while they consider the P2/N2/N2pc complex as "late" (l. 62-64), these ERP components are considered "early" in the debate on "early vs. late" neural correlates of consciousness (for a review, see Förster et al., 2020).

      We appreciate the debate. Our naming convention follows these seminal works[3,14–16].

      (3.d) P. 5., l. 77: "previous studies have essentially treated the attentional blinks as a unitary, monolithic phenomenon": There are previous studies in which both the presence and identity of T2 were queried (e.g., Eiserbeck et al., 2022; Harris et al., 2013).

      Addressed in response to recommendations for authors, #1.a.

      (3.e) P. 9, l. 169-177: The detection and discrimination accuracies are analyzed using twoway ANOVAs with the factors lags and contrast. I wonder why the lag effects are additionally analyzed using Wilcoxon signed rank tests using data pooled across the T2 contrasts (p., 9, l. 161-168)? If I understand it correctly, these tests should correspond to the main effects of lag in the ANOVAs. Indeed, both analyses lead to the same conclusions (l. 167 and l. 176).

      Our motivation was to first establish the attentional blink effect, with data pooled across contrasts. The subsequent ANOVA allowed delving deeper into contrast and interaction effects. Indeed, the results were consistent across both tests.

      (3.f) P. 12, l. 242: I wonder why the T2 contrasts are pooled in the statistical tests (but plotted separately, p. 45, Figure 3C).

      Model selection analysis distinct d’<sub>det</sub> parameter values across contrasts, as reflected in Fig. 3C. As mentioned in response #3.e contrasts effects were analyzed with an ANOVA.

      (3.g) P. 13, l. 287: "high and low contrast T2 trials were pooled to estimate reliable ERPs". The amount of trials per condition is not provided.

      Addressed in response to public reviews, #3.7.

      (3.h) P. 45, Figure 3D/F: In my opinion, plotting the contrasts and lags separately (despite the results of the model selection) would have provided a better idea of the data.

      We appreciate the reviewer’s suggestion, but followed the results of model selection for consistency.

      (3.i) P. 21, l. 470: "the left index finger to report clockwise orientations and the right index finger to report counter-clockwise orientations": This left/right mapping seems counterintuitive to me, and the authors also used the opposite mapping in Figures 1 and 2. It is not described in the Methods (p. 25) and thus is unclear.

      We regret the typo. Revised as follows:

      “...the left index finger to report counter-clockwise orientations and the right index finger to report clockwise orientations.”

      (3.j) P. 22, l. 514: "Taken together, these results suggest the following, testable schema (SI Figure S5)." Figure S5 seems to be missing.

      Amended. This is Fig. 8 in the revised manuscript.

      (3.k) P. 25, l. 559: I do not understand why the circular placeholders around the stimuli were included, and they are not mentioned in Figure 2A (p. 43). When I saw the figure and read the inscription, I wondered whether they were actually part of the stimulus presentation or symbolized something else.

      The placeholder was described in the earlier Methods section. We have now also mentioned it in caption for Fig. 2A.

      “All plaids were encircled by a circular placeholder. The fixation dot and the placeholder were present on the screen throughout the trial.”

      This avoided spatial uncertainty with estimating stimulus dimensions during the presentation.

      (3.l) P. 32, l. 754: The interval of interest for the P1 from 40 to 140 ms seems unusually early to me. The component usually peaks at 100 ms (e.g., at 96 ms in the cited study by Sergent et al., 2005), which also seems to be the case in the present study (Fig. S3, p. 57). I wonder how they were defined.

      For our analyses, we employed the peak value of the P1 ERP component in a window from 40-140 ms. The peak occurred around 100 ms (SI Fig. S3), which aligns with the literature.

      Additional minor comments:

      These comments have been all addressed, and typos corrected, by revising the manuscript at the appropriate places.

      3.m.1. L. 14: In my opinion, this sentence is difficult to read due to the nested combination of singular and plural forms. Importantly, as the authors also acknowledge (e.g., l. 83), perceptual sensitivity and choice bias could both be compromised, so I would suggest using plural and adding "or both" as a third option for clarity. See also p. 10, l. 204.

      3.m.2. L. 14: The comma before "As a result" should be replaced by a period.

      3.m.3. L. 45 "to guide Behavior" should be lowercase.

      3.m.4. L. 67: "Activity in the parietal, lateral prefrontal cortex and anterior cingulate cortex" could be read as if there was a "parietal, prefrontal cortex", so I would suggest removing the first "cortex".

      Revised/amended.

      3.m.5. L. 77: "fundamental questions regarding the component mechanisms of the attentional blink remain unanswered": The term "component mechanisms" is a bit unclear to me.

      We elaborate on this term in the very next set of paragraphs in the Introduction.

      3.m.6. L. 88: "a lower proportion of correct T2 detections can arise from a lower detection d'". "Arise from" sounds a bit off given that d' is a function of hits and false alarms.

      3.m.7. L. 95: I would suggest citing the updated edition of the classic "Detection Theory: A User's Guide" by Hautus, Macmillan & Creelman (2021).

      3.m.8. L. 102: "a oriented grating" should be "an".

      3.m.9. L. 126: "key neural markers - a local neural marker (event-related potentials) potentials" should be rephrased/corrected.

      3.m.10. L. 129: There are inconsistent tenses (mostly past tense but "we synthesize").

      3.m.11. L. 138: Perhaps the abbreviations (e.g., dva, cpd) should be introduced here (first mention) rather than in the Methods below.

      3.m.12. L. 148: "at the end of each trial participants first, indicated": The comma position should be changed.

      3.m.13. L. 176 "attentional blink-induced both a ...": The hyphen should be removed.

      3.m.14. L. 396: I think "but neither of them affects" would be better here.

      3.m.15. L. 383: "Detection deficits were signaled by ERP components such as the occipitoparietal N2p and the parietal P3": In my opinion, "such as" is too vague here.

      Revised/amended.

      3.m.16. L. 403: "Neurally, improved detection of attended targets is accompanied by (...) higher ERP amplitudes". Given the different mechanisms underlying the ERP, this section would benefit from more details.

      Addressed in response to public reviews, #3.4.

      3.m.17.    L. 924: References 18 and 46 seem to be the same.

      3.m.18.    L. 1181: I think d'det should be d'dis here.

      3.m.19.    L. 1284: "détection" should be "detection".

      3.m.20.    I found some Figure legends a bit confusing. For example, 5E refers to 4E, but 4E refers to 4C.

      3.m.21.    In Figures 4A/B and 6C/D, some conditions are hidden due to the overlap of CIs. Could they be made more transparent?

      Revised/amended.

      References:

      (1) Fook K.Chua. The effect of target contrast on the attentional blink. Percept Psychophys 5, 770–788 (2005).

      (2) Chmielewski, W. X., Mückschel, M., Dippel, G. & Beste, C. Concurrent information affects response inhibition processes via the modulation of theta oscillations in cognitive control networks. Brain Struct Funct 221, 3949–3961 (2016).

      (3) Sergent, C., Baillet, S. & Dehaene, S. Timing of the brain events underlying access to consciousness during the attentional blink. Nat Neurosci 8, 1391–400 (2005).

      (4) Zivony, A. & Lamy, D. What processes are disrupted during the attentional blink? An integrative review of event-related potential research. Psychon Bull Rev 29, 394–414 (2022).

      (5) Pernet, C. R., Wilcox, R. & Rousselet, G. A. Robust Correlation Analyses: False Positive and Power Validation Using a New Open Source Matlab Toolbox. Front Psychol 3, (2013).

      (6) Gross, J. et al. Modulation of long-range neural synchrony reflects temporal limitations of visual attention in humans. Proceedings of the National Academy of Sciences 101, 13050–13055 (2004).

      (7) Eric Maris and Robert Oostenveld. Nonparametric statistical testing of EEG and MEG data. J Neurosci Methods 164, 177–190 (2007).

      (8) Hommel, B. & Akyürek, E. G. Lag-1 sparing in the attentional blink: Benefits and costs of integrating two events into a single episode. The Quarterly Journal of Experimental Psychology Section A 58, 1415–1433 (2005).

      (9) Livesey, E. J. & Harris, I. M. Target sparing effects in the attentional blink depend on type of stimulus. Atten Percept Psychophys 73, 2104–2123 (2011).

      (10) Dellert, T. et al. Neural correlates of consciousness in an attentional blink paradigm with uncertain target relevance. Neuroimage 264, 119679 (2022).

      (11) Nobre, A., Correa, A. & Coull, J. The hazards of time. Curr Opin Neurobiol 17, 465– 470 (2007).

      (12) Bang, J. W. & Rahnev, D. Stimulus expectation alters decision criterion but not sensory signal in perceptual decision making. Sci Rep 7, 17072 (2017).

      (13) JASP Team. JASP (version 0.19.0.) [Computer Software]. Preprint at (2022).

      (14) Luck, S. J. Electrophysiological Correlates of the Focusing of Attention within Complex Visual Scenes: N2pc and Related ERP Components. (Oxford University Press, 2011). doi:10.1093/oxfordhb/9780195374148.013.0161.

      (15) Brydges, C. R., Fox, A. M., Reid, C. L. & Anderson, M. Predictive validity of the N2 and P3 ERP components to executive functioning in children: a latent-variable analysis. Front Hum Neurosci 8, (2014).

      (16) Michalewski, H. J., Prasher, D. K. & Starr, A. Latency variability and temporal interrelationships of the auditory event-related potentials (N1, P2, N2, and P3) in normal subjects. Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section 65, 59–71 (1986).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The expression and localization of Foxc2 strongly suggest that its role is mainly confined to As undifferentiated spermatogonia (uSPGs). Lineage tracing demonstrated that all germ cells were derived from the FOXC2+ uSPGs. Specific ablation of the FOXC2+ uSPGs led to the depletion of all uSPG populations. Full spermatogenesis can be achieved through the transplantation of Foxc2+ uSPGs. Male germ cell-specific ablation of Foxc2 caused Sertoli-only testes in mice. CUT&Tag sequencing revealed that FOXC2 regulates the factors that inhibit the mitotic cell cycle, consistent with its potential role in maintaining a quiescent state in As spermatogonia. These data made the authors conclude that the FOXC2+ uSPG may be the true SSCs, essential for maintaining spermatogenesis. The conclusion is largely supported by the data presented, but two concerns should be addressed: 1) terminology used is confusing: primitive SSCs, primitive uSPGs, transit amplifying SSCs... 2) the GFP+ cells used for germ cell transplantation should be better controlled using THY1+ cells.

      Thanks for your good comments. According to your suggestions, we have addressed your two concerns as follows:

      1> Overall our work suggest that FOXC2+ SSCs are a subpopulation of SSCs in a quiescent state, thus we have replaced the term ‘primitive’ with ‘quiescent’ in the revised manuscript. In general, ‘transient amplifying SSCs’ is considered to be ‘progenitors’, thus we have replaced ‘transient amplifying SSCs’ with ‘progenitors’ in the revised manuscript.

      2> The transplantation experiment was conducted using MACS-sorted THY1+, FACS sorted THY1+, and FACS-sorted GFP+ (FOXC2+) uSPGs simultaneously. To be consistent with the single-cell RNA-seq using the MACS-sorted THY1+ uSPGs, we only presented the results from MACS-sorted THY1+ and FACS-sorted GFP+ (FOXC2+) uSPGs in the previous manuscript. Following the reviewer’s suggestion, we have included the results derived from FACS sorted THY1+ uSPGs as the control. The overall conclusion is still fully supported by the more comprehensive dataset, i.e. FOXC2+ cells generated significant higher numbers of colonies than THY1+ cells after transplantation (Figure 2D, E).

      Reviewer #2 (Public Review):

      The authors found FOXC2 is mainly expressed in As of mouse undifferentiated spermatogonia (uSPG). About 60% of As uSPG were FOXC2+ MKI67-, indicating that FOXC2 uSPG were quiescent. Similar spermatogonia (ZBTB16+ FOXC2+ MKI67-) were also found in human testis.

      The lineage tracing experiment using Foxc2iCreERT2/+;Rosa26LSL-T/G/LSL-T/G mice demonstrated that all germ cells were derived from the FOXC2+ uSPG. Furthermore, specific ablation of the FOXC2+ uSPGs using Foxc2iCreERT2/+;Rosa26LSL-DTA/+ mice resulted in the depletion of all uSPG population. In the regenerative condition created by busulfan injection, all FOXC2+ uSPG survived and began to proliferate at around 30 days after busulfan injection. The survived FOXC2+ uSPGs generated all germ cells eventually. To examine the role of FOXC2 in the adult testis, spermatogenesis of Foxc2f/-;Ddx4Cre/+ mice was analyzed. From a 2-month-old, the degenerative seminiferous tubules were increased and became Sertoli cell-only seminiferous tubules, indicating FOXC2 is required to maintain normal spermatogenesis in adult testes. To get insight into the role of FOXC2 in the uSPG, CUT&Tag sequencing was performed in sorted FOXC2+ uSPG from Foxc2iCreERT2/+;Rosa26LSL-T/G/LSL-T/G mice 3 days after TAM diet feeding. The results showed some unique biological processes, including negative regulation of the mitotic cell cycle, were enriched, suggesting the FOXC2 maintains a quiescent state in spermatogonia.

      Lineage tracing experiments using transgenic mice of the TAM-inducing system was well-designed and demonstrated interesting results. Based on all data presented, the authors concluded that the FOXC2+ uSPG are primitive SSCs, an indispensable subpopulation to maintain adult spermatogenesis.

      The conclusion of the mouse study is mostly supported by the data presented, but to accept some of the authors' claims needs additional information and explanation. Several terminologies define cell populations used in the paper may mislead readers.

      1) "primitive spermatogonial stem cell (SSC)" is confusing. SSCs are considered the most immature subpopulation of uSPG. Thus, primitive uSPGs are likely SSCs. The naming, primitive SSCs, and transit-amplifying SSCs (Figure 7K) are weird. In general, the transit-amplifying cell is progenitor, not stem cell. In human and even mouse, there are several models for the classification of uSPG and SSCs, such as reserved stem cells and active stem cells. The area is highly controversial. The authors' definition of stem cells and progenitor cells should be clarified rigorously and should compare to existing models.

      Thanks for your good comments. Considering that our results showed that FOXC2+ SSCs are in a quiescent state and that Mechanistically FOXC2 maintained the quiescent state of SSCs by promoting the expression of negative regulators of cell cycle, we have replaced ‘primitive SSCs’ with ‘quiescent SSCs’ in the revised manuscript. We agree with the reviewer that ‘transient amplifying SSCs’ is considered to be ‘progenitors’, thus we have replaced ‘transient amplifying SSCs’ with ‘progenitors’ in the revised manuscript. Further,from our point of view, the FOXC2+Ki67+ SSCs could be regarded as active stem cells, and the FOXC2+Ki67- SSCs could be regarded as reserved stem cells, although further research evidence is still needed to confirm this.

      2) scRNA seq data analysis and an image of FOXC2+ ZBTB16+ MKI67- cells by fluorescent immunohistochemistry are not sufficient to conclude that they are human primitive SSCs as described in the Abstract. The identity of human SSCs is controversial. Although Adark spermatogonia are a candidate population of human SSCs, the molecular profile of the Adark spermatogonia seems to be heterogeneous. None of the molecular profiles was defined by a specific cell cycle phase. Thus, more rigorous analysis is required to demonstrate the identity of FOXC2+ ZBTB16+ MKI67- cells and Adark spermatogonia.

      We agree with the reviewer that the identity of human SSCs remain elusive even though Adark population demonstrates certain characteristics of SSCs. To acknowledge this notion, we have revised our conclusion as such that only suggests FOXC2+ZBTB16+MKI67- represents a quiescent state of human SSCs.

      3) FACS-sorted GFP+ cells and MACS-THY1 cells were used for functional transplantation assay to evaluate SSC activity. In general, the purity of MACS is significantly lower than that of FACS. Therefore, FACS-sorted THY1 cells must be used for the comparative analysis. As uSPGs in adult testes express THY1, the percentage of GFP+ cells in THY1+ cells determined by flow cytometry is important information to support the transplantation data.

      Thanks for your good comments. According to your suggestions, we have addressed your concerns as follows:

      1> The transplantation experiment was conducted using MACS-sorted THY1+, FACS sorted THY1+, and FACS-sorted GFP+ (FOXC2+) uSPGs simultaneously. To be consistent with the single-cell RNA-seq using the MACS-sorted THY1+ uSPGs, we only presented the results from MACS-sorted THY1+ and FACS-sorted GFP+ (FOXC2+) uSPGs in the previous manuscript. Following the reviewer’s suggestion, we have included the results derived from FACS sorted THY1+ uSPGs as the control. The overall conclusion is still fully supported by the more comprehensive dataset, i.e. FOXC2+ cells generated significant higher numbers of colonies than THY1+ cells after transplantation (Figure 2D, E).

      2> We performed FACS analysis to determine the proportion of GFP+ cells in FACS-sorted THY1+ cells from Rosa26LSL-T/G/LSL-T/G or Foxc2iCreERT2/+;Rosa26LSL-T/G/LSL-T/G mice at day 3 post TAM induction, and the result showed that GFP+ cells account for approximately 20.9±0.21% of THY1+ cells, See Author response image 1.

      Author response image 1.

      4) The lineage tracing experiments of FOXC2+-SSCs in Foxc2iCreERT2/+;Rosa26LSL-T/G/LSL-T/G showed ~95% of spermatogenic cells and 100% progeny were derived from the FOXC2+ (GFP+) spermatogonia (Figure 2I, J) at month 4 post-TAM induction, although FOXC2+ uSPG were quiescent and a very small subpopulation (~ 60% of As, ~0.03% in all cells). This means that 40% of As spermatogonia and most of Apr/Aal spermatogonia, which were FOXC2 negative, did not contribute to spermatogenesis at all eventually. This is a striking result. There is a possibility that FOXC2CRE expresses more widely in the uSPG population although immunohistochemistry could not detect them.

      Thanks for your good comments. From our lineage tracing results, over 95% of the spermatogenic cells are derived from the FOXC2+ SSCs in the testes of 4-month-old mice, which means that FOXC2+ SSCs maintain a long-term stable spermatogenesis. In addition, previous studies have shown that only a portion of As spermatogonia belong to SSCs with complete self-renewal ability (PMID: 28087628, PMID: 25133429), which is consistent with our findings. Therefore, we speculate that 40% of As spermatogonia and most of Apr/Aal spermatogonia, which were FOXC2 negative, did contribute to spermatogenesis but cannot maintain a long-term spermatogenesis due to limited self-renewal ability.

      5) The CUT&Tag_FOXC2 analysis on the FACS-sorted FOXC2+ showed functional enrichment in biological processes such as DNA repair and mitotic cell cycle regulation (Figure 7D). The cells sorted were induced Cre recombinase expression by TAM diet and cut the tdTomato cassette out. DNA repair process and negative regulation of the mitotic cell cycle could be induced by the Cre/lox recombination process. The cells analyzed were not FOXC2+ uSPG in a normal physiological state.

      We do appreciate the reviewer’s concern on the possibility of the functions enriched in the analysis as referred might be derived from Cre/lox recombination. However, we think it is unlikely that the Cre/lox recombination process, supposed to be rather local and specific, can trigger such a systemic and robust response by the DNA damage and cell cycle regulatory pathways. The reasons are as follows: First, as far as we are aware, there has been sufficient data to support this suggested scenario. Second, we did not observe any alteration in either the SSC behaviors or spermatogenesis in general upon the TAM-induced genomic changes, suggesting the impact from the Cre/lox recombination on DNA damage or cell cycle was not significant. Third, no factors associated with the DNA repair process were revealed in the differential analysis of single-cell transcriptomes of FOXC2-WT and FOXC2-KO.

      6) Wei et al (Stem Cells Dev 27, 624-636) have published that FOXC2 is expressed predominately in As and Apr spermatogonia and requires self-renewal of mouse SSCs; however, the authors did not mention this study in Introduction, but referred shortly this at the end of Discussion. Their finding should be referred to and evaluated in advance in the Introduction.

      Thanks for your good comments. According to your suggestion, we have revised the introduction to refer this latest parallel work on FOXC2. We are happy to see that our discoveries are converged to the important role of FOXC2 in regulating SSCs in adult mammals.  

      Reviewer #3 (Public Review):

      By popular single-cell RNA-seq, the authors identified FOXC2 as an undifferentiated spermatogonia-specific expressed gene. The FOXC2+-SSCs can sufficiently initiate and sustain spermatogenesis, the ablation of this subgroup results in the depletion of the uSPG pool. The authors provide further evidence to show that this gene is essential for SSCs maintenance by negatively regulating the cell cycle in adult mice, thus well-established FOXC2 as a key regulator of SSCs quiescent state.

      The experiments are well-designed and conducted, the overall conclusions are convincing. This work will be of interest to stem cell and reproductive biologists.

      Thanks for the positive feedback.  

      Reviewer #1 (Recommendations for the Authors):

      The authors should address the following concerns:

      1) The most primitive uSPGs should be the true SSCs. The term "primitive SSCs" is very confusing.

      2) In addition to FACS-sorted GFP+ cells, FACS-sorted THY1+ cells should also be used for transplantation.

      Thanks for your good comments. According to your suggestions, we have addressed your two concerns as follows:

      1) Overall our work suggest that FOXC2+ SSCs are a subpopulation of SSCs in a quiescent state, thus we have replaced the term ‘primitive’ with ‘quiescent’ in the revised manuscript.

      2) The transplantation experiment was conducted using MACS-sorted THY1+, FACS sorted THY1+, and FACS-sorted GFP+ (FOXC2+) uSPGs simultaneously. To be consistent with the single-cell RNA-seq using the MACS-sorted THY1+ uSPGs, we only presented the results from MACS-sorted THY1+ and FACS-sorted GFP+ (FOXC2+) uSPGs in the previous manuscript. Following the reviewer’s suggestion, we have included the results derived from FACS sorted THY1+ uSPGs as the control. The overall conclusion is still fully supported by the more comprehensive dataset, i.e. FOXC2+ cells generated significant higher numbers of colonies than THY1+ cells after transplantation (Figure 2D, E).

      Reviewer #3 (Recommendations for the Authors):

      The experiments are well-designed and conducted, the overall conclusions are convincing. The only concerns are the writing, especially the introduction which was not well-rationalized. Sounds the three subtypes and three models for SSCs' self-renew are irrelevant to the major points of this manuscript. I don't think you need to talk too much about the markers of SSCs. Instead, I suggest you provide more background about the quiescent or activation states of the SSCs. In addition to that, as a nuclear-localized protein, it cannot be used to flow cytometric sorting, I don't think it should be emphasized as a marker. You identified a key transcription factor for maintaining the quiescent state of the primitive SSCs, that's quite important!

      Appreciate the positive feedback and constructive suggestions on the writing. We have substantially revised our manuscript to include the relevant advances and understanding from the field as well as highlight the importance of FOXC2 in regulating the quiescent state of SSCs.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      (1) One issue that needs to be considered is the nomenclature of the enhancer. The authors have presented data to show this enhancer controls the expression of Ctnnb1 in the stomach, intestine, and colon tissues. However, the name proposed by the authors, ieCtnnb1 (intestinal enhancer of Ctnnb1), doesn't represent its functions. It might be more appropriate to call it a different name, such as gieCtnnb1 (gastrointestinal enhancer of Ctnnb1).

      We thank the reviewer for the insightful suggestion and agree that wholemount reporter assays indicated ieCtnnb1 and ieCTNNB1 indeed display activity in the stomach. However, in current study, we focused on the cellular distribution and the function in intestinal epithelia. After careful consideration, we reasoned that the current designation, ieCtnnb1, would be more appropriately represent its expression pattern and functions based on provided evidence. We hope the reviewer could understand our reasoning.  

      (2) The writing of this manuscript can be improved in a few places. 

      a) The definitions or full names for the abbreviations of some terms, e.g., Ctnnb1, ieCtnnb1, in both abstract and main text, are needed when they first appear. Specifically, Line 108 should be moved to Lines 26 and 95. Lines 125126 are redundant. ieCtnnb1 in Line 130 needs to be defined.

      We appreciate the suggestion. In the revision, we have included the definition of Ctnnb1 and the full name of ieCtnnb1 when they first appear in the abstract and the main text. Lines 125-126 were deleted in the revision.

      b) Line 192-194, the description of the result needs to be rewritten to reflect

      the higher expression of LacZ transcript in eGFP+ cells. 

      We would like to emphasize that the key point of this part is that the enhancer activity of ieCtnnb1 is present in both Lgr5-eGFP+ and Lgr5-eGFP- cells. This was validated by single-cell sequencing, which revealed the presence of LacZ transcripts in the Paneth cells. Moreover, we could not confidently conclude that eGFP+ cells have higher expression levels of LacZ, as these measurements were obtained from separate, semi-quantitative RTqPCR experiments.

      c)  More details are needed for how the data using human tumor samples were generated and how they were analyzed. 

      We thank the suggestion. In the revision, we have provided additional details regarding the data and subsequent analyses of human CRC samples as follows: “We previously conducted paired analyses of chromatin immunoprecipitation sequencing (ChIP-seq) for H3K27ac and H3K4me3, alongside RNA-seq on 68 CRC samples and their adjacent normal (native) tissue (Li et al., 2021).  In the current study, we performed analyses for the enrichment of H3K27ac and H3K4me3 at ieCTNNB1 and CTNNB1 promoter regions, as well as the expression levels of CTNNB1, followed by combined analyses (Figure. 5A, Figure 5 - figure supplement 1).”

      d) The genomic structures from multiple species are presented at the bottom of Figure 1a. However, the description and explanation are lacking in both the main text and the figure legend.

      We apologize for not presenting clearly. We have added related description in the legend of Figure 1A as “The sequence conservation of the indicated species is shown at the bottom as vertical lines”. We also added an explanation in lines 162-163 of the main text: “Notably, unlike neCtnnb1, the primary sequence of ieCtnnb1 is not conserved among vertebrates (Figure 1A, bottom)”.

      Reviewer #2:

      (1) One of the main issues emerging during reading concerns the interpretation of the consequence of deleting the ieCtnnb1 enhancer. The authors write on line 235 that the deletion of ieCtnnb1 "undermined" Wnt signaling in the intestinal epithelium. This feels too strong, as the status of the pathway is only mildly affected, testified by the observation that mice with homozygous deletion on ieCtnnb1 are alive and well. The enhancer likely "only" drives higher Ctnnb1 expression, and it does not affect Wnt signaling by other mechanisms. The reduction of Wnt target gene expression upon its deletion is easily interpreted as the consequence of reduced β-catenin. Also the title, in my opinion, allows this ambiguity to stick in readers' minds. In other words, the authors present no evidence that the ieCtnnb1 enhancer controls Wnt signaling dosage via any mechanism other than its upregulation of Ctnnb1 expression in the intestinal epithelium. Reduced Ctnnb1, in turn, could explain the observed reduction of Wnt signaling output and the interesting downstream physiological consequences. Unless the authors think otherwise, I suggest they clarify this throughout the text, including necessary modifications to the title.

      We greatly appreciate the reviewer’s important comments and suggestion. We agree that ieCtnnb1’s direct effect on the canonical Wnt signaling is to regulate the transcription of Ctnnb1 in the intestinal epithelia. Therefore, knockout of ieCtnnb1 leads to compromised expression of Ctnnb1 and, consequently, reduced Wnt signaling.  The term “undermined” is indeed too strong and has been revised to “compromised” in the revision (line 237). Similar revisions have been made throughout the manuscript. Particularly, the title was changed into “A Ctnnb1 enhancer transcriptionally regulates Wnt signaling dosage to balance homeostasis and tumorigenesis of intestinal epithelia”. However, as we state in the following point, decreased levels of β-catenin on ieCtnnb1 loss could lead to indirect effect, including the reduced expression of Bambi, which might cause a more significant decrease of nuclear β-catenin.

      (2) It is unclear how the reduction of Ctnnb1 mRNA caused by deletion of ieCtnnb1 in mice could lead to a preferential decrease of nuclear more than membranous β-catenin (Fig. 1K and L). This might reflect a general cell autonomous reduction in Wnt signaling activation; yet, it is not clear how this could occur. Do the authors have any explanations for this?

      It's a very important question. We observed that in inCtnnb1 knockout epithelia, the expression of Bambi (BMP and activin membrane-bound inhibitor) was significantly downregulated. Since BAMBI has been reported to stabilize β-catenin and facilitate its nuclear translocation, it is likely that the reduced level of BAMBI resulting from the loss of ieCtnnb1 further decreased nuclear βcatenin. In the revision, the expression change of Bambi has been added in Figure 1M. Moreover, the related content was extensively discussed with proper citations: “We noticed that after knocking out ieCtnnb1, the level of βcatenin in the nuclei of small intestinal crypt cells of Ctnnb1Δi.enh mice decreased more significantly compared to that in the cytoplasm (49.5% vs. 29.8%). Although the loss of ieCtnnb1 should not directly lead to reduced nuclear translocation of β-catenin, RNA-seq results showed that the loss of ieCtnnb1 causes a reduction in the expression of Bambi (BMP and activin membranebound inhibitor), a target gene in the canonical Wnt signaling pathway (Figure 1M). BAMBI promotes the binding of Frizzled to Dishevelled, thereby stabilizing β-catenin and facilitating its nuclear translocation (Lin et al., 2008; Liu et al., 2014; Mai et al., 2014; Zhang et al., 2015). Thus, it is likely that the decreased level of BAMBI resulting from the loss of ieCtnnb1 further reduced nuclear βcatenin”. 

      (3) In Figure 1 K-L the authors show β-catenin protein level. Why not show its mRNA?

      The mRNA levels of Ctnnb1 in small and large intestinal crypts were shown in Figure 1I and 1J, demonstrating reduced expression of Ctnnb1 upon ieCtnnb1 knockout. We hope the reviewer understands that it is unnecessary to measure the nuclear and cytosolic levels of Ctnnb1 transcripts, as the total mRNA level generally reflects the protein level. 

      (4) Concerning the GSEA of Figure 1 that includes the Wnt pathway components: a) it would be interesting to see which components and to what extent is their expression affected; b) why should the expression of Wnt components that are not Wnt target genes be affected in the first place? It is odd to see this described uncritically and used to support the idea of downregulated Wnt signaling.

      We appreciate the suggestion and apologize for any lack of clarity. The affected components of the Wnt signaling pathway and the extent of their changes are summarized in Figure 1 – figure supplement 3. Additionally, we have provided explanations for their downregulation. For instance, the reduced expression of Wnt3 and Wnt2b ligands in ieCtnnb1-KO crypts may be attributed to the decreased numbers of Paneth cells.  

      (5) In lines 251-252 the authors refer to "certain technical issues" in the isolation of cell type from the intestinal epithelium. Why this part should be obscure in the characterization of a tissue for which there are several established protocols of isolation and analysis is not clear. I would rather describe what these issues have been and how they protocol of isolation and analysis is not clear. I would rather describe what these issues have been and how they might have affected the data presented.

      We thank the reviewer for pointing this out. The single-cell preparation and sequencing of small intestinal cryptal epithelial cells were carried out largely according to reported protocols with slight modification. The enrichment of live crypt epithelial cells (EpCAM+DAPI-) by flow cytometry and cell filtering after single-cell sequencing were appropriate (Figure 2 – figure supplement 1A1C). We would like to emphasize a few points: 1) Unlike other protocols, we did not exclude immune cells, erythrocytes, or endothelial cells using negative sorting antibodies. 2) When defining cell populations, we focused exclusively on epithelial cell types and did not consider other cell types, such as immune cells. As a result, the so-called “undefined” cells include a mixture of nonepithelial cells. Indeed, markers for erythrocytes (AY036118/Erf1, PMID:12894589) and immune cells (Gm42418 and Lars2, PMID:30940803, PMID: 35659337) were the top three enriched genes in the “undefined” cluster (Figure 2 – figure supplement 1D). 3) Nonetheless, the overall findings remain robust, as key observations such as the loss of Paneth cells and reduced cell proliferation were validated through histological studies. This information has been incorporated into the revised manuscript with related references cited (lines 254-259). 

      (6) It is interesting that human SNPs exist that seem to fall within the ieCTNNB1 enhancer and affect the gastrointestinal expression of CTNNB1. Could the author report or investigate whether this SNP is present in human populations that have been considered in large-scale studies for colorectal cancer susceptibility? It seems to me a rather obvious next step of extreme importance to be ignored.

      (7) From Figure 5A a reader could conclude that colorectal tumor cells have a higher expression of CTNNB1 mRNA than in normal epithelium. This is the first time I have seen this observation which somewhat undermines our general understanding of Wnt-induced carcinogenesis exclusively initiated by APC mutations whereby it is β-catenin's protein level, not expression of its mRNA, of crucial importance. I find this to be potentially the most interesting observation of the current study, which could be linked to the activity of the enhancer discovered, and I suggest the authors elaborate more on this and perhaps consider it for future experimental follow-ups.

      We appreciate the comments and suggestions.  We therefore added related content in the revision (lines 470-475): “Importantly, ieCTNNB1 displayed higher enhancer activity in most CRC samples collected in the study. Moreover, the SNP rs15981379 (C>T) within ieCTNNB1 is associated with the expression of CTNNB1 in the GI tract. Future population studies could investigate how the enhancer activity of ieCTNNB1 and this particular SNP are associated with CRC susceptibility and prognosis”.

      (8) I am surprised that the authors, who seem to have dedicated lots of resources to this study, are satisfied by analyzing their ChIP experiments with qPCR rather than sequencing (Figure 6). ChIP-seq would produce a more reliable profile of the HNF4a and CREB1 binding sites on these loci and in other control regions, lending credibility to the whole experiment and binding site identification. Sequencing would also take care of the two following conceptual problems in primer design. 

      First: while the strategy to divide enhancer and promoter in 6 regions to improve the resolution of their finding is commendable, I wonder how the difference in signal reflects primers' efficiency rather than HNF4/CREB1 exact positioning. The possibility of distinguishing between regions 2 and 3, for example, in a ChIP-qPCR experiment, also depends on the average DNA fragment length after sonication, a parameter that is not specified here. 

      Second: what are the primers designed to detect the ieCtnnb1 enhancer amplifying in the yellow-columns samples of Figure 6G? In this sample, the enhancer is deleted, and no amplification should be possible, yet it seems that a value is obtained and set to 1 as a reference value.

      This is indeed a crucial point, and we fully agree with the reviewer that “ChIP-seq would produce a more reliable profile of the HNF4a and CREB1 binding sites on these loci and in other control regions”. However, we believe that our current ChIP-qPCR experiments have adequately addressed the potential concerns raised by the reviewers. (1) We have ensured that the DNA fragment length after sonication falls within the range of 200 bp to 500 bp, with an average length of approximately 300 bp (Author response image 1A). We have stated the point in the revised methods section (line 633). (2) We have randomly inspected 14 out of 26 primer sets used in Figure 6 and its supplemental figure (Author response image 1B-E), confirming that all primer sets demonstrate equal amplification efficiency (ranging from 90% to 110%). This information has also been included in the revised methods section (line 650). (3) Figures 6G and 6H show reduced enrichment of HNF4𝛼 (6G) and p-S133-CREB1 (6H) at the Ctnnb1 promoter in ieCtnnb1 knockout ApcMin/+ tumor tissues. The ChIP-qPCR primers used were positioned at the Ctnnb1 promoter, not at ieCtnnb1, with IgG control enrichment serving as the reference values on the Y-axes. 

      Author response image 1.

      (A) Agarose gel electrophoresis of sonicated DNA. (B-E) Tests of amplification efficiency for primer sets used in ChIP-qPCR.

      (9) The ChIP-qPCR showing preferential binding of pS133-CREB1 in small intestinal crypts and CHT15 cells (line 393) should be shown. 

      The ChIP-qPCR results demonstrating preferential binding of p-S133-

      CREB1 over CREB1 have been added in revised Figure 6C, 6D and Figure 6 – Supplement 1C.

      (10) It is not entirely clear what the blue tracks represent at the bottom of Figures 6C-D and Figure 6 - Figure Supplement 1C-D. The ChIP-seq profiles of both CREB1 and HNF4a shown in Figures 6A and Figure 6 - Figure Supplement 1A do not seem to match. Taking HNF4a, for example from Figure 6 - Figure Supplement 1A it seems to bind on the Ctnnb1 promoter, while in Figure 6 - Figure Supplement 1D the peaks are within the first intron. I realize this might all be a problem with a different scale across figure panels, but I suggest producing a cleared figure.

      We apologize for the confusion. We have revised Figure 6C-6D, Figure 6 - figure supplement 1C-D, and the corresponding legends to enhance clarity. (1) The top panels of Figures 6C and 6D respectively highlight shaded regions of ieCTNNB1 (pink) and the CTNNB1 promoter (grey) in Figure 6A, emphasizing the enrichment of p-S133-CREB1.  (2) The top panels of Figure 6 – figure supplement 1C and 1D respectively highlight shaded regions of ieCtnnb1 (pink) and the Ctnnb1 promoter (grey) in Figure 6A – figure supplement 1A, emphasizing the enrichment of HNF4α. (3) Because Figures 6C-6D and Figure 6 - figure supplement 1C-1D respectively correspond to human and mouse genomes, the positions of peaks and scales differ.  

      (11) In the intro the authors refer to "TCF-4". I suggest they use the more recent unambiguous nomenclature for this family of transcription factors and call it TCF7L2.

      TCF-4 has been changed into TCF7L2 in the revision (line 81)

      (12) In lines 121-122, the authors write "Although numerous putative enhancers...only a fraction of them were functionally annotated". To what study/studies are the authors referring? Please provide references.

      References were added in the revision (line 124)

      (13) In some parts the authors use strong words that should in my opinion be attenuated. Examples are: (i) at line 224, "maintains" would be better substituted with "contribute", as in the absence of ieCtnnb1, Ctnnb1 is still abundantly expressed; (ii) at line 266 "compromised" when the proliferative capacity of CFCs and TACs seems to be only mildly reduced; (iii) at line 286 "disrupts", the genes are simply downregulated.

      We thank these great suggestions. 1) On lines 224-225, the sentence was revised to: “These data suggest that ieCtnnb1 plays a specific role in regulating the transcription of Ctnnb1 in intestinal epithelia”. 2) On line 271, “compromised” were replaced with “mildly reduced”. 3) In ieCtnnb1 knockout epithelial cells of small intestine, genes related to secretory functions were decreased, while genes related to absorptive functions were increased. Therefore, the term 'disrupts' is more appropriate than 'downregulates'. 

      Reviewer #3:

      Line 81, c-Myc should be human MYC (italics) to agree with the other human gene names in this sentence. 

      c-Myc has been changed into MYC in the revision (line 82)

      Line 215, wildtype should be wild-type. 

      “wildtype” has been changed into “wild-type” in the revision (line 215)

      Line 224, Elimination of the enhancer did not abolish expression of Ctnnb1; therefore, it would be better to say that it "helps to maintain Ctnnb1 transcription" 

      The sentence was changed into “These data suggest that ieCtnnb1 plays a specific role in regulating the transcription of Ctnnb1 in intestinal epithelia” in revision (lines 224-225)

      Line 228, perhaps "to activate transcription" is meant. 

      “active” has been changed into “activate” in the revision (line 228)

      Line 235, consider "reduced" instead of "undermined". 

      “undermined” has been replaced with “compromised” in the revision (line 237)

      Line 262, "em" dashes should be a both ends of this insertion. 

      Line 298, "dysfunctional" would be better.

      Line 356, "samples were". 

      Line 481, 12-hr (add hyphen). 

      All above points have been optimized according to the reviewer’s suggestion.

      Line 712, Is "poly-N" meant? 

      “Poly-N” indicates undetected bases during sequencing. This explanation was added in the revision (lines 759-760).

      Figure 1K, the GAPDH signal is not visible and that panel is unnecessary as there is an H3 control.   

      Figure 1K and 1L respectively show levels of nuclear and cytoplasmic βcatenin. GAPDH and H3 were used as internal references for the cytoplasmic and nuclear fractions, respectively, confirming both robust fractionation and equal loading.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #3 (Public Review):

      The iron manipulation experiments are in the whole animal and it is likely that this affects general feeding behaviour, which is known to affect NB exit from quiescence and proliferative capacity. The loss of ferritin in the gut and iron chelators enhancing the NB phenotype are used as evidence that glia provide iron to NB to support their number and proliferation. Since the loss of NB is a phenotype that could result from many possible underlying causes (including low nutrition), this specific conclusion is one of many possibilities.

      We have investigated the feeding behavior of fly by Brilliant Blue (sigma, 861146)[1]. Our result showed that the amount of dye in the fly body were similar between control group and BPS group, suggesting that BPS almost did not affect the feeding behavior (Figure 3—figure supplement 1A).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      There was a gap between the Pros nuclear localization and downstream targets of ferritin, particularly NADH dehydrogenase and biosynthesis. Could overexpression of Ndi1 restore Pros localization in NBs?

      Ferritin defect downregulates iron level, which leads to cell cycle arrest of NBs via ATP shortage. And cell cycle arrest of NBs probably results in NB differentiation[2, 3]. We have added the experiment in Figure 5—figure supplement 2. This result showed that overexpression of Ndi1 could significantly restore Pros localization in NBs.

      The abstract requires revision to cover the major findings of the manuscript, particularly the second half.

      We revised the abstract to add more major findings of the manuscript in the second half as follows:

      “Abstract

      Stem cell niche is critical for regulating the behavior of stem cells. Drosophila neural stem cells (Neuroblasts, NBs) are encased by glial niche cells closely, but it still remains unclear whether glial niche cells can regulate the self-renewal and differentiation of NBs. Here we show that ferritin produced by glia, cooperates with Zip13 to transport iron into NBs for the energy production, which is essential to the self-renewal and proliferation of NBs. The knockdown of glial ferritin encoding genes causes energy shortage in NBs via downregulating aconitase activity and NAD+ level, which leads to the low proliferation and premature differentiation of NBs mediated by Prospero entering nuclei. More importantly, ferritin is a potential target for tumor suppression. In addition, the level of glial ferritin production is affected by the status of NBs, establishing a bicellular iron homeostasis. In this study, we demonstrate that glial cells are indispensable to maintain the self-renewal of NBs, unveiling a novel role of the NB glial niche during brain development.”

      In Figure 2B Mira appeared to be nuclear in NBs, which is inconsistent with its normal localization. Was it Dpn by mistake?

      In Figure 2B, we confirmed that it is Mira. Moreover, we also provide a magnified picture in Figure 2B’, showing that the Mira mainly localizes to the cortex or in the cytoplasm as previously reported.

      Figure 2C, Fer1HCH-GFP/mCherry localization was non-uniform in the NBs revealing 1-2 regions devoid of protein localization potentially corresponding to the nucleus and Mira crescent enrichment. It is important to co-label the nucleus in these cells and discuss the intracellular localization pattern of Ferritin.

      We have revised the picture with nuclear marker DAPI in Figure 2C. The result showed that Fer1HCH-GFP/Fer2LCH-mCherry was not co-localized with DAPI, which indicated that Drosophila ferritin predominantly distributes in the cytosol[4, 5]. As for the concern mentioned by this reviewer, GFP/mCherry signal in NBs was from glial overexpressed ferritin, which probably resulted in non-uniform signal.

      In Figure 3-figure supplement 3F, glial cells in Fer1HCH RNAi appeared to be smaller in size. This should be quantified. Given the significance of ferritin in cortex glial cells, examining the morphology of cortex glial cells is essential.

      In Figure 3—figure supplement 3F, we did not label single glial cells so it was difficult to determine whether the size was changed. However, it seems that the chamber formed by the cellular processes of glial cells becomes smaller in Fer1HCH RNAi. The glial chamber will undergo remodeling during neurogenesis, which responses to NB signal to enclose the NB and its progeny[6]. Thus, the size of glial chamber is regulated by NB lineage size. In our study, ferritin defect leads to the low proliferation, inducing the smaller lineage of each NB, which likely makes the chamber smaller.

      Since the authors showed that the reduced NB number was not due to apoptosis, a time-course experiment for glial ferritin KD is recommended to identify the earliest stage when the phenotype in NB number /proliferation manifests during larval brain development.

      We observed brains at different larval stages upon glial ferritin KD. The result showed that NB proliferation decreased significantly, but NB number declined slightly at the second-instar larval stage (Figure 5—figure supplement 1E and F), suggesting that brain defect of glial ferritin KD manifests at the second-instar larval stage.

      Transcriptome analysis on ferritin glial KD identified genes in mitochondrial functions, while the in vivo EM data suggested no defects in mitochondria morphology. A short discussion on the inconsistency is required.

      For the observation of mitochondria morphology via the in vivo EM data, we focused on visible cristae in mitochondria, which was used to determine whether the ferroptosis happens[7]. It is possible that other details of mitochondria morphology were changed, but we did not focus on that. To describe this result more accurately, we replaced “However, our observation revealed no discernible defects in the mitochondria of NBs after glial ferritin knockdown” with the “However, our result showed that the mitochondrial double membrane and cristae were clearly visible whether in the control group or glial ferritin knockdown group, which suggested that ferroptosis was not the main cause of NB loss upon glial ferritin knockdown” in line 207-209.

      The statement “we found no obvious defects of brain at the first-instar larval stage (0-4 hours after larval hatching) when knocking down glial ferritin (Figure 5-figure supplement 1C).” lacks quantification of NB number and proliferation, making it challenging to conclude.

      We have provided the quantification of NB number and proliferation rate of the first-instar larval stage in Figure 5—figure supplement 1C and D. The data showed that there is no significant change in NB number and proliferation rate when knocking down ferritin, suggesting that no brain defect manifests at the first-instar larval stage.

      A wild-type control is necessary for Figure 6A-C as a reference for normal brain sizes.

      We have added Insc>mCherry RNAi as a reference in Figure 6A-D, which showed that the brain size of tumor model is larger than normal brain. Moreover, we removed brat RNAi data from Figure 6A-D to Figure 6—figure supplement 1A-D for the better layout.

      In Figures 6B, D, “Tumor size” should be corrected to “Larval brain volume”.

      Here, we measured the brain area to assess the severity of the tumor via ImageJ instead of 3D data of the brain volume. So we think it would be more appropriate to use the “Larval brain size” than “Larval brain volume” here. Thus, we have corrected “Tumor size” to “Larval brain size” in Figure 6B and D to Figure 6—figure supplement 1B and D.

      Considering that asymmetric division defects in NBs may lead to premature differentiation, it is advisable to explore the potential involvement of ferritin in asymmetric division.

      aPKC is a classic marker to determine the asymmetric division defect of NB. We performed the aPKC staining and found it displayed a crescent at the apical cortex based on the daughter cell position whether in control or glial ferritin knockdown (Figure 5—figure supplement 3A). This result indicated that there was no obvious asymmetric defect after glial ferritin knockdown.

      In the statement "Secondly, we examined the apoptosis in glial cells via Caspase-3 or TUNEL staining, and found the apoptotic signal remained unchanged after glial ferritin knockdown (Figure 3-figure supplement 3A-D).", replace "the apoptosis in glial cells" with "the apoptosis in larval brain cells".

      We have replaced "the apoptosis in glial cells" with "the apoptosis in larval brain cells" in line 216.

      Include a discussion on the involvement of ferritin in mammalian brain development and address the limitations associated with considering ferritin as a potential target for tumor suppression.

      We have added the discussion about ferritin in mammalian brain development in line 428-430 and limitation of ferritin for suppressing tumor in line 441-444.

      Indicate Insc-GAL4 as BDSC#8751, even if obtained from another source. Additionally, provide information on the extensively used DeRed fly stock used in this study within the methods section.

      We provided the stock information of Insc-GAL4 and DsRed in line 673-674.

      Reviewer #2 (Recommendations For The Authors):

      Major points:

      The number of NBs differs a lot between experiments. For example, in Fig 1B and 1K controls present less than 100 NBs whereas in Figure 1 Supplementary 2B it can be seen that controls have more than 150. Then, depending on which control you compare the number of NBs in flies silencing Fer1HCH or Fer2LCH, the results might change. The authors should explain this.

      Figure 1 Supplementary 2B (Figure 1 Supplementary 3B in the revised version) shows NB number in VNC region while Fig 1B and 1K show NB number in CB region. At first, we described the general phenotype showing the NB number in CB and VNC respectively (Fig 1 and Fig 1-Supplementary 1 and 3 in the revised version). And the NB number is consistent in each region. After then, we focused on NB number in CB for the convenience.

      This reviewer encourages the authors to use better Gal4 lines to describe the expression patterns of ferritins and Zip13 in the developing brain. On the one hand, the authors do not state which lines they are using (including supplementary table). On the other hand, new Trojan GAL4 (or at least InSite GAL4) lines are a much better tool than classic enhancer trap lines. The authors should perform this experiment.

      All stock source and number were documented in Table 2. Ferritin GAL4 and Zip13 GAL4 in this study are InSite GAL4. In addition, we also used another Fer2LCH enhancer trapped GAL4 to verify our result (DGRC104255) and provided the result in Figure 2—figure supplement 1. Our data showed that DsRed driven by Fer2LCH-GAL4 was co-localized with the glia nuclear protein Repo, instead of the NB nuclear protein Dpn, which was consistent with the result of Fer1HCH/Fer2LCH GAL4. In addition, we will try to obtain the Trojan GAL4 (Fer1HCH/Fer2LCH GAL4 and Zip13 GAL4) and validate this result in the future.

      The authors exclude very rapidly the possibility of ferroptosis based only on some mitochondrial morphological features without analysing the other hallmarks of this iron-driven cell death. The authors should at least measure Lipid Peroxidation levels in their experimental scenario either by a kit to quantify by-products of lipid peroxidation such as Malonaldehide (MDA) or using an anti 4-HNE antibody.

      We combined multiple experiments to exclude the possibility of ferroptosis. Firstly, ferroptosis can be terminated by iron chelator. And we fed fly with iron chelator upon glial ferritin knockdown, but NB number and proliferation were not restored, which suggested that ferroptosis probably was not the cause of NB loss induced by glial ferritin knockdown (Figure 3B and C). Secondly, Zip13 transports iron into the secretary pathway and further out of the cells in Drosophila gut[8]. Our data showed that knocking down iron transporter Zip13 in glia resulted in the decline of NB number and proliferation, which was consistent with the phenotype upon glial ferritin knockdown (Figure 3E-G). More importantly, the knockdown of Zip13 and ferritin simultaneously aggravated the phenotype in NB number and proliferation (Figure 3E-G). These results suggested that the phenotype was induced by iron deficiency in NB, which excluded the possibility of iron overload or ferroptosis to be the main cause of NB loss upon glial ferritin knockdown. Finally, we observed mitochondrial morphology on double membrane and the cristae that are critical hallmarks of ferroptosis, but found no significant damage (Figure 3-figure supplement 2E and F).

      In addition, we have added the 4-HNE determination in Figure 3—figure supplement 2G and H. This result showed that 4-HNE level did not change significantly, suggesting that lipid peroxidation was stable, which supported to exclude the possibility that the ferroptosis led to the NB loss upon glial ferritin knockdown.

      All of the above results together indicate that ferroptosis is not the cause of NB loss after ferritin knockdown.

      A major flaw of the manuscript is related to the chapter Glial ferritin defects result in impaired Fe-S cluster activity and ATP production and the results displayed in Figure 4. The authors talk about the importance of FeS clusters for energy production in the mitochondria. Surprisingly, the authors do not analyse the genes involved in this process such as but they present the interaction with the cytosolic FeS machinery that has a role in some extramitochondrial proteins but no role in the synthesis of FeS clusters incorporated in the enzymes of the TCA cycle and the respiratory chain. The authors should repeat the experiments incorporating the genes NSF1 (CG12264), ISCU(CG9836), ISD11 (CG3717), and fh (CG8971) or remove (or at least rewrite) this entire section.

      Thanks for this constructive advice and we have revised this in Figure 4B and C. We repeated the experiment with blocking mitochondrial Fe-S cluster biosynthesis by knocking down Nfs1 (CG12264), ISCU(CG9836), ISD11 (CG3717), and fh (CG8971), respectively. Nfs1 knockdown in NB led to a low proliferation, which was consistent with CIA knockdown. However, we did not observe the obvious brain defect in ISCU(CG9836), ISD11 (CG3717), and fh (CG8971) knockdown in NB. Our interpretation of these results is that Nfs1 probably is a necessary core component in Fe-S cluster assembly while others are dispensable[9].

      The presence and aim of the mouse model Is unclear to this reviewer. On the one hand, It Is not used to corroborate the fly findings regarding iron needs from neuroblasts. On the other hand, and without further explanation, authors migrate from a fly tumor model based on modifying all neuroblasts to a mammalian model based exclusively on a glioma. The authors should clarify those issues.

      Although iron transporter probably is different in Drosophila and mammal, iron function is conserved as an essential nutrient for cell growth and proliferation from Drosophila to mammal. The data of fly suggested that iron is critical for brain tumor growth and thus we verified this in mammalian model. Glioma is the most common form of central nervous system neoplasm that originates from neuroglial stem or progenitor cells[10]. Therefore, we validated the effect of iron chelator DFP on glioma in mice and found that DFP could suppress the glioma growth and further prolong the survival of tumor-bearing mice.

      Minor points

      Although referred to adult flies, the authors did not include either in the introduction or in the discussion existing literature about expression of ferritins in glia or alterations of iron metabolism in fly glia cells (PMID: 21440626 and 25841783, respectively) or usage of the iron chelator DFP in drosophila (PMID: 23542074). The author should check these manuscripts and consider the possibility of incorporating them into their manuscript.

      Thanks for your remind. We have incorporated all recommended papers into our manuscript line 65-67 and 168.

      The number of experiments in each figure is missing.

      All experiments were repeated at least three times. And we revised this in Quantifications and Statistical Analysis of Materials and methods.

      If graphs are expressed as mean +/- sem, it is difficult to understand the significance stated by the authors in Figure 2E.

      We apologize for this mistake and have revised this in Quantifications and Statistical Analysis. All statistical results were presented as means ± SD.

      When authors measure aconitase activity, are they measuring all (cytosolic and mitochondrial) or only one of them? This is important to better understand the experiments done by the authors to describe any mitochondrial contribution (see above in major points).

      In this experiment, we were measuring the total aconitase activity. We also tried to determine mitochondrial aconitase but it failed, which was possibly ascribed to low biomass of tissue sample.

      In this line, why do controls in aconitase and atp lack an error bar? Are the statistical tests applied the correct ones? It is not the same to have paired or unpaired observations.

      It is the normalization. We repeated these experiments at least three times in different weeks respectively, because the whole process was time-consuming and energy-consuming including the collection of brains, protein determination and ATP or aconitase determination. And the efficiency of aconitase or ATP kit changed with time. We cannot control the experiment condition identically in different batches. Therefore, we performed normalization every time to present the more accurate result. The control group was normalized as 1 via dividing into itself and other groups were divided by the control. This normalized process was repeated three times. Therefore, there is no error bar in the control group. We think it is appropriate to apply ANOVA with a Bonferroni test in the three groups.

      In some cases, further rescue experiments would be appreciated. For example, expression of Ndi restores control NAD+ levels or number of NBs, it would be interesting to know if this is accompanied by restoring mitochondrial integrity and its ability to produce ATP.

      We have determined ATP production after overexpressing Ndi1 and provided this result in Figure 4—figure supplement 1B. The data showed that expression of Ndi1 could restore ATP production upon glial Fer2LCH knockdown, which was consistent with our conclusion.

      Lines 293-299 on page 7 are difficult to understand.

      According to our above results, the decrease of NB number and proliferation upon glial ferritin knockdown (KD) was caused by energy deficiency. As shown in the schematic diagram (Author response image 1), “T” represented the total energy which was used for NB maintenance and proliferation. “N” indicated the energy for maintaining NB number. “P” indicated the energy for NB proliferation. “T” is equal to “N” plus “P”. When ferritin was knocked down in glia, “T”, “N” and “P” declined in “Ferritin KD” compared to “wildtype (WT)”. Knockdown of pros can prevent the differentiation of NB, but it cannot supply the energy for NB, which probably results in the rescue of NB number but not proliferation. Specifically, NB number increased significantly in “Ferritin KD Pros KD” compared to “Ferritin KD”, which resulted in consuming more energy for NB maintenance in “Ferritin KD Pros KD”. As shown in the schematic diagram, “T” was not changed between “Ferritin KD Pros KD” and “Ferritin KD”, whereas ”N” was increased in “Ferritin KD Pros KD” compared to “Ferritin KD”. Thus, “P” was decreased, which suggested that less energy was remained for proliferation, leading to the failure of rescue in NB proliferation. It seemed that the level of proliferation in “Ferritin KD Pros KD” was even lower than “Ferritin KD”.

      Author response image 1.

      The schematic diagram of relationship between energy and NB function in different groups. “T” represents total energy for NB maintenance and proliferation. “N” represents the energy for NB maintenance. “P” represents the energy for NB proliferation. T=N+P 

      Line 601 should indicate that Tables 2 and 3 are part of the supplementary material.

      We have revised this in line 678.

      Figure 4-supplement 1. Only validation of 2 genes from a RNAseq seems too little.

      We dissected hundreds of brains for sorting NBs because of low biomass of fly brain. This is a difficult and energy-consuming work. Most NBs were used for RNA-seq, so we can only use a small amount of sample left for validation which is not enough for more genes.

      Figure 6E, the authors indicate that 10 mg/ml DFP injection could significantly prolong the survival time. Which increase in % is produced by DFP?

      We have provided the bar graph in Author response image 2. The increase is about 16.67% by DFP injection.

      Author response image 2.

      The bar graph of survival time of mice treated with DFP. (The unpaired two-sided Student’s t test was employed to assess statistical significance. Statistical results were presented as means ± SD. n=7,6; *: p<0.05)

      Reviewer #3 (Recommendations For The Authors):

      As I read the initial results that built the story (glia make ferritin>release it> NBs take them up>use it for TCA and ETC) I kept thinking about what it meant for NBs to be 'lost'. This led me to consider alternate possibilities that the results might point to, other than the ones the authors were suggesting. It was only in Figure 5 that the authors ruled out some of those possibilities. I would suggest that they first illustrate how NBs are lost upon glial ferritin loss of function before they delve into the mechanism. This would also be a place to similarly address that glial numbers and general morphology are unchanged upon ferritin loss.

      This recommendation provides a valuable guideline to build this story especially for researchers who are interested in neural stem cell studies. Actually, we tried this logic to present our study but found that there are several gaps in the middle of the manuscript, such as the relationship between glial ferritin and Pros localization in NB, so that the whole story cannot be fluently presented. Therefore, we decided to present this study in the current way.

      More details of the screen would be useful to know. How many lines did they screen, what was the assay? This is not mentioned anywhere in the text.

      We have added this in Screen of Materials and methods. We screened about 200 lines which are components of classical signaling pathways, highly expressed genes in glial cells or secretory protein encoding genes. UAS-RNAi lines were crossed with repo-Gal4, and then third-instar larvae of F1 were dissected. We got the brains from F1 larvae and performed immunostaining with Dpn and PH3. Finally, we observed the brain in Confocal Microscope.

      Many graphs seem to be repeated in the main figures and the supplementary data. This is unnecessary, or at least should be mentioned.

      We appreciate your kind reminder. However, we carefully went through all the figures and did not find the repeated graphs, though some of them look similar.

      The authors mention that they tested which glial subtypes ferritin is needed in, but don't show the data. Could they please show the data? Same with the other iron transport/storage/regulation. Also, in both this and later sections, the authors could mention which Gal4 was used to label what cell types. The assumption is that the reader will know this information.

      We have added the result of ferritin knockdown in glial subpopulations in Figure 1—figure supplement 2. However, considering that the quantity of iron-related genes, we did not take the picture, but we recorded this in Table 3.

      For all their images showing colocalisation, magnified, single-colour images shown in grayscale will be useful. For example, without the magnification, it is not possible to see the NB expression of the protein trap line in Figure 2B. A magnified crop of a few NBs (not a single one like in 2C) would be more useful.

      We have provided Figure 2A’, B’, D’ and Figure 3D’ as suggested.

      There are a lot of very specific assays used to detect ROS, NAD, aconitase activity, among others. It would be nice to have a brief but clear description of how they work in the main text. I found myself having to refer to other sources to understand them. (I believe SoNAR should be attributed to Zhao et al 206 and not Bonnay et al 2020.)

      We have added a brief description about ROS, aconitase activity, NAD in line 198-199, 229-231, and 269 as suggested.

      I did not understand the normalisation done with respect to SoNAR. Is this standard practice? Is the assumption that 'overall protein levels will be higher in slowly proliferating NBs' reasonable? This is why they state the need to normalise.

      The SoNAR normalization is not a standard practice. However, we think that our normalization of SoNar is reasonable. According to our results, the expression level of Dpn and Mira seemed higher in glial ferritin knockdown, so we speculated that some proteins accumulated in slowly proliferating NBs. Thus, we used Insc-GAL4 to drive DsRed for indicating the expression level of Insc and found that DsRed rose after glial ferritin knockdown, suggesting that Insc expression was increased indeed. Therefore, we have to normalize SoNar driven by Insc-GAL4 based on DsRed driven by Insc-Gal4, which eliminates the effect of increased Insc upon glial ferritin knockdown.

      FAC is mentioned as a chelator? But the authors seem to use it oppositely. Is there an error?

      FAC is a type of iron salt, which is used to supply iron. We have also indicated that in line 156 according to your advice. 

      The lack of any cell death in the L3 brain surprised me. There should be plenty of hemilineages that die, as do many NBs, particularly in the abdominal segments. Is the stain working? Related to this, P35 is not the best method for rescuing cell death. H99 might be a better way to go.

      We were also surprised to see this result and repeated this experiment for several times with both negative and positive controls. Moreover, we also used TUNEL to validate this result, which led to the same result. We will try to use H99 to rescue NB loss in the future, because it needs to be integrated and recombined with our current genetic tools.

      It would be nice to see the aconitase activity signal as opposed to just the quantification.

      This method can only determine the absorbance for indicating aconitase activity, so our result is just the quantification.

      Glia are born after NBs are specified. In fact, they arise from NBs (and glioblasts). So, it's unlikely that the knockdown of ferritin in glia can at all affect initial NB specification.

      We completely agree with this statement.

      The section on tumor suppression seems out of place. The fly data on which the authors base this as an angle to chase is weak. Dividing cells will be impaired if they have inadequate energy production. As a therapeutic, this will affect every cell in the body. I'm not sure that cancer therapeutics is pursuing such broadly acting lines of therapies anymore.

      Our data suggested that iron/ferritin is more critical for high proliferative cells. Tumor cells have a high expression of TfR (Transferrin Receptor)[11], which can bind to Transferrin and ferritin[12]. And ferritin specifically targets on the tumor cells[11]. Thus, we think iron/ferritin is extremely essential for tumor cells. If we can find the appropriate dose of iron/ferritin inhibitor, suppressing tumor growth but maintaining normal cell growth, iron/ferritin might be an effective target of tumor treatment.

      The feedback from NB to glial ferritin is also weak data. The increased cell numbers (of unknown identity) could well be contributing to the increase in ferritin. I would omit the last two sections from the MS.

      In brat RNAi and numb RNAi, increased cells are NB-like cells, which cannot undergo further differentiation and are not expected to produce ferritin. More importantly, we used Repo (glia marker) as the reference and quantified the ratio of ferritin level to Repo level, which can exclude the possibility that increased glial cells lead to the increase in ferritin.

      References

      (1) Tanimura T, Isono K, Takamura T, et al. Genetic Dimorphism in the Taste Sensitivity to Trehalose in Drosophila-Melanogaster. J Comp Physiol, 1982,147(4):433-7

      (2) Myster DL, Duronio RJ. Cell cycle: To differentiate or not to differentiate? Current Biology, 2000,10(8):R302-R4

      (3) Dalton S. Linking the Cell Cycle to Cell Fate Decisions. Trends in Cell Biology, 2015,25(10):592-600

      (4) Nichol H, Law JH, Winzerling JJ. Iron metabolism in insects. Annu Rev Entomol, 2002,47:535-59

      (5) Pham DQ, Winzerling JJ. Insect ferritins: Typical or atypical? Biochim Biophys Acta, 2010,1800(8):824-33

      (6) Speder P, Brand AH. Systemic and local cues drive neural stem cell niche remodelling during neurogenesis in Drosophila. Elife, 2018,7

      (7) Mumbauer S, Pascual J, Kolotuev I, et al. Ferritin heavy chain protects the developing wing from reactive oxygen species and ferroptosis. PLoS Genet, 2019,15(9):e1008396

      (8) Xiao G, Wan Z, Fan Q, et al. The metal transporter ZIP13 supplies iron into the secretory pathway in Drosophila melanogaster. Elife, 2014,3:e03191

      (9) Marelja Z, Leimkühler S, Missirlis F. Iron Sulfur and Molybdenum Cofactor Enzymes Regulate the  Life Cycle by Controlling Cell Metabolism. Front Physiol, 2018,9

      (10) Morgan LL. The epidemiology of glioma in adults: a "state of the science" review. Neuro-Oncology, 2015,17(4):623-4

      (11) Fan K, Cao C, Pan Y, et al. Magnetoferritin nanoparticles for targeting and visualizing tumour tissues. Nat Nanotechnol, 2012,7(7):459-64

      (12) Li L, Fang CJ, Ryan JC, et al. Binding and uptake of H-ferritin are mediated by human transferrin receptor-1. Proc Natl Acad Sci U S A, 2010,107(8):3505-10

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The propagation of electrical signals within neuronal circuits is tightly regulated by the physical and molecular properties of neurons. Since neurons vary in size across species, the question arises whether propagation speed also varies to compensate for it. The present article compares numerous speed-related properties in human and rat neurons. They found that the larger size of human neurons seems to be compensated by a faster propagation within dendrites but not the axons of these neurons. The faster dendritic signal propagation was found to arise from wider dendritic diameters and greater conductance load in human neurons. In addition, the article provides a careful characterization of human dendrites and axons, as the field has only recently begun to characterize post-operative human cells. There are only a few studies reporting dendritic properties and these are not all consistent, hence there is the added value of reporting these findings, particularly given that the characterization is condensed in a compartmental model.

      Strengths:

      The study was performed with great care using standard techniques in slice electrophysiology (pharmacological manipulation with somatic patch-clamp) as well as some challenging ones (axonal and dendritic patch-clamp). Modeling was used to parse out the role of different features in regulating dendritic propagation speed. The finding that propagation speed varies across species is novel as previous studies did not find a large change in membrane time constant or axonal diameters (a significant parameter affecting speed). A number of possible, yet less likely factors were carefully tested (Ih, membrane capacitance). The main features outlined here are well-known to regulate speed in neuronal processes. The modeling was also carefully done to verify that the magnitude of the effects is consistent with the difference in biophysical properties. Hence, the findings appear very solid to me.

      Weaknesses:

      The role of diameter in regulating propagation speed is well-known in the axon literature.

      We thank the reviewer for this comment. This is indeed true. The paper does not claim that this is new – we just refereed to Waxman’s book to acknowledge this established effect. Our main emphasize is on the impact of dendritic (rather than axonal) diameter – highlighting the faster EPSP speed near the input synapse and converging to steady-state value further away from the soma and using this to explore the impact of differences in dendritic diameter of rat vs. human on EPSP latency and velocity. We now made this point clearer in the revised text.

      Reviewer #2 (Public Review):

      Summary:

      In this paper, Oláh and colleagues introduce new research data on the cellular and biophysical elements involved in transmission within the pyramidal circuits of the human neocortex. They gathered a comprehensive set of patch-clamp recordings from human and rat pyramidal neurons to compare how the temporal aspect of neuronal processing is maintained in the larger human neocortex. A broad range of experimental, theoretical, and computational methods are used, including two-photon guided dual whole-cell recordings, electron microscopy, and computational simulations of reconstructed neurons.

      Recordings from synaptically connected pyramidal neurons revealed longer intercellular path lengths within the human neocortex. Further, by using dual whole-cell recordings from somadendrite and soma-axon locations, they found that short latencies from soma to soma can be partly attributed to an increased propagation speed for synaptic potentials, but not for the propagation of action potentials along the axon.

      Next, in a series of extensive computational modeling studies focusing on the synaptic potentials, the authors observe that the short-latency within large human pyramidal neural circuits may have a passive origin. For a wide array of local synaptic input sites, the authors show that the conductance load of the dendrites, electrically coupled to a large diameter apical dendrite, affects the cable properties. The result is a relatively faster propagation of EPSPs in the human neuron.

      The manuscript is well-written and the physiological experiments and biophysical arguments are very well explained. I appreciated the in-depth theoretical steps for the simulations. That passive cable properties of the dendrites are causing a higher velocity in human dendrites is interesting but there is a disconnect between the experimental findings and the model simulations. Based on the present data the contribution of active membrane properties cannot be dismissed and deserves further experiments.

      See our response below

      Strengths:

      The authors present state-of-the-art 2P-guided dual whole-cell recordings in human neurons. In combination with detailed reconstructions, these approaches represent the next steps in unravelling the information processing in human circuits.

      The computational modeling based on cable theory and experimentally constrained simulations provides an excellent integrated view of the passive membrane properties.

      Weaknesses:

      There are smaller and larger issues with the statistical analyses of the experimental data which muddles the interim conclusions.

      That the cable properties alone are the main explanation for speeding the electrical signaling in human pyramidal neurons appears inconsistent with the experimental data.

      This is an excellent point – we indeed performed analysis on only passive cases – highlighting (and now also ranking) the impact of the various morpho-electrical properties of the neurons on the differences in signal latency in human vs. rats. We did explored (not shown) the effect of active channels in the dendrites (including the h-current); as expected the results strongly depend on channel density and their spatial distribution over the dendritic tree. As we do not know these parameters for the modelled cells, we decided to remain focus on the impact of passive/morphological parameters. We also note that the experimental results (page 4-5 in manuscript) show minor contribution of h-current emphasizing that the passive properties have the main role in differentiating human and rats. differences between human and rat. 

      Some of the electrophysiological experiments require further control experiments to make robust conclusions.

      Reviewer #3 (Public Review):

      Summary:

      This study indicates that connections across human cortical pyramidal cells have identical latencies despite a larger mean dendritic and axonal length between somas in the human cortex. A precise demonstration combining detailed electrophysiology and modeling indicates that this property is due to faster propagation of signals in proximal human dendrites. This faster propagation is itself due to a slightly thicker dendrite, a larger capacitive load, and stronger hyperpolarizing currents. Hence, the biophysical properties of human pyramidal cells are adapted such that they do not compromise information transfer speed.

      Strengths:

      The manuscript is clear and very detailed. The authors have experimentally verified a large number of aspects that could affect propagation speed and have pinpointed the most important one. This paper provides an excellent comparison of biophysical properties between rat and human pyramidal cells. Thanks to this approach a comprehensive description of the mechanisms underlying the acceleration of propagation in human dendrite is provided.

      Weaknesses:

      Several aspects having an impact on propagation speed are highlighted (dendritic diameter, ionic channels, capacitive load) and there is no clear ranking of their impact on signal propagation speed. It seems that the capacitive load plays a major role, much more than dendritic diameter for which only a 10% increase is observed across species. Both aspects actually indicate that there is an increase in passive signal propagation speed with bigger cells at least close to the soma. This suggests that bigger cells are mechanically more rapid. An intuitive reason why capacitive load increases speed would also help the reader follow the demonstration.

      We thank the referee for both these excellent points. In response to them:

      (i) We now performed a new comprehensive statistical analysis and show the ranking of the effect of the different morphological/cable factors on EPSP propagation. This analysis appears in both Supp. Table 5& 6, Fig. S16 and also in the main text as follows:

      To rank the impact of the various factors affecting EPSP propagation latency in human and rat neurons, we conducted a comprehensive statistical analysis using two complementary approaches: the generalized linear model (GLM) (Kiebel & Holmes, 2007) as well as SHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017) based on fitting Gradient Tree Boosting  (Friedman, 2002)model. We began by fitting a GLM without interaction terms among the factors affecting EPSP latency (Suppl. Table 5). This enables us to quantify the primary individual factors affecting EPSP propagation. Our analysis revealed the following ranking order: 1) physical distance of synapses from soma had the strongest effect; 2) species differences; 3) conductance load, as demonstrated by our “hybrid cells” manipulation; 4) radii of the apical dendrite, affecting the cables’ space constant, λ; and 5) the specific cable parameters, as revealed when using per-cell fitted parameters versus uniform cable parameters, was minimal. We next performed GLM analysis with interaction terms showing that, as expected, there are significant interactions between the factors affecting EPSP latency (Suppl. Table 6). To further validate the above ranking while incorporating the interactions between the various factors affecting EPSP latency, we performed a SHAP analysis. Notably, even with interactions included, the ranking of the factors affecting signal propagation are aligned with the results from the analysis based on the GLM without interaction terms (see Fig S.16).

      (ii) As for the intuitive explanation required by the referee. We added the following paragraph In the Discussion:

      The intuitive reason for this enhancement is that the large conductance load (the “leaky end” boundary conditions) more effectively “steals” the synaptic (axial) current (like water pouring faster into a large pool). The more mathematical intuition is that the large soma (sink) adds fast time constants to the system (see also related explanation in Fig. 4 in Eyal et al., 2014).

      We thank the editors for considering and revising our manuscript for publication in eLife. We appreciate the positive appreciation of the work and the critical points raised by the reviewers. We have responded in detail to all the excellent comments from all reviewers. We believe that these revisions have significantly improved the quality of our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      There are two points that could improve the reading experience of this nice manuscript. These should be easily addressed with minor re-phrasing.

      Credit to conduction velocity literature. Less widely known in the dendrite literature, in the axon literature, the relationship between propagation speed and process diameter is well established. I thought the two articles cited (Jack Noble Tsien and Agmon-Snir & Segev) were not as direct in the treatment of this relationship. The work of Stephen Waxman, for instance, made clear how axon diameter tightly controls propagation speed (see for instance the Scholarpedia entry by Swadlow and Waxman). In my opinion, this is a widely known piece of work, that is part of some introductory books to neuroscience. While the article does not claim they found this relationship, parts of the presentation are better understood if we ignore this well-known fact. I am referring to the abstract, intro, and the beginning of results where 'larger' is presented as synonymous with 'slower'. For instance 'to compensate for the increase neurons' size' (abstract) or 'the increase in size of dendrites and axons might come with a cost of longer signal propagation times' only makes sense if 'size' refers to spatial extent, not diameter.

      We thank for this valid point; leaving out axon diameter references was not intentional. We have now added the suggested reference to our manuscript. In the size comparisons, we have only pointed out the obvious size differences between the body and the dendritic processes. We have reworded sentences with size comparisons.

      In Abstract (lines 1-6):

      Human-specific cognitive abilities depend on information processing in the cerebral cortex, where neurons are significantly larger, their processes are longer and sparser compared to rodents. We found that, in synaptically-connected layer 2/3 pyramidal cells (L2/3 PCs), soma-tosoma signal propagation delay is similar in humans and rodents. Thus, to compensate for the increase in neurons’s longer processes, membrane potential changes must propagate faster in human axons and/or dendrites.

      In section “Effect of dendritic thickness” in Results we have modified it as follows:

      The relationship between conduction velocity and axon diameter is well known for small myelinated and unmyelinated axons (Waxman and Bennett, 1972). Anatomical features of neuronal processes dendrites also have a major influence on signal propagation properties 5,19, thus …

      Waxman, S. G. and Bennett, M. V. L. Relative conduction velocity of small myelinated and nonmyelinated fibres in the central nervous system. Nature New Biol., 238217-219, 1972.

      Two or four dendritic factors? The study identifies two major dendritic factors influencing the propagation speed (diameter and load), however the end of the results highlights four factors. I did not understand how factor 2 was different than factor 1. Neither did I understand how factor 4 was different from the other factors. There seemed to be a little redundancy here that could be streamlined.

      We thank the reviewer for pointing this out. We now have changes the respective text, added the ranking statistics (see above) to assess the effect of the different parameters on signal propagation in dendrites.

      Microcircuits? The study found that the changes in speed arise from the dendrites rather than the axons, as such it seems it would be more precise to replace 'microcircuits' with 'dendrites'.

      We are thankful for this suggestion. We change the title to Accelerated signal propagation speed in human neocortical dendrites.

      Typos

      P3 line 24 'find significant difference the propagation'.

      P6 line 35 'how morphological differences' it would be useful to specify which morphological difference here.

      Corrected.

      Reviewer #2 (Recommendations For The Authors):

      (1) The statistical analyses should be changed. T-testing populations and comparing visual differences of differences ("human minus rats") is a common but egregious error in the field of neurosciences (see doi:10.1038/nn.2886). The conclusion that HCN channels "... do not by themselves explained the differences between the two species" (lines 174-176) is not compelling. The design of the experiments presented in Figure 3 is paired recordings and the addition of a blocker (ZD7288 or TTX cocktail). These are classic 2 x 2 factorial designs (species x drug). The authors will need to perform a repeated-measured analysis of variance (RM-ANOVA) and provide information on the interaction significance. Please revise the figures and improve statistical reporting. Post-hoc comparisons of the velocity populations are required to support the idea of whether h-channels are explaining the observed differences.

      Thank you for drawing our attention to this error. The statistical analysis of the pharmacological experiments was re-performed as suggested. After the 2-way ANOVA with repeated measures and Bonferroni post-hoc correction, we can indeed find significant differences only in the control group, namely that the propagation speed of bAPs in human dendrites was significantly higher. The implementation of the proposed statistical analysis demonstrates that the administration of ZD has no statistically significant effect on the propagation speed of human or rat dendrites. The treatment with TTX cocktail resulted in a significant difference in signal propagation in humans but not in rodents. However the trend is discernible and the P = 0.0588 value is close to the widely accepted 0.05 threshold. After the TTX cocktail treatment, the speed of signal propagation did not differ significantly between the two species. However, on average, the human dendrites remained faster. These alterations in P-values do not affect our primary conclusions. The MS text has been modified accordingly.

      (2) Although ZD7288, in my opinion, influences the bAP (see point #1) the authors subsequently leave the h-current unblocked in the experiments in Figures 3D, E. Here, they use sodium, potassium, and calcium currents as well as synaptic conductances. I am puzzled why (in line 188) they claim the dendrites are "passive" although the data show h-currents are contributing to the shape of the bAP in human neurons. In line 196 they conclude voltage-gated conductances have a "minor" contribution and passive properties a main role. Please revise conclusions or provide better experimental support.

      Thank you for this point. We meant to refer to the state in which no action potential can be generated, although the word 'passive' might be misleading in this context; we rephrase these sentences in the MS accordingly.

      (3) A major concern is the injection of an AP in voltage-clamp mode. Although this is the right choice and I'm in support of the experiment, it is technically challenging to space clamp the soma and fully recapitulate the speed and amplitude of a 100 mV depolarization. The voltage drop in peak amplitude as well as the increased delay between the baseline AP (current clamp) and AP in blocker conditions (voltage clamp) could be fully explained by switching between current- and voltage-clamp modes. In additional control experiments, the authors should add a second voltage follower electrode (CC) at the soma showing whether the authors can preserve the original AP (from CC) in VC/blocker condition. It may well be they need to adjust the injection protocol.

      Our experiments were designed to replicate the work of Stuart et al. (1994), in which they compared the attenuation of active and passive backpropagating signals. When they blocked Na+ channels with TTX they injected simulated action potentials in voltage-clamp mode. They concluded that TTX-sensitive Na+ channels cause somatic action potential entry into the dendritic compartment. They found a comparable attenuation of the backward propagating action potential in the dendrites in control conditions (~70 %). 

      We performed control recordings based on the reviewer’s suggestion (Author response image 1).

      Author response image 1.

      Injection of the previously recorded AP (blue) in VC mode produced a completely similar somatic AP in CC mode (orange). The slight temporal delay between the two signal caused by the different position of the pipettes on the cell body.  The right panel shows the plot of the two peak-aligned APs as a function of each other, close to the blue ‘equality’ line. We concluded that the original AP is well preserved in VC/blocker condition.

      (5) From the paragraph entitled "Modeling EPSP propagation in dendrites" and onwards the authors make countless conclusions based on theory and modelling results but without any statistical support. Multiple neurons are used thus it is rather straightforward to provide numerical support for the assertions. For example, but this is not an exhaustive list, how should we interpret that latency ranges are different (line 240, line 253) etc.? Or were the estimated Cm values of human and rat neurons (0.6 versus 1.1) significantly different? And if so, how does this align with the Cm estimates in the nucleated patch experiments?

      We thank the referee for this comment and now added a set of statistical analyses. The results appear now throughout the whole theoretical paper in revised article. In particular with respect to Figs. 6&7 where we now show that, indeed, our various manipulations (e.g., hybrid vs. original cells) as well as the cable parameters (Cm, Rm) are indeed significantly different between human and rats whereas the membrane time constant is not significantly different between human and rat. As for Cm in human. Our limited sample size shows significant difference between human and rat. Yet, the range of values for Cm that we found in our modeling study does fall within the experimental range reported in the present study.

      Minor

      Line 44. The "simulated EPSP" example in Figure 2C is not a command waveform for an EPSC. Line 526 in the methods states that also ramp currents were used. Please revise to clarify the main text.

      Thank you for bringing this discrepancy to our attention. In the experiments, we used ramp injections. We have made this clear in the main text as follows: ”... we tested orthodromic or forward propagating signal propagation velocity by injecting short-duration current ramps to simulate EPSP (sEPSP) signals in the dendrites and recorded the resultant subthreshold voltage response in the soma”

      Line 522. The authors state the recordings were all carried out "in current clamp mode" but detailed VC method information is lacking. Did they use series resistance compensation?

      We did not use series resistance compensation.

      Line 479 From which region(s) where human "neocortical slices" sampled? Please add this information.

      We have added regions of origin to the Methods section: frontal (n = 21), temporal (n = 20), parietal (n = 20), and occipital (n = 1).  

      Please show higher temporal resolution example traces, for example in Figure 3. Differences are at the micrometer scale, but APs are shown at the millisecond scale. Hard to judge the quality of the data. Showing the command potentials (inset Figure 3D, E) is misleading (see major point #3).

      In response to the reviewer's request, we have redrawn the example traces in Figure 3.

      Please check the labeling of figures. There is information missing. For example, in Figure 5 A to C I am missing information and the units of the axes.

      In the black plots on the right side of panels B and C, the y-axis shows the thickness measurements for the given dendrite stacked on top of each other and the x-axis shows the measurement values, the units for the x-axis are µm as mentioned in the figure legend.

      Line 981 "scalebars" should read scale bars."

      Line 986 "bootstraped" should read "bootstrapped".

      Done.

      Are the dendritic diameters increased for all basal and apical higher-order branches? It is unclear how the model simulations were built on diameters of primary and higher-order branches.

      In our modelling study we took the actual diameter of the reconstructed PCs in both proximal and higher order branches. We did compare per-distance differences in diameter – but it is automatically incorporated into the computation of the basal load (“equivalent cables” in Figs 6&8).

      The velocity calculation for axonal propagation (yielding a ~0.9 m/s conduction velocity, Figure 2B) is incorrect. Using the peak of the action potentials between soma and axon misses the fact that action potentials start earlier and spatially distally from the soma in the axon. Please revise the calculation to include the temporal delay and actual distance travelled by the forward propagating action potential.

      Thank you for this question. We are aware that the AP is generated at the AIS and that it is located between the two recording electrodes and we have to take into account that the signal propagates from the AIS to the soma and this may shorten the delay in the system. To the best of our knowledge, there is no experimental evidence of the location of the AP generation site on the AIS in layer 2-3 pyramidal cells in the human neocortex, so we assumed that it is located 35 microns from the soma, and that the propagation speed from the AIS to the two directions is the same. Consequently, we have corrected our propagation velocity values as follows:

      “For the axon bleb recordings we assumed that the axon initial segment (AIS) of the cells are 35 µm from the axon hillock, and the APs propagate to forward (to the bleb) and backward (to the soma) at the same speed. For the correction of the AIS we used the following formula: (2)

      where vcorr is the corrected propagation speed for AIS position, l is the axonal distance between the soma and the axon bleb, t is the latency between the two measuring point, ais is the assumed position of the AIS alongside the axon (35 µm).”

      What explains the strongly attenuated axonal action potential at the bleb? Is this representative?

      The strongly attenuated axonal action potential at the bleb can be explained by a few key factors:

      (1) Membrane Integrity: Bleb formation often indicates some level of membrane damage or alteration. This can disrupt the normal ionic gradients across the membrane, leading to a failure in generating or propagating action potentials effectively.

      (2) Current Leakage: Bleb formation may create additional pathways for ion leakage, which can dissipate the electrical current that would normally propagate the action potential. This leakage reduces the overall amplitude of the action potential.

      Line 275 "To our delight", please rephrase.

      Corrected.

      Reviewer #3 (Recommendations For The Authors):

      - In Figure 1, the number of cells used to assess intersomatic distance is quite low. A larger number of neuron pairs should be analyzed to be more representative. Or at least an explanation of why such a low sampling can be conclusive.

      We appreciate the reviewer’s concerns on sample sizes of the first set of experiments, where the anatomical pathways were measured through the synapses of coupled cells with electrophysiological recordings. We acknowledge that this is a limitation of our study. However, in this series of experiments, we simply wanted to experimentally confirm already known results which consisted of two parts: first, that in humans the dendrites and axons of neurons are longer, and second, that they have the same time delay in terms of synaptic latency. 

      The reported similarity in synaptic latencies is consistent with the results of a recent study by Campagnola et al. (2022) showing that EPSP latencies of local connections between layer 2/3 pyramidal cells are in the same range in humans and mice (human median latency = 1.73 ms vs. mouse median latency = 1.49 ms). We came to the same conclusion in our previous work where we compared pyramidal basket cell synaptically coupled pairs in human and rat pairs (Molnár et al. 2016). 

      On the other hand, we report interspecific differences in cable pathways from soma to soma, again consistent with the literature suggesting that the length of pyramidal neural processes is longer in humans than in rodents (see Supplementary Figure 1 and e.g. Berg et al. 2021).

      From a practical point of view the collection of experimental data in this hard won experiment is particularly difficult. The electrophysiological recording of a connected pair with an appropriate pre- and postsynaptic series resistance, where human tissue samples are limited, is the first step here. To obtain information about the path of the signals between pre- and postsynaptic cells, an anatomical reconstruction is required. This requires a) a high-quality recovery of postsynaptic dendrites and presynaptic axons, b) successful tracing of all potential contact points between presynaptic axons and postsynaptic dendrites back to the pre- and postsynaptic soma. The difficulty of the latter point in particular arises from the fact that parts of the presynaptic axonal arbor are myelinated and the success of biocytin-based tracing depends on the length of the myelinated axon branches. The success/failure of complete axonal tracing only becomes apparent at the end of these efforts.

      - The author should provide an intuitive explanation of why capacitive load accelerates propagation in the dendrite.

      See answer above  

      - The author should more clearly rank the contribution of each difference between rat and human neurons. The 10% increase in dendritic diameter which affects velocity only via a square root seems a very weak contribution. This should be clarified.

      We now added a set of statistical methods to perform such a ranking in the theoretical part of this study, as described above (and in a new paragraph, attached above) in the revised article. 

      References

      Eyal, G., Mansvelder, H. D., de Kock, C. P. J., & Segev, I. (2014). Dendrites impact the encoding capabilities of the axon. Journal of Neuroscience, 34(24), 8063–8071. https://doi.org/10.1523/JNEUROSCI.5431-13.2014

      Friedman, J. H. (2002). Stochastic gradient boosting. In Computational Statistics & Data Analysis (Vol. 38). www.elsevier.com/locate/csda

      Kiebel, S. J., & Holmes, A. P. (2007). The General Linear Model. In K. Friston, J. Ashburner, S. Kiebel, T. Nichols, & P. William (Eds.), Statistical Parametric Mapping (pp. 101–125). Academic Press.

      Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, 4768–4777.

    1. Author response:

      The following is the authors’ response to the original reviews

      General response 

      Our modeling study integrates recent experimental advances on dendritic physiology, biophysical plasticity rules, and network connectivity motifs into a single model, aiming to clarify their hypothesized inseparable functional roles in neocortical learning. By modelling excitatory plasticity in multi-synaptic connections on dendrites within a network with biologically constrained higher-order structure, we show these aspects are sufficient to account for a wide range of interesting phenomena: First, the calcium-based plasticity rule acted sparsely and specifically, keeping the network stable without requiring homeostatic mechanisms or inhibitory plasticity, as usually employed for models based on STDP rules. Most importantly, simulations of the network initiated in a recurrent-excitation induced synchronous state transitioned to an in vivo-like asynchronous state, and remained there. Second, plastic changes were stimulus-dependent and could be predicted by neurons’ membership in functional assemblies, spatial clustering of synapses on dendrites, and the topology of the network’s connectivity. Several of our predictions could be confirmed by comparison to the MICrONS dataset.

      Our study thus aims to provide a first broad exploration of these phenomena and their interactions in a model, as well as a foundation for future studies that examine specific aspects more deeply. Specific concerns of the reviewers about parameter choices (reviewer 2’s 2nd point - 2.2), claims about stability (2.1 and 3.1), the STDP control (1.5), and the motivation behind network metrics (1.8, 2.3) are addressed in detail below and in the revised manuscript.

      Reviewer #1 (Public review): 

      This paper investigates the dynamics of excitatory synaptic weights under a calcium-based plasticity rule, in long (up to 10 minutes) simulations of a 211,000-neuron biophysically detailed model of a rat cortical network. 

      Strengths 

      (1) A very detailed network model, with a large number of neurons, connections, synapses, etc., and with a huge number of biological considerations implemented in the model. 

      (2) A carefully developed calcium-based plasticity rule, which operates with biologically relevant variables like calcium concentration and NMDA conductances. 

      (3) The study itself is detailed and thorough, covering many aspects of the cellular and network anatomy and properties and investigating their relationships to plasticity. 

      (4) The model remains stable over long periods of simulations, with the plasticity rule maintaining reasonable synaptic weights and not pushing the network to extremes. 

      (5) The variety of insights the authors derive in terms of relationships between the cellular and network properties and dynamics of the synaptic weights are potentially interesting for the field. 

      (6) Sharing the model and the associated methods and tools is a big plus. 

      We thank the reviewer for their comments.

      Weaknesses 

      (1) Conceptually, there seems to be a missed opportunity here in that it is not clear what the network learns to do. The authors present 10 different input patterns, the network does some plasticity, which is then analyzed, but we do not know whether the learning resulted in anything functionally significant. Did the network learn to discriminate the patterns much better than at the beginning, to capture or anticipate the timing of pattern presentation, detect similarities between patterns, etc.? This is important to understand if one wants to assess the significance of synaptic changes due to plasticity. For example, if the network did not learn much new functionally, relative to its initial state, then the observed plasticity could be considered minor and possibly insufficient. In that case, were the network to learn something substantial, one would potentially observe much more extensive plasticity, and the results of the whole study could change, possibly including the stability of the network. While this could be a whole separate study, this issue is of central importance, and it is hard to judge the value of the results when we do not know what the network learned to do, if anything. 

      (1.1) The reviewer raises a very interesting point of discussion. As they remarked, it is very hard to judge what the network learned to do. However, our model was not designed to solve a specific task and even defining precisely what "learning" entails in a primary sensory region is still an open question. As many before us, we hypothesized that one of the roles of the primary somatosensory cortex would be to represent stimuli features and that most of the learning process would happen in an unsupervised manner. This is indeed what we have demonstrated by showing the stimulus-specificity of changes as well as an increase of reliability of assembly sequences between repetitions after plasticity. We have added this to the Discussion in lines 523-525.

      (2) In this study, plasticity occurs only at E-to-E connections but not at others. However, it is well known that inhibitory connections in the cortex exhibit at the very least a substantial short-term plasticity. One would expect that not including these phenomena would have substantial consequences on the results.

      (1.2) This is indeed well known. Please consider that we do have short-term plasticity (called synapse dynamics in the manuscript) at all connections, including inhibitory ones. We thank the reviewer for pointing out this potential confusion in the wording. We have now clarified this  in the Methods in lines: 691-697. Furthermore, we have listed not having long-term plasticity at inhibitory connections in the limitations part of the Discussion in line: 593.

      (3) Lines 134-135: "We calibrated layer-wise spontaneous firing rates and evoked activity to brief VPM inputs matching in vivo data from Reyes-Puerta et al. (2015)."

      (4) Can the authors show these results? It is an important comparison, and so it would be great to see firing rates (ideally, their distributions) for all the cell types and layers vs. experimental data, for the evoked and spontaneous conditions. 

      (1.3) The layer- and cell type specific spontaneous firing rates were indeed hidden in the Methods and on Supplementary Figure S3. We now reference that figure in the Results in line: 136. Furthermore, we have amended Supplementary Figure S3 (panel A2), to show these rates in the evoked state as well.

      (5) That being said, the Reyes-Puerta et al. paper reports firing rates for the barrel cortex, doesn't it? Whereas here, the authors are simulating a non-barrel cortex. Is such a comparison appropriate?

      (1.4) As correctly pointed out by the reviewer, we made the assumption that these rates would generalize to the whole S1 because of the sparsity of experimental data. This assumption is discussed in length in Isbister et al. (2023) and now in the limitations part of the Discussion in lines: 564-568.

      (6) Comparison with STDP on pages 5-7 and Figure 2: if I got this right, the authors applied STDP to already generated spikes, that is, did not run a simulation with STDP. That seems strange. The spikes they use here were generated by the system utilizing their calcium-based plasticity rule. Obviously, the spikes would be different if STDP was utilized instead. The traces of synaptic weights would then also be different. The comparison therefore is not quite appropriate, is it?

      (1.5) Yes, the reviewer's understanding is correct. However, considering the findings of Morrison et al. 2007 [PMID: 17444756], and Zenke et al. 2017 [PMID: 28431369] (cited in the manuscript in lines: 165-166), running STDP in a closed loop simulation would most likely make the network “blow up” because of the positive feedback loop. Thus, we argue that our comparison is more conservative, since by using pre-generated spikes, we opened the loop and avoided positive feedback. This is now further explained in lines: 166-167.

      (7) Section 2.3 and Figure 5: I am not sure this analysis adds much. The main finding is that plasticity occurs more among cells in assemblies than among all cells. But isn't that expected given what was shown in the previous figures? Specifically, the authors showed that for cells that fire more, plasticity is more prominent. Obviously, cells that fire little or not at all won't belong to any assemblies. Therefore, we expect more plasticity in assemblies.

      (1.6) We thank the reviewer for this comment. We added additional panels (G1 and G2) to Figure 5 (and describe their content in lines: 329-337) showing that this is not the case. Firing-rate alone is indeed predictive of plastic changes, but co-firing in assemblies is even more so.

      (8) Section 2.4 and Figure 6: It is not clear that the results truly support the formulation of the section's title ("Synapse clustering contributes to the emergence of cell assemblies, and facilitates plasticity across them") and some of the text in the section. What I can see is that the effect on rho is strong for non-clustered synapses (Figure 6C and Figure S8A). In some cases, it is substantially higher than what is seen for clustered synapses. Furthermore, the wording "synapse clustering contributes to the emergence of cell assemblies" suggests some kind of causal role of clustered synapses in determining which neurons form specific cell assemblies. I do not see how the data presented supports that. Overall, it appears that the story about clustered synapses is quite complicated, with both clustered and non-clustered synapses driving changes in rho across the board. 

      (1.7) We agree with the reviewer, it is “quite complicated” and we also see that the writing could have been better/more precise and supported by the data shown on the Figure. We updated both the section title and a big chunk of the text to take the suggestions into account in lines: 361-373.

      (9) Section 2.5 and Figure 7: Can we be certain that it is the edge participation that is a particularly good predictor of synaptic changes and/or strength, as opposed to something simpler? For example, could it be the overall number of synapses, excitatory synapses, or something along these lines, that the source and/or target neurons receive, that determine the rho dynamics? And then, I do not understand the claim that edge participation allows one to "delineate potentiation from depression". The only related data I can find is in Figure 7A3, about which the authors write "this effect was stronger for potentiation than depression". But I don't see what they mean. For both depression and facilitation, the changes observed are in the range of ~12% of probability values. And even if the effect is stronger, does it mean one can "delineate" potentiation from depression better? What does it mean, to "delineate"? If it is some kind of decoding based on the edge participation, then the authors did not show that.  

      (1.8) We thank the reviewer for this comment. We have included an analysis of the predictive power of indegree of the pre and postsynaptic neuron of a connection on the rho dynamics in Figure 7 (panel B). Please consider, that the rho dynamics are described on the level of connections, while properties like indegree are on the level of nodes. Any procedure transferring a node based property to an edge based property involves choices e.g., should the values be added, multiplied, should one be preferential over the other, or should they be considered independently? As edge-based metrics avoid these arbitrary choices, we would argue that they are - ultimately - the simpler and more natural choice in this context.

      Though we believe that the metric of edge participation is simple, we recognize it is perhaps not common. Thus, we have switched to using a version of it that is perhaps more intuitive for the community at large i.e., as a metric of common innervation.  Moreover, we have changed the name “(k+2) edge participation” to “(k)-edge indegree”, to make it even more accessible. For k=0, this is the number of neurons that commonly innervate the connection, i.e., a common neighbour. And for k=1, this is the number of connections that commonly innervate the connection.  This is equivalent to edge participation from the next to last to the last neuron in a simplex.  Furthermore, in lines: 391-418 we have added additional text and references explaining the intuition of why we think this metric is relevant, as it has been shown to affect correlated activity of pairs of neurons, as well as assembly formation.

      Furthermore, we have clarified the language referring to potentiation and depression in lines: 420-422 and 448.

      (10) "test novel predictions in the MICrONS (2021) dataset, which while pushing the boundaries of big data neuroscience, was so far only analyzed with single cells in focus instead of the network as a whole (Ding et al., 2023; Wang et al., 2023)." That is incorrect. For example, the whole work of Ding et al. analyzes connectivity and its relation to the neuron's functional properties at the network level. 

      (1.9) We thank the reviewer for pointing this out. Indeed, the sentence was improperly worded. We have appropriately changed this phrasing in lines: 616-618.

      Reviewer #2 (Public review): 

      Summary: 

      This paper aims to understand the effects of plasticity in shaping the dynamics and structure of cortical circuits, as well as how that depends on aspects such as network structure and dendritic processing. 

      Strengths: 

      The level of biological detail included is impressive, and the numerical simulations appear to be well executed. Additionally, they have done a commendable job in open-sourcing the model.

      We thank the reviewer for their comments.

      Weaknesses: 

      The main result of this work is that activity in their network model remains stable without the need for a homeostatic mechanism. However, as the authors acknowledge, this has been  demonstrated in previous studies (e.g., Higgins et al. 2014). In those studies, stability was attributed to calcium-based rules combined with calcium concentrations at in vivo levels and background neuronal activity. Since the authors use the same calcium-based rule, it is unclear what new result, if any, is being presented. If the authors are suggesting that the mechanism in their simulations differs, that should be stated clearly, and evidence supporting that claim should be provided. 

      (2.1) We do not see this as the main result of our study, but rather a critical validation step, since our calcium rule, while similar to previous ones, is not exactly the same (see equations (1) and especially (2) in Methods). This has been clarified in the text in lines: 150-151. Note in particular, that one of the main differences is the stochastic synaptic transmission and the role of calcium concentration on the release probability. Furthermore, our model involves multicompartmental neurons instead of point neuron models, which to our knowledge was never tested before with calcium-based plasticity rules at the network level. Moreover, determining the time required for stability to be reached is a necessary step to set up the simulation parameters to test the main hypotheses about rules governing the plastic changes.

      The other findings discussed in the paper are related to a characterization of the dependency of plastic changes on network structure. While this analysis is potentially interesting, it has the following limitations. 

      First, I believe the authors should include an analysis of the generality and specificity of their results. All the findings seem to be derived from a single run of the simulation. How do the results vary with different network initializations, simulation times, or parameter choices? 

      (2.2) All simulations were run with 3 different random seeds (mentioned in the Methods) and now shown in Supplementary Figure S8 for some selected analyses. The maximum duration of our simulations were limited by our hardware constraints.  However, from the long (10 minutes) simulation we concluded that most changes happen within the first minute. This is how we determined 2 minutes as the simulation time for all other experiments. Parameters determining both the spontaneous and evoked network state are discussed in length in Isbister et al. (2023) and while we acknowledge that they are only shown in Supplementary Figure S3, we did not want to lengthen the manuscript with redundant details but rather refer to reader to the manuscript where this is discussed at large. 

      Crucially, we tried slightly different parameters of the plasticity model in the early phases of the research, and while they changed the exact numerical values of our results, the main trends (i.e., stabilization time, assemblies, synapse clustering, and network topology influencing plastic changes) remained unchanged. This is now shown in Supplementary Figure S13 and referenced in the Discussion in lines: 572-575.

      Second, the presentation of the results is difficult to follow. The characterization comes across as a long list of experiments, making it hard to identify a central message or distinguish key findings from minor details. The authors provide little intuition about why certain outcomes arise, and the complexity of the simulation makes it challenging - if not impossible - to determine which model elements are essential for specific results and which mechanisms drive emergent properties. Additionally, the text often lacks crucial details. For instance, the description of k-edge participation should be expanded, and an explanation of what this method quantifies should be included. Overall, I believe the authors should focus on a smaller set of significant results and provide a more in-depth discussion. 

      (2.3) We acknowledge the complexity of these large-scale simulations and the interpretation of their results. We appreciate the reviewer's feedback on the areas that needed more detail. To address this, we have extended the Results section describing k-edge indegree with more background and intuition in lines: 391-418. See also our reply to reviewer 1 (1.8) above. 

      While the manuscript may appear to be "a long list of experiments," it is actually guided by the following logic: We choose a calcium-based rule because it was the natural choice in a multicompartmental model which already included calcium dynamics and NMDA receptors. After setting up the main network state, verifying stability (Figure 2), doing traditional basic analysis (Figure 3), and verifying that the changes are non-random (Figure 4); we elaborated on long-standing ideas about co-firing in cell assemblies (Figure 5) and spatial clustering of synapse on dendrites (Figure 6) interacting with plasticity. Finally as we had access to the network’s non-random connectivity we tried to link the network's topology to the observed plastic changes. This was done with a higher order perspective, given that there was previous evidence for the relevance of these structures on cofiring and correlated activity.

      While we understand the frustration, we would highlight that the study is the first of its kind at this scale and level of biological detail. Our goal was to offer a broad exploration of the factors influencing plasticity and their interactions at this scale. Thus, laying the groundwork for future studies to investigate specific aspects more deeply. 

      The comparison of the model with the MICrONS dataset could be improved. In Figure 7B, the authors should show how the same quantification looks in a network model without plasticity. In Figure 8B, the data aligns with the model before plasticity, so it's unclear how this serves as a verification of the theoretical predictions.

      (2.4) Our only claim is that by being used to working with both functional and structural data we were able to develop a metric (k-edge indegree) that could be utilized to study the non-random, high-order topology of the MICrONS connectivity as well. On Figure 8, spike correlations in MICrONS more or less align with both cases (before vs. after plasticity); the only difference is that spike correlations looked different enough in the model so we thought they are worth showing for both cases. Moreover, as the changes are sparse (Figure 2 and 3) the synapse strength panel of Figure 7(D) looks almost exactly the same before plasticity (see first two panels of Author response image 1). In line with our results, the small and significant changes increase as k-edge indegree increases (last panel of Author response image 1). As the first two panels look almost the same and the third one is shown in a slightly different way (Figure 7C2) we would prefer not to include this in the manuscript, but only in our response.

      Author response image 1.

      Reviewer #3 (Public review): 

      Summary: 

      Ecker et al. utilized a biologically realistic, large-scale cortical model of the rat's non-barrel somatosensory cortex, incorporating a calcium-dependent plasticity rule to examine how various factors influence synaptic plasticity under in vivo-like conditions. Their analysis characterized the resulting plastic changes and revealed that key factors, including the co-firing of stimulus-evoked neuronal ensembles, the spatial organization of synaptic clusters, and the overall network topology, play an important role in affecting the extent of synaptic plasticity. 

      Strengths: 

      The detailed, large-scale model employed in this study enables the evaluation of diverse factors across various levels that influence the extent of plastic changes. Specifically, it facilitates the assessment of synaptic organization at the subcellular level, network topology at the macroscopic level, and the co-activation of neuronal ensembles at the activity level. Moreover, modeling plasticity under in vivo-like conditions enhances the model's relevance to experiments. 

      We thank the reviewer for their comments.

      Weaknesses: 

      (1) The authors claimed that, under in vivo-like conditions and in the presence of plasticity, firing rates and weight distributions remain stable without additional homeostatic mechanisms during a 10-minute stimulation period. However, the weights do not reach the steady state immediately after the 10-minute stimulation. Therefore, extended simulations are necessary to substantiate the claim. 

      (3.1) We thank the reviewer for this comment, as it gave us the opportunity to clarify in the text our stabilization criteria. Indeed, the dynamical system of weight changes has not reached a zero-change steady state because the changes, while small, are non-zero. However, in a stochastic system with ongoing activity (stimulus- or noise-driven), non-zero changes are expected. Thus, we consider the system to be at steady state when changes become negligible relative to a null model given by a random walk. Our results show that this condition is met around the 2-minute mark, with negligible changes in the subsequent 8 minutes.

      Moreover, for spontaneous activity, we showed that an unstable network exhibiting synchronous activity can be stabilized into an asynchronous regime by the calcium-based plasticity rule within 10 minutes. These results show that the system reaches a stochastic steady state within 10 minutes without requiring homeostatic mechanisms. Our work reveals that incorporating more biological detail (i.e. calcium-based plasticity), reduces the need for additional mechanisms to stabilize network activity (e.g. fast homeostatic mechanisms).

      Interestingly, one might argue that after 10 minutes of stimulation the network might transition to a different weight configuration if the stimuli change or cease. We agree this is an intriguing question, which we added to the Discussion in lines 611-613. However, this scenario concerns continuous learning, not the system’s steady-state dynamics.

      (2) Another major limitation of the paper lies in its lack of mechanistic insights into the observed phenomena (particularly on aspects that are typically impossible to assess in traditional simplified models, like layer-specific and layer-to-layer pathways-specific plasticity changes), as well as the absence of discussions on the potential computational implications of the corresponding observed plastic changes.

      (3.2) Our study integrates recent experimental advances aiming to clarify their hypothesized inseparable functional roles in neocortical learning. In particular, we study three different kinds of mechanistic insight: co-firing in assemblies (Figure 5), synapse clustering on postsynaptic dendrites (Figure 6), and high-order network topology (Figure 7). Furthermore, layer specificity is shown (Figure 3A1, B1, B2, D1) and so is layer-to-layer specificity (Figure 4A2). In addition we also describe synapse clustering on postsynaptic dendrites (Figure 6) which is not available in simplified models either.

      As such, the mechanistic insights provided in our work are integrative in nature and aim to provide a first broad exploration of these phenomena and their interactions-which are rarely considered together in experimental or modelling studies.  This foundation paves the way for future studies that examine specific aspects more deeply in this level of biological detail.

      Reviewer #1 (Recommendations for the authors):

      (1) I would suggest the authors explain more explicitly that their study uses plasticity for E-to-E connections and not others. Doing so in multiple places in the paper, but certainly in Methods and early in Results, would be helpful. This is stated in lines 117-119 ("To simulate long-term plasticity, we integrated our recently published calcium-based plasticity model that was used to describe functional long-term potentiation and depression between pairs of pyramidal cells"), but could be highlighted more.

      We have added it to several lines in the Methods: 621, 648, 649.

      (2) "Simulations were always repeated at least three times to assess the consistency of the results." This sounds important. How is this used for the analysis? Do the results reported combine the data from the 3 simulations? How did the authors check the "consistency of the results"? Did they run any statistical tests comparing the results between the 3 simulations or was it more of a visual check?

      The reported results come from a single simulation. Three simulations were run to check that no obvious qualitative differences could be found, such as a change of network regime, association between stimuli and assemblies. No statistical tests can be run with samples of size three. These are now shown in Supplementary Figure S8, and additional clarifying text has been added in Methods line: 722. 

      (3) "We needed 12M core hours to run the simulation presented in this manuscript." The Methods section mentions ~2.4 M core hours for a 10-minute simulation, which may be confusing. It might be helpful to provide a table with all the simulations run for this study.

      We wanted to provide a rough estimate of the runtime, but did not run a deep profiling of all campaigns. The results depend on the actual hardware and configurations used (e.g., temporal resolution of synapse reporting).  We understand the potential source of confusion and have clarified this in the Methods in lines 719-721 (and took it out from the Discussion).

      Reviewer #2 (Recommendations for the authors):

      (1) I found the paper somewhat challenging to follow, as there are many small points, making it unclear what the main message is. It sometimes feels like a list of 'we did this and found that.' It might be helpful if the authors focused on a smaller number of key results with more in-depth discussion. For instance, the discussion of network topology on page 9 is intriguing but condensed into a single, dense paragraph that is hard to follow. Clarifying how the random control is generated would also be beneficial.

      See our response to the public review’s third point (2.3).

      (2) Line 245: typo? "Furthermore, the maximal simplex dimension found in the subgraph was two higher than expected by chance.".

      We changed the grammar in line: 249.

      (3) Line 410: typo? "It has been previously shown before that  assemblies have many edges".

      Noted and fixed in line: 463.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors claimed that plasticity operates in a sparse and specific manner, with firing rates and weight distributions remaining stable without additional homeostatic mechanisms. However, as shown in Figure 2D inset, the weights do not reach their steady-state values immediately after the 10-minute stimulation. A similar issue is observed in Figure 2G. It would be necessary to show the claim is indeed true as the weights reach the steady states.

      See our response to the public review’s first point (3.1).

      (2) In the model, synapses undergo both short- and long-term plasticity, but the contribution of short-term plasticity to the stated claim is unclear. It would be helpful to demonstrate how the results of Figure 2 are affected when short-term plasticity is excluded.

      STP is needed to achieve the asynchronous in vivo-like firing state in our model (and is intimately linked to the fitting procedure of the plasticity rules - mean-field approximation is not possible due to the important role of synaptic failures in thresholded plasticity outcomes), thus it cannot be excluded. We have added this to the Methods in lines: 691-697.

      (3) It would be helpful to include a supplementary plot, similar to Figure 2F, illustrating the corresponding results for STDP.

      This is not possible as we did not run a different simulation with STDP, only evaluated the changes in connections with an STDP model using spikes from our simulation. We did not incorporate the STDP equations into our detailed network, as there is no canonical or unambiguous way for doing so (e.g., one would need to handle the fact the connections are multi-synaptic). Note however, that considering the findings of Morrison et al. 2007 [PMID: 17444756], and Zenke et al. 2017 [PMID: 28431369] (cited in the manuscript in lines: 165-166), running STDP in a closed loop simulation would most likely make the network “blow up” because of the positive feedback loop.

      (4) It would be helpful to provide mechanistic insights into the current observations and to discuss the potential computational implications of the observed plastic changes. Particularly on aspects that are typically impossible to examine in traditional models, like layer-specific plastic changes presented in Fig. 3A1, B1, B2, D1, and layer-to-layer pathways-specific plastic changes illustrated in Figure 4A2.

      See our response to the public review’s second point (3.2).

      (5) The use of the term 'assembly' in most places of the manuscript may cause confusion. To enhance clarity and foster effective discussions in the field, I would recommend replacing it with 'ensemble,' as suggested in Miehl et al. (2023), 'Formation and computational implications of assemblies in neural circuits' (The Journal of Physiology, 601(15), 3071-3090), which should also be cited.

      We read the mentioned manuscript when it was published (and appreciated it a lot), now reference it, and explain why we did not exactly follow the suggestion in lines: 293-299.

      (6) The title of Figure 5 is not directly supported by the current figure. To strengthen the alignment, it would be helpful to present the results from lines 303-306 in bar plots and incorporate them into Figure 5 to better substantiate the figure title.

      While the mentioned lines compare maximum values to those within the whole dataset, we think those 2*12*12 values are better presented in condensed matrices than bar plots (while the maximum values are still easily grasped from the colorbars). We have added panel G2 to the figure to address a comment by reviewer 1 (1.7), we believe that this further supports the title of the Figure.

      (7) Line 326, cite "Kirchner, J. H., & Gjorgjieva, J. (2021). Emergence of local and global synaptic organization on cortical dendrites. Nature Communications, 12(1), 4005." and "Kirchner, J. H., & Gjorgjieva, J. (2022). Emergence of synaptic organization and computation in dendrites. Neuroforum, 28(1), 21-30."

      Although we were aware of the mentioned manuscripts, we did not include them originally because they are models of a different species. However, we have now cited these in line: 347.

      (8) The contrast results for ensembles 11 and 12 do not appear to support the claims made in lines 339-341. Clarification on this point would be helpful.

      The reviewer is right, we have updated lines: 360-361, to clarify the difference between the two late assemblies.

      (9) For Figure 6C and 6D in Section 2.4, rather than presenting the results for individual ensembles (which could be moved to the supplementary materials), it would be easier if the authors could summarize the results by grouping them into three categories: early, middle, and late ensembles.

      We agree with the reviewer’s suggestion and tried it before, but as the results slightly depend on functional assembly size as well (not only temporal order) averaging them loses information (see different xlims of the panels). Given that the issue is complex we decided to show all the data on the Figure, but we have revised the text now to provide  a more high-level interpretation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Comment#1: Ren et al developed a novel computational method to investigate cell evolutionary trajectory for scRNA-seq samples. This method, MGPfact, estimates pseudotime and potential branches in the evolutionary path by explicitly modeling the bifurcations in a Gaussian process. They benchmarked this method using synthetic as well as real-world samples and showed superior performance for some of the tasks in cell trajectory analysis. They further demonstrated the utilities of MGPfact using single-cell RNA-seq samples derived from microglia or T cells and showed that it can accurately identify the differentiation timepoint and uncover biologically relevant gene signatures. Overall I think this is a useful new tool that could deliver novel insights for the large body of scRNA-seq data generated in the public domain. The manuscript is written in a logical way and most parts of the method are well described.

      Thank you for reviewing our manuscript and for your positive feedback on MGPfact. We are pleased that you find it useful for identifying differentiation timepoints and uncovering gene signatures. We will continue to refine MGPfact and explore its applications across diverse datasets. Your insights are invaluable, and we appreciate your support.

      Comment#2: Some parts of the methods are not clear. It should be outlined in detail how pseudo time T is updated in Methods. It is currently unclear either in the description or Algorithm 1.

      Thanks to the reviewers' comments. We've added a description of how pseudotime T is obtained between lines 138 and 147 in the article. In brief, the pseudotime of MGPfact is inferred through Gaussian process regression on the downsampled single-cell transcriptomic data. Specifically, T is treated as a continuous variable representing the progression of cells through the differentiation process. We describe the relationship between pseudotime and expression data using the formula:

      Where f(T) is a Gaussian Process (GP) with covariance matrix S, and Ɛ represents the error term. The Gaussian process is defined as:

      Where is the variance set to 1e-6.

      During inference, we update the pseudotime by maximizing the posterior likelihood. Specifically, the posterior distribution of pseudotime T can be represented as:

      Where is the likelihood function of the observed data Y*, and is the prior distribution of the Gaussian process. This posterior distribution integrates the observed data with model priors, enabling inference of pseudotime and trajectory simultaneously. Due to the high autocorrelation of  in the posterior distribution, we use Adaptive Metropolis within Gibbs (AMWG) sampling (Roberts and Rosenthal, 2009; Tierney, 1994). Other parameters are estimated using the more efficient SLICE sampling technique (Neal, 2003).

      Comment#3: There should be a brief description in the main text of how synthetic data were generated, under what hypothesis, and specifically how bifurcation is embedded in the simulation.

      Thank you for the reviewers' comments. We have added descriptions regarding the synthetic dataset in the methods section. The revised content is from line 487 to 493:

      “The synthetic datasets were generated using four simulators: dyngen (Saelens et al., 2019), dyntoy (Saelens et al., 2019), PROSSTT (Papadopoulos et al., 2019), and Splatter (Zappia et al., 2017), each modeling different trajectory topologies such as linear, branching, and cyclic. Splatter simulates branching events by setting expression states and transition probabilities, dyntoy generates random expression gradients to reflect dynamic changes, and dyngen focuses on complex branching structures within gene regulatory networks.”

      Comment#4: Please explain what the abbreviations mean at their first occurrence.

      We appreciate the reviewers' feedback. We have thoroughly reviewed the entire manuscript and made sure that all abbreviations have had their full forms provided upon their first occurrence.

      Comment#5: In the benchmark analysis (Figures 2/3), it would be helpful to include a few trajectory plots of the real-world data to visualize the results and to evaluate the accuracy.

      We appreciate the reviewer's feedback. To more clearly demonstrate the performance of MGPfact, we selected three representative cases from the dataset for visual comparison. These cases represent different types of trajectory structures: linear, bifurcation, and multifurcation. The revised content is between line 220 and 226.

      As shown in Supplementary Fig. 5, it is evident that MGPfact excels in capturing main developmental paths and identifying key bifurcation points. In the linear trajectory structure, MGPfact accurately predicted the linear structure without bifurcation events, showing high consistency with the ground truth (overall\=0.871). In the bifurcation trajectory structure, MGPfact accurately captured the main bifurcation event (overall\=0.636). In the multifurcation trajectory structure, although MGPfact predicted only one bifurcation point, its overall structure remains close to the ground truth, as evidenced by its high overall score (overall\=0.566). Overall, MGPfact demonstrates adaptability and accuracy in reconstructing various types of trajectory structures.

      Comment#6: It is not clear how this method selects important genes/features at bifurcation. This should be elaborated on in the main text.

      Thanks to the reviewers' comments. To enhance understanding, we've added detailed descriptions of gene selection in the main text and appendix, specifically from lines 150 to 161. In brief, MGPfact employs a Gaussian process mixture model to infer cell fate trajectories and identify independent branching events. We calculate load matrices using formulas 1 and 14 to assess each gene's contribution to the trajectories. Genes with an absolute weight greater than 0.05 are considered predominant in specific branching processes. Subsequently, SCENIC (Aibar et al., 2017; Bravo González-Blas et al., 2023) analysis was conducted to further infer the underlying regulons and annotate the biological processes of these genes.

      Comment#7: It is not clear how survival analysis was performed in Figure 5. Specifically, were critical confounders, such as age, clinical stage, and tumor purity controlled?

      To evaluate the predictive and prognostic impacts of the selected genes, we utilized the Cox multivariate regression model, where the effects of relevant covariates, including age, clinical stage, and tumor purity, were adjusted. We then conducted the Kaplan-Meier survival analysis again to ensure the reliability of the results. The revisions mainly include the following sections:

      (1) We modified the description of adjusting for confounding factors in the survival analysis, from line 637 to 640:

      “To adjust for possible confounding effects, the relevant clinical features including age, sex and tumor stage were used as covariates. The Cox regression model was implemented using R-4.2 package “survival”. And we generated Kaplan-Meier survival curves based on different classifiers to illustrate differences in survival time and report the statistical significance based on Log-rank test.”

      (2) We updated the images in the main text regarding the survival analysis, including Fig. 5a-b, Fig. 6c, and Supplementary Fig. 8e.

      Comment#8: I recommend that the authors perform some sort of 'robustness' analysis for the consensus tree built from the bifurcation Gaussian process. For example, subsample 80% of the cells to see if the bifurcations are similar between each bootstrap.

      We appreciate the reviewers' feedback. We performed a robustness analysis of the consensus tree using 100 training datasets. This involved sampling the original data at different proportions, and then calculating the topological similarity between the consensus trajectory predictions of MGPfact and those without sampling, using the Hamming-Ipsen-Mikhailov (HIM ) metric. A higher score indicates greater robustness. The relevant figure is in Supplementary Fig. 4, and the description is in the main text from line 177 to 182.

      The results indicate that the consensus trajectory predictions based on various sampling proportions of the original data maintain a high topological similarity with the unsampled results (HIM<sub>mean</sub>=0.686). This demonstrates MGPfact’s robustness and generalizability under different data conditions, hence the capability of capturing bifurcative processes in the cells’ trajectory.

      Reviewer #2:

      Comment#1: The authors present MGPfact<sup>XMBD</sup>, a novel model-based manifold-learning framework designed to address the challenges of interpreting complex cellular state spaces from single-cell RNA sequences. To overcome current limitations, MGPfact<sup>XMBD</sup> factorizes complex development trajectories into independent bifurcation processes of gene sets, enabling trajectory inference based on relevant features. As a result, it is expected that the method provides a deeper understanding of the biological processes underlying cellular trajectories and their potential determinants. MGPfact<sup>XMBD</sup> was tested across 239 datasets, and the method demonstrated similar to slightly superior performance in key quality-control metrics to state-of-the-art methods. When applied to case studies, MGPfact<sup>XMBD</sup> successfully identified critical pathways and cell types in microglia development, validating experimentally identified regulons and markers. Additionally, it uncovered evolutionary trajectories of tumor-associated CD8+ T cells, revealing new subtypes with gene expression signatures that predict responses to immune checkpoint inhibitors in independent cohorts. Overall, MGPfact<sup>XMBD</sup> represents a relevant tool in manifold learning for scRNA-seq data, enabling feature selection for specific biological processes and enhancing our understanding of the biological determinants of cell fate.

      Thank you for your thoughtful review of our manuscript. We are thrilled to hear that you find MGPfact<sup>XMBD</sup> beneficial for exploring cellular evolutionary paths in scRNA-seq data. Your insights are invaluable, and we look forward to incorporating them to further enrich our study. Thank you once again for your support and constructive feedback.

      Comment#2: How the methods compare with existing Deep Learning based approaches such as TIGON is a question mark. If a comparison would be possible, it should be conducted; if not, it should be clarified why.

      We appreciate the reviewer's comments. We have added a comparison with the sctour (Li, 2023) and TIGON methods (Sha, 2024).

      It is important to note that the encapsulation and comparison of MGPfact are based on traditional differentiation trajectory construction. Saelens et al. established a systematic evaluation framework that categorizes differentiation trajectory structures into topological subtypes such as linear, bifurcation, multifurcation, graph, and tree, focusing on identifying branching structures in the cell differentiation process (Saelens et al., 2019). The sctour and TIGON methods mentioned by the reviewer are primarily used for estimating RNA velocity, focusing on continuous temporal evolution rather than explicit branching structures, and do not explicitly model branches. Therefore, we considered the predictions of these two methods as linear trajectories and compared them with MGPfact. While scTour explicitly estimates pseudotime, TIGON uses the concept of "growth," which is analogous to pseudotime, so we made the necessary adaptations.

      Author response image 1 show that within this framework, compared to scTour (overall<sub>mean</sub>=0.448) and TIGON (overall<sub>mean</sub>=0.263), MGPfact still maintains a relatively high standard (overall<sub>mean</sub>=0.534). This indicates that MGPfact has a significant advantage in accurately capturing branching structures in cell differentiation, especially in applications where explicit modeling of branches is required.

      Author response image 1.

      Comparison of MGPfact with scTour and TIGON in trajectory inference performance across 239 test datasets. a. Overall scores; b.F1<sub>branches</sub>; c.HIM; d. cor<sub>dist</sub>; e. wcor<sub>features</sub>. All results are color-coded based on the trajectory types, with the black line representing the mean value. The “Overall” assessment is calculated as the geometric mean of all four metrics.

      Comment#3: Missing Methods:

      - The paper lacks a discussion of Deep Learning approaches for bifurcation analysis. e.g. scTour, Tigon.

      - I am missing comments on methods such CellRank, and alternative approaches to delineate a trajectory.

      We thank the reviewer for these comments.

      (1) As mentioned in response to Comments#2, the scTour and TIGON methods are primarily used for estimating RNA velocity, focusing on continuous temporal evolution rather than explicit branching structures, and they do not explicitly model branches. We consider the predictions of these two methods as linear trajectories and compare them with MGPfact. The relevant description and discussion have been addressed in the response.

      (2) We have added a description of RNA velocity estimation methods (scTour, TIGON, CellRank) in the introduction section. The revised content is from line 66 to 71:

      “Moreover, recent studies based on RNA velocity has provided insights into cell state transitions. These methods measure RNA synthesis and degradation rates based on the abundance of spliced and unspliced mRNA, such as CellRank (Lange et al., 2022). Nevertheless, current RNA velocity analyses are still unable to resolve cell-fates with complex branching trajectory. Deep learning methods such as scTour (Li, 2023) and TIGON (Sha, 2024) circumvent some of these limitations, offering continuous state assumptions or requiring prior cell sampling information.”

      Comment#4: Impact of MURP:

      The rationale for using MURP is well-founded, especially for trajectory definition. However, its impact on the final results needs evaluation.

      How does the algorithm compare with a random subselection of cells or the entire cell set?

      Thank you for the comments. We fully agree that MURP is crucial in trajectory prediction. As a downsampling method, MURP is specifically designed to address noise issues in single-cell data by dividing the data into several subsets, thereby maximizing noise reduction while preserving the main structure of biological variation (Ren et al., 2022). In MGPfact, MURP typically reduces the data to fewer than 100 downsampled points, preserving the core biological structure while lowering computational complexity. To assess MURP's impact, we conducted experiments by randomly selecting 20, 40, 60, 80, and 100 cells for trajectory inference. These results were mapped back to the original data using the KNN graph structure for final predictions, which were then compared with the MURP downsampling results. Supplementary results can be found in Supplementary Fig. 3, with additional descriptions in the main text from line 170 to 176.

      The results indicate that trajectory inference using randomly sampled cells has significantly lower prediction accuracy compared to that using MURP. This is particularly evident in branch assignment (F1<sub>branches</sub>) and correlation cor<sub>dist</sub>, where the average levels decrease by 20.5%-64.9%. In contrast, trajectory predictions using MURP for downsampling show an overall score improvement of 5.31%-185%, further highlighting MURP's role in enhancing trajectory inference within MGPfact.

      Comment#5: What is the impact of the number of components selected?

      Thank you for the comments. In essence, MGPfact consists of two main steps: 1) trajectory inference; 2) calculation of factorized scores and identification of high-weight genes. After step 1, MGPfact estimates parameters such as pseudotime T and bifurcation points B.  In step 2, we introduce a rotation matrix to obtain factor scores W<sub>l</sub>  for each trajectory l by rotating Y*.

      For all trajectories,

      where e<sub>l</sub>  is the error term for the -th trajectory. The number of features in Y* must match the dimensions of the rotation matrix R to ensure the factorized score matrix W contains factor scores for  trajectories, achieving effective feature representation and interpretation in the model.

      Additionally, to further illustrate the impact of the number of principal components (PCs) on model performance in step 1, we conducted additional experiments. We used 3 PCs as the default and adjusted the number to evaluate changes from this baseline. As shown in Author response image 2, setting the number of PCs to 1 significantly decreases the overall performance score (overall<sub>mean</sub>=0.363), as well as the wcor<sub>features</sub> and wcor<sub>dist</sub> metrics.  In contrast, increasing the number of PCs does not significantly affect the metrics. It ought to be mentioned that number of components used should be determined by the intrinsic biological characteristics of the cell fate-determination. Our experiment based on a limited number of datasets may not represent more complex scenarios in other cell types.

      Author response image 2.

      Robustness testing of the number of MURP PCA components on 100 training datasets. With the number of principal components (PCs) set to 3 by default; we tested the impact of different number of components (1-10) on the prediction results. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range. Significance is denoted as follows: not annotated indicates non-significant; * P < 0.05; ** P < 0.01; *** P < 0.001; two-sided paired Student’s T-tests.

      Comment#6: Please comment on the selection of the kernel functions (rbf and polynomial) and explain why other options were discarded.

      Thank you for the comments. We have added a description regarding the selection of radial basis functions and polynomial kernels in lines 126-130. As the reviewers mentioned, the choice of kernel functions is crucial in the MGPfact analysis pipeline for constructing the covariance matrix of the Gaussian process. We selected the radial basis function (RBF) kernel and the polynomial kernel to balance capturing data complexity and computational efficiency. The RBF kernel is chosen for its ability to effectively model smooth functions and capture local variations in the data, making it well-suited to the continuous and smooth characteristics of biological processes; its hyperparameters offer modeling flexibility. The polynomial kernel is used to capture more complex nonlinear relationships between input features, with its hyperparameters also allowing further customization of the model. In contrast, other complex kernels, such as Matérn or spectral kernels, were omitted due to their interpretability challenges and the risk of overfitting with limited data. However, as suggested by the reviewers, we will consider and test the impact of other kernel functions on the covariance matrix of the Gaussian process and their role in trajectory inference in our subsequent phases of algorithm design.

      Comment#7: What is the impact of the Pseudotime method used initially? This section should be expanded with clear details on the techniques and parameters used in each analysis.

      We are sorry for the confusion. We've added a description of how pseudotime T is obtained between line 138 and 147 in the main text. And the specific hyperparameters involved in the model and their prior settings are detailed in the supplementary information.

      In brief, the pseudotime and related topological parameters of the bifurcative trajectories in MGPfact are inferred by Gaussian process regression from downsampled single-cell transcriptomic data (MURP). Specifically, T is treated as a continuous variable representing the progression of cells through the differentiation process. We describe the relationship between pseudotime and expression data as:

      where f(T) is a Gaussian Process (GP) with covariance matrix S, and ε represents the error term. The Gaussian process is defined as:

      where  is the variance set to 1e-6. During inference, we update the pseudotime by maximizing the posterior liklihood. Specifically, the posterior distribution of pseudotime is obtained by combining the observed data Y* with the prior distribution of the Gaussian process model.

      We use the Markov Chain Monte Carlo method for parameter estimation, particularly employing the adaptive Metropolis-within-Gibbs (AMWG) sampling to handle the high autocorrelation of pseudotime.

      Comment#8: Enhancing Readability: For clarity, provide intuitive descriptions of each evaluation function used in simulated and real data. The novel methodology performs well for some metrics but less so for others. A clear understanding of these measurements is essential.

      To address the concern of readability, we have added descriptions of 5 evaluation metrics in the methodology section (Benchmarking MGPfact to state-of-the-art methods) in line 494 to 515. Additionally, we have included a summary and discussion of these metrics in the conclusion section in line 214-240 to help the readers better understand the significance and impact of these measurements.

      (1) In brief, the Hamming-Ipsen-Mikhailov (HIM) distance measures the similarity between topological structures, combining the normalized Hamming distance and the Ipsen-Mikhailov distance, which focus on edge length differences and degree distribution similarity, respectively. The F1<sub>branches</sub> is used to assess the accuracy of a model's branch assignment via Jaccard similarity between branch pairs. In trajectory inference, cor<sub>dist</sub> quantifies the similarity of inter-cell distances between predicted and true trajectories, evaluating the accuracy of cell ordering. The wcor<sub>features</sub> assesses the similarity of key features through weighted Pearson correlation, capturing biological variation. The Overall score is calculated as the geometric mean of these metrics, providing an assessment of overall performance.

      (2) For MGPfact and the other seven methods included in the comparison, each has its own focus. MGPfact specializes in factorizing complex cell trajectories using Gaussian process mixture models, making it particularly capable of identifying bifurcation events. Therefore, it excels in the accuracy of branch partitioning and similarity of trajectory topology. Among other methods, scShaper (Smolander et al., 2022) and TSCAN(Ji and Ji, 2016) are more suited for generating linear trajectories and excel in linear datasets, accurately predicting pseudotime. The Monocle series, as typical representatives of tree methods, effectively capture complex topologies and are suitable for analyzing cell data with diversified differentiation paths.

      Comment#9: Microglia Analysis:In Figures 3A-C, the genes mentioned in the text for each bifurcation do not always match those shown in the panels. Please confirm this.

      Thank you for pointing this out. We have carefully reviewed the article and corrected the error where the genes shown in the figures did not correspond to the descriptions in the article. The specific corrections have been made between line 257 and 264:

      “The first bifurcation determines the differentiated cell fates of PAM and HM, which involves a set of notable marker genes of both cell types, such as Apoe, Selplg (HM), and Gpnmb (PAM). The second bifurcation determines the proliferative status, which is crucial for the development and function of PAM and HM (Guzmán, n.d.; Li et al., 2019). The genes affected by the second bifurcation are associated with cell cycle and proliferation, such as Mki67, Tubb5, Top2a. The third bifurcation influences the development and maturity of microglia, of which the highly weighted genes, such as Tmem119, P2ry12, and Sepp1 are all previously annotated markers for establishment of the fates of microglia (Anderson et al., 2022; Li et al., 2019) (Supplementary Table 4).”

      Comment#10: Regulons:

      - The conclusions rely heavily on regulons. The Methods section describes using SCENIC, GENIE3, RCisTarget, and AUCell, but their relation to bifurcation analysis is unclear.

      - Do you perform trajectory analysis on all MURP-derived cells or within each identified trajectory based on bifurcation? This point needs clarification to make the outcomes comprehensible. The legend of Figure 4 provides some ideas, but further clarity is required.

      Thank you for the comments.

      (1) To clarify, we used the tools like SCENIC to annotate the highly weighted genes (HWG) resulted from the bifurcation analysis for transcription factor regulation activity and possible impacts on biological processes. We have added descriptions to the analysis of our microglial data. The revised content is between line 265 and 266:

      “Moreover, we retrieved highly active regulons from the HWG by MGPfact, of which the significance is quantified by the overall weights of the member genes.”

      (2) We apologize for any confusion caused by our description. It is important to clarify that we performed an overall trajectory analysis on all MURP results, rather than analyzing within each identified trajectory. Specifically, we first used MURP to downsample all preprocessed cells, where each MURP subset represents a group of cells. We then conducted trajectory inference on all MURP subsets and identified bifurcation points. This process generated multiple independent differentiation trajectories, encompassing all MURP subsets. To clearly convey this point, we have added descriptions in the legend of Figure 4. The revised content is between line 276 and 283:

      “Fig. 4. MGPfact reconstructed the developmental trajectory of microglia, recovering known determinants of microglia fate. a-c. The inferred independent bifurcation processes with respect to the unique cell types (color-coded) of microglia development, where phase 0 corresponds to the state before bifurcation; and phases 1 and 2 correspond to the states post-bifurcation. Each colored dot represents a metacell of unique cell type defined by MURP. The most highly weighted regulons in each trajectory were labeled by the corresponding transcription factors (left panels). The HWG of each bifurcation process include a set of highly weighted genes (HWG), of which the expression levels differ significantly among phases 1, 2, and 3 (right panels).”

      Comment#11: CD8+ T Cells: The comparison is made against Monocle2, the method used in the publication, but it would be beneficial to compare it with more recent methods. Otherwise, the added value of MGPfact is unclear.

      Per your request, we have expanded our comparative analysis to include not only Monocle2 but also more recent methods such as Monocle3 (Cao et al., 2019) and scFates Tree (Faure et al., 2023). We used adjusted R-squared values to evaluate each method's ability to explain trajectory variation. The results have been added to Table 2 and Supplementary Table 6. The revised content is between line 318 and 326:

      We assessed the goodness-of-fit (adjusted R-square) of the consensus trajectory derived by MGPfact and three methods (Monocle 2, Monocle 3 and scFates Tree) for the CD8+ T cell subtypes described in the original studies (Guo et al., 2018; Zhang et al., 2018). The data showed that MGPfact significantly improved the explanatory power for most CD8+ T cell subtypes over Monocle 2, which was used in the original studies (P < 0.05, see Table 2 and Supplementary Table 6), except for the CD8-GZMK cells in the CRC dataset. Additionally, MGPfact demonstrated better explanatory power in specific cell types when compared to Monocle 3 and scFates Tree. For instance, in the NSCLC dataset, MGPfact exhibited higher explanatory power for CD8-LEF1 cells (Table 2, R-squared = 0.935), while Monocle 3 and scFates Tree perform better in other cell types.

      Comment#12: Consensus Trajectory: A panel explaining how the consensus trajectory is generated would be helpful. Include both visual and textual explanations tailored to the journal's audience.

      Thank you for the comments. Regarding how the consensus trajectory is constructed, we have illustrated and described this in Figure 1 and the supplementary methods. Taking the reviewers' suggestions into account, we have added more details about the generation process of the consensus trajectory in the methods section to enhance the completeness of the manuscript. The revised content is from line 599 to 606:

      “Following MGPfact decomposition, we obtained multiple independent bifurcative trajectories, each corresponds to a binary tree within the temporal domain. These trajectories were then merged to construct a coherent diffusion tree, representing the consensus trajectory of cells’ fate. The merging process involves initially sorting all trajectories by their bifurcation time. The first (earliest) bifurcative trajectory is chosen as the initial framework, and subsequent trajectories are integrated to the initial framework iteratively by adding the corresponding branches at the bifurcation timepoints. As a result, the trajectories are ultimately merged into a comprehensive binary tree, serving as the consensus trajectory.”

      Comment#13: Discussion:

      - Check for typos, e.g., line 382 "pseudtime.".

      - Avoid considering HVG as the entire feature space.

      - The first three paragraphs are too similar to the Introduction. Consider shortening them to succinctly state the scenario and the implications of your contribution.

      Thank you for pointing out the typos.

      (1) We conducted a comprehensive review of the document to ensure there are no typographical errors.

      (2) We restructured the first three paragraphs of the discussion section to clarify the limitations in the use of current manifold-learning methods and removed any absolute language regarding treating HVGs as the entire feature space. The revised content is from line 419 to 430:

      “Single-cell RNA sequencing (scRNA-seq) provides a direct, quantitative snapshot of a population of cells in certain biological conditions, thereby revealing the actual cell states and functions. Although existing clustering and embedding algorithms can effectively reveal discrete biological states of cells, these methods become less efficient when depicting continuous evolving of cells over the temporal domain. The introduction of manifold learning offers a new dimension for discovery of relevant biological knowledge in cell fate determination, allowing for a better representation of continuous changes in cells, especially in time-dependent processes such as development, differentiation, and clonal evolution. However, current manifold learning methods face major limitations, such as the need for prior information on pseudotime and cell clustering, and lack of explainability, which restricts their applicability. Additionally, many existing trajectory inference methods do not support gene selection, making it difficult to annotate the results to known biological entities, thereby hindering the interpretation of results and subsequent functional studies.”

      Comment#14: Minor Comments:

      (1) Review the paragraph regarding the "current manifold-learning methods are faced with two major challenges." The message needs clarification.

      (2) Increase the quality of the figures.

      (3) Update the numbering of equations from #(.x) to (x).

      We thank the reviewer for these detailed suggestions.

      (1) We have thoroughly revised the discussion section, addressing overly absolute statements. The revised content is from line 426 to 428:

      “However, current manifold learning methods face major limitations, such as the need for prior information on pseudotime and cell clustering, and lack of explainability, which restricts their applicability.”

      (2) We conducted a comprehensive review of the figures in the article to more clearly present our results.

      (3) We have meticulously reviewed the equations in the article to ensure there are no display issues with the indices.

      Reference

      Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. 2017. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 14:1083–1086. doi:10.1038/nmeth.4463

      Anderson SR, Roberts JM, Ghena N, Irvin EA, Schwakopf J, Cooperstein IB, Bosco A, Vetter ML. 2022. Neuronal apoptosis drives remodeling states of microglia and shifts in survival pathway dependence. Elife 11:e76564.

      Bravo González-Blas C, De Winter S, Hulselmans G, Hecker N, Matetovici I, Christiaens V, Poovathingal S, Wouters J, Aibar S, Aerts S. 2023. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat Methods. doi:10.1038/s41592-023-01938-4

      Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ, Trapnell C, Shendure J. 2019. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566:496–502. doi:10.1038/s41586-019-0969-x

      Faure L, Soldatov R, Kharchenko PV, Adameyko I. 2023. scFates: a scalable python package for advanced pseudotime and bifurcation analysis from single-cell data. Bioinformatics 39:btac746. doi:10.1093/bioinformatics/btac746

      Guo X, Zhang Y, Zheng L, Zheng C, Song J, Zhang Q, Kang B, Liu Z, Jin L, Xing R, Gao R, Zhang L, Dong M, Hu X, Ren X, Kirchhoff D, Roider HG, Yan T, Zhang Z. 2018. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat Med 24:978–985. doi:10.1038/s41591-018-0045-3

      Guzmán AU. n.d. Single-cell RNA sequencing of spinal cord microglia in a mouse model of neuropathic pain.

      Ji Z, Ji H. 2016. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 44:e117–e117. doi:10.1093/nar/gkw430

      Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, Lickert H, Ansari M, Schniering J, Schiller HB, Pe’er D, Theis FJ. 2022. CellRank for directed single-cell fate mapping. Nat Methods 19:159–170. doi:10.1038/s41592-021-01346-6

      Li Q. 2023. scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics. Genome Biology.

      Li Q, Cheng Z, Zhou L, Darmanis S, Neff NF, Okamoto J, Gulati G, Bennett ML, Sun LO, Clarke LE, Marschallinger J, Yu G, Quake SR, Wyss-Coray T, Barres BA. 2019. Developmental Heterogeneity of Microglia and Brain Myeloid Cells Revealed by Deep Single-Cell RNA Sequencing. Neuron 101:207-223.e10. doi:10.1016/j.neuron.2018.12.006

      Neal RM. 2003. Slice sampling. The annals of statistics 31:705–767.

      Papadopoulos N, Gonzalo PR, Söding J. 2019. PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics 35:3517–3519. doi:10.1093/bioinformatics/btz078

      Ren J, Zhang Q, Zhou Y, Hu Y, Lyu X, Fang H, Yang J, Yu R, Shi X, Li Q. 2022. A downsampling method enables robust clustering and integration of single-cell transcriptome data. Journal of Biomedical Informatics 130:104093. doi:10.1016/j.jbi.2022.104093

      Roberts GO, Rosenthal JS. 2009. Examples of adaptive MCMC. Journal of computational and graphical statistics 18:349–367.

      Saelens W, Cannoodt R, Todorov H, Saeys Y. 2019. A comparison of single-cell trajectory inference methods. Nat Biotechnol 37:547–554. doi:10.1038/s41587-019-0071-9

      Sha Y. 2024. Reconstructing growth and dynamic trajectories from single-cell transcriptomics data 6.

      Smolander J, Junttila S, Venäläinen MS, Elo LL. 2022. scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data. Bioinformatics 38:1328–1335. doi:10.1093/bioinformatics/btab831

      Tierney L. 1994. Markov chains for exploring posterior distributions. the Annals of Statistics 1701–1728.

      Zappia L, Phipson B, Oshlack A. 2017. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 18:174. doi:10.1186/s13059-017-1305-0

      Zhang L, Yu X, Zheng L, Zhang Y, Li Y, Fang Q, Gao R, Kang B, Zhang Q, Huang JY, Konno H, Guo X, Ye Y, Gao S, Wang S, Hu X, Ren X, Shen Z, Ouyang W, Zhang Z. 2018. Lineage tracking reveals dynamic relationships of T cells in colorectal cancer. Nature 564:268–272. doi:10.1038/s41586-018-0694-x

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this study, Yang et al. investigated the locations and hierarchies of NFATc1+ and PDGFRα+ cells in dental and periodontal mesenchyme. By combining intersectional and exclusive reporters, they attempted to distinguish among NFATc1+PDGFRα+, NFATc1+PDGFRα-, and NFATc1- PDGFRα+ cells. Using tissue clearing and serial section-based 3D reconstruction, they mapped the distribution atlas of these cell populations. Through DTA-induced ablation of PDGFRα+ cells, they demonstrated the crucial role of PDGFRα+ cells in the formation of the odontoblast cell layer and periodontal components.

      Thank you for your valuable comments and suggestions, which have greatly enhanced the quality of this research article. The manuscript has been significantly revised in accordance with the reviewers’ comments. All necessary experimental conditions and required data have been included, and all the questions and considerations have been well-addressed in the revised manuscript and supporting information.

      Main issues:

      (1) The authors did not quantify the contribution of PDGFRα+ cells or NFATc1+ cells to dental and periodontal lineages in PDGFRαCreER; Nfatc1DreER; LGRT mice. Zsgreen+ cells represented PDGFRα+ cells and their lineages. Tomato+ cells represented NFATc1+ cells and their lineages. Tomato+Zsgreen+ cells represented NFATc1+PDGFRα+ cells and their lineages. Conducting immunostaining experiments with lineage markers is essential to determine the physiological contributions of these cells to dental and periodontal homeostasis.

      Thanks for your question, we are sorry for the insufficient statement. Figure S9 provided statistical analysis of the number of PDGFR-α+ cells, NFATc1+ cells, and PDGFR-α+&NFATc1+ cells in the dental pulp and periodontal ligament (PDL). The results allow for a clear comparison of the contributions of single-positive and double-positive cells to both tissues. Additionally, the tracing results showed whether these three cell populations have the capacity to produce progeny cells. We further supplemented the analysis with immunofluorescence results of double-positive cells to identify their cell types, selecting AlphaV as the marker for mesenchymal stem cells (MSCs) and CD45 as the marker for hematopoietic cells. This part is further discussed in the manuscript as below:

      Page 14-15 in the revised manuscript, “To identify the population of PDGFR-α+ and NFATc1+ co-expressing cells in the pulp and periodontal ligament (PDL), we generated Pdgfr-aCreER; Nfatc1DreER; R26-LSL-RSR-tdT-DTR (LRTD) mice... Strong tdTomato signals were detected in both the PDL (Figure S22B) and pulp (Figure S22C). With respect to the MSC-specific marker AlphaV, we observed AlphaV+tdTomato+ cells in both regions. Additionally, CD45+ (hematopoietic marker) tdTomato+ cells were also present in these areas (Figure S22B, C). These findings suggest that the population of PDGFR-α+ and NFATc1+ co-expressing cells is heterogeneous.”

      (2) The authors attempted to use PDGFRαCreER; Nfatc1DreER;IR1 mice to illustrate the hierarchies of NFATc1+ and PDGFRα+ cells. According to the principle of the IR1 reporter, it requires sequential induction of PDGFRα-CreER and Nfatc1-DreER to investigate their genetic relationship. Upon induction by tamoxifen, NFATc1+PDGFRα- cells and NFATc1-PDGFRα+ cells were labeled by Tomato and Zsgreen, respectively. However, the reporter expression of NFATc1+PDGFRα+ cells was uncertain, most likely random. Therefore, the hierarchical relationship of NFATc1+ and PDGFRα+ cells cannot be reliably determined from PDGFRαCreER; Nfatc1DreER; IR1 mice.

      Thank you for your question. We have supplemented the control group (Pdgfr-αCreER; IR1) experimental data (Figure 8). By comparing the results of Pdgfr-αCreER; Nfatc1DreER; LGRT tracing assays, we confirmed that the expression pattern and range of PDGFR-a+ cells in pulp and PDL of Pdgfr-αCreER; IR1 mice are consistent with those observed in Pdgfr-αCreER; Nfatc1DreER; LGRT mice (Figure 6), and the same applies to NFATc1+ cells. All of our experimental results have been repeated multiple times. In addition, the IR1 system was initially developed by Professor Bin Zhou's lab and was validated for feasibility and stability in a paper published in Nature Medicine in 2017 (https://doi.org/10.1038/nm.4437). Moreover, Professor Zhou Bo O's team applied IR1 dual recombinases for bone lineage tracing in 2021 published in Cell Stem Cell, which also confirmed its feasibility and stability. (DOI: 10.1016/j.stem.2021.08.010)

      Reviewer #2 (Public Review):

      Summary:

      Yang et al. present an article investigating the spatiotemporal atlas of NFATc1+ and PDGFR-α+ cells within the dental and periodontal mesenchyme. The study explores their capacity for progeny cell generation and their relationships - both inclusive and hierarchical - under homeostatic conditions. Utilizing the Cre/loxP-Dre/Rox system to construct tool mice, combined with tissue transparency and continuous tissue slicing for 3D reconstruction, the researchers effectively mapped the distribution of NFATc1+ and PDGFR-α+ cells. Additionally, in conjunction with DTA mice, the study provides preliminary validation of the impact of PDGFR-α+ cells on dental pulp and periodontal tissues. Primarily, this study offers an in-situ distribution atlas for NFATc1+ and PDGFR-α+ cells but provides limited information regarding their origin, fate differentiation, and functionality.

      We would like to thank the reviewer for setting a high value on our study. Given many constructive suggestions, the manuscript has been revised to improve the quantity of this study. All the necessary discussions have also been added, and all the questions and concerns have been well-addressed in the revised manuscript. The point-to-point reply to the comments is listed below:

      Strengths:

      (1) Tissue transparency techniques and continuous tissue slicing for 3D reconstruction, combined with transgenic mice, provide high-quality images and rich, reliable data.

      (2) The Cre/loxP and Dre/Rox systems used by the researchers are powerful and innovative.

      (3) The IR1 lineage tracing model is significantly important for investigating cellular differentiation pathways.

      (4) This study provides effective spatial distribution information of NFATc1+/PDGFR-α+ cell populations in the dental and periodontal tissues of adult mice.

      Weaknesses:

      (1) In the functional experiment section, the investigation into the role of NFATc1+/PDGFR-α+ cell populations is somewhat lacking.

      Thank you so much for your comments and suggestions. We have supplemented the analysis with immunofluorescence results of double-positive cells to identify NFATc1+&PDGFR-α+ cell populations, selecting AlphaV as the marker for mesenchymal stem cells (MSCs) and CD45 as the marker for hematopoietic cells. This part was shown as below:

      Page 14-15 in the revised manuscript, “To identify the population of PDGFR-α+ and NFATc1+ co-expressing cells in the pulp and periodontal ligament (PDL), we generated Pdgfr-aCreER; Nfatc1DreER; R26-LSL-RSR-tdT-DTR (LRTD) mice… Strong tdTomato signals were detected in both the PDL (Figure S22B) and pulp (Figure S22C). With respect to the MSC-specific marker AlphaV, we observed AlphaV+tdTomato+ cells in both regions. Additionally, CD45+ (hematopoietic marker) tdTomato+ cells were also present in these areas (Figure S22B, C). These findings suggested that the population of PDGFR-a+ and NFATc1+ co-expressing cells is heterogeneous.”

      We also supplemented the discussion regarding the role of PDGFR-α+ population on page 17. Its potential role in pulp and periodontal formation had been suggested as well.    

      Page 17 in the revised manuscript, “After ablating PDGFR-α+ cells, we observed damage to the odontoblast layer and shrinkage of the pulp core in dental pulp tissue, indicating that PDGFR-α+ cells contribute to the composition of dental pulp tissue, particularly the odontoblast layer (Figure. 9C, D). In the periodontal ligament, we noted a reduction and destruction of collagen fibers, suggesting a role for PDGFR-α+ cells in periodontal tissue structure (Figure. 9E, F).”

      (2) The author mentions that 3D reconstruction of consecutive tissue slices can provide more detailed information on cell distribution, so what is the significance of using tissue-clearing techniques in this article?

      Thank you for your insightful comment, and we are sorry for the insufficient statement here. In our study, the utilization of tissue clearing techniques was to address some of the shortcomings associated with the 3D reconstruction of consecutive tissue slices, such as the compromised integrity of samples due to section layering, leading to discontinuities along the z-axis and potential loss of positive signals (Fig. S5, S13). Additionally, unavoidable tissue damage during the sectioning process may result in the loss of some information. As one of the most advanced imaging technologies currently available, tissue clearing/imaging allows for direct observation of the spatial location and relationships of fluorescently labeled cells within the intact tissue, which is more persuasive. Also, evolving beyond the analysis of structural and molecular biology of selected tissue sections, and expanding the focus to entire organs and organisms, is a trend in the development of the biomedical field (Nat Methods. 2024 Jul;21(7):1153-1165; Nat Commun. 2024 Feb 26;15(1):1764). Admittedly, no method is flawless; thus, our employment of two advanced imaging approaches aims to answer questions regarding the spatial positioning and relationships of PDGFR-α single-positive, NFATc1 single-positive cells, and PDGFR-α+ NFATc1+ cells from multiple perspectives. This is done to enhance the credibility and persuasiveness of our results.

      We greatly appreciate your suggestion, which have significantly complemented the content of our article. The corresponding statements have been added in the revised manuscript as below:

      Page 6 in the revised manuscript, “As one of the most advanced imaging technologies currently available, tissue clearing/imaging allows for direct observation of the spatial location and relationships of fluorescently labeled cells within the intact tissue. Therefore, according to the existing SUMIC tissue deep clearing (TC) methods, we modified and improved a rapid and efficient procedure, which enable rapid single-cell resolution and quantitative panoptic 3D light-sheet imaging.”

      (3) After reading the entire article, it is confusing whether the purpose of the article is to explore the distribution and function of NFATc1+/PDGFR-α+ cells in teeth and periodontal tissues, or to compare the differences between tissue clearing techniques and 3D reconstruction of continuous histological slices using NFATc1+/PDGFR-α+ cells?

      We sincerely appreciate your question and apologize for any ambiguous descriptions.

      The purpose of our study is to map the atlas of NFATc1+/ PDGFR-α+ inclusive, exclusive and hierarchical distribution in dental and periodontal mesenchyme. Under this premise, the two advanced imaging techniques were merely employed as means to elucidate this issue Indeed, in the previous manuscript, we did overemphasize the comparison and description of the differences between tissue clearing techniques and 3D reconstruction of continuous slices, which led to unnecessary misunderstandings for which we are deeply apologetic. Consequently, in this version of the manuscript, we have diminished the descriptions comparing their advantages and disadvantages, focusing instead on exploring the importance of NFATc1+/PDGFR-α+ cells. We appreciate your suggestions once again.

      Page 6 in the revised manuscript, “These two 3D-reconstruction and imaging technologies complement each other to jointly address the spatial positioning and hierarchical relationships of PDGFR-α+, NFATc1+, and PDGFR-α+ NFATc1+ cells from multiple perspectives.”

      (4) The researchers did not provide a clear definition of the cell types of NFATc1+/PDGFR-α+ cells in teeth and periodontal tissues.

      Thanks for your suggestions. We discovered through cell ablation experiments that the removal of PDGFR-α+ cells resulted in the destruction of the odontoblast layer in the dental pulp, shrinkage of the pulp core, and disruption of collagen fibers in the periodontal ligament. Combined with the results from lineage tracing, we conclude that PDGFR-α+ cells primarily constitute the mesenchymal cells that form the supporting tissues in both the dental pulp and periodontal ligament (Part 4.1). Through immunofluorescence staining, AlphaV was as the marker for mesenchymal stem cells (MSCs) and CD45 as the marker for hematopoietic cells, we observed that the double-positive cell population was a heterogeneous group, containing both mesenchymal stem cells (MSC) and hematopoietic cells (Part 4.2).

      (5) In studies related to long bones, the author defines the NFATc1+/PDGFR-α+ cell population as SSCs, which as a stem cell group should play an important role in tooth development or injury repair. However, the distribution patterns and functions of the NFATc1+/PDGFR-α+ cell population in these two conditions have not been discussed in this study.

      Thanks for your suggestions. The NFATc1+/PDGFR-α+ cell population was identified as playing an important role in tissue regeneration, especially in oral and maxillofacial tissues. Our research primarily focuses on the identification of NFATc1+ and PDGFR-α+ cells within dental and periodontal mesenchyme, highlighting their contribution to tissue homeostasis and regeneration. Although the NFATc1+/PDGFR-α+ cells were characterized in the context of other tissue types, their detailed role in tooth development and injury repair remains an area for further exploration.

      This part was further discussed on page 17-18 in the revised manuscript, “Cell ablation and immunofluorescence staining experiments further characterized the types and functions of PDGFR-α+/PDGFR-α+&NFATc1+ populations. After ablating PDGFR-α+ cells, we observed damage to the odontoblast layer and shrinkage of the pulp core in dental pulp tissue, indicating that PDGFR-α+ cells contribute to the composition of dental pulp tissue, particularly the odontoblast layer (Figure. 9C, D). In the periodontal ligament, we noted a reduction and destruction of collagen fibers, suggesting a role for PDGFR-α+ cells in periodontal tissue structure (Figure. 9E, F). Previous results confirmed the presence of double-positive cells in both dental pulp and periodontal tissues and provided insights into their hierarchical relationships in the periodontal ligament (Figure. 8). To further investigate the double-positive cell population, we developed an inducible dual-editing enzyme reporter system to label these cells with tdTomato signals. Using AlphaV as a marker for mesenchymal stem cells (MSCs) and CD45 for hematopoietic cells, we found that double-positive cells included components of both MSCs and hematopoietic cells (Figure S22B, C), indicating a heterogeneous population. Further experiments are necessary to determine whether the predominant role in this co-positive MSC population is played by PDGFR-α+ or NFATc1+ and to clarify the specific functions of these cells in the future.”

      Reviewer #3 (Public Review):

      Summary:

      This groundbreaking study provided the most advanced transgenic lineage tracing and advanced imaging techniques in deciphering dental/periodontal mesenchyme cells. In this study, authors utilized CRISPR/Cas9-mediated transgenic lineage tracing techniques to concurrently demonstrate the inclusive, exclusive, and hierarchical distributions of NFATc1+ and PDGFR-α+ cells and their lineage commitment in dental and periodontal mesenchyme.

      Strengths:

      In cooperating with tissue clearing-based advanced imaging and three-dimensional slices reconstruction, the distribution and hierarchical relationship of NFATc1+ and PDGFR-α+ cells and progeny cells plainly emerged, which undoubtedly broadens our understanding of their in vivo fate trajectories in craniomaxillofacial tissue. Also, the experiment design is comprehensive and well-executed, and the results are convincing and compelling.

      Weaknesses:

      Minor modifications could be made to the paper, including more details on the advantages of the methodology used by the authors in this study, compared to other studies.

      Thanks for your constructive comments and advice on how to improve the quality of this research article. We have thoroughly and carefully corrected the manuscript based on your suggestion, and all the necessary data have been added to support our claims. Meanwhile, all the questions and concerns have been well-addressed in the revised manuscript and the revised supplementary information. Thus, we believe that the quality of this paper has been significantly enhanced. We thank you again for your great efforts.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Line 134, the authors categorized the reporter systems into three types: intersectional reporters, exclusive reporters, and nested reporters. However, Figure 1A does not depict the nested reporters.

      Thanks for your helpful recommendation to improve the quality of this manuscript, and we are sorry for the mistake. In this revised manuscript, we have modified the content of Figure 1A, as displayed below:

      (2) Line 238, the authors mentioned that NFATc1 is expressed in the mandible and periodontal tissues based on their previous sequencing analyses. It would be better to cite the related reference or display the expression of NFATc1 in the Supplemental Figures.

      Thanks for your suggestions. We sincerely apologize for the typo that occurred during the writing process and have revised the original text to on page 9:

      “The previous sequencing analyses have reported the expression of NFATc1 in mandible and periodontal tissues20. (DOI: 10.1177/00220345221074356)”

      (3) Line 264, the figure callout "Figure 5E" does not exist, and the figure legends of Figure 5 contain the same error.

      We greatly appreciate your rigor and diligence, and we have corrected this error.

      (4) Line 280, the figure callout "Figure S12" is incorrect.

      Thank you for your efforts, and we are sorry for our negligence. The corresponding descriptions have been amended as below:

      Page 10 in the revised manuscript, “Consistent with the quantification of TC-based imaging results (Figure S9), the number of PDGFR-α+ cells and NFATc1+ cells were significantly higher than that in pulse group.”

      (5) Line 301, the figure callout "Figure 4" is erroneous.

      Thank you for your efforts, and we are sorry for our negligence. The corresponding descriptions have been amended as below:

      Page 11 in the revised manuscript, “After 11 days tracing, the number of PDGFR-α+ & NFATc1+ cells and PDGFR-α+NFATc1+ cells increased significantly (Figure 7)…”

      (6) Line 306, the sentence "Our previous study identified the presence of NFATc1+ cells in the cranium by single-cell sequencing (unpublished data)" could be improved by referencing specific data or findings.

      Thanks for your suggestions, and we are sorry for our negligence. The corresponding citation have been amended as below:

      Page 11 in the revised manuscript, “As a part of craniomaxillofacial hard tissue, we also intended to explore whether the presence of NFATc1+ and PDGFR-α+ cells in cranial bone tissue/suture is different from dental and periodontal tissue (our previous study has identified the presence of NFATc1+ cells in the cranium by single-cell sequencing28”

      (7) Line 341, the statement "Moreover, no PDGFR-α+ cells were detected in the Nfatc1DreER; IR1 group," needs further explanation or context.

      Thanks for your suggestions. The corresponding descriptions have been amended as below:

      Page 13 in the revised manuscript,  “Moreover, since the recombinase recognition sites are interleaved (loxP–rox–loxP–rox), recombination by one system will naturally remove a recognition site of the other system, rendering its reporter gene inactive for further recombination. The results showed no tdTomato+ cells or ZsGreen+ cells were detected in the Pdgfr-αCreER; IR1 or Nfatc1DreER; IR1 group respectively demonstrating the feasibility and accuracy of the IR1 system.”

      (8) Several statements in this text were duplicated. For instance, lines 365 to 376 are identical to lines 497 to 508. This redundancy should be addressed to improve the manuscript's clarity and conciseness.

      We greatly appreciate your suggestions, and we are sorry for the misunderstanding we may have caused. We have revised and integrated the entire Results 4 section (including lines 365 to 376 of the original manuscript) into the Discussion section to avoid unnecessary redundancy and misunderstandings. This adjustment also emphasizes that the goal of using two imaging techniques is to draw more credible conclusions from multiple perspectives, thereby mitigating the shortcomings of relying solely on existing advanced imaging methods. The revised content are as follows:

      Page 18 in the revised manuscript, “TC-based advanced imaging procedure can clearly visualize its 3D structure, reconstruct the whole across latitudes, and understand the spatial position and expression of each structure, which could avoid the bias of traditional single-layer slicing may cause, and provides a more intuitive and objective description of the existing situation. However, our results demonstrated TC still has some limitations…”

      Page 19 in the revised manuscript, “The 3D sections reconstruction results, however, effectively addressed the issue of weak tdTomato signal and provide a clearer visualization of the distribution of ZsGreen and tdTomato signals. For example, the tdTomato signal in the root pump, which was almost completely unobservable by TC-based imaging, can be clearly seen using confocal imaging and 3D reconstruction (Figure 3C-D, Figure 6C-D, and Figure S4, Figure S12). However, compared to TC, the quality of 3D reconstruction of sections still relies on the angle and quality of the sections, with the section angle having a significant impact on the reconstruction outcome. In addition, because the slice itself has a certain thickness (10 μM in this study), which leads to the appearance of discontinuous in the final reconstructed image, and the aesthetics and accuracy could be affected to a certain extent. Also, unavoidable tissue damage during the sectioning process may result in the loss of some information. Therefore, a variety of different information could be obtained through two different imaging technologies, which prompt us to use the advanced experimental procedure according to the actual purpose.”

      Reviewer #2 (Recommendations For The Authors):

      (1) It should be further highlighted in the article what cell type the NFATc1+/PDGFR-α+ cells should be defined as in teeth and periodontal tissues.

      Thank you so much for your suggestions. We have supplemented the analysis with immunofluorescence results of double-positive cells to identify NFATc1+&PDGFR-α+ cell populations, selecting AlphaV as the marker for mesenchymal stem cells (MSCs) and CD45 as the marker for hematopoietic cells.

      This part was on page 14-15 in the revised manuscript, “To identify the population of PDGFR-α+ and NFATc1+ co-expressing cells in the pulp and periodontal ligament (PDL), we generated Pdgfr-aCreER; Nfatc1DreER; R26-LSL-RSR-tdT-DTR (LRTD) mice… Strong tdTomato signals were detected in both the PDL (Figure S22B) and pulp (Figure S22C). With respect to the MSC-specific marker AlphaV, we observed AlphaV+tdTomato+ cells in both regions. Additionally, CD45+ (hematopoietic marker) tdTomato+ cells were also present in these areas (Figure S22B, C). These findings suggested that the population of PDGFR-a+ and NFATc1+ co-expressing cells is heterogeneous.”

      We also supplemented the discussion regarding the role of  PDGFR-α+ population on page 17. Its potential role in pulp and periodontal formation had been suggested as well:

      Page 17 in the revised manuscript: “After ablating PDGFR-α+ cells, we observed damage to the odontoblast layer and shrinkage of the pulp core in dental pulp tissue, indicating that PDGFR-α+ cells contribute to the composition of dental pulp tissue, particularly the odontoblast layer (Figure. 9C, D). In the periodontal ligament, we noted a reduction and destruction of collagen fibers, suggesting a role for PDGFR-α+ cells in periodontal tissue structure (Figure. 9E, F).”

      (2) The authors are advised to supplement the description of the cellular origin and the differentiation trajectory of NFATc1+/PDGFR-α+ cells in teeth and periodontal tissues.

      Thank you for your suggestion. Our study currently focused more on mapping the distribution atlas of NFATc1+PDGFRα+, NFATc1+PDGFRα-, and NFATc1-PDGFRα+ cells in adult homeostatic mice. In the next step, we plan to explore the differentiation trajectory of NFATc1+/PDGFRα+ cells during development using single-cell sequencing and other methods.

      (3) It is recommended to add figure labels to Figure 1B to facilitate reader comprehension.

      Thank you for your valuable suggestion to improve the quality of this manuscript. We have modified Figure 1B in the revised manuscript as follows:

      (4) Why compare 3D images from tissue clearing with 3D reconstructions of confocal imaging after consecutive tissue slicing?

      Thanks for your important and helpful comments to improve the quality of this manuscript, and we are sorry for the insufficient statement.

      The original intention of comparing the two methods was to is to draw more credible conclusions from multiple perspectives, thereby minimizing the limitations inherent in the singular use of current advanced imaging techniques. Indeed, the description in the previous manuscript could lead to misunderstandings among readers. Therefore, in the revised manuscript, we have modified and integrated the content of Results 4 section into the Discussion section to eliminate unnecessary verbosity and potential confusion.

      Page 18 in the revised manuscript, “TC-based advanced imaging procedure can clearly visualize its 3D structure, reconstruct the whole across latitudes, and understand the spatial position and expression of each structure, which could avoid the bias of traditional single-layer slicing may cause, and provides a more intuitive and objective description of the existing situation. However, our results demonstrated TC still has some limitations…”

      Page 19 in the revised manuscript, “The 3D sections reconstruction results, however, effectively addressed the issue of weak tdTomato signal and provide a clearer visualization of the distribution of Zsgreen and tdTomato signals. For example, the td-tomato signal in the root pump, which was almost completely unobservable by TC-based imaging, can be clearly seen using confocal imaging and 3D reconstruction (Figure 3C-D, Figure 6C-D, and Figure S4, Figure S12). However, compared to TC, the quality of 3D reconstruction of sections still relies on the angle and quality of the sections, with the section angle having a significant impact on the reconstruction outcome. In addition, because the slice itself has a certain thickness (10 μM in this study), which leads to the appearance of discontinuous in the final reconstructed image, and the aesthetics and accuracy could be affected to a certain extent. Also, unavoidable tissue damage during the sectioning process may result in the loss of some information. Therefore, a variety of different information could be obtained through two different imaging technologies, which prompt us to use the advanced experimental procedure according to the actual purpose.”

      (5) The experimental results section does not specify the age of the mice used, which lacks clarity for the reader and makes it difficult to determine at what developmental stage the observed distribution of NFATc1+/PDGFR-α+ cells occurs.

      Thank you for your suggestion. I apologize for overlooking this point. I only displayed the age of the mice in some of the figures. All the transgenic mice discussed in this article are adults around 12-14 weeks. I have added the specific weeks of age in the main text.

      (6) What is the rationale behind selecting day 1, day 3, and day 5 as the experimental time points in Figure 2B?

      Thanks for your questions. 48 hours after injection, TAM can be metabolized in the body and converted into 4-OHT, which then distributes thoroughly to various tissue systems through the bloodstream. Therefore, we chose to administer a booster dose 48 hours after the initial injection to ensure timely replenishment and achieve high labeling efficiency. This drug administration scheme has already been validated for feasibility in our preliminary studies.

      (7) In Figure 2E, why is there a large area of red signal visible in the tooth enamel?

      Thanks for your valuable comments and advice on how to improve the quality of this research article and our future work. As we discussed in the main text, the existing TC-based imaging techniques cannot meet the requirements for capturing as conspicuous tdTomato signals as ZsGreen, which may due to: 1) the editing efficiency of the DNA recombinase-mediated lineage-tracing system has limitations; 2) the lower presence of NFATc1+ cells in the region-of-interest (ROI) ensures weak signals of tdTomato; 3) the TC method as described may result in poor penetration of td-tomato fluorescence signals. Therefore, to clearly display the NFATc1+ cells in the ROI (periodontal ligament, pulp, and alveolar bone) as much as possible, we increased the intensity of excitation fluorescence of 561-channel of the Lightsheet fluorescence microscopy, which led to a large area of unrelated red signal in non-target areas (tooth enamel). In future work, we will further improve the TC procedure to shorten the sample processing time, and developing other transgenic mice to address this issue. Thanks again.

      (8) In the text at Line 249, the author notes that PDGFRα+ cells are widely distributed, and NFATc1+ cells are primarily located in the pulp horns. What is the relevance of their distribution to their function?

      Thank you very much for your suggestion. We found that PDGFRα+ cells are widely distributed in dental pulp tissue. Combined with the results from subsequent cell ablation experiments, it revealed that PDGFRα+ cells contribute to the formation of the odontoblast layer and the pulp core. In our supplementary data, we discovered through immunofluorescence staining that double-positive cells co-expressed AlphaV in the dental pulp, indicating that they possessed MSC components. We need to further investigate the relationship between their distribution and function in the future.

      (9) In Line 301 of the text, there is a mislabeling of Figure 4. Please verify this carefully throughout the document.

      Thank you for your efforts, and we are sorry for our negligence. We have made the necessary corrections and have meticulously reviewed the entire manuscript to ensure that there were no similar mistakes. The corresponding descriptions have been amended as below:

      Page 11 in the revised manuscript, “After 11 days tracing, the number of PDGFR-α+ & NFATc1+ cells and PDGFR-α+NFATc1+ cells increased significantly (Figure 7)…”

      (10) Between Lines 323 to 325, the author states: "the wider range of PDGFR-α+ cells than NFATc1+ cells were observed, which laid the foundation for our conjecture that NFATc1+ cells may contribute as subpopulation of PDGFR-α+ cells." This statement is inaccurate.

      Thank you for your suggestions. We apologize for the inaccuracies in our description and have made corrections in the original text.

      Page 12 in the revised manuscript, “the wider range of PDGFR-α+ cells than NFATc1+ cells were observed, we speculate that there may be a hierarchical relationship between the two.”

      (11) The author is advised to combine the use of single-cell sequencing data for cell trajectory analysis to corroborate the differentiation relationships between NFATc1+/PDGFR-α+ cells, discussing their specific origins and final differentiation fates.

      Thank you for your suggestion; it is very meaningful to us and will be the focus of our future research work.

      (12) In the Results 4 section, the comparison between tissue clearing imaging and 3D reconstruction of consecutive tissue slices could be discussed in the discussion section.

      We greatly appreciate your suggestions. We have revised and integrated the entire Results 4 section into the Discussion section to avoid unnecessary redundancy and misunderstandings. This adjustment also emphasizes that the goal of using two imaging techniques is to draw more credible conclusions from multiple perspectives, thereby mitigating the shortcomings of relying solely on existing advanced imaging methods. The revised content are as follows:

      Page 18 in the revised manuscript, “TC-based advanced imaging procedure can clearly visualize its 3D structure, reconstruct the whole across latitudes, and understand the spatial position and expression of each structure, which could avoid the bias of traditional single-layer slicing may cause, and provides a more intuitive and objective description of the existing situation. However, our results demonstrated TC still has some limitations…”

      Page 19 in the revised manuscript, “The 3D sections reconstruction results, however, effectively addressed the issue of weak tdTomato signal and provide a clearer visualization of the distribution of Zsgreen and tdTomato signals. For example, the td-tomato signal in the root pump, which was almost completely unobservable by TC-based imaging, can be clearly seen using confocal imaging and 3D reconstruction (Figure 3C-D, Figure 6C-D, and Figure S4, Figure S12). However, compared to TC, the quality of 3D reconstruction of sections still relies on the angle and quality of the sections, with the section angle having a significant impact on the reconstruction outcome. In addition, because the slice itself has a certain thickness (10 μM in this study), which leads to the appearance of discontinuous in the final reconstructed image, and the aesthetics and accuracy could be affected to a certain extent. Also, unavoidable tissue damage during the sectioning process may result in the loss of some information. Therefore, a variety of different information could be obtained through two different imaging technologies, which prompt us to use the advanced experimental procedure according to the actual purpose.”

      (13) The article only demonstrates the impact of removing PDGFR-α+ cells on the dental pulp and periodontal tissues of adult mice. What would be the impact of removing NFATc1α cells on teeth and periodontal tissues?

      Thank you for your suggestions. Our lab had been investigating the role of NFATc1+ cells in PDL and dental pulp tissues which is currently submitted to another journal. So please forgive me for not being able to present the data. The ablation assays showed that NFATc1+ cells may be involved in the formation of the odontoblast layer in dental pulp and in promoting osteogenic differentiation in the periodontal ligament.

      (14) The effects of removing PDGFR-α+ cells on the teeth and periodontal tissues of adult mice are shown in the article. What would be the impact on teeth and periodontal tissues if PDGFR-α cells were removed during early development?

      Thank you for your question. Our current research has not yet focused on the impact of PDGFR-α+ cells on the formation of periodontal ligaments and dental pulp tissue during the developmental stage. In our literature search, we found articles indicating that PDGFR-α was expressed at all stages of tooth development, and that PDGFR-α signaling was crucial for regulating the growth of the tooth apex and the proper extension of the palatal shelves during palatal fusion. Disruption of PDGFRα signaling interferes with apex growth and the critical extension of palatal shelves during craniofacial development. In the future, we would like to focus on the role of PDGFR-α cells during teeth development.

      (15) If the data on the skull are not presented in this paper, it is suggested not to overly describe it in the results section, or to include related skull data in supplementary figures.

      We appreciate your attention to detail and your suggestions for improving the clarity and presentation of our work. The corresponding results of cranium and cranial sutures region were shown in Video S7-9 in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      We sincerely appreciate your thorough review and positive feedback on our manuscript. In accordance with your recommendations, all the questions and concerns have been well-addressed in the revised manuscript. We believe these revisions further enhance the clarity and quality of our work. The point-to-point reply to the comments is listed below:

      (1) In line 181, the author claimed that "we modified and improved a rapid and efficient procedure...this ultrafast clearing technique could minimize the impact on transgenic mice." However, there is no mention in the main text of the amount of time required for other methods. How can the "rapid" element of your improved method be reflected? The author should briefly list a few other studies and discuss them.

      Thanks for your important and helpful comments, and we are sorry for the insufficient statement. In recent years, a variety of tissue clearing methods have emerged. Here is a summary of the methods and durations used for hard tissue clearing as published in several authoritative journals:

      Author response table 1.

      In comparison, our approach requires only approximately two days, thereby minimizing the potential damage to the tissue itself. Additionally, the study employs transgenic mice mediated by lineage tracing, and the shorter processing time also serves to reduce the impact on the fluorescence of the positive cells to a minimum.

      (2) In Figure S6, the author mentioned the use of another 3D reconstruction method-DICOM-3D. What is the advantage of this methodology? Is the conclusion drawn the same as the previous approaches? The author should propose corresponding discussions in this section.

      We sincerely appreciate your comments. The purpose of employing DICOM-3D reconstruction for the serial section images is to validate the constructed results obtained by Imaris. This method is based on sequential 2D DICOM images and utilizes 3D reconstruction and visualization technology to generate a stereoscopic 3D image with intuitive effects. Compared to Imaris reconstruction, this method offers a more straightforward and time-efficient approach. Regardless of the different reconstruction methods employed in this study, the ultimate goal remains consistent, which is to jointly address the spatial positioning and hierarchical relationships of PDGFR-α+, NFATc1+, and PDGFR-α+NFATc1+ cells from multiple perspectives, to enhance the credibility and persuasiveness of our results. We have also included the corresponding description in the revised manuscript as follows:

      Page 8-9 in the revised manuscript, “To enhance the comprehensive and accurate display of the reconstruction results and to mitigate the potential errors that may arise from relying on single reconstruction method, we employed an alternative 3D reconstruction method—DICOM-3D. This method is based on sequential 2D DICOM images and utilizes 3D reconstruction and visualization technology to generate a stereoscopic 3D image with intuitive effects, which was a comparatively straightforward and highly efficient approach. We transformed the serial IF images into DICOM format and subsequently reconstruct it, and the same conclusion can be drawn, namely, PDGFR-α+ cells almost constituted the whole structure of pulp and PDL, with NFATc1+ cells as subpopulation (Figure S6).

      (3) Line 292: Why was the tdTomato signal in confocal-based reconstruction more conspicuous than the TC procedure? Some descriptions would be beneficial for readers' understanding.

      Thank you very much for your comments. We hypothesize that the current light-sheet systems have inherent limitations in capturing tdTomato signals of intact tissue, which become more evident in tissues with inherently low fluorescence strengths (in this work, due to the limitations of editing efficiency in DNA recombinase mediated lineage-tracing system, which guaranteed weaker tdTomato signal compared to ZsGreen). In contrast, traditional confocal imaging techniques do not encounter such issues. The corresponding descriptions in the revised manuscript are shown as follows:

      Page 11 in the revised manuscript, “We hypothesize that the current light-sheet systems for intact tissue-imaging have inherent limitations in capturing tdTomato signals, which become more evident in tissues with inherently low fluorescence strengths (in this work, due to the limitations of editing efficiency in DNA recombinase mediated lineage-tracing system, which guaranteed weaker tdTomato signal compared to ZsGreen). In contrast, traditional confocal imaging techniques do not encounter such issues.”

      (4) Part 2.2, line 305: What is the purpose of analyzing the cranium and cranial sutures region through TC technology?

      Thank you for your comments. There are three main purposes of this part of the experiment. First, our research group has long been committed to studying the distribution and role of NFATc1+ SSCs in a variety of hard tissues, and our previous study has identified the presence of NFATc1+ cells in the cranium by single-cell sequencing. Therefore, in this work, we also intend to investigated the spatiotemporal atlas of NFATc1+ and PDGFR-α+ cells in cranium and cranial sutures region based on transgenic lineage tracing techniques. Second, as a part of craniomaxillofacial hard tissue, we intended to explore whether the presence of NFATc1+ and PDGFR-α+ cells in cranial bone tissue/suture is different from dental and periodontal tissue; In addition, the results in Video S7-9 further demonstrated that our improved tissue clearing procedure in this work is universal for a variety of hard tissues, which lay a foundation for our future researches.

      Page 11 in the revised manuscript, “As a part of craniomaxillofacial hard tissue, we also intended to explore whether the presence of NFATc1+ and PDGFR-α+ cells in cranial bone tissue/suture is different from dental and periodontal tissue (our previous study has identified the presence of NFATc1+ cells in the cranium by single-cell sequencing28”

      (5) Some images before & after the tissue-clearing procedure need to be provided in the supplemental file.

      Thanks for your important and helpful comments to improve the quality of this manuscript. We have included the corresponding description and photographs in the main text and the supplemental file as follows:

      Page 7 in the revised manuscript, “As shown in Figure S1A-B, we recorded bright-field images of the maxilla before and after clearing, and our procedure achieved high transparency of the whole tissue. On this basis, whole-tissue imaging can be achieved, with the observation of different cell type distribution in spatial 3D structure.”

      (6) In part 5, line 394, the author investigated the consequences of the ablation of PDGFR-α+ cells in dental pulp and periodontal mesenchymal tissues, but some research objectives and mechanisms need to be discussed here, regarding: "why choosing to ablation PDGFR-α+ cells instead of NFATc1+ cells? Was the hierarchical relationship between PDGFR-α+ cells and NFATc1+ cells considered during the experimental design?", etc.

      Thank you very much for your suggestion, it has been very helpful. We chose PDGFR-α+ cells as the subject for the cell ablation experiments based on the results from the previous lineage tracing and hierarchical relationship studies. We have included the corresponding description and photographs in the main text and the supplemental file as follows:

      Page 13 in the revised manuscript, “The results from the aforementioned lineage tracing experiments showed that PDGFR-α+ cells constitute a significant component of both dental pulp and periodontal tissues. Additionally, the hierarchical relationship experiments revealed that a portion of NFATc1+ cells in the periodontal ligament derives from PDGFR-α+ progenitor cells. Therefore, investigating the role of PDGFRα+ cells in dental pulp and periodontal tissues has become more urgent.”

      (7) Some claims in the main text were lack of literature citation, such as in lines 207 and 234.

      Thank you very much for your comments. We are deeply sorry for the mistakes. We have added the relevant references at the appropriate locations in the main text as follows:

      (1) line 207 of previous manuscript (page 8, line 206 in the revised manuscript): We sincerely apologize for the typo that occurred during the writing process and have revised the original text to: which was consistent with RNA-sequencing results in the previous study20. (DOI: 10.1177/00220345221074356)

      (2) line 234 of previous manuscript (page 9, line 234 in the revised manuscript): “we employed an alternative 3D reconstruction method—DICOM-3D27.” (DOI: 10.1177/09544119211020148)

      (8) What were the specific reasons for the conspicuous tdTomato signal in the reconstructed images obtained by traditional serial section-based confocal imaging, which were not as evident in TC imaging?

      Thank you very much for your comments. Traditional sectioning and subsequent confocal imaging can clearly display fluorescence signals on a single plane (Figure 3B, Figure 6B, Figure S3, S8, S11, S16, S19), therefore, after 3D reconstruction of multiple planes, it will still have a high resolution (Figure 3, 4, 7, 8). However, for TC imaging, the current light-sheet systems have inherent limitations in capturing tdTomato signals of intact tissue, which become more evident in tissues with inherently low fluorescence strengths (in this work, due to the limitations of editing efficiency in DNA recombinase mediated lineage-tracing system, which guaranteed weaker tdTomato signal compared to ZsGreen). In contrast, traditional confocal imaging techniques do not encounter such issues.

      (9) In tissue clearing techniques, do the chemical reagents and procedures used affect the signal intensity of tdTomato and Zsgreen?

      We appreciate your helpful comment. In this work, we modified and improved a rapid and efficient tissue deep clearing (TC) procedure based the existing SUMIC method, and  (Nature Cardiovascular Research, 2024, 3, 474–491; Cell, 2023, 186, 382-397.e24.). These researches have confirmed that the chemical reagents used in this method do not affect the inherent fluorescence signal of transgenic animals. With our improvements, we minimized the sample processing time as much as possible to avoid any potential adverse effects. The results in Figure 2, Figure 5, and Figure S1 indicated that after TC procedure, the tissue exhibit significant ZsGreen signals and certain tdTomato signals, which sufficiently support our conclusions.

      (10) How did you address the issue of sample integrity and discontinuities in the z-axis caused by the stratification of slices in your reconstructions?

      We greatly appreciate your comments. Currently, reconstruction techniques based on continuous sectioning cannot fully eliminate the discontinuities in the z-axis. Therefore, it is for this reason that we need to compensate for this deficiency by imaging the whole tissue through TC procedure. These two 3D-reconstruction and imaging technologies complement each other to jointly address the spatial positioning and hierarchical relationships of PDGFR-α+, NFATc1+, and PDGFR-α+NFATc1+ cells from multiple perspectives. Additionally, this deficiency can be minimized by improving the technical skills, reducing section thickness, and to minimize tissue loss during sectioning, which is our future research endeavors.

      (11) In Figure 2B, the schematic representation of the operational principle "Cre-loxp/Dre-loxp" does not correspond to the genotype "CreER/DreER". Please correct it.

      Thanks for your important comments. We are sincerely sorry for the mistake. We have modified Figure 2B in the revised manuscript as below:

      (12) Line 450, the specific distribution and differences of PDGFR-α+, NFATc1+, and PDGFR-α+&NFATc1+ cells in pulp and periodontal tissues need to be further described and explained.

      Thank you for your question. We have described this part on page 16 in the revised manuscript, “In PDL tissue, pulse data demonstrated widespread and abundant expression of PDGFR-α single-positive cells as well as NFATc1 single-positive cells, with no significant alteration in expression pattern or quantity after lineage tracing. Consequently, we conclude that in periodontal ligament and dental pulp tissues, PDGFR-α single-positive and NFATc1 single-positive cells primarily label intrinsic periodontal mesenchyme in PDL. Conversely, PDGFR-α+&NFATc1+ cells exhibited a more confined localization in PDL. The tracing data clearly illustrated that PDGFR-α+&NFATc1+ cells successfully gave rise to numerous progenies, which become predominant constituents within the periodontal ligament. In pulp tissue, the distribution of PDGFR-α single-positive cells was similar as that in PDL, primarily labeled odontoblast cell layer and there was not a significant increase in ZsGreen signal after tracing assay.”

      (13) In Figure S9, the sparse presence of NFATc1+ cells in pulp and periodontal tissue raises questions about the plasticity and differentiation potential of these cells. The author should include relevant discussions in this section.

      Thanks for your suggestion. Considering the plasticity and differentiation potential of NFATc1+ cells, we conducted immunofluorescence staining and found that the PDGFR-α+&NFATc1+ cell lineage in dental pulp and periodontal tissues represents a heterogeneous population. This population includes non-terminally differentiated mesenchymal stem cells (MSCs) as well as hematopoietic cells, indicating significant heterogeneity. We have also added this part of the discussion on page 17 of the manuscript.

      Page 17 in the revised manuscript, “Cell ablation and immunofluorescence staining experiments further characterized the types and functions of PDGFR-α+/PDGFR-α+&NFATc1+ populations. After ablating PDGFR-α+ cells, we observed damage to the odontoblast layer and shrinkage of the pulp core in dental pulp tissue, indicating that PDGFR-α+ cells contribute to the composition of dental pulp tissue, particularly the odontoblast layer (Figure. 9C, D). In the periodontal ligament, we noted a reduction and destruction of collagen fibers, suggesting a role for PDGFR-α+ cells in periodontal tissue structure (Figure. 9E, F). Previous results confirmed the presence of double-positive cells in both dental pulp and periodontal tissues and provided insights into their hierarchical relationships in the periodontal ligament (Figure. 8). To further investigate the double-positive cell population, we developed an inducible dual-editing enzyme reporter system to label these cells with tdTomato signals. Using AlphaV as a marker for mesenchymal stem cells (MSCs) and CD45 for hematopoietic cells, we found that double-positive cells included components of both MSCs and hematopoietic cells (Figure S22B, C), indicating a heterogeneous population. Further experiments are necessary to determine whether the predominant role in this co-positive MSC population is played by PDGFR-α+ or NFATc1+ and to clarify the specific functions of these cells in the future.”

      (14) Part 3, line 351, the authors were unable to confirm the hierarchical relationship between PDGFR-α+ and NFATc1+ cells in the dental pulp region. Could this be due to limitations in experimental design or technical methods? Have you considered other factors that might explain these results?

      Thank you for your question. We believe that the possible reason was that PDGFR-α+ cells were a widely distributed constitutive component of dental pulp tissue, while NFATc1+ cells had a more limited expression range, resulting in a significant difference between the two. Therefore, we were unable to calculate the differences. In the future, we could further investigate the hierarchical relationship between the two by increasing the sample size or through in vitro experiments such as immunoprecipitation.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their time in evaluating the strengths and weaknesses of our manuscript.

      We are pleased to see that all reviewers recognized the high significance of our work, noting that the manuscript addresses “longstanding question of which cell types are infected during congenital or perinatal rubella virus infection”. As noted by reviewer 1, “This study reveals a new cellular target that will have important implications for basic studies on rubella virus-host interactions and for the potential development of therapies or improved vaccines targeting this virus. As the rubella virus is a pathogen of high concern during human pregnancy, this study also has important implications in the field of neonatal infectious diseases”.

      Below, we provide responses (in blue) to specific critiques:

      Reviewer #1 (Public Review):

      A weakness is that the current data do not provide information on the full replicative potential of the rubella virus in microglia, or whether the virus persists in this system.

      See our response below. Briefly, we include new experimental evidence from primary tissue, microglia-transplanted organoids, and Vero cells to further characterize the dynamics of viral infection.

      Reviewer #1 (Recommendations for the authors):

      Most of the viral assays in the brain slices and organoids examine viral protein synthesis, which is a surrogate for genome replication. However, basic virological characterization is lacking and would improve the robustness of the model and its potential utility to understand better rubella virus-microglia interactions. Questions the authors should consider with new experiments include:

      Are new virions produced? Can viruses be detected in the media?

      Or, are the infections abortive, with viral protein synthesis occurring, but no virus production?

      We performed RV titering experiments in dissociated microglia co-cultured with other cell types, as well as Vero cells as a control. While we can detect a robust increase in viral titer from Vero cells, it fell below detection levels in microglia co-cultures. See Author response image 1. We now include these data in Supplementary Figure 2D.

      Author response image 1.

      Rubella virus titering experiment performed in Vero cells (positive control) or dissociated microglia co-cultures. In primary microglia co- cultures, viral titer falls below detection levels after several days of infection.

      While we could not detect an increase in the viral particles from microglia mixed cultures, we confirmed the presence of GFP from the RV-GFP reporter construct, and we believe it serves as a proof that the virus can infect microglia cells and lead to production of functional viral protein (Author response image 2, Figure 1E-F of the current manuscript):

      Author response image 2.

      We also observed an increase in RV RNA over time in tissue slice infections, using qPCR (Author response image 3, not included in the manuscript).

      Author response image 3.

      Modest increase in RV RNA over time in brain slice infections. Rubella virus RNA measured by qPCR relative to GAPDH gene, in n=3 samples (2 technical replicates each condition). Brain slices were exposed to RV, then collected at end of inoculation (4 hours post infection), or at 3 or 5 days post infection, and processed for RNA extraction and RT-qPCR.

      How long do the infections persist in the model? What is the fate of infected microglia over time? Time courses to monitor infection and cell health would be useful.

      We performed a longer infection with RV in organoids transplanted with microglia, and after two weeks of infection, we can detect multiple microglia cells positive for the RV capsid. These data are now included in Figure 4 of the current manuscript.

      Author response image 4.

      After 2 weeks post infection, microglia remain positive for RV capsid.

      Reviewer #2 (Public Review):

      Weaknesses

      The set of data is rather descriptive. It suggests that microglia are the predominant brain target of RV in vivo, without identifying the targeting mechanism that provides cell type specificity. Moreover, what are the diffusible cues released from the brain environment that increase microglia infection and RV replication?

      We agree with the reviewer that identifying molecular mechanisms that underlie this phenotype will be very interesting to explore in future research, and we acknowledge the limitation of the study in the Discussion.

      It is unclear why brain organoids not supplemented by microglia are susceptible to RV inoculation.

      We could not detect RV capsid in organoids without microglia after 72 hours of inoculation. We attribute any changes seen at the level of single cell transcriptomics in the absence of microglia transplantation to exposure to virus-associated particles, including but not limited to viral RNA species, viral proteins, or even other components of the viral stocks made in Vero cells. These factors may induce transcriptomic differences even in the absence of RV infection. In the text, we take care to refer to these condition as “Rubella virus-exposed” rather than “Rubella virus- infected”. We now include the following panel from Author response image 5 in Figure 4B of the current manuscript.

      Author response image 5.

      Organoids without microglia do not show positive RV immunofluorescence.

      Reviewer #2 (Recommendations for the authors):

      Several points could be further addressed to improve the data set and shed more light on some aspects of this manuscript:

      • Figure 1. Additional microglia markers should be used to reinforce the evidence that microglia cells are the principal RV targets. Since Iba1 is a marker of activated microglia, does RV have a selective tropism to all microglia or only to activated ones in human fetal brain slices?

      The reviewer brings up an interesting point that, in our mind, can be separated into two independent questions:

      1. Are Iba1-positive cells bona fide microglia, or are there other cell populations of macrophage/monocyte origin that are labeled with Iba1? Therefore, additional markers should be used for immunolabeling;

      2. Is RV infection selective for microglia “activation” status, when only 5mmune-primed cells can be infected?

      For the first point, we have previously shown that in the developing human brain, virtually all Iba1-positive cells are also P2RY12-positive (unpublished; Author response image 6). Therefore, in primary human brain slices, there is a negligible amount of non-microglia macrophages. However, in culture microglia quickly lose their “homeostatic” identity, including P2RY12 expression, as quickly as six hours after ex vivo extraction (Gosselin et al., 2017; DOI: 10.1126/science.aal3222).

      Author response image 6.

      P2RY12 co-localizes with Iba1 in primary brain tissue from gestational week 17.5, including cells with more ameboid morphology (arrows)

      However, in organoids at 2 weeks post-RV exposure, we found microglia with both ameboid and more ramified morphology (Author response image 7). It would be challenging and beyond the scope of this manuscript to use morphology or Iba1 intensity levels to determine cause and effect as microglia activation state relates to RV infectivity (i.e. do activated microglia preferentially get infected with the virus, or do infected microglia become activated and upregulate Iba1 levels and change morphology).

      Author response image 7.

      Examples of microglia with round (top) and ramified (bottom) morphology that co-localize with RV capsid staining.

      Regarding RV tropism in the 2D culture of microglia, some Iba- cells are infected by RV as they show capsid staining. What are these cells? Are neurons and/or glia also susceptible to RV in vitro infection? Are non-microglial cells getting RV infected in the absence of microglia?

      In the absence of microglia cells, a small proportion of non-microglia cells get infected with RV. There is no statistically significant difference in the number of cells that get infected with RV in the presence or absence of microglia across different cell types. We add these data as Supplement Figure 3.

      Author response image 8.

      Rubella infection in non-microglia cells. A. Representative images of different cell types depleted of microglia. Cell cultures were stained RV capsid (green) and DAPI. B. Quantification of total cells that are positive for RV capsid across conditions. C. Quantification of RV+ cells that are not microglia across different cell populations. No statistically significant difference was detected in RV infectivity in cells c-cultured with or without microglia.

      • Figure 3. The low rate of Rubella virus infection in homogenous CD11b+ cell culture raises the question of whether the Rubella virus can infect microglia at a specific activation stage. It is also surprising that there is no infection of such cell population (also CD11b+) alone while cultured in 2D, as reported in figure 2. Why such a difference?

      It is well established that culture of microglial cells isolated from brain tissue alters their molecular properties, which likely alters the cell surface protein composition. In the revised discussion, we include activation as a possible mechanism that will require further investigation.

      • Fig 4A-B, it is unclear whether organoids that are not engrafted with microglia get infected upon RV (with active viral replication) inoculation. If non-microglia-supplemented organoids are indeed infected and allow RV replication, this suggests that organoids might not be the ideal system to model human fetal brain RV infection at GW18-23.

      We could not detect RV capsid in organoids without microglia after 72 hours of inoculation. We include the following panel from Author respone image 9 in Figure 4 now.

      Author response image 9.

      Organoids without microglia do not show positive RV immunofluorescence.

      • Figure 4E, why are cells derived from microglia-free organoids so much enriched in the UMAP plots as compared to the other organoid condition? Is RV impacting cell fitness, proliferation, or neurodifferentiation?

      This perceived difference is due to data presentation. Based on cell proportions, cells from organoids that were treated with microglia are more represented in the scRNAseq data, and this difference most likely comes from user-introduced imbalance in cell loading and possible cell losses during demultiplexing (Author response image 10, panel A). Cell number composition across different conditions and cell types, including RV and MG treatment, are shown in Supplement Figure 4 of the current manuscript (Author response image 10, panel B).

      Contribution of each condition can be visualized via UCSC single cell data browser: https://cells.ucsc.edu/?ds=rubella-organoids

      Author response image 10.

      Data composition depending on condition. A. Cell number contribution from organoids with and without microglia. B. Contribution of each condition to each cluster composition.

      • Figure 4F-H. If microglia is the predominant target for RV in the brain, why are microglia-free organoids susceptible to RV and who are the other cellular targets, whose infection leads to activation of interleukin pathway genes and dysregulation of brain developmental markers in selected subpopulations (RGCs, ENs..).

      Thank you for bringing this point. We did not detect any appreciable RV genomic RNA in our published single cell data, nor did we identify RV capsid in the RV-exposed organoids without microglia. Our experiments on dissociated cell cultures show that a small population (~1-4%) of other cell types was positive for the RV capsid, including neuron-enriched and glial-enriched fractions (Author response image 11; Supplementary Figure 3C in current manuscript). We expect a similar proportion of non-microglia cells to be infected in the brain organoids. One possible explanation for the robust interferon response even in the absence of productive infection in other cell types is exposure to virions and virus-associated particles, including but not limited to viral RNA species, viral proteins, or even other components of the viral stocks made in Vero cells (which is a cell line that should not produce interferons, but may produce other unmeasured cytokines as a virally infected cell culture).

      Author response image 11.

      Quantification of RV+ cells that are not microglia across different cell populations. No statistically significant difference was detected in RV infectivity in cells cultured with or without microglia.

      • QRT-PCR validations of some of these key brain targets should be performed.

      We agree with the reviewer that further validation of the predicted molecular changes downstream of Rubella exposure would be valuable. We have opted to validate IFITM3 and NOVA1 expression differences using immunostaining, and the results are consistent with our predictions from scRNAseq, and the data is presented in revised Figure 5 and 6 of the current manuscript.

      Reviewer #3 (Public Review):

      Weaknesses of the paper: Overall, additional control experiments are needed to support the stated conclusions. Affinity chromatography is used to purify microglia and other cell types, but the overall cell enrichment is not quantified.

      We appreciate the reviewer concern. However, affinity based enrichments rarely guarantee purity of the enrichment, and we do not believe accurate estimation of the purification purity would alter the biological interpretation of the data.

      In cell mixing experiments, the authors do not rule out the possibility that the added non- microglia cells also become infected, releasing additional infectious viruses. The finding that a diffusible factor is required for RV infection would be unusual if not unprecedented; therefore, additional data are required to support this claim and rule out other interpretations.

      We provide quantification of non-microglia cells that are positive for RV capsid in the presence and absence of microglia. Small (~1-4%) of non-microglia cells get infected with the virus and can potentially release more of the virus (see Author response image 12), but we do not know how this newly produced virus would be different from the one that was applied to the cells directly. To follow up our co-culture experiments, we wanted to exclude a possibility of microglia engulfing RV- infected cells in co-cultures, therefore we separated the two cell fractions by a liquid-permeable membrane (Figure 3 of the current manuscript). It is possible that factors secreted by other cell populations in the transwell assay experiments act on microglia cells to upregulate a yet unidentified receptor on microglia surface or other infection-dependent molecule rendering them infectable by the virus.

      We re-phrase the text by de-emphasizing “soluble factors” and focusing on excluding phagocytosis of infected cells as a possible mechanism of RV capsid immunoreactivity in microglia cells.

      Author response image 12.

      Rubella infection in non-microglia cells. A. Representative images of different cell types depleted of microglia. Cell cultures were stained RV capsid (green) and DAPI. B. Quantification of total cells that are positive for RV capsid across conditions. C. Quantification of RV+ cells that are not microglia across different cell populations. No statistically significant difference was detected in RV infectivity in cells c-cultured with or without microglia.

      The methods section would be improved by including details about the iPSC line that was used.

      We include the following section in Materials and Methods:

      iPSC lines.

      All work related to human iPS cells has been approved by the UCSF Committee on Human Research and the UCSF GESCR (Gamete, Embryo, and Stem Cell Research) Committee. Human iPS cell line “WTC-10” derived from healthy 30-year-old Japanese male fibroblasts was from the Conklin Lab, UCSF (Bershteyn et al., 2017; Kreitzer et al., 2013). Human iPSC line “13325” was derived from 9-year-old female fibroblasts originally obtained from Coriell cell repository. Human iPSC line “1323-4” derived from healthy 48-year-old Caucasian female fibroblasts (gift from the Conklin Lab, UCSF) was used for immunofluorescence validation analysis as we found that this line generates more reproducible brain organoid differentiations.

      and by a more thorough description of virus-specific details, including the numbers of infectious particles added per volume of incubation media.

      We now include the following data in the Materials and Methods section:

      Rubella virus infection

      Cells cultured in 2D were inoculated by adding RV stock virus to culture media in 1:1 dilution (250 ul of media to the equal volume of viral stock, 1.75x105 total ffu/well) to achieve a multiplicity of infection (MOI) of 2. After four hours, media was exchanged with fresh cell culture media. Cortical brain slices were treated with 500 ul of RV viral stock (3.5x105 total ffu/slice) applied over the slice culture filter for four hours, and then the viral culture media was removed and replaced with fresh slice culture media. Organoids were treated in 6-well plates with 2ml of 1:1 dilution of viral stock:organoid maintenance media (7x105 total ffu) for four hours, and then viral media was exchanged for fresh media. For all experimental conditions, cells were fixed and processed for downstream analysis at 72 hours post infection. Supernatant from non-infected Vero cells (mock) or heat-inactivated RV (650C, 30 mins) was used as control.

      In addition to immunofluorescence, adding additional data to demonstrate and quantify virus infection (PCR and plaque assays. or immunofluorescence using an anti-double-stranded RNA antibody such as J2) from the infected brain slices and organoids would provide greater assurance that the virus is indeed replicating under the experimental conditions.

      We performed RV titering experiment in dissociated microglia co-cultured with other cell types, as well as Vero cells control. While we can detect a robust increase in viral titer from Vero cells, it fell below detection levels in microglia co-cultures. We now include these data in Supplementary Figure 2D.

      Author response image 13.

      Rubella virus titering experiment performed in Vero cells (positive control) or dissociated microglia co-cultures. In primary microglia co- cultures, viral titer falls below detection levels after several days of infection.

      Unfortunately, we did not find J2 staining informative because we could detect signal in both wild type RV infection conditions and in heat-inactivated RV, presumably due to native dsRNA species present in cells. We did not detect any increase or difference in the pattern of staining between RV and heat-inactivated virus-exposed conditions (Author response image 14; not included in the manuscript).

      Author response image 14.

      J2 antibody labels dsRNA in both RV-exposed and control heat- inactivated virus conditions, presumably due to native dsRNA that is not unique to the viral replication.

      Organoid imaging with immunofluorescence would be very informative in demonstrating the presence of microglia and also in showing which cells are virus-infected in the context of organoid structures.

      We provide images from 72hrs and 2 week RV infection, providing a zoomed-out view of organoids with microglia and RV capsid staining. We also provide images of 72hrs post- infection in organoids without microglia Author response image 15, Figure 4C in current manuscript).

      Author response image 15.

      Microglia in organoids co-localize with RV capsid staining.

      GenBank accession numbers are listed for the recombinant RV and GFP-RV reporter, but a search using those numbers did not locate the deposits--perhaps the deposits were very recent?

      Both viral construct information is now available on GenBank:

      M33 RV strain can be found here: https://www.ncbi.nlm.nih.gov/nuccore/OM816674

      RV-GFP can be found here: https://www.ncbi.nlm.nih.gov/nuccore/OM816675

      The authors incorrectly refer to the GFP virus as a new strain; it is not a viral strain and should be referred to as a reporter virus.

      Thank you, we changed the description to

      “To confirm functional transcription and translation of the viral genome, a new reporter construct of RV designed to express GFP within the non-structural P150 gene was generated (RV-GFP, GenBank Accession OM816675)”

      Given that the authors show that Vero cell cultures are infected by the Rubella virus in the absence of other cells, additional evidence is needed to demonstrate that a diffusible factor from other cells enables microglia to be infected by the Rubella virus.

      We have revised the manuscript to indicate that our data is consistent with the possibility that a diffusible factor is involved. Our experiment utilizing transwell assay argues against phagocytosis and physical interactions as primary drivers, but future studies will be needed to determine if soluble factors are involved.

      The authors did not detect Rubella virus transcripts in the single-cell RNA sequencing experiment, nor was a microglia cluster found.

      Indeed, microglia recovery using scRNAseq is very inefficient. We note this limitation in the discussion.

      Innate immune responses can be activated in the presence of viral particles but without virus replication, as in inactivated viral vaccines; therefore changes in interferon responses do not necessarily prove virus replication.

      We agree with the reviewer on this point, it is difficult, if at all possible, to entirely eliminate the possibility that some of the transcriptomic changes, particularly the interferon responses, are not induced by the exposure to viral particles. We have revised the manuscript to more rigorously described the conditions as “RV-exposed”.

      Figure 4: it would be helpful to define the abbreviations used in the figure legend (e.g. IPC, RG, EN). In the volcano plots, the gene names are blocked by the dots, and the figure becomes very pixelated when enlarged to read the text.

      We have added abbreviations and replaced the figure files with higher resolution images (Figure 6 in current manuscript).

      The value of including Supplemental Figure 2 (MOG) is not clear because it receives little mention in the text and also seems to be previously published data that could be cited.

      We have removed the figure and replaced it with a citation and a link to the Cell Browser.

      Supplemental Figure 4: In panel G, the legend shows "YH10" and "13325". These terms are not described in the Figure legend, nor did a search of the manuscript identify these terms. In its current form Supp. Fig. 4G is not interpretable. In addition, would be more clear to use the term "RV-infected" instead of "treated" to describe the addition of the virus.

      We have expanded the Methods section to include the description of different organoid lines and added a revised legend for Supplementary Figure 4. We do not provide evidence of RV infecting organoids without microglia, therefore we have revised the claims that organoid cells become infected with the virus and replaced it with “RV-exposed” to better reflect the conditions studied.

      Reviewer #3 (Recommendations for the authors):

      1) Demonstrate and quantify virus replication to provide data to complement the imaging. In order of data quality, plaque assays would be most convincing in demonstrating infection and release of infectious virus, while a time course of PCR on RV transcripts would support a conclusion of replicating virus. Further, staining with an anti-double-stranded RNA antibody (J2) would represent evidence of virus replication.

      In response to the reviewer’s comment, we performed an RV titering experiment in dissociated microglia co-cultured with other cell types, as well as Vero cells control. While we can detect a robust increase in viral titer from Vero cells, it fell below detection levels in microglia co-cultures. We now include these data in Supplementary Figure 2D.

      Author response image 16.

      Rubella virus titering experiment performed in Vero cells (positive control) or dissociated microglia co-cultures. In primary microglia co- cultures, viral titer falls below detection levels after several days of infection.

      We detected a very modest increase in RV RNA in infected brain slices over time using RT- qPCR (see Author response image 17, not included in current manuscript)

      Author response image 17.

      Modest increase in RV RNA over time in brain slice infections. Rubella virus RNA measured by qPCR relative to GAPDH gene, in n=3 samples (2 technical replicates each condition). Brain slices were exposed to RV, then collected at end of inoculation (4 hours post infection), or at 3 or 5 days post infection, and processed for RNA extraction and RT-qPCR.

      Unfortunately, we did not find J2 staining informative because we could detect signal in both wild type RV infection conditions and in heat-inactivated RV, presumably due to native dsRNA species present in cells. We did not detect any increase of difference in the pattern of staining between RV and heat-inactivated virus-exposed conditions (Author response image 18; not included in the manuscript).

      Author response image 18.

      J2 antibody labels dsRNA in both RV-exposed and control heat- inactivated virus conditions, presumably due to native dsRNA that is not unique to the viral replication.

      We utilized FISH to detect negative-stranded (non-genomic) RV RNA as an alternative to J2 to indicate RNA replication. However, it proved to be not very sensitive, as a small quantity of negative-strand RV RNA could be detected in highly infected Vero cells, but negative-strand RV RNA was not detected in more modestly infected microglia (based on positive-strand RV RNA quantification), as in Author response image 19, not included in current manuscript.

      Author response image 19.

      FISH probes to positive strand (genomic) and negative strand (replication template) RV RNA in Vero cells and microglia co-cultures. A: representative images of Vero cells infected with RV (top row) or Zika virus as control (bottom row). At 72hpi, cells were fixed and processed for immunofluorescence with anti-RV capsid antibody (RVcap) or Zika virus antibody (Zika4G2), and then FISH was performed using probes to positive strand (+) or negative strand (-) RV RNA. Negative strand RV RNA difficult to visualize at low-power magnification, and required quantification within cell borders defined by wheat germ agglutinin staining with results in panel B. B: In Vero cells, negative strand RV RNA is detected in strongly infected cells. Infection strength determined by intensity of RV capsid immunofluorescence staining and positive strand RV RNA (RVcap/(+) 2/3 indicates robust infection, RVcap/(+) 1 indicates weak infection). ZIKVinf = Zika virus infected control. C: In microglia co-cultures, positive strand RV RNA detected in cells with RV capsid immunopositivity (RVcap_pos). RVinf = RV infected. RVHI = heat-inactivated RV. D: In microglia co-cultures, negative strand RV RNA quantification not significantly different between mock, heat-inactivated RV (RVHI), or RV- infected conditions (RVinf), including cells with weak positive-strand RV RNA (RVinf, (+)<8) or cells with stronger positive-strand RV RNA ((RVinf, (+)>=8). Two biological replicates (bHR60 and bHR61), n indicates number of cells counted.

      While we could not detect an increase in the viral particles from microglia mixed cultures, we confirmed the presence of GFP from the RV-GFP reporter construct, and we believe it serves as a proof that the virus can infect microglia cells and lead to production of functional viral protein (see Author response image 20, Figure 1E-F of the current manuscript)

      Author response image 20.

      Thus, overall we detect replication of viral RNA and protein (qPCR, RV-GFP), but not an appreciable increase in released newly-made virions. The discussion now reflects this more clearly in the current manuscript.

      2) The claim of requiring a diffusible factor to enable RV infection requires additional data. A suggestion would be to include further characterization of affinity-purified cells to define the levels of cell enrichment and to determine which other cell types are present, It is also important to test the RV infection of the fractionated cell types alone before adding to the microglia, in order to demonstrate whether RV is replicating in cell types other than microglia.

      We performed quantifications of RV capsid-positive cells in each of the affinity-purified cell populations: neuron-enriched (purified with PSA-NCAM beads), glia-enriched (PSA-NCAM depleted cell fraction), or non-microglia fraction (“Flow through”, depleted of CD11b+ cells). We show that across each condition, we have low infectivity (ranging from ~1 to 4% of total cell population) after 72 hours post-infection. We include these data in Supplementary Figure 3.

      Author response image 21.

      Rubella infection in non-microglia cells. A. Representative images of different cell types depleted of microglia. Cell cultures were stained RV capsid (green) and DAPI. B. Quantification of total cells that are positive for RV capsid across conditions. C. Quantification of RV+ cells that are not microglia across different cell populations. No statistically significant difference was detected in RV infectivity in cells c-cultured with or without microglia.

      Another approach to limit cell heterogeneity would be to use iPSC-derived cells, which are highly enriched as a single cell type as a specific cell type, to test the requirement for additional cell types to achieve RV infection of microglia.

      In our prior publication (Popova et al. 2021) we have identified a number of molecular differences between primary and iPSC derived microglia. iPSC derived microglia like cells could show differences in infection tropism from primary microglia, and those results may be difficult to interpret biologically. We agree with the reviewer that iPSC derived cells would be an interesting model, there are now several distinct protocols for deriving microglia like cells from pluripotent stem cells and we feel that embarking on a protocol comparison project would fall outside the scope of the current manuscript.

      3) Consider a longer organoid infection. The authors did not identify viral RNA transcripts in their organoid scRNAseq data after a 72-hour infection. Although the 72-hour time point seems right for cells in 2D culture, it’s possible that the infection in the organoids is slower because the virus has to spread inwardly. It would be worth trying a time course out to 2 weeks, collecting organoids every few days and then imaging and doing pcr or plaque assays. Zoomed-out views that show immunofluorescence of the entire organoid would also be beneficial in assessing organoid quality and immunofluorescent staining to identify cell types,

      We performed longer RV infection for two weeks and now present data on RV capsid in microglia in 72 hrs and 2 weeks post-infection (Author response image 22, Figure 4C of the current manuscript). We have also validated one of the scRNAseq-generated gene candidates in combination with different cell type markers and present data on whole organoids immunostained with NeuN for neurons and EOMES for intermediate progenitor cells that demonstrate the overall structure of the organoids (Author response image 23; Figure 6 of the current manuscript).

      Author response image 22.

      Microglia in organoids co-localize with RV capsid staining. Organoid with microglia were exposed to RV for 72 hrs or two weeks.

      Author response image 23.

      Organoids labeled with splice regulator NOVA1 (magenta), neuronal marker NeuN (green) and intermediate progenitor cell marker EOMES (cyan).

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are grateful to the reviewers for their constructive comments. The following is our point-to-point responses.

      Reviewer #1 (Recommendations For The Authors):

      Point 1- Abstract: advanced morning peak « opposite » to pdf/pdfr mutants. To my knowledge, the alteration of PDF/PDFR suppresses the morning peak. I am not sure that an advance of the peak is « opposite » to its inhibition?

      Mutants with disruptions in CNMa or CNMaR display advanced morning activity, indicating an enhanced state. Mutants with disruptions in Pdf or Pdfr exhibit no morning anticipation, suggesting a promoting role of these genes in morning anticipation. Therefore, our revised version is: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-51)

      Point 2- Fig 1K-L: the authors should show the sleep phenotype of the homozygous nAChRbeta2 mutant (if not lethal) for a direct comparison with the FRT/FLP genotype and thus evaluate the efficiency of the system.

      We have incorporated sleep profiles of nAChRbeta2 mutant and W1118 into Fig 1K-L. nAChRbeta2 mutants (red) exhibited a sleep amount comparable to that of pan-neural nAChRbeta2 knockout flies (dark red), as shown below.

      Author response image 1.

      Point 3- Dh31-EGFP-FRT expression patterns look different in figS1 A (or fig1 H) and J. why that?

      We re-examined the original data. Both (with R57C10-GAL4 for Fig. S1A, right, S1J, left) are Dh31EGFP.FRT samples displayed below which demonstrated consistent primary expression subsets. Any observed disparities in region "e" could potentially be attributed to variations during dissection.

      Author response image 2.

      Point 4- The knockdown experiments with the elav-switch (RU486) system (fig S2) do not seem to be as efficient as the HS-FLP system (fig 1H-J). The conclusions on the efficiency should be toned down.

      We have revised accordingly: "Near Complete Disruption of Target Genes by GFPi and Flp-out Based cCCTomics" (Line 130): "Knocking out at the adult stage using either hsFLP driven Flp-out (Golic and Lindquist, 1989) (Fig. 1H-1J) or neural (elav-Switch) driven shRNAGFP (Nicholson et al., 2008; Osterwalder et al., 2001) (Fig. S2A-S2I), also resulted in the elimination of most, though not all, GFP signals." (Line 145-149)

      Point 5- Fig 2H-J: the LD behavioral phenotype of pdfr pan-neuronal cripsr does not seem to correspond to what is described in the literature for the pdfr mutant (han), see hyun et al 2005 (no morning anticipation and advanced evening peak). I understand that the activity index is lower than controls but fig2H shows a large anticipatory activity that seems really unusual, and no advanced evening peak is observed. I think that the authors should show the CRISPR flies and pdfr mutants together, to better compare the phenotypes.

      Thank you for pointing out that the phenotypes of pan-neuronal knockout of PDFR by unmodified Cas9 (Fig. 2H-2I of the previous version) whose morning anticipation still exist (Fig, 2H of the previous manuscript), although the significant decrease of morning anticipation index (Fig 2I of the previous manuscript) and advanced evening activity are not as pronounced as observed in han5304 (Fig. 3C in Hyun et al., 2005).

      First, we have separated the activity plots of Fig. 2H of previous manuscript, as shown below. The activity from ZT18 to ZT24 shows a tendency of decreasing from ZT18 to ZT21 and a tendency of increasing from ZT21 to ZT24. The lowest activity before dawn during ZT18 to ZT24 shows at about ZT21, and the activity at ZT18 is comparable to the activity at ZT24. This is significantly different compared to the two control groups, whose activity tends to increase activity from ZT18 to ZT24 with an activity peak at ZT24.

      The activity from ZT6 to ZT12 increased much faster in Pdfr knockout flies and get to an activity plateau at about ZT11 compared to two control groups with a slower activity increasing from ZT6 to ZT12 with no activity plateau but an activity peak at ZT12.

      Author response image 3.

      Second, we have incorporated the phenotype of Pdfr mutants we previously generated (Pdfr-attpKO Deng et al., 2019) with Pdfr pan-neuronal knockout by Cas9.HC. This mutant lacks all seven transmembrane regions of Pdfr (a). The phenotypes are very similar between Pdfr-attpKO flies and Pdfr pan-neuronal knockout flies. In this experimental repeat, we found that a much more obvious advanced evening activity peak is observed both in pan-neuronal knockout flies and Pdfr-attpKO flies.

      To further analyze the phenotypes of Pdfr pan-neuronal knockout flies by Cas9.HC, we referred to the literature. The activity pattern at ZT18 to ZT24 (activity tends to decrease from ZT18 to ZT21 and tends to increase from ZT21 to ZT24, with the lowest activity before dawn occurring at about ZT21, and activity at ZT18 comparable to activity at ZT24) is also reported in Pdfr knockout flies such as Fig3C and 3H in Hyun et al., 2005, Fig 2B in Lear et al., 2009, Fig 3B in Zhang et al., 2010, Fig .5A in Guo et al., 2014, and Fig 5B in Goda et al., 2019. Additionally, the less pronounced advanced evening activity peak compared to han5304 (Fig. 3C in Hyun et al., 2005) is also reported in Fig. 2B in Lear et al., 2009, Fig. 3B in Zhang et al., 2010, and Fig. 5B in Goda et al., 2019. We consider that this difference is more likely to be caused by environmental conditions or recording strategies (DAM system vs. video tracing).

      Therefore, we revised the text to: “Pan-neuronal knockout of Pdfr resulted in a tendency towards advanced evening activity and weaker morning anticipation compared to control flies (Fig. 2H-2I), which is similar to Pdfr-attpKO flies. These phenotypes were not as pronounced as those reported previously, when han5304 mutants exhibited a more obvious advanced evening peak and no morning anticipation (Hyun et al., 2005)”.

      Author response image 4.

      Point 6-The authors should provide more information about the DD behavior (power is low, but how about the period of rhythmic flies, which is shortened in pdf (renn et al) and pdfr (hyun et al) mutants).

      We have incorporated period data into Fig. 2I. Indeed, conditional knock out of Pdfr by Cas9.HC driven by R57C10-GAL4 shortens the period length, as shown below (previous data), also in Fig. 2I of the revised version.

      In the revised Fig. 2I, we tested 45 Pdfr-attpKO flies during DD condition (3 out of 48 flies died during video tracing in DD condition), and only one fly was rhythmic. In contrast, 9 out of 48 Pdfr pan-neuronal knockout flies were rhythmic.

      Author response image 5.

      Point 7- P15 and fig6. The authors indicate that type II CNMa neurons do not show advanced morning activity as type I do, but Figs 6 I and K seem to show some advance although less important than type I. I am not sure that this supports the claim that type I is the main subset for the control of morning activity. This should be toned down.

      We have re-organized Fig. 6 and revised the summary of these results as: “However, Type II neurons-specific CNMa knockout (CNMa ∩ GMR91F02) showed weaker advanced morning activity without advanced morning peak (Fig. 6N), while Type I neurons-specific CNMa knockout did (Fig. 6J), indicating a possibility that these two type I CNMa neurons constitute the main functional subset regulating the morning anticipation activity of fruit fly”. (Line 400-405)

      Point 8- Figs 6M and N: is power determined from DD data? if yes, how about the period and arrhythmicity? Please also provide the LD activity profiles for the mutants and rescued pdfr genotypes.

      Yes, the power was determined from the DD data. In the new version of the manuscript, we have included the activity plots for the LD phase in supplementary Fig S13, as well as shown below (A, B), and the period and arrhythmicity data for the DD phase in Fig. 6S and Table S7. We have also refined the related description as follows: “Moreover, knocking out Pdfr by GMR51H05, GMR79A11 and CNMa GAL4, which cover type I CNMa neurons, decreased morning anticipation of flies (Fig. 6T, Fig. S13B). However, the decrease in morning anticipation observed in the Pdfr knockout by CNMa-GAL4 was not as pronounced as with the other two drivers. Because the presumptive main subset of functional CNMa is also PDFR-positive, there is a possibility that CNMa secretion is regulated by PDF/PDFR signal”. (Line 413-419)

      Author response image 6.

      Point 9- Fig 7: does CNMaR affect DD behavior? This should be tested.

      We analyzed the CNMaR-/- activity in the dark-dark condition over a span of six days. Results revealed a higher power in CNMaR mutants compared to control flies (Power: 93.5±41.9 (CNMaR-/-, n=48) vs 47.3±31.6 (w1118, n=47); Period: 23.7±0.3 h (CNMaR-/-, n=46) vs 23.7±0.3 h (w1118, n=47); arrhythmic rate 2/48 (CNMaR-/-) vs 0/47 (w1118)). Considering that mutating CNMa had no obvious effect on DD behavior, even if CNMaR affects DD behavior, it cannot be attributed to CNMa signal, we did not further repeat and analyze DD behavior of CNMaR mutant. We believe this raises another question beyond the scope of our current discussion.

      Reviewer #2 (Recommendations For The Authors):

      Point 1-One major concern is the apparent discrepancies in clock network gene expression using the Flp-Out and split-LexA approaches compared to what is known about the expression of several transmitter and peptide-related genes. For example, it is well established that the 5th-sLNv expresses CHAT (along with a single LNd), yet there appears to be no choline acetyltransferase (ChAT) signal in the 5th-sLNv as assayed by the Split-LexA approach (Fig. 4). This approach also suggests that DH31 is expressed in the s-LNvs, which, as one of the most intensely studied clock neuron are known to express PDF and sNPF, but not DH31. The results also suggest that the sLNvs express ChAT, which they do not. Remarkably PDF is not included in the expression analysis, this peptide is well known to be expressed in only two subgroups of clock neurons, and would therefore be an excellent test case for the expression analysis in Fig. 4. PDF should therefore be added to analysis shown in Fig. 4. Another discrepancy is PdfR, which split LexA suggests is expressed in the Large LNvs but not the small LNvs, the opposite of what has been shown using both reporter expression and physiology. The authors do acknowledge that discrepancies exist between their data and previous work on expression within the clock network (lines 237 and 238). However, the extent of these discrepancies is not made clear and calls into question the accuracy of Flp-Out and Split LexA approaches.

      The concerns mentioned above are:

      (1) sLNvs express PDF and sNPF but not Dh31;

      (2) ChAT presents in 5th-sLNv and one LNd but not in other sLNvs;

      (3) PDFR presents in sLNvs but not l-LNvs.

      (4) PDF is not included in the analysis.

      To verify the accuracy of these intersection analyses, all related to PDF positive neurons (except 5th-sLNv and LNds), we stained PDF and examined the co-localization between PDF-positive LNvs and the respective drivers ChAT-KI-LexA, Pdfr-KI -LexA, Dh31-KI -LexA, and Pdf-KI -LexA.

      First, Dh31-KI-LexA labeled four s-LNvs, as shown below (also in Fig. S9A). Therefore, the results of the intersection analysis of Dh31-KI-LexA with Clk856-GAL4 are correct. The difference in the results compared to previous literature is attributed to Dh31-KI-LexA labels different neurons than the previous driver or antibody.

      Second, no s-LNv was labeled by ChAT-KI -LexA as shown below. We rechecked our intersection data and found that we analyzed 10 brains of ChAT-KI-LexA∩Clk856-GAL4 while only two brains showed sLNvs positively. To enhance the accuracy of intersection analysis results, we marked all positive signal records when positive subsets were found in less than 1/3 of the total analyzed brains (Table S4).

      Third, one l-LNv and at least two s-LNvs were labeled by Pdfr-KI-LexA, as shown below (also in Fig. S9B). Fourth, Pdf-KI-LexA labels all PDF-positive neurons, but the intersection analysis by Pdf-KI-LexA and Clk856-GAL4 only showed scattered signals, as shown below (D, also in Fig. S9C). For these cases, we found some positive signals expected but not observed in our dissection. The possible reason could be the inefficiency of LexAop-FRT-myr::GFP driven by LexA. Therefore, our intersection results must miss some positive signals.

      Author response image 7.

      Finally, we revised the text to (Line 286-317):

      To assess the accuracy of expression profiles using CCT drivers, we compared our dissection results with previous reports. Initially, we confirmed the expression of CCHa1 in two DN1s (Fujiwara et al., 2018), sNFP in four s-LNvs and two LNds(Johard et al., 2009), and Trissin in two LNds (Ma et al., 2021), aligning with previous findings. Additionally, we identified the expression of nAChRα1, nAChRα2, nAChRβ2, GABA-B-R2, CCHa1-R, and Dh31-R in all or subsets of LNvs, consistent with suggestions from studies using ligands or agonists in LNvs (Duhart et al., 2020; Fujiwara et al., 2018; Lelito and Shafer, 2012; Shafer et al., 2008) (Table S4).

      Regarding previously reported Nplp1 in two DN1as (Shafer et al., 2006), we found approximately five DN1s positive for Nplp-KI-LexA, indicating a broader expression than previously reported. A similar pattern emerged in our analysis of Dh31-KI-LexA, where four DN1s, four s-LNvs, and two LNds were identified, contrasting with the two DN1s found in immunocytochemical analysis (Goda et al., 2016). Colocalization analysis of Dh31-KI-LexA and anti-PDF revealed labeling of all PDF-positive s-LNvs but not l-LNvs (Fig S9A), suggesting that the differences may arise from the broader labeling of 3' end knock-in LexA drivers or the amplitude effect of the binary expression system. The low protein levels might go undetected in immunocytochemical analysis. This aligns with transcriptome analysis findings showing Nplp1 positive in DN1as, a cluster of CNMa-positive DN1ps, and a cluster of DN3s (Ma et al., 2021), which is more consistent with our dissection.

      Despite the well-known expression of PDF in LNvs and PDFR in s-LNvs (Renn et al., 1999; Shafer et al., 2008), we did not observe stable positive signals for both in Flp-out intersection experiments, although both Pdf-KI-LexA and Pdfr-KI-LexA label LNvs as expected (Fig S9B-S9C). We also noted fewer positive neurons in certain clock neuron subsets compared to previous reports, such as NPF in three LNds and some LNvs (Erion et al., 2016; He et al., 2013; Hermann et al., 2012; Johard et al., 2009; Lee et al., 2006) and ChAT in four LNds and the 5th s-LNv (Johard et al., 2009; Duhart et al., 2020) (Table S4). We attribute this limitation to the inefficiency of LexAop-FRT-myr::GFP driven by LexA, acknowledging that our intersection results may miss some positive signals.

      Point 2-Related to this, the authors rather inaccurately suggest that the field's understanding of PdfR expression within the clock neuron network is "inconsistent" and "variable" (lines 368-377). This is not accurate. It is true that the first attempts to map PdfR expression with antisera and GAL4s were inaccurate. However, subsequent work by several groups has produced strong convergent evidence that with the exception of the l-LNvs after several days post-eclosion, PdfR is expressed in the Cryptochrome expressing a subset of the clock neuron network. This section of the study should be revised.

      We thank the reviewer for pointing this out. As we have already addressed and revised the related part in the RESULTS section (Line 308-317), we have now removed this part from the DISCUSSION section of the revised version.

      Point 3-One minor issue that would avoid unnecessary confusion by readers familiar with the circadian literature is the say that activity profiles are plotted in the study. The authors have centered their averaged activity profiles on the 12h of darkness. This is the opposite of the practice of the field, and it leads to some initial confusion in the examination of the morning and evening peak data. The authors may wish to avoid this by centering their activity plots on the 12h light phase, which would put the morning peak on the left and the evening peak on the right. This is the way the field is accustomed to examining locomotor activity profiles.

      The centering of averaged activity profiles on the 12 h of darkness is done to highlight the phenotype of advanced morning activity. To prevent any confusion among readers, we have included a sentence in the figure legend explaining the difference in our activity profiles compared to previous literatures: "Activity profiles were centered of the 12 h darkness in all figures with evening activity on the left and morning activity on the right, which is different from general circadian literatures. (Fig. 2H legend)" (Line 957-959))

      Point 4-The authors conclude that the loss of PDF and CNMa have opposite effects on the morning peak of locomotor activity (line 392). But they also acknowledge, briefly, that things are not that simple: loss of CNMa causes a phase advance, but loss of PDF causes a loss or reduction in the anticipatory peak. It is still significant to find a peptide transmitter with the clock neuron network that regulates morning activity, but the authors should revise their conclusion regarding the opposing actions of PDF and CNMa, which is not well supported by the data.

      We have revised the relevant parts.

      ABSTRACT: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-48)

      DISCUSSION: “Furthermore, given that the morning anticipation vanishing phenotype of Pdf or Pdfr mutant indicates a promoting role of PDF-PDFR signal, while the enhanced morning anticipation phenotype of CNMa mutant suggests an inhibiting role of CNMa signal, we consider the two signals to be antagonistic.” (Line 492-495)

      Point 5-The authors should acknowledge, cite, and incorporate the substantive discussion of CNMa peptide and the DN1p neuronal class in Reinhard et al. 2022 (Front Physiol. 13: 886432).

      We have revised the text accordingly and cited this paper: “Type I with two neurons whose branches projecting to the anterior region, as in CNMa∩GMR51H05, CNMa∩Pdfr, and CNMa∩GMR79A11 (Fig. 6E, 5G, 6H), and type II with four neurons branching on the posterior side with few projections to the anterior region, as in CNMa∩GMR91F02 (Fig. 6F). These two types of DN1ps’ subsets were also reported and profound discussed previously (Lamaze et al., 2018; Reinhard et al., 2022)”. (Line 393-397)

      Reviewer #3 (Recommendations For The Authors):

      Point 1-Throughout the manuscript figure legends (axis, genotypes, etc) are too small to be appreciated. Fig. 1. Panel A. The labels are very difficult to read.

      We have attempted to enlarge the font as much as possible in the revised version.

      Point 2-Fig. 1. H-J Why is efficiency not mentioned in all the examples?

      In the revised manuscript, the results of Fig 1H-1J are discussed in the revised version (Line 145-147). The reason that we did not calculate the exact efficiency is that the GFP intensity is not stable enough which might change during dissection, mounting or intensity of laser in our experimental process. Therefore, in all results related to GFP signal (Fig. 1B-1J, Fig. S1, Fig. S2, Fig. 2B-2D), we relied on qualitative judgment rather than quantitative judgment, unless the GFP signal was easily quantifiable (such as in cases with limited cells or no GFP signal in the experimental group).

      Point 3-Fig. 1. Panel L, left (light phase): the statistical comparisons are not clearly indicated (the same happens in Figs 3Q and 3R).

      We have now re-arranged Fig. 1L and Fig. 3Q-3R to make the statistical comparisons clear in the new version.

      Point 4-Line 792. Could induced be introduced?

      Yes, we have now corrected this typo.

      Point 5-Fig. S1. Check labels for consistency. GMR57C10 Gal4 driver is most likely R57C10.

      We have now revised the labels (Fig. S1).

      Point 6-Fig. S2. If the experiments were repeated and several brains were observed, the authors should include the efficiency and the number of flies as reported in Fig. S1.

      We have now added the number of flies in Fig. S2 as reported in Fig. S1. As Response to Point 2 mentioned, due to the instability of the GFP signal, we are unable to provide a quantitative efficiency in this context.

      Point 7-Fig S4. The fig legend describes panels I-J which are not shown in the current version of the manuscript.

      We now have deleted them.

      Point 8-Fig 2I. Surprising values for morning anticipation indexes even for controls (0.5 would indicate ¨no anticipation¨; in controls, the expected values would be >>0.5, as most of the activity is concentrated right before the transition. Could the authors explain this unexpected result?

      We have revised the description of the calculation in the methods section (Line 612). After calculating the ratio of the last three hours of activity to the total six hours of activity, the results were further subtracted by 0.5. Therefore, the index should be ≤0.5. When the index is equal to 0, it indicates no morning anticipation.

      Point 9-Fig 2K/L. The authors mention that not all genes are effectively knocked out with their strategy. Could this be accounted for the specific KD strategy, its duration, or the promotor strength? It is surprising no explanation is provided in the text (page 9 line 179).

      In our pursuit of establishing a broadly effective method for gene editing, Fig. 2H-2L and Fig. 2D revealed that previous attempts have fallen short of achieving this objective. The observed inefficiency may be attributed to the intensity of the promoter, resulting in inadequate expression. Alternatively, the insufficient duration of the operation may also contribute to the lack of success. However, in the context of sleep and rhythm research applications, the age of the fruit fly tests is typically fixed, limiting the potential to enhance efficiency by extending the manipulation time. Moreover, increasing the expression level may pose challenges related to cytotoxicity, as reported in previous studies (Port et al., 2014). We refrain from offering specific explanations, as we lack a definitive plan and cannot provide additional robust evidence to support the above speculations. Consequently, in our ongoing efforts, we aim to enhance the efficiency of the tool system while operating within the current constraints.

      Point 10-Page 9, line 179. Can the authors include a brief description of the reason for the different modifications? Only one was referenced.

      We have revised related part in the manuscript (Line 223-231):

      Cas9.M9: We fused a chromatin-modulating peptide (Ding et al., 2019), HMGN1 183 (High mobility group nucleosome binding domain 1), at the N-terminus of Cas9 and HMGB1 184 (High mobility group protein B1) at its C-terminus with GGSGP linker, termed Cas9.M9.

      Cas9.M6: We also obtained a modified Cas9.M6 with HMGN1 at the N-terminus and an undefined peptide (UDP) at the C-terminus. (NOTE:UDP was gained by accident)

      Cas9.M0: We replaced the STARD linker between Cas9 and NLS in Cas9.HC with GGSGP the linker (Zhao et al., 2016), termed Cas9.M0

      Point 11-The authors tested the impact of KO nAChR2 across the different versions of conditional disruption (Fig 1K-L, Fig 2L, Fig 3R). It is surprising they observe a difference in daytime sleep upon knocking down with Cas9.HC (2L) but not with Cas9.M9 (3R) and the reverse is seen for night-time sleep. Could the authors provide an explanation? Efficiency is not the issue at stake, is it?

      In Fig. 2K, the day sleep of flies (R57C10-GAL4/UAS-sgRNAnAChRbeta2; UAS-Cas9/+) was significantly decreased compared to flies (R57C10-GAL4/UAS-sgRNAnAChRbeta2; +/+), but not when compared to flies (R57C10-GAL4/+; UAS-Cas9/+). Our criterion for asserting a difference is that the experimental group must show a significant distinction from both control groups. Therefore, we concluded that there was no significant difference between the experimental group and the control groups in Fig. 2K.

      Point 12-Fig. 4. Which of the two strategies described in A-B was employed to assemble the expression profile of CCT genes in clock neurons shown in C? This information should be part of the fig legend.

      We have now revised the legend as follows: “(A-B) Schematic of intersection strategies used in Clk856 labelled clock neurons dissection, Flp-out strategy (A) and split-LexA strategy (B). The exact strategy used for each gene is annotated in Table S5.”

      Point 13-Similarly, how many brains were analyzed to give rise to the table shown in C?

      We have now revised the legend of Table S4 to address this concern. As indicated in: “The largest N# for each gene in Table S4 is the brain number analyzed for each gene”.

      Point 14-Finally, the sentence ¨The figure is...¨ requires revision.

      We have now revised it: “The exact cell number for each subset is annotated in Table S4”.

      Point 15-Legend to Table S3. The authors have done an incredible job testing many gRNAs for each gene potentially relevant for communication. However, there is very little information to make the most out of it; for instance, the legend does not inform why many of the targeted genes do not appear to have been tested any further. It would be useful to the reader to discern whether despite being the 3 most efficient gRNAs, they were still not effective in targeting the gene of interest, or whether they showed off-targets, or it was simply a matter of testing the educated guesses. This information would be invaluable for the reader.

      First, we designed and generated transgenic UAS-sgRNA fly lines for all these sgRNAs. We randomly selected 14 receptor genes, known for their difficulty in editing based on our experience, to assess the efficiency of our strategy, as depicted in Fig. 3M-3P, Fig. S5, and Fig. S6. We believe these results are representative and indicative of the efficiency of sgRNAs designed using our process and applied with the modified Cas9.

      Secondly, we acknowledge your valid concern. While we selected sgRNAs with no predicted off-target effects through various prediction models (outlined in the Methods under C-cCCTomics sgRNA design), we did not conduct whole-genome sequencing. Consequently, we can only assert that the off-target possibility is relatively low. To address potential misleading effects arising from off-target concerns, it is essential to validate these results through mutants, RNAi, or alternative UAS-sgRNAs targeting the same gene.

      Point 16-Table S4. Some of the data presented derives from observations made in 1-2 brains for a specific cluster; isn´t it too little to base a decision on whether a certain gene is (or not) expressed? It is surprising since the same CCT line was observed/analysed in more brains for other clusters. Can the authors explain the rationale?

      The N# number represents the GFP positive number, and we have revised the legend of Table S4. The largest N# number denotes the total number of brains analyzed for a specific CCT line. It's possible that, due to variations in our dissection or mounting process, some clusters were only observed in 1-2 brains out of the total brains analyzed. To enhance the accuracy of intersection analysis results, we marked all positive signal records when positive subsets were found in less than 1/3 of the total analyzed brains (Table S4).

      Point 17-The paragraph describing this data in the results section needs revising (lines 233-243).

      We have now revised this. (Line 286-317)

      Point 18-While it is customary for authors to attempt to improve the description of the activity patterns by introducing new parameters (i.e. MAPI and EAPI, lines 253-258) it would be interesting to understand the difference between the proposed method and the one already in use (which compares the same parameter, i.e., the slope (defined as ¨the slope of the best-fitting linear regression line over a period of 6 h prior to the transition¨, i.e., Lamaze et al. 2020 and many others). Is there a need to introduce yet another one?

      This approach is necessary. The slope defined by Lamaze et al. utilizes data from only 2 time points, which may not accurately capture the pattern within a period before light on or off. Linear regression is not well-suited for a single fly due to the high variability in activity at each time point, making it challenging to fit the model at the individual level. The parameters we have introduced (MAPI and EAPI) in this paper are concise and can be applied at the individual level, effectively reflecting the morning or evening anticipation characteristics of each fly.

      As an alternative, the activity plot of a certain fly line could be represented by an average of all flies' activity in one experiment. This would make linear regression easier to fit. However, several independent experiments are required for statistical robustness, necessitating the inclusion of hundreds of flies for each strain in a single analysis.

      Point 19-In general, the legends of supplementary figures are a bit too brief. S7 and S8: it is not clear which of the two intersectional strategies were used (it would benefit whoever is interested in replicating the experiments). Legend to Fig S8 should read ¨similar to Fig S7¨.

      We have now revised the legend and included “The exact strategy used for each gene is annotated in Table S5” in the legend.

      Point 20-The legend in Table S6 should clearly state the genotypes examined. What does the marking in bold refer to?

      We have now revised annotation of Table S6. Marking in bold refer to results out of one SD compared to control group.

      Point 21-Line 314. The sentence needs revision.

      We have revised these sentences.

      Point 22-Line 391 (and also in the results section). The authors attempt to describe the CNMa phenotype as the opposite of pdf/pdfr mutant phenotypes. However, no morning anticipation/advanced morning anticipation are not necessarily opposite phenotypes.

      We have revised related description.

      ABSTRACT: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-48)

      DISCUSSION: “Furthermore, given that the morning anticipation vanishing phenotype of Pdf or Pdfr mutant indicates a promoting role of PDF-PDFR signal, while the enhanced morning anticipation phenotype of CNMa mutant suggests an inhibiting role of CNMa signal, we consider the two signals to be antagonistic.” (Line 492-495)

      Reference

      Deng, B., Li, Q., Liu, X., Cao, Y., Li, B., Qian, Y., Xu, R., Mao, R., Zhou, E., Zhang, W., et al. (2019). Chemoconnectomics: mapping chemical transmission in Drosophila. Neuron 101, 876-893.e874.

      Ding, X., Seebeck, T., Feng, Y., Jiang, Y., Davis, G.D., and Chen, F. (2019). Improving CRISPR-Cas9 genome editing efficiency by fusion with chromatin-modulating peptides. Crispr j 2, 51-63.

      Duhart, J.M., Herrero, A., de la Cruz, G., Ispizua, J.I., Pírez, N., and Ceriani, M.F. (2020). Circadian Structural Plasticity Drives Remodeling of E Cell Output. Curr Biol 30, 5040-5048.e5045.

      Erion, R., King, A.N., Wu, G., Hogenesch, J.B., and Sehgal, A. (2016). Neural clocks and Neuropeptide F/Y regulate circadian gene expression in a peripheral metabolic tissue. eLife 5, e13552.

      Fujiwara, Y., Hermann-Luibl, C., Katsura, M., Sekiguchi, M., Ida, T., Helfrich-Förster, C., and Yoshii, T. (2018). The CCHamide1 neuropeptide expressed in the anterior dorsal neuron 1 conveys a circadian signal to the ventral lateral neurons in Drosophila melanogaster. Front Physiol 9, 1276.

      Goda, T., Tang, X., Umezaki, Y., Chu, M.L., Kunst, M., Nitabach, M.N.N., and Hamada, F.N. (2016). Drosophila DH31 neuropeptide and PDF receptor regulate night-onset temperature preference. J Neurosci 36, 11739-11754.

      Goda, T., Umezaki, Y., Alwattari, F., Seo, H.W., and Hamada, F.N. (2019). Neuropeptides PDF and DH31 hierarchically regulate free-running rhythmicity in Drosophila circadian locomotor activity. Sci Rep 9, 838.

      Guo, F., Cerullo, I., Chen, X., and Rosbash, M. (2014). PDF neuron firing phase-shifts key circadian activity neurons in Drosophila. Elife 3.

      He, C., Cong, X., Zhang, R., Wu, D., An, C., and Zhao, Z. (2013). Regulation of circadian locomotor rhythm by neuropeptide Y-like system in Drosophila melanogaster. Insect Mol Biol 22, 376-388.

      Hermann, C., Yoshii, T., Dusik, V., and Helfrich-Förster, C. (2012). Neuropeptide F immunoreactive clock neurons modify evening locomotor activity and free-running period in Drosophila melanogaster. J Comp Neurol 520, 970-987.

      Hyun, S., Lee, Y., Hong, S.T., Bang, S., Paik, D., Kang, J., Shin, J., Lee, J., Jeon, K., Hwang, S., et al. (2005). Drosophila GPCR Han is a receptor for the circadian clock neuropeptide PDF. Neuron 48, 267-278.

      Johard, H.A., Yoishii, T., Dircksen, H., Cusumano, P., Rouyer, F., Helfrich-Förster, C., and Nässel, D.R. (2009). Peptidergic clock neurons in Drosophila: ion transport peptide and short neuropeptide F in subsets of dorsal and ventral lateral neurons. J Comp Neurol 516, 59-73.

      Lamaze, A., Krätschmer, P., Chen, K.F., Lowe, S., and Jepson, J.E.C. (2018). A Wake-Promoting Circadian Output Circuit in Drosophila. Curr Biol 28, 3098-3105.e3093.

      Lear, B.C., Zhang, L., and Allada, R. (2009). The neuropeptide PDF acts directly on evening pacemaker neurons to regulate multiple features of circadian behavior. PLoS Biol 7, e1000154.

      Lee, G., Bahn, J.H., and Park, J.H. (2006). Sex- and clock-controlled expression of the neuropeptide F gene in Drosophila. 103, 12580-12585.

      Lelito, K.R., and Shafer, O.T. (2012). Reciprocal cholinergic and GABAergic modulation of the small ventrolateral pacemaker neurons of Drosophila's circadian clock neuron network. J Neurophysiol 107, 2096-2108.

      Ma, D., Przybylski, D., Abruzzi, K.C., Schlichting, M., Li, Q., Long, X., and Rosbash, M. (2021). A transcriptomic taxonomy of Drosophila circadian neurons around the clock. Elife 10.

      Port, F., Chen, H.M., Lee, T., and Bullock, S.L. (2014). Optimized CRISPR/Cas tools for efficient germline and somatic genome engineering in Drosophila. Proc Natl Acad Sci USA 111, E2967-2976.

      Reinhard, N., Schubert, F.K., Bertolini, E., Hagedorn, N., Manoli, G., Sekiguchi, M., Yoshii, T., Rieger, D., and Helfrich-Förster, C. (2022). The Neuronal Circuit of the Dorsal Circadian Clock Neurons in Drosophila melanogaster. Front Physiol 13, 886432.

      Renn, S.C., Park, J.H., Rosbash, M., Hall, J.C., and Taghert, P.H. (1999). A pdf neuropeptide gene mutation and ablation of PDF neurons each cause severe abnormalities of behavioral circadian rhythms in Drosophila. Cell 99, 791-802.

      Shafer, O.T., Helfrich-Förster, C., Renn, S.C., and Taghert, P.H. (2006). Reevaluation of Drosophila melanogaster's neuronal circadian pacemakers reveals new neuronal classes. J Comp Neurol 498, 180-193.

      Shafer, O.T., Kim, D.J., Dunbar-Yaffe, R., Nikolaev, V.O., Lohse, M.J., and Taghert, P.H. (2008). Widespread receptivity to neuropeptide PDF throughout the neuronal circadian clock network of Drosophila revealed by real-time cyclic AMP imaging. Neuron 58, 223-237.

      Zhang, L., Chung, B.Y., Lear, B.C., Kilman, V.L., Liu, Y., Mahesh, G., Meissner, R.A., Hardin, P.E., and Allada, R. (2010). DN1(p) circadian neurons coordinate acute light and PDF inputs to produce robust daily behavior in Drosophila. Curr Biol 20, 591-599.

      Zhao, P., Zhang, Z., Lv, X., Zhao, X., Suehiro, Y., Jiang, Y., Wang, X., Mitani, S., Gong, H., and Xue, D. (2016). One-step homozygosity in precise gene editing by an improved CRISPR/Cas9 system. Cell Res 26, 633-636.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This paper describes the development and initial validation of an approach-avoidance task and its relationship to anxiety. The task is a two-armed bandit where one choice is 'safer' - has no probability of punishment, delivered as an aversive sound, but also lower probability of reward - and the other choice involves a reward-punishment conflict. The authors fit a computational model of reinforcement learning to this task and found that self-reported state anxiety during the task was related to a greater likelihood of choosing the safe stimulus when the other (conflict) stimulus had a higher likelihood of punishment. Computationally, this was represented by a smaller value for the ratio of reward to punishment sensitivity in people with higher task-induced anxiety. They replicated this finding, but not another finding that this behavior was related to a measure of psychopathology (experiential avoidance), in a second sample. They also tested test-retest reliability in a sub-sample tested twice, one week apart and found that some aspects of task behavior had acceptable levels of reliability. The introduction makes a strong appeal to back-translation and computational validity, but many aspects of the rationale for this task need to be strengthened or better explained. The task design is clever and most methods are solid - it is encouraging to see attempts to validate tasks as they are developed. There are a few methodological questions and interpretation issues, but they do not affect the overall findings. The lack of replicated effects with psychopathology may mean that this task is better suited to assess state anxiety, or to serve as a foundation for additional task development.

      We thank the reviewer for their kind comments and constructive feedback. We agree that the approach taken in this paper appears better suited to state anxiety, and further work is needed to assess/improve its clinical relevance.

      Reviewer #1 (Recommendations For The Authors):

      1) For the introduction, the authors communicate well the appeal of tasks with translational potential, and setting up this translation through computational validity is a strong approach. However, I had some concerns about how the task was motivated in the introduction:

      a) The authors state that current approach-avoidance tasks used in humans do not resemble those used in the non-human literature, but do not provide details on what exactly is missing from these tasks that makes translation difficult.

      Our intention for the section that the reviewer refers to was to briefly convey that historically, approach-avoidance conflict would have been measured either using questionnaires or joystick-based tasks which have no direct non-human counterpart. However, we note that the phrasing was perhaps unfair to recent tasks that were explicitly designed to be translatable across species. Therefore, we have amended the text to the following:

      In humans, on the other hand, approach-avoidance conflict has historically been measured using questionnaires such as the Behavioural Inhibition/Activation Scale (Carver & White, 1994), or cognitive tasks that rely on motor biases, for example by using joysticks to approach/move towards positive stimuli and avoid/move away from negative stimuli, which have no direct non-human counterparts (Guitart-Masip et al., 2012; Kirlic et al., 2017; Mkrtchian et al., 2017; Phaf et al., 2014).

      b) Although back-translation to 'match' human paradigms to non-animal paradigms is useful for research, this isn't the end goal of task development. What really matters is how well these tasks, whether in humans or not, capture psychopathology-relevant behavior. Many animal paradigms were developed and brought into extensive use because they showed sensitivity to pharmacological compounds (e.g., benzodiazepines). The introduction accepts the validity of these paradigms at face value, and doesn't address whether developing human tests of psychopathology based on sensitivity to existing medication classes is the best way to generate new insights about psychopathology.

      We agree that whilst paradigms with translational and computational validity have merits of their own for neuroscientific theory, clinical validity (i.e. how well the paradigm reflects a phenomenon relevant to psychopathology) is key in the context of clinical applications. While our findings of associations between task performance and self-reported (state) anxiety suggest that our approach is a step in the right direction, the lack of associations with clinical measures was disappointing. Although future work is needed to more directly test the sensitivity of the current approach to psychopathology, this may mean that it, and its non-human counterparts, do not measure behaviours relevant to pathological anxiety. Since our primary focus in this paper was on translational and computational validity, we have opted to discuss the author’s suggestion in the ‘Discussion’ section, as follows:

      Further, it is worth noting that many animal paradigms were developed and widely adopted due to their sensitivity to anxiolytic medication (Cryan & Holmes, 2005). Given the lack of associations with clinical measures in our results, it is possible that current translational models of anxiety may not fully capture behaviours that are directly relevant to pathological anxiety. To develop translational paradigms of clinical utility, future research should place a stronger emphasis on assessing their clinical validity in humans.

      c) The authors may want to bring in the literature on the description-experience gap (e.g., PMID: 19836292) when discussing existing decision tasks and their computational dissimilarity to non-human operant conditioning tasks.

      We thank the reviewer for this useful addition to the introduction. We have now added the following to the 'Introduction’ section:

      Moreover, evidence from economic decision-making suggests that explicit offers of probabilistic outcomes can impact decision-making differently compared to when probabilistic contingencies need to be learned from experience (referred to as the ‘description-experience gap’; Hertwig & Erev, 2009); this finding raises potential concerns regarding the use of offer-based tasks in humans as approximations of non-human tasks that do not involve explicit offers.

      d) How does one evaluate how computationally similar human vs. non-human tasks are? What are the criteria for making this judgement? Specific to the current tasks, many animal learning tasks are not learning tasks in the same sense that human learning tasks are, in terms of the number of trials used and if the animals are choosing from a learned set of contingencies versus learning the contingencies during the testing.

      The computational similarity of human and non-human strategies in a given translational task can be tested empirically. This can be done by fitting models to the data and assessing whether similar models explain choices, even if parameter distributions might vary across species due to, for example, physiological differences. Indeed, non-human animals require much more training to perform even uni-dimensional reinforcement learning, but once they are trained, it should be possible to model their responses. In fact, it should even be possible to take training data into account in some cases. For example, the training phase of the Vogel/Geller-Seifter preclinical tests require an animal to learn to emit a certain action (e.g. lever press) simply to obtain some reward. In the next phase, an aversive outcome is introduced as an additional outcome, but one could model both the training and test phase together – the winning model in our studies would be a suitable candidate to model behaviour here. As we also discuss predictive validity in the ‘Discussion’ section, we opted to add the following text there too:

      … computational validity would also need to be assessed directly in non-human animals by fitting models to their behavioural data. This should be possible even in the face of different procedures across species such as number of trials or outcomes used (shock or aversive sound). We are encouraged by our finding that the winning computational model in our study relies on a relatively simple classical reinforcement learning strategy. There exist many studies showing that non-human animals rely on similar strategies during reward and punishment learning (Mobbs et al., 2020; Schultz, 2013); albeit to our knowledge this has never been modelled in non-human animals where rewards and punishment can occur simultaneously.

      2) What do the authors make of the non-linear relationship between probability of punishment and probability of choosing the conflict stimulus (Fig 2d), especially in the high task-induced anxiety participants? Did this effect show up in the replication sample as well?

      Figures 2c-e were created by binning the continuous predictors of outcome probabilities into discrete bins of equal interval. Since punishment probability varied according to Gaussian random walks, it was also distributed with more of its mass in the central region (~ 0.4), and so values at the extreme bins were estimated on fewer data and with greater variance. The non-linear relationships are likely thus an artefact of our task design and plotting procedure. The pattern was also evident in the replication sample, see Author response image 1:

      Author response image 1.

      However, since these effects were estimated as linear effects in the logistic regression models, and to avoid overfitting/interpretations of noise arising from our task design, we now plot logistic curves fitted to the raw data instead.

      3) How correlated were learning rate and sensitivity parameters? The EM algorithm used here can sometimes result in high correlations among these sets of parameters.

      As the reviewer suspects the parameters were strongly correlated, especially across the punishment-specific parameters. The Pearson’s r estimates for the untransformed parameter values were as follows:

      Reward parameters: discovery sample r = -0.39; replication sample r = -0.78

      Punishment parameters: discovery sample r = -0.91; replication sample r = -0.85

      We have included the correlation matrices of the estimated parameters as Supplementary Figure 2 in the ‘Computational modelling’ section of the Supplement.

      We have now also re-fitted the winning model using variational Bayesian inference (VBI) via Stan, and found that the cross-parameter correlations were much lower than when the data were fitted using EM. We also ran a sensitivity analysis assessing whether using VBI changed the main findings of our studies. This showed that the correlation between task-induced anxiety and the reward-punishment sensitivity index was robust to fitting method, as was the mediating effect of reward-punishment sensitivity index on anxiety’s effect on choice. This indicates that overall our key findings are robust to different methods of parameter-fitting.

      We now direct readers to these analyses from the new ‘Sensitivity analyses’ section in the manuscript, as follows:

      As our procedure for estimating model parameters (the expectation-maximisation algorithm, see ‘Methods’) produced high inter-parameter correlations in our data (Supplementary Figure 2), we also re-estimated the parameters using Stan’s variational Bayesian inference algorithm (Stan Development Team, 2023) – this resulted in lower inter-parameter correlations, but our primary computational finding, that the effect of anxiety on choice is mediated by relative sensitivity to reward/punishment was consistent across algorithms (see Supplement section 9.8 for details).

      We have included the relevant analyses comparing EM and VBI in the Supplement, as follows:

      [9.8 Sensitivity analysis: estimating parameters via expectation maximisation and variational Bayesian inference algorithms]

      Given that the expectation maximisation (EM) algorithm produced high inter-parameter correlations, we ran a sensitivity analysis by assessing the robustness of our computational findings to an alternative method of parameter estimation – (mean-field) variational Bayesian inference (VBI) via Stan (Stan Development Team, 2023). Since, unlike EM, the results of VBI are very sensitive to initial values, we fitted the data 10 times with different initial values.

      Inter-parameter correlations

      The VBI produced lower inter-parameter correlations than the EM algorithm (Supplementary Figure 8).

      Sensitivity analysis

      Since multicollinearity in the VBI-estimated parameters was lower than for EM, indicating less trade-off in the estimation, we re-tested our computational findings from the manuscript as part of a sensitivity analysis. We first assessed whether we observed the same correlations between task-induced anxiety and punishment learning, and reward-punishment sensitivity index (Supplementary Figure 9a). Punishment learning rate was not significantly associated with task-induced anxiety in any of the 10 VBI iterations in the discovery sample, although it was in 9/10 in the replication sample. On the other hand, the reward-punishment sensitivity index was significantly associated with task-induced anxiety in 9/10 VBI iterations in the discovery sample and all iterations in the replication sample. This suggests that the correlation of anxiety and sensitivity index is robust to these two fitting approaches.

      We also re-estimated the mediation models, where in the EM-estimated parameters, we found that the reward-punishment sensitivity index mediated the relationship between task-induced anxiety and task choice proportions (Supplementary Figure 9b). Again, we found that the reward-punishment sensitivity index was a significant mediator in 9/10 VBI iterations in the discovery sample and all iterations in the replication sample. Punishment learning rate was also a significant mediator in 9/10 iterations in the replication sample, although it was not in the discovery sample for all iterations, and this was not observed for the EM-estimated parameters.

      Overall, we found that our key results, that anxiety is associated with greater sensitivity to punishment over reward, and this mediates the relationship between anxiety and approach-avoidance behaviour, were robust across both fitting methods.

      As an aside, we were unable to run the model fitting using Markov chain Monte Carlo sampling approaches due to the computational power and time required for a sample of this size (Pike & Robinson, 2022, JAMA Psychiatry).

      4) What is the split-half reliability of the task parameters?

      We thank the reviewer for this query. We have now included a brief section on the (good-to-excellent) split-half reliability of the task in the manuscript:

      We assessed the split-half reliability of the task by correlating the overall proportion of conflict option choices and model parameters from the winning model across the first and second half of trials. For overall choice proportion, reliability was simply calculated via Pearson’s correlations. For the model parameters, we calculated model-derived estimates of Pearson’s r values from the parameter covariance matrix when first- and second-half parameters were estimated within a single model, following a previous approach recently shown to accurately estimate parameter reliability (Waltmann et al., 2022). We interpreted indices of reliability based on conventional values of < 0.40 as poor, 0.4 - 0.6 as fair, 0.6 - 0.75 as good, and > 0.75 as excellent reliability (Fleiss, 1986). Overall choice proportion showed good reliability (discovery sample r = 0.63; replication sample r = 0.63; Supplementary Figure 5). The model parameters showed good-to-excellent reliability (model-derived r values ranging from 0.61 to 0.85 [0.76 to 0.92 after Spearman-Brown correction]; Supplementary Figure 5).

      5) The authors do a good job of avoiding causal language when setting up the cross-sectional mediation analysis, but depart from this in the discussion (line 335). Without longitudinal data, they cannot claim that "mediation analyses revealed a mechanism of how anxiety induces avoidance".

      Thank you for spotting this, we have now amended the text to:

      … mediation analyses suggested a potential mechanism of how anxiety may induce avoidance.

      Reviewer #2 (Public Review):

      Summary:

      The authors develop a computational approach-avoidance-conflict (AAC) task, designed to overcome limitations of existing offer based AAC tasks. The task incorporated likelihoods of receiving rewards/ punishments that would be learned by the participants to ensure computational validity and estimated model parameters related to reward/punishment and task induced anxiety. Two independent samples of online participants were tested. In both samples participants who experienced greater task induced anxiety avoided choices associated with greater probability of punishment. Computational modelling revealed that this effect was explained by greater individual sensitivities to punishment relative to rewards.

      Strengths:

      Large internet-based samples, with discovery sample (n = 369), pre-registered replication sample (n = 629) and test-retest sub group (n = 57). Extensive compliance measures (e.g. audio checks) seek to improve adherence.

      There is a great need for RL tasks that model threatening outcomes rather than simply loss of reward. The main model parameters show strong effects and the additional indices with task based anxiety are a useful extension. Associations were broadly replicated across samples. Fair to excellent reliability of model parameters is encouraging and badly needed for behavioral tasks of threat sensitivity.

      We thank the reviewer for their comments and constructive feedback.

      The task seems to have lower approach bias than some other AAC tasks in the literature. Although this was inferred by looking at Fig 2 (it doesn't seem to drop below 46%) and Fig 3d seems to show quite a strong approach bias when using a reward/punishment sensitivity index. It would be good to confirm some overall stats on % of trials approached/avoided overall.

      The range of choice proportions is indeed an interesting statistic that we have now included in the manuscript:

      Across individuals, there was considerable variability in overall choice proportions (discovery sample: mean = 0.52, SD = 0.14, min/max = [0.03, 0.96]; replication sample: mean = 0.52, SD = 0.14, min/max = [0.01, 0.99]).

      Weaknesses:

      The negative reliability of punishment learning rate is concerning as this is an important outcome.

      We agree that this is a concerning finding. As reviewer 3 notes, this may have been due to participants having control over the volume used to play the aversive sounds in the task (see below for our response to this point). Future work with better controlled experimental settings will be needed to determine the reliability of this parameter more accurately.

      This may also have been due to the asymmetric nature of the task, as only one option could produce the punishment. This means that there were fewer trials on which to estimate learning about the occurrence of a punishment. Future work using continuous outcomes, as the reviewer suggests below, whilst keeping the asymmetric relationship between the options, could help in this regard.

      We have included the following comment on this issue in the manuscript:

      Alternatively, as participants self-determined the loudness of the punishments, differences in volume settings across sessions may have impacted the reliability of this parameter (and indeed punishment sensitivity). Further, the asymmetric nature of the task may have impacted our ability to estimate the punishment learning rate, as there were fewer occurrences of the punishment compared to the reward.

      The Kendall's tau values underlying task induced anxiety and safety reference/ various indices are very weak (all < 0.1), as are the mediation effects (all beta < 0.01). This should be highlighted as a limitation, although the interaction with P(punishment|conflict) does explain some of this.

      We now include references to the effect sizes to emphasise this limitation. We also note, as the reviewer suggests, that this may be due to crudeness of overall choice proportion as a measure of approach/avoidance, as it is contaminated with variables such as P(punishment|conflict).

      One potentially important limitation of our findings is the small effect size observed in the correlation between task-induced anxiety and avoidance (Kendall's tau values < 0.1, mediation betas < 0.01). This may be attributed to the simplicity of using overall choice proportion as a measure of approach/avoidance, as the effect of anxiety on choice was also influenced by punishment probability.

      The inclusion of only one level of reward (and punishment) limits the ecological validity of the sensitivity indices.

      We agree that using multi-level outcomes will be an important question for future work and now explicitly note this in the manuscript, as below:

      Using multi-level or continuous outcomes would also improve the ecological validity of the present approach and interpretation of the sensitivity parameters.

      Appraisal and impact:

      Overall this is a very strong paper, describing a novel task that could help move the field of RL forward to take account of threat processing more fully. The large sample size with discovery, replication and test-retest gives confidence in the findings. The task has good ecological validity and associations with task-based anxiety and clinical self-report demonstrate clinical relevance. The authors could give further context but test-retest of the punishment learning parameter is the only real concern. Overall this task provides an exciting new probe of reward/threat that could be used in mechanistic disease models.

      We thank the reviewer again for helping us to improve our analyses and manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Additional context:

      In the introduction "cognitive tasks that bear little semblance to those used in the non-human literature" seems a little unfair. One study that is already cited (Ironside et al, 2020) used a task that was adapted from non-human primates for use in humans. It has almost identical visual stimuli (different levels of simultaneous reward and aversive outcome/punishment) and response selection processes (joystick) between species and some overlapping brain regions were activated across species for conflict and aversiveness. The later point that non-human animals must be trained on the association between action and outcome is well taken from the point of view of computational validity but perhaps not sufficient to justify the previous statement.

      Our intention for this section was to briefly convey that historically, approach-avoidance conflict would have been measured either using questionnaires or joystick-based tasks which have no direct non-human counterpart. However, we agree that this phrasing is unfair to recent studies such as those by Ironside and colleagues. Therefore, we have amended the text to the following:

      In humans, on the other hand, approach-avoidance conflict has historically been measured using questionnaires such as the Behavioural Inhibition/Activation Scale (Carver & White, 1994), or cognitive tasks that rely on motor biases to approach/move towards positive stimuli and avoid/move away from negative stimuli which have no direct non-human counterparts (Guitart-Masip et al., 2012; Kirlic et al., 2017; Mkrtchian et al., 2017; Phaf et al., 2014).

      It would be good to speculate on why task induced anxiety made participants slower to update their estimates of punishment probability.

      Although a meta-analysis of reinforcement learning studies using reward and punishment outcomes suggests a positive association between punishment learning rate and anxiety symptoms (and depressed mood), we paradoxically found the opposite effect. However, previous work has suggested that distinct forms of anxiety associate differently with anxiety (Wise & Dolan, 2020, Nat. Commun.), where somatic anxiety was negatively correlated with punishment learning rate whereas cognitive anxiety showed the opposite effect. We have now added the following to the manuscript, and noted that future work is needed to understand the potentially complex relationship between anxiety and learning from punishments:

      Notably, although a recent computational meta-analysis of reinforcement learning studies showed that symptoms of anxiety and depression are associated with elevated punishment learning rates (Pike & Robinson, 2022), we did not observe this pattern in our data. Indeed, we even found the contrary effect in relation to task-induced anxiety, specifically that anxiety was associated with lower rates of learning from punishment. However, other work has suggested that the direction of this effect can depend on the form of anxiety, where cognitive anxiety may be associated with elevated learning rates, but somatic anxiety may show the opposite pattern (Wise & Dolan, 2020) and this may explain the discrepancy in findings. Additionally, parameter values are highly dependent on task design (Eckstein et al., 2022), and study designs to date may be more optimised in detecting differences in learning rate (Pike & Robinson, 2022) – future work is needed to better understand the potentially complex association between anxiety and punishment learning rate. Lastly, as punishment learning rate was severely unreliable in the test-retest analyses, and the associations between punishment learning rate and state anxiety were not robust to an alternative method of parameter estimation (variational Bayesian inference), the negative correlation observed in our study should be treated with caution.

      Were those with more task-based anxiety more inflexible in general?

      The lack of associations across reward learning rate and task-induced anxiety suggest that this was not a general inflexibility effect. To test the reviewer’s hypothesis more directly, we conducted a sensitivity analysis by examining the model with a general learning rate – this did not support a general inflexibility effect. Please see the new section in the Supplement below:

      [9.10 Sensitivity analysis: anxiety and inflexibility]

      As anxious participants were slower to update their estimates of punishment probability, we determined whether this was due to greater general inflexibility by examining the model including two sensitivity parameters, but one general learning rate (i.e. not split by outcome). The correlation between this general learning rate and task-induced anxiety was not significant in either samples (discovery: tau = -0.02, p = 0.504; replication: tau = -0.01, p = 0.625), suggesting that the effect is specific to punishment.

      Was the 16% versus 20% of the two samples with clinically relevant anxiety symptoms significantly different? What about other demographics in the two samples?

      The difference in proportions were not significantly different (χ2 = 2.33, p = 0.127). The discovery sample included more females and was older on average compared to the replication sample – information which we now report in the manuscript:

      The discovery sample consisted of a significantly greater proportion of female participants than the replication sample (59% vs 52%, χ2 = 4.64, p = 0.031). The average age was significantly different across samples (discovery sample mean = 37.7, SD = 10.3, replication sample mean = 34.3, SD = 10.4; t785.5 = 5.06, p < 0.001). The differences in self-reported psychiatric symptoms across samples did not reach significance (p > 0.086).

      It would be interesting to know how many participants failed the audio attention checks.

      We have now included information about what proportion of participants fail each of the task exclusion criteria in the manuscript:

      Firstly, we excluded participants who missed a response to more than one auditory attention check (see above; 8% in both discovery and replication samples) – as these occurred infrequently and the stimuli used for the checks were played at relatively low volume, we allowed for incorrect responses so long as a response was made. Secondly, we excluded those who responded with the same response key on 20 or more consecutive trials (> 10% of all trials; 4/6% in discovery and replication samples, respectively). Lastly, we excluded those who did not respond on 20 or more trials (1/2% in discovery and replication samples, respectively). Overall, we excluded 51 out of 423 (12%) in the discovery sample, and 98 out of 725 (14%) in the replication sample.

      There doesn't appear to be a model with only learning from punishment (i.e. no reward learning) included in the model comparison. It would be interesting to see how it compared.

      We have fitted the suggested model and found that it is the least parsimonious of the models. Since participants were monetarily incentivised based on the rewards only, this was to be expected. We have now added this ‘punishment learning only’ model and its variant including a lapse term into the model comparison. The two lowest bars on the y-axis in Author response image 2 represent these models.

      Author response image 2.

      Were sex effects examined as these have been commonly found in AAC tasks. How about other covariates such as age?

      We have now tested the effects of sex and age on behaviour and on parameter values. There were indeed some significant effects, albeit with some inconsistencies across the two samples, which for completeness we have included in the manuscript, as follows:

      While sex was significantly associated with choice in the discovery sample (β = 0.16 ± 0.07, p = 0.028) with males being more likely to choose the conflict option, this pattern was not evident in the replication sample (β = 0.08 ± 0.06, p = 0.173), and age was not associated with choice in either sample (p > 0.2).

      Comparing parameters across sexes via Welch’s t-tests revealed significant differences in reward sensitivity (t289 = -2.87, p = 0.004, d = 0.34; lower in females) and consequently reward-punishment sensitivity index (t336 = -2.03, p = 0.043, d = 0.22; lower in females i.e. more avoidance-driven). In the replication sample, we observed the same effect on reward-punishment sensitivity index (t626 = -2.79, p = 0.005, d = 0.22; lower in females). However, the sex difference in reward sensitivity did not replicate (p = 0.441), although we did observe a significant sex difference in punishment sensitivity in the replication sample (t626 = 2.26, p = 0.024, d = 0.18).

      Minor: Still a few placeholders (Supplementary Table X/ Table X) in the methods

      We thank the reviewer for spotting these errors. We have now corrected these references.

      Reviewer #3 (Public Review):

      This study investigated cognitive mechanisms underlying approach-avoidance behavior using a novel reinforcement learning task and computational modelling. Participants could select a risky "conflict" option (latent, fluctuating probabilities of monetary reward and/or unpleasant sound [punishment]) or a safe option (separate, generally lower probability of reward). Overall, participant choices were skewed towards more rewarded options, but were also repelled by increasing probability of punishment. Individual patterns of behavior were well-captured by a reinforcement learning model that included parameters for reward and punishment sensitivity, and learning rates for reward and punishment. This is a nice replication of existing findings suggesting reward and punishment have opposing effects on behavior through dissociated sensitivity to reward versus punishment.

      Interestingly, avoidance of the conflict option was predicted by self-reported task-induced anxiety. This effect of anxiety was mediated by the difference in modelled sensitivity to reward versus punishment (relative sensitivity). Importantly, when a subset of participants were retested over 1 week later, most behavioral tendencies and model parameters were recapitulated, suggesting the task may capture stable traits relevant to approach-avoidance decision-making.

      We thank the reviewer for their useful analysis of our study. Indeed, it was reassuring to see that performance indices were reliable across time.

      However, interpretation of these findings are severely undermined by the fact that the aversiveness of the auditory punisher was largely determined by participants, with the far-reaching impacts of this not being accounted for in any of the analyses. The manipulation check to confirm participants did not mute their sound is highly commendable, but the thresholding of punisher volume to "loud but comfortable" at the outset of the task leaves substantial scope for variability in the punisher delivered to participants. Indeed, participants' ratings of the unpleasantness of the punishment was moderate and highly variable (M = 31.7 out of 50, SD = 12.8 [distribution unreported]). Despite having this rating, it is not incorporated into analyses. It is possible that the key finding of relationships between task-induced anxiety, reward-punishment sensitivity and avoidance are driven by differences in the punisher experienced; a louder punisher is more unpleasant, driving greater task-induced anxiety, model-derived punishment sensitivity, and avoidance (and vice versa). This issue can also explain the counterintuitive findings from re-tested participants; lower/negatively correlated task-induced anxiety and punishment-related cognitive parameters may have been due to participants adjusting their sound settings to make the task less aversive (retest punisher rating not reported). It can therefore be argued that the task may not actually capture meaningful cognitive/motivational traits and their effects on decision-making, but instead spurious differences in punisher intensity.

      We thank the reviewer for raising this important potential limitation of our study. We agree that how participants self-adjusted their sound volume may important consequences for our interpretations of the data. Unfortunately, despite the scalability of online data collection, this highlights one of its major weaknesses in the lack of controllability over experimental parameters. The previous paper from which we obtained our aversive sounds (Seow & Hauser, 2021, Behav Res, doi.org/10.3758/s13428-021-01643-0) contains useful analyses with regards to this discussion. When comparing the unpleasantness of the sounds played at 50% vs 100% volume, the authors indeed found that the lower volumes lead to lower unpleasantness ratings. However, the magnitude of this effect did not appear to be substantial (Fig. 4 from the paper), and even at 50% volume, the scream sounds we used were rated in the top quartile for unpleasantness, on average. This implies that the sounds have sufficient inherent unpleasantness, even when played at half intensity. We find this reassuring, in the sense that any self-imposed volume effects may not be large. Of note, our instructions to participants to adjust the volume to a ‘loud but comfortable’ level was based on the same phrasing used in this study.

      To the reviewers point on how this might affect the reliability of the task, we have included the following in the ‘Discussion’ section:

      Alternatively, as participants self-determined the loudness of the punishments, differences in volume settings across sessions may have impacted the reliability of this parameter (and indeed other measures).

      Please see below for analyses accounting for punishment unpleasantness ratings.

      This undercuts the proposed significance of this task as a translational tool for understanding anxiety and avoidance. More information about ratings of punisher unpleasantness and its relationship to task behavior, anxiety and cognitive parameters would be valuable for interpreting findings. It would also be of interest whether the same results were observed if the aversiveness of the punisher was titrated prior to the task.

      As suggested, we have now included sensitivity analyses using the unpleasantness ratings that show their effect is minimal on our primary inference. We report relevant results below in the ‘Recommendations For The Authors’ section. At the same time, we think it is important to acknowledge that unpleasantness is a combination of both the inherent unpleasantness of the sound and the volume it is presented at, where only the latter is controlled by the participant. Therefore, these analyses are not a perfect indicator of the effect of participant control. For convenience, we reproduce the key findings from this sensitivity analysis here:

      Approach-avoidance hierarchical logistic regression model

      We assessed whether approach and avoidance responses, and their relationships with state anxiety, were impacted by punishment unpleasantness, by including unpleasantness ratings as a covariate into the hierarchical logistic regression model. Whilst unpleasantness was a significant predictor of choice (positively predicting safe option choices), all significant predictors and interaction effects from the model without unpleasantness survived (Supplementary Figure 11). Critically, this suggests that punishment unpleasantness does not account for all of the variance in the relationship between anxiety and avoidance.

      Mediation model

      When unpleasantness ratings were included in the mediation models, the mediating effect of the reward-punishment sensitivity index did not survive (discovery sample: standardised β = 0.003 ± 0.003, p = 0.416; replication sample: standardised β = 0.004 ± 0.003, p = 0.100; Supplementary Figure 12). Pooling the samples resulted in an effect that narrowly missed the significance threshold (standardised β = 0.004 ± 0.002, p = 0.068).

      More generally, whether or not to titrate the punishments (and indeed the rewards) is an interesting experimental decision, which we think should be guided by the research question. In our case, we were interested in individual differences in reward/punishment learning and sensitivity and their relation to anxiety, so variation in how aversive the sounds affected approach-avoidance decisions was an important aspect of our design. In studies where the aim is to understand more general processes of how humans act under approach-avoidance conflict, it may be better to tightly control the salience of reinforcers.

      Ultimately, the best test of the causal role of anxiety on avoidance, and against the hypothesis that our results were driven by spurious volume control effects, would be to run within-subjects anxiety interventions, where these volume effects are naturally accounted for. This will be an important direction for future studies using similar measures. We have added a paragraph in the ‘Discussion’ section on this point:

      Relatedly, participants had some control over the intensity at which the punishments were presented, which may have driven our findings relating to anxiety and putative mechanisms of anxiety-related avoidance. Sensitivity analyses showed that our finding that anxiety is positively associated with avoidance in the task was robust to individual differences in self-reported punishment unpleasantness, whilst the mediation effects were not. Future work imposing better control over the stimuli presented, and/or using within-subjects designs will be needed to validate the role of reward/punishment sensitivities in anxiety-related avoidance.

      Although the procedure and findings reported here remain valuable to the field, claims of novelty including its translational potential are perhaps overstated. This study complements and sits within a much broader literature that investigates roles for aversion and cognitive traits in approach-avoidance decisions. This includes numerous studies that apply reinforcement learning models to behavior in two-choice tasks with latent probabilities of reward and punishment (e.g., see doi: 10.1001/jamapsychiatry.2022.0051), as well as other translationally-relevant paradigms (e.g., doi: 10.3389/fpsyg.2014.00203, 10.7554/eLife.69594, etc).

      We agree with the reviewer that our approach builds on previous work in reinforcement learning, approach-avoidance conflict and translational measures of anxiety. Whilst there are by now many studies using two-choice learning tasks with latent reward and punishment probabilities, our main, and which we refer to as ‘novel’, aim was to bring these fields together in such a way so as to model anxiety-related behaviour.

      We note that we do not make strong statements about whether these effects speak to traits per se, and as Reviewer 1 notes, the evidence from our study suggests that the present measure may be better suited to assessing state anxiety. While computational model parameters can and are certainly often interpreted as constituting stable individual traits, a more simple interpretation of our findings may be that state anxiety is associated with a momentary preference for punishment avoidance over reward pursuit. This can still be informative for the study of anxiety, especially given the notion of a continuous relationship between adaptive/state anxiety and maladaptive/persistent anxiety.

      Having said that, we agree with the underlying premise of the reviewer’s point that how the measure relates to trait-level avoidance/inhibition measures will be an interesting question for future work. We appreciate the importance of using tasks such as ours and those highlighted by the reviewer as trait-level measures, especially in computational psychiatry. We have now included a discussion on the potential roles of cognitive/motivational traits, in line with the reviewer’s recommendation – briefly, we have included the suggested references by the reviewer, discussed the measure’s potential relevance to cognitive/motivational traits, and direct interested readers to the broader literature. Please see below for details.

      Reviewer #3 (Recommendations For The Authors):

      As stated in the public review, punisher unpleasantness and its relationship to key findings (including for retest) should be reported and discussed.

      We signpost readers to our new analyses, incorporating unpleasantness ratings into the statistical models, from the main manuscript as follows:

      Since participants self-determined the volume of the punishments in the task, and therefore (at least in part) their aversiveness, we conducted sensitivity analyses by accounting for self-reported unpleasantness ratings of the punishment (see the Supplement). Our finding that anxiety impacts approach-avoidance behaviour was robust to this sensitivity analysis (p < 0.001), however the mediating effect of the reward-sensitivity sensitivity index was not (p > 0.1; see Supplement section 9.9 for details).

      We reproduce the relevant section from the Supplement below. Overall, we found that the effect of anxiety on choices (via its interaction with punishment probability) remained significant after accounting for unpleasantness, however the mediating effect of reward-punishment sensitivity was no longer significant when unpleasantness ratings were included in the model. As noted above, unpleasantness ratings are not a perfect measure of self-imposed sound volume, and indeed punishment sensitivity is essentially a computationally-derived measure of unpleasantness, which makes it difficult to interpret the mediation model which contains both of these measures. However, since we found that anxiety affected choice over and above and effects of self-imposed sound volume (using unpleasantness ratings as a proxy measure), we argue that the task still holds value as a model of anxiety-related avoidance.

      [Supplement Section 9.9: Sensitivity analyses of punishment unpleasantness]

      Distribution of unpleasantness

      The punishments were rated as unpleasant by the participants, on average (discovery sample: mean rating = 31.1 [scored between 0 and 50], SD = 13.1; replication sample: mean rating = 32.1, SD = 12.7; Supplementary Figure 10).

      Approach-avoidance hierarchical logistic regression model

      We assessed whether approach and avoidance responses, and their relationships with state anxiety, were impacted by punishment unpleasantness, by including unpleasantness ratings as a covariate into the hierarchical logistic regression model. Whilst unpleasantness was a significant predictor of choice (positively predicting safe option choices), all significant predictors and interaction effects from the model without unpleasantness ratings survived (Supplementary Figure 11). Critically, this suggests that punishment unpleasantness does not account for all of the variance in the relationship between anxiety and avoidance.

      Mediation model

      When unpleasantness ratings were included in the mediation models, the mediating effect of the reward-punishment sensitivity index did not survive (discovery sample: standardised β = 0.003 ± 0.003, p = 0.416; replication sample: standardised β = 0.004 ± 0.003, p = 0.100; Supplementary Figure 12). Pooling the samples resulted in an effect that narrowly missed the significance threshold (standardised β = 0.004 ± 0.002, p = 0.068).

      Test-retest reliability of unpleasantness

      The test-retest reliability of unpleasantness ratings was excellent (ICC(3,1) = 0.75), although participants gave significantly lower ratings in the second session (t56 = 2.7, p = 0.008, d = 0.37; mean difference of 3.12, SD = 8.63).

      Reliability of other measures with/out unpleasantness

      To assess the effect of accounting for unpleasantness ratings on reliability estimates of task performance, we extracted variance components from linear mixed models, following a standard approach (Nakagawa et al., 2017) – note that this was not the method used to estimate reliability values in the main analyses, but we used this specific approach to compare the reliability values with and without the covariate of unpleasantness ratings. The results indicated that unpleasantness ratings did not have a material effect on reliability (Supplementary Figure 14).

      We discuss the findings of these sensitivity analyses in the ‘Discussion’ section, as follows:

      Relatedly, participants had some control over the intensity at which the punishments were presented, which may have driven our findings relating to anxiety and putative mechanisms of anxiety-related avoidance. Sensitivity analyses showed that our finding that anxiety is positively associated with avoidance in the task was robust to individual differences in self-reported punishment unpleasantness, whilst the mediation effects were not. Future work imposing better control over the stimuli presented, and/or using within-subjects designs will be needed to validate the role of reward/punishment sensitivities in anxiety-related avoidance.

      Introduction and discussion should spend more time relating the task and current findings to existing procedures and findings examining individual differences in avoidance and cognitive/motivational correlates.

      We thank the reviewer for the opportunity to expand on the literature. Whilst there are numerous behavioural paradigms in both the human and non-human literature that involve learning about rewards and punishments, our starting point for the introduction was the state-of-the-art in translational models of approach-avoidance conflict models of anxiety. Therefore, for the sake of brevity and logical flow of our introduction, we have opted to bring in the discussion on other procedures primarily in the ‘Discussion’ section of the manuscript.

      We have now included the reviewer’s suggested citations from their ‘Public Review’ as follows:

      Since we developed our task with the primary focus on translational validity, its design diverges from other reinforcement learning tasks that involve reward and punishment outcomes (Pike & Robinson, 2022). One important difference is that we used distinct reinforcers as our reward and punishment outcomes, compared to many studies which use monetary outcomes for both (e.g. earning and losing £1 constitute the reward and punishment, respectively; Aylward et al., 2019; Jean-Richard-Dit-Bressel et al., 2021; Pizzagalli et al., 2005; Sharp et al., 2022). Other tasks have been used that induce a conflict between value and motor biases, relying on prepotent biases to approach/move towards rewards and withdraw from punishments, which makes it difficult to approach punishments and withdraw from rewards (Guitart-Masip et al., 2012; Mkrtchian et al., 2017). However, since translational operant conflict tasks typically induce a conflict between different types of outcome (e.g. food and shocks/sugar and quinine pellets; Oberrauch et al., 2019; van den Bos et al., 2014), we felt it was important to implement this feature. One study used monetary rewards and shock-based punishments, but also included four options for participants to choose from on each trial, with rewards and punishments associated with all four options (Seymour et al., 2012). This effectively requires participants to maintain eight probability estimates (i.e. reward and punishment at each of the four options) to solve the task, which may be too difficult for non-human animals to learn efficiently.

      We have also included a discussion on the measure’s potential relevance to cognitive/motivational traits as follows:

      Finally, whilst there is a broad literature on the roles of behavioural inhibition and avoidance tendency traits on decision-making and behaviour (Carver & White, 1994; Corr, 2004; Gray, 1982), we did not replicate the correlation of experiential avoidance and avoidance responses or the reward-punishment sensitivity index. Since there were also no significant correlations across task performance indices and clinical symptom measures, our findings suggest that the measure may be more sensitive to behaviours relating to state anxiety, rather more stable traits. Nevertheless, how performance in the present task relates to other traits such as behavioural approach/inhibition tendencies (Carver & White, 1994), as has been found in previous studies on reward/punishment learning (Sharp et al., 2022; Wise & Dolan, 2020) and approach-avoidance conflict (Aupperle et al., 2011), will be an important question for future work.

      We also now direct readers to a recent, comprehensive review on applying computational methods to approach-avoidance behaviours in the ‘Introduction’ section:

      A fundamental premise of this approach is that the brain acts as an information-processing organ that performs computations responsible for observable behaviours, including approach and avoidance (for a recent review on the application of computational methods to approach-avoidance conflict, see Letkiewicz et al., 2023).

      I am curious why participants were excluded if they made the same response on 20+ consecutive trials. How does this represent a cut-off between valid versus invalid behavioral profiles?

      We apologise for the lack of clarity on this point in our original submission – this exclusion criterion was specifically if participants used the same response key (e.g. the left arrow button) on 20 or more consecutive trials, indicating inattention. Since the left-right positions of the stimuli were randomised across trials, this did not exclude participants who repeatedly chose the same option frequently. However, as we show in the Supplement, this, along with the other exclusion criteria, did not affect our main findings.

      We have now clarified this as follows:

      … we excluded those who responded with the same response key on 20 or more consecutive trials (> 10% of all trials; 4%/6% in discovery and replication samples, respectively) – note that as the options randomly switched sides on the screen across trials, this did not exclude participants who frequently and consecutively chose a certain option.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      Summary:

      This work by Grogan and colleagues aimed to translate animal studies showing that acetylcholine plays a role in motivation by modulating the effects of dopamine on motivation. They tested this hypothesis with a placebo-controlled pharmacological study administering a muscarinic antagonist (trihexyphenidyl; THP) to a sample of 20 adult men performing an incentivized saccade task while undergoing electroencephalography (EEG). They found that reward increased vigor and reduced reaction times (RTs) and, importantly, these reward effects were attenuated by trihexyphenidyl. High incentives increased preparatory EEG activity (contingent negative variation), and though THP also increased preparatory activity, it also reduced this reward effect on RTs.

      Strengths:

      The researchers address a timely and potentially clinically relevant question with a within-subject pharmacological intervention and a strong task design. The results highlight the importance of the interplay between dopamine and other neurotransmitter systems in reward sensitivity and even though no Parkinson's patients were included in this study, the results could have consequences for patients with motivational deficits and apathy if validated in the future.

      Weaknesses:

      The main weakness of the study is the small sample size (N=20) that unfortunately is limited to men only. Generalizability and replicability of the conclusions remain to be assessed in future research with a larger and more diverse sample size and potentially a clinically relevant population. The EEG results do not shape a concrete mechanism of action of the drug on reward sensitivity.

      We thank the reviewer for their time and their assessment of this manuscript, and we appreciate their helpful comments on the previous version.

      We agree that the sample size being smaller than planned due to the pandemic restrictions is a weakness for this study, and hope that future studies into cholinergic effects on motivation in humans will use larger sample sizes. They should also ensure women are not excluded from sample populations, which will become even more important if the research progresses to clinical populations.

      Reviewer #3 (Public review):

      Summary:

      Grogan et al examine a role for muscarinic receptor activation in action vigor in a saccadic system. This work is motivated by a strong literature linking dopamine to vigor, and some animal studies suggesting that ACH might modulate these effects, and is important because patient populations with symptoms related to reduced vigor are prescribed muscarinic antagonists. The authors use a motivated saccade task with distractors to measure the speed and vigor of actions in humans under placebo or muscarinic antagonism. They show that muscarinic antagonism blunts the motivational effects of reward on both saccade velocity and RT, and also modulates the distractibility of participants, in particular by increasing the repulsion of saccades away from distractors. They show that preparatory EEG signals reflect both motivation and drug condition, and make a case that these EEG signals mediate the effects of the drug on behavior.

      Strengths:

      This manuscript addresses an interesting and timely question and does so using an impressive within subject pharmacological design and a task well designed to measure constructs of interest. The authors show clear causal evidence that ACH affects different metrics of saccade generation related to effort expenditure and their modulation by incentive manipulations. The authors link these behavioral effects to motor preparatory signatures, indexed with EEG, that relate to behavioral measures of interest and in at least one case statistically mediate the behavioral effects of ACH antagonism.

      Weaknesses:

      A primary weakness of this paper is the sample size - since only 20 participants completed the study. The authors address the sample size in several places and I completely understand the reason for the reduced sample size (study halt due to covid). Nonetheless, it is worth stating explicitly that this sample size is relatively small for the effect sizes typically observed in such studies highlighting the need for future confirmatory studies.

      We thank the reviewer for their time and their assessment of this manuscript, and we appreciate their helpful comments on the previous version.

      We agree that the small sample size is a weakness of the study, and hope that future work into cholinergic modulation of motivation can involve larger samples to replicate and extend this work.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Thank you for addressing my comments and clarifying the analysis sections. Women can be included in such studies by performing a pregnancy test before each test session, but I understand how this could have added to the pandemic limitations. Best of luck with your future work!

      Thank you for your time in reviewing this paper, and your helpful comments.

      Reviewer #3 (Recommendations for the authors):

      The authors have done a great job at addressing my concerns and I think that the manuscript is now very solid. That said, I have one minor concern.

      Thank you for your time in reviewing this paper, and your helpful comments.

      For descriptions of mass univariate analyses and cluster correction, I am still a bit confused on exactly what terms were in the regression. In one place, the authors state:

      On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model 'variable ~1 + voltage + incentive*distractorPresent*THP + (1 | participant)'.

      I take this to mean that the regression model includes a voltage regressor and a three-way interaction term, along with participant level intercept terms.

      However, elsewhere, the authors state:

      "We regressed each electrode and time-point against the three behavioural variables separately, while controlling for effects of incentive, distractor, THP, the interactions of those factors, and a random effect of participant."

      I take this to mean that the regression model included regressors for incentive, distractorPresent, THP, along with their 2 and 3 way interactions. I think that this seems like the more reasonable model - but I just want to 1) verify that this is what the authors did and 2) encourage them to articulate this more clearly and consistently throughout.

      We apologise for the lack of clarity about the whole-brain regression analyses.

      We used Wilkinson notation for this formula, where ‘A*B’ denotes ‘A + B + A:B’, so all main effects and lower-order interactions terms were included in the regression, as your second interpretation says. The model written out in full would be:

      'variable ~1 + voltage + incentive + distractorPresent + THP + incentive*distractorPresent + incentive*THP + distractorPresent*THP +  incentive*distractorPresent*THP + (1 | participant)'    

      We will clarify this in the Version of Record.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors used a motivated saccade task with distractors to measure response vigor and reaction time (RT) in healthy human males under placebo or muscarinic antagonism. They also simultaneously recorded neural activity using EEG with event-related potential (ERP) focused analyses. This study provides evidence that the muscarinic antagonist Trihexyphenidyl (THP) modulates the motivational effects of reward on both saccade velocity and RT, and also increases the distractibility of participants. The study also examined the correlational relationships between reaction time and vigor and manipulations (THP, incentives) with components of the EEG-derived ERPs. While an interesting correlation structure emerged from the analyses relating the ERP biomarkers to behavior, it is unclear how these potentially epiphenomenal biomarkers relate to relevant underlying neurophysiology.

      Strengths:

      This study is a logical translational extension from preclinical findings of cholinergic modulation of motivation and vigor and the CNV biomarker to a normative human population, utilizing a placebo-controlled, double-blind approach.

      While framed in the context of Parkinson's disease where cholinergic medications can be used, the authors do a good job in the discussion describing the limitations in generalizing their findings obtained in a normative and non-age-matched cohort to an aged PD patient population.

      The exploratory analyses suggest alternative brain targets and/or ERP components that relate to the behavior and manipulations tested. These will need to be further validated in an adequately powered study. Once validated, the most relevant biomarkers could be assessed in a more clinically relevant population.

      Weaknesses:

      The relatively weak correlations between the main experimental outcomes provide unclear insight into the neural mechanisms by which the manipulations lead to behavioral manifestations outside the context of the ERP. It would have been interesting to evaluate how other quantifications of the EEG signal through time-frequency analyses relate to the behavioral outcomes and manipulations.

      The ERP correlations to relevant behavioral outcomes were not consistent across manipulations demonstrating they are not reliable biomarkers to behavior but do suggest that multiple underlying mechanisms can give rise to the same changes in the ERP-based biomarkers and lead to different behavioral outcomes.

      We thank the reviewer for their review and their comments.

      We agree that these ERPs may not be reliable biomarkers yet, given the many-to-one mapping we observed where incentives and THP antagonism both affected the CNV in different ways, and hope that future studies will help clarify the use and limitations of the CNV as a potential biomarker of invigoration.

      Our original hypothesis was specifically about the CNV as an index of preparatory behaviour, but we plan to look at potential changes to frequency characteristics in future work. We have included this in the discussion of future investigations. (page 16, line 428):

      “Future investigations of other aspects of the EEG signals may illuminate us. Such studies could also investigate other potential signals that may be more sensitive to invigoration and/or muscarinic antagonism, including frequency-band power and phase-coherence, or measures of variability in brain signals such as entropy, which may give greater insight into processes affected by these factors.”

      Reviewer #2 (Public Review):

      Summary:

      This work by Grogan and colleagues aimed to translate animal studies showing that acetylcholine plays a role in motivation by modulating the effects of dopamine on motivation. They tested this hypothesis with a placebo-controlled pharmacological study administering a muscarinic antagonist (trihexyphenidyl; THP) to a sample of 20 adult men performing an incentivized saccade task while undergoing electroengephalography (EEG). They found that reward increased vigor and reduced reaction times (RTs) and, importantly, these reward effects were attenuated by trihexyphenidyl. High incentives increased preparatory EEG activity (contingent negative variation), and though THP also increased preparatory activity, it also reduced this reward effect on RTs.

      Strengths:

      The researchers address a timely and potentially clinically relevant question with a within-subject pharmacological intervention and a strong task design. The results highlight the importance of the interplay between dopamine and other neurotransmitter systems in reward sensitivity and even though no Parkinson's patients were included in this study, the results could have consequences for patients with motivational deficits and apathy if validated in the future.

      Weaknesses:

      The main weakness of the study is the small sample size (N=20) that unfortunately is limited to men only. The generalizability and replicability of the conclusions remain to be assessed in future research with a larger and more diverse sample size and potentially a clinically relevant population. The EEG results do not shape a concrete mechanism of action of the drug on reward sensitivity.

      We thank the reviewer for their review, and their comments.

      We agree that our study was underpowered, not reaching our target of 27 participants due to pandemic restrictions halting our recruitment, and hope that future studies into muscarinic antagonism in motivation will have larger sample sizes, and include male and female participants across a range of ages, to assess generalisability.

      We only included men to prevent the chance of administering the drug to someone pregnant. Trihexyphenidyl is categorized by the FDA as a Pregnancy Category Class C drug, and the ‘Summary of Product Characteristics’ states: “There is inadequate information regarding the use of trihexyphenidyl in pregnancy. Animal studies are insufficient with regard to effects on pregnancy, embryonal/foetal development, parturition and postnatal development. The potential risk for humans is unknown. Trihexyphenidyl should not be used during pregnancy unless clearly necessary.”

      While the drug can be prescribed where benefits may outweigh this risk, as there were no benefits to participants in this study, we only recruited men to keep the risk at zero.

      We have updated the Methods/Drugs section to explain this (page 17, line 494):

      “The risks of Trihexyphenidyl in pregnancy are unknown, but the Summary Product of Characteristics states that it “should not be used during pregnancy unless clearly necessary”. As this was a basic research study with no immediate clinical applications, there was no justification for any risk of administering the drug during pregnancy, so we only recruited male participants to keep this risk at zero.”

      And we reference to this in the Methods/Participants section (page 18, line 501):

      “We recruited 27 male participants (see Drugs section above),…”

      We agree that future work is needed to replicate this in different samples, and that this work cannot tell us the mechanism by which the drug is dampening invigoration, but we think that showing these effects do occur and can be linked to anticipatory/preparatory activity rather than overall reward sensitivity is a useful finding.

      Reviewer #3 (Public Review):

      Summary:

      Grogan et al examine a role for muscarinic receptor activation in action vigor in a saccadic system. This work is motivated by a strong literature linking dopamine to vigor, and some animal studies suggesting that ACH might modulate these effects, and is important because patient populations with symptoms related to reduced vigor are prescribed muscarinic antagonists. The authors use a motivated saccade task with distractors to measure the speed and vigor of actions in humans under placebo or muscarinic antagonism. They show that muscarinic antagonism blunts the motivational effects of reward on both saccade velocity and RT, and also modulates the distractibility of participants, in particular by increasing the repulsion of saccades away from distractors. They show that preparatory EEG signals reflect both motivation and drug condition, and make a case that these EEG signals mediate the effects of the drug on behavior.

      Strengths:

      This manuscript addresses an interesting and timely question and does so using an impressive within-subject pharmacological design and a task well-designed to measure constructs of interest. The authors show clear causal evidence that ACH affects different metrics of saccade generation related to effort expenditure and their modulation by incentive manipulations. The authors link these behavioral effects to motor preparatory signatures, indexed with EEG, that relate to behavioral measures of interest and in at least one case statistically mediate the behavioral effects of ACH antagonism.

      Weaknesses:

      In full disclosure, I have previously reviewed this manuscript in another journal and the authors have done a considerable amount of work to address my previous concerns. However, I have a few remaining concerns that affect my interpretation of the current manuscript.

      Some of the EEG signals (figures 4A&C) have profiles that look like they could have ocular, rather than central nervous, origins. Given that this is an eye movement task, it would be useful if the authors could provide some evidence that these signals are truly related to brain activity and not driven by ocular muscles, either in response to explicit motor effects (ie. Blinks) or in preparation for an upcoming saccade.

      We thank the reviewer for re-reviewing the manuscript and for raising this issue.

      All the EEG analyses (both ERP and whole-brain) are analysing the preparation period between the ready-cue and target appearance when no eye-movements are required. We reject trials with blinks or saccades over 1 degree in size, as detected by the Eyelink software according the sensitive velocity and acceleration criteria specified in the manuscript (Methods/Eye-tracking, page 19, line 550). This means that there should be no overt eye movements in the data. However, microsaccades and ocular drift are still possible within this period, which indeed could drive some effects. To measure this, we counted the number of microsaccades (<1 degree in size) in the preparation period between incentive cue and the target onset, for each trial. Further, we measure the mean absolute speed of the eye during the preparation period (excluding the periods during microsaccades) for each trial.

      We have run a control analysis to check whether including ocular drift speed or number of microsaccades as a covariate in the whole-brain regression analysis changes the association between EEG and the behavioural metrics at frontal or other electrodes. Below we show these ‘variable ~ EEG’ beta-coefficients when controlling for each eye-movement covariate, in the same format as Figure 4. We did not run the permutation testing on this due to time/computational costs (it takes >1 week per variable), so p-values were not calculated, only the beta-coefficients. The beta-coefficients are almost unchanged, both in time-course and topography, when controlling for either covariate.  The frontal associations to velocity and distractor pull remain, suggesting they are not due to these eye movements.

      We have added this figure as a supplemental figure.

      For additional clarity in this response, we also plot the differences between these covariate-controlled beta-coefficients, and the true beta-coefficients from figure 4 (please note the y-axis scales are -0.02:0.02, not -0.15:0.15 as in Figure 4 and Figure 4-figure supplement 2). This shows that the changes to the associations between EEG and velocity/distractor-pull were not frontally-distributed, demonstrating eye-movements were not driving these effects. Relatedly, the RT effect’s change was frontally-distributed, despite Figure 4 showing the true relationship was central in focus, again indicating that effect was also not related to these eye movements.

      Author response image 1.

      Difference in beta-coefficients when eye-movement covariates are included. This is the difference from the beta-coefficients shown in Figure 4, please note the smaller y-axis limits.

      The same pattern was seen if we controlled for the change in eye-position from the baseline period (measured by the eye-tracker) at each specific time-point, i.e., controlling for the distance the eye had moved from baseline at the time the EEG voltage is measured. The topographies and time-course plots were almost identical to the above ones:

      Author response image 2.

      Controlling for change in eye-position at each time-point does not change the regression results. Left column shows the beta-coefficients between the variable and EEG voltage, and the right column shows the difference from the main results in Figure 4 (note the smaller y-axis limits for the right-hand column).

      Therefore, we believe the brain-behaviour regressions are independent of eye-movements. We have included the first figure presented here as an additional supplemental figure, and added the following to the text (page 10, line 265):

      “An additional control analysis found that these results were not driven by microsaccades or ocular drift during the preparation period, as including these as trial-wise covariates did not substantially change the beta-coefficients (Figure 4 – Figure Supplement 2).”

      For other EEG signals, in particular, the ones reported in Figure 3, it would be nice to see what the spatial profiles actually look like - does the scalp topography match that expected for the signal of interest?

      Yes, the CNV is a central negative potential peaking around Cz, while the P3a is slightly anterior to this (peaking between Cz and FCz). We have added the topographies to the main figure (see point below).

      This is the topography of the mean CNV (1200:1500ms from the preparation cue onset), which is maximal over Cz, as expected.

      The P3a’s topography (200:280ms after preparation cue) is maximal slightly anterior to Cz, between Cz and FCz.

      A primary weakness of this paper is the sample size - since only 20 participants completed the study. The authors address the sample size in several places and I completely understand the reason for the reduced sample size (study halt due to COVID). That said, they only report the sample size in one place in the methods rather than through degrees of freedom in their statistical tests conducted throughout the results. In part because of this, I am not totally clear on whether the sample size for each analysis is the same - or whether participants were removed for specific analyses (ie. due to poor EEG recordings, for example).  

      We apologise for the lack of clarity here. All 20 participants were included in all analyses, although the number of trials included differed between behavioural and EEG analyses. We only excluded trials with EEG artefacts from the EEG analyses, not from the purely behavioural analyses such as Figures 1&2, although trials with blinks/saccades were removed from behavioural analyses too. Removing the EEG artefactual trials from the behavioural analyses did not change the findings, despite the lower power. The degrees of freedom in the figure supplement tables are the total number of trials (less 8 fixed-effect terms) included in the single-trial / trial-wise regression analyses we used.

      We have clarified this in the Methods/Analysis (page 20, line 602):

      “Behavioural and EEG analysis included all 20 participants, although trials with EEG artefacts were included in the behavioural analyses (18585 trials in total) and not the EEG analyses (16627 trials in total), to increase power in the former. Removing these trials did not change the findings of the behavioural analyses.”

      And we state the number of participants and trials in the start of the behavioural results (page 3, line 97):

      “We used single-trial mixed-effects linear regression (20 participants, 18585 trials in total) to assess the effects of Incentive, Distractors, and THP, along with all the interactions of these (and a random-intercept per participant), on residual velocity and saccadic RT.”

      and EEG results section (page 7, line 193):

      “We used single-trial linear mixed-effects regression to see the effects of Incentive and THP on each ERP (20 participants, 16627 trials; Distractor was included too, along with all interactions, and a random intercept by participant).”

      Beyond this point, but still related to the sample size, in some cases I worry that results are driven by a single subject. In particular, the interaction effect observed in Figure 1e seems like it would be highly sensitive to the single subject who shows a reverse incentive effect in the drug condition.

      Repeating that analysis after removing the participant with the large increase in saccadic RT with incentives did not remove the incentive*THP interaction effect – although it did weaken slightly from (β = 0.0218, p = .0002) to  (β=0.0197, p=.0082). This is likely because that while that participant did have slower RTs for higher incentives on THP, they were also slower for higher incentives under placebo (and similarly for distractor present/absent), making them less of an outlier in terms of effects than in raw RT terms. Below is Author response image 3 the mean-figure without that participant, and Author response image 4 that participant shown separately.

      Author response image 3.

      Author response image 4.

      There are not sufficient details on the cluster-based permutation testing to understand what the authors did or whether it is reasonable. What channels were included? What metric was computed per cluster? How was null distribution generated?

      We apologise for not giving sufficient details of this, and have updated the Methods/Analysis section to include these details, along with a brief description in the Results section.

      To clarify here, we adapted the DMGroppe Mass Univariate Testing toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘variable ~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour – i.e. does adding the voltage at this time/channel explain additional variance in the variable not captured in our main behavioural analyses. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution of cluster mass (across times/channels per iteration), and calculated the p-value as the proportion of this distribution further from zero than the absolute true t-statistics (two-tailed test).

      We have given greater detail for this in the Methods/Analysis section (page 20, line 614):

      “We adapted this toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution, and calculated the p-value as the proportion of this distribution further from zero than the true t-statistics (two-tailed test). Given the relatively small sample size here, these whole-brain analyses should not be taken as definitive.”

      And we have added a brief explanation to the Results section also (page 9, line 246):

      “We regressed each electrode and time-point against the three behavioural variables separately, while controlling for effects of incentive, distractor, THP, the interactions of those factors, and a random effect of participant. This analysis therefore asks whether trial-to-trial neural variability predicts behavioural variability. To assess significance, we used cluster-based permutation tests (DMGroppe Mass Univariate toolbox; Groppe, Urbach, & Kutas, 2011), shuffling the trials within each condition and person, and repeating it 2500 times, to build a null distribution of ‘cluster mass’ from the t-statistics (Bullmore et al., 1999; Maris & Oostenveld, 2007) which was used to calculate two-tailed p-values with a family-wise error rate (FWER) of .05 (see Methods/Analysis for details).”

      The authors report that "muscarinic antagonism strengthened the P3a" - but I was unable to see this in the data plots. Perhaps it is because the variability related to individual differences obscures the conditional differences in the plots. In this case, event-related difference signals could be helpful to clarify the results.

      We thank the reviewer for spotting this wording error, this should refer to the incentive effect weakening the P3a, as no other significant effects were found on the P3a, as stated correctly in the previous paragraph. We have corrected this in the manuscript (page 9, line 232):

      “This suggests that while incentives strengthened the incentive-cue response and the CNV and weakened the P3a, muscarinic antagonism strengthened the CNV,”

      The reviewer’s suggestion for difference plots is very valuable, and we have added these to Figure 3, as well as increasing the y-axis scale for figure 3c to show the incentives weakening the P3a more clearly, and adding the topographies suggested in an earlier comment. The difference waves for Incentive and THP effects show that both are decreasing voltage, albeit with slightly different onset times – Incentive starts earlier, thus weakening the positive P3a, while both strengthen the negative CNV. The Incentive effects within THP and Placebo separately illustrate the THP*Incentive interaction.

      We have amended the Results text and figure (page 7, line 200):

      “The subsequent CNV was strengthened (i.e. more negative; Figure 3d) by incentive (β = -.0928, p < .0001) and THP (β = -0.0502, p < .0001), with an interaction whereby THP decreased the incentive effect (β= 0.0172, p = .0213). Figure 3h shows the effects of Incentive and THP on the CNV separately, using difference waves, and Figure 3i shows the incentive effect grows more slowly in the THP condition than the Placebo condition.

      For mediation analyses, it would be useful in the results section to have a much more detailed description of the regression results, rather than just reporting things in a binary did/did not mediate sort of way. Furthermore, the methods should also describe how mediation was tested statistically (ie. What is the null distribution that the difference in coefficients with/without moderator is tested against?).

      We have added a more detailed explanation of how we investigated mediation and mediated moderation, and now report the mediation effects for all tests run and the permutation-test p-values.

      We had been using the Baron & Kenny (1986) method, based on 4 tests outlined in the updated text below, which gives a single measure of change in absolute beta-coefficients when all the tests have been met, but without any indication of significance; any reduction found after meeting the other 3 tests indicates a partial mediation under this method. We now use permutation testing to generate a p-value for the likelihood of finding an equal or larger reduction in the absolute beta-coefficients if the CNV were not truly related to RT. This found that the CNV’s mediation of the Incentive effect on RT was highly significant, while the Mediated Moderation of CNV on THP*Incentive was weakly significant.

      During this re-analysis, we noticed that we had different trial-numbers in the different regression models, as EEG-artefactual trials were not excluded from the behavioural-only model (‘RT ~ 1 + Incentive’). However, this causes issues with the permutation testing as we are shuffling the ERPs and need the same trials included in all the mixed-effects models. Therefore, we have redone these mediation analyses, including only the trials with valid ERP measures (i.e. no artefactual trials) in all models. This has changed the beta-coefficients we report, but not the findings or conclusions of the mediation analyses. We have updated the figure to have these new statistics.

      We have updated the text to explain the methodology in the Results section (page 12, line 284):

      “We have found that neural preparatory activity can predict residual velocity and RT, and is also affected by incentives and THP. Finally, we ask whether the neural activity can explain the effects of incentives and THP, through mediation analyses. We used the Baron & Kenny ( 1986) method to assess mediation (see Methods/Analysis for full details). This tests whether the significant Incentive effect on behaviour could be partially reduced (i.e., explained) by including the CNV as a mediator in a mixed-effects single-trial regression. We measured mediation as the reduction in (absolute) beta-coefficient for the incentive effect on behaviour when the CNV was included as a mediator (i.e., RT ~ 1 + Incentive + CNV + Incentive*CNV + (1 | participant)). This is a directional hypothesis of a reduced effect, and to assess significance we ran a permutation-test, shuffling the CNV within participants, and measuring the change in absolute beta-coefficient for the Incentive effect on behaviour. This generates a distribution of mediation effects where there is no relationship between CNV and RT on a trial (i.e., a null distribution). We ran 2500 permutations, and calculated the proportion with an equal or more negative change in absolute beta-coefficient, equivalent to a one-tailed test. We ran this mediation analysis separately for the two behavioural variables of RT and residual velocity, but not for distractor pull as it was not affected by incentive, so failed the assumptions of mediation analyses (Baron & Kenny, 1986; Muller et al., 2005). We took the mean CNV amplitude from 1200:1500ms as our Mediator.

      Residual velocity passed all the assumption tests for Mediation analysis, but no significant mediation was found. That is, Incentive predicted velocity (β=0.1304, t(1,16476)=17.3280, p<.0001); Incentive predicted CNV (β=-0.9122, t(1,16476)=-12.1800, p<.0001); and CNV predicted velocity when included alongside Incentive (β=0.0015, t(1,16475)=1.9753, p=.0483). However, including CNV did not reduce the Incentive effect on velocity, and in fact strengthened it (β=0.1318, t(1,16475)=17.4380, p<.0001; change in absolute coefficient: Δβ=+0.0014). Since there was no mediation (reduction), we did not run permutation tests on this.

      However, RT did show a significant mediation of the Incentive effect by CNV: Incentive predicted RT (β=-0.0868, t(1,16476)=-14.9330, p<.0001); Incentive predicted CNV (β=-0.9122, t(1,16476)=-12.1800, p<.0001); and CNV predicted RT when included alongside Incentive (β=0.0127, t(1,16475)=21.3160, p<.0001). The CNV mediated the effect of Incentive on RT, reducing the absolute beta-coefficient (β=-0.0752, t(1,16475)=-13.0570, p<.0001; change in absolute coefficient: Δβ= -0.0116). We assessed the significance of this change via permutation testing, shuffling the CNV across trials (within participants) and calculating the change in absolute beta-coefficient for the Incentive effect on RT when the permuted CNV was included as a mediator. We repeated this 2500 times to build a null distribution of Δβ, and calculated the proportion with equal or stronger reductions for a one-tailed p-value, which was highly significant (p<.0001). This suggests that the Incentive effect on RT is partially mediated by the CNV’s amplitude during the preparation period, and this is not the case for residual velocity.

      We also investigated whether the CNV could explain the cholinergic reduction in motivation (THP*Incentive interaction) on RT – i.e., whether CNV mediation the THP moderation. We measured Mediated Moderation as suggested by Muller et al. (2005; see Methods/Analysis for full explanation): Incentive*THP was associated with RT (β=0.0222, t(1,16474)=3.8272, p=.0001); and Incentive*THP was associated with CNV (β=0.1619, t(1,16474)=2.1671, p=.0302); and CNV*THP was associated with RT (β=0.0014, t(1,16472)=2.4061, p=.0161). Mediated Moderation was measured by the change in absolute Incentive*THP effect when THP*CNV was included in the mixed-effects model (β=0.0214, t(1,16472)=3.7298, p=.0002; change in beta-coefficient: Δβ= -0.0008), and permutation-testing (permuting the CNV as above) found a significant effect (p=.0132). This indicates cholinergic blockade changes how incentives affect preparatory negativity, and how this negativity reflects RT, which can explain some of the reduced invigoration of RT. However, this was not observed for saccade velocity.

      And we have updated the Methods/Analysis section with a more detailed explanation too (page 21, line 627):

      “For the mediation analysis, we followed the 4-step process  (Baron & Kenny, 1986; Muller et al., 2005), which requires 4 tests be met for the outcome (behavioural variable, e.g. RT), mediator (ERP, e.g., CNV) and the treatment (Incentive):

      (1) Outcome is significantly associated with the Treatment (RT ~ 1 + Incentive + (1 | participant))

      (2) Mediator is significantly associated with the Treatment (ERP ~ 1 + Incentive + (1 | participant))

      (3) Mediator is significantly associated with the Outcome (RT ~ 1 + Incentive + ERP + (1 | participant))

      (4) And the inclusion of the Mediator reduces the association between the Treatment and Outcome (Incentive effect from model #3)

      The mediation was measured by the reduction in the absolute standardised beta coefficient between incentive and behaviour when the ERP mediator was included (model #3 vs model #1 above). We used permutation-testing to quantify the likelihood of finding these mediations under the null hypothesis, achieved by shuffling the ERP across trials (within each participant) to remove any link between the ERP and behaviour. We repeated this 2500 times to build a null distribution of the change in absolute beta-coefficients for the RT ~ Incentive effect when this permuted mediator was included (model #3 vs model #1). We calculated a one-tailed p-value by finding the proportion of the null distribution that was equal or smaller than the true values (as Mediation is a one-tailed prediction).

      Mediated moderation (Muller et al., 2005) was used to see whether the effect of THP (the Moderator) on behaviour is mediated by the ERP, with the following tests (after the previous Mediation tests were already satisfied):

      (5) THP moderates the Incentive effect, via a significant Treatment*Moderator interaction on the Outcome (RT ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (6) THP moderates the Incentive effect on the Mediator, via a Treatment*Moderator interaction on the Outcome (ERP ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (7) THP’s moderation of the Incentive effect is mediated by the ERP, via a reduction in the association of Treatment*Moderator on the Outcome when the Treatment*Moderator interaction is included (RT ~ 1 + Incentive + THP + Incentive*THP + ERP + ERP*THP + (1 | participant)

      Mediated moderation is measured as the reduction in absolute beta-coefficients for ‘RT ~ Incentive*THP’ between model #5 and #7, which captures how much of this interaction could be explained by including the Mediator*Moderator interaction (ERP*THP in model #7). We tested the significance of this with permutation testing as above, permuting the ERP across trials (within participants) 2500 times, and building a null distribution of the change in the absolute beta-coefficients for RT ~ Incentive*THP between models #7 and #5. We calculated a one-tailed p-value from the proportion of these that were equal or smaller than the true change.”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) The analysis section could benefit from greater detail. For example, how exactly did they assess that the effects of the drug on peak velocity and RT were driven by non-distracting trials? Ideally, for every outcome, the analysis approach used should be detailed and justified.

      We apologise for the confusion from this. To clarify, we found a 2-way regression (incentive*THP) on both residual velocity and saccadic RT and this pattern was stronger in distractor-absent trials for residual velocity, and stronger in distractor-present trials for saccadic RT, as can be seen in Figure 1d&e. However, as there was no significant 3-way interaction (incentive*THP*distractor) for either metric, and the 2-way interaction effects were in the same direction in distractor present/absent trials for both metrics, we think these effects were relatively unaffected by distractor presence.

      We have updated the Results section to make this clearer: (page 3, line 94):

      We measured vigour as the residual peak velocity of saccades within each drug session (see Figure 1c & Methods/Eye-tracking), which is each trial’s deviation of velocity from the main sequence. This removes any overall effects of the drug on saccade velocity, while still allowing incentives and distractors to have different effects within each drug condition. We used single-trial mixed-effects linear regression (20 participants, 18585 trials in total) to assess the effects of Incentive, Distractors, and THP, along with all the interactions of these (and a random-intercept per participant), on residual velocity and saccadic RT. As predicted, residual peak velocity was increased by incentives (Figure 1d; β = 0.1266, p < .0001), while distractors slightly slowed residual velocity (β = -0.0158, p = .0294; see Figure 1 – Figure supplement 1 for full behavioural statistics). THP decreased the effect of incentives on velocity (incentive * THP: β = -0.0216, p = .0030), indicating that muscarinic blockade diminished motivation by incentives. Figure 1d shows that this effect was similar in distractor absent/present trials, although slightly stronger when the distractor was absent; the 3-way (distractor*incentive*THP) interaction was not significant (p > .05), suggesting that the distractor-present trials had the same effect but weaker (Figure 1d).

      Saccadic RT (time to initiation of saccade) was slower when participants were given THP (β = 0.0244, p = < .0001), faster with incentives (Figure 1e; β = -0.0767, p < .0001), and slowed by distractors (β = 0.0358, p < .0001). Again, THP reduced the effects of incentives (incentive*THP: β = 0.0218, p = .0002). Figure 1e shows that this effect was similar in distractor absent/present trials, although slightly stronger when the distractor was present; as the 3-way (distractor*incentive*THP) interaction was not significant and the direction of effects was the same in the two, it suggests the effect was similar in both conditions. Additionally, the THP*Incentive interactions were correlated between saccadic RT and residual velocity at the participant level (Figure 1 – Figure supplement 2).

      We have given more details of the analyses performed in the Methods section and the results, as requested by you and the other reviewers (page 20, line 602):

      Behavioural and EEG analysis included all 20 participants, although trials with EEG artefacts were included in the behavioural analyses (18585 trials in total) and not the EEG analyses (16627 trials in total), to increase power in the former. Removing these trials did not change the findings of the behavioural analyses.

      We used single-trial linear-mixed effects models to analyse our data, including participant as a random effect of intercept, with the formula ‘~1 + incentive*distractor*THP + (1 | participant)’. We z-scored all factors to give standardised beta coefficients.

      For the difference-wave cluster-based permutation tests (Figure 3 – Figure supplement 4), we used the DMGroppe Mass Univariate toolbox (Groppe et al., 2011), with 2500 permutations, to control the family-wise error rate at 0.05. This was used for looking at difference waves to test the effects of incentive, THP, and the incentive*THP interaction (using difference of difference-waves), across all EEG electrodes.

      We adapted this toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution, and calculated the p-value as the proportion of this distribution further from zero than the true t-statistics (two-tailed test). Given the relatively small sample size here, these whole-brain analyses should not be taken as definitive.

      For the mediation analysis, we followed the 4-step process  (Baron & Kenny, 1986; Muller et al., 2005), which requires 4 tests be met for the outcome (behavioural variable, e.g. RT), mediator (ERP, e.g., CNV) and the treatment (Incentive):

      (1) Outcome is significantly associated with the Treatment (RT ~ 1 + Incentive + (1 | participant))

      (2) Mediator is significantly associated with the Treatment (ERP ~ 1 + Incentive + (1 | participant))

      (3) Mediator is significantly associated with the Outcome (RT ~ 1 + Incentive + ERP + (1 | participant))

      (4) And the inclusion of the Mediator reduces the association between the Treatment and Outcome (Incentive effect from model #3)

      The mediation was measured by the reduction in the absolute standardised beta coefficient between incentive and behaviour when the ERP mediator was included (model #3 vs model #1 above). We used permutation-testing to quantify the likelihood of finding these mediations under the null hypothesis, achieved by shuffling the ERP across trials (within each participant) to remove any link between the ERP and behaviour. We repeated this 2500 times to build a null distribution of the change in absolute beta-coefficients for the RT ~ Incentive effect when this permuted mediator was included (model #3 vs model #1). We calculated a one-tailed p-value by finding the proportion of the null distribution that was equal or more negative than the true value (as Mediation is a one-tailed prediction). For this mediation analysis, we only included trials with valid ERP measures, even for the models without the ERP included (e.g., model #1), to keep the trial-numbers and degrees of freedom the same.

      Mediated moderation (Muller et al., 2005) was used to see whether the effect of THP (the Moderator) on behaviour is mediated by the ERP, with the following tests (after the previous Mediation tests were already satisfied):

      (5) THP moderates the Incentive effect, via a significant Treatment*Moderator interaction on the Outcome (RT ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (6) THP moderates the Incentive effect on the Mediator, via a Treatment*Moderator interaction on the Outcome (ERP ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (7) THP’s moderation of the Incentive effect is mediated by the ERP, via a reduction in the association of Treatment*Moderator on the Outcome when the Treatment*Moderator interaction is included (RT ~ 1 + Incentive + THP + Incentive*THP + ERP + ERP*THP + (1 | participant)

      Mediated moderation is measured as the reduction in absolute beta-coefficients for ‘RT ~ Incentive*THP’ between model #5 and #7, which captures how much of this interaction could be explained by including the Mediator*Moderator interaction (ERP*THP in model #7). We tested the significance of this with permutation testing as above, permuting the ERP across trials (within participants) 2500 times, and building a null distribution of the change in the absolute beta-coefficients for RT ~ Incentive*THP between models #7 and #5. We calculated a one-tailed p-value from the proportion of these that were equal or more negative than the true change.

      (2) Please explain why only men were included in this study. We are all hoping that men-only research is a practice of the past.

      We only included men to prevent any chance of administering the drug to someone pregnant. Trihexyphenidyl is categorized by the FDA as a Pregnancy Category Class C drug, and the ‘Summary of Product Characteristics’ states: “There is inadequate information regarding the use of trihexyphenidyl in pregnancy. Animal studies are insufficient with regard to effects on pregnancy, embryonal/foetal development, parturition and postnatal development. The potential risk for humans is unknown. Trihexyphenidyl should not be used during pregnancy unless clearly necessary.”

      While the drug can be prescribed where benefits may outweigh this risk, as there were no benefits to participants in this study, we only recruited men to keep the risk at zero.

      We have updated the Methods/Drugs section to explain this (page 17, line 494):

      “The risks of Trihexyphenidyl in pregnancy are unknown, but the Summary Product of Characteristics states that it “should not be used during pregnancy unless clearly necessary”. As this was a basic research study with no immediate clinical applications, there was no justification for any risk of administering the drug during pregnancy, so we only recruited male participants to keep this risk at zero.”

      And we have referenced this in the Methods/Participants section (page 18, line 501):

      “Our sample size calculations suggested 27 participants would detect a 0.5 effect size with .05 sensitivity and .8 power. We recruited 27 male participants (see Drugs section above)”

      (3) Please explain acronyms (eg EEG) when first used.

      Thank you for pointing this out, we have explained EEG at first use in the abstract and the main text, along with FWER, M1r, and ERP which had also been missed at first use.

      Reviewer #3 (Recommendations For The Authors):

      The authors say: "Therefore, acetylcholine antagonism reduced the invigoration of saccades by incentives, and increased the pull of salient distractors. We next asked whether these effects were coupled with changes in preparatory neural activity." But I found this statement to be misleading since the primary effects of the drug seem to have been to decrease the frequency of distractor-repulsed saccades... so "decreased push" would probably be a better analogy than "increased pull".

      Thank you for noticing this, we agree, and have changed this to (page 5, line 165):

      “Therefore, acetylcholine antagonism reduced the invigoration of saccades by incentives, and decreased the repulsion of salient distractors. We next asked whether these effects were coupled with changes in preparatory neural activity.”

      I don't see anything in EEG preprocessing about channel rejection and interpolation. Were these steps performed? There are very few results related to the full set of electrodes.

      We did not reject or interpolate any channels, as visual inspection found no obvious outliers in terms of noisiness, and no channels had standard deviations (across time/trials) higher than our standard cutoff (of 80). The artefact rejection was applied across all EEG channels, so any trials with absolute voltages over 200uV in any channel were removed from the analysis. On average 104/120 trials were included (having passed this check, along with eye-movement artefact checks) per condition per person, and we have added the range of these, along with totals across conditions to the Analysis section and a statement about channel rejection/interpolation (page 20, line 588):

      “Epochs were from -200:1500ms around the preparation cue onset, and were baselined to the 100ms before the preparation cue appeared. Visual inspection found no channels with outlying variance, so no channel rejection or interpolation was performed. We rejected trials from the EEG analyses where participants blinked or made saccades (according to EyeLink criteria above) during the epoch, or where EEG voltage in any channel was outside -200:200μV (muscle activity). On average 104/120 trials per condition per person were included (SD = 21, range = 21-120), and 831/960 trials in total per person (SD=160, range=313-954). A repeated-measures ANOVA found there were no significant differences in number of trials excluded for any condition (p > .2).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Review #1:

      Summary:

      Jin et al. investigated how the bacterial DNA damage (SOS) response and its regulator protein RecA affect the development of drug resistance under short-term exposure to beta-lactam antibiotics. Canonically, the SOS response is triggered by DNA damage, which results in the induction of error-prone DNA repair mechanisms. These error-prone repair pathways can increase mutagenesis in the cell, leading to the evolution of drug resistance. Thus, inhibiting the SOS regulator RecA has been proposed as a means to delay the rise of resistance. 

      In this paper, the authors deleted the RecA protein from E. coli and exposed this ∆recA strain to selective levels of the beta-lactam antibiotic, ampicillin. After an 8-hour treatment, they washed the antibiotic away and allowed the surviving cells to recover in regular media. They then measured the minimum inhibitory concentration (MIC) of ampicillin against these treated strains. They note that after just 8-hour treatment with ampicillin, the ∆recA had developed higher MICs towards ampicillin, while by contrast, wild-type cells exhibited unchanged MICs. This MIC increase was also observed in subsequent generations of bacteria, suggesting that the phenotype is driven by a genetic change.

      The authors then used whole genome sequencing (WGS) to identify mutations that accounted for the resistance phenotype. Within resistant populations, they discovered key mutations in the promoter region of the beta-lactamase gene, ampC; in the penicillin-binding protein PBP3 which is the target of ampicillin; and in the AcrB subunit of the AcrAB-TolC efflux machinery. Importantly, mutations in the efflux machinery can impact the resistance towards other antibiotics, not just beta-lactams. To test this, they repeated the MIC experiments with other classes of antibiotics, including kanamycin, chloramphenicol, and rifampicin. Interestingly, they observed that the ∆recA strains pre-treated with ampicillin showed higher MICs towards all other antibiotics tested. This suggests that the mutations conferring resistance to ampicillin are also increasing resistance to other antibiotics.

      The authors then performed an impressive series of genetic, microscopy, and transcriptomic experiments to show that this increase in resistance is not driven by the SOS response, but by independent DNA repair and stress response pathways. Specifically, they show that deletion of the recA reduces the bacterium's ability to process reactive oxygen species (ROS) and repair its DNA. These factors drive the accumulation of mutations that can confer resistance to different classes of antibiotics. The conclusions are reasonably well-supported by the data, but some aspects of the data and the model need to be clarified and extended.

      We sincerely appreciate your overall summary of the manuscript and their positive evaluation of our work.

      Strengths:

      A major strength of the paper is the detailed bacterial genetics and transcriptomics that the authors performed to elucidate the molecular pathways responsible for this increased resistance. They systemically deleted or inactivated genes involved in the SOS response in E. coli. They then subjected these mutants to the same MIC assays as described previously. Surprisingly, none of the other SOS gene deletions resulted in an increase in drug resistance, suggesting that the SOS response is not involved in this phenotype. This led the authors to focus on the localization of DNA PolI, which also participates in DNA damage repair. Using microscopy, they discovered that in the RecA deletion background, PolI co-localizes with the bacterial chromosome at much lower rates than wild-type. This led the authors to conclude that deletion of RecA hinders PolI and DNA repair. Although the authors do not provide a mechanism, this observation is nonetheless valuable for the field and can stimulate further investigations in the future.

      In order to understand how RecA deletion affects cellular physiology, the authors performed RNA-seq on ampicillin-treated strains. Crucially, they discovered that in the RecA deletion strain, genes associated with antioxidative activity (cysJ, cysI, cysH, soda, sufD) and Base Excision Repair repair (mutH, mutY, mutM), which repairs oxidized forms of guanine, were all downregulated. The authors conclude that down-regulation of these genes might result in elevated levels of reactive oxygen species in the cells, which in turn, might drive the rise of resistance. Experimentally, they further demonstrated that treating the ∆recA strain with an antioxidant GSH prevents the rise of MICs. These observations will be useful for more detailed mechanistic follow-ups in the future.

      We are grateful to you for your positive assessment of the strengths of our manuscript and your recognition of its potential future applications.

      Weaknesses:

      Throughout the paper, the authors use language suggesting that ampicillin treatment of the ∆recA strain induces higher levels of mutagenesis inside the cells, leading to the rapid rise of resistance mutations. However, as the authors note, the mutants enriched by ampicillin selection can play a role in efflux and can thus change a bacterium's sensitivity to a wide range of antibiotics, in what is known as cross-resistance. The current data is not clear on whether the elevated "mutagenesis" is driven ampicillin selection or by a bona fide increase in mutation rate.

      We greatly appreciate you for raising this issue, as it is an important premise that must be clearly stated throughout the entire manuscript. To verify that the observed increase in mutation rate is a bona fide increase and not due to experimental error, we used a non-selective antibiotic, rifampicin, to evaluate the mutation frequency after drug induction, as it is a gold-standard method documented in other studies [Heterogeneity in efflux pump expression predisposes antibiotic-resistant cells to mutation, Science, 362, 6415, 686-690, 2018.]. In the absence of ampicillin treatment, the natural mutation rates detected using rifampicin were consistent between the wild-type and the ΔrecA strain. However, after ampicillin treatment, the mutation rate detected using rifampicin was significantly elevated only in the ΔrecA strain (Fig. 1G). We also employed other antibiotics, such as ciprofloxacin and chloramphenicol, in our experiments to treat the cells (data not shown). However, we observed that beta-lactam antibiotics specifically induced the emergence of resistance or altered the MIC in our bacterial populations. If resistance had pre-existed before antibiotic exposure or a bona fide increase in mutation rate, we would expect other antibiotics to exhibit a similar selective effect, particularly given the potential for cross-resistance to multiple antibiotics.

      Furthermore, on a technical level, the authors employed WGS to identify resistance mutations in the treated ampicillin-treated wild-type and ∆recA strains. However, the WGS methodology described in the paper is inconsistent. Notably, wild-type WGS samples were picked from non-selective plates, while ΔrecA WGS isolates were picked from selective plates with 50 μg/mL ampicillin. Such an approach biases the frequency and identity of the mutations seen in the WGS and cannot be used to support the idea that ampicillin treatment induces higher levels of mutagenesis.

      We appreciate your concern regarding potential inconsistencies in the WGS methodology. However, we would like to clarify that the primary aim of the WGS experiment was to identify the types of mutations present in the wild-type and ΔrecA strains after treatment of ampicillin, rather than to quantify or compare mutation frequencies. This purpose was explicitly stated in the manuscript.

      Furthermore, the choice of selective and non-selective conditions was made to ensure the successful isolation of mutants in both strains. Specifically, if selective conditions (50 μg/mL ampicillin) were applied to the wild-type strain, it would have been nearly impossible to recover colonies for WGS analysis, as wild-type cells are highly susceptible to ampicillin at this concentration (Top, Author response image 1). Conversely, under non-selective conditions, ΔrecA mutants carrying resistance mutations may not have been effectively isolated, which would have limited our ability to identify resistance mutations in these strains (Bottom, Author response image 1 Thus, the use of different selection pressures was essential for achieving the objective of mutation identification in this study.

      Author response image 1.

      After 8 hours of antibiotic treatment, the wild type or the ΔrecA cells were plated on agar plates either without ampicillin or with 50 μg/mL ampicillin and incubated for 24-48 hours. Top: Under selective conditions, no wild type colonies were recovered, indicating high susceptibility to the antibiotic, preventing further analysis. Bottom: In non-selective conditions, both ΔrecA resistant mutants and non-resistant cells grew, making it difficult to distinguish and isolate the mutants carrying resistance mutations.

      Finally, it is important to establish what the basal mutation rates of both the WT and ∆recA strains are. Currently, only the ampicillin-treated populations were reported. It is possible that the ∆recA strain has inherently higher mutagenesis than WT, with a larger subpopulation of resistant clones. Thus, ampicillin treatment might not in fact induce higher mutagenesis in ∆recA.

      Thanks for this suggestion. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      Reviewer #2:

      Summary:

      This study aims to demonstrate that E. coli can acquire rapid antibiotic resistance mutations in the absence of a DNA damage response. To investigate this, the authors employed a sophisticated experimental framework based on a modified Adaptive Laboratory Evolution (ALE) workflow. This workflow involves numerous steps culminating in the measurement of antibiotic resistance. The study presents evidence that a recA strain develops ampicillin resistance mutations more quickly than the wild-type, as shown by measuring the Minimum Inhibitory Concentration (MIC) and mutation frequency. Whole-genome sequencing of 15 recA-colonies resistant to ampicillin revealed predominantly inactivation of genes involved in the multi-drug efflux pump system, whereas, in the wild-type, mutations appear to enhance the activity of the chromosomal ampC cryptic promoter. By analyzing mutants involved in the SOS response, including a lexA3 mutant incapable of inducing the SOS response, the authors conclude that the rapid evolution of antibiotic resistance occurs in an SOS-independent manner when recA is absent.

      Furthermore, RNA sequencing (RNA-seq) of the four experimental conditions suggests that genes related to antioxidative responses drive the swift evolution of antibiotic resistance in the recA-strain.

      We greatly appreciate your overall summary of the manuscript and their positive evaluation of our work.

      Weaknesses:

      However, a potential limitation of this study is the experimental design used to determine the 'rapid' evolution of antibiotic resistance. It may introduce a significant bottleneck in selecting ampicillin-resistant mutants early on. A recA mutant could be more susceptible to ampicillin than the wild-type, and only resistant mutants might survive after 8 hours, potentially leading to their enrichment in subsequent steps. To address this concern, it would be critical to perform a survival analysis at various time points (0h, 2h, 4h, 6h, and 8h) during ampicillin treatment for both recA and wild-type strains, ensuring there is no difference in viability.

      We appreciate your suggestion. We measured the survival fraction at 0, 2, 4, 6, and 8 hours after ampicillin treatment. The results show no significant difference in antibiotic sensitivity between the wild-type and ΔrecA strain (Fig. S2). We therefore added a description int the main text, “Meanwhile, after 8 hours of treatment with 50 μg/mL ampicillin, the survival rates of both wild type and ΔrecA strain were consistent (Fig. S2)”.

      The observation that promoter mutations are absent in ΔrecA strains could be explained by previous research indicating that amplification of the AmpC genes is a mechanism for E. coli resistance to ampicillin, which does not occur in a recA-deficient background (PMID# 19474201).

      We are very grateful to you for providing this reference. We did examine the amplification of the ampC gene in both wild-type and _recA-_deficient strains, but we found no significant changes in its copy number after ampicillin treatment (Author response image 2). Therefore, the results and discussion regarding gene copy number were not included in this manuscript.

      Author response image 2.

      Copy number variations of genes in the chromosome before and after exposure to ampicillin at 50 µg/mL for 8 hours in the wild type and ΔrecA strain.

      The section describing Figure 3 is poorly articulated, and the conclusions drawn are apparent. The inability of a recA strain to induce the SOS response is well-documented (lines 210 and 278). The data suggest that merely blocking SOS induction is insufficient to cause 'rapid' evolution in their experimental conditions. To investigate whether SOS response can be induced independently of lexA cleavage by recA, alternative experiments, such as those using a sulA-GFP fusion, might be more informative.

      Thanks for your suggestion. We agree that detecting the expression level of SulA can provide valuable information to reveal the impact of the SOS system on rapid drug resistance. In addition to fluorescence visualization and quantification of SulA expression, regulating the transcription level of the sulA gene can achieve the same objective. Therefore, in our transcriptome sequencing analysis, we focused on evaluating the transcription level of sulA (Fig. 4E).

      In Figure 4E, the lack of increased SulA gene expression in the wild-type strain treated with ampicillin is unexpected, given that SulA is an SOS-regulated gene. The fact that polA (Pol I) is going down should be taken into account in the interpretation of Figures 2D and 2E.

      Thank you for your observation regarding the lack of increased SulA gene expression in the wild-type strain treated with ampicillin in Figure 4E. We agree that SulA is typically an SOS-regulated gene, and its expression is expected to increase in response to DNA damage induced by antibiotics like ampicillin. However, in our experimental conditions, the observed lack of increased SulA expression could be due to different factors. One possibility is that the concentration of ampicillin used, or the duration of treatment, was not applicable to induce a strong SOS response in the wild type strain under the specific conditions tested. Additionally, differences in experimental setups such as timing, sampling, or cellular stress responses could account for the lack of a pronounced upregulation of SulA.

      You may state that the fact that polA (Pol I) is going down should be taken into account in the interpretation of Figures 3D and 3E, and we agree with you.

      The connection between compromised DNA repair, the accumulation of Reactive Oxygen Species (ROS) based on RNA-seq data, and accelerated evolution is merely speculative at this point and not experimentally established.

      We greatly appreciate your comments. First, the correlation between DNA mutations and the accumulation of reactive oxygen species (ROS) has been experimentally confirmed. As shown in Fig. 4I, after the addition of the antioxidant GSH, DNA resistance mutations were not detected in the ΔrecA strain treated with ampicillin for 8 hours, compared to those without the addition of GSH, proving that the rapid accumulation of ROS induces the enhancement of DNA resistance mutations. Second, the enhancement of DNA resistance mutations in relation to bacterial resistance has been widely validated and is generally accepted. Finally, we appreciate the your suggestion to strengthen the evidence supporting ROS enhancement. To address this, we have added an experiment to measure ROS levels. Through flow cytometry, we found that ROS levels significantly increased in both the wild-type and ΔrecA strain after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Reviewer #3:

      Summary:

      In the present work, Zhang et al investigate the involvement of the bacterial DNA damage repair SOS response in the evolution of beta-lactam drug resistance evolution in Escherichia coli. Using a combination of microbiological, bacterial genetics, laboratory evolution, next-generation, and live-cell imaging approaches, the authors propose short-term drug resistance evolution that can take place in RecA-deficient cells in an SOS response-independent manner. They propose the evolvability of drug resistance is alternatively driven by the oxidative stress imposed by the accumulation of reactive oxygen species and inhibition of DNA repair. Overall, this is a nice study that addresses a growing and fundamental global health challenge (antimicrobial resistance). However, although the authors perform several multi-disciplinary experiments, there are several caveats to the authors' proposal that ultimately do not fully support their interpretation that the observed antimicrobial resistance evolution phenotype is due to compromised DNA repair.

      We greatly appreciate your overall summary of the manuscript and positive evaluation of our work.

      Strengths:

      The authors introduce new concepts to antimicrobial resistance evolution mechanisms. They show short-term exposure to beta-lactams can induce durably fixed antimicrobial resistance mutations. They propose this is due to comprised DNA repair and oxidative stress. This is primarily supported by their observations that resistance evolution phenotypes only exist for recA deletion mutants and not other genes in the SOS response.

      Thanks for your positive comments.

      Weaknesses:

      The authors do not show any direct evidence (1) that these phenotypes exist in strains harboring deletions in other DNA repair genes outside of the SOS response, (2) that DNA damage is increased, (3) that reactive oxygen species accumulate, (4) that accelerated resistance evolution can be reversed by anything other than recA complementation. The authors do not directly test alternative hypotheses. The conclusions drawn are therefore premature.

      We sincerely thank you for your insightful comments. First, in this study, our primary focus is on the role of recA deficiency in bacterial antibiotic resistance evolution. Therefore, we conducted an in-depth investigation on E. coli strains lacking RecA and found that its absence promotes resistance evolution through mechanisms involving increased ROS accumulation and downregulation of DNA repair pathways. While we acknowledge the importance of other DNA repair genes outside of the SOS response, exploring them is beyond the scope of this paper. However, in a separate unpublished study, we have identified the involvement of another DNA recombination protein, whose role in resistance evolution is not yet fully elucidated, in promoting resistance development. This finding is part of another independent investigation.

      Regarding DNA damage and repair, our paper emphasizes that resistance-related mutations in DNA are central to the development of antibiotic resistance. These mutations are a manifestation of DNA damage. To demonstrate this, we measured mutation frequency and performed whole-genome sequencing, both of which confirmed an increase in DNA mutations.

      We appreciate the reviewer's suggestion to provide additional evidence for ROS accumulation, and we have now supplemented our manuscript with relevant experiments. Through flow cytometry, we found that ROS levels significantly increased in both the wild type and ΔrecA strains after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Finally, in response to your question about reversing accelerated resistance evolution, we would like to highlight that, in addition to recA complementation, we successfully suppressed rapid resistance evolution by supplementing with an antioxidant, GSH (Fig. 4I). This further supports our hypothesis that increased ROS levels play a key role in driving accelerated resistance evolution in the absence of RecA.

      Recommendations for the authors:

      Reviewer #1:

      The author's model asserts that deletion of recA impairs DNA repair in E. coli, leading to an accumulation of ROS in the cell, and ultimately driving the rapid rise of resistance mutations. However, the experimental evidence does not adequately address whether the resistance mutations are true, de novo mutations that arose due to beta-lactam treatment, or mutations that confer cross-resistance enriched by ampicillin selection.

      a. Major: In Figure 1F & G, the authors show that the ∆recA strain, following ampicillin treatment, has higher resistance and mutation frequency towards rifampicin than WT. However, it is not clear whether the elevated resistance and mutagenesis are driven by mutations enriched by the ampicillin treatment (e.g. mutations in acrB, as seen in Figure 2) or by "new" mutations in the rpoB gene. As the authors note, the mutants enriched by ampicillin selection can play a role in efflux and can thus change a bacterium's sensitivity to a wide range of antibiotics, including rifampicin, in what is known as cross-resistance. Therefore, the mutation frequency calculation, which relies on quantifying rifampicin-resistant clones, might be confounded by bacteria with mutations that confer cross-resistance. A better approach to calculate mutation frequency would be to employ an assay that does not require antibiotic selection, such as a lac-reversion assay. This would mitigate the confounding effects of cross-resistance of drug-resistant mutations.

      We appreciate your thoughtful comments regarding the potential for cross-resistance to confound the mutation frequency calculation based on rifampicin-resistant clones. Indeed, as noted, ampicillin selection can enrich for mutants with enhanced efflux activity, which may confer cross-resistance to a range of antibiotics, including rifampicin.

      However, we believe that the current approach of calculating mutation frequency using rifampicin-resistant mutants is still valid in our specific context. Rifampicin targets the RNA polymerase β subunit, and resistance typically arises from specific mutations in the rpoB gene. These mutations are well-characterized and distinct from those typically associated with efflux-related cross-resistance. Thus, the likelihood of cross-resistance affecting our mutation frequency calculation is minimized in this scenario.

      Additionally, while the lac-reversion assay could be an alternative, it focuses on specific metabolic pathway mutations (such as those affecting lacZ) and would not necessarily capture the same types of mutations relevant to rifampicin resistance or antibiotic-induced mutagenesis. Given our experimental objective of understanding how ampicillin induces mutations that confer antibiotic resistance, the current approach of using rifampicin selection provides a direct and relevant measurement of mutation frequency under antibiotic stress.

      b. Major: It is important to establish what the basal mutation frequencies/rates of both the WT and ∆recA strains are. Currently, only the ampicillin-treated populations were reported. It is possible that the ∆recA strain has an inherently higher mutagenesis than WT. Thus, ampicillin treatment might not in fact induce higher mutagenesis in ∆recA.

      Thanks for your suggestion. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      c. Major: In the text, the authors write, "To verify whether drug resistance associated DNA mutations have led to the rapid development of antibiotic resistance in recA mutant strain, we randomly selected 15 colonies on non-selected LB agar plates from the wild type surviving isolates, and antibiotic screening plates containing 50 μg/mL ampicillin from the ΔrecA resistant isolates, respectively." Why were the WT clones picked from non-selective plates and the recA mutant from selective ones for WGS? It appears that such a procedure would bias the recA mutant clones to show more mutations (caused by selection on the ampicillin plate). The authors need to address this discrepancy.

      We appreciate your concern regarding potential inconsistencies in the WGS methodology. However, we would like to clarify that the primary aim of the WGS experiment was to identify the types of mutations present in the wild-type and ΔrecA strains after treatment of ampicillin, rather than to quantify or compare mutation frequencies. This purpose was explicitly stated in the manuscript.

      Furthermore, the choice of selective and non-selective conditions was made to ensure the successful isolation of mutants in both strains. Specifically, if selective conditions (50 μg/mL ampicillin) were applied to the wild type strain, it would have been nearly impossible to recover colonies for WGS analysis, as wild-type cells are highly susceptible to ampicillin at this concentration (Top, Author response image 1). Conversely, under non-selective conditions, ΔrecA mutants carrying resistance mutations may not have been effectively isolated, which would have limited our ability to identify resistance mutations in these strains (Bottom, Author response image 1). Thus, the use of different selection pressures was essential for achieving the objective of mutation identification in this study.

      d. Major: In some instances, the authors do not use accurate language to describe their data. In Figure 2A, the authors randomly selected 15 ∆recA clones from a selective plate with 50 µg/mL of ampicillin. These clones were then subjected to WGS, which subsequently identified resistant mutations. Based on the described methods, these mutations are a result of selection: in other words, resistant mutations were preexisting in the bacterial population, and the addition of ampicillin selection killed off the sensitive cells, enabling the proliferation of the resistant clones. However, the in Figure 2 legend and associated text, the authors suggest that these mutations were "induced" by beta-lactam exposure, which is misleading. The data does not support that.

      We appreciate your detailed feedback on the language used to describe our data. We understand the concern regarding the use of the term "induced" in relation to beta-lactam exposure. To clarify, we employed not only beta-lactam antibiotics but also other antibiotics, such as ciprofloxacin and chloramphenicol, in our experiments (data not shown). However, we observed that beta-lactam antibiotics specifically induced the emergence of resistance or altered the MIC in our bacterial populations. If resistance had pre-existed before antibiotic exposure, we would expect other antibiotics to exhibit a similar selective effect, particularly given the potential for cross-resistance to multiple antibiotics.

      Furthermore, we used two different ∆recA strains, and the results were consistent between the strains (Fig. S3). Given that spontaneous mutations can occur with significant variability in populations, if resistance mutations pre-existed before antibiotic exposure, the selective outcomes should have varied between the two strains.

      Most importantly, we found that the addition of anti-oxidative compound GSH prevented the evolution of antibiotic from the treatment of ampicillin in the ΔrecA strain. If we assume that resistant bacteria preexist in the ∆recA strain, then the addition of GSH should not affect the evolution of resistance. Therefore, we believe that the resistance mutations we detected were not simply the result of selection from preexisting mutations but were indeed induced by beta-lactam exposure.

      e. Major: For Figure 4J, using WGS the authors show that the addition of GSH to WT and ∆recA cells inhibited the rise of resistance mutations; no resistance mutations were reported. However, in the "Whole genome sequencing" section under "Materials and Methods", they state that "Resistant clones were isolated by selection using LB agar plates with the supplementation of ampicillin at 50 μg/mL". These clones were then genome-extracted and sequenced. Given the methodology, it is surprising that the WGS did not reveal any resistance mutations in the GSH-treated cells. How were these cells able to grow on 50 μg/mL ampicillin plates for isolation in the first place? The authors need to address this.

      We sincerely apologize for the confusion caused by the incorrect expression in the "Materials and Methods" section. Indeed, when bacteria were treated with the combination of antibiotics and GSH, resistance was significantly suppressed, and no resistant clones could be isolated from selective plates (i.e., LB agar supplemented with 50 μg/mL ampicillin).

      To address this, we instead plated the bacteria treated with antibiotics and GSH onto non-selective plates (without ampicillin) and randomly selected 15 colonies for WGS. None of them showed resistance mutations. We will revise the text in the "Materials and Methods" section to accurately reflect this procedure and provide clarity.

      f. Minor: for Figure 1G, it is misleading to have both "mutation frequency" and "mutant rate" in the y-axis; the two are defined and calculated differently. Based on the Materials and Materials, "mutation frequency" would be the appropriate term. Also, for the ∆recA strain, it is a bit unusual to see mutation frequencies that are tightly clustered. Usually, mutation frequencies follow the Luria-Delbruck distribution. Can the authors explain why the ∆recA data looks so different compared to, say, the WT mutation frequencies?

      Thank you for your insightful feedback. We agree that having both "mutation frequency" and "mutant rate" on the y-axis is misleading, as these terms are defined and calculated differently. To avoid confusion, we will revise Figure 1G to use only "mutation frequency" as the correct term, in line with the methods described in the Materials and Methods section.

      Regarding the ∆recA strain's mutation frequencies, we acknowledge that the data appear more tightly clustered compared to the expected Luria-Delbruck distribution seen in the wild type strain. In fact, the y-axis of the Figure 1G is logarithmic, this causes the data to appear more clustered.

      We further added the basal mutation frequency in the wild type and ∆recA strains before the exposure to ampicillin. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      g. Minor: It needs to be made clear in the Main Text what the selective antibiotic agar plate used was, rifampicin or ampicillin. I am assuming it was rifampicin, as ampicillin plates would yield resistance frequencies close to 100%, given the prior treatment of the culture with ampicillin.

      Thanks for your comments. Depending on the objective, we used different selective plates. For example, when testing the mutation frequency of antibiotic resistance, we used a selective plate containing rifampicin in order to utilize a non-inducing antibiotic, which is the standard method for calculating resistance mutation frequency. In the WGS experiment, to obtain mutations specific to ampicillin resistance, we selected a selective plate containing ampicillin.

      Reviewer #2:

      The Y-axis label (log10 mutant rate) in Figure 1G is misleading or incorrect.

      Thanks for your comments and we apologize for this misleading information. The Figure 1G has been revised accordingly.

      In line 393 of the discussion, the authors claim that excessive ROS accumulation drives the evolution of ampicillin resistance, which has not been conclusively demonstrated. Additional experiments are needed to support this statement.

      We greatly appreciate your comments. First, the correlation between DNA mutations and the accumulation of reactive oxygen species (ROS) has been experimentally confirmed. As shown in Fig. 4I, after the addition of the antioxidant GSH, DNA resistance mutations were not detected in the ΔrecA strain treated with ampicillin for 8 hours, compared to those without the addition of GSH, proving that the rapid accumulation of ROS induces the enhancement of DNA resistance mutations. Second, the enhancement of DNA resistance mutations in relation to bacterial resistance has been widely validated and is generally accepted. Finally, we appreciate the your suggestion to strengthen the evidence supporting ROS enhancement. To address this, we have added an experiment to measure ROS levels. Through flow cytometry, we found that ROS levels significantly increased in both the wild-type and ΔrecA strain after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      The abstract is overly complex and difficult to read, e.g. "Contrary to previous findings, it is shown that this accelerated resistance development process is dependent on the hindrance of DNA repair, which is completely orthogonal to the SOS response").

      Thank you for the valuable feedback regarding the complexity of the abstract. We agree that certain sections could be simplified for clarity. In response, we have revised the abstract to make it more concise and easier to understand. For example, the sentence “Contrary to previous findings, it is shown that this accelerated resistance development process is dependent on the hindrance of DNA repair, which is completely orthogonal to the SOS response” has been rewritten as: "Unlike earlier studies, we found that the rapid development of resistance relies on the hindrance of DNA repair, a mechanism that operates independently of the SOS response."

      Reviewer #3:

      As indicated above, direct evidence is needed to show (1) that these phenotypes exist in strains harboring deletions in other DNA repair genes outside of the SOS response, (2) that DNA damage is increased, (3) that reactive oxygen species accumulate, (4) that accelerated resistance evolution can be reversed by anything other than recA complementation. There are also other resistance evolution mechanisms untested here, including transcription-coupled repair (TCR) mechanisms involving Mfd. These need to be shown in order to draw the conclusions proposed.

      We sincerely thank you for your insightful comments. First, in this study, our primary focus is on the role of recA deficiency in bacterial antibiotic resistance evolution. Therefore, we conducted an in-depth investigation on E. coli strains lacking RecA and found that its absence promotes resistance evolution through mechanisms involving increased ROS accumulation and downregulation of DNA repair pathways. While we acknowledge the importance of other DNA repair genes outside of the SOS response and other resistance evolution mechanisms including the TCR mechanism, exploring them is beyond the scope of this paper. However, in a separate unpublished study, we have identified the involvement of another DNA recombination protein, whose role in resistance evolution is not yet fully elucidated, in promoting resistance development. This finding is part of another independent investigation.

      Regarding DNA damage and repair, our paper emphasizes that resistance-related mutations in DNA are central to the development of antibiotic resistance. These mutations are a manifestation of DNA damage. To demonstrate this, we measured mutation frequency and performed whole-genome sequencing, both of which confirmed an increase in DNA mutations.

      We appreciate the reviewer's suggestion to provide additional evidence for ROS accumulation, and we have now supplemented our manuscript with relevant experiments. Through flow cytometry, we found that ROS levels significantly increased in both the wild type and ΔrecA strains after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Finally, in response to your question about reversing accelerated resistance evolution, we would like to highlight that, in addition to recA complementation, we successfully suppressed rapid resistance evolution by supplementing with an antioxidant, GSH (Fig. 4I). This further supports our hypothesis that increased ROS levels play a key role in driving accelerated resistance evolution in the absence of RecA.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment 

      This useful study reports how neuronal activity in the prefrontal cortex maps time intervals during which animals have to wait until reaching a reward and how this mapping is preserved across days. However, the evidence supporting the claims is incomplete as these sequential neuronal patterns do not necessarily represent time but instead may be correlated with stereotypical behavior and restraint from impulsive decision, which would require further controls (e.g. behavioral analysis) to clarify the main message. The study will be of interest to neuroscientists interested in decision making and motor control.

      We thank the editors and reviewers for the constructive comments. In light of the questions mentioned by the reviewers, we have performed additional analyses in our revision, particularly aiming to address issues related to single-cell scalability, and effects of motivation and movement. We believe these additional data will greatly improve the rigor and clarity of our study. We are grateful for the review process of eLife.

      Public Reviews:

      Reviewer #1 (Public Review): 

      Summary:

      This paper investigates the neural population activity patterns of the medial frontal cortex in rats performing a nose poking timing task using in vivo calcium imaging. The results showed neurons that were active at the beginning and end of the nose poking and neurons that formed sequential patterns of activation that covaried with the timed interval during nose poking on a trial-by-trial basis. The former were not stable across sessions, while the latter tended to remain stable over weeks. The analysis on incorrect trials suggests the shorter non-rewarded intervals were due to errors in the scaling of the sequential pattern of activity.

      Strengths:

      This study measured stable signals using in vivo calcium imaging during experimental sessions that were separated by many days in animals performing a nose poking timing task. The correlation analysis on the activation profile to separate the cells in the three groups was effective and the functional dissociation between beginning and end, and duration cells was revealing. The analysis on the stability of decoding of both the nose poking state and poking time was very informative. Hence, this study dissected a neural population that formed sequential patterns of activation that encoded timed intervals. 

      We thank the reviewer for the positive comments.

      Weaknesses:

      It is not clear whether animals had enough simultaneously recorded cells to perform the analyzes of Figures 2-4. In fact, rat 3 had 18 responsive neurons which probably is not enough to get robust neural sequences for the trial-by-trial analysis and the correct and incorrect trial analysis. 

      We thank the reviewer for the comment. Our imaging data generally yielded 50-150 cells in each session. The 18 neurons mentioned by the reviewer are from the duration cell category. We have now provided the number of imaged cells from each rat in the new Supplementary figure 1D. In addition, we have plotted the duration cells’ sequential activity of individual trials for each rat in new Supplementary figure 1B and 1C. These data demonstrate robust sequential activities from the duration cells.

      In addition, the analysis of behavioral errors could be improved. The analysis in Figure 4A could be replaced by a detailed analysis on the speed, and the geometry of neural population trajectories for correct and incorrect trials.

      We thank the reviewer for the suggestions. We have now performed analyses of the neural population trajectories as the reviewer suggested. We have calculated the neural population trajectories using the first two principal components of the neural activities during nose poke events. While both correct and incorrect trials show similar shapes of the trajectories, correct trials show more expanded paths, with longer lengths on average. These new results are now updated in Figure 4. Since type I or type II errors would likely generate trajectories not following the general direction which is different from our observations, these results are consistent with our conclusion that scaling errors contribute to the incorrect behavior timing in these rats.

      In the case of Figure 4G is not clear why the density of errors formed two clusters instead of having a linear relation with the produce duration. I would be recommendable to compute the scaling factor on neuronal population trajectories and single cell activity or the computation of the center of mass to test the type III errors. 

      To clarify the original Figure 4G, the correct trials tended to show positive time estimation errors while the incorrect trials showed negative time estimation errors. We believe that the polarity switch between these two types suggests a possible use of this neural mechanism to time the action of the rats.

      In addition, we have performed the analysis suggested by the reviewer in our revision. We calculated two types of scaling factors. On individual cell level, we computed the peak position of individual trials to the expected positions from averaged template. And on neural population level, we searched for a scaling multiplier to resample the calcium activity data and minimized the differences between scaled activity and the expected template. Using these two factors, we found that correct trials show significantly larger scaling compared to incorrect trials, consistent with our original interpretation that behavior errors are primarily correlated with scaling errors in the neural activities (type III error). These new results are now incorporated in Figure 4 and we have also updated the main text for the descriptions.

      Due to the slow time resolution of calcium imaging, it is difficult to perform robust analysis on ramping activity. Therefore, I recommend downplaying the conclusion that: "Together, our data suggest that sequential activity might be a more relevant coding regime than the ramping activity in representing time under physiological conditions." 

      We agree with the reviewer, and have now modified this sentence in the abstract.

      Reviewer #2 (Public Review):

      In this manuscript, Li and collaborators set out to investigate the neuronal mechanisms underlying "subjective time estimation" in rats. For this purpose, they conducted calcium imaging in the prefrontal cortex of water-restricted rats that were required to perform an action (nosepoking) for a short duration to obtain drops of water. The authors provided evidence that animals progressively improved in performing their task. They subsequently analyzed the calcium imaging activity of neurons and identify start, duration, and stop cells associated with the nose poke. Specifically, they focused on duration cells and demonstrated that these cells served as a good proxy for timing on a trial-by-trial basis, scaling their pattern of actvity in accordance with changes in behavioral performance. In summary, as stated in the title, the authors claim to provide mechanistic insights into subjective time estimation in rats, a function they deem important for various cognitive conditions.

      This study aligns with a wide range of studies in system neuroscience that presume that rodents solve timing tasks through an explicit internal estimation of duration, underpinned by neuronal representations of time. Within this framework, the authors performed complex and challenging experiments, along with advanced data analysis, which undoubtedly merits acknowledgement. However, the question of time perception is a challenging one, and caution should be exercised when applying abstract ideas derived from human cognition to animals. Studying so-called time perception in rats has significant shortcomings because, whether acknowledged or not, rats do not passively estimate time in their heads. They are constantly in motion. Moreover, rats do not perform the task for the sake of estimating time but to obtain their rewards are they water restricted. Their behavior will therefore reflects their motivation and urgency to obtain rewards. Unfortunately, it appears that the authors are not aware of these shortcomings. These alternative processes (motivation, sensorimotor dynamics) that occur during task performance are likely to influence neuronal activity. Consequently, my review will be rather critical. It is not however intended to be dismissive. I acknowledge that the authors may have been influenced by numerous published studies that already draw similar conclusions. Unfortunately, all the data presented in this study can be explained without invoking the concept of time estimation. Therefore, I hope the authors will find my comments constructive and understand that as scientists, we cannot ignore alternative interpretations, even if they conflict with our a priori philosophical stance (e.g., duration can be explicitly estimated by reading neuronal representation of time) and anthropomorphic assumptions (e.g., rats estimate time as humans do). While space is limited in a review, if the authors are interested, they can refer to a lengthy review I recently published on this topic, which demonstrates that my criticism is supported by a wide range of timing experiments across species (Robbe, 2023). In addition to this major conceptual issue that cast doubt on most of the conclusions of the study, there are also several major statistical issues.

      Main Concerns

      (1) The authors used a task in which rats must poke for a minimal amount of time (300 ms and then 1500 ms) to be able to obtain a drop of water delivered a few centimeters right below the nosepoke. They claim that their task is a time estimation task. However, they forget that they work with thirsty rats that are eager to get water sooner than later (there is a reason why they start by a short duration!). This task is mainly probing the animals ability to wait (that is impulse control) rather than time estimation per se. Second, the task does not require to estimate precisely time because there appear to be no penalties when the nosepokes are too short or when they exceed. So it will be unclear if the variation in nosepoke reflects motivational changes rather than time estimation changes. The fact that this behavioral task is a poor assay for time estimation and rather reflects impulse control is shown by the tendency of animals to perform nose-pokes that are too short, the very slow improvement in their performance (Figure 1, with most of the mice making short responses), and the huge variability. Not only do the behavioral data not support the claim of the authors in terms of what the animals are actually doing (estimating time), but this also completely annhilates the interpretation of the Ca++ imaging data, which can be explained by motivational factors (changes in neuronal activity occurring while the animals nose poke may reflect a growing sens of urgency to check if water is available). 

      We would like to respond to the reviewer’s comments 1, 2 and 4 together, since they all focus on the same issue. We thank the reviewer for the very thoughtful comments and for sharing his detailed reasoning from a recently published review (Robbe, 2023). A lot of discussions go beyond the scope of this study, and we agree that whether there is an explicit representation of time (an internal clock) in the brain is a difficult question to be answer, particularly by using animal behaviors. In fact, even with fully conscious humans and elaborated task design, we think it is still questionable to clearly dissociate the neural substrate of “timing” from “motor”. In the end, it may as well be that as the reviewer cited from Bergson’sarticle, the experience of time cannot be measured.

      Studying the neural representation of any internal state may suffer from the same ambiguity. With all due respect, however, we would like to limit our response to the scope of our results. According to the reviewer, two alternative interpretations of the task-related sequential activity exist: 1, duration cells may represent fidgeting or orofacial movements and 2, duration cells may represent motivation or motion plan of the rats. To test the first alternative interpretation, we have now performed a more comprehensive analysis of the behavior data at all the limbs and visible body parts of the experimental rats during nose poke and analyzed its periodicity among different trials. We found that the coding cells (including duration, start and end cells) activities were not modulated by these motions, arguing against this possibility. These data are now included in the new Supp. Figure 2, and we have added corresponding texts in the manuscript.

      Regarding the second alternative interpretation, we think our data in the original Figure 4G argues against it. In this graph, we plotted the decoding error of time using the duration cells’ activity against the actual duration of the trials. If the sequential activity of durations cells only represents motivation, then the errors should be linearly modulated by trial durations. The unimodal distribution we observed (Figure 4G and see graph below for a re-plot without signs) suggests that the scaling factor of the sequential activity represents information related to time. And the fact that this unimodal distribution centered at the time threshold of the task provides strong evidence for the active use of scaling factor for time estimation.

      In order to further test the relationship to motivation, we have measured the time interval between exiting nose poke to the start of licking water reward as an independent measurement of motivation for each trial. We found that this reward-seeking time was positively correlated with the trial durations, suggesting that the durations were correlated with motivation to some degree. And when we scaled the activities of the duration cells by this reward-seeking time, we found that the patterns of the sequential activities were largely diminished, and showed a significantly lower peak entropy compared to the same activities scaled by trial durations. The remaining sequential pattern may be due to the correlation between trial durations and motivation (Supp. Figure 2), and the sequential pattern reflects timing more prominently. These analyses provide further evidence that the sequential activities were not coding motivations. These data are included in Figure 2F, 2K and supp. Figure 3 in revised manuscript.

      Author response image 1.

      Regarding whether the scaling sequential activity we report represents behavioral timing or true time estimation, we did not have evidence on this point. However, a previous study has shown that PFC silencing led to disruption of the mouse’s timing behavior without affecting the execution of the task (PMID: 24367075), arguing against the behavior timing interpretation. The main surprising finding of our present study is that these duration cells are different from the start and end cells

      in terms of their coding stability. Thus, future studies dissecting the anatomical microcircuit of these duration cells may provide further clues regarding whether they are connected with reward-related or motion-related brain regions. This may help partially resolve the “time” vs.

      “motor” debate the reviewer mentioned.

      (2) A second issue is that the authors seem to assume that rats are perfectly immobile and perform like some kind of robots that would initiate nose pokes, maintain them, and remove them in a very discretized manner. However, in this kind of task, rats are constantly moving from the reward magazine to the nose poke. They also move while nose-poking (either their body or their mouth), and when they come out of the nose poke, they immediately move toward the reward spout. Thus, there is a continuous stream of movements, including fidgeting, that will covary with timing. Numerous studies have shown that sensorimotor dynamics influence neural activity, even in the prefrontal cortex. Therefore, the authors cannot rule out that what the records reflect are movements (and the scaling of movement) rather than underlying processes of time estimation (some kind of timer). Concretely, start cells could represent the ending of the movement going from the water spout to the nosepoke, and end cells could be neurons that initiate (if one can really isolate any initiation, which I doubt) the movement from the nosepoke to the water spout. Duration cells could reflect fidgeting or orofacial movements combined with an increasing urgency to leave the nose pokes.

      (3) The statistics should be rethought for both the behavioral and neuronal data. They should be conducted separately for all the rats, as there is likely interindividual variability in the impulsivity of the animals.

      We thank the reviewer for the comment, yet we are not quite sure what specifically was asked by the reviewer. It appears that the reviewer requires we conduct our analysis using each rat individually. In our revised manuscript, we have conducted and reported analyses with individual rat in the original Figure 1C, Figure 2C, G, K, Figure 4F.

      (4) The fact that neuronal activity reflects an integration of movement and motivational factors rather than some abstract timing appears to be well compatible with the analysis conducted on the error trials (Figure 4), considering that the sensorimotor and motivational dynamics will rescale with the durations of the nose poke. 

      (5) The authors should mention upfront in the main text (result section) the temporal resolution allowed by their Ca+ probe and discuss whether it is fast enough in regard of behavioral dynamics occurring in the task. 

      We thank the reviewer for the suggestion. We have originally mentioned the caveat of calcium imaging in the interpretation of our results. We have now incorporated more texts for this purpose during our revision. In terms of behavioral dynamics (start and end of nose poke in this case), we think calcium imaging could provide sufficient kinetics. However, the more refined dynamics related to the reproducibility of the sequential activity or the precise representation of individual cells on the scaled duration may be benefited from improved time resolution.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors): 

      (1) Please refer explicitly to the three types of cells in the abstract. 

      We have now modified the abstract as suggested during revision.

      (2) Please refer to the work of Betancourt et al., 2023 Cell Reports, where a trial-by-trail analysis on the correlation between neural trajectory dynamics in MPC and timing behavior is reported. In that same paper the stability of neural sequences across task parameters is reported. 

      We have now cited and discussed the study in the discussion section of the revised manuscript.

      (3) Please state the number of studied animals at the beginning of the results section. 

      We have now provided this information as requested. The numbers of rats are also plotted in Figure 1D for each analysis.

      (4) Why do the middle and right panels of Figure 2E show duration cells. 

      Figure 2E was intended to show examples of duration cells’ activity. We included different examples of cells that peak at different points in the scaled duration. We believe these multiple examples would give the readers a straight forward impression of these cells’ activity patterns.

      (5) Which behavioral sessions of Figure 1B were analyzed further.

      We have now labeled the analyzed sessions in Figure 1B with red color in the revised manuscript.

      (6) In Figure 3A-C please increase the time before the beginning of the trial in order to visualize properly the activation patterns of the start cells.

      We thank the reviewer for the suggestion and have now modified the figure accordingly in the revised manuscript.

      (7) Please state what could be the behavioral and functional effect of the ablation of the cortical tissue on top of mPFC.

      We thank the reviewer for the question. In our experience, mice with lens implanted in the mPFC did not show observable difference with mice without surgery in the acquisition of the task and the distribution of the nose-poke durations. In our dataset, rats with the lens implantation showed similar nose-poking behavior as those without lens implantation (Figure 1B). Thus, it seems that the effect of ablation, if any, was quite limited, in the scope of our task.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers and appreciate their recommendations to improve this work.

      Reviewer 1:

      Reviewer 1 recognizes that ‘This is an important finding that is relevant to the actions of VDR on colorectal cancer. The data presented to support the presented conclusion is convincing’.

      Reviewer 1 identifies as a major weakness ‘that the site of SIRT1 regulatory lysine acetylation is defined by mutational analysis rather than by direct biochemical analysis.

      However, as the reviewer mentions “previous reports of K610 acetylation using mass spec https://www.phosphosite.org/proteinAction.action?id=5946&showAllSites=true), and the absence of SIRT1 mutant K610R in the immunoprecipitates using anti-acetylated lysine antibodies presented in Fig. 4E clearly overcome this weakness”.

      In addition, overall SIRT1 acetylation is reduced by vitamin D and by the specific SIRT1 activator SRT1720 as shown by decreased SIRT1 in the anti-acetyl-lysine immunoprecipates, (Fig. 4A and B). The second weakness identified by Reviewer 1 concerns “the use of only one shRNA to deplete VDR in CRC cells.”

      We have made efforts to demonstrate that the results are specific, though we do not have results with alternative shRNAs for a variety of reasons. To mitigate this issue, we have compared two colon cancer cells originating from the same patient which differ in the presence/absence of VDR. SW480, derive from the primary tumor and express VDR, whereas SW620 cells were derived from a lung metastasis and lack VDR. Similar, to the comparison of HCT116 with shVDR HCT116 cells presented in this study, VD induced SIRT1 levels in SW480 in contrast to a lack of induction in SW620, as shown in Author response image 1. This result provides support for the specificity of the shVDR.

      Author response image 1.

      Vitamin D requires the presence of VDR to increase SIRT1 protein levels. SW480 and SW620 cell lines derive from the same patient, from primary tumor and lung metastasis respectively and differ in their VDR content. 1α,25-dihydroxyvitamin D3 (1,25(OH)2D3) was added at 100 nM for 24 h. Representative western-blot, where TBP was used as a loading control, of four biological replicates. Statistical analysis by ANOVA and values represent mean ± SEM; *p<0.05; *** p<0.001.

      The referee noticed the inclusion of an siRNA for SIRT1 in Table 1. We apologize for that, since this is an error, and no results are presented in this study with SIRT1 depletion. Table 1 has been modified accordingly.

      Concerning the third and fourth weaknesses that Reviewer 1 identifies, we agree that mapping the interacting domains in both VDR and SIRT1 and in vitro reconstitution would improve the present study. However, we believe that these would constitute long-term studies that themselves are not strictly necessary at this stage. Consequently, we favor the publication of the present body of work. In vitro reconstitution of the present work and the putative relevance of the proposed mechanism of vitamin D action via SIRT1 on types of cancer other than colon (eg breast etc), are certainly very interesting and warrant further investigation.

      Reviewer 2:

      This reviewer acknowledges that “…this study provides very interesting and solid information on the link between vitamin D and colorectal cancer. It is likely that this study will provide insight into the importance of vitamin D in other types of cancer. It may also lead to new therapeutic strategies for specific cases. This article is convincing, although the authors can improve their study as outlined…”

      We acknowledge the proposed changes and recommendations, and have changed the text and Figures as suggested the by Reviewer as follows:

      Figure 1

      Figure 1E and F: the cell lines used were described in the figure legend, but we agree that including the name in the figure brings more clarity and these are now added.

      Figure 1G: the statistical analysis was for all panels of Figure 1 as described in the Figure legend (lines 731-32), We have amended the original omission of panels 1G and 1H. In panel G, * represents statistical analysis by ANOVA (comparing the four groups) whereas # was the analysis by Students t test (comparing the two indicated groups), where * or #p<0.05. We hope to have clarified this point now.

      Figure 2

      Figure 2C: We showed originally the SIRT1/VDR interaction by immunoprecipitation of VDR and detection of SIRT1 in immunoprecipitates. We also showed immunoprecipitation of exogenously expressed Myc-SIRT1 (WT or mutants) and detection of VDR in immunoprecipitates (Figure 4F). The reviewer requests that we perform the inverse IP for endogenous SIRT1, that is immunoprecipitate SIRT1 and detect VDR in the immunoprecipates, which we now supply for the reviewer in Author response image 2.

      Author response image 2

      Immunoprecipitation of endogenous SIRT1 to show interaction with VDR. 1α,25-dihydroxyvitamin D3 (1,25(OH)2D3) was added at 100 nM for 24 h. Representative western-blots, where TBP was used as a loading control.

      Figure 3

      • Figure 3D: ‘The authors should indicate the color of the different stainings’. Immunostainings have been revealed with DAB (diaminebenzidine); thus, positiveness is highlighted by light or dark brown according to their low or high protein expression. Counterstaining has been performed with hematoxilin, which stains nuclei in dark blue and cytoplasm in light blue.

      Do the authors mean that the secondary antibody marks in brown/red? If so, these results are inconsistent with the text considering that hematoxylin was used for non-tumor tissue. This part needs to be clarified.

      We thank the Reviewer for asking us to clarify this issue. Neither the primary nor anti-Ig horseradish peroxidase-conjugated secondary antibodies presented positiveness resulting from these antibodies individually. Therefore, secondary antibody does not mark in any color. Hematoxylin has been used as counterstaining for both non-tumor as well as for tumor tissues.

      What about the level of FOXO3A in these tissues/tumors?

      We did not prove the tumor sections for specific SIRT1 substrates such as FoxO3A since their levels may not entirely depend on SIRT1 specific deacetylation.

      What is the level of 1,25(OH)2D3 in these patients?

      We agree with this referee that this information would be very useful, but unfortunately, we do not have data on vitamin D levels for these patients since they were not specifically recruited for this study and vitamin D levels are not routinely measured.

      Figure 3D, the following information is missing: "A detailed amplification is shown in the lower left of each micrograph."

      We decided not to include the amplification in micrographs because the aim of the manuscript is focused on protein levels, not localization and including the amplification was more confusing than enlightening. This has been amended now in the text.

      Figure 3E, it says p=0.325, in the legend p<0.01, and in the text there is a trend. Which is the correct version?

      We really apologize for this misunderstanding. As stated in the Figure, p=0.325 and therefore it does not reach statistical significance. We have amended the main text and figure legend to report that differences between SIRT1 expression levels of healthy and cancer human colon samples are not statistically significant.

      Figure 4

      Figure 4F. The quality of the presented blots is not optimal. It needs to be improved. In addition, the number of independent biological experiments is not indicated.

      We have substituted the representative western-blot and included statistical analysis of four independent biological replicates. Since 4F is now a bigger panel, it has required a slight reorganization of the whole Figure, but the rest of panels remain with the originals. Now we indicate in the figure legend that at least three independent biological replicas were analyzed. In addition, we supply below the four experiments for the reviewer in Author response image 3.

      Author response image 3

      Immunoprecipitation of exogenous myc-tagged SIRT1 to show interaction with VDR of wild type (WT) or mutants. 1α,25-dihydroxyvitamin D3 (1,25(OH)2D3) was added at 100 nM for 24 h. FT: Flow Through. TBP as a loading control.

      Regarding the last general comment concerning the number of independent experiments performed, this is indicated in the Figure legends (lines 732-36, 757-58, 82324, 840-41). All the in vitro experiments were performed at least as three independent experiments and not by repeating a western blot. A representative western blot is shown, and the statistical analysis corresponds to the analysis of the three biological replicates. For experiments with patient samples, the number of patients appears clearly indicated in the corresponding panel.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors attempt to validate Fisher Kernels on the top of HMM as a way to better describe human brain dynamics at resting state. The objective criterion was the better prediction of the proposed pipeline of the individual traits.

      Strengths:

      The authors analyzed rs-fMRI dataset from the HCP providing results also from other kernels.

      The authors also provided findings from simulation data.

      Weaknesses:

      (1) The authors should explain in detail how they applied cross-validation across the dataset for both optimization of parameters, and also for cross-validation of the models to predict individual traits.

      Indeed, there were details about the cross-validation for hyperparameter tuning and prediction missing. This problem was also raised by Reviewer #2. We have now rephrased this section in 4.4 and added details: ll. 804-813:

      “We used k-fold nested cross-validation (CV) to select and evaluate the models. We used 10 folds for both the outer loop (used to train and test the model) and the inner loop (used to select the optimal hyperparameters) such that 90% were used for training and 10% for testing. The optimal hyperparameters λ (and τ in the case of the Gaussian kernels) were selected using grid-search from the vectors λ=[0.0001,0.001,0.01,0.1,0.3,0.5,0.7,0.9,1] and . In both the outer and the inner loop, we accounted for family structure in the HCP dataset so that subjects from the same family were never split across folds (Winkler et al., 2015). Within the CV, we regressed out sex and head motion confounds, i.e., we estimated the regression coefficients for the confounds on the training set and applied them to the test set (Snoek et al., 2019).“ and ll. 818-820: “We generated the 100 random repetitions of the 10 outer CV folds once, and then used them for training and prediction of all methods, so that all methods were fit to the same partitions.”

      (2) They discussed throughout the paper that their proposed (HMM+Fisher) kernel approach outperformed dynamic functional connectivity (dFC). However, they compared the proposed methodology with just static FC.

      We would like to clarify that the HMM is itself a method for estimating dynamic (or time-varying) FC, just like the sliding window approach, see also Vidaurre, 2024 (https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00363/124983) for an overview of terminology.

      See also our response to Q3.

      (3) If the authors wanted to claim that their methodology is better than dFC, then they have to demonstrate results based on dFC with the trivial sliding window approach.

      We would like to be clear that we do not claim in the manuscript that our method outperforms other dynamic functional connectivity (dFC) approaches, such as sliding window FC. We have now made changes to the manuscript to make this clearer.

      First, we have clarified our use of the term “brain dynamics” to signify “time-varying amplitude and functional connectivity patterns” in this context, as Reviewer #2 raised the point that the former term is ambiguous (ll.33-35: “One way of describing brain dynamics are state-space models, which allow capturing recurring patterns of activity and functional connectivity (FC) across the whole brain.”).

      Second, our focus is on our method being a way of using dFC for predictive modelling, since there currently is no widely accepted way of doing this. One reason why dFC is not usually considered in prediction studies is that it is mathematically not trivial how to use the parameters from estimators of dynamic FC for a prediction. This includes the sliding window approach. We do not aim at comparing across different dFC estimators in this paper. To make these points clearer, we have revised the introduction to now say:

      Ll. 39-50:

      “One reason why brain dynamics are not usually considered in this context pertains to their representation: They are represented using models of varying complexity that are estimated from modalities such as functional MRI or MEG. Although there exists a variety of methods for estimating time-varying or dynamic FC (Lurie et al., 2019), like the commonly used sliding-window approach, there is currently no widely accepted way of using them for prediction problems. This is because these models are usually parametrised by a high number of parameters with complex mathematical relationships between the parameters that reflect the model assumptions. How to leverage these parameters for prediction is currently an open question.

      We here propose the Fisher kernel for predicting individual traits from brain dynamics, using information from generative models that do not assume any knowledge of task timings. We focus on models of brain dynamics that capture within-session changes in functional connectivity and amplitude from fMRI scans, in this case acquired during wakeful rest, and how the parameters from these models can be used to predict behavioural variables or traits. In particular, we use the Hidden Markov Model (HMM), which is a probabilistic generative model of time-varying amplitude and functional connectivity (FC) dynamics (Vidaurre et al., 2017).”

      Reviewer #2 (Public Review):

      Summary:

      The manuscript presents a valuable investigation into the use of Fisher Kernels for extracting representations from temporal models of brain activity, with the aim of improving regression and classification applications. The authors provide solid evidence through extensive benchmarks and simulations that demonstrate the potential of Fisher Kernels to enhance the accuracy and robustness of regression and classification performance in the context of functional magnetic resonance imaging (fMRI) data. This is an important achievement for the neuroimaging community interested in predictive modeling from brain dynamics and, in particular, state-space models.

      Strengths:

      (1) The study's main contribution is the innovative application of Fisher Kernels to temporal brain activity models, which represents a valuable advancement in the field of human cognitive neuroimaging.

      (2) The evidence presented is solid, supported by extensive benchmarks that showcase the method's effectiveness in various scenarios.

      (3) Model inspection and simulations provide important insights into the nature of the signal picked up by the method, highlighting the importance of state rather than transition probabilities.

      (4) The documentation and description of the methods are solid including sufficient mathematical details and availability of source code, ensuring that the study can be replicated and extended by other researchers.

      Weaknesses:

      (1) The generalizability of the findings is currently limited to the young and healthy population represented in the Human Connectome Project (HCP) dataset. The potential of the method for other populations and modalities remains to be investigated.

      As suggested by the reviewer, we have added a limitations paragraph and included a statement about the dataset: Ll. 477-481: “The fMRI dataset we used (HCP 1200 Young Adult) is a large sample taken from a healthy, young population, and it remains to be shown how our findings generalise to other datasets, e.g. other modalities such as EEG/MEG, clinical data, older populations, different data quality, or smaller sample sizes both in terms of the number of participants and the scanning duration”.

      We would like to emphasise that this is a methodological contribution, rather than a basic science investigation about cognition and brain-behaviour associations. Therefore, the method would be equally usable on different populations, even if the results vary.

      (2) The possibility of positivity bias in the HMM, due to the use of a population model before cross-validation, needs to be addressed to confirm the robustness of the results.

      As pointed out by both Reviewers #2 and #3, we did not separate subjects into training and test set before fitting the HMM. To address this issue, we have now repeated the predictions for HMMs fit only to the training subjects. We show that this has no effect on the results. Since this question has consequences for the Fisher kernel, we have also added simulations showing how the different kernels react to increasing heterogeneity between training and test set. These new results are added as results section 2.4 (ll. 376-423).

      (3) The statistical significance testing might be compromised by incorrect assumptions about the independence between cross-validation distributions, which warrants further examination or clearer documentation.

      We have now replaced the significance testing with repeated k-fold cross-validated corrected tests. Note that this required re-running the models to be able to test differences in accuracies on the level of individual folds, resulting in different plots throughout the manuscript and different statistical results. This does not, however, change the main conclusions of our manuscript.

      (4) The inclusion of the R^2 score, sensitive to scale, would provide a more comprehensive understanding of the method's performance, as the Pearson correlation coefficient alone is not standard in machine learning and may not be sufficient (even if it is common practice in applied machine learning studies in human neuroimaging).

      We have now added the coefficient of determination to the results figures.

      (5) The process for hyperparameter tuning is not clearly documented in the methods section, both for kernel methods and the elastic net.

      As mentioned above in the response to Reviewer #1, we have now added details about hyperparameter tuning for the kernel methods and the non-kernelised static FC regression models (see also Reviewer #1 comment 1): Ll.804-813: “We used k-fold nested cross-validation (CV) to select and evaluate the models. We used 10 folds for both the outer loop (used to train and test the model) and the inner loop (used to select the optimal hyperparameters) such that 90% were used for training and 10% for testing. The optimal hyperparameters  (and  in the case of the Gaussian kernels) were selected using grid-search from the vectors λ=[0.0001,0.001,0.01,0.1,0.3,0.5,0.7,0.9,1] and . In both the outer and the inner loop, we accounted for family structure in the HCP dataset so that subjects from the same family were never split across folds (Winkler et al., 2015). Within the CV, we regressed out sex and head motion confounds, i.e., we estimated the regression coefficients for the confounds on the training set and applied them to the test set (Snoek et al., 2019).” and ll. 818-820: “We generated the 100 random repetitions of the 10 outer CV folds once, and then used them for training and prediction of all methods, so that all methods were fit to the same partitions.”, as well as ll.913-917: “All time-averaged FC models are fitted using the same (nested) cross-validation strategy as described above (10-fold CV using the outer loop for model evaluation and the inner loop for model selection using grid-search for hyperparameter tuning, accounting for family structure in the dataset, and repeated 100 times with randomised folds).”

      (6) For the time-averaged benchmarks, a comparison with kernel methods using metrics defined on the Riemannian SPD manifold, such as employing the Frobenius norm of the logarithm map within a Gaussian kernel, would strengthen the analysis, cf. Jayasumana (https://arxiv.org/abs/1412.4172) Table 1, log-euclidean metric.

      We have now added the log-Euclidean Gaussian kernel proposed by the reviewer to the model comparisons. The additional model does not change our conclusions.

      (7) A more nuanced and explicit discussion of the limitations, including the reliance on HCP data, lack of clinical focus, and the context of tasks for which performance is expected to be on the low end (e.g. cognitive scores), is crucial for framing the findings within the appropriate context.

      We have now revised the discussion section and added an explicit limitations paragraph: Ll. 475-484:

      “We here aimed to show the potential of the HMM-Fisher kernel approach to leverage information from patterns of brain dynamics to predict individual traits in an example fMRI dataset as well as simulated data. The fMRI dataset we used (HCP 1200 Young Adult) is a large sample taken from a healthy, young population, and it remains to be shown how the exhibited performance generalises to other datasets, e.g. other modalities such as EEG/MEG, clinical data, older populations, different data quality, or smaller sample sizes both in terms of the number of participants and the scanning duration. Additionally, we only tested our approach for the prediction of a specific set of demographic items and cognitive scores; it may be interesting to test the framework in also on clinical variables, such as the presence of a disease or the response to pharmacological treatment.”

      (8) While further benchmarks could enhance the study, the authors should provide a critical appraisal of the current findings and outline directions for future research, considering the scope and budget constraints of the work.

      In addition to the new limitations paragraph (see previous comment), we have now rephrased our interpretation of the results and extended the outlook paragraph: Ll. 485-507:

      “There is growing interest in combining different data types or modalities, such as structural, static, and dynamic measures, to predict phenotypes (Engemann et al., 2020; Schouten et al., 2016). While directly combining the features from each modality can be problematic, modality-specific kernels, such as the Fisher kernel for time-varying amplitude and/or FC, can be easily combined using approaches such as stacking (Breiman, 1996) or Multi Kernel Learning (MKL) (Gönen & Alpaydın, 2011). MKL can improve prediction accuracy of multimodal studies (Vaghari et al., 2022), and stacking has recently been shown to be a useful framework for combining static and time-varying FC predictions (Griffin et al., 2024). A detailed comparison of different multimodal prediction strategies including kernels for time-varying amplitude/FC may may be the focus of future work.

      In a clinical context, while there are nowadays highly accurate biomarkers and prognostics for many diseases, others, such as psychiatric diseases, remain poorly understood, diagnosed, and treated. Here, improving the description of individual variability in brain measures may have potential benefits for a variety of clinical goals, e.g., to diagnose or predict individual patients’ outcomes, find biomarkers, or to deepen our understanding of changes in the brain related to treatment responses like drugs or non-pharmacological therapies (Marquand et al., 2016; Stephan et al., 2017; Wen et al., 2022; Wolfers et al., 2015). However, the focus so far has mostly been on static or structural information, leaving the potentially crucial information from brain dynamics untapped. Our proposed approach provides one avenue of addressing this by leveraging individual patterns of time-varying amplitude and FC, and it can be flexibly modified or extended to include, e.g., information about temporally recurring frequency patterns (Vidaurre et al., 2016).”

      Reviewer #3 (Public Review):

      Summary:

      In this work, the authors use a Hidden Markov Model (HMM) to describe dynamic connectivity and amplitude patterns in fMRI data, and propose to integrate these features with the Fisher Kernel to improve the prediction of individual traits. The approach is tested using a large sample of healthy young adults from the Human Connectome Project. The HMM-Fisher Kernel approach was shown to achieve higher prediction accuracy with lower variance on many individual traits compared to alternate kernels and measures of static connectivity. As an additional finding, the authors demonstrate that parameters of the HMM state matrix may be more informative in predicting behavioral/cognitive variables in this data compared to state-transition probabilities.

      Strengths:

      - Overall, this work helps to address the timely challenge of how to leverage high-dimensional dynamic features to describe brain activity in individuals.

      - The idea to use a Fisher Kernel seems novel and suitable in this context.

      - Detailed comparisons are carried out across the set of individual traits, as well as across models with alternate kernels and features.

      - The paper is well-written and clear, and the analysis is thorough.

      Potential weaknesses:

      - One conclusion of the paper is that the Fisher Kernel "predicts more accurately than other methods" (Section 2.1 heading). I was not certain this conclusion is fully justified by the data presented, as it appears that certain individual traits may be better predicted by other approaches (e.g., as shown in Figure 3) and I found it hard to tell if certain pairwise comparisons were performed -- was the linear Fisher Kernel significantly better than the linear Naive normalized kernel, for example?

      We have revised the abstract and the discussion to state the results more appropriately. For instance, we changed the relevant section in the abstract to (ll. 24-26):

      “We show here, in fMRI data, that the HMM-Fisher kernel approach is accurate and reliable. We compare the Fisher kernel to other prediction methods, both time-varying and time-averaged functional connectivity-based models.”,

      and in the discussion, removing the sentence

      “resulting in better generalisability and interpretability compared to other methods”,

      and adding (given the revised statistical results) ll. 435-436:

      “though most comparisons were not statistically significant given the narrow margin for improvements.”

      In conjunction with the new statistical approach (see Reviewer #2, comment 3), we have now streamlined the comparisons. We explained which comparisons were performed in the methods ll.880-890:

      “For the main results, we separately compare the linear Fisher kernel to the other linear kernels, and the Gaussian Fisher kernel to the other Gaussian kernels, as well as to each other. We also compare the linear Fisher kernel to all time-averaged methods. Finally, to test for the effect of tangent space projection for the time-averaged FC prediction, we also compare the Ridge regression model to the Ridge Regression in Riemannian space. To test for effects of removing sets of features, we use the approach described above to compare the kernels constructed from the full feature sets to their versions where features were removed or reduced. Finally, to test for effects of training the HMM either on all subjects or only on the subjects that were later used as training set, we compare each kernel to the corresponding kernel constructed from HMM parameters, where training and test set were kept separate.“

      Model performance evaluation is done on the level of all predictions (i.e., across target variables, CV folds, and CV iterations) rather than for each of the target variables separately. That means different best-performing methods depending on the target variables are to be expected.

      - While 10-fold cross-validation is used for behavioral prediction, it appears that data from the entire set of subjects is concatenated to produce the initial group-level HMM estimates (which are then customized to individuals). I wonder if this procedure could introduce some shared information between CV training and test sets. This may be a minor issue when comparing the HMM-based models to one another, but it may be more important when comparing with other models such as those based on time-averaged connectivity, which are calculated separately for train/test partitions (if I understood correctly).

      The lack of separation between training and test set before fitting the HMM was also pointed out by Reviewer #2. We are addressing this issue in the new Results section 2.4 (see also our response to Reviewer #2, comment 2).

      Recommendations for the authors:

      The individual public reviews all indicate the merits of the study, however, they also highlight relatively consistent questions or issues that ought to be addressed. Most significantly, the authors ought to provide greater clarity surrounding the use of the cross-validation procedures they employ, and the use of a common atlas derived outside the cross-validation loop. Also, the authors should ensure that the statistical testing procedures they employ accommodate the dependencies induced between folds by the cross-validation procedure and give care to ensuring that the conclusions they make are fully supported by the data and statistical tests they present.

      Reviewer #1 (Recommendations For The Authors):

      Overall, the study is interesting but demands further improvements. Below, I summarize my comments:

      (1) The authors should explain in detail how they applied cross-validation across the dataset for both optimization of parameters, and also for cross-validation of the models to predict individual traits.

      How did you split the dataset for both parameters optimization, and for the CV of the prediction of behavioral traits?

      A review and a summary of various CVs that have been applied on the same dataset should be applied.

      We apologise for the oversight and have now added more details to the CV section of the methods, see our response to Reviewer #1 comment 1:

      In ll. 804-813:

      “We used k-fold nested cross-validation (CV) to select and evaluate the models. We used 10 folds for both the outer loop (used to train and test the model) and the inner loop (used to select the optimal hyperparameters) such that 90% were used for training and 10% for testing. The optimal hyperparameters  (and  in the case of the Gaussian kernels) were selected using grid-search from the vectors λ=[0.0001,0.001,0.01,0.1,0.3,0.5,0.7,0.9,1] and . In both the outer and the inner loop, we accounted for family structure in the HCP dataset so that subjects from the same family were never split across folds (Winkler et al., 2015). Within the CV, we regressed out sex and head motion confounds, i.e., we estimated the regression coefficients for the confounds on the training set and applied them to the test set (Snoek et al., 2019).“ and ll. 818-820: “We generated the 100 random repetitions of the 10 outer CV folds once, and then used them for training and prediction of all methods, so that all methods were fit to the same partitions.”

      (2) The authors should explain in more detail how they applied ICA-based parcellation at the group-level.

      A. Did you apply it across the whole group? If yes, then this is problematic since it rejects the CV approach. It should be applied within the folds.

      B. How did you define the representative time-source per ROI?

      A: How group ICA was applied was stated in the Methods section (4.1 HCP imaging and behavioural data), ll. 543-548:

      “The parcellation was estimated from the data using multi-session spatial ICA on the temporally concatenated data from all subjects.”

      We have now added a disclaimer about the divide between training and test set:

      “Note that this means that there is no strict divide between the subjects used for training and the subjects for testing the later predictive models, so that there is potential for leakage of information between training and test set. However, since this step does not concern the target variable, but only the preprocessing of the predictors, the effect can be expected to be minimal (Rosenblatt et al., 2024).”

      We understand that in order to make sure we avoid data leakage, it would be desirable to estimate and apply group ICA separately for the folds, but the computational load of this would be well beyond the constraints of this particular work, where we have instead used the parcellation provided by the HCP consortium.

      B: This was also stated in 4.1, ll. 554-559: “Timecourses were extracted using dual regression (Beckmann et al., 2009), where group-level components are regressed onto each subject’s fMRI data to obtain subject-specific versions of the parcels and their timecourses. We normalised the timecourses of each subject to ensure that the model of brain dynamics and, crucially, the kernels were not driven by (averaged) amplitude and variance differences between subjects.”

      (3) The authors discussed throughout the paper that their proposed (HMM+Fisher) kernel approach outperformed dynamic functional connectivity (dFC). However, they compared the proposed methodology with just static FC.

      A. The authors didn't explain how static and dFC have been applied.

      B. If the authors wanted to claim that their methodology is better than dFC, then they have to demonstrate results based on dFC with the trivial sliding window approach.

      C. Moreover, the static FC networks have been constructed by concatenating time samples that belong to the same state across the time course of resting-state activity.

      So, it's HMM-informed static FC analysis, which is problematic since it's derived from HMM applied over the brain dynamics.

      I don't agree that connectivity is derived exclusively from the clustering of human brain dynamics!

      D. A static approach of using the whole time course, and a dFC following the trivial sliding-window approach should be adopted and presented for comparison with (HMM+Fisher) kernel.

      We do not intend to claim our manuscript that our method outperforms other methods for doing dynamic FC. Indeed, we would like to be clear that the HMM itself is a method for capturing dynamic FC. Please see our responses to public review comments 2 and 3 by reviewer #1, copied below, which is intended to clear up this misunderstanding:

      We would like to clarify that the HMM is itself a method for estimating dynamic (or time-varying) FC, just like the sliding window approach, see also Vidaurre, 2024 (https://direct.mit.edu/imag/article/doi/10.1162/imag_a_00363/124983) for an overview of terminology.

      We would like to be clear that we do not claim in the manuscript that our method outperforms other dynamic functional connectivity (dFC) approaches, such as sliding window FC. We have now made changes to the manuscript to make this clearer.

      First, we have clarified our use of the term “brain dynamics” to signify “time-varying amplitude and functional connectivity patterns” in this context, as Reviewer #2 raised the point that the former term is ambiguous.

      Second, our focus is on our method being a way of using dFC for predictive modelling, since there currently is no widely accepted way of doing this. One reason why dFC is not usually considered in prediction studies is that it is mathematically not trivial how to use the parameters from estimators of dynamic FC for a prediction. This includes the sliding window approach. We do not aim at comparing across different dFC estimators in this paper. To make these points clearer, we have revised the introduction to now say:

      Ll. 39-50:

      “One reason why brain dynamics are not usually considered in this context pertains to their representation: They are represented using models of varying complexity that are estimated from modalities such as functional MRI or MEG. Although there exists a variety of methods for estimating time-varying or dynamic FC (Lurie et al., 2019), like the commonly used sliding-window approach, there is currently no widely accepted way of using them for prediction problems. This is because these models are usually parametrised by a high number of parameters with complex mathematical relationships between the parameters that reflect the model assumptions. How to leverage these parameters for prediction is currently an open question.

      We here propose the Fisher kernel for predicting individual traits from brain dynamics, using information from generative models that do not assume any knowledge of task timings. We focus on models of brain dynamics that capture within-session changes in functional connectivity and amplitude from fMRI scans, in this case acquired during wakeful rest, and how the parameters from these models can be used to predict behavioural variables or traits. In particular, we use the Hidden Markov Model (HMM), which is a probabilistic generative model of time-varying amplitude and functional connectivity (FC) dynamics (Vidaurre et al., 2017).”

      To the additional points raised here:

      A: How static and dynamic FC have been estimated is explicitly stated in the relevant Methods sections 4.2 (The Hidden Markov Model), which explains the details of using the HMM to estimate dynamic functional connectivity; and 4.5 (Regression models based on time-averaged FC features), which explains how static FC was computed.

      B: We are not making this claim. We have now modified the Introduction to avoid further misunderstandings, as per ll. 33-36: “One way of describing brain dynamics are state-space models, which allow capturing recurring patterns of activity and functional connectivity (FC) across the whole brain.”

      C: This is not how static FC networks were constructed; we apologise for the confusion. We also do not perform any kind of clustering. The only “HMM-informed static FC analysis” is the static FC KL divergence model to allow for a more direct comparison with the time-varying FC KL divergence model, but we have included several other static FC models (log-Euclidean, Ridge regression, Ridge regression Riem., Elastic Net, Elastic Net Riem., and Selected Edges), which do not use HMMs. This is explained in Methods section 4.5.

      D: As explained above, we have included four (five in the revised manuscript) static approaches using the whole time course, and we do not claim that our method outperforms other dynamic FC models. We also disagree that using the sliding window approach for predictive modelling is trivial, as explained in the introduction of the manuscript and under public review comment 3.

      (4) Did you correct for multiple comparisons across the various statistical tests?

      All statistical comparisons have been corrected for multiple comparisons. Please find the relevant text in Methods section 4.4.1.

      (5) Do we expect that behavioral traits are encapsulated in resting-state human brain dynamics, and on which brain areas mostly? Please, elaborate on this.

      While this is certainly an interesting question, our paper is a methodological contribution about how to predict from models of brain dynamics, rather than a basic science study about the relation between resting-state brain dynamics and behaviour. The biological aspects and interpretation of the specific brain-behaviour associations are a secondary point and out of scope for this paper. Our approach uses whole-brain dynamics, which does not require selecting brain areas of interest.

      Reviewer #2 (Recommendations For The Authors):

      Beyond the general principles included in the public review, here are a few additional pointers to minor issues that I would wish to see addressed.

      Introduction:

      - The term "brain dynamics" encompasses a broad spectrum of phenomena, not limited to those captured by state-space models. It includes various measures such as time-averaged connectivity and mean EEG power within specific frequency bands. To ensure clarity and relevance for a diverse readership, it would be beneficial to adopt a more inclusive and balanced approach to the terminology used.

      The reviewer rightly points out the ambiguity of the term “brain dynamics”, which we use in the interest of readability. The HMM is one of several possible descriptions of brain dynamics. We have now included a statement early in the introduction to narrow this down:

      Ll. 32-35:

      “… the patterns in which brain activity unfolds over time, i.e., brain dynamics. One way of describing brain dynamics are state-space models, which allow capturing recurring patterns of activity and functional connectivity (FC) across the whole brain.”

      And ll. 503-507:

      “Our proposed approach provides one avenue of addressing this by leveraging individual patterns of time-varying amplitude and FC, as one of many possible descriptions of brain dynamics, and it can be flexibly modified or extended to include, e.g., information about temporally recurring frequency patterns (Vidaurre et al., 2016).”

      Figures:

      - The font sizes across the figures, particularly in subpanels 2B and 2C, are quite small and may challenge readability. It is advisable to standardize the font sizes throughout all figures to enhance legibility.

      We have slightly increased the overall font sizes, while we are generally following figure recommendations set out by Nature. The font sizes are the same throughout the figures.

      - When presenting performance comparisons, a horizontal layout is often more intuitive for readers, as it aligns with the natural left-to-right reading direction. This is not just a personal preference; it is supported by visualization best practices as outlined in resources like the NVS Cheat Sheet (https://github.com/GraphicsPrinciples/CheatSheet/blob/master/NVSCheatSheet.pdf) and Kieran Healy's book (https://socviz.co/lookatdata.html).

      We have changed all figures to use horizontal layout, hoping that this will ease visual comparison between the different models.

      - In the kernel density estimation (KDE) and violin plot representations, it appears that the data displays may be truncated. It is crucial to indicate where the data distribution ends. Overplotting individual data points could provide additional clarity.

      To avoid confusion about the data distribution in the violin plots, we have now overlaid scatter plots, as suggested by the reviewer. Overlaying the fold-level accuracies was not feasible (since this would result in ~1.5 million transparent points for a single figure), so we instead show the accuracies averaged over folds but separate for target variables and CV iterations. Only the newly added coefficient of determination plots had to be truncated, which we have noted in the figure legend.

      - Figure 3 could inadvertently suggest that time-varying features correspond to panel A and time-averaged features to panel B. To avoid confusion, consider reorganizing the labels at the bottom into two rows for clearer attribution.

      We have changed the layout of the time-varying and time-averaged labels in the new version of the plots to avoid this issue.

      Discussion:

      - The discussion on multimodal modeling might give the impression that it is more effective with multiple kernel learning (MKL) than with other methods. To present a more balanced view, it would be appropriate to rephrase this section. For instance, stacking, examples of which are cited in the same paragraph, has been successfully applied in practice. The text could be adjusted to reflect that Fisher Kernels via MKL adds to the array of viable options for multimodal modeling. As a side thought: additionally, a well-designed comparison between MKL and stacking methods, conducted by experts in each domain, could greatly benefit the field. In certain scenarios, it might even be demonstrated that the two approaches converge, such as when using linear kernels.

      We would like to thank the reviewer for the suggestion about the discussion concerning multimodal modelling. We agree that there are other relevant methods that may lead to interesting future work and have now included stacking and refined the section: ll. 487-494:

      “While directly combining the features from each modality can be problematic, modality-specific kernels, such as the Fisher kernel for time-varying amplitude and/or FC, can be easily combined using approaches such as stacking (Breiman, 1996) or Multi Kernel Learning (MKL) (Gönen & Alpaydın, 2011). MKL can improve prediction accuracy of multimodal studies (Vaghari et al., 2022), and stacking has recently been shown to be a useful framework for combining static and time-varying FC predictions (Griffin et al., 2024). A detailed comparison of different multimodal prediction strategies including kernels for time-varying amplitude/FC may be the focus of future work.”

      - The potential clinical applications of brain dynamics extend beyond diagnosis and individual outcome prediction. They play a significant role in the context of biomarkers, including pharmacodynamics, prognostic assessments, responder analysis, and other uses. The current discussion might be misinterpreted as being specific to hidden Markov model (HMM) approaches. For diagnostic purposes, where clinical assessment or established biomarkers are already available, the need for new models may be less pressing. It would be advantageous to reframe the discussion to emphasize the potential for gaining deeper insights into changes in brain activity that could indicate therapeutic effects or improvements not captured by structural brain measures. However, this forward-looking perspective is not the focus of the current work. A nuanced revision of this section is recommended to better reflect the breadth of applications.

      We appreciate the reviewer’s thoughtful suggestions regarding the discussion of potential clinical applications. We have included the suggestions and refined this section of the discussion: Ll. 495-507:

      “In a clinical context, while there are nowadays highly accurate biomarkers and prognostics for many diseases, others, such as psychiatric diseases, remain poorly understood, diagnosed, and treated. Here, improving the description of individual variability in brain measures may have potential benefits for a variety of clinical goals, e.g., to diagnose or predict individual patients’ outcomes, find biomarkers, or to deepen our understanding of changes in the brain related to treatment responses like drugs or non-pharmacological therapies (Marquand et al., 2016; Stephan et al., 2017; Wen et al., 2022; Wolfers et al., 2015). However, the focus so far has mostly been on static or structural information, leaving the potentially crucial information from brain dynamics untapped. Our proposed approach provides one avenue of addressing this by leveraging individual patterns of time-varying amplitude and FC, and it can be flexibly modified or extended to include, e.g., information about temporally recurring frequency patterns (Vidaurre et al., 2016).”

      Reviewer #3 (Recommendations For The Authors):

      - I wondered if the authors could provide, within the Introduction, an intuitive description for how the Fisher Kernel "preserves the structure of the underlying model of brain dynamics" / "preserves the mathematical structure of the underlying HMM"? Providing more background may help to motivate this study to a general audience.

      We agree that this would be helpful and have now added this to the introduction: Ll.61-67:

      “Mathematically, the HMM parameters lie on a Riemannian manifold (the structure). This defines, for instance, the relation between parameters, such as: how changing one parameter, like the probabilities of transitioning from one state to another, would affect the fitting of other parameters, like the states’ FC. It also defines the relative importance of each parameter; for example, how a change of 0.1 in the transition probabilities would not be the same as a change of 0.1 in one edge of the states’ FC matrices.”

      To communicate the intuition behind the concept, the idea was also illustrated in Figure 1, panel 4 by showing Euclidean distances as straight lines through a curved surface (4a, Naïve kernel), as opposed to the tangent space projection onto the curved manifold (4b, Fisher kernel).

      - Some clarifications regarding Figure 2a would be helpful. Was the linear Fisher Kernel significantly better than the linear Naive normalized kernel? I couldn't find whether this comparison was carried out. Apologies if I have missed it in the text. For some of the brackets indicating pairwise tests and their significance values, the start/endpoints of the bracket fall between two violins; in this case, were the results of the linear and Gaussian Fisher Kernels pooled together for this comparison?

      We have now streamlined the statistical comparisons and avoided plotting brackets falling between two violin plots. The comparisons that were carried out are stated in the methods section 4.4.1. Please see also our response to above to Reviewer #3 public review, potential weaknesses, point 1, relevant point copied below:

      In conjunction with the new statistical approach (see Reviewer #2, comment 3), we have now streamlined the comparisons. We explained which comparisons were performed in the methods ll.880-890:

      “For the main results, we separately compare the linear Fisher kernel to the other linear kernels, and the Gaussian Fisher kernel to the other Gaussian kernels, as well as to each other. We also compare the linear Fisher kernel to all time-averaged methods. Finally, to test for the effect of tangent space projection for the time-averaged FC prediction, we also compare the Ridge regression model to the Ridge Regression in Riemannian space. To test for effects of removing sets of features, we use the approach described above to compare the kernels constructed from the full feature sets to their versions where features were removed or reduced. Finally, to test for effects of training the HMM either on all subjects or only on the subjects that were later used as training set, we compare each kernel to the corresponding kernel constructed from HMM parameters, where training and test set were kept separate”.

      - The authors may wish to include, in the Discussion, some remarks on the use of all subjects in fitting the group-level HMM and the implications for the cross-validation performance, and/or try some analysis to ensure that the effect is minor.

      As suggested by reviewers #2 and #3, we have now performed the suggested analysis and show that fitting the group-level HMM to all subjects compared to only to the training subjects has no effect on the results. Please see our response to Reviewer #2, public review, comment 2.

      - The decision to use k=6 states was made here, and I wondered if the authors may include some support for this choice (e.g., based on findings from prior studies)?

      We have now refined and extended our explanation and rationale behind the number of states: Ll. 586-594: “The number of states can be understood as the level of detail or granularity with which we describe the spatiotemporal patterns in the data, akin to a dimensionality reduction, where a small number of states will lead to a very general, coarse description and a large number of states will lead to a very detailed, fine-grained description. Here, we chose a small number of states, K=6, to ensure that the group-level HMM states are general enough to be found in all subjects, since a larger number of states increases the chances of certain states being present only in a subset of subjects. The exact number of states is less relevant in this context, since the same HMM estimation is used for all kernels.”

      - (minor) Abstract: "structural aspects" - do you mean structural connectivity?

      With “structural aspects”, we refer to the various measures of brain structure that are used in predictive modelling. We have now specified: Ll. 14-15: “structural aspects, such as structural connectivity or cortical thickness”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This study demonstrates a key role of oxLDL in enhancing Ang II-induced Gq signaling by promoting the AT1/LOX1 receptor complex formation. Importantly, Gq-mediated calcium influx was only observed in LOX1 and AT1 both expressing cells, and AT1-LOX1 interaction aggravated renal damage and dysfunction under the condition of a high-fat diet with Ang II infusion, so this study indicated a new therapeutic potential of AT1-LOX1 receptor complex in CKD patients with dyslipidemia and hypertension.

      Strengths:

      This study is very exciting and the work is also very detailed, especially regarding the mechanism of LOX1-AT1 receptor interaction and its impact on oxidative stress, fibrosis, and inflammation.

      Weaknesses:

      The direct evidence for the interaction between AT1 and LOX1 receptors in cell membrane localization is relatively weak. Here I raise some questions that may further improve the study.

      Major points:

      (1) The authors hypothesized that in the interaction of AT1/LOX1 receptor complex in response to ox-LDL and AngII, there should be strong evidence of fluorescence detection of colocalization for these two membrane receptors, both in vivo and in vitro. Although the video evidence for AT1 internalization upon complex activation is shown in Figure S1, the more important evidence should be membrane interaction and enhanced signal of intracellular calcium influx.

      Thank you for your valuable feedback. We agree that demonstrating the colocalization and interaction of AT1 and LOX-1 receptors at the membrane is critical to supporting our hypothesis.

      In response, we have previously provided visual evidence of membrane co-localization of the AT1/LOX-1 receptor complex using an in situ PLA assay with anti-FLAG and antiV5 antibodies in CHO cells expressing FLAG-tagged AT1 and V5-tagged LOX-1 (Yamamoto et al., FASEB J 2015). This was further supported by immunoprecipitation of membrane proteins in CHO cells co-expressing LOX-1 and AT1, which confirmed the presence of the receptor complex. In the current study, we offer additional evidence of enhanced intracellular calcium influx following simultaneous stimulation with oxLDL and Ang II, confirming the functional activation of the AT1/LOX-1 receptor complex (Fig. 1g-j and Fig. 3e-h). Together, these findings provide substantial support for the colocalization of AT1 and LOX-1 and their influence on downstream signaling in our in vitro experiments.

      However, we acknowledge the limitation of direct evidence for membrane co-localization of LOX-1 and AT1 in vivo. This constraint is attributed to the fact that both available anti-AT1 and anti-LOX-1 antibodies are derived from rabbits, making coimmunofluorescence or PLA challenging in our study. To address this, we employed coimmunofluorescent staining with megalin, a well-established marker for proximal renal tubules, as shown in Fig. S10. We found that both AT1 and LOX-1 co-localized with megalin, particularly at the brush borders, indicating their presence in the same renal compartments relevant to AT1/LOX-1 signaling.

      We have revised the manuscript to highlight the functional evidence from calcium influx assays, supported by prior PLA results, demonstrating the interaction between LOX-1 and AT1. Additionally, we included a figure showing the co-localization of AT1 and LOX-1 with megalin in proximal renal tubules to reinforce these findings. Lastly, we have emphasized in the discussion the limitation regarding the lack of direct in vivo evidence for membrane co-localization of LOX-1 and AT1.

      (2) Co-IP experiment should be provided to prove the AT1/LOX1 receptor interaction in response to ox-LDL and AngII in AT1 and LOX1 both expressing cells but not in AT1 only expressing cells.

      We thank the reviewer for the insightful suggestion to validate the AT1/LOX1 receptor interaction under various stimulation conditions. In our previous study (Yamamoto et al., FASEB J 2015), we demonstrated the interaction between AT1 and LOX1 receptors through Co-IP and in situ PLA assays in cells overexpressing both receptors, without stimulation. These experiments provided solid evidence of the receptor interaction under static conditions at the cell membrane.

      However, as noted in the previous work, we did not perform Co-IP experiments under AngII or oxLDL stimulation. The primary reason for this is that both AngII and oxLDL trigger internalization of the AT1 and/or LOX1 receptors, which may complicate the detection of receptor interaction at the membrane via Co-IP. This is supported by our realtime imaging, which showed a reduction in AT1 and/or LOX1 puncta following stimulation, indicating internalization of the receptors (Fig. 2a).

      While we acknowledge the reviewer’s interest in investigating the interaction under AngII stimulation, we believe that the current data—especially from the PLA and Co-IP assays under static conditions—strongly support the interaction of AT1 and LOX1 receptors at the membrane.

      (3) The authors mentioned that the Gq signaling-mediated calcium influx may change gene expression and cellular characteristics, including EMT and cell proliferation. They also provided evidence that oxidative stress, fibrosis, and inflammation were all enhanced after activating both receptors and inhibiting Gq was effective in reversing these changes. However, single stimulation with ox-LDL or AngII also has strong effects on ROS production, inflammation, and cell EMT, which has been extensively proved by previous studies. So, how to distinguish the biased effect of LOX1 or AT1r alone or the enhanced effect of receptor conformational changes mediated by their receptor interaction? Is there any better evidence to elucidate this point?

      Thank you for raising this important point regarding the distinction between the individual effects of LOX-1 or AT1R activation and the enhanced effects mediated by their interaction. In our study, the concentration of oxLDL used (2–10 μg/ml) was significantly lower than concentrations typically employed in other studies (which often exceed 20 μg/ml). As a result, oxLDL alone produced minimal effects, aside from a reduction in cell proliferation observed in the BrdU assay. This suggests that oxLDL, at the concentrations used in our experiments, does not elicit a strong cellular response on its own.

      The key to distinguishing the effect of the LOX-1/AT1 interaction lies in the amplification of Gq signaling, a pathway specifically activated by AngII. The distinction between the individual effects of LOX-1 or AT1R and the enhanced effects due to their interaction is centered on the increased activation of Gq signaling. In our experiments, co-treatment with oxLDL and AngII led to a significant increase in IP1 levels and calcium influx— both critical indicators of Gq signaling activation. While AngII alone also raised IP1 levels, the combined treatment with oxLDL further amplified the Gq signaling response, as reflected in the enhanced calcium influx. Importantly, oxLDL alone did not alter IP1 levels, even at high concentrations (100 μg/ml) (Takahashi et al., iScience 2021).

      This enhancement of Gq signaling provides strong evidence of the synergistic interaction between LOX-1 and AT1, which surpasses the individual effects of either receptor alone. The LOX-1/AT1 interaction is thus crucial for the observed amplification of AngIIspecific signaling pathways. The combination of increased IP1 levels and calcium influx serves as compelling evidence of this interaction, clearly differentiating the effects of individual receptor activation from the enhanced response driven by receptor conformational changes and interaction.

      Thank you again for your insightful comment, which has helped us to better articulate the significance of receptor interaction in this study.

      (4) How does the interaction between AT1 and LOX1 affect the RAS system and blood pressure? What about the serum levels of rennin, angiotensin, and aldosterone in ND-fed or HFD-fed mice?

      Thank you for your insightful question regarding the effects of AT1 and LOX-1 interaction on the renin-angiotensin system (RAS) and blood pressure, as well as the plasma levels of renin, angiotensin, and aldosterone in normal diet (ND)-fed and high-fat diet (HFD)-fed mice.

      OxLDL binds to LOX-1, amplifying AT1 receptor activation and Gq signaling, which enhances the effects of Ang II. This interaction between AT1 and LOX-1 can lead to increased vasoconstriction, oxidative stress, and inflammation, which contribute to elevated blood pressure. This pathway may play a crucial role in modulating the RAS, particularly under conditions of elevated oxLDL, such as those induced by a HFD. Regarding the components of the RAS, we focused on plasma aldosterone levels, as this is a direct consequence of Ang II signaling. As shown in Fig. S7, when mice were treated with a pressor dose of Ang II infusion and subjected to a HFD to elevate oxLDL levels, we did not observe a significant increase in plasma aldosterone levels (102.8 ± 11.6pg/mL vs. 141.8 ± 15.0 pg/mL, P = 0.081).

      In terms of blood pressure, Fig. 7b shows that no significant changes were observed under these treatment conditions, despite the AT1/LOX-1 interaction. These findings suggest that while oxLDL, via the AT1/LOX-1 interaction, can enhance Ang II signaling, its effect on blood pressure was not apparent in our study. This may be due to several factors, including heterogeneous cellular responses to the combined treatment across different cell types, as shown by the lack of reaction in vascular endothelial cells, vascular smooth muscle cells, and macrophages (Fig. S2). This may also be attributed to the high concentration of angiotensin II used in this study, which could have saturated aldosterone production under our experimental conditions. We have revised the manuscript to reflect these points. 

      Thank you again for your thoughtful comment, which has allowed us to expand and refine the discussion on this important aspect of our study.

      Reviewer #2 (Public Review):

      (1)  Individuals with chronic kidney disease often have dyslipidemia, with the latter both a risk factor for atherosclerotic heart disease and a contributor to progressive kidney disease. Prior studies suggest that oxidized LDL (oxLDL) may cause renal injury through the activation of the LOX1 receptor. The authors had previously reported that LOX1 and AT1 interact to form a complex at the cell surface. In this study, the authors hypothesize that oxLDL, in the setting of angiotensin II, is responsible for driving renal injury by inducing a more pronounced conformational change of the AT1 receptor which results in enhanced Gq signaling.

      They go about testing the hypothesis in a set of three studies. In the first set, they engineered CHO cell lines to express AT1R alone, LOX1 in combination with AT1R, or LOX1 with an inactive form of AT1R and indirectly evaluated Gq activity using IP1 and calcium activity as read-outs. They assessed activity after treatment with AngII, oxLDL, or both in combination and found that treatment with both agents resulted in the greatest level of activity, which could be effectively blocked by a Gq inhibitor but not a Gi inhibitor nor a downstream Rho kinase inhibitor targeting G12/13 signaling. These results support their hypothesis, though variability in the level of activation was dramatically inconsistent from experiment to experiment, differing by as much as 20-fold. In contrast, within the experiment, differences between the AngII and AngII/oxLDL treatments, while nominally significant and consistent with their hypothesis, generally were only 10-20%. Another example of unexplained variability can be found in Figures 1g-1j. AngII, at a concentration of 10-12, has no effect on calcium flux in one set of studies (Figure 1g, h) yet has induced calcium activity to a level as great as AngII + oxLDL in another (Figure 1i). The inconsistency of results lessens confidence in the significance of these findings. In other studies with the LOX1-CHO line, they tested for conformational change by transducing AT1 biosensors previously shown to respond to AngII and found that one of them in fact showed enhanced BRET in the setting of oxLDL and AngII compared to AngII alone, which was blocked by an antibody to AT1R. The result is supportive of their conclusions. Limiting enthusiasm for these results is the fact that there isn't a good explanation as to why only 1 sensor showed a difference, and the study should have included a non-specific antibody to control for non-specific effects.

      We sincerely appreciate the reviewer’s thorough and insightful feedback, especially regarding the variability observed in our experimental results. As the reviewer pointed out, the differences in activation levels between the calcium influx assay and the IP1 assay, particularly between AngII and AngII/oxLDL co-treatment, were indeed significant. These differences can be attributed to the inherent sensitivity of these assays, which are used to indirectly evaluate Gq activity. Despite the variability, we believe that the reliability of our results is supported by the consistent directional trends across both assays, which align with our hypothesis.

      Regarding the inconsistencies in intracellular calcium dynamics observed in Fig. 1i, we have performed additional analysis of calcium kinetics during ligand stimulation, similar to the analysis in Fig. 1g. As shown in Author response image 1, the background signal in the experiment related to Fig. 1i was relatively higher than in Fig. 1g and 1h. This elevated background, which may have been influenced by variations between cells and experimental days, resulted in a higher percent change from baseline in samples treated with AngII alone. However, the combined effect of AngII with oxLDL was still apparent. This clarification further supports the consistency of our findings.

      Author response image 1.

      In reference to the BRET sensor experiments, we acknowledge the reviewer’s concern regarding the variability in sensor responses. As outlined in Devost et al. (J Biol Chem. 2017), the sensitivity of AT1 intramolecular FlAsH-BRET biosensors in detecting conformational changes induced by AngII is highly dependent on the insertion site of the FlAsH sequence. In our experiments, co-treatment with oxLDL and AngII enhanced AT1 conformational changes, but this effect was only detectable with the CHO-LOX-1-AT1-3p3 sensor (with FlAsH inserted in the third intracellular loop), and not with the CHO-LOX-1-AT1-C-tail P1 sensor (with FlAsH inserted at the C-terminal tail). This differential sensitivity likely explains why only one sensor showed a significant response, highlighting the critical role of FlAsH insertion site selection in these assays. We hope these clarifications address the reviewer’s concerns and improve confidence in the significance of our findings.

      (2) The authors then repeated similar studies using publicly available rat kidney epithelial and fibroblast cell lines that have an endogenous expression of AT1R and LOX1. In these studies, oxLDL in combination with AngiI also enhanced Gq signaling, while knocking down either AT1R or LOX1, and treatment with inhibitors of Gq and AT1R blocked the effects. Like the prior set of studies, however, the effects are very modest and there was significant inter-experimental variability, reducing confidence in the significance of the findings. The authors then tested for evidence that the enhanced Gq signaling could result in renal injury by comparing qPCR results for target genes. While the results show some changes, their significance is difficult to assess. A more global assessment of gene expression patterns would have been more appropriate. In parallel with the transcriptional studies, they tested for evidence of epithelial-mesenchymal transition (EMT) using a single protein marker (alpha-smooth muscle actin) and found that its expression increased significantly in cells treated with oxLDL and AngII, which was blocked by inhibition of Gq inhibition and AT1R. While the data are sound, their significance is also unclear since EMT is a highly controversial cell culture phenomenon. Compelling in vivo studies have shown that most if not all fibroblasts in the kidney are derived from interstitial cells and not a product of EMT. In the last set of studies using these cell lines, the authors examined the effects of AngII and oxLDL on cell proliferation as assayed using BrdU. These results are puzzling---while the two agents together enhanced proliferation which was effectively blocked by an inhibitor to either AT1R or Gq, silencing of LOX1 had no effect.

      Thank you for your thorough review and comments. We acknowledge your concerns regarding the modest effects observed and the variability in experimental outcomes. We would like to address your points systematically.

      (1) Gq signaling and experimental variability:

      Regarding the question of Gq signaling in Fig. 3, as previously mentioned, the observed differences in the IP1 assay are likely due to the sensitivity of the assay and the technical issues associated with detecting calcium influx and IP1 levels. While the overall differences between treatments may appear modest, the most critical comparison— between AngII alone and AngII combined with oxLDL—consistently showed significant differences, which aligns with the calcium influx results shown in Fig. 1. Notably, we found that the EC50 for IP1 production decreased by 80% in response to co-treatment with oxLDL and AngII, compared to AngII treatment alone. These findings demonstrate the robustness of Gq signaling enhancement with co-treatment, even if the absolute differences in the IP1 assay appear small.

      (2) Gene expression in Fig. 4:

      Regarding the gene expression analysis in Fig. 4, we used relatively low concentrations of oxLDL (5 μg/ml) compared to the higher concentrations typically employed in other studies (mostly exceeding 20 μg/ml). This may explain the lack of robust responses in some conditions. However, in combination with AngII, the co-treatment significantly upregulated several genes, particularly pro-inflammatory markers such as IL-6, TNFα, IL1β, and MCP-1 in NRK49F cells. These results suggest that the co-treatment induces a complex response, potentially activating multiple downstream signaling pathways beyond just Gq signaling, which may obscure more straightforward effects.

      While we agree that a more global assessment of gene expression would provide further insights, due to cost constraints, we focused on key representative genes that are highly relevant to inflammation and fibrosis in this study.

      (3) EMT in renal fibrosis:

      We appreciate the reviewer’s insightful comments regarding the role of EMT in renal fibrosis. Regarding full EMT, in which epithelial cells completely transition into mesenchymal cells, previous studies using the unilateral ureteral obstruction (UUO) model suggest that full EMT may not play a significant role (J Clin Invest. 2011 Feb;121(2):468-74). The role of full EMT remains controversial in the context of renal fibrosis, with most kidney fibroblasts thought to originate from interstitial cells rather than through full EMT.

      Recent studies, however, suggest that partial epithelial-mesenchymal transition (pEMT) could be involved in CKD, especially in association with inflammation, oxidative stress, and elevated TGF-β levels—conditions also present in our model involving Ang II infusion combined with an HFD. pEMT refers to a state in which epithelial cells acquire mesenchymal traits, such as increased α-SMA expression and secretion of pro-fibrotic cytokines, while remaining attached to the basement membrane without fully transitioning into fibroblasts (Front Physiol. 2020 Sep 15;11:569322). This phenomenon has been observed in kidney fibrosis models, including UUO, which shares inflammatory and oxidative stress conditions with our Ang II and HFD treatment model. The observed increase in α-SMA in our model may thus indicate a pEMT-like state, indirectly contributing to fibrosis through the secretion of growth factors and cytokines.

      We are mindful of the importance of not overstating EMT's role. Accordingly, we interpret increased α-SMA expression as a potential marker of the pEMT process rather than definitive evidence of its presence or direct role in fibroblast formation. Furthermore, we acknowledge limitations in providing direct in vivo evidence for pEMT and recognize that further mechanistic studies are needed to elucidate its specific role in renal fibrosis, despite inherent challenges.

      In response to the reviewer’s concern, we have revised the manuscript to clarify that our data support the possibility of pEMT contributing to fibrosis in this model, without overstating its impact. We also acknowledge the challenges in translating in vitro pEMT findings to in vivo models, where detecting the subtle effects of pEMT is inherently challenging.

      (4) BrdU assay and fibroblast proliferation (Fig. 6b):

      In Fig. 6b, the BrdU assay shows that fibroblast proliferation was significantly enhanced by the co-treatment with AngII and oxLDL, and this effect was abolished by LOX-1 knockdown, similar to the results observed with AT1 knockdown. These findings strongly suggest a combinatorial effect of AT1/LOX-1 interaction in promoting fibroblast proliferation, supporting the idea that the co-treatment operates through a coordinated mechanism involving both receptors. Notably, LOX-1 silencing did not affect the proliferation induced by AngII alone, as this response is independent of LOX-1.

      We will incorporate these points into the Discussion section of the manuscript, specifically regarding the differences in sensitivity between the Ca influx and IP1 assays, as well as the emerging role of partial EMT in renal fibrosis. This will provide a clearer context for the interpretation of our findings and further strengthen the discussion on the significance of these phenomena.

      Thank you again for your valuable feedback, which has helped us improve the clarity and depth of our manuscript.

      (3) The final set of studies looked to test the hypothesis in mice by treating WT and Lox1KO mice with different doses of AngII and either a normal or high-fat diet (to induce oxLDL formation). The authors found that the combination of high dose AngII and a highfat diet (HFD) increased markers of renal injury (urinary 8-ohdg and urine albumin) in normal mice compared to mice treated with just AngII or HFD alone, which was blunted in Lox1-KO mice). These results are consistent with their hypothesis. However, there are other aspects of these studies that are either inconsistent or complicating factors that limit the strength of the conclusions. For example, Lox1- KO had no effect on renal injury marker expression in mice treated with low-dose AngII and HFD. It also should be noted that Lox1-KO mice had a lower BP response to AngII, which could have reduced renal injury independent of any effects mediated by the AT1R/LOX1 interaction. Another confounding factor was the significant effect the HFD diet had on body weight. While the groups did not differ based on AngII treatment status, the HFD consistently was associated with lower total body weight, which is unexplained. Next, the authors sought to find more direct evidence of renal injury using qPCR of candidate genes and renal histology. The transcriptional results are difficult to interpret; moreover, there were no significant histologic differences between groups. They conclude the study by showing the pattern of expression of LOX1 and AT1R in the kidney by immunofluorescence and conclude that the proteins overlap in renal tubules and are absent from the glomerulus. Unfortunately, they did not co-stain with any other markers to identify the specific cell types. However, these results are inconsistent with other studies that show AT1R is highly expressed in mesangial cells, renal interstitial cells, near the vascular pole, JG cells, and proximal tubules but generally absent from most other renal tubule segments.

      Thank you for your valuable comments and for raising these important points. We appreciate the opportunity to clarify several aspects of our study and address the limitations and inconsistencies you have pointed out.

      (1) Renal injury markers (urinary albumin and 8-OHdG) and the effect of LOX-1 loss of- function:

      Our results showed that the combination of high-dose AngII and HFD led to a significant increase in renal injury markers, such as urinary albumin and 8-OHdG, in WT mice. In LOX-1 KO mice, this increase was significantly blunted, supporting a protective role of LOX-1 loss-of-function. However, as you noted, at low-dose AngII, there was no significant difference in urinary 8-OHdG between ND-fed and HFD-fed mice. Despite this, we observed a significant increase in urinary albumin in HFD-fed WT mice compared to ND-fed mice under low-dose AngII, and this difference was abolished in LOX-1 KO mice. Moreover, gene expression analysis showed that oxidative stress markers such as p67phox and p91phox (Fig. 8b), as well as p40phox, p47phox (Fig. S8), and inflammatory markers like IL1β (Fig. 8b), were significantly elevated in HFD-fed WT mice even with low-dose AngII, while these increases were absent in LOX-1 KO mice. These results suggest that the LOX-1/AT1 interaction contributes to renal injury under both low- and high-dose AngII conditions.

      We acknowledge that the treatment duration may have influenced our results, as urine and renal tissue samples were only examined at a single time point (1.5 months after treatment initiation). The impact of AT1/LOX-1 interaction may evolve over time, and different treatment durations might yield varying outcomes. This is a limitation of our study, which we have addressed in the revised manuscript.

      (2) Blood pressure and its effect on renal injury:

      As shown in Fig. 7b and Fig S6f, LOX-1 KO mice exhibited a lower blood pressure response to high-dose AngII compared to WT mice, which could indeed have contributed to the reduced renal injury in the LOX-1 KO group, independent of the AT1/LOX-1 interaction. However, it is important to note that the differences in renal injury markers between AngII alone and AngII + HFD were largely abolished in LOX-1 KO mice, suggesting the in vivo relevance of the LOX-1/AT1 interaction observed in vitro. Additionally, as shown in Fig. 7d (urinary albumin), Fig. 8b (p67phox, p91phox), and Fig. S8b (p40phox, p47phox), even under subpressor doses of AngII, where no significant blood pressure differences were observed, HFD-fed WT mice exhibited exacerbated renal injury compared to ND-fed mice. These effects were ameliorated in LOX-1 KO mice, indicating that the protective effects in LOX-1 KO mice are at least partly independent of blood pressure changes and that the AT1/LOX-1 interaction plays a significant role in modulating renal injury under co-treatment with AngII and HFD.

      (3) HFD and body weight changes:

      We agree with your observation regarding the effect of HFD on body weight, which was consistently lower in HFD-fed groups, despite no differences in AngII treatment status. This is an atypical presentation compared to previous studies mostly showing increased body weight by feeding of HFD. The HFD used in this study was intended to elevate oxLDL levels, as previously reported (Atherosclerosis 200:303–309 (2008)). As shown in Fig. S6d and S6e, this can be attributed to reduced food intake in HFD-fed mice. Although modest, this weight reduction may influence renal function. This point is added in the limitation.

      (4) Histological findings and qPCR results:

      As discussed in the manuscript, despite significant changes in urinary markers and gene expression, we did not observe histological evidence of fibrosis or mesangial expansion, even under co-treatment with AngII and HFD. This may be due to the relatively short treatment period of 4 weeks, and a longer duration might be necessary to detect such changes. Additionally, we acknowledge that we did not detect increased Gq signaling in kidney tissue, which is another limitation of the study. Nevertheless, the gene expression data on oxidative stress, fibrosis, inflammation, and renal injury markers (e.g., p67phox, IL1β) are consistent with our hypothesis that the AT1/LOX-1 interaction exacerbates renal injury under AngII and HFD conditions.

      (5) Immunostaining for AT1 and LOX-1:

      Due to the use of rabbit-derived antibodies for both AT1 and LOX-1, it was technically not feasible to perform co-immunostaining for both receptors simultaneously. Instead, we performed co-immunofluorescent staining using megalin, a well-established marker of proximal renal tubules, to help localize these receptors. As shown in Fig. S10, both AT1 and LOX-1 were co-localized with megalin, particularly at the brush borders of proximal tubules. This pattern suggests the presence of these receptors in renal compartments relevant to AT1/LOX-1 signaling. While we did not perform additional co-staining with other markers to identify specific cell types, the strong localization with megalin provides robust evidence of their expression in proximal renal tubules, which is consistent with the literature on AT1R in this nephron segment. We acknowledge that previous studies have identified AT1R expression in mesangial cells, renal interstitial cells, the vascular pole, juxtaglomerular (JG) cells, and proximal tubules. In our immunofluorescence experiments, we did not detect significant AT1R expression in the glomerulus or mesangium. This finding aligns with other reports showing strong expression of AT1R in proximal tubules (Am J Physiol Renal Physiol. 2021 Apr 1;320(4)), although it does not exclude the possibility of AT1 expression in other compartments, given the sensitivity limitations of the immunofluorescence. Our focus on proximal tubules allowed us to observe clear AT1/LOX-1 co-localization in this region, particularly in the context of oxLDL and AngII signaling. Given that the AT1/LOX-1 interaction is crucial in kidney disease pathogenesis, this co-localization in proximal tubules highlights a key site of action for these receptors in the renal system.

      In summary, while our study focused on the co-localization of AT1 and LOX-1 in proximal tubules, we agree that further exploration of AT1R expression in other renal cell types would provide a more comprehensive understanding of its role across different kidney compartments. We have addressed this in the revised discussion.

      Reviewer #1 (Recommendations For The Authors):

      Minor points:

      (1) In this study, AT1/LOX1 receptor complex was mainly observed in some renal cells, how about other types of cells that also highly express LOX1 and AT1r? Such as cardiomyocytes? Vascular endothelial cells?

      Thank you for your insightful comment. In our study, we demonstrated that enhanced Gq signaling through co-treatment with AngII and oxLDL was not observed in other cell types, including vascular endothelial cells, smooth muscle cells, and macrophages, as indicated by the lack of an IP1 increase in response to the co-treatment (Fig. S2). The factors contributing to this heterogeneous response remain unclear, and further investigation is needed to explore this observation more thoroughly.

      (2) Has the author detected such an effect on the AT2 receptor?

      We greatly appreciate the reviewer’s insightful inquiry regarding the potential interaction between the AT2 receptor and LOX-1. In our previous work (Yamamoto et al., FASEB J 2015), we conducted an immunoprecipitation (IP) assay to investigate the interaction between LOX-1 and AT2 on cell membranes. The results of this assay demonstrated that, unlike AT1, LOX-1 exhibits minimal binding to the AT2 receptor under the experimental conditions tested. Specifically, our IP studies showed that while LOX-1 readily coimmunoprecipitated with AT1, indicating a strong interaction, this was not the case with AT2, where the binding was negligible. These findings suggest that the interaction between LOX-1 and AT1 is receptor-specific and that LOX-1 does not significantly associate with AT2 to influence signaling pathways.

      (3) Which kind of ARBs are more effective for the inhibition of this AT1/LOX1 receptor conformational change?

      Thank you for your insightful question regarding the effectiveness of ARBs in inhibiting the AT1/LOX-1 receptor conformational change. Based on our current understanding, any ARB should similarly block the downstream signaling resulting from the interaction between AT1 and LOX-1. This is because all ARBs function by inhibiting the binding of Ang II to AT1, thereby preventing receptor activation and the conformational changes that facilitate its interaction with LOX-1. Additionally, our previous study (FASEB J. 2015) demonstrated that even in the absence of Ang II, the activation of AT1 via the binding of oxLDL to LOX-1 was similarly blocked by ARBs, including olmesartan, telmisartan, valsartan, and losartan.

      When oxLDL and Ang II are co-treated, the Gq signaling pathway is significantly amplified due to the interaction between LOX-1 and AT1. In this setting, all ARBs act by competitively inhibiting Ang II binding to AT1, effectively reducing Gq signaling. 

      However, a subtle but important difference arises when considering the inverse agonist activity of certain ARBs. Olmesartan, telmisartan, and valsartan are thought to act not only as competitive inhibitors of Ang II but also as inverse agonists, meaning they reduce the baseline activity of the AT1 receptor by preventing the conformational changes in the absence of Ang II. This inverse agonist property is particularly relevant in pathological conditions where AT1 receptor activation can occur independently of Ang II binding, such as in the presence of oxLDL. In these cases, ARBs with inverse agonist activity may offer an additional therapeutic advantage by reducing receptor activation beyond what is achieved by simple antagonism.

      Thus, while the general efficacy of ARBs in blocking the AT1/LOX-1 interaction could be under similar conditions of oxLDL and Ang II co-treatment, ARBs with inverse agonist properties may provide additional benefit by further reducing AT1 activity. 

      We have revised the manuscript to clarify these points and to highlight the role of inverse agonist activity in ARB efficacy under these conditions.

      Thank you again for your valuable comment, which has allowed us to refine our discussion on the relative efficacy of ARBs in inhibiting AT1/LOX-1 receptor interaction.

      Reviewer #2 (Recommendations For The Authors):

      My comments were pretty thorough in the public review. The only other comments I would add are the following:

      (1) Why are there so few overlapping LOX1 and ATR puncta in Supplementary Figure 1 if the receptors co-localize? The figure would suggest a very small proportion of the receptors actually are co-localized.

      Thank you for your insightful comment regarding the apparent scarcity of overlapping LOX-1 and AT1R puncta in Fig. S1. We agree that at first glance, the low number of colocalized puncta may raise questions about the extent of interaction between these receptors. However, based on our previous findings reported in FASEB J 2015, we believe this phenomenon can be explained by the dynamic nature of the LOX-1 and AT1 interaction.

      As we reported in FASEB J 2015, the interaction between LOX-1 and AT1 is sensitive to buffer conditions. Specifically, in non-reducing conditions, LOX-1 and AT1 form complexes, whereas in reducing buffer, this interaction is not observed. This suggests that the interaction between these receptors is not stabilized by strong covalent (disulfide) bonds but is instead transient, likely involving non-covalent interactions. Thus, LOX-1 and AT1 may form and dissociate repeatedly, contributing to a dynamic receptor complex rather than a permanent colocalization. This transient interaction could explain the relatively low number of overlapping puncta observed at a given time point in the liveimaging analysis.

      Moreover, as you pointed out, it is likely that only a small fraction of LOX-1 and AT1 are physically co-localized at any one moment. However, when these receptors do interact, co-treatment with oxLDL and Ang II has been shown to significantly enhance Gq signaling. This suggests that the functional consequence of the LOX-1/AT1 interaction, particularly in response to stimuli such as oxLDL and Ang II, is more critical than the frequency of receptor colocalization at any one time.

      We have revised the manuscript to include this explanation and to clarify the dynamic nature of the LOX-1/AT1 interaction. This revision also highlights the importance of considering not just the number of colocalized receptors but also the functional outcomes of their interaction, such as enhanced Gq signaling in response to co-treatment.

      Thank you again for your careful observation, which has allowed us to better communicate the complexity of the receptor dynamics in our study.

      (2) Tubulin is misspelled in Figure 5 ("tublin").

      Thank you for pointing out the typographical error in Fig. 5. We have corrected the spelling of "tubulin" in the revised figure. We appreciate your attention to detail, and we apologize for the oversight.

      (3) Why does the number of replicates differ for some experimental sets (i.e. Figure 1h vs other panels in Figure 1, Figure 2d vs other panels in Figure 2, Figure 7: Lox-1KO treated with High dose AngII and HFD? There aren't obvious reasons why the number of replicates should differ so much within a set of studies.

      We are grateful to the reviewer for highlighting the discrepancies in the number of replicates across different figures in our manuscript. We would like to provide detailed explanations for each case.

      (1) Fig. 1h vs Other Panels in Fig. 1:

      The calcium influx assay (Fig. 1h) required a higher number of replicates due to the inherent biological variability associated with calcium signaling. To achieve statistical significance and account for variability in these measurements, we conducted additional replicates. Other panels, such as those measuring IP1 accumulation (Fig. 1a–f), displayed more consistent and reproducible results, allowing us to use fewer replicates while still maintaining statistical power.

      (2) Fig. 2d vs Fig. 2b and 2c: 

      The difference in the number of replicates between Fig. 2d (N=8) and Fig. 2b and 2c (N=4) is due to the distinct nature of the measurements and the variability expected in each assay. In Fig. 2d, which measures the effects of a LOX-1 neutralizing antibody on BRET, additional replicates were needed to ensure the robustness of the statistical analysis due to the greater complexity and sensitivity of the assay. The inclusion of an antibody treatment introduces more variability, necessitating a higher number of replicates (N=8) to confidently assess the effects of the neutralizing antibody. In contrast, Fig. 2b and 2c involved BRET measurements of AT1 conformational changes without antibody intervention. These assays are more reproducible and have less experimental variability, allowing for a smaller sample size (N=4) while still achieving reliable and statistically significant results. The differences in sample size across these panels were carefully considered to ensure appropriate statistical power for each specific experimental condition.

      (3) Fig. 7: LOX-1 KO Mice Treated with High-dose AngII vs Saline:

      We acknowledge the reviewer’s concern regarding the higher number of LOX-1 KO mice treated with high-dose Ang II compared to the saline group. The number of saline-treated mice was indeed sufficient for reliable statistical analysis. However, the decision to increase the number of mice in the high-dose Ang II group was driven by the anticipated higher variability in the physiological responses under these conditions, such as blood pressure and renal injury. To ensure that we captured the full spectrum of responses and to maintain robust statistical power in the high-dose group, we opted to include more mice in this cohort. 

      We hope this response provides clarity on the rationale behind the varying number of replicates across different experiments. We have rigorously applied appropriate statistical methods to account for these differences, ensuring that the conclusions drawn are robust and scientifically sound. We appreciate the reviewer’s understanding of the experimental constraints and variations that can arise in complex studies such as these.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Joint Public Review: 

      The molecular mechanisms that mediate the regulated exocytosis of neuropeptides and neurotrophins from neurons via large dense-core vesicles (LDCVs) are still incompletely understood. Motivated by their earlier discovery that the Rab3-RIM1 pathway is essential for neuronal LDCV exocytosis, the authors now examined the role of the Rab3 effector Rabphilin-3A in neuronal LDCV secretion. Based on multiple live and confocal imaging approaches, the authors provide evidence for a synaptic enrichment of Rabphilin-3A and for independent trafficking of Rabphilin-3A and LDCVs. Using an elegant NPY-pHluorin imaging approach, they show that genetic deletion of Rabphilin-3A causes an increase in electrically triggered LDCV fusion events and increased neurite length. Finally, knock-out-replacement studies, involving Rabphilin-3A mutants deficient in either Rab3- or SNAP25-binding, indicate that the synaptic enrichment of Rabphilin-3A depends on its Rab3 binding ability, while its ability to bind to SNAP25 is required for its effects on LDCV secretion and neurite development. The authors conclude that Rabphilin-3A negatively regulates LDCV exocytosis and propose that this mechanism also affects neurite growth, e.g. by limiting neurotrophin secretion. These are important findings that advance our mechanistic understanding of neuronal large dense-core vesicle (LDCV) secretion. 

      The major strengths of the present paper are: 

      (i) The use of a powerful Rabphilin-3A KO mouse model. 

      (ii) Stringent lentiviral expression and rescue approaches as a strong genetic foundation of the study. 

      (iii) An elegant FRAP imaging approach. 

      (iv) A cutting-edge NPY-pHluorin-based imaging approach to detect LDCV fusion events. 

      We thank the reviewers for their positive evaluation of our manuscript.

      Weaknesses that somewhat limit the convincingness of the evidence provided and the corresponding conclusions include the following: 

      (i) The limited resolution of the various imaging approaches introduces ambiguity to several parameters (e.g. LDCV counts, definition of synaptic localization, Rabphilin-3A-LDCV colocalization, subcellular and subsynaptic localization of expressed proteins, AZ proximity of Rabphilin-3A and LDCVs) and thereby limits the reliability of corresponding conclusions. Super-resolution approaches may be required here. 

      We thank the reviewer for their constructive suggestion. We fully agree that super-resolution imaging would produce a more precise localization of RPH3A and co-localization with DCVs. We have now repeated our (co)-localization experiments with STED microscopy. We find that RPH3A colocalized with the pre-synaptic marker Synapsin1 and, to a lesser extent, with the post synaptic marker Homer and DCV marker chromogranin B (new Figure 1). This indicates that RPH3A is highly enriched in synapses, mostly the pre-synapse, and that RPH3A partly co-localizes with DCVs.  

      (ii) The description of the experimental approaches lacks detail in several places, thus complicating a stringent assessment. 

      We apologize for the lack of detail in explaining the experimental approaches. We have included a more detailed description in the revised manuscript. 

      (iii) Further analyses of the LDCV secretion data (e.g. latency, release time course) would be important in order to help pinpoint the secretory step affected by Rabphilin-3A. 

      We agree. To address this comment, we have now included the duration of the fusion events (new Figure S2D-F). The start time of the fusion events are shown in the cumulative plots in now Figure 3F and I. The kinetics are normal in the RPH3A KO neurons.

      (iv) It remains unclear why a process that affects a general synaptic SNARE fusion protein - SNAP25 - would specifically affect LDCV but not synaptic vesicle fusion. 

      We agree that we have not addressed this issue systematically enough in the original manuscript. We have now added a short discussion on this topic in the Discussion of the revised manuscript (p 15, line 380-386). In brief, we do not claim full selectivity for the DCV pathway. Some effects of RPH3A deficiency on the synaptic vesicle cycle have been observed. Furthermore, because DCVs typically do not mix in the synaptic vesicle cluster and fuse outside the active zone (and outside the synapse), DCVs might be more accessible to RPH3A regulation.

      (v) The mechanistic links between Rabphilin-3A function, LDCV density in neurites, neurite outgrowth, and the proposed underlying mechanisms involving trophic factor release remain unclear. 

      We agree that we have not addressed all these links systematically enough in the original manuscript, although we feel that we have at least postulated the best possible working model to link RPH3A function to DCV exocytosis/neurotrophic factor release and neurite outgrowth (p 15-16, line 396-400). Of course, a single study cannot support all these links with sufficient experimental evidence. We have now added a short text on what we can conclude exactly based on our experiments and how we see the links between RPH3A function, DCV exocytosis/neurotrophic factor release, neurite outgrowth and DCV density in neurites (p 13-14, line 317-325).

      Reviewer #1 (Public Review): 

      Summary:

      The manuscript by Hoogstraaten et al. investigates the effect of constitutive Rabphilin 3A (RPH3A) ko on the exocytosis of dense core vesicles (DCV) in cultured mouse hippocampal neurons. Using mCherry- or pHluorin-tagged NPY expression and EGFP- or mCherry tagged RPHA3, the authors first analyse the colocalization of DCVs and RPH3A. Using FRAP, the authors next analyse the mobility of DCVs and RAB3A in neurites. The authors go on to determine the number of exocytotic events of DCVs in response to high-frequency electrical stimulation and find that RPH3A ko increases the number of exocytotic events by a factor 2-3, but not the fraction of released DCVs in a given cell (8x 50Hz stim). In contrast, the release fraction is also increased in RBP3A KOs when doubling the stimulation number (16x 50Hz). They further observe that RPH3A ko increases dendrite and axon length and the overall number of ChgrB-positive DCVs. However, the overall number of DCVs and dendritic length in ko cells directly correlate, indicating that the number of vesicles per dendritic length remains unaffected in the RPH3A KOs. Lentiviral co-expression of tetanus toxin (TeNT) showed a non-significant trend to reduce axon and dendrite length in RPH3a KOs. Finally, the authors use co-expression of RAB3A and SNAP25 constructs to show that RAB3A but not SNAP25 interaction is required to allow the exocytosis-enhancing effect in RPH3A KOs. 

      While the authors' methodology is sound, the microscopy results are performed well and analyzed appropriately, but their results in larger parts do not sufficiently support their conclusions. Moreover, the experiments are not always described in sufficient detail (e.g. FRAP; DCV counts vs. neurite length) to fully understand their claims. 

      Overall, I thus feel that the manuscript does not provide a sufficient advance in knowledge. 

      Strengths: 

      - The authors' methodology is sound, and the microscopy results are performed well and analyzed appropriately. 

      - Figure 2: The exocytosis imaging is elegant and potentially very insightful. The effect in the RPH3A KOs is convincing. 

      - Figure 4: the logic of this experiment is elegant. It shows that the increased number of DCV fusion events in RPH3A KOs is related to the interaction of RPH3A with RAB3A but not with SNAP25. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      - The results in larger parts do not sufficiently support the conclusions. 

      - The experiments are not always described in sufficient detail (e.g. FRAP; DCV counts vs. neurite length) to fully understand their claims. 

      - Not of sufficient advance in knowledge for this journal 

      - The significance of differences in control experiments WT vs. KO) varies between experiments shown in different figures. 

      - Axons and dendrites were not analyzed separately in Figures 1 and 2. 

      - The colocalization study in Figure 1 would require super-resolution microscopy. 

      To address the reviewers’ comments, we have provided a more detailed explanation of our analysis (p 19-20, line 521-542). In addition, we have repeated our colocalization experiments using STED microscopy, see Joint Public Review item (i).  

      Reviewer #2 (Public Review): 

      Summary: 

      Hoogstraaten et al investigated the involvement of rabphilin-3A RPH3A in DCV fusion in neurons during calcium-triggered exocytosis at the synapse and during neurite elongation. They suggest that RPH3A acts as an inhibitory factor for LDV fusion and this is mediated partially via its interaction with SNAP25 and not Rab3A/Rab27. It is a very elegant study although several questions remain to be clarified. 

      Strengths: 

      The authors use state-of-the-art techniques like tracking NPY-PHluorin exocytosis and FRAP experiments to quantify these processes providing novel insight into LDCs exocytosis and the involvement of RPH3A. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      At the current state of the manuscript, further supportive experiments are necessary to fully support the authors' conclusions. 

      We thank the reviewer for their comments and suggestions. We have performed additional experiments to support our conclusions, see Joint Public Review items (i) – (iv)

      Reviewer #3 (Public Review): 

      Summary: 

      The molecular mechanism of regulated exocytosis has been extensively studied in the context of synaptic transmission. However, in addition to neurotransmitters, neurons also secrete neuropeptides and neurotrophins, which are stored in dense core vesicles (DCVs). These factors play a crucial role in cell survival, growth, and shaping the excitability of neurons. The mechanism of release for DCVs is similar, but not identical, to that used for SV exocytosis. This results in slow kinetic and low release probabilities for DCV compared to SV exocytosis. There is a limited understanding of the molecular mechanisms that underlie these differences. By investigating the role of rabphilin-3A (RPH3A), Hoogstraaten et al. uncovered for the first time a protein that inhibits DCV exocytosis in neurons. 

      Strengths: 

      In the current work, Hoogstraaten et al. investigate the function of rabphilin-3A (RPH3A) in DVC exocytosis. This RAB3 effector protein has been shown to possess a Ca2+ binding site and an independent SNAP25 binding site. Using colocalization analysis of confocal imaging the authors show that in hippocampal neurons RPH3A is enriched at pre- and post-synaptic sites and associates specifically with immobile DCVs. Using site-specific RPH3A mutants they found that the synaptic location was due to its RAB3 interaction site. They further could show that RPH3A inhibits DCV exocytosis due to its interaction with SNAP25. They came to that conclusion by comparing NPY-pHluorin release in WT and RPH3A KO cells and by performing rescue experiments with RPH3A mutants. Finally, the authors showed that by inhibiting stimulated DCV release, RPH3A controlled the axon and dendrite length possibly through the reduced release of neurotrophins. Thereby, they pinpoint how the proper regulation of DCV exocytosis affects neuron physiology. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      Data context 

      One of the findings is that RPH3A accumulates at synapses and is mainly associated with immobile DCVs.

      However, Farina et al. (2015) showed that 66% of all DCVs are secreted at synapses and that these DCVs are immobile prior to secretion. To provide additional context to the data, it would be valuable to determine if RPH3A KO specifically enhances secretion at synapses. Additionally, the authors propose that RPH3A decreases DCV exocytosis by sequestering SNAP25 availability. At first glance, this hypothesis appears suitable. However, due to RPH3A synaptic localization, it should also limit SV exocytosis, which it does not. In this context, the only explanation for RPH3A's specific inhibition of DCV exocytosis is that RPH3A is located at a synapse site remote from the active zone, thus protecting the pool of SNAP25 involved in SV exocytosis from binding to RPH3A. This hypothesis could be tested using super-resolution microscopy. 

      We thank the reviewer for their suggestion. We have now performed super resolution microscopy, see Joint Public Review item (i). However, these new data do not necessarily explain the stronger effect of RP3A deficiency on DCV exocytosis, relative to SV exocytosis. We have added a short discussion on this topic to the revised manuscript, see Joint Public Review item (iv).

      Technical weakness 

      One technical weakness of this work consists in the proper counting of labeled DCVs. This is significant since most findings in this manuscript rely on this analysis. Since the data was acquired with epi-fluorescence or confocal microscopy, it doesn't provide the resolution to visualize individual DCVs when they are clumped. The authors use a proxy to count the number of DCVs by measuring the total fluorescence of individual large spots and dividing it by the fluorescence intensity of discrete spots assuming that these correspond to individual DCVs. This is an appropriate method but it heavily depends on the assumption that all DCVs are loaded with the same amount of NPY-pHluorin or chromogranin B (ChgB). Due to the importance of this analysis for this manuscript, I suggest that the authors show that the number of DCVs per µm2 is indeed affected by RPH3A KO using super-resolution techniques such as dSTORM, STED, SIM, or SRRF. 

      The reviewer is correct that this is a crucial issue, that we have not addressed optimally until now. We have previously devoted a large part of a previous manuscript to this issue, but have not referred to this previous work clearly enough. We have now clarified this (p 7, line 187-190). In brief, we have previously quantified the ratio between fluorescent intensity of ChgB and NPY-pHluorin in confocal microscopy over the number of dSTORM puncta in sparse areas of WT mouse hippocampal neurons (Persoon et al., 2018). This quantification yielded a unitary fluorescence intensity per vesicle that was very stable of different neurons. Although there might be some underestimation of the total number of DCVs when using confocal microscopy, the study of Persoon et al. (2018) has demonstrated that these parameters correlate well and that the estimations are accurate. Considering that the rF/F0 is similar in RPH3A WT and KO neurons (now Figure S2I), meaning that the intensity of NPY-pHluorin of one fusion event is comparable, we can presume that this correlation also applies for the RPH3A KO neurons.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Major points: 

      (1) The authors perform an extensive analysis regarding the colocalization of RPH3A and DCVs (Figure 1 upper part). This analysis is hampered by the fact that the recorded data has in relation to vesicle size limited resolution (> 1 µm) to allow making strong claims here. In my view, super-resolution microscopy would be required for the co-localization studies shown in Figure 1. 

      We fully agree and have now performed super-resolution microscopy, see Joint Public Review item (i)

      (2) The FRAP experiments (Figure 1 lower part) cannot be sufficiently understood from what is presented. The methods say that both laser channels were activated during bleaching but NPY-pHluorin is not bleached in Fig.1E. Explanation of the bleaching is not very circumspect. In 1D, it is rather EGFP-RPH3A that is entering the bleached area than the NPY vesicles. These experiments require a more careful explanation of methodology, observed results, and their interpretation. Overall, the observed effects in the original kymograph traces require a better explanation. 

      We acknowledge that NPY-pHluorin in Figure 1E (now Figure 2C) is not completely bleached. NPY-pHluorin appeared to be more difficult to bleach than NPY-mCherry. However, it is important to clarify that we merely bleached the neurites to remove the stationary puncta and facilitate our analysis of DCV/RPH3A dynamics. This bleaching step does not affect the interpretation of our results. We apologize that this was not clearly stated in the text and have made the necessary adjustments in legend, results- and methods section, (p 6-7, line 162-163; p 5, line 140-142 and p 19, line 508-513). Additionally, we apologize for the accidental switch of the kymographs for NPY-mCherry and EGFP-RPH3A in Figure 1D (now Figure 2B, C). We greatly appreciate identifying this error.  

      (3) Figure 1: The authors need to mention whether axons, dendrites, or both were analyzed throughout the different panels and how they were identified. Is it possible that axons were wrapping around dendrites in their cultures (compare e.g. Shimojo et al., 2015)? Given the limited spatial resolution and because of this wrapping, interpretation of results could be affected. 

      We completely agree with the reviewer’s assessment and conclusion. We are unable to distinguish axons from dendrites using this experimental design. We have made sure to specify in the text that our observation that RPH3A does not co-travel with DCVs is true for both dendrites and axons, (p 5, line 150).

      (4) Figure 2: The exocytosis imaging is elegant and potentially very insightful. The effect in the RPH3A KOs is convincing. However, the authors determine the efficacy of exocytosis from NPY-pHluorin unquenching of DCVs only. This is only one of several possible parameters to read out the efficiency of exocytosis. Kinetics like e.g. delay between stimulation and start of exocytosis events or release time course of NPY after DCV fusion were not determined. Such analysis could give a better insight into what process before or after the fusion of DCVs is affected by RPH3A ko. 

      We fully agree with the reviewer. We have now included the duration of the fusion events (new Figure S2D-F). The start time of the fusion events are shown in the cumulative plots in now Figure 3F and I. The kinetics are normal in the RPH3A KO neurons.

      Moreover, it needs to be mentioned whether 2C and D are from WT or ko cultures. It would be best to show representative examples from both genotypes. 

      We have now adjusted this in the new figure (now Figure 3C, D).

      The number of fusion events is much increased but the release fraction is not significantly changed. While this is consistent with results in Figure 4C it is at variance with 4F. This raises questions about the reliability of the effects in RPH3A KOs. 

      The release fraction indicates the number of fusion events normalized to the total DCV pool. In Figure 4D, we observed a slightly bigger pool size, which explains the lack of significance when analyzing the released fraction. In Figure 4G, however, DCV pool sizes are similar between KO and WT, leading to a statistically significant effect on release fraction in KO neurons. Furthermore, Figures 4B and E distinctly show a substantial increase in fusion events in RPH3A KO neurons. This variability in pool size observed could potentially be attributed to variation in culture or inherent biological variability.

      Given the increased number of ChgrB-positive DCVs in RPH3A KOs (shown in Figure 2) and that only the cumulative number of exocytosis events were analysed, how can the authors exclude that the RPH3A ko only affects vesicle number but not release, if the % change in released vesicles is not different to WT? Kinetics of release don't seem to be affected. Importantly, what was the density of NPY-pHluorin vesicles in WT vs. ko? 

      In Figure 2 (now Figure 5) we show that RPH3A KO neurons are larger and contain more endogenous ChgB+ puncta than WT neurons. This increased number of ChgrB+ puncta scales with their size as puncta density is not increased. A previous study (Persoon et al., 2018) has demonstrated a strong correlation between DCV number and neuron size. Our data show that RPH3A deficiency increased DCV exocytosis, but the released fraction of vesicles depends on the total number of DCVs, which we determined during live recording by dequenching NPY-pHluorin using NH4+. Considering that this is an overexpression of a heterologous DCV-fusion reporter, and not endogenous staining of DCVs, as in the case of ChgrB+ puncta, some variability is not unexpected.

      Also in these experiments, the question arises of whether the authors analyse axons, dendrites, or both throughout the different panels and how they were identified. 

      In our experimental design we record all fusion events per cell, including both axons and dendrites but excluding the cell soma. We have clarified this in the method section, (p 19, line 508 and p 19, line 521-522).

      (5) Figure 3: in D the authors show that ChgrB-pos. DCV density is slightly increased in KOs. How does this relate to the density of NPY-pHluorin DCVS in Figure 2? 

      We do not observe a difference in NPY-pHluorin density (see Author response image 1). However, it is important to note that we relied on tracing neurites in live recording images to determine the neuronal size. In contrast, the ChgB density was based on dendritic length using MAP2 (post-hoc) staining was limited. In addition, Chgr+ puncta represent an endogenous DCV staining, NPY-pHluorin quantification is based on overexpression of a heterologous DCV-fusion reporter. These two factors likely contribute some variability.

      Author response image 1.

      The authors show a non-significant trend of TeNT coexpression to reduce axon and dendrite lengths in RPH3A KOs. While this trend is visible, I think one cannot draw conclusions from that when not reaching significance. The argument of the authors that the increased axon and dendrite lengths are created by growth factor peptide release from DCV during culture time is interesting. However, the fact that TeNT expression shows a trend toward reducing this effect on axons/dendrites is not sufficient to prove the release of such growth factors. 

      We agree. We have toned down this speculation in the revised manuscript, (p 15-16, line 395-400).

      Lastly, the authors don't provide insight into the mechanisms, of how RPH3A ko increases the number of DCVs per µm dendritic length in the neurons. In my view, there are too many loose ends in this story of how RPH3A ko first increases spontaneous release of DCVs and then enhances neurite growth and DCV density. Did the authors e.g. measure the spontaneous release of DCVs in their cultures? 

      We measured spontaneous release of DCVs during the 30s baseline recording prior to stimulation. We observed no difference in spontaneous release between WT and KO neurons (now Figure S2H). However, baseline recording lasted only 30 seconds. It is possible that this was too short to detect subtle effects.

      Other points: 

      (1) Figure 4: the logic of this experiment is elegant. It shows that the increased number of DCV fusion events in RPH3A KOs is related to the interaction of RPH3A with RAB3A but not with SNAP25. As mentioned above, it is irritating that the reduction of fusion events in KOs and on the release fraction is sometimes reaching significance, but sometimes it does not. Likewise, the absence of significant effects on DCV numbers is not consistent with the results shown in Figures 3C and D. 

      DCV numbers in Figure 3 (now Figure 5) are determined by staining for endogenous ChgB, whereas in Figure 4D and G DCV numbers are determined by overexpressing NPY-pHluorin and counting the dequenched puncta following a NH4+ puff.

      (2) Figure 1B: truncation of the y-axis needs to be clearly indicated. 

      We have replaced this figure with new Figure 1 and have indicated truncations of the y-axis when needed (new Figure 1E). 

      (3) Page 10: "Given that neuropeptides are key modulators of adult neurogenesis (Mu et al., 2010), and that RPH3A depletion leads to increased DCV exocytosis, it is coherent that we observed longer neurites in RPH3A KO neurons." I cannot follow the argument of the authors here: what has neurogenesis to do with neurite length? 

      We apologize for the confusion. We have clarified this in the revised text, (p 16, line 398-400).

      Minor point: 

      There are some typos in the manuscript. e.g., page 8: "... may partially dependent on regulated secretion...); page 6: "...to dequence all...". 

      Thank you for noticing, we have corrected the typos.

      Reviewer #2 (Recommendations For The Authors): 

      (1) Supplementary Figure S1A, in my opinion, should be in Figure 1A as it illustrates all the constructs used in this study and helps the reader to follow it up. 

      We thank the reviewer for their suggestion. However, we feel that with the adjustments we have made in Figure 1, the illustrations of the constructs fit better in Figure S1, since new Figure 1 shows the localization of endogenous RPH3A and not that of the constructs.  

      (2) One of the conclusions of the manuscript is the synaptic localization of the different RPH3A mutants. The threshold for defining synaptic localization is not clear either from the images nor from the analysis: for example, the Menders coefficient for VGut1-Syn1 which is used as a positive control, ranges from 0.65-0.95 and that of RPH3A and Syn1 ranges from 0.5-0.95. These values should be compared to all mutants and the conclusions should be based on such comparison. 

      We agree. We have now repeated our initial co-localization experiment with all the RPH3A mutants (now Figure S1D-F).  

      (3) Strengthening this figure with STED/SIM/dSTORM microscopy can verify and add a new understanding of the subtle changes of RPH3A localization. 

      We fully agree and have now added super-resolution microscopy data, see Joint Public Review item (i).

      (4) As RAB3A/RAB27A (ΔRAB3A/RAB27A) loses the punctate distribution, please clarify how can it function at the synapse and not act as a KO. Is it sorted to the synapse and how does it is sorted to the synapse? 

      We used lentiviral delivery to introduce our constructs, resulting in the overexpression of ΔRAB3A/RAB27A mutant RPH3A. This overexpression likely compensates for the loss of the punctate distribution of RPH3A, thereby maintaining its limiting effect on DCV exocytosis. It is plausible that under physiological conditions, the mislocalization of RPH3A would lead to increased exocytosis, similar to what we observed in the KO. 

      (5) Is RPH3A expressed in both excitatory and inhibitory neurons? 

      We agree this is an important question. Single cell RNA-seq already suggests the protein is expressed in both, but we nevertheless decided to test expression of RPH3A protein in excitatory and inhibitory neurons, using immunocytochemistry with VGAT and VGLUT as markers in hippocampal and striatal WT neurons. We found that RPH3A is expressed in both VGLUT+ hippocampal neurons and VGAT+ striatal neurons (new Figure S1A, B).  

      (6) The differential use of ChgB and NPY as markers for DCVs should be clarified and compared as these are used at different stages of the manuscript. 

      We have previously addressed the comparison between ChgB and NPY-pHluorin (Persoon et al., 2018). We made sure to indicate this more clearly throughout the manuscript to clarify the use of the two markers. 

      (7) FRAP experiments- A graph describing NPY recovery should be added as a reference to 2H and discussed. 

      We agree. We have made the necessary adjustments (new Figure 2G).

      (8) Figure 2E shows some degree of "facilitation" between the 2 8x50 pulses RPH3A KO neurons. Can the author comment on that? What was the reason for using this dual stimulation protocol? 

      There is indeed some facilitation between the two 8 x 50 pulses in KO neurons and to a lesser extent also in the WT neurons, which we have observed before in WT neurons (Baginska et al., 2023). Baginska et al. (2023) showed recently that different stimulation protocols can influence certain fusion dynamics, like the ratio of persistent and transient events and event duration. We used two different stimulation protocols to thoroughly investigate the effect of RPH3A on exocytosis, and assess the robustness of our findings regarding the number of fusion events. Fusion kinetics was similar in WT an KO neurons for both stimulation protocols (new Figure 2D-F).

      (9) Figure 3 quantifies dendrites length and then moves to quantify both axon and dendrites for the Tetanus toxin experiment. What are the effects of KO on axon length? In the main figures, it is not mentioned but in S3 it seems not to be affected. How does it reconcile with the main conclusion on neurite length? 

      Figure 3H (now Figure 6C) shows the effect of the KO on axon length: the axon length is increased in RPH3A KO neurons compared to WT, similar to dendrite length. Re-expressing RPH3A in KO neurons rescues axonal length to WT levels. In Figure S3, we observe a similar trend as in main Figure 3 (new Figure 6), yet this effect did not reach significance. Based on this, we concluded that neurite length is increased upon RPH3A depletion.

      (10) For lay readers, please explain the total pool and how you measured it. However, see the next comment. 

      We agree. We have now defined this better in the revised manuscript, (p 19, line 524-527 and p 20, line 535-539).

      (11) It is a bit hard to understand if the total number of DCV was increased in the KO and if the pool size was increased and in which figure it is quantified. Some sentences like: "A trend towards a larger intracellular DCV pool in KO compared to WT neurons was observed" do not fit with "No difference in DCV pool size was observed between WT and KO neurons (Figure S2D)" or with "During stronger stimulation (16 bursts of 50 APs at 50 Hz), the total fusion and released fraction of DCVs were increased in KO neurons compared to WT". They are not directly supported, or not related to specific figures. Please indicate if the total DCVs pool, as measured by NH4, was increased and based on that, the fraction of the releasable DCVs following the long stimulation. From Figure 2H, the conclusion is an increase in fusion events. In general, NH4 is not quantified clearly- is it quantified in Figure S2C? And if it is a trend, how can it become significant in Figure 3? 

      We agree there has been some inconsistency in the way we describe the data on the total number of DCVs. We have addressed this in the revised text to ensure better clarity. The total DCV pool measured by NPY-pHluorin was not significantly increased in KO neurons, we see a trend towards a bigger DCV pool in the 2x8 50 Hz stimulation paradigm (now Figure S2C), therefore the released fraction of vesicles is not increased in Figure 1G (now Figure 3G). The number of DCV in Figure 3 (now Figure 5) is based on endogenous ChgB staining and not overexpression like the DCV pool measured by NPY-pHluorin. In Figure 3 (now Figure 5) we show that RPH3A KO neurons have slightly more ChgB+ puncta compared to WT.

      (12) In Figure 3, the quantification is not clear, discrete puncta are not visible but rather a smear of chromogranin staining. How was it quantified? An independent method to count DCV number, size, and distribution like EM is necessary to support and add further understanding. 

      We acknowledge that discrete ChgB puncta are not completely visible in Figure 3 (now Figure 5). Besides the inherent limitation in resolution with confocal imaging, we believe that this is due to ChgB accumulation in the KO neurons, as shown in now Figure 5D. Nonetheless, to address this concern of the reviewer, we have selected other images that represent our dataset (now Figure 5A). Furthermore, the number of ChgB+ DCVs was calculated using SynD software (Schmitz et al., 2011; van de Bospoort et al., 2012) (see previous reply). EM would offer valuable independent confirmation on the total DCV number, size and distribution. However, with the current method we already know that vesicle numbers are at least similar. Does that justify the (major) investment in a quantitative EM study? Moreover, this issue does not affect the central message of the current study.

      (13) Can the author discuss if the source of DCVs that are released at the synapse is similar or different from the source of DCVs fused while neurites elongate? 

      With our current experimental design, we are unable to draw conclusions regarding this aspect. We are not sure how experiments to identify this source (probably the Golgi?) would be crucial to sustain the central message of our study.

      (14) An interesting and related question: what are the expression levels of RPH3A during development and neuronal growth during the nervous system development? 

      While we have not specifically examined the expression levels of RPH3A over development, public databases show that RPH3A expression increases over time in mice, consistent with other synaptic proteins (Blake et al., 2021; Baldarelli et al., 2021; Krupke et al., 2017). We have now added this to the revised manuscript (p 2, line 55-56).

      (15) The conclusion from Figure 4 about the contribution of SNAP25 interaction to RPH3A inhibitory effect is not convincing. The data are scattered and in many neurons, high levels of fusion events were detected. Further or independent experiments are needed to support this conclusion. For example, is the interaction with SNAP25 important for its inhibitory activity in other DCV-releasing systems like adrenal medulla chromaffin cells? 

      We agree that further studies in other DCV-releasing systems like chromaffin cells would provide valuable insight into the role of SNAP25 interaction in RPH3A’s inhibitory effect on exocytosis. However, we believe that starting new series of experiments in another model system is outside of the scope of our current study.

      (16) Furthermore, the number of DCVs in the KO is similar in this experiment, raising some more questions about the quantification of the number of vesicles, that differ, in different sections of the manuscript (points # 10,11). 

      The total DCV pool in the fusion experiments is measured by overexpression NPY-pHluorin, this cannot be directly compared to the number of endogenous ChgB+ DCV in Figure 3 (now Figure 5), see also item (11)

      (17) The statement - "RPH3A is the only negative regulator of DCV" is not completely accurate as other DCV inhibitors like tomosyn were described before. 

      We agree. By this statement, we intend to convey that RPH3A is the only negative regulator of DCVs without substantial impact on synaptic vesicle exocytosis, unlike Tomosyns. We have clarified this in the revised text, (p 15, line 366-367).

      (18) The support for the effect of KO on the "clustering of DCVs" is not convincing. 

      The intensity of endogenous ChgB puncta was decreased in RPH3A KO neurons (now Figure 5E). However, the peak intensity induced by single NPY-pHluorin labeled DCV fusion events (quanta) was unchanged (now Figure S2I). This indicates that the decrease in ChgB puncta intensity must be due to a reduced number of DCVs (quanta) in this specific location. We have interpreted that as ‘clustering’, or maybe ‘accumulation’. However, we only put forward this possibility. We are now more careful in our speculations within the text, (p 11 line 271-277).

      (19) Final sentence: "where RPH3A binds available SNAP25, consequently restricting the assembly of SNARE complexes" should be either demonstrated or rephrased as no effect of trans or general SNARE complex formation is shown. 

      We agree. We have made the necessary adjustments in the text, (p 15, line 387-389).   

      (20) A scheme summarizing RPH3A's interaction with synaptic proteins and its effects on DCVs release, maybe even versus its effects on SVs release, should be considered as a figure or graphic abstract. 

      We have included a working model in Figure 7.  

      (21) Figure 4 logically should come after Figure 2 to summarize the fusion-related chapter before moving to neurite elongation. 

      We have placed Figure 4 after Figure 2 (now Figure 3).

      Reviewer #3 (Recommendations For The Authors): 

      One important finding of this study is that RPH3A downregulates neuron size, possibly by inhibiting DCV release. Additionally, the authors demonstrated that the number of DCVs is directly proportional to the number of DCVs per µm2, and that RPH3A KO reduces DCV clustering. This conclusion was drawn by comparing ChgB with NPY-pHluorin loading of the DCVs. However, this comparison is not valid as ChgB is expressed at an endogenous level and NPY-pHluorin is over-expressed. In the KO situation where DCV exocytosis is enhanced, the available endogenous ChgB may be depleted faster than the overexpressed NPY-pHluorin. Hoogstraaten et al. should either perform a study in which ChgB is overexpressed to test whether the difference in DCV remains or at least provides an alternative interpretation of their data. 

      We thank the reviewer for this comment. The reviewer challenges one or two conclusions in our original manuscript (It is not entirely clear to what exactly “This conclusion” refers): (a) “the number of DCVs is directly proportional to the number of DCVs per µm2”, and (b) “that RPH3A KO reduces DCV clustering”. The reviewer probably means that the number of DCVs per neuron is directly proportional to size of the neuron (a) and states this (these) conclusion(s) are “not valid as ChgB is expressed at an endogenous level and NPY-pHluorin is over-expressed” because “endogenous ChgB may be depleted faster than the overexpressed NPY-pHluorin”. We have three arguments to conclude that faster depletion of ChgB cannot affect these two conclusions: (1) DCVs bud off from the Golgi with newly synthesized (fresh) ChgB. Whether or not a larger fraction of DCVs is released does not influence this initial ChgB loading into DCVs (together with over-expressed NPY-pHluorin); (2) in hippocampal neurons merely 1-6% of the total DCV pool undergoes exocytosis (the current study and also extensively demonstrated in Persoon et al., 2018). RPH3A KO neurons release few percent more of the total DCV pool. Hence, “depletion of ChgB” is only marginally different between experimental groups; and (c) the proposed experiment overexpressing ChgB will not help scrutinize our current conclusions as ChgB overexpression is known to affect DCV biogenesis and the total DCV pool, most likely much more than a few percent more release by RPH3A deficiency.

      Hoogstraaten et al. conducted a thorough analysis of the impact of RPH3A KO and its rescue using various mutants on dendrite and axon length (see Supplementary Figure 3). However, they did not test the effect of the ΔSNAP25 mutant. The authors demonstrated that this mutant is the least efficient in rescuing DCV exocytosis (Figure 4E). Hence the neurons expressing this mutant should have a similar size to the KO neurons. This finding would strongly support the argument that DCV exocytosis regulates neuron size. Otherwise, it would suggest that RPH3A may have a function in regulating exocytosis at the growth cones that is independent of SNAP25. Since the authors most probably have the data that allows them to measure the neuron size (acquired for Supplementary Figure 2), I suggest that they perform the required analysis. 

      We agree this is important and performed new experiments to determine the dendrite length of RPH3A WT, KO and KO neurons expressing the ΔSNAP25 mutant. We observed that the dendrite length of RPH3A KO neurons expressing ΔSNAP25 mutant is indeed similar to KO neurons (new Figure S3C). Although not significant we observe a clear trend towards bigger neurons compared to WT.  This strengthens our conclusion that increased DCV exocytosis contributes to the observed increased neuronal size.

      The authors displayed the result of DCV exocytosis in two ways. One is by showing the number of exocytosis events the other is to display the proportion of DCVs that were secreted. They do the latter by dividing the secreted DCV by the total number of DCVs. These are visualized at the end of the experiment through NH4+ application. While this method works well for synaptic secretion as the marker of SV is localized to the SV membrane and remains at the synapse upon SV exocytosis, it cannot be applied in the same manner when it is the DCV content that is labeled as it is released upon secretion. Hence, the total pool of vesicles should be the number of DCV counted upon NH4+ application in addition to those that are secreted. This way of analyzing the total pool of DCV might also explain the difference in this pool size between KO neurons stimulated two times with 8 stimuli instead of one time with 16 stimuli (Sup Fig 2 C and D). This is an important point as it affects the conclusions drawn from Figure 2. 

      We thank the reviewed for this comment. We agree, and we have made the necessary adjustments throughout the manuscript. 

      The kymogram of DCV exocytic events displayed in Figure 2D shows a majority of persistent (>20s long) events. This is strange as NPY-pHluori corresponds to the released cargo. Previous work using the same labeling and stimulation technique showed that content release occurs in less than 10s (Baginska et al. 2023). The authors should comment on that difference. 

      In Baginska et al. (2023), the authors distinguished between persistent and transient events. The transient events are shorter than 10s for the 2x8 and 16x stimulation paradigms, whereas persistent events can last for more than 10s. In our study we did not make this distinction. However, in response to this reviewer, we have now quantified the fusion duration per cell. These new data show that the mean duration is similar between genotypes for both stimulation paradigms. We have added these new data (new Figure S2D-F).

      In Figures 1D and E, some puncta in the kymogram appeared to persist after bleaching. This raises questions about the effectiveness of the bleaching procedure for the FRAP experiment. 

      The reviewer is correct that NPY-pHluorin in Figure 1E (now Figure 2C) is not fully bleached. NPY-pHluorin was more resistant to bleaching than NPY-mCherry. However, we merely bleached the neurites to facilitate our analysis by reducing fluorescence of the stationary puncta without causing phototoxicity. Some remaining fluorescence after bleaching does not affect our conclusions in any way.

      In the discussion, the paragraph titled "RPH3A does not travel with DCVs in hippocampal neurons" is quite confusing and would benefit from a streamlined explanation. 

      We thank the reviewed for this comment. We made the necessary adjustments to make this paragraph clearer, (p 14, line 339-351).

      First paragraph of page 8 "TeNT expression in KO neurons restored neurite length to WT levels. When compared to KO neurons without TeNT, neurite length was not significantly decreased but displayed a trend towards WT levels (Figure 3G, H)." These two sentences are confusing as they seem contradictory. 

      We agree that this conclusion has been too strong. However, we do not see a contradiction. The significant effect between KO and control neurons on both axon and dendrite length is lost upon TeNT expression (which forms the basis for our conclusions cited by the reviewer, now Figure 6B, C). While the difference between KO neurons +/- TeNT did not reach statistical significance. The (strong) trend is clearly in the same direction. We have refined our original conclusion in the revised manuscript, (p 12, line 304-306).

      The data availability statement is missing. 

      We have added the data availability statement, (p 21, line 571-572).

    1. Reviewer #2 (Public Review):

      Here I submit my previous review and a great deal of additional information following on from the initial review and the response by the authors.

      * Initial Review *

      Assessment:

      This manuscript is based upon the unprecedented identification of an apparently highly unusual trigeminal nuclear organization within the elephant brainstem, related to a large trigeminal nerve in these animals. The apparently highly specialized elephant trigeminal nuclear complex identified in the current study has been classified as the inferior olivary nuclear complex in four previous studies of the elephant brainstem. The entire study is predicated upon the correct identification of the trigeminal sensory nuclear complex and the inferior olivary nuclear complex in the elephant, and if this is incorrect, then the remainder of the manuscript is merely unsupported speculation. There are many reasons indicating that the trigeminal nuclear complex is misidentified in the current study, rendering the entire study, and associated speculation, inadequate at best, and damaging in terms of understanding elephant brains and behaviour at worst.

      Original Public Review:

      The authors describe what they assert to be a very unusual trigeminal nuclear complex in the brainstem of elephants, and based on this, follow with many speculations about how the trigeminal nuclear complex, as identified by them, might be organized in terms of the sensory capacity of the elephant trunk.<br /> The identification of the trigeminal nuclear complex/inferior olivary nuclear complex in the elephant brainstem is the central pillar of this manuscript from which everything else follows, and if this is incorrect, then the entire manuscript fails, and all the associated speculations become completely unsupported.

      The authors note that what they identify as the trigeminal nuclear complex has been identified as the inferior olivary nuclear complex by other authors, citing Shoshani et al. (2006; 10.1016/j.brainresbull.2006.03.016) and Maseko et al (2013; 10.1159/000352004), but fail to cite either Verhaart and Kramer (1958; PMID 13841799) or Verhaart (1962; 10.1515/9783112519882-001). These four studies are in agreement, the current study differs.

      Let's assume for the moment that the four previous studies are all incorrect and the current study is correct. This would mean that the entire architecture and organization of the elephant brainstem is significantly rearranged in comparison to ALL other mammals, including humans, previously studied (e.g. Kappers et al. 1965, The Comparative Anatomy of the Nervous System of Vertebrates, Including Man, Volume 1 pp. 668-695) and the closely related manatee (10.1002/ar.20573). This rearrangement necessitates that the trigeminal nuclei would have had to "migrate" and shorten rostrocaudally, specifically and only, from the lateral aspect of the brainstem where these nuclei extend from the pons through to the cervical spinal cord (e.g. the Paxinos and Watson rat brain atlases), the to the spatially restricted ventromedial region of specifically and only the rostral medulla oblongata. According to the current paper the inferior olivary complex of the elephant is very small and located lateral to their trigeminal nuclear complex, and the region from where the trigeminal nuclei are located by others, appears to be just "lateral nuclei" with no suggestion of what might be there instead.

      Such an extraordinary rearrangement of brainstem nuclei would require a major transformation in the manner in which the mutations, patterning, and expression of genes and associated molecules during development occurs. Such a major change is likely to lead to lethal phenotypes, making such a transformation extremely unlikely. Variations in mammalian brainstem anatomy are most commonly associated with quantitative changes rather than qualitative changes (10.1016/B978-0-12-804042-3.00045-2).

      The impetus for the identification of the unusual brainstem trigeminal nuclei in the current study rests upon a previous study from the same laboratory (10.1016/j.cub.2021.12.051) that estimated that the number of axons contained in the infraorbital branch of the trigeminal nerve that innervate the sensory surfaces of the trunk is approximately 400 000. Is this number unusual? In a much smaller mammal with a highly specialized trigeminal system, the platypus, the number of axons innervating the sensory surface of the platypus bill skin comes to 1 344 000 (10.1159/000113185). Yet, there is no complex rearrangement of the brainstem trigeminal nuclei in the brain of the developing or adult platypus (Ashwell, 2013, Neurobiology of Monotremes), despite the brainstem trigeminal nuclei being very large in the platypus (10.1159/000067195). Even in other large-brained mammals, such as large whales that do not have a trunk, the number of axons in the trigeminal nerve ranges between 400 000 and 500 000 (10.1007/978-3-319-47829-6_988-1). The lack of comparative support for the argument forwarded in the previous and current study from this laboratory, and that the comparative data indicates that the brainstem nuclei do not change in the manner suggested in the elephant, argues against the identification of the trigeminal nuclei as outlined in the current study. Moreover, the comparative studies undermine the prior claim of the authors, informing the current study, that "the elephant trigeminal ganglion ... point to a high degree of tactile specialization in elephants" (10.1016/j.cub.2021.12.051). While clearly the elephant has tactile sensitivity in the trunk, it is questionable as to whether what has been observed in elephants is indeed "truly extraordinary".

      But let's look more specifically at the justification outlined in the current study to support their identification of the unusual located trigeminal sensory nuclei of the brainstem.

      (1) Intense cytochrome oxidase reactivity<br /> (2) Large size of the putative trunk module<br /> (3) Elongation of the putative trunk module<br /> (4) Arrangement of these putative modules correspond to elephant head anatomy<br /> (5) Myelin stripes within the putative trunk module that apparently match trunk folds<br /> (6) Location apparently matches other mammals<br /> (7) Repetitive modular organization apparently similar to other mammals.<br /> (8) The inferior olive described by other authors lacks the lamellated appearance of this structure in other mammals

      Let's examine these justifications more closely.

      (1) Cytochrome oxidase histochemistry is typically used as an indicative marker of neuronal energy metabolism. The authors indicate, based on the "truly extraordinary" somatosensory capacities of the elephant trunk, that any nuclei processing this tactile information should be highly metabolically active, and thus should react intensely when stained for cytochrome oxidase. We are told in the methods section that the protocols used are described by Purkart et al (2022) and Kaufmann et al (2022). In neither of these cited papers is there any description, nor mention, of the cytochrome oxidase histochemistry methodology, thus we have no idea of how this histochemical staining was done. In order to obtain the best results for cytochrome oxidase histochemistry, the tissue is either processed very rapidly after buffer perfusion to remove blood or in recently perfusion-fixed tissue (e.g., 10.1016/0165-0270(93)90122-8). Given: (1) the presumably long post-mortem interval between death and fixation - "it often takes days to dissect elephants"; (2) subsequent fixation of the brains in 4% paraformaldehyde for "several weeks"; (3) The intense cytochrome oxidase reactivity in the inferior olivary complex of the laboratory rat (Gonzalez-Lima, 1998, Cytochrome oxidase in neuronal metabolism and Alzheimer's diseases); and (4) The lack of any comparative images from other stained portions of the elephant brainstem; it is difficult to support the justification as forwarded by the authors. It is likely that the histochemical staining observed is background reactivity from the use of diaminobenzidine in the staining protocol. Thus, this first justification is unsupported.<br /> Justifications (2), (3), and (4) are sequelae from justification (1). In this sense, they do not count as justifications, but rather unsupported extensions.

      (4) and (5) These are interesting justifications, as the paper has clear internal contradictions, and (5) is a sequelae of (4). The reader is led to the concept that the myelin tracts divide the nuclei into sub-modules that match the folding of the skin on the elephant trunk. One would then readily presume that these myelin tracts are in the incoming sensory axons from the trigeminal nerve. However, the authors note that this is not the case: "Our observations on trunk module myelin stripes are at odds with this view of myelin. Specifically, myelin stripes show no tapering (which we would expect if axons divert off into the tissue). More than that, there is no correlation between myelin stripe thickness (which presumably correlates with axon numbers) and trigeminal module neuron numbers. Thus, there are numerous myelinated axons, where we observe few or no trigeminal neurons. These observations are incompatible with the idea that myelin stripes form an axonal 'supply' system or that their prime function is to connect neurons. What do myelin stripe axons do, if they do not connect neurons? We suggest that myelin stripes serve to separate rather than connect neurons." So, we are left with the observation that the myelin stripes do not pass afferent trigeminal sensory information from the "truly extraordinary" trunk skin somatic sensory system, and rather function as units that separate neurons - but to what end? It appears that the myelin stripes are more likely to be efferent axonal bundles leaving the nuclei (to form the olivocerebellar tract). This justification is unsupported.

      (6) The authors indicate that the location of these nuclei matches that of the trigeminal nuclei in other mammals. This is not supported in any way. In ALL other mammals in which the trigeminal nuclei of the brainstem have been reported they are found in the lateral aspect of the brainstem, bordered laterally by the spinal trigeminal tract. This is most readily seen and accessible in the Paxinos and Watson rat brain atlases. The authors indicate that the trigeminal nuclei are medial to the facial nerve nucleus, but in every other species the trigeminal sensory nuclei are found lateral to the facial nerve nucleus. This is most salient when examining a close relative, the manatee (10.1002/ar.20573), where the location of the inferior olive and the trigeminal nuclei matches that described by Maseko et al (2013) for the African elephant. This justification is not supported.

      (7) The dual to quadruple repetition of rostro-caudal modules within the putative trigeminal nucleus as identified by the authors relies on the fact that in the neurotypical mammal, there are several trigeminal sensory nuclei arranged in a column running from the pons to the cervical spinal cord, these include (nomenclature from Paxinos and Watson in roughly rostral to caudal order) the Pr5VL, Pr5DM, Sp5O, Sp5I, and Sp5C. But, these nuclei are all located far from the midline and lateral to the facial nerve nucleus, unlike what the authors describe in the elephants. These rostrocaudal modules are expanded upon in Figure 2, and it is apparent from what is shown is that the authors are attributing other brainstem nuclei to the putative trigeminal nuclei to confirm their conclusion. For example, what they identify as the inferior olive in figure 2D is likely the lateral reticular nucleus as identified by Maseko et al (2013). This justification is not supported.

      (8) In primates and related species, there is a distinct banded appearance of the inferior olive, but what has been termed the inferior olive in the elephant by other authors does not have this appearance, rather, and specifically, the largest nuclear mass in the region (termed the principal nucleus of the inferior olive by Maseko et al, 2013, but Pr5, the principal trigeminal nucleus in the current paper) overshadows the partial banded appearance of the remaining nuclei in the region (but also drawn by the authors of the current paper). Thus, what is at debate here is whether the principal nucleus of the inferior olive can take on a nuclear shape rather than evince a banded appearance. The authors of this paper use this variance as justification that this cluster of nuclei could not possibly be the inferior olive. Such a "semi-nuclear/banded" arrangement of the inferior olive is seen in, for example, giraffe (10.1016/j.jchemneu.2007.05.003), domestic dog, polar bear, and most specifically the manatee (a close relative of the elephant) (brainmuseum.org; 10.1002/ar.20573). This justification is not supported.

      Thus, all the justifications forwarded by the authors are unsupported. Based on methodological concerns, prior comparative mammalian neuroanatomy, and prior studies in the elephant and closely related species, the authors fail to support their notion that what was previously termed the inferior olive in the elephant is actually the trigeminal sensory nuclei. Given this failure, the justifications provided above that are sequelae also fail. In this sense, the entire manuscript and all the sequelae are not supported.

      What the authors have not done is to trace the pathway of the large trigeminal nerve in the elephant brainstem, as was done by Maseko et al (2013), which clearly shows the internal pathways of this nerve, from the branch that leads to the fifth mesencephalic nucleus adjacent to the periventricular grey matter, through to the spinal trigeminal tract that extends from the pons to the spinal cord in a manner very similar to all other mammals. Nor have they shown how the supposed trigeminal information reaches the putative trigeminal nuclei in the ventromedial rostral medulla oblongata. These are but two examples of many specific lines of evidence that would be required to support their conclusions. Clearly tract tracing methods, such as cholera toxin tracing of peripheral nerves cannot be done in elephants, thus the neuroanatomy must be done properly and with attention to details to support the major changes indicated by the authors.

      So what are these "bumps" in the elephant brainstem?

      Four previous authors indicate that these bumps are the inferior olivary nuclear complex. Can this be supported?

      The inferior olivary nuclear complex acts "as a relay station between the spinal cord (n.b. trigeminal input does reach the spinal cord via the spinal trigeminal tract) and the cerebellum, integrating motor and sensory information to provide feedback and training to cerebellar neurons" (https://www.ncbi.nlm.nih.gov/books/NBK542242/). The inferior olivary nuclear complex is located dorsal and medial to the pyramidal tracts (which were not labelled in the current study by the authors but are clearly present in Fig. 1C and 2A) in the ventromedial aspect of the rostral medulla oblongata. This is precisely where previous authors have identified the inferior olivary nuclear complex and what the current authors assign to their putative trigeminal nuclei. The neurons of the inferior olivary nuclei project, via the olivocerebellar tract to the cerebellum to terminate in the climbing fibres of the cerebellar cortex.

      Elephants have the largest (relative and absolute) cerebellum of all mammals (10.1002/ar.22425), this cerebellum contains 257 x109 neurons (10.3389/fnana.2014.00046; three times more than the entire human brain, 10.3389/neuro.09.031.2009). Each of these neurons appears to be more structurally complex than the homologous neurons in other mammals (10.1159/000345565; 10.1007/s00429-010-0288-3). In the African elephant, the neurons of the inferior olivary nuclear complex are described by Maseko et al (2013) as being both calbindin and calretinin immunoreactive. Climbing fibres in the cerebellar cortex of the African elephant are clearly calretinin immunopositive and also are likely to contain calbindin (10.1159/000345565). Given this, would it be surprising that the inferior olivary nuclear complex of the elephant is enlarged enough to create a very distinct bump in exactly the same place where these nuclei are identified in other mammals?

      What about the myelin stripes? These are most likely to be the origin of the olivocerebellar tract and probably only have a coincidental relationship to the trunk. Thus, given what we know, the inferior olivary nuclear complex as described in other studies, and the putative trigeminal nuclear complex as described in the current study, is the elephant inferior olivary nuclear complex. It is not what the authors believe it to be, and they do not provide any evidence that discounts the previous studies. The authors are quite simply put, wrong. All the speculations that flow from this major neuroanatomical error are therefore science fiction rather than useful additions to the scientific literature.

      What do the authors actually have?<br /> The authors have interesting data, based on their Golgi staining and analysis, of the inferior olivary nuclear complex in the elephant.

      * Review of Revised Manuscript *

      Assessment:

      There is a clear dichotomy between the authors and this reviewer regarding the identification of specific structures, namely the inferior olivary nuclear complex and the trigeminal nuclear complex, in the brainstem of the elephant. The authors maintain the position that in the elephant alone, irrespective of all the published data on other mammals and previously published data on the elephant brainstem, these two nuclear complexes are switched in location. The authors maintain that their interpretation is correct, this reviewer maintains that this interpretation is erroneous. The authors expressed concern that the remainder of the paper was not addressed by the reviewer, but the reviewer maintains that these sequelae to the misidentification of nuclear complexes in the elephant brainstem renders any of these speculations irrelevant as the critical structures are incorrectly identified. It is this reviewer's opinion that this paper is incorrect. I provide a lot of detail below in order to provide support to the opinion I express.

      Public Review of Current Submission:

      As indicated in my previous review of this manuscript (see above), it is my opinion that the authors have misidentified, and indeed switched, the inferior olivary nuclear complex (IO) and the trigeminal nuclear complex (Vsens). It is this specific point only that I will address in this second review, as this is the crucial aspect of this paper - if the identification of these nuclear complexes in the elephant brainstem by the authors is incorrect, the remainder of the paper does not have any scientific validity.

      The authors, in their response to my initial review, claim that I "bend" the comparative evidence against them. They further claim that as all other mammalian species exhibit a "serrated" appearance of the inferior olive, and as the elephant does not exhibit this appearance, that what was previously identified as the inferior olive is actually the trigeminal nucleus and vice versa.

      For convenience, I will refer to IOM and VsensM as the identification of these structures according to Maseko et al (2013) and other authors and will use IOR and VsensR to refer to the identification forwarded in the study under review.<br /> The IOM/VsensR certainly does not have a serrated appearance in elephants. Indeed, from the plates supplied by the authors in response (Referee Fig. 2), the cytochrome oxidase image supplied and the image from Maseko et al (2013) shows a very similar appearance. There is no doubt that the authors are identifying structures that closely correspond to those provided by Maseko et al (2013). It is solely a contrast in what these nuclear complexes are called and the functional sequelae of the identification of these complexes (are they related to the trunk sensation or movement controlled by the cerebellum?) that is under debate.

      Elephants are part of the Afrotheria, thus the most relevant comparative data to resolve this issue will be the identification of these nuclei in other Afrotherian species. Below I provide images of these nuclear complexes, labelled in the standard nomenclature, across several Afrotherian species.

      (A) Lesser hedgehog tenrec (Echinops telfairi)

      Tenrecs brains are the most intensively studied of the Afrotherian brains, these extensive neuroanatomical studies undertaken primarily by Heinz Künzle. Below I append images (coronal sections stained with cresol violet) of the IO and Vsens (labelled in the standard mammalian manner) in the lesser hedgehog tenrec. It should be clear that the inferior olive is located in the ventral midline of the rostral medulla oblongata (just like the rat) and that this nucleus is not distinctly serrated. The Vsens is located in the lateral aspect of the medulla skirted laterally by the spinal trigeminal tract (Sp5). These images and the labels indicating structures correlate precisely with that provide by Künzle (1997, 10.1016/S0168- 0102(97)00034-5), see his Figure 1K,L. Thus, in the first case of a related species, there is no serrated appearance of the inferior olive, the location of the inferior olive is confirmed through connectivity with the superior colliculus (a standard connection in mammals) by Künzle (1997), and the location of Vsens is what is considered to be typical for mammals. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      Review image 1.

      (B) Giant otter shrew (Potomogale velox)

      The otter shrews are close relatives of the Tenrecs. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see hints of the serration of the IO as defined by the authors, but we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      Review image 2.

      (C) Four-toed sengi (Petrodromus tetradactylus)

      The sengis are close relatives of the Tenrecs and otter shrews, these three groups being part of the Afroinsectiphilia, a distinct branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see vague hints of the serration of the IO (as defined by the authors), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      Review image 3.

      (D) Rock hyrax (Procavia capensis)

      The hyraxes, along with the sirens and elephants form the Paenungulata branch of the Afrotheria. Below I append images of cresyl violet (left column) and myelin (right column) stained coronal sections through the brainstem with the IO, Vsens and Sp5 labelled as per the standard mammalian anatomy. Here we see hints of the serration of the IO (as defined by the authors), but we also see evidence of a more "bulbous" appearance of subnuclei of the IO (particularly the principal nucleus), and we also see many myelin stripes across the IO. Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      Review image 4.

      (E) West Indian manatee (Trichechus manatus)

      The sirens are the closest extant relatives of the elephants in the Afrotheria. Below I append images of cresyl violet (top) and myelin (bottom) stained coronal sections (taken from the University of Wisconsin-Madison Brain Collection, https://brainmuseum.org, and while quite low in magnification they do reveal the structures under debate) through the brainstem with the IO, Vsens and Sp5 labelled as per standard mammalian anatomy. Here we see the serration of the IO (as defined by the authors). Vsens is located laterally and skirted by the Sp5. This is in agreement with the authors, as they propose that ONLY the elephants show the variations they report.

      Review image 5.

      These comparisons and the structural identification, with which the authors agree as they only distinguish the elephants from the other Afrotheria, demonstrate that the appearance of the IO can be quite variable across mammalian species, including those with a close phylogenetic affinity to the elephants. Not all mammal species possess a "serrated" appearance of the IO. Thus, it is more than just theoretically possible that the IO of the elephant appears as described prior to this study.

      So what about elephants? Below I append a series of images from coronal sections through the African elephant brainstem stained for Nissl, myelin, and immunostained for calretinin. These sections are labelled according to standard mammalian nomenclature. In these complete sections of the elephant brainstem, we do not see a serrated appearance of the IOM (as described previously and in the current study by the authors). Rather the principal nucleus of the IOM appears to be bulbous in nature. In the current study, no image of myelin staining in the IOM/VsensR is provided by the authors. However, in the images I provide, we do see the reported myelin stripes in all stains - agreement between the authors and reviewer on this point. The higher magnification image to the bottom left of the plate shows one of the IOM/VsensR myelin stripes immunostained for calretinin, and within the myelin stripes axons immunopositive for calretinin are seen (labelled with an arrow). The climbing fibres of the elephant cerebellar cortex are similarly calretinin immunopositive (10.1159/000345565). In contrast, although not shown at high magnification, the fibres forming the Sp5 in the elephant (in the Maseko description, unnamed in the description of the authors) show no immunoreactivity to calretinin.

      Review image 6.

      Peripherin Immunostaining

      In their revised manuscript the authors present immunostaining of peripherin in the elephant brainstem. This is an important addition (although it does replace the only staining of myelin provided by the authors which is unusual as the word myelin is in the title of the paper) as peripherin is known to specifically label peripheral nerves. In addition, as pointed out by the authors, peripherin also immunostains climbing fibres (Errante et al., 1998). The understanding of this staining is important in determining the identification of the IO and Vsens in the elephant, although it is not ideal for this task as there is some ambiguity. Errante and colleagues (1998; Fig. 1) show that climbing fibres are peripherin-immunopositive in the rat. But what the authors do not evaluate is the extensive peripherin staining in the rat Sp5 in the same paper (Errante et al, 1998, Fig. 2). The image provided by the authors of their peripherin immunostaining (their new Figure 2) shows what I would call the Sp5 of the elephant to be strongly peripherin immunoreactive, just like the rat shown in Errant et al (1998), and more over in the precise position of the rat Sp5! This makes sense as this is where the axons subserving the "extraordinary" tactile sensitivity of the elephant trunk would be found (in the standard model of mammalian brainstem anatomy). Interestingly, the peripherin immunostaining in the elephant is clearly lamellated...this coincides precisely with the description of the trigeminal sensory nuclei in the elephant by Maskeo et al (2013) as pointed out by the authors in their rebuttal. Errante et al (1998) also point out peripherin immunostaining in the inferior olive, but according to the authors this is only "weakly present" in the elephant IOM/VsensR. This latter point is crucial. Surely if the elephant has an extraordinary sensory innervation from the trunk, with 400 000 axons entering the brain, the VsensR/IOM should be highly peripherin-immunopositive, including the myelinated axon bundles?! In this sense, the authors argue against their own interpretation - either the elephant trunk is not a highly sensitive tactile organ, or the VsensR is not the trigeminal nuclei it is supposed to be.

      Summary:

      (1) Comparative data of species closely related to elephants (Afrotherians) demonstrates that not all mammals exhibit the "serrated" appearance of the principal nucleus of the inferior olive.

      (2) The location of the IO and Vsens as reported in the current study (IOR and VsensR) would require a significant, and unprecedented, rearrangement of the brainstem in the elephants independently. I argue that the underlying molecular and genetic changes required to achieve this would be so extreme that it would lead to lethal phenotypes. Arguing that the "switcheroo" of the IO and Vsens does occur in the elephant (and no other mammals) and thus doesn't lead to lethal phenotypes is a circular argument that cannot be substantiated.

      (3) Myelin stripes in the subnuclei of the inferior olivary nuclear complex are seen across all related mammals as shown above. Thus, the observation made in the elephant by the authors in what they call the VsensR, is similar to that seen in the IO of related mammals, especially when the IO takes on a more bulbous appearance. These myelin stripes are the origin of the olivocerebellar pathway, and are indeed calretinin immunopositive in the elephant as I show.

      (4) What the authors see aligns perfectly with what has been described previously, the only difference being the names that nuclear complexes are being called. But identifying these nuclei is important, as any functional sequelae, as extensively discussed by the authors, is entirely dependent upon accurately identifying these nuclei.

      (4) The peripherin immunostaining scores an own goal - if peripherin is marking peripheral nerves (as the authors and I believe it is), then why is the VsensR/IOM only "weakly positive" for this stain? This either means that the "extraordinary" tactile sensitivity of the elephant trunk is non-existent, or that the authors have misinterpreted this staining. That there is extensive staining in the fibre pathway dorsal and lateral to the IOR (which I call the spinal trigeminal tract), supports the idea that the authors have misinterpreted their peripherin immunostaining.

      (5) Evolutionary expediency. The authors argue that what they report is an expedient way in which to modify the organisation of the brainstem in the elephant to accommodate the "extraordinary" tactile sensitivity. I disagree. As pointed out in my first review, the elephant cerebellum is very large and comprised of huge numbers of morphologically complex neurons. The inferior olivary nuclei in all mammals studied in detail to date, give rise to the climbing fibres that terminate on the Purkinje cells of the cerebellar cortex. It is more parsimonious to argue that, in alignment with the expansion of the elephant cerebellum (for motor control of the trunk), the inferior olivary nuclei (specifically the principal nucleus) have had additional neurons added to accommodate this cerebellar expansion. Such an addition of neurons to the principal nucleus of the inferior olive could readily lead to the loss of the serrated appearance of the principal nucleus of the inferior olive, and would require far less modifications in the developmental genetic program that forms these nuclei. This type of quantitative change appears to be the primary way in which structures are altered in the mammalian brainstem.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Common comments

      (1) Significance of zero mutation rate

      Reviewers asked why we included mutation rate even though setting mutation rate to zero doesn’t change results. We think that including non-zero mutation rate makes our results more generalisable, and thus is a strength rather than weakness. To better motivate this choice, we have added a sentence to the beginning of Results:

      (2) Writing the mu=0 case first

      Reviewers suggested that we should first focus on the mu=0 case, and then generalize the result. The suggestions are certainly good. However, given the large amount of work involved in a re-organization, we have decided to adhere to our current narrative. However, we now only include equations where mu=0 in the main text, and have moved the case of nonzero mutation rate to Supplementary Information.

      (3) Making equations more accessible

      We have taken three steps to make equations more readable.

      ● Equations in the main text correspond to the case of zero-mutation rate.

      ● The original section on equation derivation is now in a box in the main text so that readers have the choice of skipping it but interested readers can still get a gist of where equations came from.

      ● We have provided a much more detailed interpretation of the equation (see page 10).

      (4) Validity of the Gaussian approximation

      Reviewers raised concerns about the validity of Gaussian approximation on F frequency𝑓(𝜏). The fact that our calculations closely match simulations suggest that this approximation is reasonable. Still, we added a discussion about the validity of this approximation in Box 1.

      We also added to SI with various cases of initial S and F sizes. This figure shows that when either initial S or initial F is small, the distribution of𝑓(𝜏) is not normal. However, if initial S and F are both on the order of hundreds, then the distribution of 𝑓(𝜏) is approximately Gaussian.

      Public Reviews:

      Summary:

      The authors demonstrate with a simple stochastic model that the initial composition of the community is important in achieving a target frequency during the artificial selection of a community.

      Strengths:

      To my knowledge, the intra-collective selection during artificial selection has not been seriously theoretically considered. However, in many cases, the species dynamics during the incubation of each selection cycle are important and relevant to the outcome of the artificial selection experiment. Stochasticity from birth and death (demographic stochasticity) plays a big role in these species' abundance dynamics. This work uses a simple framework to tackle this idea meticulously.

      This work may or may not be hysteresis (path dependency). If this is true, maybe it would be nice to have a discussion paragraph talking about how this may be the case. Then, this work would even attract the interest of people studying dynamic systems.

      We have added this clarification in the main text:

      “Note that here, selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis where path-dependence results from whether a tuning parameter is increased or decreased.

      Weaknesses:

      (1) Connecting structure and function

      In typical artificial selection literature, most of them select the community based on collective function. Here in this paper, the authors are selecting a target composition. Although there is a schematic cartoon illustrating the relationship between collective function (y-axis) and the community composition in the main Figure 1, there is no explicit explanation or justification of what may be the origin of this relationship. I think giving the readers a naïve idea about how this structure-function relationship arises in the introduction section would help. This is because the conclusion of this paper is that the intra-collective selection makes it hard to artificially select a community that has an intermediate frequency of f (or s). If there is really evidence or theoretical derivation from this framework that indeed the highest function comes from the intermediate frequency of f, then the impact of this paper would increase because the conclusions of this stochastic model could allude to the reasons for the prevalent failures of artificial selection in literature.

      We have added this to introduction: “This is a common quest: whenever a collective function depends on both populations, collective function is maximised, by definition, at an intermediate frequency (e.g. too little of either population will hamper function [23]).”

      (2) Explain intra-collective and inter-collective selection better for readers.

      The abstract, the introduction, and the result section use these terms or intra-collective and inter-collective selection without much explanation. For the wide readership of eLife, a clear definition in the beginning would help the audience grasp the importance of this paper, because these concepts are at the core of this work.

      This is a great point. We have added in Abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      and in Introduction

      “A selection cycle consists of three stages (Fig. 1). During collective maturation, intra-collective selection favors fast-growing individuals within a collective. At the end of maturation, inter-collective selection acts on collectives and favors those achieving the target composition. Finally during collective reproduction, offspring collectives sample stochastically from the parents, a process dominated by genetic drift.”

      (3) Achievable target frequency strongly depending on the degree of demographic stochasticity.

      I would expect that the experimentalists would find these results interesting and would want to consider these results during their artificial selection experiments. The main Figure 4 indicates that the Newborn size N0 is a very important factor to consider during the artificial selection experiment. This would be equivalent to how much bottleneck is imposed on the artificial selection process in every iteration step (i.e., the ratio of serial dilution experiment). However, with a low population size, all target frequencies can be achieved, and therefore in these regimes, the initial frequency now does not matter much. It would be great for the authors to provide what the N0 parameter actually means during the artificial selection experiments. Maybe relative to some other parameter in the model. I know this could be very hard. But without this, the main result of this paper (initial frequency matters) cannot be taken advantage of by the experimentalists.

      We have added an analytical approximation for N0˘, the Newborn size below which all target frequencies can be achieved in SI.

      Also, we have added lines indicating N0˘ in Fig4a.

      (4) Consideration of environmental stochasticity.

      The success (gold area of Figure 2d) in this framework mainly depends on the size of the demographic stochasticity (birth-only model) during the intra-collective selection. However, during experiments, a lot of environmental stochasticity appears to be occurring during artificial selection. This may be out of the scope of this study. But it would definitely be exciting to see how much environmental stochasticity relative to the demographic stochasticity (variation in the Gaussian distribution of F and S) matters in succeeding in achieving the target composition from artificial selection.

      You are correct that our work considers only demographic stochasticity.

      Indeed, considering other types of stochasticity will be an exciting future research direction. We added in the main text:

      “Overall our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.”

      (5) Assumption about mutation rates

      If setting the mutation rates to zero does not change the result of the simulations and the conclusion, what is the purpose of having the mutation rates \mu? Also, is the unidirectional (S -> F -> FF) mutation realistic? I didn't quite understand how the mutations could fit into the story of this paper.

      This is a great point. We have added this to the beginning of Results to better motivate our study:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.

      See answer on common question 1.

      (6) Minor points

      In Figure 3b, it is not clear to me how the frequency difference for the Intra-collective and the Inter-collective selection is computed.

      We added a description in caption 3b.

      In Figure 5b, the gold region (success) near the FF is not visible. Maybe increase the size of the figure or have an inset for zoom-in. Why is the region not as big as the bottom gold region?

      We increased the resolution of Fig 5b so that the gold region near FF is more visible.

      We have added Fig 5c and the following explanation to the main text:

      “From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Fig. 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see section VIII in Supplementary Information). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Fig. 5c left), and then expanded to a three-population problem (Fig. 5c right).”

      Recommendations For The Authors

      Since the conclusion of the model greatly depends on the noise (variation) of F and S in the Gaussian distribution, it would be nice to have a plot where the y-axis is the variation in terms of frequency and the x-axis is the s_0 or f_0 (frequency). In the plot, I would love to see how the variation in the frequency depends on the initial frequency of S and F. Maybe this is just trivial.

      In the SI, we added Fig6a, as per your request. Previous Fig6 became Fig6b.

      Reviewer #2 (Public review):

      The authors provide an analytical framework to model the artificial selection of the composition of communities composed of strains growing at different rates. Their approach takes into account the competition between the targeted selection at the level of the meta-community and the selection that automatically favors fast-growing cells within each replicate community. Their main finding is a tipping point or path-dependence effect, whereby compositions dominated by slow-growing types can only be reached by community-level selection if the community does not start and never crosses into a range of compositions dominated by fast growers during the dynamics.

      These results seem to us both technically correct and interesting. We commend the authors on their efforts to make their work reproducible even when it comes to calculations via extensive appendices, though perhaps a table of contents and a short description of these appendices at the start of SI would help navigate them.

      Thank you for the suggestion. We have added a paragraph at the beginning of SI.

      The main limitation in the current form of the article is that it could clarify how its assumptions and findings differ from and improve upon the rest of the literature:

      -  Many studies discuss the interplay between community-level evolution and species- or strain-level evolution. But "evolution" can be a mix of various forces, including selection, drift/randomness, and mutation/innovation.

      - This work's specificity is that it focuses strictly on constant community-level selection versus constant strain-level selection, all other forces being negligible (neither stochasticity nor innovation/mutation matter at either level, as we try to clarify now).

      Note that intra-collective selection is not strictly “constant” in the sense that selection favoring F is the strongest at intermediate F frequency (Fig 3). However, we think that you mean that intra- and inter-collective selection are present in every cycle, and this is correct for our case, and for community selection in general.

      -  Regarding constant community-level selection, it is only briefly noted that "once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types" [pg. 3 {section sign}2]. In other words, action from the selector is required indefinitely to maintain the community in the desired state. This assumption is found in a fraction of the literature, but is still worth clarifying from the start as it can inform the practical applicability of the results.

      This is a good point. We have added to abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      - More importantly, strain-level evolution also boils down here to pure selection with a constant target, which is less usual in the relevant literature. Here, (1) drift from limited population sizes is very small, with no meaningful counterbalancing of selection, (2) pure exponential regime with constant fitness, no interactions, no density- or frequency-dependence, (3) there is no innovation in the sense that available types are unchanging through time (no evolution of traits such as growth rate or interactions) and (4) all the results presented seem unchanged when mutation rate mu = 0 (as noted in Appendix III), meaning that the conclusions are not "about" mutation in any meaningful way.

      With regard to point (1), Figure 4a (reproduced below) shows how Newborn size affects the region of achievable targets. Indeed at large Newborn size (e.g. 5000 and above), no target frequency is achievable (since drift is too small to generate sufficient inter-community variation and consequently all communities are dominated by fast-growing F). However at Newborn size of for example 1000, there are two regions of accessible target frequencies. At smaller Newborn size, all target frequencies become achievable due to drift becoming sufficiently strong.

      With regard to points (2) and (3), we have added to Introduction

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      With regard to point (4), we view this as a strength rather than weakness. We have added the following to the beginning of Results and Discussions:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      See Point 1 of Common comments.

      - Furthermore, the choice of mutation mechanism is peculiar, as it happens only from slow to fast grower: more commonly, one assumes random non-directional mutations, rather than purely directional ones from less fit to fitter (which is more of a "Lamarckian" idea). Given that mutation does not seem to matter here, this choice might create unnecessary opposition from some readers or could be considered as just one possibility among others.

      We have added the following justification:

      “This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around.”

      It would be helpful to have all these points stated clearly so that it becomes easy to see where this article stands in an abundant literature and contributes to our understanding of multi-level evolution, and why it may have different conclusions or focus than others tackling very similar questions.

      Finally, a microbial context is given to the study, but the assumptions and results are in no way truly tied to that context, so it should be clear that this is just for flavor.

      We have deleted “microbial” from the title, and revised our abstract:

      Recommendations For The Authors

      (1) More details concerning our main remark above:

      - The paragraph discussing refs [24, 33] is not very clear in how they most importantly differ from this study. Our impression is that the resource aspect is not very important for instance, and the main difference is that these other works assume that strains can change in their traits.

      We are fairly sure that resource depletion is important in Rainey group’s study, as the attractor only evolved after both strains grew fast enough to deplete resources by the end of maturation. Indeed, evolution occurred in interaction coefficients which dictate the competition between strains for resources.

      Regardless, you raised an excellent point. As discussed earlier, we have added the following:

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      - We would advise the main text to focus on mu = 0, and only say in discussion that results can be generalized.

      Your suggestion is certainly good. However, given the large amount of work involved in a reorganisation, we have decided to adhere to our current narrative. However, as discussed earlier, we have added this at the beginning of Results to help orient readers:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      (2) We think the material on pg. 5 "Intra-collective evolution is the fastest at intermediate F frequencies, creating the "waterfall" phenomenon", although interesting, could be presented in a different way. The mathematical details on how to find the probability distribution of the maximum of independent random variables (including Equation 1) will probably be skipped by most of the readers (for experienced theoreticians, it is standard content; for experimentalists, it is not the most relevant), as such I would recommend displacing them to SM and report only the important results.

      This is an excellent suggestion. We have put a sketch of our calculations in a box in the main text to help orient interested readers. As before, details are in SI.

      Similarly, Equations 2, 3, and 4 are hard to read given the large amount of parameters and the low amount of simplification. Although exploring the effect of the different parameters through Figures 3 and 4 is useful, I think the role of the equations should be reconsidered:

      i. Is it possible to rewrite them in terms of effective variables in a more concise way?

      See Point 3 of Common comments.

      ii. Is it possible to present extreme/particular cases in which they are easier to interpret?

      We have focused on the case where the mutation rate is zero. This makes the mathematical expressions much simpler (see above).

      (3) Is it possible to explain more in detail why the distribution of f_k+1 conditional to f_k^* is well approximated by a Gaussian? Also, have you explored to what extent the results would change if this were not true (in light of the few universal classes for the maximum of independent variables)?

      Despite the appeal to the CLT and the histograms in the Appendix suggesting that the distribution looks a bit like a Gaussian at a certain scale, fluctuations on that scale are not necessarily what is relevant for the results - a rapid (and maybe wrong) attempt at a characteristic function calculation suggests that in your case, one does not obtain convergence to Gaussians unless we renormalize by S(t=0) and F(t=0), so it seems there is a justification missing in the text as is for the validity of this approximation (or that it is simply assumed).

      See point 4 of Common comments.

      Reviewer #3 (Public Reviews):

      The authors address the process of community evolution under collective-level selection for a prescribed community composition. They mostly consider communities composed of two types that reproduce at different rates, and that can mutate one into the other. Due to such differences in 'fitness' and to the absence of density dependence, within-collective selection is expected to always favour the fastest grower, but the collective-level selection can oppose this tendency, to a certain extent at least. By approximating the stochastic within-generation dynamics and solving it analytically, the authors show that not only high frequencies of fast growers can be reproducibly achieved, aligned with their fitness advantage. Small target frequencies can also be maintained, provided that the initial proportion of fast growers is sufficiently small. In this regime, similar to the 'stochastic corrector' model, variation upon which selection acts is maintained by a combination of demographic stochasticity and of sampling at reproduction. These two regions of achievable target compositions are separated by a gap, encompassing intermediate frequencies that are only achievable when the bottleneck size is small enough or the number of communities is (disproportionately) larger.

      A similar conclusion, that stochastic fluctuations can maintain the system over evolutionary time far from the prevalence of the faster-growing type, is then confirmed by analyzing a three-species community, suggesting that the qualitative conclusions of this study are generalizable to more complex communities.

      I expect that these results will be of broad interest to the community of researchers who strive to improve community-level selection, but are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space of such embedded populations. The realization that not all target collective functions can be as easily achieved and that they should be adapted to the initial conditions and the selection protocol is also a sobering message for designing concrete applications.

      A major strength of this work is that the qualitative behaviour of the system is captured by an analytically solvable approximation so that the extent of the 'forbidden region' can be directly and generically related to the parameters of the selection protocol.

      Thanks so much for these positive comments.

      I however found the description of the results too succinct and I think that more could be done to unpack the mathematical results in a way that is understandable to a broader audience. Moreover, the phenomenon the authors characterize is of purely ecological nature. Here, mutations of the growth rate are, in my understanding, neither necessary (non-trivial equilibria can be maintained also when \mu =0) nor sufficient (community-level selection is necessary to keep the system far from the absorbing state) for the phenomenon described. Calling this dynamics community evolution reflects a widespread ambiguity, and is not ascribable just to this work. I find that here the authors have the opportunity to make their message clearer by focusing on the case where the 'mutation' rate \mu vanishes (Equations 39 & 40 of the SI) - which is more easily interpretable, at least in some limits - while they may leave the more general equations 3 & 4 in the SI.

      See points 1-4 of Common comments.

      Combined with an analysis of the deterministic equations, that capture the possibility of maintaining high frequencies of fast growers, the authors could elucidate the dynamics that are induced by the presence of a second level of selection, and speculate on what would be the result of real open-ended evolution (not encompassed by the simple 'switch mutations' generally considered in evolutionary game theory), for instance discussing the invasibility (or not) of mutant types with slightly different growth rates.

      Indeed, evolution is not restricted to two types. However, our main goal here is to derive an analytical expression, and it was difficult for even two types. For three-type collectives, we had to resort to simulations. Investigating the case where fitness effects of mutations are continuously distributed is beyond the scope of this study.

      The single most important model hypothesis that I would have liked to be discussed further is that the two types do not interact. Species interactions are not only essential to achieve inheritance of composition in the course of evolution but are generally expected to play a key role even on ecological time scales. I hope the authors plan to look at this in future work.

      In our system, the S and F do interact in a competitive fashion: even though S and F are not competing for nutrients (which are always in excess), they are competing for space. This is because a fixed number of cells are transferred to the next cycle. Thus, the presence of F will for example reduce the chance of S being propagated. We have added this clarification to our main text:

      “Note that even though S and F do not compete for nutrients, they compete for space: because the total number of cells transferred to the next cycle is fixed, an overabundance of one population will reduce the likelihood of the other being propagated.”

      Recommendations For The Authors

      I felt the authors could put some additional effort into making their theoretical results meaningful for a population of readers who, though not as highly mathematically educated as they are, can nonetheless appreciate the implications of simple relations or scaling. Below, you find some suggestions:

      (1) In order to make it clear that there is a 'natural' high-frequency equilibrium that can be reached even in the absence of selection, the authors could examine first the dynamics of the deterministic system in the absence of mutations, and use its equilibria to elucidate the combined role of the 'fitness' difference \omega and of the generation duration \tau in setting its value. The fact that these parameters always occur in combination (when there are no mutations) is a general and notable feature of the stochastic model as well. Moreover, this model would justify why you only focus on decreasing the frequency in the new generation.

      Note that the ‘natural’ high-frequency equilibrium in the absence of collective selection is when fast grower F becomes fixed in the population. Following your suggestion, we have introduced two parameters 𝑅τ and 𝑊τ to reflect the coupling between ‘fitness’ and ‘generation duration’:

      (2) Since the phenomenon described in the paper is essentially ecological in nature (as the author states, it does not change significantly if the 'mutation rate' \mu is set to zero), I would put in the main text Equations 39 & 40 of the SI in order to improve intelligibility.

      See Point 2 at the beginning of this letter.

      These equations can be discussed in some detail, especially in the limit of small f^*_k, where I think it is worth discussing the different dependence of the mean and the variance of the frequency distribution on the system's parameters.

      This is a great suggestion. We have added the following:

      “In the limit of small , Equation (3) becomes f while Equation (4) becomes . Thus, both Newborn size (N<sub>0</sub>) and fold-change in F/S during maturation (W<sub>τ</sub>) are important determinants of selection progress.

      (3) I would have appreciated an explanation in words of what are the main conceptual steps involved in attaining Equation 2, the underlying hypotheses (notably on community size and distributions), and the expected limits of validity of the approximation.

      See points 3 and 4 at the beginning of this letter.

      (4) I think that some care needs to be put into explaining where extreme value statistics is used, and why is the median of the conditional distribution the most appropriate statistics to look at for characterizing the evolutionary trajectory (which seems to me mostly reliant on extreme values).

      Great point! We added an explanation of using median value in Box 1.

      and also added figure 7 to explaining it in SI.

      Showing in a figure the different distributions you are considering (for instance, plotting the conditional distribution for one generation in the trajectories displayed in Figure 2) would be useful to understand what information \bar f provides on a sequence of collective generations, where in principle there may be memory effects.

      Thanks for this suggestion. We have added to Fig 2d panel to illustrate the shape and position of F frequency distributions in each step in the first two selection cycles.

      (5) Similarly, I do not understand why selecting the 5% best communities should push the system's evolution towards the high-frequency solution, instead of just slowing down the improvement (unless you are considering the average composition of the top best communities - which should be justified). I think that such sensitivity to the selection intensity should be appropriately referenced and discussed in the main text, as it is a parameter that experimenters are naturally led to manipulate.

      In the main text, we have added this explanation:

      “In contrast with findings from an earlier study [23], choosing top 1 is more effective than the less stringent “choosing top 5%”. In the earlier study, variation in the collective trait is partly due to nonheritable factors such as random fluctuations in Newborn biomass. In that context, a less stringent selection criterion proved more effective, as it helped retain collectives with favorable genotypes that might have exhibited suboptimal collective traits due to unfavorable nonheritable factors. However, since this study excludes nonheritable variations in collective traits, selecting the top 1 collective is more effective than selecting the top 5% (see Fig. 11 in Supplementary Information).”

      (6) Equation 1 could be explained in simpler terms as the product between the probability that one collective reaches the transmitted value times the probability that all others do worse than that. The current formulation is unclear, perhaps just a matter of English formulation.

      We have revised our description to state:

      “Equation (1) can be described as the product between two terms related to probability: (i) describes the probability density that any one of the g Adult collectives achieves f given , and (ii) describes the probability that all other g – 1 collectives achieve frequencies above f and thus not selected.”

      (7) I think that the discussion of the dependence of the boundaries of the 'waterfall' region with the difference in growth rate \omega is important and missing, especially if one wants to consider open-ended evolution of the growth rate - which can occur at steps of different magnitude.

      We added a new chapter and figure in supplementary information on the threshold values when \omega varies. As expected, smaller \omega enlarges the success area.

      We have also added a new figure panel to show how maturation time affects selection efficacy.

      (8) Notations are a bit confusing and could be improved. First of all, in most equations in the main text and SI, what is initially introduced as \omega appears as s. This is confusing because the letter s is also used for the frequency of the slow type.

      The letter S is used to denote an attribute of cells (S cells), the type of cells (Equations 1-3 of the SI) and the number of these cells in the population, sometimes with different meanings in the same sentence. This is confusing, and I suggest referring to slow cells or fast cells instead (or at least to S-cells and F-cells), and keeping S and F as variables for the number of cells of the two types.

      All typos related to the notation have been fixed. We use S and F as types, and S and F (italic) and population numbers.

      (9) On page 3, when introducing the sampling of newborns as ruled by a binomial distribution, the information that you are just transmitting one collective is needed, while it is conveyed later.

      We have added this emphasis:

      “At the end of a cycle, a single Adult with the highest function (with F frequency f closest to the target frequency ) is chosen to reproduce g Newborn collectives each with N<sub>0</sub> cells (‘Selection’ and ’Reproduction’ in Fig. 1).”

      (10) I found that the abstract talks too early about the 'waterfall' phenomenon. As this is a concept introduced here, I suggest the authors first explain what it is, then use the term. It is a useful metaphor, but it should not obscure the more formal achievements of the paper.

      We feel that the “waterfall” analogy offers a gentle helping hand to orient those who have not thought much about the phenomenon. We view abstract as an opportunity to attract readership, and thus the more accessible the better.

      (11) In the SI there are numerous typos and English language issues. I suggest the authors read carefully through it, and add line numbers to the next version so that more detailed feedback is possible.

      Thank you for going through SI. We have gone through the SI, and fixed problems.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We sincerely thank and express our appreciation to each of the reviewers for their thorough critique of our manuscript.

      Reviewer #1 (Recommendations For The Authors):

      1. The analysis of whole study comes from only 4 cells from L2/3 of ferret visual cortex; however, it is well known that there is a high level of functional heterogeneity within the cortical neurons. Do those four neurons have similar or different response properties? If the four neurons are functionally different, the weak or no correlation may result from heterogenous distribution pattern of mitochondria associated with heterogenous functionality.

      This is an important consideration and often a limitation of CLEM studies. While cortical neurons do exhibit a high degree of functional heterogeneity (similar to spine activity), the 4 cells examined all had selective (OSI > 0.4) somatic responses to oriented gratings, although they differed in their exact orientation preference. Due to experimental limitations of recording from a large population of dendritic spines, we did not characterize other response properties for which their sensitivity might differ. We did not consider orientation preference a metric of study, but instead characterized the difference in preference from the somatic output, allowing comparisons across spines. In addition, our measurements were limited to proximal, basal dendrites rather than any location in the dendritic tree. Nonetheless, we attempted to address this concern by examining spines with functionally heterogenous visual responses within single cells, as reported in our manuscript: mitochondrial volume within a 1 µm radius was correlated with difference in orientation preference relative to the soma across all 4 cells, the mean r = 0.49 +/- 0.22 s.d.), suggesting that cell-to-cell variability has a minimal impact on our main conclusions.

      Even with our limited measurements, there is a large amount of functional heterogeneity in dendritic spine responses (Extended Data Figure 2, Scholl et al. 2021), far greater than the small differences in somatic responses of these 4 cells (Figure 3, Scholl et al. 2021). Moreover, the individual dendrites from these 4 cells exhibited substantial heterogeneity in the distribution of mitochondria. We cannot rule out whether heterogeneity at various scales may obscure certain relationships or result in the weak correlations we observed. We also note that future advancements in volume electron microscopy should allow for greater sample sizes that can better address the role of functional (and structural) heterogeneity. We have added text in the Discussion section about the potential structure-function relationships that might be obscured or revealed by neuron heterogeneity.

      1. The authors argued that "mitochondria are not necessarily positioned to support the energy needs of strong spines." However, the overall energy needs of a spine depend not only on the synaptic strength but also on the frequency of synaptic activity. Is there a correlation between the mitochondria volume around a spine and the overall activity of the spine? This data needs to be analyzed to confirm the distribution of mitochondria is independent of local energy needs.

      The reviewer is correct, but our experimental paradigm was not optimally designed to measure the ‘frequency’ of synaptic activity in vivo. This could have been accomplished with flashed gratings or, perhaps, presenting drifting gratings at different temporal frequencies. For spines tuned to higher temporal frequencies in V1, we may expect greater energy needs, although as the reviewer suggests, energy needs will depend on synapse (and bouton) size. Because we are not able to directly measure activity frequency as carefully or beautifully as can be done ex vivo or in nerve fibers, we do not feel confident in attempting such analysis in this study. Instead, based on previous studies, we assumed that larger synapses might be able to transmit higher frequencies, and thus have higher energy demands. We believe future in vivo studies will more directly measure synaptic frequency for comparison with mitochondria.

      We have added text in the Discussion about this caveat and potential future experiments.

      1. In the results section, the authors briefly mentioned that "We also considered other spine response properties related to tuning preference; specifically, orientation selectivity and response amplitude at the preferred orientation. For either metric, we observed no relationship to mitochondria within 1 μm radius (selectivity: 1 μm: r = -0.081, p = 0.269, n = 60; max response amplitude: 1 μm: r = -0.179, p = 0.078, n = 64) but did see a weak, significant relationship to both at a 5 μm radius (selectivity: r = 0.175, p = 0.027, n = 121; max response amplitude: r =-0.166, p = 0.030, n = 129)." Here only statistic results were given while the data were not presented in the figure illustration. Based on the methods and Figure 3B, it seems that the preferred orientations were calculated based on the vector summation. How did the authors calculate the "response amplitude at the preferred orientation"? This needs to be clarified. In addition, given the huge variety of orientation selectivity, using the response amplitude at the preferred orientation may not be the best parameter to correlate with the mitochondria volume which is indicative of energy needs. It might be necessary to include the baseline activity without visual stimulation and the average response for visual stimuli of different orientations in the analysis.

      We apologize for this oversight, as the details are present in our previous study (Scholl et al., 2021). Response amplitude and preferred orientation were calculated from a Gaussian curve fitting procedure with specific parameters describing those exact values (see Scholl et al. 2021 or Scholl et al. 2013). Only spines with selective responses (vector strength index > 0.1) and passing our SNR criterion were used for these analyses. We have now added this information to the Methods section and referred to it in the Results. With respect to the reviewer’s other concern, we also examined the average response amplitude (across all visual stimuli). There we found no relationship between the volume of mitochondria within 1 or 5 microns of a spine, however, because spines differ greatly in their selectivity (range = 0 – 0.8) the average response may not be an appropriate metric to compare across spines.

      1. A continuation from the former point, do the spines with similar preferred orientation to the somatic Ca signal have similar Ca signal strength, orientation selectivity index and other characteristics to the spines with different preferred orientation? As shown in the examples (Figure 3B), the spine on the right with different orientation preference compared with its soma has considerably larger response in non-preferred orientation compared to the spine on the left. Thus, the overall activity of the spine on the right may be higher than the spine which has similar preferred orientation to the soma. The authors showed that a positive correlation between difference in orientation preference and mitochondria volume (Figure 3C). Could this be simply due to higher spine activity for non-preferred orientation or spontaneous activity? Thus, the mitochondria might be positioned to support spines with higher overall activity rather than diverse response property.

      The reviewer brings up an interesting consideration. We examined the response properties of spines co-tuned (∆θpref < 22.5 deg) and differentially-tuned (∆θpref > 67.5 deg) to the soma. The orientation selectivity was not different between the two groups (p = 0.12, Wilcoxon ranksum test), although there was a small trend towards co-tuned inputs being more selective. We found that calcium response amplitudes for the preferred stimulus were also not different (p = 0.58, Wilcoxon ranksum test). These analyses are now included as a sentence in the Results.

      We agree with the reviewer that higher spontaneous activity in non-preferred spines may help explain the mitochondrial relationship we observe. However, our current dataset does not have sufficiently long recordings to measure spontaneous synaptic activity. Further, when considering a non-anesthetized preparation, spontaneous activity is highly dependent on brain state and an animal’s self-driven brain activity, which all must be experimentally controlled or measured to accurately address this.

      1. In addition, the information about the orientation selectivity of the soma is also missing. Do the four cells shown here all have similar level of orientation selectivity? Or some have relative weak orientation selectivity in the soma?

      Yes, all 4 cells have a similar OSI (range = 0.4 – 0.57, mean = 0.46 +/ 0.08 s.d.). This has been added to the Results section.

      1. This study focused on only a fraction of spines that are (1) responsive (2) osi > 0.1. However, in theory energy consumption is also related to non-responsive spines and spines with weak orientation tuning. What is the percentage of tuned and untuned spines? What's the correlation of mitochondria volume and spine activity level for untuned spines? I also recommend including the non-responsive spines into the analysis. For example, for each mitochondrion calculate the averaged overall activity of spines within certain distance from the mitochondrion, including the non-responsive spines. I would predict there may be more active spines and higher overall spine activity of dentritic segments near a mitochondrion than segments far from a mitochondrion.

      A majority of spines were tuned for orientation (~91%), although we specifically chose to only analyze data from spines with verifiable, independent calcium events. All analyses except those involving measurements of orientation preference use all dendritic spines (i.e. tuned and untuned). We have clarified this in the Results.

      These other ‘silent’ (i.e. without resolvable visual activity) spines may significantly contribute to energy demands of a dendrite too, as our methods (GCaMP6s expression) likely only capture synaptic events driving Ca+2 influx through NMDA receptors or VGCCs. We expect that glutamate imaging (e.g. iGlusnfr) may open the door to additional analyses to fully characterize functional relationship between spines and mitochondria.

      1. The correlation coefficient for mitochondria volume and difference in orientation preference is relatively low (r=0.3150). With such weak correlation, the explanatory power of this data is limited.

      We agree that while the correlation is significant, it is not particularly strong. To better represent the noise surrounding this measurement, we performed a bootstrap correlation analysis, sampling with replacement (1 micron: mean r = 0.31 +/- 0.11 s.e., 5 micron: mean r = 0.02 +/- 0.10 s.e.). We now include this in the Results.

      1. Why do the numbers of spines in different figures vary? For example, n=60 for 1micron in Figure 3, 54 in Figure 3c, 31 in Figure 4b, 51 in Figure 4e and so on.

      We apologize for the lack of clarity. Each analysis presented different requirements of the data. For example, orientation preference was computed only for selective (OSI > 1) spines (Fig. 3c), but this requirement did not apply to comparisons with selectivity or response amplitude (Fig. 3d). Similarly, as stated in the Results and Methods, measurements of local heterogeneity require a minimum number of neighboring spines (n > 2), limiting the number of usable spines for analysis (Fig. 4). We have clarified this in the text.

      1. In Figure 6a, the sample sizes of mito+ spines and mito- spines are extremely unbalanced, which affects the stat power of the analysis. I recommend performing a randomization test.

      We thank the reviewer for this suggestion. We ran permutation tests to compare the similarity in mean values between equally sampled values from each distribution. These tests supported our original analysis and conclusions. We have added these tests to the Results.

      1. Ca signals are approximations of electrical signals. How well are spinal calcium signals correlated to synaptic strength and local depolarization? This should be put into discussion.

      There is unlikely a simple, direct relationship between spine calcium signal and synaptic strength or membrane depolarization, and this has never been addressed in vivo. Koester and Johnston (2005) performed paired recordings in slice and showed that single presynaptic action potentials producing successful transmission generate widely different calcium amplitudes (Fig. 3). Another study from Sobczyk, Scheuss, and Svoboda (2005) used two-photon glutamate uncaging on single spines and showed that micro-EPSC’s evoked are uncorrelated with the spine calcium signal amplitude. We have added a note about this in the discussion.

      1. In Figure 4i, the negative correlation may depend on the 4 data points on the right side. How influential are those data points?

      Spearman’s correlation coefficient analysis is robust to outliers and it is highly unlikely these datapoints are critical with our sample size (n > 100 spines).

      1. Raw data of Ca responses were missing.

      Some data has been published with the parent publication (Scholl et al., 2021). As spine imaging data is difficult to obtain and highly unique, we prefer to provide raw data directly upon reasonable request of the corresponding author.

      1. What is the temporal frequency of the drifting grating? Was it fixed or the speed of the grating was fixed?

      This was fixed to 4 Hz and this is now included in the Methods.

      Reviewer #2 (Recommendations For The Authors):

      1. Most of the measurements were based on the distance from the base of the spine neck, and "only on spines with measurable mitochondrial volume at each radius" were analyzed. To better understand the causality, it may also be interesting to have an analysis based on the distance from mitochondria. Would the result be different if the measurements are not 1µm / 5µm from spine but 1µm / 5µm from mitochondria? (e.g. total spine volume in 1µm / 5µm from mitochondria).

      In fact, our first iteration of this study focused on exactly this metric: measuring the distance to nearest mitochondria. However, after lengthy discussions between the authors, we ultimately decided this metric was inferior to a volumetric one. Our decision was based on several factors: (1) distance to mitochondrion is ill defined (e.g. distance to the a mitochondrion center or nearest membrane edge?), (2) the total amount of mitochondrial volume within a dendritic shaft should allow the greatest amount of energetic support (e.g. more cristae for ATP production, greater capacity for calcium buffering), and (3) we would not account for the geometry of individual mitochondria or their placement near a spine (e.g. when 2 different mitochondria are next to the same spine) We have added further clarification of our reasoning to the Results.

      Nonetheless, we present the reviewer some of our original analyses correlating distance to mitochondria (from the base of the spine and including the spine neck length):

      Author response image 1.

      Here, we examined the relationship to spine head volume, spine-soma orientation preference difference, and the local orientation preference heterogeneity. No relationship showed any significant correlation. Again, this may not be surprising given the drawbacks of measuring ‘distance to mitochondria’.

      1. Is there a selection criterion for the spine for the analysis? Are filopodia spines excluded in the analysis?

      Spines were analyzed regardless of structural classification; however, they were only analyzed if they had a synaptic density with synaptic vesicle accumulation. In our dataset (including those visualized in vivo and reconstructed from the EM volume) we observed no filopodia.

      1. The result states that "56.8% of spines had no mitochondria volume within 1 μm and 12.1% of spines had none within 5 μm.". In other words, around 43% of spines had mitochondria within 1 μm. It would be interesting to show whether there is a correlation between mitochondria size and spine density.

      We agree that this is an interesting measurement. It has been reported that mitochondrial unit length along the dendrite co-varies with linear synapse density in the neocortical distal dendrites of mice (Turner et al., 2022). This was specifically true in distal portions of dendrites more than 60 µm from the soma, because mitochondria volume increases as a function of distance roughly up to this point, then remains relatively constant beyond this distance.

      To investigate this possibility, we calculated the local spine density around an individual spine and compared to the mitochondria volume within 1 or 5 µm. We found no evidence of a correlation between local spine density and the volume of mitochondria (1 µm: Spearman r = -0.07447, p = 0.2859; 5 µm: r = -0.04447, p = 0.3141). However, the majority of our measurements are more proximal than 60 µm (our median distance of all spines = 49.4 µm, max = 114 µm) and this may be one reason why observe no correlation.

      1. In Figure 3B, the drifting grating directions are examined from 0 to 315 degrees in the experiment. However, in Figure 3C and 3D, the spine-soma difference of orientation preference was limited to 0 to 90 degree in the graph. Is the graph trimmed, or is there a cause that limits the spine-soma difference of orientation preference to 90?

      Ferret visual cortical neurons are highly sensitive to grating direction and the responses are fit by a double Gaussian curve which estimates the ‘orientation preference’ (0-180 deg). We then calculated the absolute difference in orientation preference and wrapped that value in circular space so the maximum difference possible is 90 deg (e.g. 135 deg -> 45 deg).

      1. In Figure 4D-F, how is the temporal correlation of calcium activity determined? Is it based on stimulated activity or basal activity? A brief explanation may be helpful to the readers. Also, scale bars could be added to Fig 4D.

      Temporal correlation is computed as the signal correlation between 2 spines over the entire imaging session at that field of view. Specifically, we measured the Pearson correlation between each spine’s ∆F/F trace. To measure the local spatiotemporal correlation, we computed correlations between all neighboring spines within 5 microns and took the average of those values. We have clarified this in the Results section.

      1. Figure 3C and Figure 4D displayed a significant correlation in 1µm range and such correlation drastically diminished once the criterion changed to 5µm range. It would be interesting to include the criterion of intermediate ranges. It would be interesting to see if there is a trend or tendency or if there is a "cut-off" limit.

      We agree with the reviewer that the drastic change in the correlations between 1 and 5 µm range was surprising to see. While these volumetric measurements are time consuming, we returned to our data and measured an intermediate point of 3 µm. Investigating relationships reported in our study, we found no significant trends for spine-soma similarity (Spearman’s r = -0.011, p = 0.54) or local heterogeneity (Spearman’s r = 0.11, p = 0.23). This suggests that a potential ‘critical distance’ might be less than 3 µm; however, far more additional measurements and analyses would be needed to attempt to identify exactly what this distance is.

      1. In Figure 5, it is shown that spines having mitochondrion in the head or neck are larger. However, only 10 spines are found with mitochondria inside. In the current dataset, are mitochondria abundantly found in large spines? Further analysis or justification would be informative to address this.

      In our dataset, mitochondria were found in ~5% of all spines. Spines with mitochondria have a median volume of approximately 0.6 µm3, roughly twice as large as than those without mitochondria, as the reviewer suggests. In the entire population of spines without mitochondria, a volume of 0.6 µm3 represents roughly the 82nd percentile. In other words, of the total population of 157 spines without mitochondria, only 29 had equal or greater volume than the median spine with a mitochondrion. We believe this trend is clearly shown in Figure 5A and is supported by our analysis, including new permutation tests suggested by Reviewer 1.

      Reviewer #3 (Recommendations For The Authors):

      1. The authors state that their unsupervised method "quickly and accurately identified mitochondria," but the methods section only says that segmentations were proofread. Was every segmentation examined and judged to be accurate, or was only a subset of the 324 mitochondria checked?

      After deep learning-based extraction, each mitochondrion segmentation was manually proofread. For each dendrite segment, this was ~10-20 mitochondria, so it did not take long to manually inspect and edit each mitochondrion segmentation.

      1. The EM image of the mitochondrion in the spine head in Figure 2C is low resolution and does not apply to the bulk of the data. Images more representative of the analyzed data should be added to supplement the cartoons.

      Our primary rationale for including this specific image was to show that the mitochondria located within spines are small, round, and to include a view of the synapse as well as the mitochondrion. We now include enlarge and additional EM images to Figure 1C.

      1. The majority of spines did not have any mitochondria within a 1 micron radius and were excluded from the correlation analyses, so most of the conclusions are based on a minority of spines. It would be interesting to see comparisons between spines with and without nearby mitochondria. Correlations between the absolute distance to any mitochondrion, synapse size, and mismatch to soma orientation would be especially interesting.

      The reviewer brings up a good point. It is true that many spines were excluded from our analysis based on the fact that they did not have nearby mitochondria within 1 or 5 µm (56.8% of spines had no mitochondria volume within 1 μm and 12.1% of spines had none within 5 μm). We compared the distributions of synapse size, mismatch to soma, and orientation selectivity of two groups of spines – those with at least some mitochondria within 1 µm (n = 65) versus spines without any mitochondria within 5 µm (n = 19).

      We found no difference in the distributions between spine volume (1 µm: median = 0.29 µm3, IQR = 0.41 µm3; no mitochondria within 5 µm: median = 0.40 µm3, IQR = 0.37 µm3; p = 0.67) or PSD area (1 µm: median = 0.26 µm2, IQR = 0.33 µm2; no mitochondria within 5 µm: median = 0.31 µm2, IQR = 0.36 µm2; p = 0.49). For functional measures, we also saw no difference in orientation selectivity (1 µm: median = 0.29, IQR = 0.28; no mitochondria within 5 µm: median = 0.28, IQR = 0.15; p = 0.74) or mismatch to soma orientation (1 µm: median = 0.54 deg, IQR = 0.86 deg; no mitochondria within 5 µm: median = 0.46 deg, IQR = 0.47 deg; p = 0.75). We now include analyses in the Results.

      We also looked at the absolute distances to mitochondria and did not find any significant relationships to spine head volume, spine-soma orientation preference difference, or the local orientation preference heterogeneity (see our response to reviewer #2 for more information).

      1. In Figure 1A the mitochondria appear to be taking up a substantial fraction of the dendritic shaft diameter, even for distal dendrites. It would be useful to know the absolute diameter of the dendrites and mitochondria, given that this is not rodent data and there is no reference for either in the ferret.

      We agree with the reviewer’s point, although we would like to remind the reviewer that these are basal dendrites of layer 2/3 cells. Basal dendrites tend to be thinner than apical branches. Interestingly, in some cases, the dendrite even swells to accommodate a mitochondrion. We did not incorporate this measurement in our study because it is not trivial; dendrite diameter is variable and dendrites are not perfect cylinders. Although we did not make precise measurements across our dendrites, the diameter is comparable to what has been seen in mouse cortex (Turner et al., 2022), roughly 500-1000 nm, but as small as 100 nm at some pinch points. In terms of mitochondria, many were roughly spherical or oblong, therefore the maximum diameters we report are roughly similar to, if not a bit larger than, those of the cross-sectional diameter.

      1. As a rule, PSD area is correlated with spine volume, which makes the observation that spines with mitochondria have larger volume but not PSD area surprising. With n=10 it is difficult to draw conclusions, but it would be interesting to know the PSD area-to-volume ratio of other spines of the same volume and synapse size.

      We were also somewhat surprised to see this, but exactly as the reviewer mentioned, we believe it to be a limitation of the sample size. The difference in volume was large enough to be detected despite a small sample size. We saw a trend towards larger synapses when spines have mitochondria (the median was approximately 60% larger), and we would expect with a larger comparison that PSD area would be significantly greater in spines with mitochondria.

      We calculated the PSD area-to-spine head volume ratio for spines with or without mitochondria. Spines with mitochondria had a significantly lower ratio compared to those without (Mann-Whitney test, p = 0.0056, mito - = 0.78, n = 10; mito + = 0.53, n = 157). As the reviewer mentions, it is somewhat difficult to draw a conclusion from this, but it appears that the PSD does not scale with the increased spine head size.

      Author response image 2.

      The only way to definitively address this is to increase the sample size, which is becoming easier to achieve with the progression of volume EM imaging and analysis techniques in recent times. We look forward to addressing this in the future.

      1. Nothing is made of the significant fact that these data come from the visual system of a carnivore, not a mouse. Consideration of differences in visual physiology between rodents and carnivores would be worthwhile to put the function of these dendrites in context.

      We thank the reviewer for this consideration and have added text to the Discussion.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors aimed to investigate the contribution of antigenic drift in the HA and NA genes of seasonal influenza A(H3N2) virus to their epidemic dynamics. Analyzing 22 influenza seasons before the COVID-19 pandemic, the study explored various antigenic and genetic markers, comparing them against indicators characterizing the epidemiology of annual outbreaks. The central findings highlight the significant influence of genetic distance on A(H3N2) virus epidemiology and emphasize the role of A(H1N1) virus incidence in shaping A(H3N2) epidemics, suggesting subtype interference as a key factor. 

      Major Strengths: 

      The paper is well-organized, written with clarity, and presents a comprehensive analysis. The study design, incorporating a span of 22 seasons, provides a robust foundation for understanding influenza dynamics. The inclusion of diverse antigenic and genetic markers enhances the depth of the investigation, and the exploration of subtype interference adds valuable insights. 

      Major Weaknesses: 

      While the analysis is thorough, some aspects require deeper interpretation, particularly in the discussion of certain results. Clarity and depth could be improved in the presentation of findings. Furthermore, the evolving dynamics of H3N2 predominance post-2009 need better elucidation.  

      Reviewer #2 (Public Review): 

      Summary: This paper aims to achieve a better understanding of how the antigenic or genetic compositions of the dominant influenza A viruses in circulation at a given time are related to key features of seasonal influenza epidemics in the US. To this end, the authors analyze an extensive dataset with a range of statistical, data science and machine learning methods. They find that the key drivers of influenza A epidemiological dynamics are interference between influenza A subtypes and genetic divergence, relative to the previous one or two seasons, in a broader range of antigenically related sites than previously thought. 

      Strengths: A thorough investigation of a large and complex dataset. 

      Weaknesses: The dataset covers a 21 year period which is substantial by epidemiological standards, but quite small from a statistical or machine learning perspective. In particular, it was not possible to follow the usual process and test predictive performance of the random forest model with an independent dataset. 

      Reviewer #3 (Public Review): 

      Summary: 

      This paper explores the relationships among evolutionary and epidemiological quantities in influenza, using a wide range of datasets and features, and using both correlations and random forests to examine, primarily, what are the drivers of influenza epidemics. It's a strong paper representing a thorough and fascinating exploration of potential drivers, and it makes a trove of relevant data readily available to the community. 

      Strengths: 

      This paper makes links between epidemiological and evolutionary data for influenza. Placing each in the context of the other is crucial for understanding influenza dynamics and evolution and this paper does a thorough job of this, with many analyses and nuances. The results on the extent to which evolutionary factors relate to epidemic burden, and on interference among influenza types, are particularly interesting. The github repository associated with the paper is clear, comprehensive, and well-documented. 

      Weaknesses: 

      The format of the results section can be hard to follow, and we suggest improving readability by restructuring and simplifying in some areas. There are a range of choices made about data preparation and scaling; the authors could explore sensitivity of the results to some of these. 

      Response to public reviews

      We appreciate the positive comments from the reviewers and have implemented or responded to all of the reviewers’ recommendations.

      In response to Reviewer 1, we expand on the potential drivers and biological implications of the findings pointed out in their specific recommendations. For example, we now explicitly mention that antigenically distinct 3c.2a and 3c.3a viruses began to co-circulate in 2012 and underwent further diversification during subsequent seasons in our study. We note that, after the 2009 A(H1N1) pandemic, the mean fraction of influenza positive cases typed as A(H3N2) in A(H3N2) dominant seasons is lower compared to A(H3N2) dominant seasons prior to 2009. We propose that the weakening of A(H3N2) predominance may be linked to the diversification of A(H3N2) viruses during the 2010s, wherein multiple antigenically distinct clades with similar fitness circulated in each season, as opposed to a single variant with high fitness.

      In response to Reviewer 2, we agree that it would be ideal and best practice to measure model performance with an independent test set, but our dataset includes only ~20 seasons. Predictions of independent test sets of 2-3 seasons had unstable performance, which indicates we do not have sufficient power to measure model performance with a test set this small. In the revised manuscript, we provide more justification and clarification of our methodology. Instead of testing model performance on an independent test set, we use leave-one-season-out cross-validation to train models and measure model performance, wherein each “assessment” set contains one season of data (predicted by the model), and the corresponding “analysis” set (“fold”) contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of the model (Kuhn & Johnson, 2019).

      In response to Reviewer 3, we follow the reviewer’s advice to put the Methods section before the Results section. Concerning Reviewer 3’s question about the sensitivity of our results to data preparation and rescaling, we provide more justification and clarification of our methodology in the revised manuscript. In our study, we adjust influenza type/subtype incidences for differences in reporting between the pre- and post-2009 pandemic periods and across HHS regions. We adjust for differences in reporting between the pre- and post-2009 periods because the US CDC and WHO increased laboratory testing capacity in response to the 2009 A(H1N1) pandemic, which led to substantial, long-lasting improvements to influenza surveillance that are still in place today. Figure 1 - figure supplement 2 shows systematic increases in influenza test volume in all HHS regions after the 2009 pandemic. Given the substantial increase in test volume after 2009, we opted to keep the time trend adjustment for the pre- and post-2009 pandemic periods and evaluate whether adjusting for regional reporting differences affects our results. When estimating univariate correlations between various A(H3N2) epidemic metrics and evolutionary indicators, we found qualitatively equivalent results when adjusting for both pre- and post-2009 pandemic reporting and regional reporting versus only adjusting for the pre- and post-2009 pandemic reporting.

      Reviewer #1 (Recommendations For The Authors): 

      Specific comments: 

      (1) Line 155-156. Request for a reference for: "Given that protective immunity wanes after 1-4 years" 

      We now include two references (He et al. 2015 and Wraith et al. 2022), which were cited at the beginning of the introduction when referring to the duration of protective immunity for antigenically homologous viruses. (Lines 640-642 in revised manuscript)

      (2) Line 162-163: Request a further explanation of the negative correlation between seasonal diversity of HA and NA LBI values and NA epitope distance. Clarify biological implications to aid reader understanding. 

      In the revised manuscript we expand on the biological implications of A(H3N2) virus populations characterized by high antigenic novelty and low LBI diversity.

      Lines 649-653:

      “The seasonal diversity of HA and NA LBI values was negatively correlated with NA epitope distance (Figure 2 – figure supplements 5 – 6), with high antigenic novelty coinciding with low genealogical diversity. This association suggests that selective sweeps tend to follow the emergence of drifted variants with high fitness, resulting in seasons dominated by a single A(H3N2) variant rather than multiple cocirculating clades.”

      (3) Figure S3 legend t-2 may be marked as t-1. 

      Thank you for catching this. We have fixed this typo. Note: Figure S3 is now Figure 2 – figure supplement 5.

      (4) Lines 201-214. The key takeaways from the analysis of subtype dominance are ultimately not clear. It also misses the underlying dynamics that H3N2 predominance following an evolutionary change has waned since 2009.

      In the revised manuscript we elaborate on key takeaways concerning the relationship between antigenic drift and A(H3N2) dominance. We also add a caveat noting that A(H3N2) predominance is weaker during the post-2009 period, which may be linked to the diversification of A(H3N2) lineages after 2012. We do not know of a reference that links the diversification of A(H3N2) viruses in the 2010s to a particular evolutionary change. Therefore, we do not attribute the diversification of A(H3N2) viruses to a specific evolutionary change in A(H3N2) variants circulating at the time (A/Perth/16/2009-like strains (PE09)). Instead, we allude to the potential role of A(H3N2) diversification in creating multiple co-circulating lineages that may have less of a fitness advantage.

      Lines 681-703:

      “We explored whether evolutionary changes in A(H3N2) may predispose this subtype to dominate influenza virus circulation in a given season. A(H3N2) subtype dominance – the proportion of influenza positive samples typed as A(H3N2) – increased with H3 epitope distance (t – 2) (R2 = 0.32, P = 0.05) and N2 epitope distance (t – 1) (R2 = 0.34, P = 0.03) (regression results: Figure 4; Spearman correlations: Figure 3 – figure supplement 1). Figure 4 illustrates this relationship at the regional level across two seasons in which A(H3N2) was nationally dominant, but where antigenic change differed. In 2003-2004, we observed widespread dominance of A(H3N2) viruses after the emergence of the novel antigenic cluster, FU02 (A/Fujian/411/2002-like strains). In contrast, there was substantial regional heterogeneity in subtype circulation during 2007-2008, a season in which A(H3N2) viruses were antigenically similar to those circulating in the previous season. Patterns in type/subtype circulation across all influenza seasons in our study period are shown in Figure 4 – figure supplement 1. As observed for the 2003-2004 season, widespread A(H3N2) dominance tended to coincide with major antigenic transitions (e.g.,

      A/Sydney/5/1997 (SY97) seasons, 1997-1998 to 1999-2000; A/California/7/2004 (CA04) season, 20042005), though this was not universally the case (e.g., A/Perth/16/2009 (PE09) season, 2010-2011). 

      After the 2009 A(H1N1) pandemic, A(H3N2) dominant seasons still occurred more frequently than A(H1N1) dominant seasons, but the mean fraction of influenza positive cases typed as A(H3N2) in A(H3N2) dominant seasons was lower compared to A(H3N2) dominant seasons prior to 2009. Antigenically distinct 3c.2a and 3c.3a viruses began to co-circulate in 2012 and underwent further diversification during subsequent seasons in our study (https://nextstrain.org/seasonal-

      flu/h3n2/ha/12y@2024-05-13) (Dhanasekaran et al., 2022; Huddleston et al., 2020; Yan et al., 2019). The decline in A(H3N2) predominance during the post-2009 period may be linked to the genetic and antigenic diversification of A(H3N2) viruses, wherein multiple lineages with similar fitness co-circulated in each season.”

      (5) Line 253-255: It would be beneficial to provide a more detailed interpretation of the statement that "pre-2009 seasonal A(H1N1) viruses may limit the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses." Elaborate on the cause-and-effect relationship within this statement.

      In the revised manuscript we suggest that seasonal A(H1N1) viruses may interfere with the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses, because seasonal A(H1N1) viruses and A(H3N2) are more closely related, and thus may elicit stronger cross-reactive T cell responses.

      Lines 738-745:

      “The internal gene segments NS, M, NP, PA, and PB2 of A(H3N2) viruses and pre-2009 seasonal A(H1N1) viruses share a common ancestor (Webster et al., 1992) whereas A(H1N1)pdm09 viruses have a combination of gene segments derived from swine and avian reservoirs that were not reported prior to the 2009 pandemic (Garten et al., 2009; Smith et al., 2009). Non-glycoprotein genes are highly conserved between influenza A viruses and elicit cross-reactive antibody and T cell responses (Grebe et al., 2008; Sridhar, 2016). Because pre-2009 seasonal A(H1N1) viruses and A(H3N2) are more closely related, we hypothesized that seasonal A(H1N1) viruses could potentially limit the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses, due to greater T cell-mediated cross-protective immunity.”

      (6) In the results section, many statements report statistical results of correlation analyses. Consider providing further interpretations of these results, such as the implications of nonsignificant correlations and how they support or contradict the hypothesis or previous studies. For example, the statement on line 248 regarding the lack of significant correlation between influenza B epidemic size and A(H3N2) epidemic metrics would benefit from additional discussion on what this non-significant correlation signifies and how it relates to the hypothesis or previous research. 

      In the Discussion section, we suggest that the lack of an association between influenza B circulation and A(H3N2) epidemic metrics is due to few T and B cell epitopes shared between influenza A and B viruses (Terajima et al., 2013).

      Lines 1005-1007 in revised manuscript (Lines 513-515 in original manuscript): 

      “Overall, we did not find any indication that influenza B incidence affects A(H3N2) epidemic burden or timing, which is not unexpected, given that few T and B cell epitopes are shared between the two virus types (Terajima et al., 2013).”

      Minor comments: 

      (1) Line 116-122: Include a summary statistical description of all collected data sets, detailing the number of HA and NA sequence data and their sources. Briefly describe subsampled data sets, specifying preferences (e.g., the number of HA or NA sequence data collected from each region). 

      In our revised manuscript we now include supplementary tables that summarize the number of A/H3 and

      A/N2 sequences in each subsampled dataset, aggregated by world region, for all seasons combined (Figure 2 - table supplements 1 - 2). We also include supplementary figures showing the number of sequences collected in each month and each season in North America versus the other nine world regions combined (Figure 2 - figure supplements 1 - 2). Subsampled datasets are plotted individually in the figures below but individual time series are difficult to discern due to minor differences in sequence counts across the datasets.

      (2) Figure 7A: Due to space limitations, consider rounding numbers on the x-axis to whole numbers for clarity. 

      Thank you for this suggestion. In the revised manuscript we round numbers in the axes of Figure 7A (Figure 9A in the revised manuscript) so that the axes are less crowded.

      (3) Figure 4C & Figure 4D: Note that Region 10 (purple) data were unavailable for seasons before 2009 (lines 1483-1484). Label each region on the map with its respective region number (1 to 10) and indicate this in the legend for easy identification. 

      In our original submission, the legend for Figure 4 included “Data for Region 10 (purple) were not available for seasons prior to 2009” at the end of the caption. We have moved this sentence, as well as other descriptions that apply to both C and D, so that they follow the sentence “C-D. Regional patterns of influenza type and subtype incidence during two seasons when A(H3N2) was nationally dominant.”

      In our revised manuscript, Figure 4, and Figure 4 - figure supplement 1 (Figure S10 in original submission) include labels for each HHS region.

      We did not receive specific recommendations from Reviewer #2. However, our responses to Reviewer #3 addresses the study’s weaknesses mentioned by Reviewer #2.

      Reviewer #3 (Recommendations For The Authors): 

      This paper explores the relationships among evolutionary and epidemiological quantities in influenza, using a wide range of datasets and features, and using both correlations and random forests to examine, primarily, what are the drivers of influenza epidemics. 

      This is a work horse of paper, in the volumes of data that are analyzed and the extensive analysis that is done. The data that are provided are a treasure trove resource for influenza modelers and for anyone interested in seeing influenza surveillance data in the context of evolution, and evolutionary information in the context of epidemiology. 

      L53 - end of sentence "and antigenic drift": not sure this fits, explain? I thought this sentence was in contrast to antigenic drift.

      Thank you for catching this. We did not intend to include “and antigenic drift” at the end of this sentence and have removed it (Line 59).

      Para around L115: would using primarily US data be a limitation, because it's global immunity that shapes success of strains? Or, how much does each country's immunity and vaccination and so on actually shape what strains succeed there, compared to global/international factors? 

      The HA and NA phylogenetic trees in our study are enriched with US sequences because our study focuses on epidemiological dynamics in the US, and we wanted to prioritize A(H3N2) viruses that the US human population encountered in each season. We agree with the reviewer that the world population may be the right scale to understand how immunity, acquired by vaccination or natural infection, may shape the emergence and success of new lineages that will go on to circulate globally. However, our study assesses the overall impact of antigenic drift on regional A(H3N2) epidemic dynamics in the US. In other words, our driving question is whether we can predict the population-level impact of an A(H3N2) variant in the US, conditional on this particular lineage having established in the US and circulating at relatively high levels. We do not assess the global or population-level factors that may influence which A(H3N2) virus lineages are successful in a given location or season.

      We have added a clarifying sentence to the end of the Introduction to narrow the scope of the paper for the reader. 

      Line 114-116: “Rather than characterize in situ evolution of A(H3N2) lineages circulating in the U.S., we study the epidemiological impacts of antigenic drift once A(H3N2) variants have arrived on U.S. soil and managed to establish and circulate at relatively high levels.”

      In the Results section, I found the format hard to follow, because of the extensive methodological details, numbers with CIs and long sentences. Sentences sometimes included the question, definitions of variables, and lists. For example at line 215 we have: "Next, we tested for associations between A(H3N2) evolution and epidemic timing, including onset week, defined as the winter changepoint in incidence [16], and peak week, defined as the first week of maximum incidence; spatiotemporal synchrony, measured as the variation (standard deviation, s.d.) in regional onset and peak timing; and epidemic speed, including seasonal duration and the number of weeks from onset to peak (Table 2, Figure S11)". I would suggest putting the methods section first, using shorter sentences, separating lists from the question being asked, and stating what was found without also putting in all the extra detail. Putting the methods section before the results might reduce the sense that you have to explain what you did and how in the results section too.

      Thank you for suggesting how to improve the readability of the Results section. In the revised manuscript, we follow the reviewer’s advice to put the Methods section before the Results section. Although eLife formatting requirements specify the order: Introduction, Results, Discussion, and Methods, the journal allows for the Methods section to follow the Introduction when it makes sense to do so. We agree with the reviewer that putting the Methods section before the Results section makes our results easier to follow because we no longer need to introduce methodological details at the beginning of each set of results.

      L285 in the RF you remove variables without significant correlations with the target variables, but isn't one of the aims of RF to uncover relationships where a correlation might not be evident, and in part to reveal combinations of features that give the targeted outcome? Also with the RF, I am a bit concerned that you could not use the leave-one-out approach because it was "unstable" - presumably that means that you obtain quite different results if you leave out a season. How robust are these results, and what are the most sensitive aspects? Are the same variables typically high in importance if you leave out a season, for example? What does the scatterplot of observed vs predicted epidemic size (as in Fig 7) look like if each prediction is for the one that was left out (i.e. from a model trained on all the rest)? In my experience, where the RF is "unstable", that can look pretty terrible even if the model trained on all the data looks great (as does Figure 7). In any case I think it's worth discussing sensitivity.

      (1) In response to the reviewer’s first question, we explain our rationale for not including all candidate predictors in random forest and penalized regression models. 

      Models trained with different combinations of predictors can have similar performance, and these combinations of predictors can include variables that do not necessarily have strong univariate associations with the target variable. The performance of random forest and LASSO regression models are not sensitive to redundant or irrelevant predictors (see Figure 10.2 in Kuhn & Johnson, 2019). However,  if our goal is variable selection rather than strictly model performance, it is considered best practice to remove collinear, redundant, and/or irrelevant variables prior to training models (see section 11.3 in Kuhn & Johnson, 2019). In both random forest and LASSO regression models, if there are highly collinear variables that are useful for predicting the target variable, the predictor chosen by the model becomes a random selection. In random forest models, these highly collinear variables will be used in all splits across the forest of decision trees, and this redundancy dilutes variable importance scores. Thus, failing to minimize multicollinearity prior to model training could result in some variables having low rankings and the appearance of being unimportant, because their importance scores are overshadowed by those of the highly correlated variables. Our rationale for preprocessing predictor data follows the philosophy of Kuhn & Johnson, 2019, who recommend including the minimum possible set of variables that does not compromise model performance. Even if a particular model is insensitive to extra predictors, Kuhn and John explain that “removing predictors can reduce the cost of acquiring data or improve the throughput of the software used to make predictions.”

      In the revised manuscript, we include more details about our steps for preprocessing predictor data. We also follow the reviewer’s suggestion to include all evolutionary predictors in variable selection analyses, regardless of whether they have strong univariate correlations with target outcomes, because the performance of random forest and LASSO regression models is not affected by redundant predictors. 

      Including additional predictors in our variable selection analyses does not change our conclusions. As reported in our original manuscript, predictors with strong univariate correlations with various epidemic metrics were the highest ranked features in both random forest and LASSO regression models.

      Lines 523-563:

      “Preprocessing of predictor data: The starting set of candidate predictors included all viral fitness metrics: genetic and antigenic distances between current and previously circulating strains and the standard deviation and Shannon diversity of H3 and N2 LBI values in the current season. To account for potential type or subtype interference, we included A(H1N1) or A(H1N1)pdm09 epidemic size and B epidemic size in the current and prior season and the dominant IAV subtype in the prior season (Lee et al., 2018). We included A(H3N2) epidemic size in the prior season as a proxy for prior natural immunity to A(H3N2). To account for vaccine-induced immunity, we considered four categories of predictors and included estimates for the current and prior seasons: national vaccination coverage among adults (18-49 years coverage × ≥ 65 years coverage), adjusted A(H3N2) vaccine effectiveness (VE), a combined metric of vaccination coverage and A(H3N2) VE (18-49 years coverage × ≥ 65 years coverage × VE), and H3 and N2 epitope distances between naturally circulating A(H3N2) viruses and the U.S. A(H3N2) vaccine strain in each season. We could not include a predictor for vaccination coverage in children or consider cladespecific VE estimates, because these data were not available for most seasons in our study.

      Random forest and LASSO regression models are not sensitive to redundant (highly collinear) features (Kuhn & Johnson, 2019), but we chose to downsize the original set of candidate predictors to minimize the impact of multicollinearity on variable importance scores. For both types of models, if there are highly collinear variables that are useful for predicting the target variable, the predictor chosen by the model becomes a random selection (Kuhn & Johnson, 2019). In random forest models, these highly collinear variables will be used in all splits across the forest of decision trees, and this redundancy dilutes variable importance scores (Kuhn & Johnson, 2019). We first confirmed that none of the candidate predictors had zero variance or near-zero variance. Because seasonal lags of each viral fitness metric are highly collinear, we included only one lag of each evolutionary predictor, with a preference for the lag that had the strongest univariate correlations with various epidemic metrics. We checked for multicollinearity among the remaining predictors by examining Spearman’s rank correlation coefficients between all pairs of predictors. If a particular pair of predictors was highly correlated (Spearman’s 𝜌 > 0.8), we retained only one predictor from that pair, with a preference for the predictor that had the strongest univariate correlations with various epidemic metrics. Lastly, we performed QR decomposition of the matrix of remaining predictors to determine if the matrix is full rank and identify sets of columns involved in linear dependencies. This step did not eliminate any additional predictors, given that we had already removed pairs of highly collinear variables based on Spearman correlation coefficients. 

      After these preprocessing steps, our final set of model predictors included 21 variables, including 8 viral evolutionary indicators: H3 epitope distance (t – 2), HI log2 titer distance (t – 2), H3 RBS distance (t – 2), H3 non-epitope distance (t – 2), N2 epitope distance (t – 1), N2 non-epitope distance (t – 1), and H3 and N2 LBI diversity (s.d.) in the current season; 6 proxies for type/subtype interference and prior immunity:

      A(H1N1) and B epidemic sizes in the current and prior season, A(H3N2) epidemic size in the prior season, and the dominant IAV subtype in the prior season; and 7 proxies for vaccine-induced immunity: A(H3N2) VE in the current and prior season, H3 and N2 epitope distances between circulating strains and the vaccine strain in each season, the combined metric of adult vaccination coverage × VE in the current and prior season, and adult vaccination coverage in the prior season.”

      (2) Next, we clarify our model training methodology to address the reviewer’s second point about using a leave-one-out cross-validation approach.

      We believe the reviewer is mistaken; we use a leave-one-season-out validation approach which lends some robustness to the predictions. In our original submission, we stated “We created each forest by generating 3,000 regression trees from 10 repeats of a leave-one-season-out (jackknife) cross-validated sample of the data. Due to the small size of our dataset, evaluating the predictive accuracy of random forest models on a quasi-independent test set produced unstable estimates.” (Lines 813-816 in the original manuscript)

      To clarify, we use leave-one-season-out cross-validation to train models and measure model performance, wherein each “assessment” set contains one season of data (predicted by the model), and the corresponding “analysis” set (“fold”) contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of the model (see Section 3.4 in Kuhn & Johnson, 2019). To reduce noise, we generated 10 bootstrap resamples of each fold and averaged the RMSE and R2 values of model predictions from resamples. 

      Although it would be ideal and best practice to measure model performance with an independent test set, our dataset includes only ~20 seasons. We found that predictions of independent test sets of 2-3 seasons had unstable performance, which indicates we do not have sufficient power to measure model performance with a test set this small. Further, we suspect that large antigenic jumps in a small subset of seasons further contribute to variation in prediction accuracy across randomly selected test sets. Our rationale for using cross-validation instead of an independent test set is best described in Section 4.3 of Kuhn and Johnson’s book “Applied Predictive Modeling” (Kuhn & Johnson, 2013):

      “When the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. Several researchers (Molinaro 2005; Martin and Hirschberg 1996; Hawkins et al. 2003) show that validation using a single test set can be a poor choice. Hawkins et al. (2003) concisely summarize this point: “holdout samples of tolerable size [...] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate. “Resampling methods, such as cross-validation, can be used to produce appropriate estimates of model performance using the training set. These are discussed in length in Sect.4.4. Although resampling techniques can be misapplied, such as the example shown in Ambroise and McLachlan (2002), they often produce performance estimates superior to a single test set because they evaluate many alternate versions of the data.”

      In our revised manuscript, we provide additional clarification of our methods (Lines 574-590):

      “We created each forest by generating 3,000 regression trees. To determine the best performing model for each epidemic metric, we used leave-one-season-out (jackknife) cross-validation to train models and measure model performance, wherein each “assessment” set is one season of data predicted by the model, and the corresponding “analysis” set contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of each model (Kuhn & Johnson, 2019). Due to the small size of our dataset (~20 seasons), evaluating the predictive accuracy of random forest models on a quasi-independent test set of 2-3 seasons produced unstable estimates. Instead of testing model performance on an independent test set, we generated 10 bootstrap resamples (“repeats”) of each analysis set (“fold”) and averaged the predictions of models trained on resamples (Kuhn & Johnson, 2013, 2019). For each epidemic metric, we report the mean root mean squared error (RMSE) and R2 of predictions from the best tuned model. We used permutation importance (N = 50 permutations) to estimate the relative importance of each predictor in determining target outcomes. Permutation importance is the decrease in prediction accuracy when a single feature (predictor) is randomly permuted, with larger values indicating more important variables. Because many features were collinear, we used conditional permutation importance to compute feature importance scores, rather than the standard marginal procedure (Altmann et al., 2010; Debeer & Strobl, 2020; Strobl et al., 2008; Strobl et al., 2007).”

      (3) In response to the reviewer’s question about the sensitivity of results when one season is left out, we clarify that the variable importance scores in Figure 8 and model predictions in Figure 9 were generated by models tuned using leave-one-season-out cross-validation. 

      As explained above, in our leave-one-season-out cross-validation approach, each “assessment” set contains one season of data predicted by the model, and the corresponding “analysis” set (“fold”) contains the remaining seasons. We generated predictions of epidemic metrics and variable importance rankings by averaging the model output of 10 bootstrap resamples of each cross-validation fold. 

      In Lines 791-806, we describe which epidemic metrics have the highest prediction accuracy and report that random forest models tend to underpredict most epidemic metrics in seasons with high antigenic novelty:

      “We measured correlations between observed values and model-predicted values at the HHS region level. Among the various epidemic metrics, random forest models produced the most accurate predictions of A(H3N2) subtype dominance (Spearman’s 𝜌 = 0.95, regional range = 0.85 – 0.97), peak incidence (𝜌 = 0.91, regional range = 0.72 – 0.95), and epidemic size (𝜌 = 0.9, regional range = 0.74 – 0.95), while predictions of effective 𝑅! and epidemic intensity were less accurate (𝜌 = 0.81, regional range = 0.65 – 0.91; 𝜌 = 0.78, regional range = 0.63 – 0.92, respectively) (Figure 9). Random forest models tended to underpredict most epidemic targets in seasons with substantial H3 antigenic transitions, in particular the SY97 cluster seasons (1998-1999, 1999-2000) and the FU02 cluster season (2003-2004) (Figure 9). 

      For epidemic size and peak incidence, seasonal predictive error – the root-mean-square error (RMSE) across all regional predictions in a season – increased with H3 epitope distance (epidemic size, Spearman’s 𝜌 = 0.51, P = 0.02; peak incidence, 𝜌 = 0.63, P = 0.004) and N2 epitope distance (epidemic size, 𝜌 = 0.48, P = 0.04; peak incidence, 𝜌 = 0.48, P = 0.03) (Figure 9 – figure supplements 1 – 2). For models of epidemic intensity, seasonal RMSE increased with N2 epitope distance (𝜌 = 0.64, P = 0.004) but not H3 epitope distance (𝜌 = 0.06, P = 0.8) (Figure 9 – figure supplements 1 – 2). Seasonal RMSE of effective 𝑅! and subtype dominance predictions did not correlate with H3 or N2 epitope distance (Figure 9 – figure supplements 1 – 2).”

      I think the competition (interference) results are really interesting, perhaps among the most interesting aspects of this work. 

      Thank you! We agree that our finding that subtype interference has a greater impact than viral evolution on A(H3N2) epidemics is one of the more interesting results in the study.

      Have you seen the paper by Barrat-Charlaix et al? They found that LBI was not good predicting frequency dynamics (see https://pubmed.ncbi.nlm.nih.gov/33749787/); instead, LBI was high for sequences like the consensus sequence, which was near to future strains. LBI also was not positively correlated with epidemic impact in Figure S7.

      The local branching index (LBI) measures the rate of recent phylogenetic branching and approximates relative fitness among viral clades, with high LBI values representing greater fitness (Neher et al. 2014).

      Two of this study’s co-authors (John Huddleston and Trevor Bedford) are also co-authors of BarratCharlaix et al. 2021. Barrat-Charlaix et al. 2021 assessed the performance of LBI in predicting the frequency dynamics and fixation of individual amino acid substitutions in A(H3N2) viruses. Our study is not focused on predicting the future success of A(H3N2) clades or the frequency dynamics or probability of fixation of individual substitutions. Instead, we use the standard deviation and Shannon diversity of LBI values in each season as a proxy for genealogical (clade-level) diversity. We find that, at a seasonal level, low diversity of H3 or N2 LBI values in the current season correlates with greater epidemic intensity, higher transmission rates, and shorter seasonal duration.

      In the Discussion we provide an explanation for these correlation results (Lines 848-857): 

      “The local branching index (LBI) is traditionally used to predict the success of individual clades, with high LBI values indicating high viral fitness (Huddleston et al., 2020; Neher et al., 2014). In our epidemiological analysis, low diversity of H3 or N2 LBI in the current season correlated with greater epidemic intensity, higher transmission rates, and shorter seasonal duration. These associations suggest that low LBI diversity is indicative of a rapid selective sweep by one successful clade, while high LBI diversity is indicative of multiple co-circulating clades with variable seeding and establishment times over the course of an epidemic. A caveat is that LBI estimation is more sensitive to sequence sub-sampling schemes than strain-level measures. If an epidemic is short and intense (e.g., 1-2 months), a phylogenetic tree with our sub-sampling scheme (50 sequences per month) may not incorporate enough sequences to capture the true diversity of LBI values in that season.”

      Figure 1 - LBI goes up over time. Is that partly to do with sampling? Overall how do higher sampling volumes in later years impact this analysis? (though you choose a fixed number of sequences so I guess you downsample to cope with that). I note that LBI is likely to be sensitive to sequencing density. 

      Thank you for pointing this out. We realized that increasing LBI Shannon diversity over the course of the study period was indeed an artefact of increasing sequence volume over time. Our sequence subsampling scheme involves selecting a random sample of up to 50 viruses per month, with up to 25 viruses selected from North America (if available) and the remaining sequences evenly divided across nine other global regions. In early seasons of the study (late 1990s/early 2000s), sampling was often too sparse to meet the 25 viruses/month threshold for North America or for the other global regions combined (H3: Figure 2 - figure supplement 1; N2: Figure 2 - figure supplement 2). Ecological diversity metrics are sensitive to sample size, which explains why LBI Shannon diversity appeared to steadily increase over time in our original submission. In our revised manuscript, we correct for uneven sample sizes across seasons before estimating Shannon diversity and clarify our methodology. 

      Lines 443-482: 

      “Clade growth: The local branching index (LBI) measures the relative fitness of co-circulating clades, with high LBI values indicating recent rapid phylogenetic branching (Huddleston et al., 2020; Neher et al., 2014). To calculate LBI for each H3 and N2 sequence, we applied the LBI heuristic algorithm as originally described by Neher et al., 2014 to H3 and N2 phylogenetic trees, respectively. We set the neighborhood parameter 𝜏 to 0.4 and only considered viruses sampled between the current season 𝑡 and the previous season 𝑡 – 1 as contributing to recent clade growth in the current season 𝑡.  

      Variation in the phylogenetic branching rates of co-circulating A(H3N2) clades may affect the magnitude, intensity, onset, or duration of seasonal epidemics. For example, we expected that seasons dominated by a single variant with high fitness might have different epidemiological dynamics than seasons with multiple co-circulating clades with varying seeding and establishment times. We measured the diversity of clade growth rates of viruses circulating in each season by measuring the standard deviation (s.d.) and Shannon diversity of LBI values in each season. Given that LBI measures relative fitness among cocirculating clades, we did not compare overall clade growth rates (e.g., mean LBI) across seasons.

      Each season’s distribution of LBI values is right-skewed and does not follow a normal distribution. We therefore bootstrapped the LBI values of each season in each replicate dataset 1000 times (1000 samples with replacement) and estimated the seasonal standard deviation of LBI from resamples, rather than directly from observed LBI values. We also tested the seasonal standard deviation of LBI from log transformed LBI values, which produced qualitatively equivalent results to bootstrapped LBI values in downstream analyses.

      As an alternative measure of seasonal LBI diversity, we binned raw H3 and N2 LBI values into categories based on their integer values (e.g., an LBI value of 0.5 is assigned to the (0,1] bin) and estimated the exponential of the Shannon entropy (Shannon diversity) of LBI categories (Hill, 1973; Shannon, 1948). The Shannon diversity of LBI considers both the richness and relative abundance of viral clades with different growth rates in each season and is calculated as follows:  

      where 𝑞 𝐷 is the effective number of categories or Hill numbers of order 𝑞 (here, clades with different growth rates), with 𝑞 defining the sensitivity of the true diversity to rare versus abundant categories (Hill,

      1973). exp is the exponential function, 𝑝# is the proportion of LBI values belonging to the 𝑖th category, and 𝑅 is richness (the total number of categories). Shannon diversity 1𝐷 (𝑞 = 1) estimates the effective number of categories in an assemblage using the geometric mean of their proportional abundances 𝑝# (Hill, 1973).  

      Because ecological diversity metrics are sensitive to sampling effort, we rarefied H3 and N2 sequence datasets prior to estimating Shannon diversity so that seasons had the same sample size. For each season in each replicate dataset, we constructed rarefaction and extrapolation curves of LBI Shannon diversity and extracted the Shannon diversity estimate of the sample size that was twice the size of the reference sample size (the smallest number of sequences obtained in any season during the study) (iNEXT R package) (Chao et al., 2014). Chao et al. found that their diversity estimators work well for rarefaction and short-range extrapolation when the extrapolated sample size is up to twice the reference sample size. For H3, we estimated seasonal diversity using replicate datasets subsampled to 360 sequences/season; For N2, datasets were subsampled to 230 sequences/season.”

      Estimating the Shannon diversity of LBI from datasets with even sampling across seasons removes the previous secular trend of increasing LBI diversity over time (Figure 2 in revised manuscript).

      Figure 3 - I wondered what about the co-dominant times? 

      In Figure 3, orange points correspond to seasons in which A(H3N2) and A(H1N1) were codominant. We are not sure of the reviewer’s specific question concerning codominant seasons, but if it concerns whether antigenic drift is linked to epidemic magnitude among codominant seasons alone, we cannot perform separate regression analyses for these seasons because there are only two codominant seasons during the 22 season study period.

      Figure 4 - Related to drift and epidemic size, dominance, etc. -- when is drift measured, and (if it's measured in season t), would larger populations create more drift, simply by having access to more opportunity (via a larger viral population size)? This is a bit 'devil's advocate' but what if some epidemiological/behavioural process causes a larger and/or later peak, and those gave rise to higher drift?

      Seasonal drift is measured as the genetic or antigenic distance between viruses circulating during season t and viruses circulating in the prior season (𝑡 – 1) or two seasons ago (𝑡 – 2).

      Concerning the question about whether larger human populations lead to greater rates of antigenic drift, phylogeographic studies have repeatedly found that East-South-Southeast Asia are the source populations for A(H3N2) viruses (Bedford et al., 2015; Lemey et al., 2014), in part because these regions have tropical or subtropical climates and larger human populations, which enable year-round circulation and higher background infection rates. Larger viral populations (via larger host population sizes) and uninterrupted transmission may increase the efficiency of selection and the probability of strain survival and global spread (Wen et al., 2016). After A(H3N2) variants emerge in East-South-Southeast Asia and spread to other parts of the world, A(H3N2) viruses circulate via overlapping epidemics rather than local persistence (Bedford et al., 2015; Rambaut et al., 2008). Each season, A(H3N2) outbreaks in the US (and other temperate regions) are seeded by case importations from outside the US, genetic diversity peaks during the winter, and a strong genetic bottleneck typically occurs at the end of the season (Rambaut et al., 2008).

      Due to their faster rates of antigenic evolution, A(H3N2) viruses undergo more rapid clade turnover and dissemination than A(H1N1) and B viruses, despite similar global migration networks across A(H3N2), A(H1N1), and B viruses (Bedford et al., 2015). Bedford et al. speculate that there is typically little geographic differentiation in A(H3N2) viruses circulating in each season because A(H3N2) viruses tend to infect adults, and adults are more mobile than children. Compared to A(H3N2) viruses, A(H1N1) and B viruses tend to have greater genealogical diversity, geographic differentiation, and longer local persistence times (Bedford et al., 2015; Rambaut et al., 2008). Thus, some A(H1N1) and B epidemics are reseeded by viruses that have persisted locally since prior epidemics (Bedford et al., 2015).

      Theoretical models have shown that epidemiological processes can influence rates of antigenic evolution (Recker et al., 2007; Wen et al., 2016; Zinder et al., 2013), though the impact of flu epidemiology on viral evolution is likely constrained by the virus’s intrinsic mutation rate. 

      In conclusion, larger host population sizes and flu epidemiology can indeed influence rates of antigenic evolution. However, given that our study is US-centric and focuses on A(H3N2) viruses, these factors are likely not at play in our study, due to intrinsic biological characteristics of A(H3N2) viruses and the geographic location of our study.

      We have added a clarifying sentence to the end of the Introduction to narrow the scope of the paper for the reader.

      Line 114-116: “Rather than characterize in situ evolution of A(H3N2) lineages circulating in the U.S., we study the epidemiological impacts of antigenic drift once A(H3N2) variants have arrived on U.S. soil and managed to establish and circulate at relatively high levels.”

      Methods -- 

      L 620 about rescaling and pre- vs post-pandemic times : tell us more - how has reporting changed? could any of this not be because of reporting but because of NPIs or otherwise? Overall there is a lot of rescaling going on. How sensitive are the results to it? 

      it would be unreasonable to ask for a sensitivity analysis for all the results for all the choices around data preparation, but some idea where there is a reason to think there might be a dependence on one of these choices would be great.

      In response to the 2009 A(H1N1) pandemic, the US CDC and WHO increased laboratory testing capacity and strengthened epidemiological networks, leading to substantial, long-lasting improvements to influenza surveillance that are still in place today (https://www.cdc.gov/flu/weekly/overview.htm). At the beginning of the COVID-19 pandemic, influenza surveillance networks were quickly adapted to detect and understand the spread of SARS-CoV-2. The 2009 pandemic occurred over a time span of less than one year, and strict non-pharmaceutical interventions (NPIs), such as lockdowns and mask mandates, were not implemented. Thus, we attribute increases in test volume during the post-2009 period to improved virologic surveillance and laboratory testing capacity rather than changes in care-seeking behavior. In the revised manuscript, we include a figure (Figure 1 - figure supplement 2) that shows systematic increases in test volume in all HHS regions after the 2009 pandemic.

      Given the substantial increase in influenza test volume after 2009, we opted to keep the time trend adjustment for the pre- and post-2009 pandemic periods and evaluate whether adjusting for regional reporting differences affects our results. When estimating univariate correlations between various

      A(H3N2) epidemic metrics and evolutionary indicators, we found qualitatively equivalent results for Spearman correlations and regression models, when adjusting for the pre- and post-2009 pandemic time periods and regional reporting versus only adjusting for the pre-/post-2009 pandemic time periods. Below, we share adjusted versions of Figure 3 (regression results) and Figure 3 - figure supplement 1 (Spearman correlations). Each figure only adjusts for differences in pre- and post-2009 pandemic reporting.

      Author response image 1.

      Adjustment for pre- and post-2009 pandemic only

      Author response image 2.

      Adjustment for pre- and post-2009 pandemic only

      L635 - Why discretize the continuous LBI distribution and then use Shannon entropy when you could just use the variance and/or higher moments? (or quantiles)? Similarly, why not use the duration of the peak, rather than Shannon entropy? (though there, because presumably data are already binned weekly, and using duration would involve defining start and stop times, it's more natural than with LBI)

      We realize that we failed to mention in the methods that we calculated the standard deviation of LBI in each season, in addition to the exponential of the Shannon entropy (Shannon diversity) of LBI. Both the Shannon diversity of LBI values and the standard deviation of LBI values were negatively correlated with effective Rt and epidemic intensity and positively correlated with seasonal duration. The two measures were similarly correlated with effective Rt and epidemic intensity (Figure 3 - figure supplements 2 - 3), while the Shannon diversity of LBI had slightly stronger correlations with seasonal duration than s.d. LBI (Figure 5). Thus, both measures of LBI diversity appear to capture potentially biologically important heterogeneities in clade growth rates.

      Separately, we use the inverse Shannon entropy of the incidence distribution to measure the spread of an A(H3N2) epidemic during the season, following the methods of Dalziel et al. 2018. The peak of an epidemic is a single time point at which the maximum incidence occurs. We have not encountered “the duration of the peak” before in epidemiology terminology, and, to our knowledge, there is not a robust way to measure the “duration of a peak,” unless one were to measure the time span between multiple points of maximum incidence or designate an arbitrary threshold for peak incidence that is not strictly the maximum incidence. Given that Shannon entropy is based on the normalized incidence distribution over the course of the entire influenza season (week 40 to week 20), it does not require designating an arbitrary threshold to describe epidemic intensity.

      L642 - again why normalize epidemic intensities, and how sensitive are the results to this? I would imagine given that the RF results were unstable under leave-one-out analysis that some of those results could be quite sensitive to choices of normalization and scaling.

      Epidemic intensity, defined as the inverse Shannon entropy of the incidence distribution, measures the spread of influenza cases across the weeks in a season. Following Dalziel et al. 2018, we estimated epidemic intensity from normalized incidence distributions rather than raw incidences so that epidemic intensity is invariant under differences in reporting rates and/or attack rates across regions and seasons. If we were to use raw incidences instead, HHS regions or seasons could have the appearance of greater or lower epidemic intensity (i.e., incidence concentrated within a few weeks or spread out over several weeks), due to differences in attack rates or test volume, rather than fundamental differences in the shapes of their epidemic curves. In other words, epidemic intensity is intended to measure the shape and spread of an epidemic, regardless of the actual volume of cases in a given region or season.

      In the methods section, we provide further clarification for why epidemic intensities are based on normalized incidence distributions rather than raw incidences.

      Lines 206-209: “Epidemic intensity is intended to measure the shape and spread of an epidemic, regardless of the actual volume of cases in a given region or season. Following the methodology of Dalziel et al. 2018, epidemic intensity values were normalized to fall between 0 and 1 so that epidemic intensity is invariant to differences in reporting rates and/or attack rates across regions and seasons.”  

      L643 - more information about what goes into Epidemia (variables, priors) such that it's replicable/understandable without the code would be good. 

      We now include additional information concerning the epidemic models used to estimate Rt, including all model equations, variables, and priors (Lines 210-276 in Methods).

      L667 did you do breakpoint detection? Why linear models? Was log(incidence) used? 

      In our original submission, we estimated epidemic onsets using piecewise regression models (Lines 666674 in original manuscript), which model non-linear relationships with breakpoints by iteratively fitting linear models (Muggeo, 2003). Piecewise regression falls under the umbrella of parametric methods for breakpoint detection.

      We did not include results from linear models fit to log(incidence) or GLMs with Gaussian error distributions and log links, due to two reasons. First, models fit to log-transformed data require non-zero values as inputs. Although breakpoint detection does not necessarily require weeks of zero incidence leading up to the start of an outbreak, limiting the time period for breakpoint detection to weeks with nonzero incidence (so that we could use log transformed incidence) substantially pushed back previous more biologically plausible estimates of epidemic onset weeks. Second, as an alternative to limiting the dataset to weeks with non-zero incidence, we tried adding a small positive number to weekly incidences so that we could fit models to log transformed incidence for the whole time period spanning epidemic week 40 (the start of the influenza season) to the first week of maximum incidence. Fitting models to log

      transformed incidences produced unrealistic breakpoint locations, potentially because log transformations 1) linearize data, and 2) stabilize variance by reducing the impact of extreme values. Due to the short time span used for breakpoint detection, log transforming incidence diminishes abrupt changes in incidence at the beginning of outbreaks, making it difficult for models to estimate biologically plausible breakpoint locations. Log transformations of incidence may be more useful when analyzing time series spanning multiple seasons, rather than short time spans with sharp changes in incidence (i.e., the exponential growth phase of a single flu outbreak).

      As an alternative to piecewise regression, our revised manuscript also estimates epidemic onsets using a Bayesian ensemble algorithm that accounts for the time series nature of incidence data and allows for complex, non-linear trajectories interspersed with change points (BEAST - a Bayesian estimator of Abrupt change, Seasonal change, and Trend; Zhao et al., 2019). Although a few regional onset time times differed across the two methods, our conclusions did not change concerning correlations between viral fitness and epidemic onset timing.

      We have rewritten the methods section for estimating epidemic onsets to clarify our methodology and to include the BEAST method (Lines 292-308):

      “We estimated the regional onsets of A(H3N2) virus epidemics by detecting breakpoints in A(H3N2) incidence curves at the beginning of each season. The timing of the breakpoint in incidence represents epidemic establishment (i.e., sustained transmission) rather than the timing of influenza introduction or arrival (Charu et al., 2017). We used two methods to estimate epidemic onsets: 1) piecewise regression, which models non-linear relationships with break points by iteratively fitting linear models to each segment (segmented R package) (Muggeo, 2008; Muggeo, 2003), and 2) a Bayesian ensemble algorithm (BEAST – a Bayesian estimator of Abrupt change, Seasonal change, and Trend) that explicitly accounts for the time series nature of incidence data and allows for complex, non-linear trajectories interspersed with change points (Rbeast R package) (Zhao et al., 2019). For each region in each season, we limited the time period of breakpoint detection to epidemic week 40 to the first week of maximum incidence and did not estimate epidemic onsets for regions with insufficient signal, which we defined as fewer than three weeks of consecutive incidence and/or greater than 30% of weeks with missing data. We successfully estimated A(H3N2) onset timing for most seasons, except for three A(H1N1) dominant seasons: 20002001 (0 regions), 2002-2003 (3 regions), and 2009-2010 (0 regions). Estimates of epidemic onset weeks were similar when using piecewise regression versus the BEAST method, and downstream analyses of correlations between viral fitness indicators and onset timing produced equivalent results. We therefore report results from onsets estimated via piecewise regression.”

      L773 national indicators -- presumably this is because you don't have regional-level information, but it might be worth saying that earlier so it doesn't read like there are other indicators now, called national indicators, that we should have heard of 

      In the revised manuscript, we move a paragraph that was at the beginning of the Results to the beginning of the Methods.

      Lines 123-132: 

      “Our study focuses on the impact of A(H3N2) virus evolution on seasonal epidemics from seasons 19971998 to 2018-2019 in the U.S.; whenever possible, we make use of regionally disaggregated indicators and analyses. We start by identifying multiple indicators of influenza evolution each season based on changes in HA and NA. Next, we compile influenza virus subtype-specific incidence time series for U.S. Department of Health and Human Service (HHS) regions and estimate multiple indicators characterizing influenza A(H3N2) epidemic dynamics each season, including epidemic burden, severity, type/subtype dominance, timing, and the age distribution of cases. We then assess univariate relationships between national indicators of evolution and regional epidemic characteristics. Lastly, we use multivariable regression models and random forest models to measure the relative importance of viral evolution, heterosubtypic interference, and prior immunity in predicting regional A(H3N2) epidemic dynamics.”

      In Lines 484-487 in the Methods, we now mention that measures of seasonal antigenic and genetic distance are at the national level. 

      “For each replicate dataset, we estimated national-level genetic and antigenic distances between influenza viruses circulating in consecutive seasons by calculating the mean distance between viruses circulating in the current season 𝑡 and viruses circulating during the prior season (𝑡 – 1 year; one season lag) or two prior seasons ago (𝑡 – 2 years; two season lag).”

      L782 Why Beta regression and what is "the resampled dataset" ? 

      Beta regression is appropriate for models of subtype dominance, epidemic intensity, and age-specific proportions of ILI cases because these data are continuous and restricted to the interval (0, 1) (Ferrari & Cribari-Neto, 2004). “The resampled dataset” refers to the “1000 bootstrap replicates of the original dataset (1000 samples with replacement)” mentioned in Lines 777-778 of the original manuscript. 

      In the revised manuscript, we include more background information about Beta regression models, and explicitly mention that regression models were fit to 1000 bootstrap replicates of the original dataset.

      Lines 503-507: 

      “For subtype dominance, epidemic intensity, and age-specific proportions of ILI cases, we fit Beta regression models with logit links. Beta regression models are appropriate when the variable of interest is continuous and restricted to the interval (0, 1) (Ferrari & Cribari-Neto, 2004). For each epidemic metric, we fit the best-performing regression model to 1000 bootstrap replicates of the original dataset.”

      The github is clear, comprehensive and well-documented, at least at a brief glance. 

      Thank you! At the time of resubmission, our GitHub repository is updated to incorporate feedback from the reviewers.

      References

      Altmann, A., Tolosi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10), 1340-1347.

      https://doi.org/10.1093/bioinformatics/btq134  

      Barrat-Charlaix, P., Huddleston, J., Bedford, T., & Neher, R. A. (2021). Limited Predictability of Amino Acid Substitutions in Seasonal Influenza Viruses. Mol Biol Evol, 38(7), 2767-2777.

      https://doi.org/10.1093/molbev/msab065  

      Bedford, T., Riley, S., Barr, I. G., Broor, S., Chadha, M., Cox, N. J., Daniels, R. S., Gunasekaran, C. P.,

      Hurt, A. C., Kelso, A., Klimov, A., Lewis, N. S., Li, X., McCauley, J. W., Odagiri, T., Potdar, V., Rambaut, A., Shu, Y., Skepner, E., . . . Russell, C. A. (2015). Global circulation patterns of seasonal influenza viruses vary with antigenic drift. Nature, 523(7559), 217-220.

      https://doi.org/10.1038/nature14460  

      Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison, A. M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecological Monographs, 84(1), 45-67. https://doi.org/10.1890/13-0133.1  Charu, V., Zeger, S., Gog, J., Bjornstad, O. N., Kissler, S., Simonsen, L., Grenfell, B. T., & Viboud, C. (2017). Human mobility and the spatial transmission of influenza in the United States. PLoS

      Comput Biol, 13(2), e1005382. https://doi.org/10.1371/journal.pcbi.1005382  

      Dalziel, B. D., Kissler, S., Gog, J. R., Viboud, C., Bjornstad, O. N., Metcalf, C. J. E., & Grenfell, B. T.

      (2018). Urbanization and humidity shape the intensity of influenza epidemics in U.S. cities.

      Science, 362(6410), 75-79. https://doi.org/10.1126/science.aat6030  

      Debeer, D., & Strobl, C. (2020). Conditional permutation importance revisited. BMC Bioinformatics, 21(1), 307. https://doi.org/10.1186/s12859-020-03622-2  

      Dhanasekaran, V., Sullivan, S., Edwards, K. M., Xie, R., Khvorov, A., Valkenburg, S. A., Cowling, B. J., & Barr, I. G. (2022). Human seasonal influenza under COVID-19 and the potential consequences of influenza lineage elimination. Nat Commun, 13(1), 1721. https://doi.org/10.1038/s41467-02229402-5  

      Ferrari, S., & Cribari-Neto, F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7), 799-815. https://doi.org/10.1080/0266476042000214501  

      Garten, R. J., Davis, C. T., Russell, C. A., Shu, B., Lindstrom, S., Balish, A., Sessions, W. M., Xu, X., Skepner, E., Deyde, V., Okomo-Adhiambo, M., Gubareva, L., Barnes, J., Smith, C. B., Emery, S. L., Hillman, M. J., Rivailler, P., Smagala, J., de Graaf, M., . . . Cox, N. J. (2009). Antigenic and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses circulating in humans.

      Science, 325(5937), 197-201. https://doi.org/10.1126/science.1176225  

      Grebe, K. M., Yewdell, J. W., & Bennink, J. R. (2008). Heterosubtypic immunity to influenza A virus:

      where do we stand? Microbes Infect, 10(9), 1024-1029.

      https://doi.org/10.1016/j.micinf.2008.07.002  

      Hill, M. O. (1973). Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology, 54(2), 427-432. https://doi.org/https://doi.org/10.2307/1934352  

      Huddleston, J., Barnes, J. R., Rowe, T., Xu, X., Kondor, R., Wentworth, D. E., Whittaker, L., Ermetal, B., Daniels, R. S., McCauley, J. W., Fujisaki, S., Nakamura, K., Kishida, N., Watanabe, S., Hasegawa, H., Barr, I., Subbarao, K., Barrat-Charlaix, P., Neher, R. A., & Bedford, T. (2020).

      Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza

      A/H3N2 evolution. Elife, 9, e60067. https://doi.org/10.7554/eLife.60067  Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). Springer. 

      Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. Chapman and Hall/CRC. 

      Lee, E. C., Arab, A., Goldlust, S. M., Viboud, C., Grenfell, B. T., & Bansal, S. (2018). Deploying digital health data to optimize influenza surveillance at national and local scales. PLoS Comput Biol,

      14(3), e1006020. https://doi.org/10.1371/journal.pcbi.1006020  

      Lemey, P., Rambaut, A., Bedford, T., Faria, N., Bielejec, F., Baele, G., Russell, C. A., Smith, D. J., Pybus,

      O. G., Brockmann, D., & Suchard, M. A. (2014). Unifying viral genetics and human transportation

      data to predict the global transmission dynamics of human influenza H3N2. PLoS Pathog, 10(2), e1003932. https://doi.org/10.1371/journal.ppat.1003932  

      Muggeo, V. (2008). Segmented: An R Package to Fit Regression Models With Broken-Line Relationships. R News, 8, 20-25. 

      Muggeo, V. M. (2003). Estimating regression models with unknown break-points. Stat Med, 22(19), 30553071. https://doi.org/10.1002/sim.1545  

      Neher, R. A., Russell, C. A., & Shraiman, B. I. (2014). Predicting evolution from the shape of genealogical trees. Elife, 3, e03568. https://doi.org/10.7554/eLife.03568  

      Rambaut, A., Pybus, O. G., Nelson, M. I., Viboud, C., Taubenberger, J. K., & Holmes, E. C. (2008). The genomic and epidemiological dynamics of human influenza A virus. Nature, 453(7195), 615-619.

      https://doi.org/10.1038/nature06945  

      Recker, M., Pybus, O. G., Nee, S., & Gupta, S. (2007). The generation of influenza outbreaks by a network of host immune responses against a limited set of antigenic types. Proceedings of the National Academy of Sciences, 104(18), 7711-7716.

      https://doi.org/doi:10.1073/pnas.0702154104  

      Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. 

      Smith, G. J., Vijaykrishna, D., Bahl, J., Lycett, S. J., Worobey, M., Pybus, O. G., Ma, S. K., Cheung, C. L., Raghwani, J., Bhatt, S., Peiris, J. S., Guan, Y., & Rambaut, A. (2009). Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature, 459(7250), 1122-1125. https://doi.org/10.1038/nature08182  

      Sridhar, S. (2016). Heterosubtypic T-Cell Immunity to Influenza in Humans: Challenges for Universal TCell Influenza Vaccines. Front Immunol, 7, 195. https://doi.org/10.3389/fimmu.2016.00195  

      Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9, 307. https://doi.org/10.1186/1471-2105-9-307  

      Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8, 25.

      https://doi.org/10.1186/1471-2105-8-25  

      Terajima, M., Babon, J. A., Co, M. D., & Ennis, F. A. (2013). Cross-reactive human B cell and T cell epitopes between influenza A and B viruses. Virol J, 10, 244. https://doi.org/10.1186/1743-422x10-244  

      Webster, R. G., Bean, W. J., Gorman, O. T., Chambers, T. M., & Kawaoka, Y. (1992). Evolution and ecology of influenza A viruses. Microbiological Reviews, 56(1), 152-179.

      https://doi.org/doi:10.1128/mr.56.1.152-179.1992  

      Wen, F., Bedford, T., & Cobey, S. (2016). Explaining the geographical origins of seasonal influenza A

      (H3N2). Proc Biol Sci, 283(1838). https://doi.org/10.1098/rspb.2016.1312  

      Yan, L., Neher, R. A., & Shraiman, B. I. (2019). Phylodynamic theory of persistence, extinction and speciation of rapidly adapting pathogens. Elife, 8. https://doi.org/10.7554/eLife.44205  

      Zhao, K., Wulder, M. A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y., Toman, E., Mallick, B., Zhang, X., & Brown, M. (2019). Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sensing

      of Environment, 232, 111181. https://doi.org/10.1016/j.rse.2019.04.034  

      Zinder, D., Bedford, T., Gupta, S., & Pascual, M. (2013). The Roles of Competition and Mutation in Shaping Antigenic and Genetic Diversity in Influenza. PLOS Pathogens, 9(1).

      https://doi.org/10.1371/journal.ppat.1003104

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      The study by Klug et al. investigated the pathway specificity of corticostriatal projections, focusing on two cortical regions. Using a G-deleted rabies system in D1-Cre and A2a-Cre mice to retrogradely deliver channelrhodopsin to cortical inputs, the authors found that M1 and MCC inputs to direct and indirect pathway spiny projection neurons (SPNs) are both partially segregated and asymmetrically overlapping. In general, corticostriatal inputs that target indirect pathway SPNs are likely to also target direct pathway SPNs, while inputs targeting direct pathway SPNs are less likely to also target indirect pathway SPNs. Such asymmetric overlap of corticostriatal inputs has important implications for how the cortex itself may determine striatal output. Indeed, the authors provide behavioral evidence that optogenetic activation of M1 or MCC cortical neurons that send axons to either direct or indirect pathway SPNs can have opposite effects on locomotion and different effects on action sequence execution. The conclusions of this study add to our understanding of how cortical activity may influence striatal output and offer important new clues about basal ganglia function. 

      The conceptual conclusions of the manuscript are supported by the data, but the details of the magnitude of afferent overlap and causal role of asymmetric corticostriatal inputs on behavioral outcomes were not yet fully resolved. 

      We appreciate the reviewer’s thoughtful understanding and acknowledgment that the conceptual conclusion of asymmetric projections from the cortex to the striatum is well supported by our data. We also recognize the importance of further elucidating the extent of afferent overlap and the causal contributions of asymmetric corticostriatal inputs to behavioral outcomes. However, we respectfully note that current technical limitations pose significant challenges to addressing these questions with high precision.

      In response to the reviewer’s comments, we have now clarified the sample size, added proper analysis and elaborated on the experimental design to ensure that our conclusions are presented more transparently and are more accessible to the reader.

      After virally labeling either direct pathway (D1) or indirect pathway (D2) SPNs to optogenetically tag pathway-specific cortical inputs, the authors report that a much larger number of "non-starter" D2-SPNs from D2-SPN labeled mice responded to optogenetic stimulation in slices than "non-starter" D1 SPNs from D1-SPN labeled mice did. Without knowing the relative number of D1 or D2 SPN starters used to label cortical inputs, it is difficult to interpret the exact meaning of the lower number of responsive D2-SPNs in D1 labeled mice (where only ~63% of D1-SPNs themselves respond) compared to the relatively higher number of responsive D1-SPNs (and D2-SPNs) in D2 labeled mice. While relative differences in connectivity certainly suggest that some amount of asymmetric overlap of inputs exists, differences in infection efficiency and ensuing differences in detection sensitivity in slice experiments make determining the degree of asymmetry problematic. 

      Thank you for highlighting this point. As it lies at the core of our manuscript, we agree that it is essential to present it clearly and convincingly. As shown by the statistics (Fig. 2B-F), non-starter D1- and D2-SPNs appear to receive fewer projections from D1-projecting cortical neurons (Input D1-record D1, 0.63; Input D1-record D2, 0.40) compared to D2-projecting cortical neurons (Input D2 - record D1, 0.73; Input D2 -record D2, 0.79).

      While it is not technically feasible to quantify the number of infected cells in brain slices following electrophysiological recordings, we addressed this limitation by collecting data from multiple animals and restricting recordings to cells located within the injection sites. In Figure 2D, we used 7 mice in the D1-projecting to D1 EGFP(+) group, 8 mice in the D1-projecting to D2 EGFP(-) group, 10 mice in the D2-projecting to D2 EGFP(+) group, and 8 mice in the D2-projecting to D1 EGFP(-) group. In Figure 2G, the group sizes were as follows: 8 mice in the D1-projecting to D2 EGFP(+) group, 7 mice in the D1-projecting to D1 EGFP(-) group, 8 mice in the D2-projecting to D1 EGFP(+) group, and 10 mice in the D2-projecting to D2 EGFP(-) group. In both panels, connection ratios were compared using Fisher’s exact test. Comparisons were then made across experimental groups. Furthermore, as detailed in our Methods section (page 20, line 399-401), we assessed cortical expression levels prior to performing whole-cell recordings. Taken together, these precautions help ensure that the calculated connection ratios are unlikely to be confounded by differences in infection efficiency.

      It is also unclear if retrograde labeling of D1-SPN- vs D2-SPN- targeting afferents labels the same densities of cortical neurons. This gets to the point of specificity in the behavioral experiments. If the target-based labeling strategies used to introduce channelrhodopsin into specific SPN afferents label significantly different numbers of cortical neurons, might the difference in the relative numbers of optogenetically activated cortical neurons itself lead to behavioral differences? 

      Thank you for bringing this concern to our attention. While optogenetic manipulation has become a widely adopted tool in functional studies of neural circuits, it remains subject to several technical limitations due to the nature of its implementation. Factors such as opsin expression efficiency, optic fiber placement, light intensity, stimulation spread, and other variables can all influence the specificity and extent of neuronal activation or inhibition. As such, rigorous experimental controls are essential when interpreting the outcomes of optogenetic experiments.

      In our study, we verified both the expression of channelrhodopsin in D1- or D2-projecting cortical neurons and the placement of the optic fiber following the completion of behavioral testing. To account for variability, we compared the behavioral effects of optogenetic stimulation within the same animals, stimulated versus non-stimulated conditions, as shown in Figures 3 and 4. Moreover, Figure S3 includes important controls that rule out the possibility that the behavioral effects observed were due to direct activation of D1- or D2-SPNs in striatum or to light alone in the cortex.

      An additional point worth emphasizing is that the behavioral effects observed in the open field and ICSS tests cannot be attributed to differences in the number of neurons activated. Specifically, activation of D1-projecting cortical neurons promoted locomotion in the open field, whereas activation of D2-projecting cortical neurons did not. However, in the ICSS test, activation of both D1- and D2-projecting cortical neurons reinforced lever pressing. Given that only D1-SPN activation, but not D2-SPN activation, supports ICSS behavior, these effects are unlikely to result merely from differences in the number of neurons recruited.

      This rationale underlies our use of multiple behavioral paradigms to examine the functions of D1- and D2-projecting cortical neurons. By assessing behavior across distinct tasks, we aimed to approach the question from multiple angles and reduce the likelihood of spurious or confounding effects influencing our interpretation.

      In general, the manuscript would also benefit from more clarity about the statistical comparisons that were made and sample sizes used to reach their conclusions.

      We thank the reviewer for the valuable suggestion to improve the manuscript. In response, we have made the following changes and provided additional clarification:

      (1) In Figure 2D, we used 7 mice in the D1-projecting to D1 EGFP(+) group, 8 mice in the D1-projecting to D2 EGFP(-) group, 10 mice in the D2-projecting to D2 EGFP(+) group, and 8 mice in the D2-projecting to D1 EGFP(-) group. In Figure 2G, the group sizes were as follows: 8 mice in the D1-projecting to D2 EGFP(+) group, 7 mice in the D1-projecting to D1 EGFP(-) group, 8 mice in the D2-projecting to D1 EGFP(+) group, and 10 mice in the D2-projecting to D2 EGFP(-) group. In both panels, connection ratios were compared using Fisher’s exact test.

      (2) In Figure 3, we reanalyzed the data in panels O, P, R, and S using permutation tests to assess whether each individual group exhibited a significant ICSS learning effect. The figure legend has been revised accordingly as follows:

      (O-P) D1-SPN (red) but not D2-SPN stimulation (black) drives ICSS behavior in both the DMS (O: D1, n = 6, permutation test, slope = 1.5060, P = 0.0378; D2, n = 5, permutation test, slope = -0.2214, P = 0.1021; one-tailed Mann Whitney test, Day 7 D1 vs. D2, P = 0.0130) and the DLS (P: D1, n = 6, permutation test, slope = 28.1429, P = 0.0082; D2, n = 5, permutation test, slope = -0.3429, P = 0.0463; one-tailed Mann Whitney test, Day 7 D1 vs. D2, P = 0.0390). *, P < 0.05. (Q) Timeline of helper virus injections, rabies-ChR2 injections and optogenetic stimulation for ICSS behavior. (R-S) Optogenetic stimulation of the cortical neurons projecting to either D1- or D2-SPNs induces ICSS behavior in both the MCC (R: MCC-D1, n = 5, permutation test, Day1-Day7, slope = 2.5857, P = 0.0034; MCC-D2, n = 5, Day2-Day7, permutation test, slope = 1.4229, P = 0.0344; no significant effect on Day7, MCC-D1 vs. MCC-D2,  two-tailed Mann Whitney test, P = 0.9999) and the M1 (S: M1-D1, n = 5, permutation test, Day1-Day7, slope = 1.8214, P = 0.0259; M1-D2, n = 5, Day1-Day7, permutation test, slope = 1.8214, P = 0.0025; no significant effect on Day7, M1-D1 vs. M1-D2, two-tailed Mann Whitney test, P = 0.3810). n.s., not statistically significant.

      (3) In Figure 4, we have added a comparison against a theoretical percentage change of zero to better evaluate the net effect of each manipulation. The results showed that in Figure 4D, optogenetic stimulation of D1-projecting MCC neurons significantly increased the pressing rate, whereas stimulation of D2-projecting MCC neurons did not (MCC-D1: n = 8, one-sample two-tailed t-test, t = 2.814, P = 0.0131; MCC-D2: n = 7, t = 0.8481, P = 0.4117). In contrast, in Figure 4H, optogenetic stimulation of both D1- and D2-projecting M1 neurons significantly increased the sequence press rate (M1-D1: n = 6, one-sample two-tailed Wilcoxon signed-rank test, P = 0.0046; M1-D2: n = 7, P = 0.0479).

      Reviewer #2 (Public Review):

      Summary: 

      Klug et al. use monosynaptic rabies tracing of inputs to D1- vs D2-SPNs in the striatum to study how separate populations of cortical neurons project to D1- and D2-SPNs. They use rabies to express ChR2, then patch D1-or D2-SPNs to measure synaptic input. They report that cortical neurons labeled as D1-SPN-projecting preferentially project to D1-SPNs over D2-SPNs. In contrast, cortical neurons labeled as D2-SPN-projecting project equally to D1- and D2-SPNs. They go on to conduct pathway-specific behavioral stimulation experiments. They compare direct optogenetic stimulation of D1- or D2-SPNs to stimulation of MCC inputs to DMS and M1 inputs to DLS. In three different behavioral assays (open field, intra-cranial self-stimulation, and a fixed ratio 8 task), they show that stimulating MCC or M1 cortical inputs to D1-SPNs is similar to D1-SPN stimulation, but that stimulating MCC or M1 cortical inputs to D2-SPNs does not recapitulate the effects of D2-SPN stimulation (presumably because both D1- and D2-SPNs are being activated by these cortical inputs). 

      Strengths: 

      Showing these same effects in three distinct behaviors is strong. Overall, the functional verification of the consequences of the anatomy is very nice to see. It is a good choice to patch only from mCherry-negative non-starter cells in the striatum.

      Thank you for your profound understanding and appreciation of our manuscript’s design and the methodologies employed. In the realm of neuroscience, quantifying synaptic connections is a formidable challenge. While the roles of the direct and indirect pathways in motor control have long been explored, the mechanism by which upstream cortical inputs govern these pathways remains shrouded in mystery at the circuitry level.

      In the ‘Go/No-Go’ model, the direct and indirect pathways operate antagonistically; in contrast, the ‘Co-activation’ model suggests that they work cooperatively to orchestrate movement. These distinct theories raise a compelling question: Do these two pathways receive inputs from the same upstream cortical neurons, or are they modulated by distinct subpopulations? Answering this question could provide vital clues as to whether these pathways collaborate or operate independently.

      Previous studies have revealed both differences and similarities in the cortical inputs to direct and indirect pathways at population level. However, our investigation delves deeper to understand how a singular cortical input simultaneously drives these pathways, or might it regulate one pathway through distinct subpopulations? To address this, we employed rabies virus–mediated retrograde tracing from D1- or D2-SPNs and recorded non-starter SPNs to determine if they receive the same inputs as the starter SPNs. This approach allowed us to calculate the connection ratio and estimate the probable connection properties.

      Weaknesses: 

      One limitation is that all inputs to SPNs are expressing ChR2, so they cannot distinguish between different cortical subregions during patching experiments. Their results could arise because the same innervation patterns are repeated in many cortical subregions or because some subregions have preferential D1-SPN input while others do not.

      Thank you for raising this thoughtful concern. It is indeed not feasible to restrict ChR2 expression to a specific cortical region using the first-generation rabies-ChR2 system alone. A more refined approach would involve injecting Cre-dependent TVA and RG into the striatum of D1- or A2A-Cre mice, followed by rabies-Flp infection. Subsequently, a Flp-dependent ChR2 virus could be injected into the MCC or M1 to selectively label D1- or D2-projecting cortical neurons. This strategy would allow for more precise targeting and address many of the current limitations.

      However, a significant challenge lies in the cytotoxicity associated with rabies virus infection. Neuronal health begins to deteriorate substantially around 10 days post-infection, which provides an insufficient window for robust Flp-dependent ChR2 expression. We have tested several new rabies virus variants with extended survival times (Chatterjee et al., 2018; Jin et al., 2024), but unfortunately, they did not perform effectively or suitably in the corticostriatal systems we examined.

      In our experimental design, the aim is to delineate the connectivity probabilities to D1 or D2-SPNs from cortical neurons. Our hypothesis considered includes the possibility that similar innervation patterns could occur across multiple cortical subregions, or that some subregions might show preferential input to D1-SPNs while others do not, or a combination of both scenarios. This leads us to perform a series behavior test that using optogenetic activation of the D1- or D2-projecting cortical populations to see which could be the case.

      In the cortical areas we examined, MCC and M1, during behavioral testing, there is consistency with our electrophysiological results. Specifically, when we stimulated the D1-projecting cortical neurons either in MCC or in M1, mice exhibited facilitated local motion in open field test, which is the same to the activation of D1 SPNs in the striatum along (MCC: Fig 3C & D vs. I; M1: Fig 3F & G vs. L). Conversely, stimulation of D2-projecting MCC or M1 cortical neurons resulted in behavioral effects that appeared to combine characteristics of both D1- and D2-SPNs activation in the striatum (MCC: Fig 3C & D vs. J; M1: Fig 3F & G vs. M). The similar results were observed in the ICSS test. Our interpretation of these results is that the activation of D1-projecting neurons in the cortex induces behavior changes akin to D1 neuron activation, while activation of D2-projecting neurons in the cortex leads to a combined effect of both D1 and D2 neuron activation. This suggests that at least some cortical regions, the ones we tested, follow the hypothesis we proposed.

      There are also some caveats with respect to the efficacy of rabies tracing. Although they only patch non-starter cells in the striatum, only 63% of D1-SPNs receive input from D1-SPN-projecting cortical neurons. It's hard to say whether this is "high" or "low," but one question is how far from the starter cell region they are patching. Without this spatial indication of where the cells that are being patched are relative to the starter population, it is difficult to interpret if the cells being patched are receiving cortical inputs from the same neurons that are projecting to the starter population. Convergence of cortical inputs onto SPNs may vary with distance from the starter cell region quite dramatically, as other mapping studies of corticostriatal inputs have shown specialized local input regions can be defined based on cortical input patterns (Hintiryan et al., Nat Neurosci, 2016, Hunnicutt et al., eLife 2016, Peters et al., Nature, 2021).

      This is a valid concern regarding anatomical studies. Investigating cortico-striatal connectivity at the single-cell level remains technically challenging due to current methodological limitations. At present, we rely on rabies virus-mediated trans-synaptic retrograde tracing to identify D1- or D2-projecting cortical populations. This anatomical approach is coupled with ex vivo slice electrophysiology to assess the functional connectivity between these projection-defined cortical neurons and striatal SPNs. This enables us to quantify connection ratios, for example, the proportion of D1-projecting cortical neurons that functionally synapse onto non-starter D1-SPNs.

      To ensure the robustness of our conclusions, it is essential that both the starter cells and the recorded non-starter SPNs receive comparable topographical input from the cortex and other brain regions. Therefore, we carefully designed our experiments so that all recorded cells were located within the injection site, were mCherry-negative (i.e., non-starter cells), and were surrounded by ChR2-mCherry-positive neurons. This configuration ensured that the distance between recorded and starter cells did not exceed 100 µm, maintaining close anatomical proximity and thereby preserving the likelihood of shared cortical innervation within the examined circuitry.

      These methodological details are also described in the section on ex vivo brain slice electrophysiology, specifically in the Methods section, lines 396–399:

      “D1-SPNs (eGFP-positive in D1-eGFP mice, or eGFP-negative in D2-eGFP mice) or D2-SPNs (eGFP-positive in D2-eGFP mice, or eGFP-negative in D1-eGFP mice) that were ChR2-mCherry-negative, but in the injection site and surrounded by cells expressing ChR2-mCherry were targeted for recording.”

      This experimental strategy was implemented to control for potential spatial biases and to enhance the interpretability of our connectivity measurements.

      A caveat for the optogenetic behavioral experiments is that these optogenetic experiments did not include fluorophore-only controls.

      Thank you for bringing this to our attention. A fluorophore-only control is indeed a valuable negative control, commonly used to rule out effects caused by light exposure independent of optogenetic manipulation. In this study, however, comparisons were made between light-on and light-off conditions within the same animal. This within-subject design, as employed in recent studies (Geddes et al., 2018; Zhu et al., 2025), is considered sufficient to isolate the effects of optogenetic manipulation.

      Furthermore, as shown in Figure S3, we conducted an additional control experiment in which optogenetic stimulation was applied to M1, while ensuring that ChR2 expression was restricted to the striatum via targeted viral infection. This approach serves as a functional equivalent to the control you suggested. Importantly, we observed no effects that could be attributed solely to light exposure, further supporting the conclusion that the observed outcomes in our main experiments are due to targeted optogenetic manipulation, rather than confounding effects of illumination.

      Lastly, by employing an in-animal comparison, measuring changes between stimulated and non-stimulated trials, we account for subject-specific variability and strengthen the interpretability of our findings.

      Another point of confusion is that other studies (Cui et al, J Neurosci, 2021) have reported that stimulation of D1-SPNs in DLS inhibits rather than promotes movement.

      Thank you for bringing the study by Cui and colleagues to our attention. While that study has generated some controversy, other independent investigations have demonstrated that activation of D1-SPNs in DLS facilitates local motion and lever-press behaviors (Dong et al., 2025; Geddes et al., 2018; Kravitz et al., 2010).

      It is still worth to clarify. The differences in behavioral outcomes observed between our study and that of Cui et al. may be attributable to several methodological factors, including differences in both the stereotaxic targeting coordinates and the optical fiber specifications used for stimulation.

      Specifically, in our experiments, the dorsomedial striatum (DMS) was targeted at coordinates AP +0.5 mm, ML ±1.5 mm, DV –2.2 mm, and the DLS at AP +0.5 mm, ML ±2.5 mm, DV –2.2 mm. In contrast, Cui et al. targeted the DMS at AP +0.9 mm, ML ±1.4 mm, DV –3.0 mm and the DLS at AP +0.7 mm, ML ±2.3 mm, DV –3.0 mm. These coordinates correspond to sites that are slightly more rostral and ventral compared to our own. Even subtle differences in anatomical targeting can result in activation of distinct neuronal subpopulations, which may account for the differing behavioral effects observed during optogenetic stimulation.

      In addition, the optical fibers used in the two studies varied considerably. We employed fibers with a 200 µm core diameter and a numerical aperture (NA) of 0.37, whereas Cui et al. used fibers with a 250 µm core diameter and a higher NA of 0.66. The combination of a larger core and higher NA in their setup implies a broader spatial spread and deeper tissue penetration of light, likely resulting in activation of a larger neural volume. This expanded volume of stimulation may have engaged additional neural circuits not recruited in our experiments, further contributing to the divergent behavioral outcomes. Taken together, these differences in targeting and photostimulation parameters are likely key contributors to the distinct effects reported between the two studies.

      Reviewer #3 (Public Review): 

      In the manuscript by Klug and colleagues, the investigators use a rabies virus-based methodology to explore potential differences in connectivity from cortical inputs to the dorsal striatum. They report that the connectivity from cortical inputs onto D1 and D2 MSNs differs in terms of their projections onto the opposing cell type, and use these data to infer that there are differences in cross-talk between cortical cells that project to D1 vs. D2 MSNs. Overall, this manuscript adds to the overall body of work indicating that there are differential functions of different striatal pathways which likely arise at least in part by differences in connectivity that have been difficult to resolve due to difficulty in isolating pathways within striatal connectivity and several interesting and provocative observations were reported. Several different methodologies are used, with partially convergent results, to support their main points.

      However, I have significant technical concerns about the manuscript as presented that make it difficult for me to interpret the results of the experiments. My comments are below.

      Major:

      There is generally a large caveat to the rabies studies performed here, which is that both TVA and the ChR2-expressing rabies virus have the same fluorophore. It is thus essentially impossible to determine how many starter cells there are, what the efficiency of tracing is, and which part of the striatum is being sampled in any given experiment. This is a major caveat given the spatial topography of the cortico-striatal projections. Furthermore, the authors make a point in the introduction about previous studies not having explored absolute numbers of inputs, yet this is not at all controlled in this study. It could be that their rabies virus simply replicates better in D1-MSNs than D2-MSNs. No quantifications are done, and these possibilities do not appear to have been considered. Without a greater standardization of the rabies experiments across conditions, it is difficult to interpret the results.

      We thank the reviewer for raising these questions, which merit further discussion.

      Firstly, the primary aim of our study is to investigate the connectivity of the corticostriatal pathway. Given the current technical limitations, it is not feasible to trace all the striatal SPNs connected to a single cortical neuron. Therefore, we approached this from the opposite direction, starting from D1- or D2-SPNs to retrogradely label upstream cortical neurons, and then identifying their connected SPNs via functional synaptic recordings. To achieve this, we employed the only available transsynaptic retrograde method: rabies virus-mediated tracing. Because we crossed D1- or D2-GFP mice with D1- or A2A-Cre mice to identify SPN subtypes during electrophysiological recordings, the conventional rabies-GFP system could not be used to distinguish starter cells without conflicting with the GFP labeling of SPNs. To overcome this, we tagged ChR2 expression with mCherry. In this setup, we recorded from mCherry-negative D1- or D2-SPNs within the injection site and surrounded by mCherry-positive neurons. This ensures that the recorded neurons are topographically matched to the starter cell population and receive input from the same cortical regions. We acknowledge that TVA-only and ChR2-expressing cells are both mCherry-positive and therefore indistinguishable in our system. As such, mCherry-positive cells likely comprise a mixture of starter cells and TVA-only cells, representing a somewhat broader population than starter cells alone. Nevertheless, by restricting recordings to mCherry-negative SPNs within the injection site, it is ensured that our conclusions about functional connectivity remain valid and aligned with the primary objective of this study.

      Secondly, if rabies virus replication were significantly more efficient in D1-SPNs than in D2-SPNs, this would likely result in a higher observed connection probability in the D1-projecting group. However, we used consistent genetic strategies across all groups: D1-SPNs were defined as GFP-positive in D1-GFP mice and GFP-negative in D2-GFP mice, with D2-SPNs defined analogously. Recordings from both D1- and D2-SPNs were performed using the same methodology and under the same injection conditions within the same animals. This internal control helps mitigate the possibility that differential rabies infection efficiency biased our results.

      With these experimental safeguards in place, we found that 40% of D2-SPNs received input from D1-SPN-projecting cortical neurons, while 73% of D1-SPNs received input from D2-SPN-projecting cortical neurons. Although the ideal scenario would involve an even larger sample size to refine these estimates, the technical demands of post-rabies-infection electrophysiological recordings inherently limit throughput. Nonetheless, our approach represents the most feasible and accurate method currently available, and provides a significant advance in characterizing the functional connectivity within corticostriatal circuits.

      The authors claim using a few current clamp optical stimulation experiments that the cortical cells are healthy, but this result was far from comprehensive. For example, membrane resistance, capacitance, general excitability curves, etc are not reported. In Figure S2, some of the conditions look quite different (e.g., S2B, input D2-record D2, the method used yields quite different results that the authors write off as not different). Furthermore, these experiments do not consider the likely sickness and death that occurs in starter cells, as has been reported elsewhere. The health of cells in the circuit is overall a substantial concern that alone could invalidate a large portion, if not all, of the behavioral results. This is a major confound given those neurons are thought to play critical roles in the behaviors being studied. This is a major reason why first-generation rabies viruses have not been used in combination with behavior, but this significant caveat does not appear to have been considered, and controls e.g., uninfected animals, infected with AAV helpers, etc, were not included.

      We understand and appreciate the reviewer’s concern regarding the potential cytotoxicity of rabies virus infection. Indeed, this is a critical consideration when interpreting functional connectivity data. We have tested several newer rabies virus variants reported to support extended survival times (Chatterjee et al., 2018; Jin et al., 2024), but unfortunately, these variants did not perform reliably in the corticostriatal circuits we examined.

      Given these limitations, we relied on the rabies virus approach originally developed by Osakada et al. (Osakada et al., 2011), which demonstrated that neurons infected with rabies virus expressing ChR2 remain both viable and functional up to at least 10 days post-infection (Fig. 3, cited below). In our own experiments, we further validated the health and viability of cortical neurons, the presynaptic partners of SPNs, particularly around day 7 post-infection.

      To minimize the risk of viral toxicity, we performed ex vivo slice recordings within a conservative time window, between 4 and 8 days after infection, when the health of labeled neurons is well maintained. Moreover, the recorded SPNs were consistently mCherry-negative, indicating they were not directly infected by rabies virus, thus further reducing the likelihood of recording from compromised cells.

      Taken together, these steps help ensure that our synaptic recordings reflect genuine functional connectivity, rather than artifacts of viral toxicity. We hope this clarifies the rationale behind our experimental design.

      For the behavioral tests, including a naïve uninfected group and an AAV helper virus-only group as negative controls could be beneficial to isolate the specific impact of rabies virus infection. However, our primary focus is on the activation of selected presynaptic inputs to D1- or D2-SPNs by optogenetic method. Therefore, comparing stimulated versus non-stimulated trials within the same animal offers more direct and relevant results for our study objectives.

      It is also important to note that the ICSS test is particularly susceptible to the potential cytotoxic effects of rabies virus, as it spans a relatively extended period, from Day 4 to Day 12 post-infection. To mitigate this issue, we focused our analysis on the first 7 days of ICSS testing, thereby keeping the behavioral observations within 10 days post-rabies injection. This approach minimizes potential confounds from rabies-induced neurotoxicity while still capturing the relevant behavioral dynamics. Accordingly, we have revised Figure 3 and updated the statistical analyses to reflect this adjustment.

      The overall purity (e.g., EnvA-pseudotyping efficiency) of the RABV prep is not shown. If there was a virus that was not well EnvA-pseudotyped and thus could directly infect cortical (or other) inputs, it would degrade specificity.

      We agree that anatomical specificity is crucial for accurately labeling inputs to defined SPN populations in our study. The rabies virus strain employed here has been rigorously validated for its specificity in numerous previous studies from our group and others (Aoki et al., 2019; Klug et al., 2018; Osakada et al., 2011; Smith et al., 2016; Wall et al., 2013; Wickersham et al., 2007). For example, in a recent study by Aoki et al. (Aoki et al., 2019), we tested the same rabies virus strain by co-injecting the glycoprotein-deleted rabies virus and the TVA-expressing helper virus, without glycoprotein expressing AAV, into the SNr. As shown in Figure S1 (related to Figure 2), GFP expression was restricted to starter cells within the SNr, with no evidence of transsynaptic labeling in upstream regions such as the striatum, EPN, GPe, or STN (see panels F–H). These findings provide strong evidence that the rabies virus used in our experiments is properly pseudotyped and exhibits high specificity for starter cell labeling without off-target spread.

      We appreciate the reviewer’s emphasis on specificity, and we hope this clarification further supports the reliability of our anatomical tracing approach.

      While most of the study focuses on the cortical inputs, in slice recordings, inputs from the thalamus are not considered, yet likely contribute to the observed results. Related to this, in in vivo optogenetic experiments, technically, if the thalamic or other inputs to the dorsal striatum project to the cortex, their method will not only target cortical neurons but also terminals of other excitatory inputs. If this cannot be ruled it, stating that the authors are able to selectively activate the cortical inputs to one or the other population should be toned down.

      We agree with the reviewer that the thalamus is also a significant source of excitatory input to the striatum. However, current techniques do not allow for precise and exclusive labeling of upstream neurons in a given brain region, such as the cortex or thalamus. This technical limitation indeed makes it difficult to definitively determine whether inputs from these regions follow the same projection rules. Despite this, our findings show that stimulation of defined cortical populations, specifically, D1- or D2-projecting neurons in MCC and M1, elicits behavioral outcomes that closely mirror those observed in our ex vivo slice recordings, providing strong support for the cortical origin of the effects we observed.

      In our in vivo optogenetic experiments, we acknowledge that stimulating a specific cortical region may also activate axonal terminals from rabies-infected cortical or thalamic neurons. While somatic stimulation is generally more effective than terminal stimulation, we recognize the possibility that terminals on non-rabies-traced cortical neurons could be activated through presynaptic connections. To address this, we considered the finding of a previous study (Cruikshank et al., 2010), which demonstrated that while brief optogenetic stimulation (0.05 ms) of thalamo-cortical terminals can elicit few action potentials in postsynaptic cortical neurons, sustained terminal stimulation (500 ms) also results in only transient postsynaptic firing rather than prolonged activation (Fig. 3C, cited below). This suggests that cortical neurons exhibit only short-lived responses to continuous presynaptic stimulation of thalamic origin.

      In comparison, our behavioral paradigms employed prolonged optogenetic stimulation protocols- 20 Hz, 10 ms pulses for 15 s (open-field test), 1 s (ICSS), and 8 s (FR4/8)—which more closely resemble sustained stimulation conditions. Given these parameters, and the robust behavioral responses observed, it means that the effects are primarily mediated by activation of rabies-labeled, ChR2-expressing D1- or D2-projecting cortical neurons rather than indirect activation through thalamic input.

      We appreciate the reviewer’s valuable comment, and we have now incorporated this point into the revised manuscript (page 13, line 265 to 275) to more clearly address the potential contribution of thalamic inputs in our experimental design.

      The statements about specificity of connectivity are not well-founded. It may be that in the specific case where they are assessing outside of the area of injections, their conclusions may hold (e.g., excitatory inputs onto D2s have more inputs onto D1s than vice versa). However, how this relates to the actual site of injection is not clear. At face value, if such a connectivity exists, it would suggest that D1-MSNs receive substantially more overall excitatory inputs than D2s. It is thus possible that this observation would not hold over other spatial intervals. This was not explored and thus the conclusions are over-generalized. e.g., the distance from the area of red cells in the striatum to recordings was not quantified, what constituted a high level of cortical labeling was not quantified, etc. Without more rigorous quantification of what was being done, it is difficult to interpret the results. 

      We sincerely thank the reviewer for the thoughtful comments and critical insights into our interpretation of connectivity data. These concerns are valid and provide an important opportunity to clarify and reinforce our experimental design and conclusions.

      Firstly, as described in our previous response, all patched neurons were carefully selected to be within the injection site and in close proximity to ChR2-mCherry-positive cells. Specifically, the estimated distance from each recorded neuron to the nearest starter cells did not exceed 100 µm. This design choice was made to minimize variability associated with spatial distance or heterogeneity in viral expression, thereby allowing for a more consistent sampling of putatively connected neurons.

      Secondly, quantifying both the number of starter and input neurons would, in principle, provide a more comprehensive picture of connectivity. However, given the technical limitations of the current approach particularly when combining rabies tracing with functional recordings it is not feasible to obtain such precise cell counts. Instead, we focused on connection ratios derived from targeted electrophysiological recordings, which offer a reliable and practical means of estimating connectivity within these defined circuits.

      Thirdly, regarding the potential influence of rabies-labeled neurons beyond the immediate recording site: while we acknowledge that rabies tracing labels a broad set of upstream neurons, our analysis was confined to a well-defined and localized area. The analogy we find helpful here is that of a spotlight - our recordings were restricted to the illuminated region directly under the beam, where the projection pattern is fixed and interpretable, regardless of what lies outside that area. Although we cannot fully account for all possible upstream connections, our methodology was designed to minimize variability and maintain consistency in the region of interest, which we believe supports the robustness of our conclusions in the ex vivo slice recording experiment.

      We hope this additional explanation addresses the reviewer’s concerns and helps clarify the rationale of our experimental strategy.

      The results in figure 3 are not well controlled. The authors show contrasting effects of optogenetic stimulation of D1-MSNs and D2-MSNs in the DMS and DLS, results which are largely consistent with the canon of basal ganglia function. However, when stimulating cortical inputs, stimulating the inputs from D1-MSNs gives the expected results (increased locomotion) while stimulating putative inputs to D2-MSNs had no effect. This is not the same as showing a decrease in locomotion - showing no effect here is not possible to interpret.

      We apologize for any confusion and appreciate the opportunity to clarify this point. Our electrophysiological recordings demonstrated that D1-projecting cortical neurons preferentially innervate D1-SPNs in the striatum, whereas D2-projecting cortical neurons provide input to both D1- and D2-SPNs, without a clear preference. These synaptic connectivity patterns are further supported by our behavioral experiments: optogenetic stimulation of D1-projecting neurons in cortical areas such as MCC and M1 led to behavioral effects consistent with direct D1-SPN activation. In contrast, stimulation of D2-projecting cortical neurons produced behavioral outcomes that appeared to reflect a mixture of both D1- and D2-SPN activation.

      We acknowledge that interpreting negative behavioral findings poses inherent challenges, as it is difficult to distinguish between a true lack of effect and insufficient experimental manipulation. To mitigate this, we ensured that all animals included in the analysis exhibited appropriate viral expression and correctly placed optic fibers in the targeted regions. These controls help to confirm that the observed behavioral effects - or lack thereof - are indeed due to the activation of the intended neuronal populations rather than technical artifacts such as weak expression or fiber misplacement.

      As shown in Author response image 1 below, our verification of virus expression and fiber positioning confirms effective targeting in MCC and M1 of A2A-Cre mice. Therefore, we interpret the negative behavioral outcomes as meaningful consequences of specific neural circuit activation.

      Author response image 1.

      Confocal image from A2A-Cre mouse showing targeted optogenetic stimulation of D2-projecting cortical neurons in MCC or M1. ChR2-mCherry expression highlights D2-projecting neurons, selectively labeled via rabies-mediated tracing. Optic fiber placement is confirmed above the cortical region of interest. Image illustrates robust expression and anatomical specificity necessary for pathway-selective stimulation in behavioral assays.

      In light of their circuit model, the result showing that inputs to D2-MSNs drive ICSS is confusing. How can the authors account for the fact that these cells are not locomotor-activating, stimulation of their putative downstream cells (D2-MSNs) does not drive ICSS, yet the cortical inputs drive ICSS? Is the idea that these inputs somehow also drive D1s? If this is the case, how do D2s get activated, if all of the cortical inputs tested net activate D1s and not D2s? Same with the results in figure 4 - the inputs and putative downstream cells do not have the same effects. Given the potential caveats of differences in viral efficiency, spatial location of injections, and cellular toxicity, I cannot interpret these experiments.

      We apologize for any confusion in our previous explanation. In our behavioral experiments, the primary objective was to determine whether activation of D1- or D2-projecting cortical neurons would produce behavioral outcomes distinct from those observed with pure D1 or D2 activation.

      Our findings show that stimulation of D1-projecting cortical neurons produced behavioral effects closely resembling those of selective D1 activation in both open field and ICSS tests. This is consistent with our slice recording data, which revealed that D1-projecting cortical neurons exhibit a higher connection probability with D1-SPNs than with D2-SPNs.

      In contrast, interpreting the effects of D2-projecting cortical neuron stimulation is inherently more nuanced. In the open field test, activation of these neurons did not significantly modulate local motion. This could reflect a balanced influence of D1 activation, which facilitates movement, and D2 activation, which suppresses it - resulting in a net neutral behavioral outcome. In the ICSS test, the absence of a strong reinforcement effect typically associated with D2 activation, combined with partial reinforcement likely due to concurrent D1 activation, suggests that stimulation of D2-projecting neurons produces a mixed behavioral signal. This outcome supports the interpretation that these neurons synapse onto both D1- and D2-SPNs, leading to a blended behavioral response that differs from selective D1 or D2 activation alone.

      Together, these two behavioral assays offer complementary perspectives, providing a more complete view of how projection-specific cortical inputs influence striatal output and behavior.

      In Figure 4 of the current manuscript (as cited below), we show that optogenetic activation of MCC neurons projecting to D1-SPNs facilitates sequence lever pressing, whereas activation of MCC neurons projecting to D2-SPNs does not induce significant behavioral changes. Conversely, activation of M1 neurons projecting to either D1- or D2-SPNs enhances lever pressing sequences. These observations align with our prior findings (Geddes et al., 2018; Jin et al., 2014), where we demonstrated that in the striatum, D1-SPN activation facilitates ongoing lever pressing, whereas D2-SPN activation is more involved in suppressing ongoing actions and promoting transitions between sub-sequences, shown in Fig. 4 from (Geddes et al., 2018; Jin et al., 2014) and Fig. 5K from (Jin et al., 2014) . Taken together, the facilitation of lever pressing by D1-projecting MCC and M1 neurons is consistent with their preferential connectivity to D1-SPNs and their established behavioral role.

      What is particularly intriguing, though admittedly more complex, is the behavioral divergence observed upon activation of D2-SPN-projecting cortical neurons. Activation of D2-projecting MCC neurons does not alter lever pressing, possibly reflecting a counterbalancing effect from concurrent D1- and D2-SPN activation. In contrast, stimulation of D2-projecting M1 neurons facilitates lever pressing, albeit less robustly than their D1-projecting counterparts. This discrepancy may reflect regional differences in striatal targets, DMS for MCC versus DLS for M1, as also supported by our open field test results. Furthermore, our recent findings (Zhang et al., 2025) show that synaptic strength from Cg to D2-SPNs is stronger than to D1-SPNs, whereas the M1 pathway exhibits the opposite pattern. These data suggest that beyond projection ratios, synaptic strength also shapes cortico-striatal functional output. Thus, stronger D2-SPN synapses in the DMS may offset D1-SPN activation during MCC-D2 stimulation, dampening lever pressing increase. Conversely, weaker D2 synapses in the DLS may permit M1-D2 projections to facilitate behavior more readily.

      In summary, the behavioral outcomes of our optogenetic manipulations support the proposed asymmetric cortico-striatal connectivity model. While the effects of D2-projecting neurons are not uniform, they reflect varying balances of D1 and D2-SPN influence, which further underscores the asymmetrical connections of cortical inputs to the striatum.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors): 

      (1) What are the sample sizes for Fig S2? Some trends that are listed as nonsignificant look like they may just be underpowered. Related to this point, S2C indicates that PPR is statistically similar in all conditions. The traces shown in Figure 2 suggest that PPR is quite different in "Input D1"- vs "Input D2" projections. If there is indeed no difference, the exemplar traces should be replaced with more representative ones to avoid confusion. 

      Thank you for your suggestion. The sample size reported in Figure S2 corresponds to the neurons identified as connected in Figure 2. The representative traces shown in Figure 2 were selected based on their close alignment with the amplitude statistics and are intended to reflect typical responses. Given this, it is appropriate to retain the current examples as they accurately illustrate the underlying data.

      (2) Previous studies have described that SPN-SPN collateral inhibition is also asymmetric, with D2->D1 SPN connectivity stronger than the other direction. While cortical inputs to D2-SPNs may also strongly innervate D1-SPNs, it would be helpful to speculate on how collateral inhibition may further shape the biases (or lack thereof) reported here. 

      This would indeed be an interesting topic to explore. SPN-SPN mutual inhibition and/or interneuron inhibition may also play a role in the functional organization and output of the striatum. In the present study, we focused on the primary layer of cortico-striatal connectivity to examine how cortical neurons selectively connect to the striatal direct and indirect pathways, as these pathways have been shown to have distinct yet cooperative functions. To achieve this, we applied a GABAA receptor inhibitor to isolate only excitatory synaptic currents in SPNs, yielding the relevant results.

      To investigate additional circuit organization involving SPN-SPN mutual inhibition, the current available technique would involve single-cell initiated rabies tracing. This approach would help identify the starter SPN and the upstream SPNs that provide input to the starter cell, thereby offering a clearer understanding of the local circuit.

      (3) In Fig 3N-S there are no stats confirming that optogenetic stimulation does indeed increase lever pressing in each group (though it obviously looks like it does). It would be helpful to add statistics for this comparison, in addition to the between-group comparisons that are shown. 

      We thank the reviewer for this thoughtful suggestion. To assess whether optogenetic stimulation increases lever pressing in each group shown in Figures 3O, 3P, 3R, and 3S, we employed a permutation test (10,000 permutations). This non-parametric statistical method does not rely on assumptions about the underlying data distribution and is particularly appropriate for our analysis given the relatively small sample sizes.

      Additionally, in response to Reviewer 3’s concern regarding the potential cytotoxicity of rabies virus affecting behavioral outcomes during in vivo optogenetic stimulation experiments, we focused our analysis on Days 1 through 7 of the ICSS test. This time window remains within 10 days post-rabies infection, a period during which previous studies have reported minimal cytopathic effects (Osakada et al., 2011).

      Accordingly, we have updated Figure 3N-S and revised the associated statistical analyses in the figure legend as follows:

      (O-P) D1-SPN (red) but not D2-SPN stimulation (black) drives ICSS behavior in both the DMS (O: D1, n = 6, permutation test, slope = 1.5060, P = 0.0378; D2, n = 5, permutation test, slope = -0.2214, P = 0.1021; one-tailed Mann Whitney test, Day 7 D1 vs. D2, P = 0.0130) and the DLS (P: D1, n = 6, permutation test, slope = 28.1429, P = 0.0082; D2, n = 5, permutation test, slope = -0.3429, P = 0.0463; one-tailed Mann Whitney test, Day 7 D1 vs. D2, P = 0.0390). *, P < 0.05. (Q) Timeline of helper virus injections, rabies-ChR2 injections and optogenetic stimulation for ICSS behavior. (R-S) Optogenetic stimulation of the cortical neurons projecting to either D1- or D2-SPNs induces ICSS behavior in both the MCC (R: MCC-D1, n = 5, permutation test, Day1-Day7, slope = 2.5857, P = 0.0034; MCC-D2, n = 5, Day2-Day7, permutation test, slope = 1.4229, P = 0.0344; no significant effect on Day7, MCC-D1 vs. MCC-D2,  two-tailed Mann Whitney test, P = 0.9999) and the M1 (S: M1-D1, n = 5, permutation test, Day1-Day7, slope = 1.8214, P = 0.0259; M1-D2, n = 5, Day1-Day7, permutation test, slope = 1.8214, P = 0.0025; no significant effect on Day7, M1-D1 vs. M1-D2, two-tailed Mann Whitney test, P = 0.3810). n.s., not statistically significant.

      We believe this updated analysis and additional context further strengthen the validity of our conclusions regarding the reinforcement effects.

      (4) Line 206: mice were trained for "a few more days" is not a very rigorous description. It would be helpful to state the range of additional days of training. 

      We thank the reviewer for the suggestion. In accordance with the Methods section, we have now specified the number of days, which is 4 days, in the main text (line 207).

      (5) In Fig 4D,H, the statistical comparison is relative modulation (% change) by stimulation of D1- vs D2- projecting inputs. Please show statistics comparing the effect of stimulation on lever presses for each individual condition. For example, is the effect of MCC-D2 stimulation in panel D negative or not significant? 

      Thank you for your suggestion. Below are the statistical results, which we have also incorporated into the figure legend for clarity. To assess the net effects of each manipulation, we compared the observed percentage changes with a theoretical value of zero.

      In Figure 4D, optogenetic stimulation of D1-projecting MCC neurons significantly increased the pressing rate (MCC-D1, n = 8, one-sample two-tailed t-test, t = 2.814, P = 0.0131), whereas stimulation of D2-projecting MCC neurons did not produce a significant effect (MCC-D2, n = 7, one-sample two-tailed t-test, t = 0.8481, P = 0.4117).

      In contrast, Figure 4H shows that optogenetic stimulation of both D1- and D2-projecting M1 neurons significantly increased the sequence press rate (M1-D1, n = 6, one-sample two-tailed Wilcoxon signed-rank test, P = 0.0046; M1-D2, n = 7, one-sample two-tailed Wilcoxon signed-rank test, P = 0.0479).

      These analyses help clarify the distinct behavioral effects of manipulating different corticostriatal projections.

      (6) Are data in Fig 1G-H from a D1- or A2a- cre mouse? 

      The data in Fig 1G-H are from a D1-Cre mouse.

      (7) In Fig S3 it looks like there may actually be an effect of 20Hz simulation of D2-SPNs. Though it probably doesn't affect the interpretation. 

      As indicated by the statistics, there is a slight, but not statistically significant, decrease in local motion when 20 Hz stimulation is delivered to the motor cortex with ChR2 expression in D2-SPNs in the striatum.

      Reviewer #2 (Recommendations For The Authors): 

      The rabies tracing is referred to on several occasions as "new" but the reference papers are from 2011, 2013, and 2018. It is unclear what is new about the system used in the paper and what new feature is relevant to the experiments that were performed. Either clarify or remove "new" terminology. 

      Thank you for bringing this to our attention. We have revised the relevant text accordingly at line 20 in the Abstract, line 31 in the In Brief, line 69 in the Introduction, line 83 in the Results, and line 226 in the Discussion to improve clarity and accuracy.

      In Figure 2 D and G, D1 eGFP (+) and D2 eGFP(-) are plotted separately. These are the same cell type; therefore it may work best to combine that data. This could also be done for 'input to D2- Record D2' in panel D as well as 'input D1-Record D2' and 'input D2-Record D1' in panel G. Combining the information in panel D and G and comparing all 4 conditions to each other would give a better understanding of the comparison of functional connectivity between cortical neurons and D1 and D2 SPNs. 

      We thank the reviewer for the thoughtful suggestion. While presenting single bars for each condition (e.g., ‘input D1 - record D1’) might improve visual simplicity, it would obscure an important aspect of our experimental design. Specifically, we aimed to highlight that the comparisons between D1- and D2-projecting neurons to D1 and D2 SPNs were counterbalanced within the same animals - not just across different groups. By showing both D1-eGFP(+) and D2-eGFP(-), or vice versa, within each group and at similar proportions, we provide a more complete picture of the internal control built into our design. This format helps ensure the audience that our conclusions are not biased by group-level differences, but are supported by within-subject comparisons. Therefore, that the current presentation better could serve to communicate the rigor and balance of our experimental approach.

      The findings in Figure 2 are stated as D1 projecting excitatory inputs have a higher probability of targeting D1 SPNs while D2 projecting excitatory inputs target both D1 SPNs and D2 SPNs. It may be more clear to say that some cortical neurons project specifically to D1 SPNs while other cortical neurons project to both D1 and D2 SPNs equally. A better summary diagram could also help with clarity. 

      Thank you for bringing this up. The data we present reflect the connection probabilities of D1- or D2-projecting cortical neurons to D1 or D2 SPNs. One possible interpretation is like the reviewer said that a subset of cortical neurons preferentially target D1 SPNs, while others exhibit more balanced projections to both D1 and D2 SPNs. However, we cannot rule out alternative explanations - for example, that some D2-projecting neurons preferentially target D2 SPNs, or that the observed differences arise from the overall proportions of D1- and D2-projecting cortical neurons connecting to each striatal subtype.

      There are multiple possible patterns of connectivity that could give rise to the observed differences in connection ratios. Based on our current data, we can confidently conclude the existence of asymmetric cortico-striatal projections to the direct and indirect pathways, but the precise nature of this asymmetry will require further investigation.

      Figure 4 introduces the FR8 task, but there are similar takeaways to the findings from Figure 3. Is there another justification for the FR8 task or interesting way of interpreting that data that could add richness to the manuscript?

      The FR8 task is a self-initiated operant sequence task that relies on motor learning mechanisms, whereas the open field test solely assesses spontaneous locomotion. Furthermore, the sequence task enables us to dissect the functional role of specific neuronal populations in the initiation, maintenance, and termination of sequential movements through closed-loop optogenetic manipulations integrated into the task design. These methodological advantages underscore the rationale for including Figure 4 in the manuscript, as it highlights the unique insights afforded by this experimental paradigm.

      I am somewhat surprised to see that D1-SPN stimulation in DLS gave the results in Figure 3 F and P, as mentioned in the public review. These contrast with some previous results (Cui et al, J Neurosci, 2021). Any explanation? Would be useful to speculate or compare parameters as this could have important implications for DLS function.

      Thank you for raising this point. While Cui’s study has generated some debate, several independent investigations have consistently demonstrated that stimulation of D1-SPNs in the dorsolateral striatum (DLS) facilitates local motion and lever-press behaviors (Dong et al., 2025; Geddes et al., 2018; Kravitz et al., 2010). These findings support the functional role of D1-SPNs in promoting movement and motivated actions.

      The differences in behavioral outcomes observed between our study and that of Cui et al. may stem from several methodological factors, particularly related to anatomical targeting and optical stimulation parameters.

      Specifically, our experiments targeted the DMS at AP +0.5 mm, ML ±1.5 mm, DV –2.2 mm, and the DLS at AP +0.5 mm, ML ±2.5 mm, DV –2.2 mm. In contrast, Cui’s study targeted the DMS at AP +0.9 mm, ML ±1.4 mm, DV –3.0 mm, and the DLS at AP +0.7 mm, ML ±2.3 mm, DV –3.0 mm. These differences indicate that their targeting was slightly more rostral and more ventral than ours, which could have led to stimulation of distinct neuronal populations within the striatum, potentially accounting for variations in behavioral effects observed during optogenetic activation.

      In addition, the optical fibers used in the two studies differed markedly. We employed optical fibers with a 200 µm core diameter and a numerical aperture (NA) of 0.37. Cui’s study used fibers with a larger core diameter (250 µm) and a higher NA (0.66), which would produce a broader spread and deeper penetration of light. This increased photostimulation volume may have recruited a more extensive network of neurons, possibly including off-target circuits, thus influencing the behavioral outcomes in a manner not seen in our more spatially constrained stimulation paradigm.

      Taken together, these methodological differences, both in anatomical targeting and optical stimulation parameters, likely contribute to the discrepancies in behavioral results observed between the two studies. Our findings, consistent with other independent reports, support the role of D1-SPNs in facilitating movement and reinforcement behaviors under more controlled and localized stimulation conditions.

      Reviewer #3 (Recommendations For The Authors): 

      Minor: 

      The authors repeatedly state that they are using a new rabies virus system, but the system has been in widespread use for 16 years, including in the exact circuits the authors are studying, for over a decade. I would not consider this new. 

      Thank you for bringing this to our attention. We have revised the relevant text accordingly at line 20 in the Abstract, line 31 in the In Brief, line 69 in the Introduction, line 83 in the Results, and line 226 in the Discussion to improve clarity and accuracy.

      Figure 2G, how many mice were used for recordings?

      In Fig. 2G, we used 8 mice in the D1-projecting to D2 EGFP(+) group, 7 mice in the D1-projecting to D1 EGFP(-) group, 8 mice in the D2-projecting to D1 EGFP(+) group, and 10 mice in the D2-projecting to D2 EGFP(-) group.

      The amplitude of inputs was not reported in figure 2. This is important, as the strength of the connection matters. This is reported in Figure S2, but how exactly this relates to the presence or absence of connections should be made clearer.

      The amplitude data presented in Figure S2 summarize all recorded currents from confirmed connections, as detailed in the Methods section. A connection is defined by the presence of a detectable and reliable postsynaptic current with an onset latency of less than 10 ms following laser stimulation.

      Reference in the reply-to-review comments:

      Aoki, S., Smith, J.B., Li, H., Yen, X.Y., Igarashi, M., Coulon, P., Wickens, J.R., Ruigrok, T.J.H., and Jin, X. (2019). An open cortico-basal ganglia loop allows limbic control over motor output via the nigrothalamic pathway. Elife 8, e49995.

      Chatterjee, S., Sullivan, H.A., MacLennan, B.J., Xu, R., Hou, Y.Y., Lavin, T.K., Lea, N.E., Michalski, J.E., Babcock, K.R., Dietrich, S., et al. (2018). Nontoxic, double-deletion-mutant rabies viral vectors for retrograde targeting of projection neurons. Nat Neurosci 21, 638-646.

      Cruikshank, S.J., Urabe, H., Nurmikko, A.V., and Connors, B.W. (2010). Pathway-Specific Feedforward Circuits between Thalamus and Neocortex Revealed by Selective Optical Stimulation of Axons. Neuron 65, 230-245.

      Dong, J., Wang, L.P., Sullivan, B.T., Sun, L.X., Smith, V.M.M., Chang, L.S., Ding, J.H., Le, W.D., Gerfen, C.R., and Cai, H.B. (2025). Molecularly distinct striatonigral neuron subtypes differentially regulate locomotion. Nat Commun 16, 2710.

      Geddes, C.E., Li, H., and Jin, X. (2018). Optogenetic Editing Reveals the Hierarchical Organization of Learned Action Sequences. Cell 174, 32-43.

      Jin, L., Sullivan, H.A., Zhu, M., Lavin, T.K., Matsuyama, M., Fu, X., Lea, N.E., Xu, R., Hou, Y.Y., Rutigliani, L., et al. (2024). Long-term labeling and imaging of synaptically connected neuronal networks in vivo using double-deletion-mutant rabies viruses. Nat Neurosci 27, 373-383.

      Jin, X., Tecuapetla, F., and Costa, R.M. (2014). Basal ganglia subcircuits distinctively encode the parsing and concatenation of action sequences. Nat Neurosci 17, 423-430.

      Klug, J.R., Engelhardt, M.D., Cadman, C.N., Li, H., Smith, J.B., Ayala, S., Williams, E.W., Hoffman, H., and Jin, X. (2018). Differential inputs to striatal cholinergic and parvalbumin interneurons imply functional distinctions. Elife 7, e35657.

      Kravitz, A.V., Freeze, B.S., Parker, P.R.L., Kay, K., Thwin, M.T., Deisseroth, K., and Kreitzer, A.C. (2010). Regulation of parkinsonian motor behaviours by optogenetic control of basal ganglia circuitry. Nature 466, 622-626.

      Osakada, F., Mori, T., Cetin, A.H., Marshel, J.H., Virgen, B., and Callaway, E.M. (2011). New Rabies Virus Variants for Monitoring and Manipulating Activity and Gene Expression in Defined Neural Circuits. Neuron 71, 617-631.

      Smith, J.B., Klug, J.R., Ross, D.L., Howard, C.D., Hollon, N.G., Ko, V.I., Hoffman, H., Callaway, E.M., Gerfen, C.R., and Jin, X. (2016). Genetic-Based Dissection Unveils the Inputs and Outputs of Striatal Patch and Matrix Compartments. Neuron 91, 1069-1084.

      Wall, N.R., De La Parra, M., Callaway, E.M., and Kreitzer, A.C. (2013). Differential Innervation of Direct- and Indirect-Pathway Striatal Projection Neurons. Neuron 79, 347-360.

      Wickersham, I.R., Lyon, D.C., Barnard, R.J.O., Mori, T., Finke, S., Conzelmann, K.K., Young, J.A.T., and Callaway, E.M. (2007). Monosynaptic restriction of transsynaptic tracing from single, genetically targeted neurons. Neuron 53, 639-647.

      Zhang, B.B., Geddes, C.E., and Jin, X. (2025) Complementary corticostriatal circuits orchestrate action repetition and switching. Sci Adv, in press.

      Zhu, Z.G., Gong, R., Rodriguez, V., Quach, K.T., Chen, X.Y., and Sternson, S.M. (2025). Hedonic eating is controlled by dopamine neurons that oppose GLP-1R satiety. Science 387, eadt0773.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Life Assessment

      This valuable study builds on previous work by the authors by presenting a potentially key method for correcting optical aberrations in GRIN lens-based micro endoscopes used for imaging deep brain regions. By combining simulations and experiments, the authors show that the obtained field of view is significantly increased with corrected, versus uncorrected microendoscopes. The evidence supporting the claims of the authors is solid, although some aspects of the manuscript should be clarified and missing information provided. Because the approach described in this paper does not require any microscope or software modifications, it can be readily adopted by neuroscientists who wish to image neuronal activity deep in the brain.

      We thank the Referees for their interest in the paper and for the constructive feedback. We have taken the time necessary to address all of their comments, acquiring new data and performing additional analyses. With the inclusion of these new results, we modified four main figures (Figures 1, 6, 7, and 8), added three new Supplementary Figures (Supplementary Figures 1, 2, and 3), and significantly edited the text. Based on the additional work suggested by the Referees, we believe that we have improved our manuscript, provided missing information, and clarified some aspects of the manuscript, which the Referees pointed our attention to.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Referee’s comment: Sattin, Nardin, and colleagues designed and evaluated corrective microlenses that increase the useable field of view of two long (>6mm) thin (500 um diameter) GRIN lenses used in deep-tissue two-photon imaging. This paper closely follows the thread of earlier work from the same group (e.g. Antonini et al, 2020; eLife), filling out the quiver of available extended-fieldof-view 2P endoscopes with these longer lenses. The lenses are made by a molding process that appears practical and easy to adopt with conventional two-photon microscopes.

      Simulations are used to motivate the benefits of extended field of view, demonstrating that more cells can be recorded, with less mixing of signals in extracted traces, when recorded with higher optical resolution. In vivo tests were performed in the piriform cortex, which is difficult to access, especially in chronic preparations.

      The design, characterization, and simulations are clear and thorough, but not exhaustive (see below), and do not break new ground in optical design or biological application. However, the approach shows much promise, including for applications not mentioned in the present text such as miniaturized GRIN-based microscopes. Readers will largely be interested in this work for practical reasons: to apply the authors' corrected endoscopes.

      Strengths:

      The text is clearly written, the ex vivo analysis is thorough and well-supported, and the figures are clear. The authors achieved their aims, as evidenced by the images presented, and were able to make measurements from large numbers of cells simultaneously in vivo in a difficult preparation.

      Weaknesses:

      Referee’s comment: (1) The novelty of the present work over previous efforts from the same group is not well explained. What needed to be done differently to correct these longer GRIN lenses?

      We thank the Referee for the positive evaluation of our work. The optical properties of GRIN lenses depend on the geometrical and optical features of the specific GRIN lens type considered, i.e. its diameter, length, numerical aperture, pitch, and radial modulation of the refractive index. Our approach is based on the addition of a corrective optical element at the back end of the GRIN lens to compensate for aberrations that light encounters as it travels through the GRIN lens. The corrective optical element must, therefore, be specifically tailored to the specific GRIN lens type we aim to correct the aberrations of. The novelty of the present article lies in the successful execution of the ray-trace simulations and two-photon lithography fabrication of corrective optical elements necessary to achieve aberration correction in the two novel and long GRIN lens types, i.e. NEM-050-25-15-860-S-1.5p and NEM-050-23-15-860-S-2.0p (GRIN length, 6.4 mm and 8.8 mm, respectively). Our previous work (Antonini et al. eLife 2020) demonstrated aberration correction with GRIN lenses shorter than 4.1 mm. The design and fabrication of a single corrective optical element suitable to enlarge the field-of-view (FOV) in these longer GRIN lenses is not obvious, especially because longer GRIN lenses are affected by stronger aberrations. To better clarify this point, we revised the Introduction at page 5 (lines 3-10 from bottom) as follows:

      “Recently, a novel method based on 3D microprinting of polymer optics was developed to correct for GRIN aberrations by placing specifically designed aspherical corrective lenses at the back end of the GRIN lens 7. This approach is attractive because it is built-in on the GRIN lens and corrected microendoscopes are ready-to-use, requiring no change in the optical set-up. However, previous work demonstrated the feasibility of this method only for GRIN lenses of length < 4.1 mm 7, which are too short to reach the most ventral regions of the mouse brain. The applicability of this technology to longer GRIN lenses, which are affected by stronger optical aberrations 19, remained to be proven.”

      (2) Some strong motivations for the method are not presented. For example, the introduction (page 3) focuses on identifying neurons with different coding properties, but this can be done with electrophysiology (albeit with different strengths and weaknesses). Compared to electrophysiology, optical methods more clearly excel at genetic targeting, subcellular measurements, and molecular specificity; these could be mentioned.

      Thank you for the comment. We added a paragraph in the Introduction (page 3, lines 2-8) according to what suggested by the Reviewer:

      “High resolution 2P fluorescence imaging of the awake brain is a fundamental tool to investigate the relationship between the structure and the function of brain circuits 1. Compared to electrophysiological techniques, functional imaging in combination with genetically encoded indicators allows monitoring the activity of genetically targeted cell types, access to subcellular compartments, and tracking the dynamics of many biochemical signals in the brain (2). However, a critical limitation of multiphoton microscopy lies in its limited (< 1 mm) penetration depth in scattering biological media 3”.

      Another example, in comparing microfabricated lenses to other approaches, an unmentioned advantage is miniaturization and potential application to mini-2P microscopes, which use GRIN lenses.

      We added the concept suggested by the Reviewer in the Discussion (page 21, lines 4-7 from bottom). The text now reads:

      “Another advantage of long corrected microendoscopes described here over adaptive optics approaches is the possibility to couple corrected microendoscopes with portable 2P microscopes 42-44, allowing high resolution functional imaging of deep brain circuits on an enlarged FOV during naturalistic behavior in freely moving mice”.

      (3) Some potentially useful information is lacking, leaving critical questions for potential adopters:

      How sensitive is the assembly to decenter between the corrective optic and the GRIN lens?

      Following the Referee’s comment, we conducted new optical simulations to evaluate the decrease in optical performance of the corrected endoscopes as a function of the radial shift of the corrective lens from the optical axis of the GRIN rod (decentering, new Supplementary Figure 3), using light rays passing either off- or on-axis. For off-axis rays, we found that the Strehl ratio remained above 0.8 (Maréchal criterion) for positive translations in the range 6-11.5 microns and 16-50 microns for the 6.4 mm- and the 8.8 mm-long corrected microendoscope, respectively, while the Strehl ratio decreased below 0.8 for negative translations of amplitude ~ 5 microns. Please note that for the most marginal rays, a negative translation produces a mismatch between the corrective microlens and the GRIN lens such that the light rays no longer pass through the corrective lens. In contrast, rays passing near the optical axis were still focused by the corrected probe with Strehl ratio above 0.8 in a range of radial shifts of -40 – 40 microns for both microendoscope types. Altogether, these novel simulations suggest that decentering between the corrective microlens and the GRIN lens < 5 microns do not majorly affect the optical properties of the corrected endoscopes. These new results are now displayed in Supplementary Figure 3 and described on page 7 (lines 3-5 from bottom).

      What is the yield of fabrication and of assembly?

      The fabrication yield using molding was ~ 90% (N > 30 molded lenses). The main limitation of this procedure was the formation of air bubbles between the mold negative and the glass coverslip. Molded lenses were visually inspected with a stereomicrscope and, in case of air bubble formation, they were discarded.

      The assembly yield, i.e. correct positioning of the GRIN lens with respect to the coverslip, was 100 % (N = 27 endoscopes).

      We added this information in the Methods at page 29 (lines 1-12), as follows:

      “After UV curing, the microlens was visually inspected at the stereomicroscope. In case of formation of air bubbles, the microlens was discarded (yield of the molding procedure: ~ 90 %, N > 30 molded lenses). The coverslip with the attached corrective lens was sealed to a customized metal or plastic support ring of appropriate diameter (Fig. 2C). The support ring, the coverslip and the aspherical lens formed the upper part of the corrected microendoscope, to be subsequently coupled to the proper GRIN rod (Table 2) using a custom-built opto-mechanical stage and NOA63 (Fig. 2C) 7. The GRIN rod was positioned perpendicularly to the glass coverslip, on the other side of the coverslip compared to the corrective lens, and aligned to the aspherical lens perimeter (Fig. 2C) under the guidance of a wide field microscope equipped with a camera. The yield of the assembly procedure for the probes used in this work was 100 % (N = 27 endoscopes). For further details on the assembly of corrected microendoscope see(7)”. 

      Supplementary Figure 1: Is this really a good agreement between the design and measured profile? Does the figure error (~10 um in some cases on average) noticeably degrade the image?

      As the Reviewer correctly noticed, the discrepancy between the simulated profile and the experimentally measured profile can be up to 5-10 microns at specific radial positions. This discrepancy could be due to issues with: (i) the fabrication of the microlens; (ii) the experimental measurement of the lens profile with the stylus profilometer. To discriminate among these two possibilities, we asked what would be the expected optical properties of the corrected endoscope should the corrective lens have the experimentally measured (not the simulated) profile. To this aim, we performed new optical simulations of the point spread function (PSF) of the corrected probe using, as corrective microlens profile, the average, experimentally measured, profile of a fabricated corrective lens. For both microendoscope types, we first fitted the mean experimentally measured profile of the fabricated lens with the aspherical function reported in equation (1) of the main text:

      where:

      -                is the radial distance from the optical axis;

      -                is equal to 1⁄ , where R is the radius of curvature;

      -                is the conic constant;

      -                − are asphericity coefficients;

      -                is the height of the microlens profile on-axis.

      The fitting values of the parameters of equation (1) for the two lenses are reported for the Referee’s inspection here below (variables describing distances are expressed in mm):

      Author response table 1.

      Fitting values for the parameters of Equation (1) describing the profile of corrective microlens replicas measured with the stylus profilometer. Distances are expressed in mm.

      We then assumed that the profile of the corrective microlenses were equal to the mean experimentally measured profiles and used the aspherical fitting functions in the optical simulations to compute the performance of corrected microendoscopes. For both microendoscope types, we found that the Strehl ratio was lower than 0.35, well below the theoretical diffractionlimited threshold of 0.8 (Maréchal criterion) at moderate distances from the optical axis (68 μm94 μm and 67 μm-92 μm on the focal plane in the object space, after the front end of the GRIN lens, for the 6.4 mm- and the 8.8 mm-long corrected microendoscope, respectively, Author response image 1A, C), and the PSF was strongly distorted (Author response image 1B, D).

      Author response image 1.

      Simulated optical performance of corrected probes with profiles of corrective microlenses equal to the mean experimentally measured profiles of fabricated corrective lenses. A) The Strehl ratio for the 6.4 mm-long corrected microendoscope with measured microlens profile (black dots) is computed on-axis (distance from the center of the FOV d = 0 µm) and at two radial distances off-axis (d = 68 μm and 94 μm on the focal plane in the object space) and compared to the Strehl ratio of the uncorrected (red line) and corrected (blue line) microendoscopes. B) Lateral (x,y) and axial (x,z) fluorescence intensity (F) profiles of simulated PSFs on-axis (left) and off-axis (right, at the indicated distance d computed on the focal plane in the object space) for the 6.4 mm-long corrected microendoscope with measured microlens profile. C) Same as in (A) for the 8.8 mm-long corrected microendoscope (off-axis d = 67 μm and 92 μm on the focal plane in the object space). D) Same as in (B) for the 8.8 mm-long corrected microendoscope.

      These simulated findings are in contrast with the experimentally measured optical properties of our corrected endoscopes (Figure 3). In other words, these novel simulated results show that experimentally measured profiles of the corrected lenses are incompatible with the experimental measurements of the optical properties of the corrected endoscopes. Therefore, our experimental recording of the lens profile shown in Supplementary Figure 1 of the first submission (now Supplementary Figure 4) should be used only as a coarse measure of the lens shape and cannot be used to precisely compare simulated lens profiles with measured lens profiles.

      How do individual radial profiles compare to the presented means?

      We provide below a modified version of Supplementary Figure 4 (Supplementary Figure 1 in the first submission), where individual profiles measured with the stylus profilometer and the mean profile are displayed for both microendoscope types (Author response image 2). In the manuscript (Supplementary Figure 4), we would suggest to keep showing mean profiles ± standard errors of the mean, as we did in the original submission.

      Author response image 2.

      Characterization of polymeric corrective lens replicas. A) Stylus profilometer measurements were performed along the radius of the corrective polymer microlens replica for the 6.4 mm-long corrected microendoscope. Individual measured profiles (grey solid lines) obtained from n = 3 profile measurements on m = 3 different corrective lens replicas, plus the mean profile (black solid line) are displayed. B) Same as (A) for the 8.8 mm-long microendoscope.

      What is the practical effect of the strong field curvature? Are the edges of the field, which come very close to the lens surface, a practical limitation?

      A first practical effect of the field curvature is that structures at different z coordinates are sampled. The observed field curvature of corrected endoscopes may therefore impact imaging in brain regions characterized by strong axially organized anatomy (e.g., the pyramidal layer of the hippocampus), but would not significantly affect imaging in regions with homogeneous cell density within the axial extension of the field curvature (< 170 µm, see more details below). A second consequence of the field curvature, as the Referee correctly points out, is that cell at the border of the FOV are closer to the front end of the GRIN lens. In measurements of subresolved fluorescent layers (Figure 3A-D), we observed that the field curvature extends in the axial direction to ~ 110 μm and ~170 μm for the 6.4 mm- and the 8.8 mm-long microendoscopes, respectively. Considered that the nominal working distances on the object side of the 6.4 mm- and the 8.8 mm-long microendoscopes were, respectively, 210 μm and 178 μm (Table 3), structures positioned at the very edge of the FOV were ~ 100 μm and ~ 8 μm away from the GRIN front end for the 6.4 mm-long and for the 8.8 mm-long probe, respectively. Previous studies have shown that brain tissue within 50-100 μm from the GRIN front end may show signs of tissue reaction to the implant (Curreli et al. PLOS Biology 2022, Attardo et al. Nature 2015). Therefore, structures at the very edge of the FOV of the 8.8 mm-long endoscopes, but not those at the edge of the 6.4 mm-long endoscopes, may be within the volume showing tissue reaction. We added a paragraph in the text to discuss these points (page 18 lines 10-14).

      The lenses appear to be corrected for monochromatic light; high-performance microscopes are generally achromatic. Is the bandwidth of two-photon excitation sufficient to warrant optimization over multiple wavelengths?

      Thanks for this comment. All optical simulations described in the first submission were performed at a fixed wavelength (λ = 920 nm). Following the Referee’s request, we explored the effect of changing wavelength on the Strehl ratio using new optical simulations. We found that the Strehl ratio remains > 0.8 at least within ± 10 nm from λ = 920 nm (new Supplementary Figure 1A-D, left panels), which covers the limited bandwidth of our femtosecond laser. Moreover, these simulations demonstrate that, on a much wider wavelength range (800 - 1040 nm), high Strehl ratio is obtained, but at different z planes (new Supplementary Figure 1A-D, right panels). This means that the corrective lens is working as expected also for wavelengths which are different from 920 nm, with different wavelengths having the most enlarged FOV located at different working distances. These new results are now described on page 7 (lines 8-10).

      GRIN lenses are often used to access a 3D volume by scanning in z (including in this study). How does the corrective lens affect imaging performance over the 3D field of view?

      The optical simulations we did to design the corrective lenses were performed maximizing aberration correction only in the focal plane of the endoscope. Following the Referee’s comment, we explored the effect of aberration correction outside the focal plane using new optical simulations. In corrected endoscopes, we found that for off-axis rays (radial distance from the optical axis > 40 μm) the Strehl ratio was > 0.8 (Maréchal criterion) in a larger volume compared to uncorrected endoscopes (new Supplementary Figure 2), demonstrating that the aberration correction method developed in this study does extend beyond the focal plane for short distances. For example, at a radial distance of ~ 90 μm from the optical axis, the axial range in which the Strehl ratio was > 0.8 in corrected endoscopes was 28 μm and 19 μm for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively. These new results are now described on page 7 (10-19).

      (4) The in vivo images (Figure 7D) have a less impressive resolution and field than the ex vivo images (Figure 4B), and the reason for this is not clear. Given the difference in performance, how does this compare to an uncorrected endoscope in the same preparation? Is the reduced performance related to uncorrected motion, field curvature, working distance, etc?

      In comparing images in Figure 4B with images shown in Figure 7D, the following points should be considered:

      (1) Figure 4B is a maximum fluorescence intensity projection of multiple axial planes of a z-stack acquired through a thin brain slice (slice thickness: 50 µm) using 8 frame averages for each plane. In contrast, images in Figure 7D are median projection of a t-series acquired on a single plane in the awake mouse at 30 Hz resonant scanning imaging (8 min, 14,400 frames).

      (2) Images of the fixed brain slice in Figure 4B were acquired at 1024 pixels x 1024 pixels resolution, nominal pixel size 0.45 µm/pixel, and with objective NA = 0.50, whereas in vivo images in Figure 7D were acquired at 512 pixels x 512 pixels resolution, nominal pixel size 0.72 - 0.84 µm/pixel, and with objective NA = 0.45.

      (3) In the in vivo preparation (Figure 7D), excitation and emission light travel through > 180 µm of scattering and absorbing brain tissue, reducing spatial resolution and the SNR of the collected fluorescence signal.

      (4) By shifting the sample in the x, y plane, in Figure 4B we could chose a FOV containing homogenously stained cells. x, y shifting and selecting across multiple FOVs was not possible in vivo, as the GRIN lens was cemented on the animal skull.

      (5) Images in Figure 7D were motion corrected, but we cannot exclude that part of the decrease in resolution observed in Figure 7D when compared to images in Figure 4B are due to incomplete correction of motion artifacts.

      For all the reasons listed above, we believe that it is expected to see smaller resolution and contrast in images recorded in vivo (Figure 7D) compared to images acquired in fixed tissue (Figure 4B).

      Regarding the question of how do images from an uncorrected and a corrected endoscopes compared in vivo, we think that this comparison is better performed in fixed tissue (Figure 4) or in simulated calcium data (Figure 5-6), rather than in vivo recordings (Figure 7). In fact, in the brain of living mice motion artifacts, changes in fluorophore expression level, variation in the optical properties of the brain (e.g., the presence of a blood vessel over the FOV) may make the comparison of images acquired with uncorrected and corrected microendoscopes difficult, requiring a large number of animals to cancel out the contributions of these factors. Comparing optical properties in fixed tissue is, in contrast, devoid of these confounding factors. Moreover, the major advantage of quantifying how the optical properties of uncorrected and corrected endoscopes impact on the ability to extract information about neuronal activity in simulated calcium data is that, under simulated conditions, we can count on a known ground truth as reference (e.g., how many neurons are in the FOV, where they are, and which is their electrical activity). This is clearly not possible in the in vivo recordings.

      Regarding Figure 7, there is no analysis of the biological significance of the calcium signals or even a description of where olfactory stimuli were presented.

      We appreciate the Reviewer pointing out the lack of detailed analysis regarding the biological significance of the calcium signals and the presentation of olfactory stimuli in Figure 7. Our initial focus was on demonstrating the effectiveness of the optimized GRIN lenses for imaging deep brain areas like the piriform cortex, with an emphasis on the improved signal-tonoise ratio (SNR) these lenses provide. However, we agree that including more context about the experimental conditions would enhance the manuscript. To address this point, we added a new panel (Figure 7F) showing calcium transients aligned with the onset of olfactory stimulus presentations, which are now indicated by shaded light blue areas. Additionally, we have specified the timing of each stimulus presented in Figure 7E. This revision allows readers to better understand the relationship between the calcium signals and the olfactory stimuli.

      The timescale of jGCaMP8f signals in Figure 7E is uncharacteristically slow for this indicator (compared to Zhang et al 2023 (Nature)), though perhaps this is related to the physiology of these cells or the stimuli.

      Regarding the timescale of the calcium signals observed in Figure 7E, we apologize for the confusion caused by a mislabeling we inserted in the original manuscript. The experiments presented in Figure 7 were conducted using jGCaMP7f, not jGCaMP8f as previously stated (both indicators were used in this study but in separate experiments). We have corrected this error in the Results section (caption of Figure 7D, E). It is important to note that jGCaMP7f has a longer half-decay time compared to jGCaMP8f, which could in part account for the slower decay kinetics observed in our data. Furthermore, the prolonged calcium signals can be attributed to the physiological properties of neurons in the piriform cortex. Upon olfactory stimulation, these neurons often fire multiple action potentials, resulting in extended calcium transients that can last several seconds. This sustained activity has been documented in previous studies, such as Roland et al. (eLife 2017, Figure 1C therein) in anesthetized animals and Wang et al. (Neuron 2020, Figure 1E therein) in awake animals, which report similar durations for calcium signals.

      (5) The claim of unprecedented spatial resolution across the FOV (page 18) is hard to evaluate and is not supported by references to quantitative comparisons. The promises of the method for future studies (pages 18-19) could also be better supported by analysis or experiment, but these are minor and to me, do not detract from the appeal of the work.

      GRIN lens-based imaging of piriform cortex in the awake mouse had already been done in Wang et al., Neuron 2020. The GRIN lens used in that work was NEM-050-50-00920-S-1.5p (GRINTECH, length: 6.4 mm; diameter: 0.5 mm), similar to the one that we used to design the 6.4 mm-long corrected microendoscope. Here we used a microendoscope specifically design to correct off-axis aberrations and enlarge the FOV, in order to maximize the number of neurons recorded with the highest possible spatial resolution, while keeping the tissue invasiveness to the minimum. Following the Referee’s comments, we revised the sentence at page 19 (lines 68 from bottom) as follows:

      “We used long corrected microendoscopes to measure population dynamics in the olfactory cortex of awake head-restrained mice with unprecedented combination of high spatial resolution across the FOV and minimal invasiveness(17)”.

      (6) The text is lengthy and the material is repeated, especially between the introduction and conclusion. Consolidating introductory material to the introduction would avoid diluting interesting points in the discussion.

      We thank the Reviewer for this comment. As suggested, we edited the Introduction and shortened the Discussion.

      Reviewer #2 (Public review):

      In this manuscript, the authors present an approach to correct GRIN lens aberrations, which primarily cause a decrease in signal-to-noise ratio (SNR), particularly in the lateral regions of the field-of-view (FOV), thereby limiting the usable FOV. The authors propose to mitigate these aberrations by designing and fabricating aspherical corrective lenses using ray trace simulations and two-photon lithography, respectively; the corrective lenses are then mounted on the back aperture of the GRIN lens.

      This approach was previously demonstrated by the same lab for GRIN lenses shorter than 4.1 mm (Antonini et al., eLife, 2020). In the current work, the authors extend their method to a new class of GRIN lenses with lengths exceeding 6 mm, enabling access to deeper brain regions as most ventral regions of the mouse brain. Specifically, they designed and characterized corrective lenses for GRIN lenses measuring 6.4 mm and 8.8 mm in length. Finally, they applied these corrected long micro-endoscopes to perform high-precision calcium signal recordings in the olfactory cortex.

      Compared with alternative approaches using adaptive optics, the main strength of this method is that it does not require hardware or software modifications, nor does it limit the system's temporal resolution. The manuscript is well-written, the data are clearly presented, and the experiments convincingly demonstrate the advantages of the corrective lenses.

      The implementation of these long corrected micro-endoscopes, demonstrated here for deep imaging in the mouse olfactory bulb, will also enable deep imaging in larger mammals such as rats or marmosets.

      We thank the Referee for the positive comments on our study. We address the points indicated by the Referee in the “Recommendation to the authors” section below.

      Reviewer #3 (Public review):

      Summary:

      This work presents the development, characterization, and use of new thin microendoscopes (500µm diameter) whose accessible field of view has been extended by the addition of a corrective optical element glued to the entrance face. Two micro endoscopes of different lengths (6.4mm and 8.8mm) have been developed, allowing imaging of neuronal activity in brain regions >4mm deep. An alternative solution to increase the field of view could be to add an adaptive optics loop to the microscope to correct the aberrations of the GRIN lens. The solution presented in this paper does not require any modification of the optical microscope and can therefore be easily accessible to any neuroscience laboratory performing optical imaging of neuronal activity.

      Strengths:

      (1) The paper is generally clear and well-written. The scientific approach is well structured and numerous experiments and simulations are presented to evaluate the performance of corrected microendoscopes. In particular, we can highlight several consistent and convincing pieces of evidence for the improved performance of corrected micro endoscopes:

      a) PSFs measured with corrected micro endoscopes 75µm from the centre of the FOV show a significant reduction in optical aberrations compared to PSFs measured with uncorrected micro endoscopes.

      b) Morphological imaging of fixed brain slices shows that optical resolution is maintained over a larger field of view with corrected micro endoscopes compared to uncorrected ones, allowing neuronal processes to be revealed even close to the edge of the FOV.

      c) Using synthetic calcium data, the authors showed that the signals obtained with the corrected microendoscopes have a significantly stronger correlation with the ground truth signals than those obtained with uncorrected microendoscopes.

      (2) There is a strong need for high-quality micro endoscopes to image deep brain regions in vivo. The solution proposed by the authors is simple, efficient, and potentially easy to disseminate within the neuroscience community.

      Weaknesses:

      (1) Many points need to be clarified/discussed. Here are a few examples:

      a) It is written in the methods: “The uncorrected microendoscopes were assembled either using different optical elements compared to the corrected ones or were obtained from the corrected

      probes after the mechanical removal of the corrective lens.”

      This is not very clear: the uncorrected microendoscopes are not simply the unmodified GRIN lenses?

      We apologize for not been clear enough on this point. Uncorrected microendoscopes are not simply unmodified GRIN lenses, rather they are GRIN lenses attached to a round glass coverslip (thickness: 100 μm). The glass coverslip was included in ray-trace optical simulations of the uncorrected system and this is the reason why commercial GRIN lenses and corresponding uncorrected microendoscopes have different working distances, as reported in Tables 2-3. To make the text clearer, we added the following sentence at page 27 (last 4 lines):

      “To evaluate the impact of corrective microlenses on the optical performance of GRIN-based microendoscopes, we also simulated uncorrected microendoscopes composed of the same optical elements of corrected probes (glass coverslip and GRIN rod), but in the absence of the corrective microlens”.

      b) In the results of the simulation of neuronal activity (Figure 5A, for example), the neurons in the center of the FOV have a very large diameter (of about 30µm). This should be discussed.

      Thanks for this comment. In synthetic calcium imaging t-series, cell radii were randomly sampled from a Gaussian distribution with mean = 10 µm and standard deviation (SD) = 3 µm. Both values were estimated from the literature (ref. no. 28: Suzuki & Bekkers, Journal of Neuroscience, 2011) as described in the Methods (page 35). In the image shown in Figure 5A, neurons near to the center of the FOV have radius of ~ 20 µm corresponding to the right tail of the distribution (mean + 3SD = 19 µm). It is also important to note that, for corrected microendoscopes, neurons in the central portion of the FOV appear larger than cells located near the edges of the FOV, because the magnification depends on the distance from the optical axis (see Figure 3E, F) and near the center the magnification is > 1 for both microendoscope types.

      Also, why is the optical resolution so low on these images?

      Images shown in Figure 5 are median fluorescence intensity projections of 5 minute-long simulated t-series. Simulated calcium data were generated with pixel size 0.8 μm/pixel and frame rate 30 Hz, similarly to in vivo recordings. In the simulations, pixels not belonging to any cell soma were assigned a value of background fluorescence randomly sampled from a normal distribution with mean and standard deviation estimated from experimental data, as described in the Methods section (page 37). To simulate activity, the mean spiking rate of neurons was set to 0.3 Hz, thus in a large fraction of frames neurons do not show calcium transients. Therefore, the median fluorescence intensity value of somata will be close to their baseline fluorescence value (_F_0). Since in simulations F0 values (~ 45-80 a.u.) were not much higher than the background fluorescence level (~ 45 a.u.), this may generate the appearance of low contrast image in Figure 5A. Finally, we suspect that PDF rendering also contributed to degrade the quality of those images. We will now submit high resolution images alongside the PDF file.

      c) It seems that we can't see the same neurons on the left and right panels of Figure 5D. This should be discussed.

      The Referee is correct. When we intersected the simulated 3D volume of ground truth neurons with the focal surface of microendoscopes, the center of the FOV for the 8.8 mmlong corrected microendoscope was located at a larger depth than the FOV of the 8.8 mm uncorrected microendoscope. This effect was due to the larger field curvature of corrected 8.8 mmlong endoscopes compared to 8.8 mm-long uncorrected endoscopes. This is the reason why different neurons were displayed for uncorrected and corrected endoscopes in Figure 5D. We added this explanation in the text at page 37 (lines 1-4). The text reads:

      “Due to the stronger field curvature of the 8.8 mm-long corrected microendoscope (Figure 1C) compared to 8.8 mm-long uncorrected microendoscopes, the center of the corrected imaging focal surface resulted at a larger depth in the simulated volume compared to the center of the uncorrected focal surface(s). Therefore, different simulated neurons were sampled in the two cases”.

      d) It is not very clear to me why in Figure 6A, F the fraction of adjacent cell pairs that are more correlated than expected increases as a function of the threshold on peak SNR. The authors showed in Supplementary Figure 3B that the mean purity index increases as a function of the threshold on peak SNR for all micro endoscopes. Therefore, I would have expected the correlation between adjacent cells to decrease as a function of the threshold on peak SNR. Similarly, the mean purity index for the corrected short microendoscope is close to 1 for high thresholds on peak SNR: therefore, I would have expected the fraction of adjacent cell pairs that are more correlated than expected to be close to 0 under these conditions. It would be interesting to clarify these points.

      Thanks for raising this point. We defined the fraction of adjacent cell pairs more correlated than expected as the number of adjacent cell pairs more correlated than expected divided by the number of adjacent cell pairs. The reason why this fraction raises as a function of the SNR threshold is shown in Supplementary Figure 2 in the first submission (now Supplementary Figure 5). There, we separately plotted the number of adjacent cell pairs more correlated than expected (numerator) and the number of adjacent cell pairs (denominator) as a function of the SNR threshold. For both microendoscope types, we observed that the denominator more rapidly decreased with peak SNR threshold than the numerator. Therefore, the fraction of adjacent cell pairs more correlated than expected increases with the peak SNR threshold.

      To understand why the denominator decreases with SNR threshold, it should be considered that, due to the deterioration of spatial resolution and attenuation of fluorescent signal collection as a function of the radial distance from the optical axis (see for example fluorescent film profiles in Figure 3A, C), increasing the threshold on the peak SNR of extracted calcium traces implies limiting cell detection to those cells located within smaller distance from the center of the FOV. This information is shown in Figure 5C, F.

      In the manuscript text, this point is discussed at page 12 (lines 1-3 from bottom) and page 13 (lines 1-4):

      “The fraction of pairs of adjacent cells (out of the total number of adjacent pairs) whose activity correlated significantly more than expected increased as a function of the SNR threshold for corrected and uncorrected microendoscopes of both lengths (Fig. 6A, F). This effect was due to a larger decrease of the total number of pairs of adjacent cells as a function of the SNR threshold compared to the decrease in the number of pairs of adjacent cells whose activity was more correlated than expected (Supplementary Figure 5)”.

      e) Figures 6C, H: I think it would be fairer to compare the uncorrected and corrected endomicroscopes using the same effective FOV.

      To address the Reviewer’s concern, we repeated the linear regression of purity index as a function of the radial distance using the same range of radial distances for the uncorrected and corrected case of both microendoscope types. Below, we provide an updated version of Figure 6C, H for the referee’s perusal. Please note that the maximum value displayed on the x-axis of both graphs is now corresponding to the minimum value between the two maximum radial distance values obtained in the uncorrected and corrected case (maximum radial distance displayed: 151.6 µm and 142.1 μm for the 6.4 mm- and the 8.8 mm-long GRIN rod, respectively). Using the same effective FOV, we found that the purity index drops significantly more rapidly with the radial distance for uncorrected microendoscopes compared to the corrected ones, similarly to what observed in the original version of Figure 6. The values of the linear regression parameters and statistical significance of the difference between the slopes in the uncorrected and corrected cases are stated in the Author response image 3 caption below for both microendoscope types. In the manuscript, we would suggest to keep showing data corresponding to all detected cells, as we did in the original submission.

      Author response image 3.

      Linear regression of purity index as a function of the radial distance. A) Purity index of extracted traces with peak SNR > 10 was estimated using a GLM of ground truth source contributions and plotted as a function of the radial distance of cell identities from the center of the FOV for n = 13 simulated experiments with the 6.4 mm-long uncorrected (red) and corrected (blue) microendoscope. Black lines represent the linear regression of data ± 95% confidence intervals (shaded colored areas). Maximum value of radial distance displayed: 151.6 μm. Slopes ± standard error (s.e.): uncorrected, (-0.0015 ± 0.0002) µm-1; corrected, (-0.0006 ± 0.0001) μm-1. Uncorrected, n = 991; corrected, n = 1156. Statistical comparison of slopes, p < 10<sup>-10</sup>, permutation test. B) Same as (A) for n = 15 simulated experiments with the 8.8 mm-long uncorrected and corrected microendoscope. Maximum value of radial distance displayed: 142.1 μm. Slopes ± s.e.: uncorrected, (-0.0014 ± 0.0003) μm-1; corrected, (-0.0010 ± 0.0002) µm-1. Uncorrected, n = 718; corrected, n = 1328. Statistical comparison of slopes, p = 0.0082, permutation test.

      f) Figure 7E: Many calcium transients have a strange shape, with a very fast decay following a plateau or a slower decay. Is this the result of motion artefacts or analysis artefacts?

      Thank you for raising this point about the unusual shapes of the calcium transients in Figure 7E. The observed rapid decay following a plateau or a slower decay is indeed a result of how the data were presented in the original submission. Our experimental protocol consisted of 22 s-long trials with an inter-trial interval of 10 s (see Methods section, page 44). In the original figure, data from multiple trials were concatenated, which led to artefactual time courses and apparent discontinuities in the calcium signals. To resolve this issue, we revised Figure 7E to accurately represent individual concatenated trials. We also added a new panel (please see new Figure 7F) showing examples of single cell calcium responses in individual trials without concatenation, with annotations indicating the timing and identity of presented olfactory stimuli.

      Also, the duration of many calcium transients seems to be long (several seconds) for GCaMP8f. These points should be discussed.

      Author response: regarding the timescale of the calcium signals observed in Figure 7E, we apologize for the confusion caused by a mislabeling we inserted in the manuscript. The experiments presented in Figure 7 were conducted using jGCaMP7f, not jGCaMP8f as previously stated (both indicators were used in this study, but in separate experiments). We have corrected this error in the Results section (caption of Figure 7D, E). It is important to note that jGCaMP7f has a longer half-decay time compared to jGCaMP8f, which could in part account for the slower decay kinetics observed in our data. Furthermore, the prolonged calcium signals can be attributed to the physiological properties of neurons in the piriform cortex. Upon olfactory stimulation, these neurons often fire multiple action potentials, resulting in extended calcium transients that can last several seconds. This sustained activity has been documented in previous studies, such as Roland et al. (eLife 2017, Figure 1C therein) in anesthetized animals and Wang et al. (Neuron 2020, Figure 1E therein) in awake animals, which report similar durations for calcium signals. We cite these references in the text. We believe that these revisions and clarifications address the Reviewer's concern and enhance the overall clarity of our manuscript.

      g) The authors do not mention the influence of the neuropil on their data. Did they subtract the neuropil's contribution to the signals from the somata? It is known from the literature that the presence of the neuropil creates artificial correlations between neurons, which decrease with the distance between the neurons (Grødem, S., Nymoen, I., Vatne, G.H. et al. An updated suite of viral vectors for in vivo calcium imaging using intracerebral and retro-orbital injections in male mice. Nat Commun 14, 608 (2023). https://doi.org/10.1038/s41467-023-363243; Keemink SW, Lowe SC, Pakan JMP, Dylda E, van Rossum MCW, Rochefort NL. FISSA: A neuropil decontamination toolbox for calcium imaging signals. Sci Rep. 2018 Feb 22;8(1):3493.

      doi: 10.1038/s41598-018-21640-2. PMID: 29472547; PMCID: PMC5823956)

      This point should be addressed.

      We apologize for not been clear enough in our previous version of the manuscript. The neuropil was subtracted from calcium traces both in simulated and experimental data. Please note that instead of using the term “neuropil”, we used the word “background”. We decided to use the more general term “background” because it also applies to the case of synthetic calcium tseries, where neurons were modeled as spheres devoid of processes. The background subtraction is described in the Methods on page 39:

      F(t) was computed frame-by-frame as the difference between the average signal of pixels in each ROI and the background signal. The background was calculated as the average signal of pixels that: i) did not belong to any bounding box; ii) had intensity values higher than the mean noise value measured in pixels located at the corners of the rectangular image, which do not belong to the circular FOV of the microendoscope; iii) had intensity values lower than the maximum value of pixels within the boxes”.

      h) Also, what are the expected correlations between neurons in the pyriform cortex? Are there measurements in the literature with which the authors could compare their data?

      We appreciate the reviewer's interest in the correlations between neurons in the piriform cortex. The overall low correlations between piriform neurons we observed (Figure 8) are consistent with a published study describing ‘near-zero noise correlations during odor inhalation’ in the anterior piriform cortex of rats, based on extracellular recordings (Miura et al., Neuron 2013). However, to the best of our knowledge, measurements directly comparable to ours have not been described in the literature. Recent analyses of the correlations between piriform neurons were restricted to odor exposure windows, with the goal to quantify odor-specific activation patterns (e.g. Roland et al., eLife 2017; Bolding et al., eLife 2017, Pashkovski et al., Nature 2020; Wang et al., Neuron 2020). Here, we used correlation analyses to characterize the technical advancement of the optimized GRIN lens-based endoscopes. We showed that correlations of pairs of adjacent neurons were independent from radial distance (Figure 8B), highlighting homogeneous spatial resolution in the field of view.

      (2) The way the data is presented doesn't always make it easy to compare the performance of corrected and uncorrected lenses. Here are two examples:

      a) In Figures 4 to 6, it would be easier to compare the FOVs of corrected and uncorrected lenses if the scale bars (at the centre of the FOV) were identical. In this way, the neurons at the centre of the FOV would appear the same size in the two images, and the distances between the neurons at the centre of the FOV would appear similar. Here, the scale bar is significantly larger for the corrected lenses, which may give the illusion of a larger effective FOV.

      We appreciate the Referee’s comment. Below, we explain why we believe that the way we currently present imaging data in the manuscript is preferable:

      (1) current figures show images of the acquired FOV as they are recorded from the microscope (raw data), without rescaling. In this way, we exactly show what potential users will obtain when using a corrected microendoscope.

      (2) In the current version of the figures, the fact that the pixel size is not homogeneous across the FOV, nor equal between uncorrected and corrected microendoscopes, is initially shown in Figure 3E, F and then explicitly stated throughout the manuscript when images acquired with a corrected microendoscope are shown.

      (3) Rescaling images acquired with the corrected endoscopes gives the impression that the acquisition parameters were different between acquisitions with the corrected and uncorrected microendoscopes, which was not the case.

      Importantly, the larger FOV of the corrected microendoscope, which is one of the important technological achievements presented in this study, can be appreciated in the images regardless of the presentation format.

      b) In Figures 3A-D it would be more informative to plot the distances in microns rather than pixels. This would also allow a better comparison of the micro endoscopes (as the pixel sizes seem to be different for the corrected and uncorrected micro endoscopes).

      The Referee is correct that the pixel size is different between the corrected and uncorrected probes. This is because of the different magnification factor introduced by the corrective microlens, as described in Figure 3E, F. The rationale for showing images in Figure 3AD in pixels rather than microns is the following:

      (1) Optical simulations in Figure 1 suggest that a corrective optical element is effective in compensating for some of the optical aberrations in GRIN microendoscopes.

      (2) After fabricating the corrective optical element (Figure 2), in Figure 3A-D we conduct a preliminary analysis of the effect of the corrective optical element on the optical properties of the GRIN lens. We observed that the microfabricated optical element corrected for some aberrations (e.g., astigmatism), but also that the microfabricated optical element was characterized by significant field curvature. This can be appreciated showing distances in pixels.

      (3) The observed field curvature and the aspherical profile of the corrected lens prompted us to characterize the magnification factor of the corrected endoscopes as a function of the radial distance. We found that the magnification factor changed as a function of the radial distance (Figure 3E-F) and that pixel size was different between uncorrected and corrected endoscopes. We also observed that, in corrected endoscopes, pixel size was a function of the radial distance (Figure 3E-F).

      (4) Once all of the above was established and quantified, we assigned precise pixel size to images of uncorrected and corrected endoscopes and we show all following images of the study (Figure 3G on) using a micron (rather than pixel) scale.

      (3) There seems to be a discrepancy between the performance of the long lenses (8.8 mm) in the different experiments, which should be discussed in the article. For example, the results in Figure 4 show a considerable enlargement of the FOV, whereas the results in Figure 6 show a very moderate enlargement of the distance at which the person's correlation with the first ground truth emitter starts to drop.

      Thanks for raising this point and helping us clarifying data presentation. Images in Figure 4B are average z-projections of z-stacks acquired through a mouse fixed brain slice and they were taken with the purpose of showing all the neurons that could be visualized from the same sample using an uncorrected and a corrected microendoscope. In Figure 4B, all illuminated neurons are visible regardless of whether they were imaged with high axial resolution (e.g., < 10 µm as defined in Figure 3J) or poor axial resolution. In contrast, in Figure 6J we evaluated the correlation between the calcium trace extracted from a given ROI and the real activity trace of the first simulated ground truth emitter for that specific ROI. The moderate increase in the correlation for the corrected microendoscope compared to the uncorrected microendoscope (Figure 6J) is consistent with the moderate improvement in the axial resolution of the corrected probe compared to the uncorrected probe at intermediate radial distances (60-100 µm from the optical axis, see Figure 3J). We added a paragraph in the Results section (page 14, lines 8-18) to summarize the points described above.

      a) There is also a significant discrepancy between measured and simulated optical performance, which is not discussed. Optical simulations (Figure 1) show that the useful FOV (defined as the radius for which the size of the PSF along the optical axis remains below 10µm) should be at least 90µm for the corrected microendoscopes of both lengths. However, for the long microendoscopes, Figure 3J shows that the axial resolution at 90µm is 17µm. It would be interesting to discuss the origin of this discrepancy: does it depend on the microendoscope used?

      As the Reviewer correctly pointed out, the size of simulated PSFs at a given radial distance (e.g., 90 µm) tends to be generally smaller than that of the experimentally measured PSFs. This might be due to multiple reasons:

      (1) simulated PSFs are excitation PSFs, i.e. they describe the intensity spatial distribution of focused excitation light. On the contrary, measured PSFs result from the excitation and emission process, thus they are also affected by aberrations of light emitted by fluorescent beads and collected by the microscope.

      (2) in the optical simulations, the Zemax file of the GRIN lenses contained first-order aberrations. High-order aberrations were therefore not included in simulated PSFs.

      (3) intrinsic variability of experimental measurements (e.g., intrinsic variability of the fabrication process, alignment of the microendoscope to the optical axis of the microscope, the distance between the GRIN back end and the objective…) are not considered in the simulations.

      We added a paragraph in the Discussion section (page 17, lines 9-18) summarizing the abovementioned points.

      Are there inaccuracies in the construction of the aspheric corrective lens or in the assembly with the GRIN lens? If there is variability between different lenses, how are the lenses selected for imaging experiments?

      The fabrication yield, i.e. the yield of generating the corrective lenses, using molding was ~ 90% (N > 30 molded lenses). The main limitation of this procedure was the formation of air bubbles between the mold negative and the glass coverslip. Molded lenses were visually inspected with the stereoscope and, in case of air bubble formation, they were discarded.

      The assembly yield, i.e. the yield of correct positioning of the GRIN lens with respect to the coverslip, was 100 % (N = 27 endoscopes).

      We added this information in the Methods at page 29 (lines 1-12), as follows:

      “After UV curing, the microlens was visually inspected at the stereomicroscope. In case of formation of air bubbles, the microlens was discarded (yield of the molding procedure: ~ 90 %, N > 30 molded lenses). The coverslip with the attached corrective lens was sealed to a customized metal or plastic support ring of appropriate diameter (Fig. 2C). The support ring, the coverslip and the aspherical lens formed the upper part of the corrected microendoscope, to be subsequently coupled to the proper GRIN rod (Table 2) using a custom-built opto-mechanical stage and NOA63 (Fig. 2C) 7. The GRIN rod was positioned perpendicularly to the glass coverslip, on the other side of the coverslip compared to the corrective lens, and aligned to the aspherical lens perimeter (Fig. 2C) under the guidance of a wide field microscope equipped with a camera. The yield of the assembly procedure for the probes used in this work was 100 % (N = 27 endoscopes). For further details on the assembly of corrected microendoscope see(7)”.

      Reviewer #1 (Recommendations for the authors):

      (1) Page 4, what is meant by 'ad-hoc" in describing software control?

      With “ad-hoc” we meant “specifically designed”. We revised the text to make this clear.

      (2) It was hard to tell how the PSF was modeled for the simulations (especially on page 34, describing the two spherical shells of the astigmatic PSF and ellipsoids modeled along them). Images or especially videos that show the modeling would make this easier to follow.

      Simulated calcium t-series were generated following previous work by our group (Antonini et al., eLife 2020), as stated in the Methods on page 37 (line 5). In Figure 4A of Antonini et al. eLife 2020, we provided a schematic to visually describe the procedure of simulated data generation. In the present paper, we decided not to include a similar drawing and cite the eLife 2020 article to avoid redundancy.

      (3) Some math symbols are missing from the methods in my version of the text (page 36/37).

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it at the time of submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      (4) The Z extent of stacks (i.e. number of steps) used to generate images in Figure 4 is missing.

      We thank the Reviewer for the comment and we now revised the caption of Figure 4 and the Methods section as follows:

      “Figure 4. Aberration correction in long GRIN lens-based microendoscopes enables highresolution imaging of biological structures over enlarged FOVs. A) jGCaMP7f-stained neurons in a fixed mouse brain slice were imaged using 2PLSM (λexc = 920 nm) through an uncorrected (left) and a corrected (right) microendoscope based on the 6.4 mm-long GRIN rod. Images are maximum fluorescence intensity (F) projections of a z-stack acquired with a 5 μm step size. Number of steps: 32 and 29 for uncorrected and corrected microendoscope, respectively. Scale bars: 50 μm. Left: the scale applies to the entire FOV. Right, the scale bar refers only to the center of the FOV; off-axis scale bar at any radial distance (x and y axes) is locally determined multiplying the length of the drawn scale bar on-axis by the corresponding normalized magnification factor shown in the horizontal color-coded bar placed below the image (see also Fig. 3, Supplementary Table 3, and Materials and Methods for more details). B) Same results for the microendoscope based on the 8.8 mm-long GRIN rod. Number of steps: 23 and 31 for uncorrected and corrected microendoscope, respectively”.

      We also modified the text in the Methods (page 35, lines 1-2):

      “(1024 pixels x 1024 pixels resolution; nominal pixel size: 0.45 µm/pixel; axial step: 5 µm; number of axial steps: 23-32; frame averaging = 8)”.

      (5) Overall, the text is wordy and a bit repetitive and could be cut down significantly in length without loss of clarity. This is true throughout, but especially when comparing the introduction and discussion.

      We edited the text (Discussion and Introduction), as suggested by the Reviewer.

      (6) Although I don't think it's necessary, I would advise including comparison data with an uncorrected endoscope in the same in vivo preparation.

      We thank the Referee for the suggestion. Below, we list the reasons why we decided not to perform the comparison between the uncorrected and corrected endoscopes in the in vivo preparation:

      (1) We believe that the comparison between uncorrected and corrected endoscopes is better performed in fixed tissue (Figure 4) or in simulated calcium data (Figure 5-6), rather than in vivo recordings (Figure 7). In fact, in the brain of living mice motion artifacts, changes in fluorophore expression level, variation in the optical properties of the brain (e.g., the presence of a blood vessel over the FOV) may make the comparison of images acquired with uncorrected and corrected microendoscopes difficult, requiring a large number of animals to cancel out the contributions of all these factors. Comparing optical properties in fixed tissue is, in contrast, devoid of these confounding factors.

      (2) A major advantage of quantifying how the optical properties of uncorrected and corrected endoscope impact on the ability to extract information about neuronal activity in simulated calcium data is that, under simulated conditions, we can count on a known ground truth as reference (e.g., how many neurons are in the FOV, where they are, and which is their electrical activity). This is clearly not possible under in vivo conditions.

      (3) The proposed experiment requires to perform imaging in the awake mouse with a corrected microendoscope, then anesthetize the animal to carefully remove the corrective microlens using forceps, and finally repeat the optical recordings in awake mice with the uncorrected microendoscope. Although this is feasible (we performed the proposed experiment in Antonini et al. eLife 2020 using a 4.1 mm-long microendoscope), the yield of success of these experiments is low. The low yield is due to the fact that the mechanical force applied on top of the microendoscope to remove the corrective microlens may induce movement of the GRIN lens inside the brain, both in vertical and horizontal directions. This can randomly result in change of the focal plane, death or damage of the cells, tissue inflammation, and bleeding. From our own experience, the number of animals used for this experiment is expected to be high.

      Reviewer #2 (Recommendations for the authors):

      Below, I provide a few minor corrections and suggestions for the authors to consider before final submission.

      (1) Page 5: when referring to Table 1 maybe add "Table 1 and Methods".

      Following the Reviewer’s comment, we revised the text at page 6 (lines 4-5 from bottom) as follows:

      “(see Supplementary Table 1 and Materials and Methods for details on simulation parameters)”.

      (2) Page 8: "We set a threshold of 10 µm on the axial resolution to define the radius of the effective FOV (corresponding to the black triangles in Fig. 3I, J) in uncorrected and corrected microendoscopes. We observed an enlargement of the effective FOV area of 4.7 times and 2.3 times for the 6.4 mm-long micro endoscope and the 8.8 mm-long micro endoscope, respectively (Table 1). These findings were in agreement with the results of the ray-trace simulations (Figure 1) and the measurement of the subresolved fluorescence layers (Figure 3AD)." I could not find the information given in this paragraph, specifically:

      a) Upon examining the black triangles in Figure 3I and J, the enlargement of the effective FOV does not appear to be 4.7 and 2.3 times.

      In Figure 3I, J, black triangles mark the intersections between the curves fitting the data and the threshold of 10 µm on the axial resolution. The values on the x-axis corresponding to the intersections (Table 1, “Effective FOV radius”) represent the estimated radius of the effective FOV of the probes, i.e. the radius within which the microendoscope has spatial resolution below the threshold of 10 μm. The ratios of the effective FOV radii are 2.17 and 1.53 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively, which correspond to 4.7 and 2.3 times larger FOV (Table 1). To make this point clearer, we modified the indicated sentence as follows (page 10, lines 3-11 from bottom):

      “We set a threshold of 10 µm on the axial resolution to define the radius of the effective FOV (corresponding to the black triangles in Fig. 3I, J) in uncorrected and corrected microendoscopes. We observed a relative increase of the effective FOV radius of 2.17 and 1.53 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively (Table 1). This corresponded to an enlargement of the effective FOV area of 4.7 times and 2.3 times for the 6.4 mm-long microendoscope and the 8.8

      mm-long microendoscope, respectively (Table 1). These findings were in agreement with the results of the ray-trace simulations (Figure 1) and the measurement of the subresolved fluorescence layers (Figure 3A-D)."

      b) I do not understand how the enlargements in Figure 3I and J align with the ray trace simulations in Figure 1, indicating an enlargement of 5.4 and 5.6.

      In Figure 1C, E of the first submission we showed the Strehl ratio of focal spots focalized after the microendoscope, in the object plane, as a function of radial distance from the optical axis of focal spots focalized in the focal plane at the back end of the GRIN rod (“Objective focal plane” in Figure 1A, B), before the light has traveled along the GRIN lens. After reading the Referee’s comment, we realized this choice does not facilitate the comparison between Figure 1 and Figure 3I, J. We therefore decided to modify Figure 1C, E by showing the Strehl ratio of focal spots focalized after the microendoscope as a function of their radial distance from the optical axis in the objet plane (where the Strehl ratio is computed), after the light has traveled through the GRIN lens (radial distances are still computed on a plane, not along the curved focal surface represented by the “imaging plane” in Figure 1 A, B). Computing radial distances in the object space, we found that the relative increase in the radius of the FOV due to the correction of aberrations was 3.50 and 3.35 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively. We also revised the manuscript text accordingly (page 7, lines 6-8):

      “The simulated increase in the radius of the diffraction-limited FOV was 3.50 times and 3.35 times for the 6.4 mm-long and 8.8 mm-long probe, respectively (Fig. 1C, E)”. We believe this change should facilitate the comparison of the data presented in Figure 1 and Figure 3.

      Moreover, in comparing results in Figure 1 and Figure 3, it is important to keep in mind that:

      (1) the definitions of the effective FOV radius were different in simulations (Figure 1) and real measurements (Figure 3). In simulations, we considered a theoretical criterion (Maréchal criterion) and set the lower threshold for a diffraction-limited FOV to a Strehl ratio value of 0.8. In real measures, the effective FOV radius obtained from fluorescent bead measurements was defined based on the empirical criterion of setting the upper threshold for the axial resolution to 10 µm.

      (2) the Zemax file of the GRIN lenses contained low-order aberrations and not high-order aberrations.

      (3) the small variability in some of the experimental parameters (e.g., the distance between the GRIN back end and the focusing objective) were not reflected in the simulations.

      Given the reasons listed above, it is expected that the prediction of the simulations do not perfectly match the experimental measurements and tend to predict larger improvements of aberration correction than the experimentally measured ones.

      c) Finally, how can the enlargement in Figure 3I be compared to the measurements of the sub-resolved fluorescence layers in Figures 3A-D? Could the authors please clarify these points?

      When comparing measurements of subresolved fluorescent films and beads it is important to keep in mind that the two measures have different purposes and spatial resolution. We used subresolved fluorescent films to visualize the shape and extent of the focal surface of microendoscopes in a continuous way along the radial dimension (in contrast to bead measurements that are quantized in space). This approach comes at the cost of spatial resolution, as we are using fluorescent layers, which are subresolved in the axial but not in the radial dimension. Therefore, fluorescent film profiles are not used in our study to extract relevant quantitative information about effective FOV enlargement or spatial resolution of corrected microendoscopes. In contrast, to quantitatively characterize axial and lateral resolutions we used measurements of 100 nm-diameter fluorescent beads (therefore subresolved in the x, y, and z dimensions) located at different radial distances from the center of the FOV, using a much smaller nominal pixel size compared to the fluorescent films (beads, lateral resolution: 0.049 µm/pixel, axial resolution: 0.5 µm/pixel; films, lateral resolution: 1.73 µm/pixel, axial resolution: 2 µm/pixel).

      (3) On page 15, the statement "significantly enlarge the FOV" should be more specific by providing the actual values for the increase. It would also be good to mention that this is not a xy lateral increase; rather, as one moves further from the center, more of the imaged cells belong to axially different planes.

      The values of the experimentally determined FOV enlargements (4.7 times and 2.3 times for 6.4 mm- and 8.8 mm-long microendoscope, respectively) are provided in Table 1 and are now referenced on page 10. Following the Referee’s request, we added the following sentence in the discussion (page 18, lines 10-14) to underline that the extended FOV samples on different axial positions because of the field curvature effect:

      “It must be considered, however, that the extended FOV achieved by our aberration correction method was characterized by a curved focal plane. Therefore, cells located in different radial positions within the image were located at different axial positions and cells at the border of the FOV were closer to the front end of the microendoscope”.

      (4) On page 36, most of the formulas appear to be corrupted. This may have occurred during the conversion to the merged PDF. Please verify this and check for similar problems in other equations throughout the text as well.

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it upon submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      (5) In the discussion, the authors could potentially add comments on how the verified performance of the corrective lenses depends on the wavelength and mention the range within which the wavelength can be changed without the need to redesign a new corrective lens.

      Following this comments and those of other Reviewers, we explored the effect of changing wavelength on the Strehl ratio using new Zemax simulations. We found that the Strehl ratio remains > 0.8 within ± at least 10 nm from λ = 920 nm (new Supplementary Figure 1A-D, left panels), which covers the limited bandwidth of our femtosecond laser. Moreover, these simulations demonstrate that, on a much wider wavelength range (800 - 1040 nm), high Strehl ratio is obtained but at different z planes (new Supplementary Figure 1A-D, right panels). These new results are now described on page 7 (lines 8-10).

      (6) Also, they could discuss if and how the corrective lens could be integrated into fiberscopes for freely moving experiments.

      Following the Referee’s suggestion, we added a short text in the Discussion (page 21, lines 4-7 from bottom). It reads:

      “Another advantage of long corrected microendoscopes described here over adaptive optics approaches is the possibility to couple corrected microendoscopes with portable 2P microscopes(42-44), allowing high resolution functional imaging of deep brain circuits on an enlarged FOV during naturalistic behavior in freely moving mice”.

      (7) Finally, since the main advantage of this approach is its simplicity, the authors should also comment on or outline the steps to follow for potential users who are interested in using the corrective lenses in their systems.

      Thanks for this comment. The Materials and Methods section of this study and that of Antonini et al. eLife 2020 describe in details the experimental steps necessary to reproduce corrective lenses and apply them to their experimental configuration.

      Reviewer #3 (Recommendations for the authors):

      (1) Suggestions for improved or additional experiments, data, or analyses, and Recommendations for improving the writing and presentation:

      See Public Review.

      Please see our point-by-point response above.

      (2) Minor corrections on text and figures: a) Figure 6A: is the fraction of cells expressed in %?

      Author response: yes, that is correct. Thank you for spotting it. We added the “%” symbol to the y label.

      b) Figurer 8A, left: The second line is blue and not red dashed. In addition, it could be interesting to also show a line corresponding to the 0 value.

      Thank you for the suggestions. We modified Figure 8 according to the Referee’s comments.

      c) Some parts of equation (1) and some variables in the Material and Methods section are missing

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it upon submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      d) In the methods, the authors mention a calibration ruler with ticks spaced every 10 µm along two orthogonal directions and refer to the following product: 4-dot calibration slide, Cat. No. 1101002300142, Motic, Hong Kong. However, this product does not seem to correspond to a calibration ruler.

      We double check. The catalog number 1101002300142 is correct and product details can be found at the following link:

      https://moticmicroscopes.com/products/calibration-slide-4-dots-1101002300142?srsltid=AfmBOorGYx9PcXtAlIMmSs_tEpxS4nX21qIcV8Kfn4qGwizQK3LYOQn3

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are grateful to the reviewers for their appreciation of our study and thoughtful comments. In response to the main concern raised by all reviewers regarding the potential influences of external noise factors on intuitive inference, such as external disturbances or imperfect observations, we have conducted three new experiments suggested by the reviewers. These experiments were designed to: (1) assess the influence of external forces on humans’ judgments by implementing a wall to block wind disturbances from one direction, (2) examine human accuracy in predicting the landing position of a falling ball when its trajectory is obscured, and (3) evaluate the effect of object geometry on human judgment of stability. The findings from these experiments consistently support our proposal of the stochastic world model on gravity embedded in human mind. Besides, we have also addressed the rest comments from the reviewers in a one-by-one fashion.

      Reviewer #1 (Recommendations For The Authors):

      As mentioned in the public review, I did not find it entirely convincing that the study shows evidence for a Gaussian understanding of gravity. There are two studies that would bolster this claim: 1. Replicate experiment 1, but also ask people to infer whether there was a hidden force. If people are truly representing gravity as proposed in the paper, you should get no force inferences. However, if the reason the Gaussian gravity model works is that people infer unseen forces, this should come out clearly in this study.

      Author response image 1.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      R1: We thank the reviewer for this suggestion. To directly test whether participants’ judgments were influenced by their implicit assumptions about external forces, we duplicated the original experimental setup with the addition of a wall implemented on one side (Supplementary Figure 4A). Before the start of the experiment, we explicitly informed the participants that the wall was designed to block wind, ensuring that any potential wind forces from the direction of the wall would not influence the collapse. If participants’ judgments were affected by external noise, we would expect to observe a skewed angle distribution. Contrary to this prediction, our results showed a normal distribution across all three participants tested (1 female; ages: 24-30), similar to the experiment without the wall (Supplementary Figure 4B). Therefore, the stochastic nature of intuitive inference on objects’ stability is embedded in the mind, not shaped by external forces or explicit instructions.

      This new experiment has been added to the revised manuscript

      Line 166-168: “…, and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Supplementary Figure 4).”

      (2) Similarly, you can imagine a simple study where you drop an object behind a floating occluder and you check where people produce an anticipatory fixation (i.e., where do they think the object will come out?). If people have a stochastic representation of gravity, this should be reflected in their fixations. But my guess is that everyone will look straight down.

      Author response image 2.

      Trajectory experiment to test the stochastic nature of gravity represented in the mind. (a) Experiment design. In this experiment, participants were required to use a mouse to determine the landing point of a parabolic trajectory (marked by the green dot), obscured by a grey rectangle. Note that the parabolic trajectory was determined only by gravity, and no external disturbances were introduced. The parameters used in this experiment are detailed in the upper right corner. (b) Predictive errors from three participants. The predictive errors from all three participants conform to Gaussian distributions with non-negligible variances. These results suggest the notion of an inherent stochastic property of gravity represented in the mind.

      R2: We thank the reviewer for suggesting this thought experiment. However, when predicting the landing point of a falling object, participants may rely more on learned knowledge that an unimpeded object continues to fall in a straight line, rather than drawing on their intuitive physics. To avoid this potential confounding factor, we designed a similar experiment where participants were asked to predict the landing point of a parabolic trajectory, obscured by an occluder (Author response image 2A). In each trial, participants used a mouse (clicking the left button) to predict the landing point of each parabolic trajectory, and there were 100 trials in total. This design not only limits the impact of direct visual cues but also actively engages the mental simulation of intuitive physics. All three participants (1 female; ages: 24-30) were unable to accurately predict the landing points of the trajectories, and the predictive errors conformed to Gaussian distributions with different variances (Author response image 2B). Therefore, this new experiment confirms the stochastic nature of intuitive physics.

      (3) I believe the correct alternative model should be the one that has uncertainty over unseen forces, which better captures current proposals in the field, and controls for the amount of uncertainty in the models.

      R3: We thank the reviewers for the above-mentioned suggestions, and the findings from these two new experiments reinforce our proposal regarding the inherent stochastic characteristic of how the mind represents gravity.

      (4) I was not convinced that the RL framework was set up correctly to tackle the questions it claims to tackle. What this shows is that you can evolve a world model with Gaussian gravity in a setup that has no external perturbations. That does not imply that that is how humans evolved their intuitive physics, particularly when creatures have evolved in a world full of external perturbations. Showing that when (1) there are hidden perturbations, and (2) these perturbations are learnable, but (3) the model nonetheless just learns stochastic gravity, would be a more convincing result.

      R4: We completely agree with the reviewer that the RL framework serves primarily as a theoretic model to explain the stochastic nature of the world model on gravity, rather than as a demonstration of the developmental origins of intuitive physics abilities. The genesis of such abilities is multifaceted and unlikely to be fully replicated through a simple simulation like RL. Therefore, the purpose of incorporating the RL framework in our study is to demonstrate that external perturbances are not necessary for the development of a stochastic representation of gravity. In fact, introducing additional external noise into the RL framework likely heightens the uncertainty in learning gravity’s direction, potentially amplifying, rather than diminishing, the stochastic nature of mental gravity.

      In revision, we have clarified the role of the RL framework

      Line 265-277: “While the cognitive impenetrability and the self-consistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.

      Here we used a reinforcement learning (RL) framework to unveil this origin, because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to fine-tune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise.”

      (5) Some comments on the writing:

      The word 'normality' is used to refer to people's judgments about whether a tower collapsed looked 'normal'. I was a bit confused by this because normality can also mean 'Gaussian' and the experiments are also sampling from Gaussian distributions. There were several points where it took me a second to figure out which sense of 'normality' the paper was using. I would recommend using a different term.

      R5: We are sorry for the confusion. In revision, the term “normality” has been replaced with “confidence level about normal trajectory”.

      (6) One small comment is that Newton's laws are not a faithful replica of the "physical laws of the world" they are a useful simplification that only works at certain timescales. I believe some people propose Newtonian physics as a model of intuitive physics in part because it is a rapid and useful approximation of complex physical systems, and not because it is an untested assumption of perfect correspondence.

      R6: We are sorry for the inaccurate expression. We have revised our statements in the manuscript Line 15-16: “We found that the world model on gravity was not a faithful replica of the physical laws, but instead encoded gravity’s vertical direction as a Gaussian distribution.”

      (7) Line 49-50: Based on Fig 1d, lower bound of possible configurations for 10 blocks is ~17 in log-space, which is about 2.5e7. But the line here says it's 3.72e19, which is much larger. Sorry if I am missing something.

      R7: We thank the reviewer to point out this error. We re-calculated the number of possible configurations using the formula (3) in the appendix, and the number of configurations with 10 blocks is:

      Thus,

      This estimated number is much larger than that in our previous calculation, which has been corrected in the revised text.

      Line 827-829: “d) The lower bound of configurations’ possible number and the number of blocks in a stack followed an exponential relationship with a base of 10. The procedure can create at least 1.14×1050 configurations for stacks consisting of 10 blocks.”

      Line 49-50: “… but the universal cardinality of possible configurations is at least 1.14×1050 (Supplementary Figure 1), …”

      Line 1017-1018: “… the number of configurations can be estimated with formula (9), which is 1.14×1050.”

      (8) Lines 77-78: "A widely adopted but not rigorously tested assumption is that the world model in the brain is a faithful replica of the physical laws of the world." This risks sounding like you are asserting that colleagues in the field do not rigorously test their models. I think you meant to say that they did not 'directly test', rather than 'rigorously test'. If you meant rigorous, you might want to say more to justify why you think past work was not rigorous.

      R8: We apologize for the inappropriate wording, the sentence has been revised and we illustrate the motivation more comprehensively in the revised text,

      Line 76-92: “A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Supplementary Figure 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach.”

      (9) Lines 79-84 States that past models encode gravity downward. It then says that alternatively there is consensus that the brain uses data from sensory organs and adds meaning to them. I think there might be a grammatical error here because I did not follow why saying there is 'consensus' on something is a theoretical alternative. I also had trouble following why those two statements are in opposition. Is any work on physics engines claiming the brain does not take data from sensory organs and add meaning to them?

      R9: We are sorry for the confusion. Here we intend to contrast the deterministic model (i.e., the uncertainty comes from outside the model) with the stochastic model (i.e., the uncertainty is inherently built into the model). In revision, we have clarified the intention. For details, please see R8.

      (10) Lines 85-88: Following on the sentence above, you then conclude that the representation of the world may therefore not be the same as reality. I did not understand why this followed. It seems you are saying that, because the brain takes data from sensory organs, therefore its representations may differ from reality.

      R10: Again, we are sorry about the confusion. Please see the revised text in R8.

      (11) Lines 190-191: I had trouble understanding this sentence. I believe you are missing an adjective to clarify that participants were more inclined to judge taller stacks as more likely to collapse.

      R11: We are sorry for the confusion. What we intended to state here is that participants’ judgment was biased, showing a tendency to predict a collapse for stacks regardless of their actual stability. We have revised this confusing sentence in the revision. Line 202–204: “However, the participants showed an obvious bias towards predicting a collapse for stacks regardless of their actual stability, as the dots in Fig 2b are more concentrated on the lower side of the diagonal line.”

      (12) Line 201: I don't think it's accurate to say that MGS "perfectly captured participants' judgments" unless the results are actually perfect.

      R12: We agree, and in revision we have toned down the statement Line 213–214: “…, the MGS, in contrast to the NGS, more precisely reflected participants’ judgments of stability …”

      Reviewer #2 (Recommendations For The Authors):

      I think this is an impressive set of experiments and modeling work. The paper is nicely written and I appreciate the poetic license the authors took at places in the manuscript. I only have clarification points and suggest a simple experiment that could lend further support to their conclusions. 1. In my opinion, the impact of this work is twofold. First, the suggestion that gravity is represented as a distribution of the world and not a result of (inferred) external perturbations. Second, that the distribution is advantageous as it balances speed and accuracy, and lessens computational processing demands (i.e., number of simulations). The second point here is contingent on the first point, which is really only supported by the RL model and potentially the inverted scene condition. I am somewhat surprised that the RL model does not converge on a width much smaller than ~20 degrees after 100,000 simulations. From my understanding, it was provided feedback with collapses based on natural gravity (deterministically downward). Why is learning so slow and the width so large? Could it be the density of the simulated world model distribution? If the model distribution of Qs was too dense, then Q-learning would take forever. If the model distribution was too sparse, then its final estimate would hit a floor of precision. Could the authors provide more details on the distribution of the Qs for the RL model?

      Author response image 3.

      RL learning curves as a function of θ angle with different sampling densities and learning rates. Learning rates were adjusted to low (a), intermediate (b) and high (c) settings, while sampling densities were chosen at four levels: 5x5, 11x11, 31x31, and 61x61 shown from the left to the right. Two key observations emerged from the simulations as the reviewer predicted. First, higher learning rates resulted in a more rapid decline in learning curves but introduced larger variances. Second, increased sampling density necessitated more iterations for convergence. Note that in all simulations, we limited the iterations to 1,000 times (as opposed to 100,000 times reported in the manuscript) to demonstrate the trend without excessive computational demands.

      R1: To illustrate the distribution of the Q-values for the RL model, we re-ran the RL model with various learning rates and sampling densities (Author response image 3). These results support the reviewer’s prediction that higher learning rates resulted in a more rapid decline in learning curves but introduced larger variances, and increased sampling density requires more iterations for convergence.

      This simulation also elucidates the slower learning observed in the experiment described in the text, where the force sphere was divided into 61x61 angle pairs, and the learning rate was set to 0.15. This set of parameters ensured convergence within a reasonable brief timeframe while maintaining high-resolution force assessments.

      Besides, the width of the Gaussian distribution is mainly determined by the complexity of stacks. As shown in Figure 3c and Supplementary Figure 9, stacks with fewer blocks (i.e., less complex) caused a larger width, whereas those with more blocks resulted in a narrower spread. In the study, we used a collection of stacks varying from 2 to 15 blocks to simulate the range of stacks humans typically encounter in daily life.

      In revision, we have incorporated these insights suggested by the reviewer to clarify the performance of the RL framework:

      Line 634-639: “The angle density and learning rate are two factors that affect the learning speed. A larger angle density prolongs the time to reach convergence but enables a more detailed force space; a higher learning rate accelerates convergence but incurs larger variance during training. To balance speed and convergence, we utilized 100,000 configurations for the training.”

      Line 618-619: “…, separately divided them into 61 sampling angles across the spherical force space (i.e., the angle density).”

      (2) Along similar lines, the authors discuss the results of the inverted science condition as reflecting cognitive impenetrability. However, do they also interpret it as support for an intrinsically noisy distribution of gravity? I would be more convinced if they created a different scene that could have the possibility of affecting the direction of an (inferred) external perturbation - a previously held explanation of the noisy world model. For example, a relatively simple experiment would be to have a wall on one side of the scene such that an external perturbation would be unlikely to be inferred from that direction. In the external perturbation account, phi would then be affected resulting in a skewed distribution of angle pairs. However, in the authors' stochastic world model phi would remain unaffected resulting in the same uniform distribution of phi the authors observed. In my opinion, this would provide more compelling evidence for the stochastic world model.

      Author response image 4.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      R2: We thank the reviewer for this suggestion. Following the reviewer’s concern, we designed the experiment with the addition of a wall implemented on one side (Supplementary figure 4A). We explicitly informed the participants that the wall was designed to block wind before the start of the experiment, ensuring no potential wind forces from the direction of the wall to influence the collapse trajectory of configurations. Participants need to judge if the trajectory was normal. If participants’ judgments were influenced by external noises, we would expect to observe a skewed angle distribution. However, our results still showed a normal distribution across all participants tested, consistent with the experiment without the wall (Supplementary figure 4B). This experiment suggested the stochastic nature of intuitive inference on objects’ stability is embedded in the mind, rather than shaped by external forces or explicit instructions.

      We revised the original manuscript, and added this new experiment

      Line 166-168: “…, and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Supplementary Figure 4).”

      (3) I didn't completely follow the authors' explanation for the taller objects illusion. On lines 229-232, the authors state that deviations from gravity's veridical direction are likely to accumulate with the height of the objects. Is this because, in the stochastic world model account, each block gets its own gravity vector that is sampled from the distribution? The authors should clarify this more explicitly. If this is indeed the author's claim, then it would seem that it could be manipulated by varying the dimensions of the blocks (or whatever constitutes an object).

      R3: We are sorry for the confusion caused by the use of the term ‘accumulate’. In the study, there is only one gravity vector sampled from the distribution for the entire structure, rather than each block having a unique gravity vector. The height illusion is attributed to the fact that the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction. This is especially true for objects consisting of multiple blocks stacked atop one another. In revision, we have removed the confusing term ‘accumulate’ for clarification.

      Line 242-244: “…, because the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction during humans’ internal simulations.”

      (4) The authors refer to the RL simulations as agent-environment interactions, but in reality, the RL model does not interact with the blocks. Would experience-dependent or observation be more apropos?

      R4: We completely agree. Indeed, the RL model did not manipulate stacks; rather, it updated its knowledge of natural gravity based on the discrepancies between the RL model’s predictions and observed outcomes. In revision, we have removed the confusing term ‘agent-environment interactions’ and clarified its intended meaning.

      Line 19-22: “Furthermore, a computational model with reinforcement learning revealed that the stochastic characteristic likely originated from experience-dependent comparisons between predictions formed by internal simulations and the realities observed in the external world, …”

      Reviewer #3 (Public Review):

      (1) In spite of the fact that the Mental Gravity Simulation (MGS) seems to predict the data of the two experiments, it is an untenable hypothesis. I give the main reason for this conclusion by illustrating a simple thought experiment. Suppose you ask subjects to determine whether a single block (like those used in the simulations) is about to fall. We can think of blocks of varying heights. No matter how tall a block is, if it is standing on a horizontal surface it will not fall until some external perturbation disturbs its equilibrium. I am confident that most human observers would predict this outcome as well. However, the MSG simulation would not produce this outcome. Instead, it would predict a non-zero probability of the block to tip over. A gravitational field that is not perpendicular to the base has the equivalent effect of a horizontal force applied on the block at the height corresponding to the vertical position of the center of gravity. Depending on the friction determined by the contact between the base of the block and the surface where it stands there is a critical height where any horizontal force being applied would cause the block to fall while pivoting about one of the edges at the base (the one opposite to where the force has been applied). This critical height depends on both the size of the base and the friction coefficient. For short objects this critical height is larger than the height of the object, so that object would not fall. But for taller blocks, this is not the case. Indeed, the taller the block the smaller the deviation from a vertical gravitational field is needed for a fall to be expected. The discrepancy between this prediction and the most likely outcome of the simple experiment I have just outlined makes the MSG model implausible. Note also that a gravitational field that is not perpendicular to the ground surface is equivalent to the force field experienced by the block while standing on an inclined plane. For small friction values, the block is expected to slide down the incline, therefore another prediction of this MSG model is that when we observe an object on a surface exerting negligible friction (think of a puck on ice) we should expect that object to spontaneously move. But of course, we don't, as we do not expect tall objects that are standing to suddenly fall if left unperturbed. In summary, a stochastic world model cannot explain these simple observations.

      Author response image 5.

      Differentiating Subjectivity from Objectivity. In both Experiment 1 (a) and Experiment 2 (b), participants were instructed to determine which shape appeared most stable. Objectively, in the absence of external forces, all shapes possess equal stability. Yet, participants typically perceived the shape on the left as the most stable because of its larger base area. The discrepancy between objective realities and subjective feelings, as we propose, is attributed to the human mind representing gravity’s direction as a Gaussian distribution, rather than as a singular value pointing directly downward.

      R1: We agree with the reviewer that objects will remain stable until disturbed by external forces. However, in many cases, this is a clear discrepancy between objective realities and subjective feelings. For example, electromagnetic waves associated with purple and red colors are the farthest in the electromagnetic space, yet purple and red are the closest colors in the color space. Similarly, as shown in Supplementary Figure 4, in reality all shapes possess equal stability in the absence of external forces. Yet, humans typically perceive the shape on the left as more stable because of its larger base area. In this study, we tried to explore the mechanism underlying this discrepancy by proposing that the human mind represents gravity’s direction as a Gaussian distribution, rather than as a singular value pointing directly downward.

      In revision, we have clarified the rationale of this study

      Line 76-98: “A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Supplementary Figure 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach. Here, we investigated these two alternative hypotheses regarding the construction of the world model in the brain by examining how gravity’s direction is represented in the world model when participants judged object stability.”

      (2) The question remains as to how we can interpret the empirical data from the two experiments and their agreement with the predictions of the stochastic world model if we assume that the brain has internalized a vertical gravitational field. First, we need to look more closely at the questions posed to the subjects in the two experiments. In the first experiment, subjects are asked about how "normal" a fall of a block construction looks. Subjects seem to accept 50% of the time a fall is normal when the gravitational field is about 20 deg away from the vertical direction. The authors conclude that according to the brain, such an unusual gravitational field is possible. However, there are alternative explanations for these findings that do not require a perceptual error in the estimation of the direction of gravity. There are several aspects of the scene that may be misjudged by the observer. First, the 3D interpretation of the scene and the 3D motion of the objects can be inaccurate. Indeed, the simulation of a normal fall uploaded by the authors seems to show objects falling in a much weaker gravitational field than the one on Earth since the blocks seem to fall in "slow motion". This is probably because the perceived height of the structure is much smaller than the simulated height. In general, there are even more severe biases affecting the perception of 3D structures that depend on many factors, for instance, the viewpoint.

      R2: We thank the reviewer for highlighting several potential confounding factors in our study. We address each of these concerns point-by-point:

      (a) Misinterpretation of the 3D scene and motion. In Response Figure 4 shown above, there is no 3D structure, yet participants’ judgment on stability still deviated from objective realities. In addition, the introduction of 3D motion was to aid in understanding the stacks’ 3D structure. Previous studies without 3D motion have reported similar findings (Allen et al., 2020). Therefore, regardless of whether objects are presented in 2D or 3D, or in static or in motion formats, humans’ judgment on object stability appears consistent.

      (b) Errors in perceived height. While there might be discrepancies between perceived and simulated heights, such errors are systematic across all conditions. Therefore, they may affect the width of the Gaussian distribution but do not fundamentally alter its existence.

      (c) The viewpoint. In one experiment, we inverted gravity’s direction to point upward, diverging from common daily experience. Despite this change in viewpoint, the Gaussian distribution was still observed. That is, the viewpoint appears not a key factor in influencing how gravity’s direction is represented as a Gaussian distribution in our mental world.

      In summary, both our and previous studies (Allen et al., 2020; Battaglia et al., 2013) agree that humans’ subjective assessments of objects’ stability deviate from actual stability due to noise in mental simulation. Apart from previous studies, we suggest that this noise is intrinsic, rather than stemming from external forces or imperfect observations.

      (3) Second, the distribution of weight among the objects and the friction coefficients acting between the surfaces are also unknown parameters. In other words, there are several parameters that depend on the viewing conditions and material composition of the blocks that are unknown and need to be estimated. The authors assume that these parameters are derived accurately and only that assumption allows them to attribute the observed biases to an error in the estimate of the gravitational field. Of course, if the direction of gravity is the only parameter allowed to vary freely then it is no surprise that it explains the results. Instead, a simulation with a titled angle of gravity may give rise to a display that is interpreted as rendering a vertical gravitational field while other parameters are misperceived. Moreover, there is an additional factor that is intentionally dismissed by the authors that is a possible cause of the fall of a stack of cubes: an external force. Stacks that are initially standing should not fall all of a sudden unless some unwanted force is applied to the construction. For instance, a sudden gust of wind would create a force field on a stack that is equivalent to that produced by a tilted gravitational field. Such an explanation would easily apply to the findings of the second experiment. In that experiment subjects are explicitly asked if a stack of blocks looks "stable". This is an ambiguous question because the stability of a structure is always judged by imagining what would happen to the structure if an external perturbation is applied. The right question should be: "do you think this structure would fall if unperturbed". However, if stability is judged in the face of possible external perturbations then a tall structure would certainly be judged as less stable than a short structure occupying the same ground area. This is what the authors find. What they consider as a bias (tall structures are perceived as less stable than short structures) is instead a wrong interpretation of the mental process that determines stability. If subjects are asked the question "Is it going to fall?" then tall stacks of sound structure would be judged as stable as short stacks, just more precarious.

      R3: Indeed, the external forces suggested by the reviewer certainly influence judgments of objects’ stability. The critical question, however, is whether humans’ judgments on objects’ stability accurately mirror the actual stability of objects in the absence of external forces. To address this question, we designed two new experiments.

      Experiment 1: we duplicated the original experimental setup with the addition of a wall implemented on one side (Supplementary Figure 4A). We explicitly informed the participants that the wall could block wind, ensuring that no potential wind from the direction of the wall could influence the configuration. If participants’ judgments were affected by external noise, we would expect to observe a skewed angle distribution. Contrary to this prediction, our results showed a normal distribution across all three participants (Age: 25-30, two females), which is similar to the experiment without the wall (Supplementary Figure 4B).

      Author response image 6.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      Experiment 2: The second experiment adopted another paradigm to test the hypothesis of stochastic mental simulation. Consider humans to infer the landing point of a parabolic trajectory that was obscured by an occlude (Author response image 2A), the stochastic mental simulation predicted that humans’ behavior follows a Gaussian distribution. However, if humans’ judgments were influenced by external noise, the landing points could not be Gaussian. The experiment consists of 100 trials in total, and in each trial participants used a mouse to predict the landing point of each trajectory by clicking the left button. Our results found all three participants (1 female; ages: 24-30) were unable to accurately predict the landing points of the trajectories, and the predictive errors conformed to Gaussian distributions with different variances (Author response image 2B). Therefore, this new experiment confirms the stochastic nature of intuitive physics.

      Author response image 7.

      Trajectory experiment to test the stochastic nature of gravity represented in the mind. (a) Experiment design. In this experiment, participants were required to use a mouse to determine the landing point of a parabolic trajectory (marked by the green dot), obscured by a grey rectangle. Note that the parabolic trajectory was determined only by gravity, and no external disturbances were introduced. The parameters used in this experiment are detailed in the upper right corner. (b) Predictive errors from three participants. The predictive errors from all three participants conform to Gaussian distributions with non-negligible variances. These results suggest the notion of an inherent stochastic property of gravity represented in the mind.

      (4) The RL model used as a proof of concept for how the brain may build a stochastic prior for the direction of gravity is based on very strong and unverified assumptions. The first assumption is that the brain already knows about the force of gravity, but it lacks knowledge of the direction of this force of gravity. The second assumption is that before learning the brain knows the effect of a gravitational field on a stack of blocks. How can the brain simulate the effect of a non-vertical gravitational field on a structure if it has never observed such an event?

      R4: We agree with the reviewer that the RL framework serves primarily as a theoretic model to explain the stochastic nature of the world model on gravity, rather than as a demonstration of the developmental origins of intuitive physics abilities. The genesis of such abilities is multifaceted and unlikely to be fully replicated through a simple simulation like RL. Therefore, the purpose of incorporating the RL framework in our study is to demonstrate that external perturbances are not necessary for the development of a stochastic representation of gravity.

      In revision, we have clarified the role of the RL framework

      Line 265-277: “While the cognitive impenetrability and the self-consistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.

      Here we used a reinforcement learning (RL) framework to unveil this origin, because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to fine-tune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise.”

      (5) The third assumption is that from the visual input, the brain is able to figure out the exact 3D coordinates of the blocks. This has been proven to be untrue in a large number of studies. Given these assumptions and the fact that the only parameters the RL model modifies through learning specify the direction of gravity, I am not surprised that the model produces the desired results.

      Author response image 8.

      Perception Uncertainty in 3D stacks structures. (a) Experimental design. A pair of two stacks with similar placements of blocks were presented sequentially to participants, who were instructed to judge whether the stacks were identical and to rate their confidence in this judgment. Each stack was presented on the screen for 2 seconds. (b) Behavior Performance. Three participants (2 males, age range: 24-30) were recruited to the experiment. The confidence in determining whether a pair of stacks remained unchanged rapidly decreased when each block had a very small displacement, suggesting humans could keenly perceive trivial changes in configurations. The x-axis denotes the difference in block placement between stacks, with the maximum value (0.4) corresponding to the length of a block’s short side. The Y-axis denotes humans’ confidence in reporting no change. The red curve illustrates the average confidence level across 4 runs, while the yellow curve is the confidence level of each run.

      R5: Indeed, uncertainty is inevitable when perceiving the external world, because our perception is not a faithful replica of external reality. A more critical question pertains to the accuracy of our perception in representing the 3D coordinates of a stack’s blocks. To address this question, we designed a straightforward experiment (Author response image 5a), where participants were instructed to determine whether a pair of stacks were identical. The position of each block was randomly changed horizontally. We found that all participants were able to accurately identify even minor positional variations in the 3D structure of the stacks (Author response image 5b). This level of perceptual precision is adequate for locating the difference between predictions from mental simulations and actual observations of the external world.

      (6)Finally, the argument that the MGS is more efficient than the NGS model is based on an incorrect analysis of the results of the simulation. It is true that 80% accuracy is reached faster by the MGS model than the 95% accuracy level is reached by the NGS model. But the question is: how fast does the NGS model reach 80% accuracy (before reaching the plateau)?

      R6: Yes. The NGS model achieved 80% accuracy as rapidly as the MGS model. However, the NGS model required a significantly longer period to reach the plateau crucial for decision-making. In revision, this information is now included.

      Line 348-350: “…, while the initial growth rates of both models were comparable, the MGS reached the plateau crucial for decision-making sooner than the NGS.”

      We greatly appreciate the thorough and insightful review provided by all three reviewers, which has considerably improved our manuscript, especially in terms of clarity in the presentation of the approach and further validation of the robustness implications of our results.

      Reference: Allen KR, Smith KA, Tenenbaum JB. 2020. Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences 117:29302–29310.

      Battaglia PW, Hamrick JB, Tenenbaum JB. 2013. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences 110:18327–18332.

      Friston K, Moran RJ, Nagai Y, Taniguchi T, Gomi H, Tenenbaum J. 2021. World model learning and inference. Neural Networks 144:573–590.

      Kriegeskorte N, Douglas PK. 2019. Interpreting encoding and decoding models. Current opinion in neurobiology 55:167–179.

      MacKay DM. 1956. The epistemological problem for automataAutomata Studies.(AM-34), Volume 34. Princeton University Press. pp. 235–252.

      Matsuo Y, LeCun Y, Sahani M, Precup D, Silver D, Sugiyama M, Uchibe E, Morimoto J. 2022. Deep learning, reinforcement learning, and world models. Neural Networks.

      Naselaris T, Kay KN, Nishimoto S, Gallant JL. 2011. Encoding and decoding in fMRI. Neuroimage 56:400–410.

      Zhou L, Smith K, Tenenbaum J, Gerstenberg T. 2022. Mental Jenga: A counterfactual simulation model of physical support.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment:

      This study presents an important finding on the implicit and automatic emotion perception from biological motion (BM). The evidence supporting the claims of the authors is solid, although inclusion of a larger number of samples and more evidence for the discrepancy between Intact and local emotional BMs would have strengthened the study. The work will be of broad interest to perceptual and cognitive neuroscience.

      We express our sincere gratitude for the positive and constructive evaluation of our manuscript. We have now included more participants and conducted a replication experiment to strengthen our results.

      Reviewer #1 (Public Review):

      Summary:

      Tian et al. investigated the effects of emotional signals in biological motion on pupil responses. In this study, subjects were presented with point-light biological motion stimuli with happy, neutral, and sad emotions. Their pupil responses were recorded with an eye tracker. Throughout the study, emotion type (i.e., happy/sad/neutral) and BM stimulus type (intact/inverted/non-BM/local) were systematically manipulated. For intact BM stimuli, happy BM induced a larger pupil diameter than neutral BM, and neutral BM also induced a larger pupil diameter than sad BM. Importantly, the diameter difference between happy and sad BM correlated with the autistic trait of individuals. These effects disappeared for the inverted BM and non-BM stimuli. Interestingly, both happy and sad emotions show superiority in pupil diameter.

      Strengths:

      (1) The experimental conditions and results are very easy to understand.

      (2) The writing and data presentation are clear.

      (3) The methods are sound. I have no problems with the experimental design and results.

      Weaknesses:

      (1) My main concern is the interpretation of the intact and local condition results. The processing advantage of happy emotion is not surprising given a number of existing studies. However, the only difference here seems to be the smaller (or larger) pupil diameter for sad compared to neutral in the intact (or local, respectively) condition. The current form only reports this effect but lacks in-depth discussions and explanations as to why this is the case.

      Thanks for pointing this out, our apology for not making this point clear. It has long been documented that pupil size reflects the degree of cognitive effort and attention input (Joshi & Gold, 2019; van der Wel & van Steenbergen, 2018), and indexes the noradrenalin activity in emotion processing structures like amygdala (Dal Monte et al., 2015; Harrison et al., 2006; Liddell et al., 2005). Accordingly, we proposed that the smaller pupil response observed under the sad condition as compared to the neutral condition is because the sad biological motion (BM) could be less efficient in attracting visual attention and evoking emotional arousal. In line with this, it has been found that infants looked more at the neutral point-light walker when displayed in pair with the sad walker (Ogren et al., 2019), suggesting that the sad BM is less effective in capturing visual attention than the neutral BM. Besides, neural studies have revealed that, compared with other emotions (anger, happiness, disgust, and fear), the processing of sad emotion failed to evoke heightened activities in any emotionally relevant brain regions including the amygdala, the extrastriate body area (EBA) and the fusiform body area (FBA) (Peelen et al., 2007)(Peelen et al., 2007). The current study echoed with these previous findings by demonstrating a disadvantage for intact sad BM in evoking pupil responses. Notably, different from the intact sad BM, the local sad BM would instead induce stronger pupil responses than the neutral local BM. This distinctive pupil modulation effect observed in intact and local sad BM could be explained as a multi-level emotion processing model of BM. Specifically, even though both the intact and local BM conveyed important life information (Chang & Troje, 2008, 2009; Simion et al., 2008), the latter is deprived of the global form feature. Hence, the processing of emotions in local BM may occur at a more basic and preliminary level, responding to the general affective salient emotion information (happy and sad) without detailed analysis. In fact, similar dissociated emotion processing phenomenon has been observed in another important type of emotional signal with analogous function (i.e., facial expression). For example, happy and fearful faces elicited differential amygdala activations when perceived consciously. However, they elicited comparable amygdala activations when suppressed (Williams et al., 2004). Moreover, it has been proposed that there exist two parallel routes for facial expression processing: a quick but coarse subcortical route that detects affective salient information without detailed analysis, and a fine-grained but slow cortical route that discriminates the exact emotion type. Similarly, the dissociated emotion processing in local and intact BM may function in the same manner, with the former serving as a primary emotion detection mechanism and the latter serving as a detailed emotion discrimination mechanism. Still, future studies adopting more diverse experimental paradigms and neuroimaging techniques were needed to further investigate this issue. We have added these points and more thoroughly discussed the potential mechanism in the revised text (see lines 329-339, 405-415, 418-420).

      References:

      Chang, D. H. F., & Troje, N. F. (2008). Perception of animacy and direction from local biological motion signals. Journal of Vision, 8(5), 3. https://doi.org/10.1167/8.5.3

      Chang, D. H. F., & Troje, N. F. (2009). Characterizing global and local mechanisms in biological motion perception. Journal of Vision, 9(5), 8–8. https://doi.org/10.1167/9.5.8

      Dal Monte, O., Costa, V. D., Noble, P. L., Murray, E. A., & Averbeck, B. B. (2015). Amygdala lesions in rhesus macaques decrease attention to threat. Nature Communications, 6(1). https://doi.org/10.1038/ncomms10161

      Harrison, N. A., Singer, T., Rotshtein, P., Dolan, R. J., & Critchley, H. D. (2006). Pupillary contagion: central mechanisms engaged in sadness processing. Social Cognitive and Affective Neuroscience, 1(1), 5–17. https://doi.org/10.1093/scan/nsl006

      Joshi, S., & Gold, J. I. (2019). Pupil size as a window on neural substrates of cognition. Trends in Cognitive Sciences, 24(6), 466–480. https://doi.org/10.31234/osf.io/dvsme

      Liddell, B. J., Brown, K. J., Kemp, A. H., Barton, M. J., Das, P., Peduto, A., Gordon, E., & Williams, L. M. (2005). A direct brainstem–amygdala–cortical ‘alarm’ system for subliminal signals of fear. NeuroImage, 24(1), 235–243.

      Ogren, M., Kaplan, B., Peng, Y., Johnson, K. L., & Johnson, S. P. (2019). Motion or emotion: infants discriminate emotional biological motion based on low-level visual information. Infant Behavior and Development, 57, 101324. https://doi.org/10.1016/j.infbeh.2019.04.006

      Peelen, M. V., Atkinson, A. P., Andersson, F., & Vuilleumier, P. (2007). Emotional modulation of body-selective visual areas. Social Cognitive and Affective Neuroscience, 2(4), 274–283. https://doi.org/10.1093/scan/nsm023

      Simion, F., Regolin, L., & Bulf, H. (2008). A predisposition for biological motion in the newborn baby. Proceedings of the National Academy of Sciences, 105(2), 809–813. https://doi.org/10.1073/pnas.0707021105

      van der Wel, P., & van Steenbergen, H. (2018). Pupil dilation as an index of effort in cognitive control tasks: a review. Psychonomic Bulletin & Review, 25(6), 2005–2015. https://doi.org/10.3758/s13423-018-1432-y

      Williams, M. A., Morris, A. P., McGlone, F., Abbott, D. F., & Mattingley, J. B. (2004). Amygdala responses to fearful and happy facial expressions under conditions of binocular suppression. Journal of Neuroscience, 24(12), 2898-2904.

      (2) I also found no systematic discussion and theoretical contributions regarding the correlation with the autistic traits. If the main point of this paper is to highlight an implicit and objective behavioral marker of the autistic trait, more interpretation and discussion of the links between the results and existing findings in ASD are needed.

      We thank the reviewer for this insightful suggestion. The perception of biological motion (BM) has long been considered an important hallmark of social cognition. Abundant studies reported that individuals with social cognitive deficits (e.g., ASD) were impaired in BM perception (Blake et al., 2003; Freitag et al., 2008; Klin et al., 2009; Nackaerts et al., 2012). More recently, it has been pointed out that the extraction of more complex social information (e.g., emotions, intentions) from BM, as compared to basic BM recognitions, could be more effective in detecting ASDs (Federici et al., 2020; Koldewyn et al., 2009; Parron et al., 2008; Todorova et al., 2019). Specifically, a meta-analysis found that the effect size expanded nearly twice when the task required emotion recognition as compared to simple perception/detection (Todorova et al., 2019). However, for the high-functioning ASD individuals, it has been reported that they showed comparable performance with the control group in explicitly labelling BM emotions, while their responses were rather delayed (Mazzoni et al., 2021). This suggested that ASD individuals could adopt compensatory strategies to complete the explicit BM labelling task, while their automatic behavioural responses remained impaired. This highlights the importance of using more objective measures that do not rely on active reports to investigate the intrinsic perception of emotions from BM and its relationship with ASD-related social deficits. The current study thus introduced the pupil size measurement to this field, and we combined it with the passive viewing task to investigate the more automatic aspect of BM emotion processing. More importantly, in addition to diagnostic ASDs, the non-clinical general population also manifested autistic tendencies that followed normal distribution and demonstrated substantial heritability (Hoekstra et al., 2007). Here, we focused on the autistic tendencies in the general population, and our results showed that pupil modulations by BM emotions were indicative of individual autistic traits. Specifically, passively viewing the happy BMs evoked larger pupil responses than the sad BMs, while such emotional modulation diminished with the increase of autistic tendencies. More detailed test-retest examination further illustrated such a correlation was driven by the general diminishment in pupil modulation effects by emotional BM (happy or sad) for individuals with high autistic tendencies. This finding demonstrated that the automatic emotion processing of BM stimuli was impaired in individuals with high autistic tendencies, lending support to previous studies (Hubert et al., 2006; Nackaerts et al., 2012; Parron et al., 2008). This indicated the utility of emotional BM stimuli and pupil measurement in identifying ASD-related tendencies in both clinical and non-clinical populations. We have added these points to the revised text (see lines 347-375).

      References:

      Blake, R., Turner, L. M., Smoski, M. J., Pozdol, S. L., & Stone, W. L. (2003). Visual recognition of biological motion is impaired in children with autism. Psychological Science, 14(2), 151–157. https://doi.org/10.1111/1467-9280.01434

      Federici, A., Parma, V., Vicovaro, M., Radassao, L., Casartelli, L., & Ronconi, L. (2020). Anomalous perception of biological motion in autism: a conceptual review and meta-analysis. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-61252-3

      Freitag, C. M., Konrad, C., Häberlen, M., Kleser, C., von Gontard, A., Reith, W., Troje, N. F., & Krick, C. (2008). Perception of biological motion in autism spectrum disorders. Neuropsychologia, 46(5), 1480–1494. https://doi.org/10.1016/j.neuropsychologia.2007.12.025

      Hoekstra, R. A., Bartels, M., Verweij, C. J. H., & Boomsma, D. I. (2007). Heritability of autistic traits in the general population. Archives of Pediatrics & Adolescent Medicine, 161(4), 372. https://doi.org/10.1001/archpedi.161.4.372

      Hubert, B., Wicker, B., Moore, D. G., Monfardini, E., Duverger, H., Fonséca, D. D., & Deruelle, C. (2006). Brief report: recognition of emotional and non-emotional biological motion in individuals with autistic spectrum disorders. Journal of Autism and Developmental Disorders, 37(7), 1386–1392. https://doi.org/10.1007/s10803-006-0275-y

      Klin, A., Lin, D. J., Gorrindo, P., Ramsay, G., & Jones, W. (2009). Two-year-olds with autism orient to non-social contingencies rather than biological motion. Nature, 459(7244), 257–261. https://doi.org/10.1038/nature07868

      Koldewyn, K., Whitney, D., & Rivera, S. M. (2009). The psychophysics of visual motion and global form processing in autism. Brain, 133(2), 599–610. https://doi.org/10.1093/brain/awp272

      Mazzoni, N., Ricciardelli, P., Actis-Grosso, R., & Venuti, P. (2021). Difficulties in recognising dynamic but not static emotional body movements in autism spectrum disorder. Journal of Autism and Developmental Disorders, 52(3), 1092–1105. https://doi.org/10.1007/s10803-021-05015-7

      Nackaerts, E., Wagemans, J., Helsen, W., Swinnen, S. P., Wenderoth, N., & Alaerts, K. (2012). Recognizing biological motion and emotions from point-light displays in autism spectrum disorders. PLoS ONE, 7(9), e44473. https://doi.org/10.1371/journal.pone.0044473

      Parron, C., Da Fonseca, D., Santos, A., Moore, D. G., Monfardini, E., & Deruelle, C. (2008). Recognition of biological motion in children with autistic spectrum disorders. Autism, 12(3), 261–274. https://doi.org/10.1177/1362361307089520

      Todorova, G. K., Hatton, R. E. M., & Pollick, F. E. (2019). Biological motion perception in autism spectrum disorder: a meta-analysis. Molecular Autism, 10(1). https://doi.org/10.1186/s13229-019-0299-8

      Reviewer #2 (Public Review):

      Summary:

      Through a series of four experiments, Yuan, Wang and Jiang examined pupil size responses to emotion signals in point-light motion stimuli. Experiment 1 examined upright happy, sad and neutral point-light biological motion (BM) walkers. The happy BM induced a significantly larger pupil response than the neutral, whereas the sad BM evoked a significantly smaller pupil size than the neutral BM. Experiment 2 examined inverted BM walkers. Experiment 3 examined BM stimuli with acceleration removed. No significant effects of emotion were found in neither Experiment 2 nor Experiment 3. Experiment 4 examined scrambled BM stimuli, in which local motion features were preserved while the global configuration was disrupted. Interestingly, the scrambled happy and sad BM led to significantly greater pupil size than the scrambled neutral BM at a relatively early time, while no significant difference between the scrambled happy and sad BM was found. Thus, the authors argue that these results suggest multi-level processing of emotions in life motion signals.

      Strengths:

      The experiments were carefully designed and well-executed, with point-light stimuli that eliminate many potential confounding effects of low-level visual features such as luminance, contrast, and spatial frequency.

      Weaknesses:

      Correlation results with limited sample size should be interpreted with extra caution.

      Thanks for pointing this out. To strengthen the correlation results, we have conducted a replication experiment (Exp.1b) and added a test-retest examination to further assess the reliability of our measurements. Specifically, a new group of 24 participants (16 females, 8 males) were recruited to perform the identical experiment procedure as in Experiment 1. Then, after at least seven days, they were asked to return to the lab for a retest. The results successfully replicated the previously reported main effect of emotional condition in both the first test (F(2, 46) = 12.0, p < .001, ηp2 = 0.34, Author response image 1A) and the second test (F(2, 46) = 14.8, p < .001, ηp2 = 0.39, Author response image 1B). The happy BM induced a significantly larger pupil response than the neutral BM (First Test: t(23) = 2.60, p = .022, Cohen’s d = 0.53, 95% CI for the mean difference = [0.02, 0.14], Holm-corrected, p = .048 after Bonferroni correction, Author response image 1A; Second Test: t(23) = 3.36, p = .005, Cohen’s d = 0.68, 95% CI for the mean difference = [0.06, 0.24], Holm-corrected, p = .008 after Bonferroni correction, Author response image 1B). On the contrary, the sad BM induced a significantly smaller pupil response than the neutral BM (First Test: t(23) = -2.77, p = .022, Cohen’s d = 0.57, 95% CI for the mean difference = [-0.19, -0.03], Holm-corrected, p = .033 after Bonferroni correction; Second Test: t(23) = -3.19, p = .005, Cohen’s d = 0.65, 95% CI for the mean difference = [-0.24, -0.05], Holm-corrected, p = .012 after Bonferroni correction, Author response image 1B). Besides, the happy BM induced significantly larger pupil response than the sad BM (first test: t(23) = 4.23, p < .001, Cohen’s d = 0.86, 95% CI for the mean difference = [0.10, 0.28], Holm-corrected, p < .001 after Bonferroni correction, Author response image 1A; second test: t(23) = 4.26, p < .001, Cohen’s d = 0.87, 95% CI for the mean difference = [0.15, 0.44], Holm-corrected, p < .001 after Bonferroni correction, Author response image 1B). The results of the cluster-based permutation analysis were also similar (see Supplementary Material for more details).

      Author response image 1.

      Normalized mean pupil responses in the replication experiment (Experiment 1b) of Experiment 1a and its retest, using the neutral condition as baseline, plotted against happy and sad conditions. (A) In the first test, the group average pupil response to happy intact BM is significantly larger than that to sad and neutral BM, while the pupil response induced by sad BM is significantly smaller than that evoked by neutral BM, replicating the results of Experiment 1a. (B) Moreover, such results were similarly found in the second test.

      Notably, we successfully replicated the negative correlation between the happy over sad dilation effect and individual autistic traits in the first test (r(23) = -0.46, p = .023, 95% CI for the mean difference = [-0.73, -0.07], Author response image 2A). No other significant correlations were found (see Author response image 2B-C). Moreover, in the second test, such a correlation was similarly found and was even stronger (r(23) = -0.61, p = .002, 95% CI for the mean difference = [-0.81, -0.27], Author response image 2D). We‘ve also performed a test-retest reliability analysis on the happy over sad pupil dilation effect and the AQ score. The results showed robust correlations. See Author response table 1 for more details.

      Author response table 1.

      Reliability of pupil size and AQ indices.

      Importantly, in the second test, we’ve also observed a significant negative correlation between AQ and the happy minus neutral pupil dilation effect (r(23) = -0.44, p = .032, 95% CI for the mean difference = [-0.72, -0.04], Author response image 2E), and a significant positive correlation between the sad minus neutral pupil size and AQ (r(23) = 0.50, p = .014, 95% CI for the mean difference = [0.12, 0.75], Author response image 2F). This indicated that the overall correlation between happy over sad dilation effect and AQ was driven both by the diminished happy dilation effect as well as the sad constriction effect. Overall, our replication experiment consistently found a significant negative correlation between AQ and happy over sad dilation effect both in the test and the retest. Moreover, it revealed that such an effect was contributed by both a negative correlation between AQ and happy-neutral pupil response and a positive correlation between AQ and sad-neutral pupil response, demonstrating a general impairment in BM emotion perception (happy or sad) for individuals with high autistic tendencies. This also indicated the utility of adopting a test-retest pupil examination to more precisely detect individual autistic tendencies. We have added these points in the revised text (see lines 135-173, lines 178-180).

      Author response image 2.

      Correlation results for pupil modulation effects and AQ scores in the replication experiment (Experiment 1b) of Experiment 1a and its retest. (A) We replicated the negative correlation between the happy over sad pupil dilation effect and AQ in the first test. (B-C) No other significant correlations were found. (D) In the second test, the negative correlation between the happy over sad pupil dilation effect and AQ was similarly observed and even stronger. (E-F) Moreover, the happy vs. neutral pupil dilation effect and the sad vs. neutral pupil constriction effect respectively correlate with AQ in the second test.

      It would be helpful to add discussions as a context to compare the current results with pupil size reactions to emotion signals in picture stimuli.

      Thanks for this this thoughtful comment. The modulation of emotional information on pupil responses has been mostly investigated using picture stimuli. Bradley et al. (2008) first demonstrated that humans showed larger pupil responses towards emotional images as compared to neutral images, while no difference was observed between the positive and negative images. This was regarded as the result of increased sympathetic activity induced by emotional arousal that is independent of the emotional valence. Similar results have been replicated with different presentation durations, repetition settings, and tasks (Bradley & Lang, 2015; Snowden et al., 2016). However, the emotional stimuli adopted in these studies were mostly complicated scene images that conveyed rather general emotional information. When it comes to the specific emotion cues (e.g., fear, anger, happy, sad) delivered by our conspecifics through biologically salient signals (e.g., faces, gestures, voices), the results became intermixed. Some studies demonstrated that fearful, disgusted, and angry static faces induced larger pupil sizes than the neutral face, while sad and happy faces failed to induce such pupil dilatory effects (Burley et al., 2017). In contrast, other studies observed larger pupil responses for happy faces as compared to sad and fearful faces (Aktar et al., 2018; Burley & Daughters, 2020; Jessen et al., 2016). These conflicting results could be due to the low-level confounds of emotional faces (e.g., eye size) (Carsten et al., 2019; Harrison et al., 2006). Similar to faces, BM also conveyed salient clues concerning the emotional states of our interactive partners. However, they were highly simplified, deprived of various irrelevant visual confounders (e.g., body shape). Here, we reported that the happy BM induced a stronger pupil response than the neutral and sad BM, lending support to the happy dilation effect observed with faces (Burley & Daughters, 2020; Prunty et al., 2021). Moreover, it helps ameliorate the concern regarding the low-level confounding factors by identifying similar pupil modulations in another type of social signal with distinctive perceptual features. We have added these points to the revised text (see lines 301-321).

      References:

      Aktar, E., Mandell, D. J., de Vente, W., Majdandžić, M., Oort, F. J., van Renswoude, D. R., Raijmakers, M. E. J., & Bögels, S. M. (2018). Parental negative emotions are related to behavioral and pupillary correlates of infants’ attention to facial expressions of emotion. Infant Behavior and Development, 53, 101–111. https://doi.org/10.1016/j.infbeh.2018.07.004

      Bradley, M. M., & Lang, P. J. (2015). Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology, 52(9), 1186–1193. https://doi.org/10.1111/psyp.12442

      Bradley, M. M., Miccoli, L., Escrig, M. A., & Lang, P. J. (2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4), 602–607. https://doi.org/10.1111/j.1469-8986.2008.00654.x

      Burley, D. T., & Daughters, K. (2020). The effect of oxytocin on pupil response to naturalistic dynamic facial expressions. Hormones and Behavior, 125, 104837. https://doi.org/10.1016/j.yhbeh.2020.104837

      Burley, D. T., Gray, N. S., & Snowden, R. J. (2017). As far as the eye can see: relationship between psychopathic traits and pupil response to affective stimuli. PLOS ONE, 12(1), e0167436. https://doi.org/10.1371/journal.pone.0167436

      Carsten, T., Desmet, C., Krebs, R. M., & Brass, M. (2019). Pupillary contagion is independent of the emotional expression of the face. Emotion, 19(8), 1343–1352. https://doi.org/10.1037/emo0000503

      Harrison, N. A., Singer, T., Rotshtein, P., Dolan, R. J., & Critchley, H. D. (2006). Pupillary contagion: central mechanisms engaged in sadness processing. Social Cognitive and Affective Neuroscience, 1(1), 5–17. https://doi.org/10.1093/scan/nsl006

      Jessen, S., Altvater-Mackensen, N., & Grossmann, T. (2016). Pupillary responses reveal infants’ discrimination of facial emotions independent of conscious perception. Cognition, 150, 163–169. https://doi.org/10.1016/j.cognition.2016.02.010

      Prunty, J. E., Keemink, J. R., & Kelly, D. J. (2021). Infants show pupil dilatory responses to happy and angry facial expressions. Developmental Science, 25(2). https://doi.org/10.11<br /> 11/desc.13182

      Snowden, R. J., O’Farrell, K. R., Burley, D., Erichsen, J. T., Newton, N. V., & Gray, N. S. (2016). The pupil’s response to affective pictures: role of image duration, habituation, and viewing mode. Psychophysiology, 53(8), 1217–1223. https://doi.org/10.1111/psyp.12668

      Overall, I think this is a well-written paper with solid experimental results that support the claim of the authors, i.e., the human visual system may process emotional information in biological motion at multiple levels. Given the key role of emotion processing in normal social cognition, the results will be of interest not only to basic scientists who study visual perception, but also to clinical researchers who work with patients of social cognitive disorders. In addition, this paper suggests that examining pupil size responses could be a very useful methodological tool to study brain mechanisms underlying emotion processing.

      Reviewer #3 (Public Review):

      Summary:

      The overarching goal of the authors was to understand whether emotional information conveyed through point-light biological motion can trigger automatic physiological responses, as reflected in pupil size.

      Strengths:

      This manuscript has several noticeable strengths: it addresses an intriguing research question that fills that gap in existing literature, presents a clear and accurate presentation of the current literature, and conducts a series of experiments and control experiments with adequate sample size. Yet, it also entails several noticeable limitations - especially in the study design and statistical analyses.

      Weaknesses:

      (1) Study design:

      (1.1) Dependent variable:

      Emotional attention is known to modulate both microsaccades and pupil size. Given the existing pupillometry data that the authors have collected, it would be both possible and valuable to determine whether the rate of microsaccades is also influenced by emotional biological motion.

      We thank the reviewer for this advice. Microsaccades functioned as a mechanism to maintain visibility by continuously shifting the retinal image to overcome visual adaptation (Martinez-Conde et al., 2006). Moreover, it was found to be sensitive to attention processes (Baumeler et al., 2020; Engbert & Kliegl, 2003b; Meyberg et al., 2017), and could reflect the activity of superior colliculus (SC) and other related brain areas (Martinez-Conde et al., 2009, 2013). Previous studies have found that, compared with neutral and pleasant images, unpleasant images significantly inhibit early microsaccade rates (Kashihara, 2020; Kashihara et al., 2013). This is regarded as the result of retaining previous crucial information at the sacrifice of updating new visual input. We agree with the reviewer that it would be valuable to investigate whether emotional information conveyed by BM could modulate microsaccades. However, it should be noted that our data collection and experimental design are not optimized for this purpose. This is because we have only recorded the left eye’s data, while abundant methodological studies have doubted the reliability of using only one eye’s data to analyze microsaccades (Fang et al., 2018; Hauperich et al., 2020; Nyström et al., 2017) and suggested that the microsaccades should be defined by spontaneous binocular eye movement (Engbert & Kliegl, 2003a, 2003b). Besides, according to Kashihara et al. (2013), participants showed differential microsaccade rates after the stimuli disappeared so as to maintain the previously observed different emotional information. However, in the current study, we discarded the data after the stimuli disappeared, making it impossible to analyze the microsaccade data after the stimuli disappeared. Despite these disadvantages, we have attempted to analyze the microsaccade rate during the stimuli presentation using only the left eye’s data. Specifically, we applied the algorithm developed by Otero-Millan et al. (2014) (minimum duration =6 ms, maximum amplitude = 1.5 degrees, maximum velocity = 150 degrees/sec) to the left eye’s data from 100 ms before to 4000 ms after stimulus onset. Subsequently, we calculated the microsaccade rates using a moving window of 100 ms (stepped in 1 ms) (Engbert & Kliegl, 2003b; Kashihara et al., 2013). The microsaccade rate displayed a typical curve, with suppression shortly after stimulus appearance (inhibition phase), followed by an increased rate of microsaccade occurrence (rebound phase). The cluster-based permutation analysis was then applied to explore the modulation of BM emotions on microsaccade rates. However, no significant differences among different emotional conditions (happy, sad, neutral) were found for the four experiments.

      Author response image 3.

      Time-series change in the microsaccade rates to happy, sad, and neutral BM in Experiments 1-4. Solid lines represent microsaccade rates under each emotional condition as a function of time (happy: red; sad: blue; neutral: gray); shaded areas represent the SEM between participants. No significant differences were found after cluster-based permutation correction for the four experiments.

      It is important to note that the microsaccade rate analysis was conducted on only the left eye’s data and that the experiment design is not optimized for this analysis, thus, extra caution should be exercised in interpreting the results. Still, we found it very innovative and important to combine the microsaccade index with the pupil size to holistically investigate the processing of emotional information in BM, and future studies are highly needed to adopt more suitable recording techniques and experiment designs to further probe this issue. We have discussed this issue in the revised text (see lines 339-344).

      References:

      Baumeler, D., Schönhammer, J. G., & Born, S. (2020). Microsaccade dynamics in the attentional repulsion effect. Vision Research, 170, 46–52. https://doi.org/10.1016/j.visres.2020.03.009

      Engbert, R., & Kliegl, R. (2003a). Binocular coordination in microsaccades. In The Mind’s Eye (pp. 103–117). Elsevier. https://doi.org/10.1016/b978-044451020-4/50007-4

      Engbert, R., & Kliegl, R. (2003b). Microsaccades uncover the orientation of covert attention. Vision Research, 43(9), 1035–1045. https://doi.org/10.1016/s0042-6989(03)00084-1

      Fang, Y., Gill, C., Poletti, M., & Rucci, M. (2018). Monocular microsaccades: do they really occur? Journal of Vision, 18(3), 18. https://doi.org/10.1167/18.3.18

      Hauperich, A.-K., Young, L. K., & Smithson, H. E. (2020). What makes a microsaccade? a review of 70 years research prompts a new detection method. Journal of Eye Movement Research, 12(6). https://doi.org/10.16910/jemr.12.6.13

      Kashihara, K. (2020). Microsaccadic modulation evoked by emotional events. Journal of Physiological Anthropology, 39(1). https://doi.org/10.1186/s40101-020-00238-6

      Kashihara, K., Okanoya, K., & Kawai, N. (2013). Emotional attention modulates microsaccadic rate and direction. Psychological Research, 78(2), 166–179. https://doi.org/10.1007/s00426-013-0490-z

      Martinez-Conde, S., Macknik, S. L., Troncoso, X. G., & Dyar, T. A. (2006). Microsaccades counteract visual fading during fixation. Neuron, 49(2), 297–305. https://doi.org/10.1016/j.neuron.2005.11.033

      Martinez-Conde, S., Macknik, S. L., Troncoso, X. G., & Hubel, D. H. (2009). Microsaccades: a neurophysiological analysis. Trends in Neurosciences, 32(9), 463–475. https://doi.org/10.1016/j.tins.2009.05.006

      Martinez-Conde, S., Otero-Millan, J., & Macknik, S. L. (2013). The impact of microsaccades on vision: towards a unified theory of saccadic function. Nature Reviews Neuroscience, 14(2), 83–96. https://doi.org/10.1038/nrn3405

      Meyberg, S., Sinn, P., Engbert, R., & Sommer, W. (2017). Revising the link between microsaccades and the spatial cueing of voluntary attention. Vision Research, 133, 47–60. https://doi.org/10.1016/j.visres.2017.01.001

      Nyström, M., Andersson, R., Niehorster, D. C., & Hooge, I. (2017). Searching for monocular microsaccades – a red hering of modern eye trackers? Vision Research, 140, 44–54. https://doi.org/10.1016/j.visres.2017.07.012

      Otero-Millan, J., Castro, J. L. A., Macknik, S. L., & Martinez-Conde, S. (2014). Unsupervised clustering method to detect microsaccades. Journal of Vision, 14(2), 18–18. https://doi.org/10.1167/14.2.18

      (1.2) Stimuli:

      It appears that the speed of the emotional biological motion stimuli mimics the natural pace of the emotional walker. What is the average velocity of the biological motion stimuli for each condition?

      Thanks for pointing out this issue. The neutral and emotional (sad or happy) BM stimuli are equal in walking speed (one step for one second, 1Hz). We have also computed their physical velocity by calculating the Euclidean distance in pixel space of each key point between adjacent frames (Poyo Solanas et al., 2020). The velocity was 5.76 pixels/frame for the happy BM, 4.14 pixels/frame for the neutral BM, and 3.21 pixels/frame for the sad BM. This difference in velocity profile was considered an important signature for conveying emotional information, as the happy walker was characterized by a larger step pace and longer arm swing and the sad walker would instead exhibit a slouching gait with short slow strides and smaller arm movement (Barliya et al., 2012; Chouchourelou et al., 2006; Halovic & Kroos, 2018; Roether et al., 2009). More importantly, our current results could not be explained by the differences in velocities. This is because the inverted emotional BM with identical velocity characteristics failed to induce any modulations on pupil responses. Furthermore, the local sad and happy BM differed the most in velocity feature, while they induced similar modulations on pupil sizes. We have added these points in the revised text (see lines 254-257, 484-491).

      References:

      Barliya, A., Omlor, L., Giese, M. A., Berthoz, A., & Flash, T. (2012). Expression of emotion in the kinematics of locomotion. Experimental Brain Research, 225(2), 159–176. https://doi.org/10.1007/s00221-012-3357-4

      Chouchourelou, A., Matsuka, T., Harber, K., & Shiffrar, M. (2006). The visual analysis of emotional actions. Social Neuroscience, 1(1), 63–74. https://doi.org/10.1080/17470910600630599

      Halovic, S., & Kroos, C. (2018). Not all is noticed: kinematic cues of emotion-specific gait. Human Movement Science, 57, 478–488. https://doi.org/10.1016/j.humov.2017.11.008

      Poyo Solanas, M., Vaessen, M. J., & de Gelder, B. (2020). The role of computational and subjective features in emotional body expressions. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-63125-1

      Roether, C. L., Omlor, L., Christensen, A., & Giese, M. A. (2009). Critical features for the perception of emotion from gait. Journal of Vision, 9(6), 15–15. https://doi.org/10.1167/9.6.15

      When the authors used inverted biological motion stimuli, they didn't observe any modulation in pupil size. Could there be a difference in microsaccades when comparing inverted emotional biological motion stimuli?

      Thanks for this consideration. Both microsaccades and pupil size can provide valuable insights into the underlying neural dynamics of attention and cognitive control (Baumeler et al., 2020; Engbert & Kliegl, 2003; Meyberg et al., 2017). Notably, previous studies have shown that the microsaccades and pupil sizes could be similar and highly correlated in reflecting various cognitive processes, such as multisensory integration, inhibitory control, and cognitive load (Krejtz et al., 2018; Wang et al., 2017; Wang & Munoz, 2021). Moreover, the generation of both microsaccades and pupil responses would involve shared neural circuits, including the midbrain structure superior colliculus (SC) and the noradrenergic system (Hafed et al., 2009; Hafed & Krauzlis, 2012; Wang et al., 2012). However, the pupil size could be more sensitive than microsaccade rates in contexts such as affective priming (Krejtz et al., 2020) and decision formation (Strauch et al., 2018). Moreover, abundant former studies have all shown that inversion would significantly disrupt the perception of emotions from BM (Atkinson et al., 2007; Dittrich et al., 1996; Spencer et al., 2016; Yuan et al., 2022, 2023). Overall, it is unlikely for the microsaccade rates to show significant differences when comparing inverted emotional biological motion stimuli. Besides, we have attempted to analyze the microsaccade rate in the inverted BM situation, while our results showed no significant differences (see also Point 1.1, Author response image 3). Still, it is needed for future studies to combine the microsaccade index and pupil size to provide a thorough understanding of BM emotion processing. We have discussed this issue in the revised text (see lines 339-344).

      References:

      Atkinson, A. P., Tunstall, M. L., & Dittrich, W. H. (2007). Evidence for distinct contributions of form and motion information to the recognition of emotions from body gestures. Cognition, 104(1), 59–72. https://doi.org/10.1016/j.cognition.2006.05.005

      Baumeler, D., Schönhammer, J. G., & Born, S. (2020). Microsaccade dynamics in the attentional repulsion effect. Vision Research, 170, 46–52. https://doi.org/10.1016/j.visres.2020.03.009

      Dittrich, W., Troscianko, T., Lea, S., & Morgan, D. (1996). Perception of emotion from dynamic point-light displays represented in dance. Perception, 25(6), 727–738. https://doi.org/10.1068/p250727

      Engbert, R., & Kliegl, R. (2003). Microsaccades uncover the orientation of covert attention. Vision Research, 43(9), 1035–1045. https://doi.org/10.1016/s0042-6989(03)00084-1

      Hafed, Z. M., Goffart, L., & Krauzlis, R. J. (2009). A neural mechanism for microsaccade generation in the primate superior colliculus. Science, 323(5916), 940–943. https://doi.org/10.1126/science.1166112

      Hafed, Z. M., & Krauzlis, R. J. (2012). Similarity of superior colliculus involvement in microsaccade and saccade generation. Journal of neurophysiology, 107(7), 1904-1916.

      Krejtz, K., Duchowski, A. T., Niedzielska, A., Biele, C., & Krejtz, I. (2018). Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze. Plos One, 13(9), e0203629. https://doi.org/10.1371/journal.pone.0203629

      Krejtz, K., Żurawska, J., Duchowski, A., & Wichary, S. (2020). Pupillary and microsaccadic responses to cognitive effort and emotional arousal during complex decision making. Journal of Eye Movement Research, 13(5). https://doi.org/10.16910/jemr.13.5.2

      Meyberg, S., Sinn, P., Engbert, R., & Sommer, W. (2017). Revising the link between microsaccades and the spatial cueing of voluntary attention. Vision Research, 133, 47–60. https://doi.org/10.1016/j.visres.2017.01.001

      Spencer, J. M. Y., Sekuler, A. B., Bennett, P. J., Giese, M. A., & Pilz, K. S. (2016). Effects of aging on identifying emotions conveyed by point-light walkers. Psychology and Aging, 31(1), 126–138. https://doi.org/10.1037/a0040009

      Strauch, C., Greiter, L., & Huckauf, A. (2018). Pupil dilation but not microsaccade rate robustly reveals decision formation. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-31551-x

      Wang, C.-A., Blohm, G., Huang, J., Boehnke, S. E., & Munoz, D. P. (2017). Multisensory integration in orienting behavior: pupil size, microsaccades, and saccades. Biological Psychology, 129, 36–44. https://doi.org/10.1016/j.biopsycho.2017.07.024

      Wang, C.-A., Boehnke, S. E., White, B. J., & Munoz, D. P. (2012). Microstimulation of the monkey superior colliculus induces pupil dilation without evoking saccades. Journal of Neuroscience, 32(11), 3629–3636. https://doi.org/10.1523/jneurosci.5512-11.2012

      Wang, C.-A., & Munoz, D. P. (2021). Differentiating global luminance, arousal and cognitive signals on pupil size and microsaccades. European Journal of Neuroscience, 54(10), 7560–7574. https://doi.org/10.1111/ejn.15508

      Yuan, T., Ji, H., Wang, L., & Jiang, Y. (2022). Happy is stronger than sad: emotional information modulates social attention. Emotion. https://doi.org/10.1037/emo0001145

      Yuan, T., Wang, L., & Jiang, Y. (2023). Cross-channel adaptation reveals shared emotion representation from face and biological motion. In Emotion (p. In Press).

      (2) Statistical analyses

      (2.1) Multiple comparisons:

      There are many posthoc comparisons throughout the manuscript. The authors should consider correction for multiple comparisons. Take Experiment 1 for example, it is important to note that the happy over neutral BM effect and the sad over neutral BM effect are no longer significant after Bonferroni correction, which is worth noting.

      Thanks for this suggestion. In our original analysis, we applied the Holm post-hoc corrections for multiple comparisons. The Holm correction is a step-down correction method and is more powerful but less conservative than the Bonferroni correction. We have now conducted the stricter Bonferroni post-hoc correction. In Experiment 1, the happy over neutral, and happy over sad BM effect is still significant after the Bonferroni post-hoc correction (happy vs. neutral: p = .036; happy vs. sad: p = .009), and the sad over neutral comparison remains marginally significant after the Bonferroni post-hoc correction (p = .071). Importantly, the test-retest replication experiment also yielded significant results for the comparisons between happy and neutral (First Test: p = .022, Holm-corrected, p = .048, Bonferroni-corrected; Second Test: p = .005,  Holm-corrected, p = .008, Bonferroni-corrected), sad and neutral (First Test: p = .022, Holm-corrected, p = .033, Bonferroni-corrected; Second Test: p = .005, Holm-corrected, p = .012, Bonferroni-corrected, Author response image 1B), and happy and sad BM  (First test: p < .001, Holm-corrected, p < .001, Bonferroni-corrected; Second test: p < .001, Holm-corrected, p < .001, Bonferroni-corrected). These results provided support for the replicability and consistency of the reported significant contrasts. See also Point 2.3.

      In Experiment 4, the significance levels of all comparisons remained the same after Bonferroni post-hoc correction (happy vs. neutral: p = .011; sad vs. neutral: p = .007; happy vs. sad: p = 1.000). We have now added these results in the main text (See lines 119, 122, 124, 143, 145, 148, 150, 153, 155, 248, 251, 254).

      (2.2) The authors present the correlation between happy over sad dilation effect and the autistic traits in Experiment 1, but do not report such correlations in Experiments 2-4. Did the authors collect the Autistic Quotient measure in Experiments 2-4? It would be informative if the authors could demonstrate the reproducibility (or lack thereof) of this happy-sad index in Experiments 2-4.

      We apologize for not making it clear. We have collected the AQ scores in Experiments 2-4. However, it should be pointed out that the happy over sad pupil dilation effect was only observed in Experiment 1. Moreover, we’ve again identified such happy over sad pupil dilation effect in the replication experiment (Experiment 1b) as well as its correlation with AQ. Instead, no significant correlations between AQ and the happy-sad pupil index were found in Experiments 2-4, see Author response image 4 for more details. We have reported these correlations in the main text (see lines 157-173, 190-194, 212-216, 257-262).

      Author response image 4.

      Correlations between the happy over sad pupil dilation effect and AQ scores. (A)  The happy over sad pupil dilation effect correlated negatively with individual autistic scores. (B-C) Such correlation was similarly observed in the test and retest of the replication experiment. (D-F) No such correlations were found for the inverted, nonbiological, and local BM stimuli.

      (2.3) The observed correlation between happy over sad dilation effect and the autistic traits in Experiment 1 seems rather weak. It could be attributed to the poor reliability of the Autistic Quotient measure or the author-constructed happy-sad index. Did the authors examine the test-retest reliability of their tasks or the Autistic Quotient measure?

      Thanks for this suggestion. We have now conducted a test-retest replication study to further confirm the observed significant correlations. Specifically, we recruited a new group of 24 participants (16 females, 8 males) to perform the identical procedure as in Experiment 1, and they were asked to return to the lab for a retest after at least seven days. We’ve replicated the significant main effect of emotional conditions in both the first test (F(2, 46) = 12.0, p < .001, ηp2 = 0.34) and the second test (F(2, 46) = 14.8, p < .001, ηp2 = 0.39). Besides, we also replicated the happy minus neutral pupil dilation effect (First Test: t(23) = 2.60, p = .022, Cohen’s d = 0.53, 95% CI for the mean difference = [0.02, 0.14], Holm-corrected, p = .048 after Bonferroni correction; Second Test: t(23) = 3.36, p = .005, Cohen’s d = 0.68, 95% CI for the mean difference = [0.06, 0.24], Holm-corrected, p = .008 after Bonferroni correction), and the sad minus neutral pupil constriction effect (First Test: t(23) = -2.77, p = .022, Cohen’s d = 0.57, 95% CI for the mean difference = [-0.19, -0.03], Holm-corrected, p = .033 after Bonferroni correction; Second Test: t(23) = -3.19, p = .005, Cohen’s d = 0.65, 95% CI for the mean difference = [-0.24, -0.05], Holm-corrected, p = .012 after Bonferroni correction). Additionally, the happy BM still induced a significantly larger pupil response than the sad BM (first test: t(23) = 4.23, p < .001, Cohen’s d = 0.86, 95% CI for the mean difference = [0.10, 0.28], Holm-corrected, p < .001 after Bonferroni correction; second test: t(23) = 4.26, p < .001, Cohen’s d = 0.87, 95% CI for the mean difference = [0.15, 0.44], Holm-corrected, p < .001 after Bonferroni correction).

      Notably, we’ve successfully replicated the negative correlation between the happy over sad dilation effect and individual autistic traits (r(23) = -0.46, p = .023, 95% CI for the mean difference = [-0.73, -0.07]). Such a correlation was similarly found and was even stronger in the retest (r(23) = -0.61, p = .002, 95% CI for the mean difference = [-0.81, -0.27]). A test-retest reliability analysis was conducted on the happy over sad pupil dilation effect and the AQ score. The results showed robust correlations (r(happy-sad pupil size)= 0.56; r(AQ)= 0.90) and strong test-retest reliabilities (α(happy-sad pupil size)= 0.60; α(AQ)= 0.82). We have added these results to the main text (see lines 135-173). See also Response to Reviewer #2 Response 1 for more details.

      (2.4) Relatedly, the happy over sad dilation effect is essentially a subtraction index. Without separately presenting the pipul size correlation with happy and sad BM in supplemental figures, it becomes challenging to understand what's primarily driving the observed correlation.

      Thanks for pointing this out. We have now presented the separate correlations between AQ and the pupil response towards happy and sad BM in Experiment 1 (see Author response image 5A), and the test-retest replication experiment of Experiment 1 (see Author response image 5B-C). No significant correlations were found. This is potentially because the raw pupil response is a mixed result of BM perception and emotion perception, while the variations in pupil sizes across emotional conditions could more faithfully reflect individual sensitivities to emotions in BM (Burley et al., 2017; Pomè et al., 2020; Turi et al., 2018).  

      Author response image 5.

      No significant correlations between AQ and pupil response towards happy and sad intact BM were found in Experiment 1a and the test-retest replication experiment (Experiment 1b).

      To probe what's primarily driving the observed correlation between happy-sad pupil size and AQ, we instead used the neutral as the baseline and separately correlated AQ with the happy-neutral and the sad-neutral pupil modulation effects. No significant correlation was found in Experiment 1a (Author response image 6A-B) and the first test of the replication experiment (Experiment 1b) (Author response image 6C-D). Importantly, in the second test of the replication experiment, we found a significant negative correlation between AQ and the happy-neutral pupil size (r(23) = -0.44, p = .032, 95% CI for the mean difference = [-0.72, -0.04], Author response image 6E), and a significant positive correlation between AQ and the sad-neutral pupil size (r(23) = 0.50, p = .014, 95% CI for the mean difference = [0.12, 0.75], Author response image 6F). This suggested that the overall correlation between AQ and the happy over sad dilation effect was driven by diminished pupil modulations towards both the happy and sad BM for high AQ individuals, demonstrating a general deficiency in BM emotion perception (happy or sad) among individuals with high autistic tendencies. It further revealed the potential of adopting a test-retest pupil examination to more precisely detect individual autistic tendencies. We have reported these results in the main text (see lines 166-173).

      Author response image 6.

      Correlation results for pupil modulations and AQ scores. (A-B) In Experiment 1a, no significant correlation was observed between AQ and the happy pupil modulation effect, as well as between AQ and the sad pupil modulation effect. (C-D) Similarly, no significant correlations were found in the first test of the replication experiment (Experiment 1b). (E-F) Importantly, in the second test of Experiment 1b, the happy vs. neutral pupil dilation effect was positively correlated with AQ, and the sad vs. neutral pupil constriction effect was positively correlated with AQ.

      References:

      Burley, D. T., Gray, N. S., & Snowden, R. J. (2017). As Far as the Eye Can See: Relationship between Psychopathic Traits and Pupil Response to Affective Stimuli. PLOS ONE, 12(1), e0167436. https://doi.org/10.1371/journal.pone.0167436

      Pomè, A., Binda, P., Cicchini, G. M., & Burr, D. C. (2020). Pupillometry correlates of visual priming, and their dependency on autistic traits. Journal of vision, 20(3), 3-3.

      Turi, M., Burr, D. C., & Binda, P. (2018). Pupillometry reveals perceptual differences that are tightly linked to autistic traits in typical adults. eLife, 7. https://doi.org/10.7554/elife.32399

      (2.5) For the sake of transparency, it is important to report all findings, not just the positive results, throughout the paper.

      Thanks for this suggestion. We have now reported all the correlations results between AQ and pupil modulation effects (happy-sad, happy-neutral, sad-neutral) in the main text (see lines 130-131, 157-162, 166-170, 190-194, 212-216, 257-262). Given that no significant correlations were observed between AQ and the raw pupil responses across four experiments, we reported their correlations with AQ in the supplementary material. We have stated this point in the main text (see lines 132-134).

      (3) Structure

      (3.1) The Results section immediately proceeds to the one-way repeated measures ANOVA. This section could be more reader-friendly by including a brief overview of the task procedures and variables, e.g., shifting Fig. 3 to this section.

      Thanks for this advice. We have now added a brief overview of the task procedures and variables and we have also shifted the figure position (see lines 101-103).

      Reviewer #1 (Recommendations For The Authors):

      (1) I suggest that the authors first explain the task (i.e., Fig. 3) at the beginning of the results. And it seems more appropriate to show the time course figures (Fig. 2) and before the bar plots (Fig. 1). If I understand correctly, the bar plots reflect the averaged data from the time course plots. Also, please clearly state the time window used to average the data. The results of the correlation analysis can be displayed in the last step.

      Thanks for this suggestion. We have now added a concise explanation of the task at the beginning of the results (see lines 101-103). We have also adjusted the figure positions and adjusted the order of our results according to the reviewer’s suggestion. The time window we used to average the data was from the onset of the stimuli until the end of the stimuli presentation. We have now clearly stated these issues in the revised text (see lines 111-112).

      (2) According to the above, I think a more reasonable arrangement should be Fig. 3, 2, and 1.

      Thanks for this suggestion. We have adjusted the figure positions accordingly.

      (3) Please include each subject's data points in the bar plots in Fig. 1.

      We have now presented each subject’s individual data point in the bar plot.

      (4) Lines 158-160 and 199-202 report interaction effects of the two-way ANOVA. This is good, but the direction of interaction effect should also be reported.

      We thank the reviewer for this suggestion. We have now reported the direction of the interaction effect. The significant interaction observed across Experiment 1 and Experiment 2 was mainly due to the diminishment of emotional modulation in inverted BM. The significant interaction crossing Experiment 1 and Experiment 3 was similarly caused by the lack of emotional modulation in nonbiological stimuli. With regard to the significant interaction across Experiment 1 and Experiment 4, it could be primarily attributed to the vanishment of pupil modulation effect between happy and sad local BM. We have specified these points in the revised text, see lines 198-199, 219-220, 267-269.

      Reviewer #3 (Recommendations For The Authors):

      (1) Number of experiments:

      As stated in the Methods section, this study seems to consist of five experiments (120/24=5) according to the description below. However, the current manuscript only reports findings from four of these experiments. Can the authors clarify on this matter?

      "A total of 120 participants (44 males, 76 females) ranging from 18 to 29 years old (M ± SD = 23.1 ± 2.5) were recruited, with 24 in each experiment."

      We apologize for not making it clear. This referred to a pure behavior explicit emotion classification experiment (N=24) that served as a prior test to confirm that the local BM stimuli conveyed recognizable emotional information. We have now more carefully stated this issue in the revised text, see lines 456-458.

      (2) Emotion processing mechanism of BM

      "Mechanism" is a very strong word, suggesting a causal relationship. In the setting of a passive viewing task that lacks any behavioral report, it is possible that the observed changes in pupil size could be epiphenomenal, rather than serving as the underlying mechanism.

      Thanks for this suggestion. We have now either changed “mechanism” into “phenomenon” or deleted it. We have also carefully discussed the potential implications for future studies to incorporate variant behavioral, physiological and neural indexes to yield more robust causal evidence to unveil the potential mechanism serving the observed multi-level BM emotion processing phenomenon.

      (3) Data sharing

      The authors could improve their efforts in promoting data transparency to ensure a comprehensive view of the results. This implies sharing deidentified raw data instead of summary data in an Excel spreadsheet.

      Thanks for this suggestion. We have now uploaded the deidentified raw data. (https://doi.org/10.57760/sciencedb.psych.00125).

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable work provides new insights into history-dependent biases in human perceptual decisionmaking. It provides compelling behavioral and MEG evidence that humans adapt their historydependent to the correlation structure of uncertain sensory environments. Further neural data analyses would strengthen some of the findings, and the studied bias would be more accurately framed as a stimulus- or outcome-history bias than a choice-history bias because tested subjects are biased not by their previous choice, but by the previous feedback (indicating the category of the previous stimulus).

      Thank you for your constructive evaluation of our manuscript. We have followed your suggestion to frame the studied bias as ‘stimulus history bias’. We now use this term whenever referring to our current results. Please note that we instead use the generic term ‘history bias’ when referring to the history biases studied in the previous literature on this topic in general. This is because these biases were dependent on previous choice(s), previous stimuli, or previous outcomes, or combinations of some (or all) of these factors. We have also added several of your suggested neural data analyses so as to strengthen the support for our conclusions, and we have elaborated on the Introduction so as to clarify the gaps in the literature that our study aims to fill. Our revisions are detailed in our replies below. We also took the liberty to reply to some points in the Public Review, which we felt called for clarification of the main aims (and main contribution) of our study.

      Reviewer #1 (Public Review):

      This paper aims to study the effects of choice history on action-selective beta band signals in human MEG data during a sensory evidence accumulation task. It does so by placing participants in three different stochastic environments, where the outcome of each trial is either random, likely to repeat, or likely to alternate across trials. The authors provide good behavioural evidence that subjects have learnt these statistics (even though they are not explicitly told about them) and that they influence their decision-making, especially on the most difficult trials (low motion coherence). They then show that the primary effect of choice history on lateralised beta-band activity, which is well-established to be linked to evidence accumulation processes in decision-making, is on the slope of evidence accumulation rather than on the baseline level of lateralised beta.

      The strengths of the paper are that it is: (i) very well analysed, with compelling evidence in support of its primary conclusions; (ii) a well-designed study, allowing the authors to investigate the effects of choice history in different stochastic environments.

      Thank you for pointing out these strengths of our study.

      There are no major weaknesses to the study. On the other hand, investigating the effects of choice/outcome history on evidence integration is a fairly well-established problem in the field. As such, I think that this provides a valuable contribution to the field, rather than being a landmark study that will transform our understanding of the problem.

      Your evaluation of the significance of our work made us realize that we may have failed to bring across the main gaps in the literature that our current study aimed to fill. We have now unpacked this in our revised Introduction.

      Indeed, many previous studies have quantified history-dependent biases in perceptual choice. However, the vast majority of those studies used tasks without any correlation structure; only a handful of studies have quantified history biases in tasks entailing structured environments, as we have done here (Abrahamyan et al., 2016; Kim et al., 2017; Braun et al., 2018; Hermoso-Mendizabal et al., 2020). The focus on correlated environments matters from an ecological perspective, because (i) natural environments are commonly structured rather than random (a likely reason for history biases being so prevalent in the first place), and (ii) history biases that change flexibly with the environmental structure are a hallmark of adaptive behavior. Critically, the few previous studies that have used correlated environments and revealed flexible/adaptive history biases were purely behavioral. Ours is the first to characterize the neural correlates of adaptive history biases.

      Furthermore, although several previous studies have identified neural correlates of history biases in standard perceptual choice tasks in unstructured environments (see (Talluri et al., 2021) for a brief overview), most have focused on static representations of the bias in ongoing activity preceding the new decision; only a single monkey physiology study has tested for both a static bias in the pre-stimulus activity and a dynamic bias building up during evidence accumulation (Mochol et al., 2021). Ours is the first demonstration of a dynamic bias during evidence accumulation in the human brain.

      The authors have achieved their primary aims and I think that the results support their main conclusions. One outstanding question in the analysis is the extent to which the source-reconstructed patches in Figure 2 are truly independent of one another (as often there is 'leakage' from one source location into another, and many of the different ROIs have quite similar overall patterns of synchronisation/desynchronisation.).

      We do not assume (and nowhere state) that the different ROIs are “truly independent” of one another. In fact, patterns of task-related power modulations of neural activity would be expected to be correlated between many visual and action-related cortical areas even without leakage (due to neural signal correlations). So, one should not assume independence even for intracortically recorded local field potential data, fMRI data, or other data with minimal spatial leakage effects. That said, we agree that filter leakage will add a (trivial) component to the similarity of power modulations across ROIs, which can and should be quantified with the analysis you propose.

      A possible way to investigate this further would be to explore the correlation structure of the LCMV beamformer weights for these different patches, to ask how similar/dissimilar the spatial filters are for the different reconstructed patches.

      Thank you for suggesting this analysis, which provides a very useful context for interpreting the pattern of results shown in our Figure 2. We have now computed (Pearson) correlation coefficients of the LCMV beamformer weights across the regions of interest. The results are shown in the new Figure 2 – figure supplement 1. This analysis provided evidence for minor leakage between the source estimates for neighboring cortical regions (filter correlations <= than 0.22 on average across subjects) and negligible leakage for more distant regions. We now clearly state this when referring to Figure 2.

      That said, we would also like to clarify our reasoning behind Figure 2. Our common approach to these source-reconstructed MEG data is to focus on the differences, rather than the similarities between ROIs, because the differences cannot be accounted for by leakage. Our analyses show clearly distinct, and physiologically plausible functional profiles across ROIs (motion coherence encoding in visual regions, action choice coding in motor regions), in line with other work using our general approach (Wilming et al., 2020; Murphy et al., 2021; Urai and Donner, 2022).

      Most importantly, our current analyses focus on the impact of history bias on the build-up of actionselective activity in downstream, action-related areas; and we chose to focus on M1 only in order to avoid hard-to-interpret comparisons between neighboring action-related regions. Figure 2 is intended as a demonstration of the data quality (showing sensible signatures for all ROIs) and as a context for the interpretation of our main neural results from M1 shown in the subsequent figures. So, all our main conclusions are unaffected by leakage between ROIs.

      We have now clarified these points in the paper.

      Reviewer #2 (Public Review):

      In this work, the authors use computational modeling and human neurophysiology (MEG) to uncover behavioral and neural signatures of choice history biases during sequential perceptual decision-making. In line with previous work, they see neural signatures reflecting choice planning during perceptual evidence accumulation in motor-related regions, and further show that the rate of accumulation responds to structured, predictable environments suggesting that statistical learning of environment structure in decision-making can adaptively bias the rate of perceptual evidence accumulation via neural signatures of action planning. The data and evidence show subtle but clear effects, and are consistent with a large body of work on decision-making and action planning.

      Overall, the authors achieved what they set out to do in this nice study, and the results, while somewhat subtle in places, support the main conclusions. This work will have impact within the fields of decisionmaking and motor planning, linking statistical learning of structured sequential effects in sense data to evidence accumulation and action planning.

      Strengths:

      • The study is elegantly designed, and the methods are clear and generally state-of-the-art

      • The background leading up to the study is well described, and the study itself conjoins two bodies of work - the dynamics of action-planning processes during perceptual evidence accumulation, and the statistical learning of sequential structure in incoming sense data

      • Careful analyses effectively deal with potential confounds (e.g., baseline beta biases)

      Thank you for pointing out these strengths of our study.

      Weaknesses:

      • Much of the study is primarily a verification of what was expected based on previous behavioral work, with the main difference (if I'm not mistaken) being that subjects learn actual latent structure rather than expressing sequential biases in uniform random environments.

      As we have stated in our reply to the overall assessment above, we realize that we may have failed to clearly communicate the novelty of our current results, and we have revised our Introduction accordingly. It is true that most previous studies of history biases in perceptual choice have used standard tasks without across-trial correlation structure. Only a handful of studies have quantified history biases in tasks entailing structured environments that varied from one condition to the next (Abrahamyan et al., 2016; Kim et al., 2017; Braun et al., 2018; Hermoso-Mendizabal et al., 2020), and showed that history biases change flexibly with the environmental structure. Our current work adds to this emerging picture, using a specific task setting analogous to one of these previous studies done in rats (Hermoso-Mendizabal et al., 2020).

      Critically, all the previous studies that have revealed flexible/adaptive history biases in correlated environments were purely behavioral. Ours is the first to characterize the neural correlates of adaptive history biases. And it is also the very first demonstration of a dynamic history-dependent bias (i.e., one that gradually builds up during evidence accumulation) in the human brain.

      Whether this difference - between learning true structure or superstitiously applying it when it's not there - is significant at the behavioral or neural level is unclear. Did the authors have a hypothesis about this distinction? If the distinction is not relevant, is the main contribution here the neural effect?

      We are not quite sure what exactly you mean with “is significant”, so we will reply to two possible interpretations of this statement.

      The first is that you may be asking for evidence for any difference between the estimated history biases in the structured (i.e., Repetitive, Alternating) vs. the unstructured (i.e., Neutral) environments used in our experiment. We do, in fact, provide quantitative comparisons between the history biases in the structured and Neutral environments at the behavioral level. Figure 1D and Figure 1 – figure supplement 2A and accompanying text show a robust and statistically significant difference in history biases. Specifically, the previous stimulus weights differ between each of the biased environments and the Neutral environment and the weights shifted in expected and opposite directions for both structured environments, indicating a tendency to repeat the previous stimulus category in Repetitive and vice versa in Alternating (Figure1D). Going further, we also demonstrate that the adjustment of the history is behaviorally relevant in that it improves performance in the two structured environments, but not in the unstructured environment (Figure 1F and Figure 1 – figure supplement 2A and figure supplement 3).

      The second is that you refer to the question of whether the history biases are generated via different computations in structured vs. random environments. Indeed, this is a very interesting and important question. We cannot answer this question based on the available results, because we here used a statistical (i.e., descriptive) model. Addressing this question would require developing and fitting a generative model of the history bias and comparing the inferred latent learning processes between environments. This is something we are doing in ongoing work.

      • The key effects (Figure 4) are among the more statistically on-the-cusp effects in the paper, and the Alternating group in 4C did not reliably go in the expected direction. This is not a huge problem per se, but does make the key result seem less reliable given the clear reliability of the behavioral results

      The model-free analyses in Figure 3C and 4B, C from the original version of our manuscript were never intended to demonstrate the “key effects”, but only as supplementary to the results from the modelbased analyses in Figures 3C and 4D, E in our current version of the manuscript. The latter show the “key effects” because they are a direct demonstration of the shaping of build-up of action-selective activity by history bias.

      To clarify this, we now decided to focus Figures 3 and 4 on the model-based analyses only. This decision was further supported by noticing a confound in our model-independent analyses in new control analyses prompted by Reviewer #3.

      Please note that the alternating bias in the Alternating environment is also less strong at the behavioral level compared to the bias in the Repetitive condition (see Figure 1D). A possible explanation is that a sequence of repetitive stimuli produces stronger prior expectations (for repetition) than an equally long sequence of alternating stimuli (Meyniel et al., 2016). This might also induce the bias to repeat the previous stimulus category in the Neutral condition (Figure 1D). Moreover, this intrinsic repetition bias might counteract the bias to alternate the previous stimulus category in Alternating.

      • The treatment of "awareness" of task structure in the study (via informal interviews in only a subsample of subjects) is wanting

      Agreed. We have now removed this statement from Discussion.

      Reviewer #3 (Public Review):

      This study examines how the correlation structure of a perceptual decision making task influences history biases in responding. By manipulating whether stimuli were more likely to be repetitive or alternating, they found evidence from both behavior and a neural signal of decision formation that history biases are flexibly adapted to the environment. On the whole, these findings are supported across an impressive range of detailed behavioral and neural analyses. The methods and data from this study will likely be of interest to cognitive neuroscience and psychology researchers. The results provide new insights into the mechanisms of perceptual decision making.

      The behavioral analyses are thorough and convincing, supported by a large number of experimental trials (~600 in each of 3 environmental contexts) in 38 participants. The psychometric curves provide clear evidence of adaptive history biases. The paper then goes on to model the effect of history biases at the single trial level, using an elegant cross-validation approach to perform model selection and fitting. The results support the idea that, with trial-by-trial accuracy feedback, the participants adjusted their history biases due to the previous stimulus category, depending on the task structure in a way that contributed to performance.

      Thank you for these nice words on our work.

      The paper then examines MEG signatures of decision formation, to try to identify neural signatures of these adaptive biases. Looking specifically at motor beta lateralization, they found no evidence that starting-level bias due to the previous trial differed depending on the task context. This suggests that the adaptive bias unfolds in the dynamic part of the decision process, rather than reflecting a starting level bias. The paper goes on to look at lateralization relative to the chosen hand as a proxy for a decision variable (DV), whose slope is shown to be influenced by these adaptive biases.

      This analysis of the buildup of action-selective motor cortical activity would be easier to interpret if its connection with the DV was more explicitly stated. The motor beta is lateralized relative to the chosen hand, as opposed to the correct response which might often be the case. It is therefore not obvious how the DV behaves in correct and error trials, which are combined together here for many of the analyses.

      We have now unpacked the connection of the action-selective motor cortical activity and decision variable in the manuscript, as follows:

      “This signal, referred to as ‘motor beta lateralization’ in the following, has been shown to exhibit hallmark signatures of the DV, specifically: (i) selectivity for choice and (ii) ramping slope that depends on evidence strength (Siegel et al., 2011; Murphy et al., 2021; O’Connell and Kelly, 2021).”

      Furthermore, we have added a figure of the time course of the motor beta lateralization separately for correct and error trials, locked to both stimulus onset and to motor response (Figure 2 – figure supplement 2). This signal reached statistical significance earlier for correct than error trials, and during the stimulus interval it ramped to a larger (i.e., more negative) amplitude for correct trials (Figure 2 – figure supplement 2, left). But the signal was indistinguishable in amplitude between correct and error trials around the time of the motor response (Figure 2 – figure supplement 2, right). This pattern matches what would be expected for a neural signature of the DV, because errors are more frequently made on weak-evidence trials than correct choices and because even for matched evidence strength, the DV builds up more slowly before error trials in accumulator models (Ratcliff and McKoon, 2008).

      --

      As you will see, all three reviewers found your work to provide valuable insights into history-dependent biases during perceptual decision-making. During consultation between reviewers, there was agreement that what is referred as a choice-history bias in the current version of the manuscript should rather be framed as a stimulus- or outcome-history bias (despite the dominant use of the term 'choicehistory' bias in the existing literature), and the reviewers pointed toward further analyses of the neural data which they thought would strengthen some of the claims made in the preprint. We hope that these comments will be useful if you wish to revise your preprint.

      We are pleased to hear that the reviewers think our work provides valuable insights into historydependent biases in perceptual decision-making. We thank you for your thoughtful and constructive evaluation of our manuscript.

      We have followed your suggestion to frame the studied bias as ‘stimulus history bias’. We now use this term whenever referring to our current results. Please note that we instead use the generic term ‘history bias’ when referring to the history biases studied in the previous literature on this topic in general. This is because these biases were dependent on previous choice(s), previous stimuli, or previous outcomes, or combinations of some (or all) of these factors.

      We have also performed several of your suggested neural data analyses so as to strengthen the support for our conclusions.

      Reviewer #1 (Recommendations For The Authors):

      One suggestion is to explore the correlation structure of the LCMV beam former weights for the regions of interest in the study, for the reasons outlined in my public review.

      Again, thank you for suggesting this analysis, which provides a very useful context for interpreting the pattern of results shown in our Figure 2. We have now computed (Pearson) correlation coefficients of the LCMV beamformer weights across the regions of interest. The results are shown in the new Figure 2 – figure supplement 1. This analysis provided evidence for minor leakage between the source estimates for neighboring cortical regions (filter correlations <= than 0.22 on average across subjects) and negligible leakage for more distant regions. We now clearly state this when referring to Figure 2.

      That said, we would also like to clarify our reasoning behind Figure 2. Our common approach to these source-reconstructed MEG data is to focus on the differences, rather than the similarities between ROIs, because the differences cannot be accounted for by leakage. Our analyses show clearly distinct, and physiologically plausible functional profiles across ROIs (motion coherence encoding in visual regions, action choice coding in motor regions), in line with other work using our general approach (Wilming et al., 2020; Murphy et al., 2021; Urai and Donner, 2022).

      Most importantly, our current analyses focus on the impact of history bias on the build-up of actionselective activity in downstream, action-related areas; and we chose to focus on M1 only in order to avoid hard-to-interpret comparisons between neighboring action-related regions. Figure 2 is intended as a demonstration of the data quality (showing sensible signatures for all ROIs) and as a context for the interpretation of our main neural results from M1 shown in the subsequent figures. So, all our main conclusions are unaffected by leakage between ROIs.

      We have now clarified also these points in the paper.

      I also wondered if the authors had considered:

      (i) the extent to which the bias changes across time, as the transition probabilities are being learnt across the experiment? given that these are not being explicitly instructed to participants, is any modelling possible of how the transition structure is itself being learnt over time, and whether this makes predictions of either behaviour or neural signals?

      We refer to this point in the discussion. The learning of the transition probabilities which can and should be addressed. This requires generative models that capture the learning of the transition structure over time (Yu and Cohen, 2009; Meyniel et al., 2016; Glaze et al., 2018; Hermoso-Mendizabal et al., 2020).

      The fact that our current statistical modeling approach successfully captures the bias adjustment between environments implies that the learning must be sufficiently fast. Tracking this process explicitly would be an exciting and important endeavor for the future. We think it is beyond the scope of the present study focusing on the trial-by-trial effect of history bias (however generated) on the build-up of action-selective activity.

      (ii) neural responses at the time of choice outcome - given that so much of the paper is about the update of information in different statistical environments, it seems a shame that no analyses are included of feedback processing, how this differs across the different environments, and how might be linked to behavioural changes at the next trial.

      We agree that the neural responses to feedback are a very interesting topic. We currently analyze these in another ongoing project on (outcome) history bias in a foraging task. We will consider re-analyzing the feedback component in the current data set, in this new study as well.

      However, this is distinct from the main question that is in the focus of our current paper – which, as elaborated above, is important to answer: whether and how adaptive history biases shape the dynamics of action-selective cortical activity in the human brain. While interesting and important, neural responses to feedback were not part of this question. So, we prefer to keep the focus of our paper on our original question.

      Reviewer #2 (Recommendations For The Authors):

      Minor:

      -pg. 7: "inconstant"

      -some citations (e.g., Barbosa 2020) are missing from the bibliography

      Thank you for pointing this out. We have fixed these.

      -figure S2 is very useful! could probably go in main text.

      We agree that this figure is important. But we decided to show it in the Supplement (now Figure 1 – figure supplement 2) after careful consideration for two reasons. First, we wanted to put the reader’s focus on the stimulus weights, because it is those weights, which are flexibly adjusted to the statistics of the environment rather than the choice weights, which seem less adaptive (i.e., stereotypical across environments) and idiosyncratic. Second, plotting the previous stimulus weights only enabled to add the individual weights in the Neutral condition, which would have been to cluttered to add to figure S2.

      For these reasons, we feel that this Figure is more suitable for expert readers with a special interest in the details of the behavioral analyses and would be better placed in the Supplement. These readers will certainly be able to find and interpret that information in the Supplement.

      Reviewer #3 (Recommendations For The Authors):

      I would suggest that a more in depth description of the previous literature that explains exactly how the features of the lateralized beta--as it is formulated here-- reflect the decision variable would assist with the readers' understanding. A demonstration of how the lateralized beta behaves under different coherence conditions, or for corrects vs errors, for example, might be helpful for readers.

      We now provide a more detailed description of how/why the motor beta lateralization is a valid proxy of DV in the revised paper.

      We have demonstrated the dependence of the ramping of the motor beta lateralization on the motion coherence using a regression model with current signed motion coherence as well as single trial bias as regressors. The beta weights describing the impact of the signed motion coherence on the amplitude as well as on the slope of the motor beta lateralization are shown in Figure 4G (now 4E). As expected, stronger motion coherence induces a steeper downward slope of the motor beta lateralization.

      Furthermore, we have added a figure of the time course of the motor beta lateralization separately for correct and error trials, locked to both stimulus onset and to motor response (Figure 2 – figure supplement 2). This signal reached statistical significance earlier for correct than error trials, and during the stimulus interval it ramped to a larger (i.e., more negative) amplitude for correct trials (Figure 2 – figure supplement 2, left). But the signal was indistinguishable in amplitude between correct and error trials around the time of the motor response (Figure 2 – figure supplement 2, right).This pattern matches what would be expected for a neural signature DV, because errors are more frequently made on weakevidence trials than correct choices and because even for matched evidence strength, the DV builds up more slowly before error trials in accumulator models (Ratcliff and McKoon, 2008).

      Finally, please note that our previous studies have demonstrated that the time course of the beta lateralization during the trial closely tracks the time course of a normative model-derived DV (Murphy et al., 2021) and that the motor beta ramping slope is parametrically modulated by motion coherence (de Lange et al., 2013), which is perfectly in line with the current results.

      Along similar lines, around figures 3c and 4B, some control analyses may be helpful to clarify whether there are differences between the groups of responses consistent and inconsistent with the previous trial (e.g. correctness, coherence) that differ between environments, and also could influence the lateralized beta.

      Thank you for pointing us to this important control analysis. We have done this, and indeed, it identified accuracy and motion strength as possible confounds (Author response image 1). Specifically, proportion correct as well as motion coherence were larger for consistent vs. inconsistent conditions in Repetitive and vice versa in Alternating. Those differences in accuracy and coherence might indeed influence the slope of the motor beta lateralization that our model-free analysis had identified, rendering the resulting difference between consistent and inconsistent difficult to interpret unambiguously in terms of bias. Thus, we have decided to drop the consistency (i.e., model-independent) analysis and focus completely on the modelbased analyses.

      Author response image 1.

      Proportion correct and motion coherence split by environment and consistency of current choice and previous stimulus. In the Repetitive environment (Rep.), accuracy and motion coherence are larger for current choice consistent vs. inconsistent with previous stimulus category and vice versa in the Alternating environment (Alt.).

      Importantly, this decision has no implications for the conclusions of our paper: The model-independent analyses in the original versions of Figure 3 and 4 were only intended as a supplement to the most conclusive and readily interpretable results from the model-based analyses (now in Figs. 3C and 4D, E. The latter are the most direct demonstration of a shaping of build-up of action-selective activity by history bias, and they are unaffected by these confounds.

      In addition, I wondered whether the bin subsampling procedure to match trial numbers for choice might result in unbalanced coherences between the up and down choices.

      The subsampling itself did not cause any unbalanced coherences between the up and down choices, which we now show in Figure 4 – figure supplement 1. There was only a slight imbalance in coherences between up and down choices before the subsampling which then translated into the subsampled trials but the coherences were equally distributed before as compared to after the subsampling.

      Also, please note that the purpose of this analysis was to make the neural bias directly “visible” in the beta lateralization data, rather than just regression weights. The issue does not pertain to the critical single-trial regression analysis, which yielded consistent results.

      References

      Abrahamyan A, Silva LL, Dakin SC, Carandini M, Gardner JL (2016) Adaptable history biases in human perceptual decisions. Proceedings of the National Academy of Sciences 113:E3548–E3557.

      Braun A, Urai AE, Donner TH (2018) Adaptive History Biases Result from Confidence-weighted Accumulation of Past Choices. The Journal of Neuroscience:2189–17. de Lange FP, Rahnev DA, Donner TH, Lau H (2013) Prestimulus Oscillatory Activity over Motor Cortex Reflects Perceptual Expectations. Journal of Neuroscience 33:1400–1410.

      Glaze CM, Filipowicz ALS, Kable JW, Balasubramanian V, Gold JI (2018) A bias–variance trade-off governs individual differences in on-line learning in an unpredictable environment. Nat Hum Behav 2:213–224.

      Hermoso-Mendizabal A, Hyafil A, Rueda-Orozco PE, Jaramillo S, Robbe D, de la Rocha J (2020) Response outcomes gate the impact of expectations on perceptual decisions. Nat Commun 11:1057.

      Kim TD, Kabir M, Gold JI (2017) Coupled Decision Processes Update and Maintain Saccadic Priors in a Dynamic Environment. The Journal of Neuroscience 37:3632–3645.

      Meyniel F, Maheu M, Dehaene S (2016) Human Inferences about Sequences: A Minimal Transition Probability Model Gershman SJ, ed. PLOS Computational Biology 12:e1005260.

      Mochol G, Kiani R, Moreno-Bote R (2021) Prefrontal cortex represents heuristics that shape choice bias and its integration into future behavior. Current Biology 31:1234-1244.e6.

      Murphy PR, Wilming N, Hernandez-Bocanegra DC, Prat-Ortega G, Donner TH (2021) Adaptive circuit dynamics across human cortex during evidence accumulation in changing environments. Nat Neurosci 24:987–997.

      O’Connell RG, Kelly SP (2021) Neurophysiology of Human Perceptual Decision-Making. Annu Rev Neurosci 44:495–516.

      Ratcliff R, McKoon G (2008) The Diffusion Decision Model: Theory and Data for Two-Choice Decision Tasks. Neural Computation 20:873–922.

      Siegel M, Engel AK, Donner TH (2011) Cortical Network Dynamics of Perceptual Decision-Making in the Human Brain. Frontiers in Human Neuroscience 5 Available at: http://journal.frontiersin.org/article/10.3389/fnhum.2011.00021/abstract [Accessed April 8, 2017].

      Talluri BC, Braun A, Donner TH (2021) Decision making: How the past guides the future in frontal cortex. Current Biology 31:R303–R306.

      Urai AE, Donner TH (2022) Persistent activity in human parietal cortex mediates perceptual choice repetition bias. Nat Commun 13:6015.

      Wilming N, Murphy PR, Meyniel F, Donner TH (2020) Large-scale dynamics of perceptual decision information across human cortex. Nat Commun 11:5109.

      Yu A, Cohen JD (2009) Sequential effects: Superstition or rational behavior. Advances in neural information processing systems 21:1873–1880.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We have specifically addressed the points of uncertainty highlighted in eLife's editorial assessment, which concerned the lack of low-level acoustics control, limitations of experimental design, and in-depth analysis. Regarding “the lack of low-level acoustics control, limitations of experimental design”, in response to Reviewer #1, we clarify that our study aimed to provide a broad perspective —which includes both auditory and higher-level processes— on the similarities and distinctions in processing natural speech and music within an ecological context. Regarding “the lack of in-depth analysis”, in response to Reviewer #1 and #2, we have clarified that while model-based analyzes are valuable, they pose fundamental challenges when comparing speech and music. Non-acoustic features inherently differ between speech and music (such as phonemes and pitch), making direct comparisons reliant on somewhat arbitrary choices. Our approach mitigates this challenge by analyzing the entire neural signal, thereby avoiding potential pitfalls associated with encoding models of non-comparable features. Finally, we provide some additional analyzes suggested by the Reviewers.

      We sincerely appreciate your thoughtful and thorough consideration throughout the review process.

      eLife assessment

      This study presents valuable intracranial findings on how two important types of natural auditory stimuli - speech and music - are processed in the human brain, and demonstrates that speech and music largely share network-level brain activities, thus challenging the domain-specific processing view. The evidence supporting the claims of the authors is solid but somewhat incomplete since although the data analysis is thorough, the results are robust and the stimuli have ecological validity, important considerations such as low-level acoustics control, limitations of experimental design, and in-depth analysis, are lacking. The work will be of broad interest to speech and music researchers as well as cognitive scientists in general.

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors examined the extent to which the processing of speech and music depends on neural networks that are either specific to a domain or general in nature. They conducted comprehensive intracranial EEG recordings on 18 epilepsy patients as they listened to natural, continuous forms of speech and music. This enabled an exploration of brain activity at both the frequency-specific and network levels across a broad spectrum. Utilizing statistical methods, the researchers classified neural responses to auditory stimuli into categories of shared, preferred, and domain-selective types. It was observed that a significant portion of both focal and network-level brain activity is commonly shared between the processing of speech and music. However, neural responses that are selectively responsive to speech or music are confined to distributed, frequency-specific areas. The authors highlight the crucial role of using natural auditory stimuli in research and the need to explore the extensive spectral characteristics inherent in the processing of speech and music.

      Strengths:

      The study's strengths include its high-quality sEEG data from a substantial number of patients, covering a majority of brain regions. This extensive cortical coverage grants the authors the ability to address their research questions with high spatial resolution, marking an advantage over previous studies. They performed thorough analyses across the entire cortical coverage and a wide frequency range of neural signals. The primary analyses, including spectral analysis, temporal response function calculation, and connectivity analysis, are presented straightforwardly. These analyses, as well as figures, innovatively display how neural responses, in each frequency band and region/electrode, are 'selective' (according to the authors' definition) to speech or music stimuli. The findings are summarized in a manner that efficiently communicates information to readers. This research offers valuable insights into the cortical selectivity of speech and music processing, making it a noteworthy reference for those interested in this field. Overall, this research offers a valuable dataset and carries out extensive yet clear analyses, amounting to an impressive empirical investigation into the cortical selectivity of speech and music. It is recommended for readers who are keen on understanding the nuances of selectivity and generality in the processing of speech and music to refer to this study's data and its summarized findings.

      Weaknesses:

      The weakness of this study, in my view, lies in its experimental design and reasoning:

      (1) Despite using longer stimuli, the study does not significantly enhance ecological validity compared to previous research. The analyses treat these long speech and music stimuli as stationary signals, overlooking their intricate musical or linguistic structural details and temporal variation across local structures like sentences and phrases. In previous studies, short, less ecological segments of music were used, maintaining consistency in content and structure. However, this study, despite employing longer stimuli, does not distinguish between neural responses to the varied contents or structures within speech and music. Understanding the implications of long-term analyses, such as spectral and connectivity analyses over extended periods of around 10 minutes, becomes challenging when they do not account for the variable, sometimes quasi-periodical or even non-periodical, elements present in natural speech and music. When contrasting this study with prior research and highlighting its advantages, a more balanced perspective would have been beneficial in the manuscript.

      Regarding ecological validity, we respectfully hold a differing perspective from the reviewer. In our view, a one-second music stimulus lacks ecological validity, as real-world music always extends much beyond such a brief duration. While we acknowledge the trade-off in selecting longer stimuli, limiting the diversity of musical styles, we maintain that only long stimuli afford participants an authentic musical listening experience. Conversely, shorter stimuli may lead participants to merely "skip through" musical excerpts rather than engage in genuine listening.

      Regarding the critique that we "did not distinguish between neural responses to the varied contents or structures within speech and music," we partly concur. Our TRF (temporal response function) analyzes incorporate acoustic content, particularly the acoustic envelope, thereby addressing this concern to some extent. However, it is accurate to note that we did not model non-acoustic features. In acknowledging this limitation, we would like to share an additional thought with the reviewer regarding model comparison for speech and music. Specifically, comparing results from a phonetic (or syntactic) model of speech to a pitch-melodic (or harmonic) model for music is not straightforward, as these models operate on fundamentally different dimensions. In other words, while assuming equivalence between phonemes and pitches may be a reasonable assumption, it in essence relies on a somewhat arbitrary choice. Consequently, comparing and interpreting neuronal population coding for one or the other model remains problematic. In summary, because the models for speech and music are different (except for acoustic models), direct comparison is challenging, although still commendable and of interest.

      Finally, we did take into account the reviewer’s remark and did our best to give a more balanced perspective of our approach and previous studies in the discussion.

      “While listening to natural speech and music rests on cognitively relevant neural processes, our analytical approach, extending over a rather long period of time, does not allow to directly isolate specific brain operations. Computational models -which can be as diverse as acoustic (Chi et al., 2005), cognitive (Giordano et al., 2021), information-theoretic (Di Liberto et al., 2020), or self-supervised neural network (Donhauser & Baillet, 2019 ; Millet et al., 2022) models- are hence necessary to further our understanding of the type of computations performed by our reported frequency-specific distributed networks. Moreover, incorporating models accounting for musical and linguistic structure can help us avoid misattributing differences between speech and music driven by unmatched sensitivity factors (e.g., arousal, emotion, or attention) as inherent speech or music selectivity (Mas-Herrero et al., 2013; Nantais & Schellenberg, 1999).”

      (2) In contrast to previous studies that employed short stimulus segments along with various control stimuli to ensure that observed selectivity for speech or music was not merely due to low-level acoustic properties, this study used longer, ecological stimuli. However, the control stimuli used in this study, such as tone or syllable sequences, do not align with the low-level acoustic properties of the speech and music stimuli. This mismatch raises concerns that the differences or selectivity between speech and music observed in this study might be attributable to these basic acoustic characteristics rather than to more complex processing factors specific to speech or music.

      We acknowledge the reviewer's concern. Indeed, speech and music differ on various levels, including acoustic and cognitive aspects, and our analyzes do not explicitly distinguish them. The aim of this study was to provide an overview of the similarities and differences between natural speech and music processing, in ecological context. Future work is needed to explore further the different hierarchical levels or networks composing such listening experiences. Of note, however, we report whole-brain results with high spatial resolution (thanks to iEEG recordings), enabling the distinction between auditory, superior temporal gyrus (STG), and higher-level responses. Our findings clearly highlight that both auditory and higher-level regions predominantly exhibit shared responses, challenging the interpretation that our results can be attributed solely to differences in 'basic acoustic characteristics'.

      We have now more clearly pointed out this reasoning in the results section:

      “The spatial distribution of the spectrally-resolved responses corresponds to the network typically involved in speech and music perception. This network encompasses both ventral and dorsal auditory pathways, extending well beyond the auditory cortex and, hence, beyond auditory processing that may result from differences in the acoustic properties of our baseline and experimental stimuli.“

      (3) The concept of selectivity - shared, preferred, and domain-selective - increases the risks of potentially overgeneralized interpretations and theoretical inaccuracies. The authors' categorization of neural sites/regions as shared, preferred, or domain-selective regarding speech and music processing essentially resembles a traditional ANOVA test with post hoc analysis. While this categorization gives meaningful context to the results, the mere presence of significant differences among control stimuli, a segment of speech, and a piece of music does not necessarily imply that a region is specifically selective to a type of stimulus like speech. The manuscript's narrative might lead to an overgeneralized interpretation that their findings apply broadly to speech or music. However, identifying differences in neural responses to a few sets of specific stimuli in one brain region does not robustly support such a generalization. This is because speech and music are inherently diverse, and specificity often relates more to the underlying functions than to observed neural responses to a limited number of examples of a stimulus type. See the next point.

      Exactly! Here, we present a precise operational definition of these terms, implemented with clear and rigorous statistical methods. It is important to note that in many cognitive neuroscience studies, the term "selective" is often used without a clear definition. By establishing operational definitions, we identified three distinct categories based on statistical testing of differences from baseline and between conditions. This approach provides a framework for more accurate interpretation of experimental findings, as now better outlined in the introduction:

      “Finally, we suggest that terms should be operationally defined based on statistical tests, which results in a clear distinction between shared, selective, and preferred activity. That is, be A and B two investigated cognitive functions, “shared” would be a neural population that (compared to a baseline) significantly and equally contributes to the processing of both A and B; “selective” would be a neural population that exclusively contributes to the processing of A or B (e.g. significant for A but not B); and “preferred” would be a neural population that significantly contributes to the processing of both A and B, but more prominently for A or B (Figure 1A).”

      Regarding the risk of over-generalization, we want to clarify that our manuscript does not claim that a specific region or frequency band is selective to speech or music. As indeed we focus on testing excerpts of speech and music, we employ the reverse logical reasoning: "if 10 minutes of instrumental music activates a region traditionally associated with speech selectivity, we can conclude that this region is NOT speech-selective." Our conclusions revolve around the absence of selectivity rather than the presence of selective areas or frequency bands. In essence, "one counterexample is enough to disprove a theory." We now further elaborated on this point in the discussion section:

      “In this context, in the current study we did not observe a single anatomical region for which speech-selectivity was present, in any of our analyzes. In other words, 10 minutes of instrumental music was enough to activate cortical regions classically labeled as speech (or language) -selective. On the contrary, we report spatially distributed and frequency-specific patterns of shared, preferred, or selective neural responses and connectivity fingerprints. This indicates that domain-selective brain regions should be considered as a set of functionally homogeneous but spatially distributed voxels, instead of anatomical landmarks.”

      (4) The authors' approach, akin to mapping a 'receptive field' by correlating stimulus properties with neural responses to ascertain functional selectivity for speech and music, presents issues. For instance, in the cochlea, different stimuli activate different parts of the basilar membrane due to the distinct spectral contents of speech and music, with each part being selective to certain frequencies. However, this phenomenon reflects the frequency selectivity of the basilar membrane - an important function, not an inherent selectivity for speech or music. Similarly, if cortical regions exhibit heightened responses to one type of stimulus over another, it doesn't automatically imply selectivity or preference for that stimulus. The explanation could lie in functional aspects, such as a region's sensitivity to temporal units of a specific duration, be it music, speech, or even movie segments, and its role in chunking such units (e.g., around 500 ms), which might be more prevalent in music than in speech, or vice versa in the current study. This study does not delve into the functional mechanisms of how speech and music are processed across different musical or linguistic hierarchical levels but merely demonstrates differences in neural responses to various stimuli over a 10-minute span.

      We completely agree with the last statement, as our primary goal was not to investigate the functional mechanisms underlying speech and music processing. However, the finding of a substantial portion of the cortical network as being shared between the two domains constrains our understanding of the underlying common operations. Regarding the initial part of the comment, we would like to clarify that in the framework we propose, if cortical regions show heightened responses to one type of stimulus over another, this falls into the ‘preferred’ category. The ‘selective’ (exclusive) category, on the other hand, would require that the region be unresponsive to one of the two stimuli.

      Reviewer #2 (Public Review):

      Summary:

      The study investigates whether speech and music processing involve specific or shared brain networks. Using intracranial EEG recordings from 18 epilepsy patients, it examines neural responses to speech and music. The authors found that most neural activity is shared between speech and music processing, without specific regional brain selectivity. Furthermore, domain-selective responses to speech or music are limited to frequency-specific coherent oscillations. The findings challenge the notion of anatomically distinct regions for different cognitive functions in the auditory process.

      Strengths:

      (1) This study uses a relatively large corpus of intracranial EEG data, which provides high spatiotemporal resolution neural recordings, allowing for more precise and dynamic analysis of brain responses. The use of continuous speech and music enhances ecological validity compared to artificial or segmented stimuli.

      (2) This study uses multiple frequency bands in addition to just high-frequency activity (HFA), which has been the focus of many existing studies in the literature. This allows for a more comprehensive analysis of neural processing across the entire spectrum. The heterogeneity across different frequency bands also indicates that different frequency components of the neural activity may reflect different underlying neural computations.

      (3) This study also adds empirical evidence towards distributed representation versus domain-specificity. It challenges the traditional view of highly specialized, anatomically distinct regions for different cognitive functions. Instead, the study suggests a more integrated and overlapping neural network for processing complex stimuli like speech and music.

      Weaknesses:

      While this study is overall convincing, there are still some weaknesses in the methods and analyses that limit the implication of the work.

      The study's main approach, focusing primarily on the grand comparison of response amplitudes between speech and music, may overlook intricate details in neural coding. Speech and music are not entirely orthogonal with each other at different levels of analysis: at the high-level abstraction, these are two different categories of cognitive processes; at the low-level acoustics, they overlap a lot; at intermediate levels, they may also share similar features. The selected musical stimuli, incorporating both vocals and multiple instrumental sounds, raise questions about the specificity of neural activation. For instance, it's unclear if the vocal elements in music and speech engage identical neural circuits. Additionally, the study doesn't adequately address whether purely melodic elements in music correlate with intonations in speech at a neural level. A more granular analysis, dissecting stimuli into distinct features like pitch, phonetics, timbre, and linguistic elements, could unveil more nuanced shared, and unique neural processes between speech and music. Prior research indicates potential overlap in neural coding for certain intermediate features in speech and music (Sankaran et al. 2023), suggesting that a simple averaged response comparison might not fully capture the complexity of neural encoding. Further delineation of phonetic, melodic, linguistic, and other coding, along with an analysis of how different informational aspects (phonetic, linguistic, melodic, etc) are represented in shared neural activities, could enhance our understanding of these processes and strengthen the study's conclusions.

      We appreciate the reviewer's acknowledgment that delving into the intricate details of neural coding of speech and music was beyond the scope of this work. To address some of the more precise issues raised, we have clarified in the manuscript that our musical stimuli do not contain vocals and are purely instrumental. We apologize if this was not clear initially.

      “In the main experimental session, patients passively listened to ~10 minutes of storytelling (Gripari, 2004); 577 secs, La sorcière de la rue Mouffetard, (Gripari, 2004) and ~10 minutes of instrumental music (580 secs, Reflejos del Sur, (Oneness, 2006) separated by 3 minutes of rest.”

      Furthermore, we now acknowledge the importance of modeling melodic, phonetic, or linguistic features in the discussion, and we have referenced the work of Sankaran et al. (2024) and McCarty et al. (2023) in this regard. However, we would like to share an additional thought with the reviewer regarding model comparison for speech and music. Specifically, comparing results from a phonetic (or syntactic) model of speech to a pitch-melodic (or harmonic) model for music is not straightforward, as these models operate on fundamentally different dimensions. In other words, while assuming equivalence between phonemes and pitches may be a reasonable assumption, it in essence relies on a somewhat arbitrary choice. Consequently, comparing and interpreting neuronal population coding for one or the other model remains problematic. In summary, because the models for speech and music are different (except for acoustic models), direct comparison is challenging, although still commendable and of interest.

      “These selective responses, not visible in primary cortical regions, seem independent of both low-level acoustic features and higher-order linguistic meaning (Norman-Haignere et al., 2015), and could subtend intermediate representations (Giordano et al., 2023) such as domain-dependent predictions (McCarty et al., 2023; Sankaran et al., 2023).”

      References:

      McCarty, M. J., Murphy, E., Scherschligt, X., Woolnough, O., Morse, C. W., Snyder, K., Mahon, B. Z., & Tandon, N. (2023). Intraoperative cortical localization of music and language reveals signatures of structural complexity in posterior temporal cortex. iScience, 26(7), 107223.

      Sankaran, N., Leonard, M. K., Theunissen, F., & Chang, E. F. (2023). Encoding of melody in the human auditory cortex. bioRxiv. https://doi.org/10.1101/2023.10.17.562771

      The paper's emphasis on shared and overlapping neural activity, as observed through sEEG electrodes, provides valuable insights. It is probably true that domain-specificity for speech and music does not exist at such a macro scale. However, it's important to consider that each electrode records from a large neuronal population, encompassing thousands of neurons. This broad recording scope might mask more granular, non-overlapping feature representations at the single neuron level. Thus, while the study suggests shared neural underpinnings for speech and music perception at a macroscopic level, it cannot definitively rule out the possibility of distinct, non-overlapping neural representations at the microscale of local neuronal circuits for features that are distinctly associated with speech and music. This distinction is crucial for fully understanding the neural mechanisms underlying speech and music perception that merit future endeavors with more advanced large-scale neuronal recordings.

      We appreciate the reviewer's concern, but we do not view this as a weakness for our study's purpose. Every method inherently has limitations, and intracranial recordings currently offer the best possible spatial specificity and temporal resolution for studying the human brain. Studying cell assemblies thoroughly in humans is ethically challenging, and examining speech and music in non-human primates or rats raises questions about cross-species analogy. Therefore, despite its limitations, we believe intracranial recording remains the best option for addressing these questions in humans.

      Regarding the granularity of neural representation, while understanding how computations occur in the central nervous system is crucial, we question whether the single neuron scale provides the most informative insights. The single neuron approach seem more versatile (e.g., in term of cell type or layer affiliation) than the local circuitry they contribute to, which appears to be the brain's building blocks (e.g., like the laminar organization; see Mendoza-Halliday et al.,2024). Additionally, the population dynamics of these functional modules appear crucial for cognition and behavior (Safaie et al. 2023; Buzsáki and Vöröslakos, 2023). Therefore, we emphasize the need for multi-scale research, as we believe that a variety of approaches will complement each other's weaknesses when taken individually. We clarified this in the introduction:

      “This approach rests on the idea that the canonical computations that underlie cognition and behavior are anchored in population dynamics of interacting functional modules (Safaie et al. 2023; Buzsáki and Vöröslakos, 2023) and bound to spectral fingerprints consisting of network- and frequency-specific coherent oscillations (Siegel et al., 2012).”

      Importantly, we focus on the macro-scale and conclude that, at the anatomical region level, no speech or music selectivity can be observed during natural stimulation. This is stated in the discussion, as follow:

      “In this context, in the current study we did not observe a single anatomical region for which speech-selectivity was present, in any of our analyses. In other words, 10 minutes of instrumental music was enough to activate cortical regions classically labeled as speech (or language) -selective. On the contrary, we report spatially distributed and frequency-specific patterns of shared, preferred, or selective neural responses and connectivity fingerprints. This indicates that domain-selective brain regions should be considered as a set of functionally homogeneous but spatially distributed voxels, instead of anatomical landmarks.”

      References :

      Mendoza-Halliday, D., Major, A.J., Lee, N. et al. A ubiquitous spectrolaminar motif of local field potential power across the primate cortex. Nat Neurosci (2024).

      Safaie, M., Chang, J.C., Park, J. et al. Preserved neural dynamics across animals performing similar behaviour. Nature 623, 765–771 (2023).

      Buzsáki, G., & Vöröslakos, M. (2023). Brain rhythms have come of age. Neuron, 111(7), 922-926.

      While classifying electrodes into 3 categories provides valuable insights, it may not fully capture the complexity of the neural response distribution to speech and music. A more nuanced and continuous approach could reveal subtler gradations in neural response, rather than imposing categorical boundaries. This could be done by computing continuous metrics, like unique variances explained by each category, or ratio-based statistics, etc. Incorporating such a continuum could enhance our understanding of the neural representation of speech and music, providing a more detailed and comprehensive picture of cortical processing.

      To clarify, the metrics we are investigating (coherence, power, linear correlations) are continuous. Additionally, we conduct a comprehensive statistical analysis of these results. The statistical testing, which includes assessing differences from baseline and between the speech and music conditions using a statistical threshold, yields three categories. Of note, ratio-based statistics (a continuous metric) are provided in Figures S9 and S10 (Figures S8 and S9 in the original version of the manuscript).

      Reviewer #3 (Public Review):

      Summary:

      Te Rietmolen et al., investigated the selectivity of cortical responses to speech and music stimuli using neurosurgical stereo EEG in humans. The authors address two basic questions: 1. Are speech and music responses localized in the brain or distributed; 2. Are these responses selective and domain-specific or rather domain-general and shared? To investigate this, the study proposes a nomenclature of shared responses (speech and music responses are not significantly different), domain selective (one domain is significant from baseline and the other is not), domain preferred (both are significant from baseline but one is larger than the other and significantly different from each other). The authors employ this framework using neural responses across the spectrum (rather than focusing on high gamma), providing evidence for a low level of selectivity across spectral signatures. To investigate the nature of the underlying representations they use encoding models to predict neural responses (low and high frequency) given a feature space of the stimulus envelope or peak rate (by time delay) and find stronger encoding for both in the low-frequency neural responses. The top encoding electrodes are used as seeds for a pair-wise connectivity (coherence) in order to repeat the shared/selective/preferred analysis across the spectra, suggesting low selectivity. Spectral power and connectivity are also analyzed on the level of the regional patient population to rule out (and depict) any effects driven by a select few patients. Across analyses the authors consistently show a paucity of domain selective responses and when evident these selective responses were not represented across the entire cortical region. The authors argue that speech and music mostly rely on shared neural resources.

      Strengths:

      I found this manuscript to be rigorous providing compelling and clear evidence of shared neural signatures for speech and music. The use of intracranial recordings provides an important spatial and temporal resolution that lends itself to the power, connectivity, and encoding analyses. The statistics and methods employed are rigorous and reliable, estimated based on permutation approaches, and cross-validation/regularization was employed and reported properly. The analysis of measures across the entire spectra in both power, coherence, and encoding models provides a comprehensive view of responses that no doubt will benefit the community as an invaluable resource. Analysis of the level of patient population (feasible with their high N) per region also supports the generalizability of the conclusions across a relatively large cohort of patients. Last but not least, I believe the framework of selective, preferred, and shared is a welcome lens through which to investigate cortical function.

      Weaknesses:

      I did not find methodological weaknesses in the current version of the manuscript. I do believe that it is important to highlight that the data is limited to passively listening to naturalistic speech and music. The speech and music stimuli are not completely controlled with varying key acoustic features (inherent to the different domains). Overall, I found the differences in stimulus and lack of attentional controls (passive listening) to be minor weaknesses that would not dramatically change the results or conclusions.

      Thank you for this positive review of our work. We added these points as limitations and future directions in the discussion section:

      “Finally, in adopting here a comparative approach of speech and music – the two main auditory domains of human cognition – we only investigated one type of speech and of music also using a passive listening task. Future work is needed to investigate for instance whether different sentences or melodies activate the same selective frequency-specific distributed networks and to what extent these results are related to the passive listening context compared to a more active and natural context (e.g. conversation).”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The concepts of activation and deactivation within the study's context of selectivity are not straightforward to comprehend. It would be beneficial for the authors to provide more detailed explanations of how these phenomena relate to the selectivity of neural responses to speech and music. Such elaboration would aid readers in better understanding the nuances of how certain brain regions are selectively activated or deactivated in response to different auditory stimuli.

      The reviewer is right that the reported results are quite complex to interpret. The concepts of activation and deactivation are generally complex to comprehend as they are in part defined by an approach (e.g., method and/or metric) and the scale of observation (Pfurtscheller et al., 1999). The power (or the magnitude) of time-frequency estimate is by definition a positive value. Deactivation (or desynchronization) is therefore related to the comparison used (e.g., baseline, control, condition). This is further complexified by the scale of the measurement, for instance, when it comes to a simple limb movement, some brain areas in sensory motor cortex are going to be activated, yet this phenomenon is accompanied at a finer scale by some desynchonization of the mu-activity, and such desynchronization is a relative measure (e.g., before/after motor movement). At a broader scale it is not rare to see some form of balance between brain networks, some being ‘inhibited’ to let some others be activated like the default mode network versus sensory-motor networks. In our case, when estimating selective responses, it is the strength of the signal that matters. The type of selectivity is then defined by the sign/direction of the comparison/subtraction. We now provide additional details about the sign of selectivity between domains and frequencies in the Methods and Results section:

      Methods:

      “In order to explore the full range of possible selective, preferred, or shared responses, we considered both responses greater and smaller than the baseline. Indeed, as neural populations can synchronize or desynchronize in response to sensory stimulation, we estimated these categories separately for significant activations and significant deactivations compared to baseline.”

      Results:

      “We classified, for each canonical frequency band, each channel into one of the categories mentioned above, i.e. shared, selective, or preferred (Figure 1A), by examining whether speech and/or music differ from baseline and whether they differ from each other. We also considered both activations and deactivations, compared to baseline, as both index a modulation of neural population activity, and have been linked with cognitive processes (Pfurtscheller & Lopes da Silva, 1999; Proix et al., 2022). However, because our aim was not to interpret specific increase or decrease with respect to the baseline, we here simply consider significant deviations from the baseline. In other words, when estimating selectivity, it is the strength of the response that matters, not its direction (activation, deactivation).”

      “Both domains displayed a comparable percentage of selective responses across frequency bands (Figure 4, first values of each plot). When considering separately activation (Figure 2) and deactivation (Figure 3) responses, speech and music showed complementary patterns: for low frequencies (<15 Hz) speech selective (and preferred) responses were mostly deactivations and music responses activations compared to baseline, and this pattern reversed for high frequencies (>15 Hz).”

      References :

      J.P. Lachaux, J. Jung, N. Mainy, J.C. Dreher, O. Bertrand, M. Baciu, L. Minotti, D. Hoffmann, P. Kahane,Silence Is Golden: Transient Neural Deactivation in the Prefrontal Cortex during Attentive Reading, Cerebral Cortex, Volume 18, Issue 2, February 2008, Pages 443–450

      Pfurtscheller, G., & Da Silva, F. L. (1999). Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical neurophysiology, 110(11), 1842-1857

      (2) The manuscript doesn't easily provide information about the control conditions, yet the conclusion significantly depends on these conditions as a baseline. It would be beneficial if the authors could clarify this information for readers earlier and discuss how their choice of control stimuli influences their conclusions.

      We added information in the Results section about the baseline conditions:

      “[...] with respect to two baseline conditions, in which patients passively listened to more basic auditory stimuli: one in which patients passively listened to pure tones (each 30 ms in duration), the other in which patients passively listened to isolated syllables (/ba/ or /pa/, see Methods).”

      Of note, while the choice of different ‘basic auditory stimuli’ as baseline can change the reported results in regions involved in low-level acoustical analyzes (auditory cortex), it will have no impact on the results observed in higher-level regions, which predominantly also exhibit shared responses. We have now more clearly pointed out this reasoning in the results section:

      “The spatial distribution of the spectrally-resolved responses corresponds to the network typically involved in speech and music perception. This network encompasses both ventral and dorsal auditory pathways, extending well beyond the auditory cortex and, hence, beyond auditory processing that may result from differences in the acoustic properties of our baseline and experimental stimuli.“

      (3) The spectral analyses section doesn't clearly explain how the authors performed multiwise correction. The authors' selectivity categorization appears similar to ANOVAs with posthoc tests, implying the need for certain corrections in the p values or categorization. Could the authors clarify this aspect?

      We apologize that this was not in the original version of the manuscript. In the spectral analyzes, the selectivity categorization depended on both (1) the difference effects between the domains and the baseline, and (2) the difference effect between domains. Channels were marked as selective when there was (1) a significant difference between domains and (2) only one domain significantly differed from the baseline. All difference effects were estimated using the paired sample permutation tests based on the t-statistic from the mne-python library (Gramfort et al., 2014) with 1000 permutations and the build-in tmax method to correct for the multiple comparisons over channels (Nichols & Holmes, 2002; Groppe et al. 2011). We have now more clearly explained how we controlled family-wise error in the Methods section:

      “For each frequency band and channel, the statistical difference between conditions was estimated with paired sample permutation tests based on the t-statistic from the mne-python library (Gramfort et al., 2014) with 1000 permutations and the tmax method to control the family-wise error rate (Nichols and Holmes 2002; Groppe et al. 2011). In tmax permutation testing, the null distribution is estimated by, for each channel (i.e. each comparison), swapping the condition labels (speech vs music or speech/music vs baseline) between epochs. After each permutation, the most extreme t-scores over channels (tmax) are selected for the null distribution. Finally, the t-scores of the observed data are computed and compared to the simulated tmax distribution, similar as in parametric hypothesis testing. Because with an increased number of comparisons, the chance of obtaining a large tmax (i.e. false discovery) also increases, the test automatically becomes more conservative when making more comparisons, as such correcting for the multiple comparison between channels.”

      References :

      Gramfort, A., Luessi, M., Larson, E., Engemann, D. A., Strohmeier, D., Brodbeck, C., Parkkonen, L., & Hämäläinen, M. S. (2014). MNE software for processing MEG and EEG data. NeuroImage, 86, 446–460.

      Groppe, D. M., Bickel, S., Dykstra, A. R., Wang, X., Mégevand, P., Mercier, M. R., Lado, F. A., Mehta, A. D., & Honey, C. J. (2017). iELVis: An open source MATLAB toolbox for localizing and visualizing human intracranial electrode data. Journal of Neuroscience Methods, 281, 40–48.

      Nichols, T. E., & Holmes, A. P. (2002). Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1), 1–25.

      Reviewer #2 (Recommendations For The Authors):

      Other suggestions:

      (1) The authors need to provide more details on how the sEEG electrodes were localized and selected. Are all electrodes included or only the ones located in the gray matter? If all electrodes were used, how to localize and label the ones that are outside of gray matter? In Figures 1C & 1D it seems that a lot of the electrodes were located in depth locations, how were the anatomical labels assigned for these electrodes

      We apologize that this was not clear in the original version of the manuscript. Our electrode localization procedure was based on several steps described in detail in Mercier et al., 2022. Once electrodes were localized in a post-implant CT-scan and the coordinates projected onto the pre-implant MRI, we were able to obtain the necessary information regarding brain tissues and anatomical region. That is, first, the segmentation of the pre-impant MRI with SPM12 provided both the tissue probability maps (i.e. gray, white, and cerebrospinal fluid (csf) probabilities) and the indexed-binary representations (i.e., either gray, white, csf, bone, or soft tissues) that allowed us to dismiss electrodes outside of the brain and select those in the gray matter. Second, the individual's brain was co-registered to a template brain, which allowed us to back project atlas parcels onto individual’s brain and assign anatomical labels to each electrode. The result of this procedure allowed us to group channels by anatomical parcels as defined by the Brainnetome atlas (Figure 1D), which informed the analyses presented in section Population Prevalence (Methods, Figures 4, 9-10, S4-5). Because this study relies on stereotactic EEG, and not Electro-Cortico-Graphy, recording sites include both gyri and sulci, while depth structures were not retained.

      We have now updated the “General preprocessing related to electrodes localisation” section in the Methods. The relevant part now states:

      “To precisely localize the channels, a procedure similar to the one used in the iELVis toolbox and in the fieldtrip toolbox was applied (Groppe et al., 2017; Stolk et al., 2018). First, we manually identified the location of each channel centroid on the post-implant CT scan using the Gardel software (Medina Villalon et al., 2018). Second, we performed volumetric segmentation and cortical reconstruction on the pre-implant MRI with the Freesurfer image analysis suite (documented and freely available for download online http://surfer.nmr.mgh.harvard.edu/). This segmentation of the pre-implant MRI with SPM12 provides us with both the tissue probability maps (i.e. gray, white, and cerebrospinal fluid (CSF) probabilities) and the indexed-binary representations (i.e., either gray, white, CSF, bone, or soft tissues). This information allowed us to reject electrodes not located in the brain. Third, the post-implant CT scan was coregistered to the pre-implant MRI via a rigid affine transformation and the pre-implant MRI was registered to MNI152 space, via a linear and a non-linear transformation from SPM12 methods (Penny et al., 2011), through the FieldTrip toolbox (Oostenveld et al., 2011). Fourth, applying the corresponding transformations, we mapped channel locations to the pre-implant MRI brain that was labeled using the volume-based Human Brainnetome Atlas (Fan et al., 2016).”

      Reference:

      Mercier, M. R., Dubarry, A.-S., Tadel, F., Avanzini, P., Axmacher, N., Cellier, D., Vecchio, M. D., Hamilton, L. S., Hermes, D., Kahana, M. J., Knight, R. T., Llorens, A., Megevand, P., Melloni, L., Miller, K. J., Piai, V., Puce, A., Ramsey, N. F., Schwiedrzik, C. M., … Oostenveld, R. (2022). Advances in human intracranial electroencephalography research, guidelines and good practices. NeuroImage, 260, 119438.

      (2) From Figures 5 and 6 (and also S4, S5), is it true that aside from the shared response, lower frequency bands show more music selectivity (blue dots), while higher frequency bands show more speech selectivity (red dots)? I am curious how the authors interpret this.

      The reviewer is right in noticing the asymmetric selective response to music and speech in lower and higher frequency bands. However, while this effect is apparent in the analyzes wherein we inspected stronger synchronization (activation) compared to baseline (Figures 2 and S1), the pattern appears to reverse when examining deactivation compared to baseline (Figures 3 and S2). In other words, there seems to be an overall stronger deactivation for speech in the lower frequency bands and a relatively stronger deactivation for music in the higher frequency bands.

      We now provide additional details about the sign of selectivity between domains and frequencies in the Results section:

      “Both domains displayed a comparable percentage of selective responses across frequency bands (Figure 4, first values of each plot). When considering separately activation (Figure 2) and deactivation (Figure 3) responses, speech and music showed complementary patterns: for low frequencies (<15 Hz) speech selective (and preferred) responses were mostly deactivations and music responses activations compared to baseline, and this pattern reversed for high frequencies (>15 Hz).”

      Note, however, that this pattern of results depends on only a select number of patients, i.e. when ignoring regional selective responses that are driven by as few as 2 to 4 patients, the pattern disappears (Figures 5-6). More precisely, ignoring regions explored by a small number of patients almost completely clears the selective responses for both speech and music. For this reason, we do not feel confident interpreting the possible asymmetry in low vs high frequency bands differently encoding (activation or deactivation) speech and music.

      Minor:

      (1) P9 L234: Why only consider whether these channels were unresponsive to the other domain in the other frequency bands? What about the responsiveness to the target domain?

      We thank the reviewer for their interesting suggestion. The primary objective of the cross-frequency analyzes was to determine whether domain-selective channels for a given frequency band remain unresponsive (i.e. exclusive) to the other domain across frequency bands, or whether the observed selectivity is confined to specific frequency ranges (i.e.frequency-specific). In other words, does a given channel exclusively respond to one domain and never—in whichever frequency band—to the other domain? The idea behind this question is that, for a channel to be selectively involved in the encoding of one domain, it does not necessarily need to be sensitive to all timescales underlying that domain as long as it remains unresponsive to any timescale in the other domain. However, if the channel is sensitive to information that unfolds slowly in one domain and faster in the other domain, then the channel is no longer globally domain selective, but the selectivity is frequency-specific to each domain.

      The proposed analyzes answer a slightly different, albeit also meaningful, question: how many frequencies (or frequency bands) do selective responses span? From the results presented below, the reviewer can appreciate the overall steep decline in selective response beyond the single frequency band with only few channels remaining selectively responsive across maximally four frequency bands. That is, selective responses globally span one frequency band.

      Author response image 1.

      Cross-frequency channel selective responses. The top figure shows the results for the spectral analyzes (baselined against the tones condition, including both activation and deactivation). The bottom figure shows the results for the connectivity analyzes. For each plot, the first (leftmost) value corresponds to the percentage (%) of channels displaying a selective response in a specific frequency band. In the next value, we remove the channels that no longer respond selectively to the target domain for the following frequency band. The black dots at the bottom of the graph indicate which frequency bands were successively included in the analysis.

      (2) P21 L623: "Population prevalence." The subsection title should be in bold.

      Done.

      Reviewer #3 (Recommendations For The Authors):

      The authors chose to use pure tone and syllables as baseline, I wonder if they also tried the rest period between tasks and if they could comment on how it differed and why they chose pure tones, (above and beyond a more active auditory baseline).

      This is an interesting suggestion. The reason for not using the baseline between speech and music listening (or right after) is that it will be strongly influenced by the previous stimulus. Indeed, after listening to the story it is likely that patients keep thinking about the story for a while. Similarly after listening to some music, the music remains in “our head” for some time.

      This is why we did not use rest but other auditory stimulation paradigms. Concerning the choice of pure tones and syllables, these happen to be used for clinical purposes to assess functioning of auditory regions. They also corresponded to a passive listening paradigm, simply with more basic auditory stimuli. We clarified this in the Results section:

      “[...] with respect to two baseline conditions, in which patients passively listened to more basic auditory stimuli: one in which patients passively listened to pure tones (each 30 ms in duration), the other in which patients passively listened to isolated syllables (/ba/ or /pa/, see Methods).”

      Discussion - you might want to address phase information in contrast to power. Your encoding models map onto low-frequency (bandpassed) activity which includes power and phase. However, the high-frequency model includes only power. The model comparison is not completely fair and may drive part of the effects in Figure 7a. I would recommend discussing this, or alternatively ruling out the effect with modeling power separately for the low frequency.

      We thank the reviewer for their recommendation. First, we would like to emphasize that the chosen signal extraction techniques that we used are those most frequently reported in previous papers (e.g. Ding et al., 2012; Di Liberto et al., 2015; Mesgarani and Chang, 2012).

      Low-frequency (LF) phase and high-frequency (HFa) amplitude are also known to track acoustic rhythms in the speech signal in a joint manner (Zion-Golumbic et al., 2013; Ding et al., 2016). This is possibly due to the fact that HFa amplitude and LF phase dynamics have a somewhat similar temporal structure (see Lakatos et al., 2005 ; Canolty and Knight, 2010).

      Still, the reviewer is correct in pointing out the somewhat unfair model comparison and we appreciate the suggestion to rule out a potential confound. We now report in Supplementary Figure S8, a model comparison for LF amplitude vs. HFa amplitude to complement the findings displayed in Figure 7A. Overall, the reviewer can appreciate that using LF amplitude or phase does not change the results: LF (amplitude or phase) always better captures acoustic features than HFa amplitude.

      Author response image 2.

      TRF model comparison of low-frequency (LF) amplitude and high-frequency (HFa) amplitude. Models were investigated to quantify the encoding of the instantaneous envelope and the discrete acoustic onset edges (peakRate) by either the low frequency (LF) amplitude or the high frequency (HFa) amplitude. The ‘peakRate & LF amplitude’ model significantly captures the largest proportion of channels, and is, therefore, considered the winning model. Same conventions as in Figure 7A.

      References:

      Canolty, R. T., & Knight, R. T. (2010). The functional role of cross-frequency coupling. Trends in Cognitive Sciences, 14(11), 506–515.

      Di Liberto, G. M., O’sullivan, J. A., & Lalor, E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457-2465.

      Ding, N., & Simon, J. Z. (2012). Emergence of neural encoding of auditory objects while listening to competing speakers. Proceedings of the National Academy of Sciences, 109(29), 11854-11859.

      Ding, N., Melloni, L., Zhang, H., Tian, X., & Poeppel, D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164.

      Golumbic, E. M. Z., Ding, N., Bickel, S., Lakatos, P., Schevon, C. A., McKhann, G. M., ... & Schroeder, C. E. (2013). Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”. Neuron, 77(5), 980-991.

      Lakatos, P., Shah, A. S., Knuth, K. H., Ulbert, I., Karmos, G., & Schroeder, C. E. (2005). An oscillatory hierarchy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911.

      Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397), 233-236.

      Similarly, the Coherence analysis is affected by both power and phase and is not dissociated. i.e. if the authors wished they could repeat the coherence analysis with phase coherence (normalizing by the amplitude). Alternatively, this issue could be addressed in the discussion above

      We agree with the Reviewer. We have now better clarified our choice in the Methods section:

      “Our rationale to use coherence as functional connectivity metric was three fold. First, coherence analysis considers both magnitude and phase information. While the absence of dissociation can be criticized, signals with higher amplitude and/or SNR lead to better time-frequency estimates (which is not the case with a metric that would focus on phase only and therefore would be more likely to include estimates of various SNR). Second, we choose a metric that allows direct comparison between frequencies. As, at high frequencies phase angle changes more quickly, phase alignment/synchronization is less likely in comparison with lower frequencies. Third, we intend to align to previous work which, for the most part, used the measure of coherence most likely for the reasons explained above.“

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      Summary:

      This valuable study by Wu and Zhou combined neurophysiological recordings and computational modelling to investigate the neural mechanisms that underpin the interaction between sensory evaluation and action selection. The neurophysiological results suggest non-linear modulation of decision-related LIP activity by action selection, but some further analysis would be helpful in order to understand whether these results can be generalised to LIP circuitry or might be dependent on specific spatial task configurations. The authors present solid computational evidence that this might be due to projections from choice target representations. These results are of interest for neuroscientists investigating decision-making.

      Strengths:

      Wu and Zhou combine awake behaving neurophysiology for a sophisticated, flexible visual-motion discrimination task and a recurrent network model to disentangle the contribution of sensory evaluation and action selection to LIP firing patterns. The correct saccade response direction for preferred motion direction choices is randomly interleaved between contralateral and ipsilateral response targets, which allows the dissociation of perceptual choice from saccade direction.

      The neurophysiological recordings from area LIP indicate non-linear interaction between motion categorisation decisions and saccade choice direction.

      The careful investigation of a recurrent network model suggests that feedback from choice target representations to an earlier sensory evaluation stage might be the source for this non-linear modulation and that it is an important circuit component for behavioural performance.

      The paper presents a possible solution to a central controversy about the role of LIP in perceptual decision-making, but see below.

      Weaknesses:

      The paper presents a possible solution to a central controversy about the role of LIP in perceptual decision-making. However, the authors could be more clear and upfront about their interpretational framework and potential alternative interpretations.

      Centrally, the authors' model and experimental data appears to test only that LIP carries out sensory evaluation in its RFs. The model explicitly parks the representation of choice targets outside the "LIP" module receiving sensory input. The feedback from this separate target representation provides then the non-linear modulation that matches the neurophysiology. However, they ignore the neurophysiological results that LIP neurons can also represent motor planning to a saccade target.

      The neurophysiological results with a modulation of the direction tuning by choice direction (contralateral vs ipsilateral) are intriguing. However, the evaluation of the neurophysiological results are difficult, because some of the necessary information is missing to exclude alternative explanations. It would be good to see the actual distributions and sizes of the RF, which were determined based on visual responses not with a delayed saccade task. There might be for example a simple spatial configuration, for example, RF and preferred choice target in the same (contralateral) hemifield, for which there is an increase in firing. It is a shame that we do not see what these neurons would do if only a choice target would be put in the RF, as has been done in so many previous LIP experiments. The authors exclude also some spatial task configurations (vertical direction decisions), which makes it difficult to judge whether these data and models can be generalised. The whole section is difficult to follow, partly also because it appears to mix reporting results with interpretation (e.g. "feedback").

      The model and its investigation is very interesting and thorough, but given the neurophysiological literature on LIP, it is not clear that the target module would need to be in a separate brain area, but could be local circuitry within LIP between different neuron types.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors recorded activity in the posterior parietal cortex (PPC) of monkeys performing a perceptual decision-making task. The monkeys were first shown two choice dots of two different colors. Then, they saw a random dot motion stimulus. They had to learn to categorize the direction of motion as referring to either the right or left dot. However, the rule was based on the color of the dot and not its location. So, the red dot could either be to the right or left, but the rule itself remained the same. It is known from past work that PPC neurons would code the learned categorization. Here, the authors showed that the categorization signal depended on whether the executed saccade was in the same hemifield as the recorded PPC neuron or in the opposite one. That is, if a neuron categorized the two motion directions such that it responded stronger for one than the other, then this differential motion direction coding effect was amplified if the subsequent choice saccade was in the same hemifield. The authors then built a computational RNN to replicate the results and make further tests by simulated "lesions".

      Strengths:

      Linking the results to RNN simulations and simulated lesions.

      Weaknesses:

      Potential interpretational issues due to a lack of evidence on what happens at the time of the saccades.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The neurophysiological results with a modulation of the direction tuning by choice direction are intriguing. However, the evaluation of the neurophysiological results are difficult because some of the necessary information is missing to exclude alternative explanations.

      We thank the reviewer for the helpful comments. We have addressed this point in detail in the following response.

      (a) Clearly state in the results how the response field "RF", where the stimulus was placed, was mapped. The methods give as "MGS"" (i.e., spatial selectivity during stimulus presentation and delay)" task rather than the standard delayed saccade. And also "while for those neurons which did not show a clear RF during the MGS task, we presented motion stimuli in the positions (always in the visual field contralateral to the recorded hemisphere) in which neurons exhibited the strongest response to the motion stimuli." All this sounds more like a sensory receptive field not an eye movement response filed". What was the exact task and criterion?

      We agree with the reviewer that the original description of how we mapped the response fields (RFs) of LIP neurons lacked sufficient detail. In this study, we used the memory-guided saccade (MGS) task to map the RFs of all isolated LIP neurons. Both MGS and delayed saccade tasks are commonly used to map a neuron's response field in previous decision-making studies.

      In the MGS task, monkeys initially fixate on the center of the screen. Subsequently, a dot randomly flashes at one of the eight possible locations surrounding the fixation dot with an eccentricity of 8 degree, requiring the monkeys to memorize the location of the flashed dot. After a delay of 1000 ms, the monkeys are instructed to saccade to the remembered location once the fixation dot disappears. The MGS task is a standard behavior task for mapping visual, memory, and motor RFs, particularly in brain regions involved in eye movement planning and control, such as LIP, FEF, and the superior colliculus.

      We believe the reviewer's confusion may stem from whether we mapped the visual, memory, or motor RFs of LIP neurons in the current study, as these "RFs" are not always consistent across individual neurons. In our study, we primarily mapped the visual and memory RFs of each LIP neuron by analyzing their activity during both the target presentation and delay periods. To focus on sensory evaluation-related activity, we presented the visual motion stimulus within the visual-memory RF of each neuron. For neurons that did not show a significant visual-memory RF, we used a different approach: we tested the neurons with the main task by altering the spatial configuration of the task stimuli to identify the visual field that elicited the strongest response when the motion stimulus was presented within it. This approach was used to guide the placement of the stimulus during the recording sessions.

      Following the reviewer’s suggestion, we have added the following clarification to the results section to better describe how we mapped the RF of LIP neurons:

      ‘We used the memory-guided saccade (MGS) task, which is commonly employed in LIP studies, to map the receptive fields (RFs) of all isolated LIP neurons. Specifically, we mapped both the visual and memory RFs of each neuron by analyzing their activity during the target presentation and delay periods of the MGS task (see Methods).’.

      (b) l.85 / l126: What do you mean by "orthogonal to the axis of the neural RF" - was the RF shape asymmetric, if so how did you determine this? OR do you mean the motion direction axis? Please explain.

      We realized that the original description of this point may have been unclear and could lead to confusion. The axis of the neural RF refers to the line connecting the center of the RF (which coincides with the center of the motion stimulus) to the fixation dot. We have revised this sentence in the revised manuscript as follows:

      ‘To examine the neural activity related to the evaluation of stimulus motion, we presented the motion stimuli within the RF of each neuron, while positioning the saccade targets at locations orthogonal to the line connecting the center of the RF (which also marks the center of the motion stimulus) and the fixation dot.’

      (c) Behavioural task. Figure 1 - are these example session? Please state this clearly. Can you show the examples (psychometric function and reaction times) separated for trials where correct choice direction aligning with the motion preference (within 90 degrees) and those that did not?

      Figure 1 shows the averaged behavioral results from all recording sessions. We have added this detail in the revised legend of Figure 1.

      We are uncertain about the reviewer’s reference to the “correct choice direction aligning with the motion preference,” as the term “motion preference” is specific to the neuron response, which are different for different neurons recorded simultaneously using multichannel recording probe.

      Nonetheless, following the reviewer’s suggestion, we grouped the trials in each recording session into two groups based on the relationship between the saccade direction and the preferred motion direction of the identified LIP neuron during one example single-channel recording. Both the RT and the performance accuracy during one example session were shown in the following figure.

      Author response image 1.

      Give also the performance averaged across all sites included in this study and range.<br /> If performance does differ for different configuration, please, show that the main modulatory effect does not align with this distinction.

      To clarify this point, we have plotted performance accuracy and RTs for horizontal, oblique, and vertical target position configurations separately, which are shown for both monkeys in the following figures. We did not observe any systematic influences of task configurations on the monkeys' performance accuracy. While the RTs did differ across different configurations, we believe these differences are likely attributable to several factors, such as varying levels of familiarity introduced by our training process and the intrinsic RT difference between different saccade directions.

      Author response image 2.

      (d) Show the distribution of RF positions and the direction preferences for the recording sites included in the quantitative analysis of this study. (And if available, separately those excluded).

      Following the reviewer’s suggestion, we have plotted the centers of the RFs for all neurons with identifiable RFs, categorizing them by their preferred motion directions. To determine each neuron’s RF, we analyzed the average firing rates from both the target presentation and delay periods during each trial of the memory-guided saccade (MGS) task. The RF centers of neurons with significant RFs were determined through a two-step process. First, we selected neurons that exhibited significant RFs in the MGS based on the following criteria: 1) there must be a significant activity difference between the eight target locations, and 2) the mean activity during the selected periods should be significantly greater than the baseline activity during the fixation period. Second, we fitted the activity data from the eight conditions to a Gaussian distribution, using the center of the fitted distribution as the RF center. A significant proportion of neurons from both monkeys that exhibited significant response to motion stimuli did not exhibited notable RFs based our current method. The following figures show the distributions of RFs and motion direction preference for all LIP neurons with identifiable RFs separately for each monkey. Since this is not the focus of the current study, we are not planning to include this result in the revised manuscript.

      Author response image 3.

      (e) Following on from d), was there a systematic relationship between RF position or direction preference and modulation by choice direction? For instance could the responses be simply explained by an increase in modulation for choices into the same (contralateral) hemifield as where the stimulus was placed?

      The reviewer raised a good point. To address whether there was a systematic relationship between RF position or direction preference and modulation by choice direction, we calculated a modulation index for each neuron to quantify the influence of saccade direction on neuronal responses to motion stimuli. We then plotted the modulation index against the RF position for each LIP neuron, shown as following:

      Author response image 4.

      As shown in the figures above, neurons with RFs farther from the horizontal meridian were more likely to exhibit stronger modulation by the saccade direction, while neurons with RFs closer to the horizontal meridian showed inconsistent and weaker modulation. This is because when the RFs was on the horizontal meridian, saccade directions were aligned with the vertical axis (with no contralateral or ipsilateral directions). This is consistent with the finding in Figure S3—no significant differences in direction selectivity between the CT and IT conditions in the data sessions where the saccade targets were aligned close to the vertical direction. Since fewer than half of the identified neurons showed clear receptive fields using our method, the figure above did not include all the neurons used in the analysis in the manuscript. Therefore, we chose not to include this figure in the revised manuscript.

      Additionally, we quantified the relationship between the modulation index and direction preference for neurons in sessions where the monkeys’ saccades were aligned to either horizontal or oblique directions. As shown in the following figure, no systematic relationship was found between direction preference and modulation by the choice direction for LIP neurons at the population level.

      Author response image 5.

      We have added this result as Figure S 2 in the revised manuscript.

      Notably, the observed modulation of saccade direction on LIP neurons’ response to motion stimuli cannot be simply explained by saccade direction selectivity. We presented two more evidence to rule out such possibility in the original manuscript. First, the modulation effect we observed was nonlinear; specifically, the firing rate of neurons increased for the preferred motion direction but decreased for the non-preferred motion direction (Figure 2i and Figure S1A-D). This phenomenon is unlikely to be attributed to a linear gain modulation driven by saccade directions. Second, we plotted the averaged neural activity for contralateral and ipsilateral saccade directions separately, and found that LIP neurons showed similar levels of activity between two saccade directions (revised Figure 2L).

      Additionally, we added a paragraph in the Methods section to describe the way we calculated modulation index as follows:

      “We have calculated a modulation index for each neuron to reflect the influence of saccade direction on neuron’s response to visual stimuli. The modulation index is calculated as:

      where represents the average firing rate from 50ms to 250ms after sample onset for all contralateral saccade trails with a neuron’s preferred moving direction of visual stimuli. The naming conventions are the same for , , and . An MI value between 0 and 1 indicate higher modulation in contralateral saccade trials, and an MI value between -1 and 0 indicates higher modulation in ipsilateral saccade trials.”

      Please split Figures 2G,H,I J,K, by whether the RF was located contralaterally or ipsilaterally. If there are only a small number of ipsilateral RFs, please show these examples, perhaps in an appendix.

      This is a reasonable suggestion; however, it is not applicable to our study. Among all the neurons included in our analysis, only one neuron from each monkey exhibited ipsilateral receptive fields (RFs). Therefore, we believe it may not be necessary to plot the result for this outlier.

      (f) Were the choice targets always equi-distant from the stimulus and at what distance was this? Please give quantitative details in methods.

      The review was correct that the choice targets were always equidistant form the stimulus. The distance between the motion stimulus and the target was typically 12-15 degree. We have added the details in the revised Methods section as follows:

      ‘Therefore, the two saccade targets were equidistant from the stimulus, with the distance typically ranging from 12 to 15 degrees.

      (2) For Figure 3E, how do you explain that there is an up regulation of for contralateral choices before the stimulus onset, i.e. before the animal can make a decision? Is this difference larger for error trials?

      This is a good question, which we have attempted to clarify in the revised manuscript. We believe that the observed upregulation in neural activity for contralateral choices may reflect the monkeys’ internal choice bias or expectation (choice between two motion directions) prior to stimulus presentation, which could influence their subsequent decisions. In Figure 3E, we calculated the r-choice to assess the correlation between the neuron’s direction selectivity and the monkeys’ decisions on motion stimuli, separately for contralateral and ipsilateral choice conditions. The increased r-decision during the pre-stimulus period indicates stronger neural activity for trials in which the monkeys later reported that the upcoming stimulus was in the preferred direction, and weaker activity for trials where the stimulus was judged to be in the non-preferred direction. This correlation was more pronounced for contralateral choices than for ipsilateral ones. It is important to note that while the monkeys cannot predict the upcoming stimulus direction with greater-than-chance accuracy, these results suggest that pre-stimulus neural activity in LIP is correlated with the monkeys’ eventual decision for that trial. Furthermore, LIP neural activity was more strongly correlated with the monkeys’ decisions in the contralateral choice condition compared to the ipsilateral one.

      Additionally, we clarify that the r-decision was calculated using both correct and error trials. When comparing Figure 2J with Figure 2K, the correlation between neural activity and the monkeys’ upcoming decision during the pre-stimulus period was most prominent in low- and zero-coherence trials, where the monkeys either made more errors or based decisions on guesswork. We infer that the monkeys' confidence in these decisions was likely lower compared to high-coherence trials. Thus, the decision process appears to be influenced by pre-stimulus neural activity, particularly in low-coherence and zero-coherence trials.

      Although it is unclear precisely what covert process this pre-stimulus activity reflects, similar patterns of choice-predictive pre-stimulus activity have been observed in LIP and other brain areas (Shadlen, M.N. and Newsome,T.W., 2001; Coe, B., at al. 2002; Baso, M.A. and Wurtz, R.H., 1998; Z. M. Williams at al. 2003). We have clarified this point in the revised manuscript, including a revision of the relevant sentence in the Results section for clarity, shown as follows:

      “Furthermore, we used partial correlation analysis to examine decision- and stimulus-related components of DS (i.e., r-decision and r-stimulus, Figure 3E and 3F) using all four coherence levels. The decision-related component of LIP DS was significantly greater in the CT condition than in the IT condition (Figure 3E; nested ANOVA: P = 1.07e-6, F= 25.72), and this difference emerged even before motion stimulus onset. This suggests that the LIP DS was more closely correlated with monkeys’ decisions in the CT condition than in the IT condition. The upregulation in r-decision for contralateral choices may reflect the monkeys’ internal choice bias or expectation (choice between two motion directions) prior to stimulus presentation, which could influence their subsequent decisions more in the CT condition”

      (3) Figure 2K: what is the very large condition-independent contribution? It almost seems as most of what these neurons code for is neither saccade or motion related.

      The condition-independent contribution is the time-dependent component that is unrelated to saccade, motion, or their interaction. Our findings are consistent with previous methodological studies, where this time-dependent component was shown to account for a significant portion of the variance in population activity (Kobak, D. et al., 2016)

      (4) Abstract:

      a) "We found that the PPC activity related to monkeys' abstract decisions about visual stimuli was nonlinearly modulated by monkeys' following saccade choices directing outside each neuron's response field."

      This sentence is not clear/precise in two regards:

      Should "directing" be "directed"?

      Also, it is not just saccades directed outside the RF, but towards the contralateral hemifield.

      We thank the reviewer for the suggestion. We agree that ‘directing’ should be ‘directed’ and revised it accordingly. However, we do not believe that ‘directed outside each neuron's response field’ should be replaced with “towards the contralateral hemifield”. There are two major reasons. First, the modulation effect was identified as the difference between contralateral and ipsilateral saccade directions. We cannot conclude that the modulation mainly happened in the contralateral saccade direction. Second, we used ‘directed outside each neuron's response field’ to emphasize that this modulation cannot be simply explained by saccade direction selectivity, whereas ‘towards the contralateral hemifield’ cannot fulfill this purpose.

      (b) " Recurrent neural network modeling indicated that the feedback connections, matching the learned stimuli-response associations during the task, mediated such feedback modulation."

      - should be "that feedback connection .... might mediate". A model can only ever give a possible explanation.

      Thanks for the help on the writing again! We have revised this sentence as following: “Recurrent neural network modeling indicated that the feedback connections, matching the learned stimuli-response associations during the task, might mediate such feedback modulation.”

      (c) "thereby increasing the consistency of flexible decisions." I am not sure what is really meant by increasing the consistency of flexible decisions? More correct or more the same?

      We apologize for the confusion. In the manuscript, "decision consistency" refers to the degree of agreement in the model's decisions under specific conditions. A higher decision consistency indicates that the model is more likely to produce the same choice when encountering encounters a stimulus in that condition. We have incorporated your suggestion and revise this sentence as “thereby increasing the reliability of flexible decisions”. We also clarified the definition of consistency in the main text as follows:

      “These disrupted patterns of saccade DS observed in the target module following projection-specific inactivation aligned with the decreased decision consistency of RNNs, where decision consistency reflects the degree of agreement in the model's choices under specific task conditions. This suggests a diminished reliance on sensory input and an increased dependence on internal noise in the decision-making process.”.

      (5) Results: headers should be changed to reflect the actual results, not the interpretation:

      "Nonlinear feedback modulation of saccade choice on visual motion selectivity in LIP"

      "Feedback modulation specifically impacted the decision-correlated activity in LIP"

      These first parts of the results describe neurophysiological modulations of LIP activity, the source cannot be known from the presented data alone. I thought that this feedback is suggested by the modelling results in the last part of the results. It is confusing to the reader that the titles already refer to the source of the modulation as "feedback". The titles should more accurelty describe what is found, not pre-judge the interpretation.

      We thank the reviewer for those valuable suggestions. We have updated the subtitles to: “Nonlinear modulation of saccade choice on visual motion selectivity in LIP” and “Decision-correlated but not stimulus-correlated activity was modulated in LIP.”

      (6) page 8, l366-380. Can you link the statements more directly to panels in Figure 6. For Figure 6H-K, it needs to be clarified that the headers for 6D-G also apply to H-K.

      ­We have added headers for Figure 6H-K in the revised version, and revised the corresponding results section as follows.

      ‘We further examined how the energy landscape in the 1-D subspace changed in relation to task difficulty (motion coherence). Consistent with prior findings, trials with lower decision consistency (trials using lower motion coherence) exhibited shallower attractor basins at the time of decision for all types of RNNs (Fig. 6H-K). However, both the depth and the positional separation of attractor basins in the network dynamics significantly decreased for all non-zero motion coherence levels after the ablation of all feedback connections (comparing Figure 6I with Figure 6H; P(depth) = 5.20e-25, F = 122.80; P(position) = 1.82e-27, F = 137.75; two-way ANOVA). Notably, this reduction in basin depth and separation was more pronounced in the specific group compared to the nonspecific groups after ablating the feedback connections (comparing Figure 6J with Figure 6K; P(depth) = 2.65e-13, F =57.35; P(position) = 3.73e-14, F = 61.79; two-way ANOVA). These results might underlie the computational mechanisms that explain the observed reduction in the decision consistency of RNNs following projection-specific inactivation: the shallower and closer attractor basins after ablating feedback connections resulted in less consistent decisions. This happened because the variability in neural activity made it more likely for population activity to stochastically shift out of the shallower basins and into nearby alternative ones.’

      (7) line 556-557: Please provide a reference or data for the assertion that nearby recording sites in LIP (100 microns apart) have similar RFs.

      The reviewer raised an interesting question that we are unable to address in depth with the current data, as we lack information on the specific cortical location for each recording session. In the original manuscript, we suggested that nearby recording sites in LIP have similar receptive fields (RFs), based on both our own experience with LIP recordings and previous studies. Specifically, we observed that neurons recorded within a single penetration using a single-channel electrode typically exhibited similar RFs. Similarly, the majority of neurons recorded from the same multichannel linear probe within a single session also showed comparable RFs. Additionally, several studies (both electrophysiological and fMRI) have reported topographic organization of RFs in LIP (Gaurav H. Patel et al., 2010; S. Ben Hamed et al., 2001; Gene J. Blatt et al., 1990).

      (8) Line 568, Methods: a response criterion of a maximum firing rate of 2 spikes/s seems very low, especially for LIP. How do the results change if this lifted to something more realistic like 5 spikes/s or 10 spikes/s?

      We chose this criterion to ensure we included as many neurons as possible in our analysis. To further clarify, we have plotted the distribution of maximum firing rates across all neurons. Based on our findings, relaxing this criterion is unlikely to affect the results, as the majority of neurons exhibit maximum firing rates well above 5 spikes/s, and many exceed 10 spikes/s. We hope this explanation addresses the concern.

      Author response image 6.

      Reviewer #2 (Recommendations For The Authors):

      In this manuscript, the authors recorded activity in the posterior parietal cortex (PPC) of monkeys performing a perceptual decision-making task. The monkeys were first shown two choice dots of two different colors. Then, they saw a random dot motion stimulus. They had to learn to categorize the direction of motion as referring to either the right or left dot. However, the rule was based on the color of the dot and not its location. So, the red dot could either be to the right or left, but the rule itself remained the same. It is known from past work that PPC neurons would code the learned categorization. Here, the authors showed that the categorization signal depended on whether the executed saccade was in the same hemifield as the recorded PPC neuron or in the opposite one. That is, if a neuron categorized the two motion directions such that it responded stronger for one than the other, then this differential motion direction coding effect was amplified if the subsequent choice saccade was in the same hemifield. The authors then built a computational RNN to replicate the results and make further tests by simulated "lesions".

      The data are generally interesting, and the manuscript is generally well written (but see some specific comments below on where I was confused). However, I'm still not sure about the conclusions. The way the experiment is setup, the "contra" saccade target is essentially in the same hemifield as the motion patch stimulus. Given that the RF's can be quite large, isn't it important to try to check whether the saccade itself contributed to the effects? i.e. if the RF is on the left side, and the "contra" saccade is to the left, then even if it is orthogonal to the location of the stimulus motion patch itself, couldn't the saccade still be part of a residual edge of the RF? This could potentially contribute to elevating the firing rate on the preferred motion direction trials. I think it would help to align the data on saccade onset to see what happens. It would also help to have fully mapped the neurons' movement fields by asking the monkeys to generate saccades to all screen locations in the monitor. The authors mention briefly that they used a memory-guided saccade task to map RF's, but it is also important to map with a visual target. And, in any case, it would be important to show the mapping results aligned on saccade onset.

      Another comment is that the authors might want to mention this other recent related paper by the Pack group: https://www.biorxiv.org/content/10.1101/2023.08.03.551852v2.full.pdf

      We thank the reviewer for the comments and realized that we did not explain our results clearly in the original manuscript. We agree with the reviewer that saccade direction selectivity might be a confounding factor for the modulation of the saccade choice direction onto LIP neurons’ activity responded to visual motion stimuli. Because the RFs of LIP neurons might be large and the saccade target might be presented within the edge of the RFs. However, we believe that the observed modulation of saccade direction on LIP neurons’ response to motion stimuli cannot be simply explained by saccade direction selectivity. We presented several pieces of evidence to rule out such possibility. First, the modulation effect we observed was not linear; specifically, the firing rate of neurons increased for the preferred motion direction but decreased for the non-preferred motion direction (Figure 2i and Figure S1A-D). This phenomenon is unlikely to be attributed to a linear gain modulation driven by saccade directions. Second, we plotted the averaged neural activity for contralateral and ipsilateral saccade directions separately, aligned the activity to either motion stimulus onset or saccade onset, and found that LIP neurons showed similar levels of activity between the contralateral and ipsilateral directions (revised Figure 2L), which is not consistent with obvious saccade direction selectivity.

      To better control for this confound, we have added figures plotting the mean neural activity aligned to saccade onset for both contralateral and ipsilateral saccades, which are now included in the revised main Figure 2. These figures are presented in the detailed response below. Additionally, we have revised the corresponding results section to clarify our points, as outlined below:

      “Figure 2A-2F shows three example LIP neurons that exhibited significant motion coherence correlated DS. Surprisingly, LIP neurons showed greater DS in the CT condition than in the IT condition, even though the same motion stimuli were used in the same spatial location for both conditions. The averaged population activity showed this DS difference between CT and IT conditions for all four coherence levels (Figure 2G, 2H). During presentation of their preferred motion direction, LIP neurons showed significantly elevated activity in the CT relative to the IT at all coherence levels (Figure S1A, S1B, nested ANOVA: P(high) = 0.0326, F = 4.65; P(medium) = 0.0088, 142 F = 7.03; P(low) = 0.0076, F = 7.32; P(zero) = 0.0124, F = 6.4), and a trend toward lower activity to the nonpreferred direction for CT vs. IT (Figure S1C, S1D, nested ANOVA: P(high) = 0.0994, F = 2.75; P(medium) = 0.0649, F = 3.12; P(low) = 0.0311, F = 4.73; P(zero) = 0.0273, F = 4.96). Most of the LIP neurons (48 of 83) showed such opposing trends in activity modulation between the preferred and nonpreferred directions (Figure 2I). These results indicated a nonlinear modulation of saccade choice on motion DS in LIP, aligned precisely with the response property of each neuron. This is unlikely to be driven by a linear gain modulation of saccade direction selectivity. Receiver operating characteristic (ROC) analysis further confirmed significantly greater motion DS in the CT condition than in the IT condition (Figure 2J 148 and 2K; nested ANOVA: P(high) = 5.0e-4, F= 12.44; P(medium) = 9.53e-6, F = 20.91; P(low) = 9.33e-7, F 149 = 26.03; P(zero) = 2.56e-8, F= 34.3). Such DS differences were observed even before stimulus onset. Moreover, LIP neurons exhibited similar levels of mean activity between different saccade directions (CT vs. IT) before monkeys’ saccade choice (Figure 2L), further supporting that saccade direction selectivity did not significantly contribute to the observed modulation of LIP neurons’ responses to motion stimuli.

      We also thank the reviewer for pointing out the missing of this relevant study, we have added the suggested refence in the revised discussion section as follows:

      ‘A recent study demonstrated that neurons in the middle temporal area responded more strongly to motion stimuli when monkeys saccaded toward their RFs in a standard decision task with a fixed mapping between motion stimuli and saccade directions. This modulation emerged through the training process and contributed causally to the monkeys' following saccade choices. Consistently, we found that the response of LIP neurons to motion stimuli was more strongly correlated with the monkeys' decisions in the CT condition (saccades toward RFs) than in the IT condition, in a more flexible decision task. Together, these results suggest that the modulation of action selection on sensory processing may be a general process in perceptual decision-making. However, the observed modulation of saccade direction on LIP neurons' responses to motion stimuli cannot be simply explained by saccade direction selectivity. Several lines of evidence argue against this possibility. First, the modulation effect was nonlinear; specifically, neuronal firing rates increased for preferred motion directions but decreased for non-preferred directions (Figure 2I and Figure S1). This pattern is unlikely to be driven by a linear gain modulation based on saccade directions. Second, we found that LIP neurons exhibited similar levels of activity in both the CT and IT conditions (Figure 2L), which is inconsistent with the presence of clear saccade direction selectivity.

      Some more specific comments are below:

      - I had a bit of a hard time with the abstract. It does not appear to be crystal clear to me, and it is the first thing that I am reading after the title. For example, if there is a claim about both perceptual decision-making and later target selection, then I feel that the task should be explained a bit more clearly than saying "flexible decision" task. Also, "..modulated by monkeys' following saccade choices directing outside each neuron's response field" was hard to read. It needs to be rewritten. Maybe just say "...modulated by the subsequent eye movement choices, even when these eye movement choices always directed the eyes away from the recorded neuron's response field". Also, I don't fully understand what "selectivity-specific feedback" means. Then, the concept of "consistency" in flexible decisions is brought up, again without much context. The above are examples of why I had a hard time with the abstract.

      We realize that our original statement may have been unclear and potentially caused confusion for the readers. Following the reviewer’s suggestions, we have revised the abstract as follows:

      ‘Neural activity in the primate brain correlates with both sensory evaluation and action selection aspects of decision-making. However, the intricate interaction between these distinct neural processes and their impact on decision behaviors remains unexplored. Here, we examined the interplay of these decision processes in posterior parietal cortex (PPC) when monkeys performed a flexible decision task, in which they chose between two color targets based on a visual motion stimulus. We found that the PPC activity related to monkeys’ abstract decisions about visual stimuli was nonlinearly modulated by their subsequent saccade choices, which were directed outside each neuron’s response field. Recurrent neural network modeling indicated that the feedback connections, matching the learned stimuli-response associations during the task, might mediate such feedback modulation. Further analysis on network dynamics revealed that selectivity-specific feedback connectivity intensified the attractor basins of population activity underlying saccade choices, thereby increasing the reliability of flexible decisions. These results highlight an iterative computation between different decision processes, mediated primarily by precise feedback connectivity, contributing to the optimization of flexible decision-making.’

      Specifically, selectivity-specific feedback refers to the feedback connections with positive or negative weights between selectivity-matched and selectivity-nonmatched unit pairs, respectively.

      Regarding "decision consistency," we define it as the degree to which the model’s decisions remain congruent under specific conditions. A higher level of decision consistency indicates that the model is more likely to produce the same choice each time it is presented with a stimulus under those conditions, in another words, decision reliability. We have revised the corresponding results section to make these concepts clearer.

      - Line 69: I'm not fully sure, but I think that some people might suggest that superior colliculus is also involved in the sensory aspect of the evaluation. But, I guess the sentence itself is correct as you write it. So, I don't think anyone should argue with it. However, if someone does argue with it, then they would flag the next sentence, since if the colliculus does both, then do the sensory and motor parts really employ distinct neural processes? Anyway, I think this is very minor.

      This is an interesting point. We have also noticed a recent study that demonstrates that the superior colliculus is causally involved in the sensory aspect of decision-making, specifically in visual categorization. However, the study also distinguishes between neural activity related to categorical decisions and that related to saccade planning. This suggests that the sensory and motor aspects of decision-making likely involve distinct neural processing, even within the same brain region—potentially reflecting separate populations of neurons. Therefore, we stand by our statement in the ‘next sentence’.

      - Line 79-80: you might want to look at this work because I feel that it is relevant to cite here: https://www.biorxiv.org/content/10.1101/2023.08.03.551852v2

      We have discussed this reference in the revised discussion section of the manuscript, please refer to the above response.

      - For a result like that shown in Fig. 2, I feel that it is important to show RF mapping with a saccade task alone. i.e. for the same neurons, have a monkey make a delayed visually guided saccade task to all possible locations on the display, and demonstrate that there is no modulation by saccades to the targets. Otherwise, the result in Fig. 2 could reflect first an onset response by a motion, and then the saccade-related response that would happen anyway, even without the decision task. So, I feel that now, it is not entirely clear whether the result reflects this so-called feedback modulation, or whether simply planning the saccade to the target itself activates the neurons. With large RF's, this is a distinct possibility in my opinion.

      - Line 174: this would also be predicted if the neuron's were responding based on the saccade target plan independent of the motion stimulus

      - On a related note, I would recommend plotting all data also aligned on saccade onset. This can help establish what the cause of the effects described is

      We understand the reviewer’s concern that the modulation might be related to saccade planning, and we acknowledge that the original manuscript might not adequately address this potential confound. Unfortunately, we did not map the LIP neurons' receptive fields (RFs) using a saccade-only task. However, as mentioned earlier, we believe that the modulation of LIP neurons' responses to motion stimuli based on saccade choice direction cannot be simply attributed to saccade direction selectivity. Several lines of evidence support this conclusion. First, the modulation we observed was nonlinear: the firing rate of neurons increased for the preferred motion direction but decreased for the non-preferred motion direction (Figure 2i and Figure S1A-D). This pattern is inconsistent with a simple linear gain modulation driven by saccade direction selectivity. Second, we directly compared LIP neuronal activity for contralateral and ipsilateral target conditions, and found no significant differences between the two. This suggests that saccade direction selectivity is unlikely to be the primary contributor to the observed modulation. In the revised figure, we added a plot (Figure 2L) that aligns neural activity to saccade onset, in addition to the original alignment to motion stimulus onset (Figure S1E). This new analysis further supports our interpretation.

      Author response image 7.

      - Even when reading the simulation results, I'm still not 100% sure I understand what is meant by this idea of "consistency" of flexible decision-making

      We have addressed this issue in a previous comment and please refer to the response above.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank you and the two reviewers for their constructive feed-back on our manuscript entitled: "Substrate evaporation drives collective construction in termites".

      Here, we submit a revised version in which -we believe- we fill the missing details identified by the reviewers and we clarify the presentation of our results.

      From the eLife assessment we can identify a few main points that the reviewers found unclear or not well developed in our previous manuscript:

      • Insufficient details about computer simulation models. Is the match between simulations and experiments qualitative or quantitative?

      • Request for clarifications related to the wall stimulus: is evaporation stronger at the high-curvature wall corners or similar along all the wall edge? Why is there less consistency in the experimental results with the wall stimulus, with a minority of wall experiments in which something different happens?

      • Quantitative estimation of the humidity gradients in our experimental setup.

      • "Confirmation" that termites can sense humidity gradients of magnitude and scale comparable with those encountered in our experiments.

      • Request for additional background information about the considered termite species and their construction habits.

      The reviewers also made a number of interesting suggestions and other comments:

      • Suggestion of possible explanations and interpretations for a purported discrepancy with a previous work by Calovi and collaborators.

      • Suggestion of alternative experimental approaches (array of probes, alternative experimental setups).

      We address all these points below.

      Details about computer simulation models

      There are two different types of computer simulations in our experiments: 1. simulations of evaporation on the initial structure, and 2. simulations of structure growth based on curvature.

      1) Simulations of evaporation We recall that these simulations rely on the hypothesis that humidity transport happens in a diffusive way, that is evaporation rate is proportional to the humidity gradient. New details on the implementation of these diffusive simulations are now added in section S.VI. We also adapted figures 4A and 4B which are now expressed in units more comparable to the expected humidity field in experiments. Essentially, we show that the model under-estimates the absolute magnitude of the humidity gradient |∇ℎ| in our setup while it correctly predicts the relative importance of the same field across the topography.

      First, it is instructive to report the value of |∇ℎ| predicted by diffusive simulations with the bottom boundary at 100% humidity (like the clay disk), and the top boundary of the simulation box at 70% like our experimental room. Note that, at a given temperature, relative humidity and absolute humidity are proportional, so we will assume here that temperature is constant and always refer to relative humidity. Thus, humidity gradient will be measured in 𝑚𝑚−1 exactly like curvature. One than has:

      • flat disk, |∇ℎ| ∼0.01mm−1

      • wall tips, |∇ℎ| ∼0.13mm−1

      • wall top edge |∇ℎ| ∼0.1mm−1

      • pillar tips |∇ℎ| ∼0.19mm−1,

      First we remark, that the value of |∇ℎ| on the flat portion of the disk is 10 times smaller of the estimation |∇ℎ|0 ∼0.5mm−1 of the same quantity in our experiments, which is now given in the manuscript and discussed in a specific paragraph below. This discrepancy is due to the fact that our simulations overestimate the size of the diffusive region (i.e. the simulation box) to 18mm while we expect the diffusive layer to be much thinner (i.e. 𝛿 ∼2mm). Note also that, as in all diffusive problems, the humidity gradient on any point of the bottom boundary (i.e. on the clay surface) depends on the distance of that point from the top boundary, for example the closer are the boundaries the stronger is the gradient. This is a very general feature of diffusive problems: the gradient of the diffusing field depends on the distance from the boundaries, where the value of the field is given. Note also that, in principle, the size of the simulation box does not only affect the overall magnitude of the humidity gradient but also its shape. However, one observes that in our simulations the topographic cues are only 30% closer to the top boundary compared to the flat, bottom, surface, but the local gradient is 10 to 20 times larger. This evidence suggests that the ’curvature’ effect is much stronger than the ’distance’ effect, and supports the fact that our approximation does not affect in a significant way the estimation of the relative importance of the humidity gradient at the bottom surface. We then conclude that our diffusive simulations do not provide a correct estimation of the order of magnitude of |∇ℎ|, but well capture its relative variations across the topography.

      2) Structure growth based on curvature. As observed by the reviewer, the dynamical simulations included here refer to a model that was developed in a previous study, thus we chose to not include all the details of the simulations in the present one. At this stage, that model is still phenomenological: for example we cannot provide a physical estimation of the dimensionless parameter 𝑑 which controls the typical size of the structure produced by the simulations of the model. Thus in principle, the comparisons with real experiments cannot be other than "qualitative". Indeed, to push such a comparison further is not necessarily of interest, given the minimal and mean field character of our model, and the extreme complexity of the natural system which is studied here. However, our experimental setup was specifically designed to overcome this limit, which is designing topographies where the curvature cues where modulated in a way which is almost discrete, with flat regions, and regions where curvature is strong ’for termites’, i.e. the curvature radius is of the order of termite body size. Our experimental results greatly validate our choice because deposition patterns also show an almost ’discrete’ shape, with specific regions attracting most of the depositing actions. Thus, we claim that the significance of the agreement is strong, and we suggest that when stimuli and response both behave in a quasi-discrete manner, the difference between qualitative and quantitative is not well defined. Finally, we recall that in all the discussion above curvature and humidity gradient can be exchanged, as we already pointed out in the manuscript. Consistently, the humidity gradient show a strong variation between the curved regions and the flat ones.

      Results with the wall stimulus One important point coming out from the reviews is that we did not clearly present the results with the wall stimulus. These concerns are best summarized by a comment from reviewer 2, who states: “evaporation rates seem inconclusive in the wall geometry, yet the termites still deposit material at the high-curvature wall corners”.

      We acknowledge that the interpretation of results of experiments with the wall stimulus must address three key points: 1- Salt deposition experiment are inconclusive in showing variation of the evaporation rate, across the top of the wall; 2- A portion (4/11) of termite experiments do not show a clear pellet deposition pattern by termites; 3- Conversely, in the remaining portion (7/11), most experiments still show a clear pellet deposition on the corners of the wall, in spite of small differences in evaporation between the corners and the top edge (like in our Fig. 3B). These points are now addressed in the manuscript and discussed below.

      The variation of the humidity gradient between the corners of the wall, and the wall’s top edge is relatively small while both are regions of relatively high curvature and higher evaporation as compared to the the flat surface of the clay disk. We now report precise values of the humidity gradient from numerical simulations, as discussed above. These indicate that humidity gradient at the wall corners and upper edge is respectively 10 and 7 times larger than on the flat bottom, but evaporation at the wall tips is only 0.3 times larger than on the wall upper edge.

      Experiments with the saline solution qualitatively confirm the same result of an evaporation pattern more evenly distributed on the wall stimulus (point 1) than on the pillars.

      Taken together, these results might explain why not all wall experiments end up with depositions at the tips (point 2): simply, in the wall experiments the relative importance of the deposition cue between tips and wall upper edge is not high enough to always guide termite behavior in a deterministic way.

      But we should also point to the fact that the evaporation simulations presented in figure 4 and the experiments with the saline solution both reflect the humidity field on the clay templates before termite construction has started. As soon as termites start adding pellets to the wall, effectively starting to build a pillar, the humidity gradient will be reinforced at the locations of pellet deposition, and a self-reinforcing process is initiated, similar to our dynamical simulations based on local curvature. This explains why eventually termite activity can result in clear and localized depositions (point 3) also with the wall stimulus.

      Incidentally, we would like to include here another consideration: the nest of Coptotermes termites comprise a “scaffold” with multiple interconnected pillars. In other termite genera, the prevalent nest structure is one made by surfaces, rather than pillars, such as in Nasutitermes nests, Apicotermes, Psammotermes, or again some fungus growing structures in Macrotermes and Synacanthotermes). The fact that the wall stimulus presents some potential to stimulate construction everywhere on its edge is intriguing as it might provide some cues on the construction of different nest architectures.

      Quantitative estimation of the humidity gradient in our setup The moisture gradients in our experiments and simulations was only presented in a non-quantitative manner, because we were mainly interested in identifying locations of high and low evaporation. But, combining scaling arguments already discussed in S.IX and the the results of our evaporation simulations, one can produce a lower boundary for the magnitude of the humidity gradient |∇ℎ|, predict its higher value at key positions on our setup, and compare it with humidity variations experienced by termites in their natural environment. These considerations are now included in the manuscript and discussed below.

      First, we define a reference value |∇ℎ|0 for the humidity gradient on the (flat) clay disk, which can be estimated using the boundary layer thickness 𝛿 ∼2mm (see section IX.A of the SI) and the variation of relative humidity Δℎ between the clay disk surface and the exterior which was Δℎ =30% (the difference between the fully wetted substrate, and room air humidity at 70% saturation). Note that |∇ℎ|0 constitutes a lower boundary for the expected values of the humidity gradient in our setup, as confirmed by our experiments with saline solution. We can then write:

      Next, the results of diffusive simulations shown in figure 4A and 4B indicate that the humidity gradient at highly curved regions of the topographic cues is at least 10 times larger than |∇ℎ|0 which allows to estimate an upper boundary for |∇ℎ| in our experimental setup, say |∇ℎ|𝑚𝑎𝑥 ∼1mm−1. Humidity sensing capabilities of termites Our hypothesis that humidity gradients could guide termite building behavior implicitly assumes that termites can sense humidity gradients comparable with those existing in our experiments.

      Humidity is important to all termites because of their small size and unsclerotized body. Coptotermes termites in particular are wetwood termites that can only survive in high-humidity environments such as moist wood or soil. It is well documented that coptotermes termites (like other termites and cockroaches) have humidity receptors in their antennae, and behavioral studies indicate that they can discriminate between chambers with different humidity content.

      For example, a study by Gautam and Henderson (2011, Environmental entomology, 40:1232) provided chambers with different relative humidity and, after 12 hours, almost all termites were in the highest humidity chamber (98% RH), leaving the other chambers with 75% or less RH empty. These results (which are similar also to other results testing termite response to chambers with different soil moisture) indicate that -given a sufficient amount of time- termites can detect a difference of humidity from 75% to 98% over a spatial scale of centimeters.

      The quantitative estimation of the humidity gradient described above indicates that in our experimental setup termites can experience humidity variations of 15% over a distance of only 1mm and even shorter, while the length of a single termite antenna is about 1.5 mm.

      In other words, the humidity gradients that we estimate for our experiments are well above those that termites were able to discriminate in previous experiments. Future experiments should aim to test the exact limits of resolution of the humidity-sensing ability of termites (e.g. in an environment where humidity is close to 100% everywhere), and the mechanisms how they sense the gradient (e.g. comparing information from the two antennae, or by integrating humidity information over time).

      By definition, |∇ℎ|0 corresponds to a variation of humidity between a fully saturated atmosphere (i.e. 100%), comparable to the nest interior, and a "humid" atmosphere (i.e. 70%) comparable to the natural environment where termites live (say the nest exterior), occurring over a distance (2mm) which is comparable with their body size.

      We can then conclude that even the lower boundary |∇ℎ|0 of the humidity gradient corresponds to an atmosphere variation to which termites must be used, i.e. nest interior vs nest exterior, happening across one body length. If we add that the upper boundary |∇ℎ|𝑚𝑎𝑥 is one order of magnitude higher, it appears extremely unlikely that they could not detect these gradients.

      Additional background information about our considered termite species and their construction habits

      We have now added some details about the life history and nesting habits of termites in the Coptotermes genus in a new paragraph in section SI. Essentially, these are wetwood termites that nest in moist wood or soil, and their nests present a typical structure comprising a scaffold of interconnected pillars (we now show a picture of a typical structure from one of our lab-reared colonies).

      After the initial submission of our manuscript we have also obtained a more precise taxonomic identification of the termites we used, which indicated that our termites are better identified as Coptotermes gestroi than Coptotermes formosanus. The two species are extremely close and can also interbreed in the areas where they co-occur, but in this case C. gestroi is a better match. Hence, we have amended the name in the manuscript and in the supplementary material.

      Differences with previous results by Calovi and collaborators

      We believe that there is no real discrepancy between our results and those described by Calovi et al. (2019, Phil. Trans. Roy. Soc. B 374:20180374). What they measure-termite aggregation and activity- is similar to what we also observe in our experiments: termites aggregate in concave regions, such as at the base of the wall in our experiments, and they collect pellets at the locations that they visit more often. And, above all, we observe that concavities promote digging activity, which in turns promote aggregation as already observed in previous studies like Green et al. (2017, Proc. Roy. Soc. B 284:20162730). The main difference is that in our analyses we treat separately the three measurements of termite occupancy, pellet collection and pellet deposition, and in this way we identify a role of convexity for pellet deposition.

      It is possible that, apart from the differences in language and interpretations between our study and the study by Calovi, there were also real differences in termite building behavior between the two studies that we couldn’t fully appreciate from our own reading of the article by Calovi, but which the reviewer has spotted. The reviewer makes a very interesting suggestion that some of these differences might be due to the different humidity level used in our experiment, compared to the experiment by Calovi and collaborators. Room humidity was high, at around 70% in our experiments. The humidity in Calovi’s experiments was possibly even higher as they performed their experiments in a closed box, but we could not find precise reported information on the humidity level in their publication.

      Given that it is not clear that the building behavior in our experiments was qualitatively different from the building behavior in Calovi and collaborators’ experiments, and given that we don’t know the precise humidity value used in Calovi’s experiments (plus, we worked on different termite species that could have different sensitivity to humidity) we decided that -based on the information that we have- we could not meaningfully expand our discussion of similarities and differences with Calovi’s study in our manuscript.

      It is clear, though, and we completely agree with the referee on this point, that in light of Calovi’s and our own new results, it would now be extremely interesting if future experiments could characterize termite construction activity across a range of finely controlled air humidity values. Anecdotally, in preliminary experiments we did include some trials in which termites were hosted in a completely closed box, and we observed much reduced construction activity in those conditions. However, the fact that we could not easily track termite activity and pellet collections / depositions in those conditions (because of the box), together with the fact that the building activity itself was reduced, made us to converge towards the open arena experiments that we describe here.

      Suggestion of alternative experimental approaches One reviewer made interesting suggestions for alternative experiments, including using an array of humidity probes for measuring humidity, or a different experimental setup -analogous to those used in previous experiments by Bardunias and collaborators-. It is often the case that only at the end of a series of experiments we identify an alternative, and possibly better, way of doing the same experiment. In future, if we have the opportunity to run other similar experiments again, we will likely experiment with these suggestions. When we first designed our own experiments, one of our priorities was to be able to film all termites in the arena at all time, so that potentially we could also study individual termite behavior and task specialization. This partly constrained the type of experimental setups that we could use.

      One aspect that clearly emerged from our work and from the revision process is that any future experiments related to this topic should achieve a very precise control of air humidity, and test a wider range of stimuli of more varied and controlled size, humidity and curvature. Since our own experiments were conducted, three of us have moved to different institutions, which imposes practical constraints for us on working on the same termites in a similar way, but the suggestions from the reviewers will be helpful as we are planning our future research.

      We hope that the explanations above and the details that we have changed in the manuscript itself have contributed to clarify unclear aspects of our study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their time and thoughtful comments. We believe that the further analyses suggested have made the results clearer and more robust. Below, we briefly highlight the key points addressed in the revision and the new evidence supporting them. Then, we address each reviewer’s critiques point-by-point.

      - Changes in variability with respect to time/experience

      Both reviewers #1 and #3 asked whether the variability in grid properties observed was dependent on time or experience. This is an important point, given that such a dependence on time could lead to interesting hypotheses about the underlying dynamics of the grid code. However, in the new analyses we performed, we do not observe changes in grid variability within a session (Fig S5 of the revised manuscript), suggesting that the grid variability seen is constant within the timescale of the data set.

      - The assumption of constant grid parameters in the literature

      Reviewer #2 pointed out that it had been appreciated by experimentalists that grid properties are variable within a module. We agree that we may have overstated the universality of this assumption in the original manuscript, and we have toned down the language in the revision. However, we note that many previous theoretical studies assumed these properties to be constant, within a given module. We provide some examples below, and have added evidence of this assertion, with citations to the theoretical literature, to the revised manuscript .

      - Additional sources of variability

      Reviewer #3 pointed out additional sources that might explain the variability observed in the paper (beyond time and experience). These sources include: field width, border location, and the impact of conjunctive cells. We have run additional analyses and have found no significant impact on the observed variability from any of these factors. We believe that these are important controls, and have added them to the manuscript (Fig S4-S7 of the revised manuscript)

      - Analysis of computational models

      Reviewer #3 noted that our results could be strengthened by performing similar analyses on the output of computational models of grid cells. This is a good idea. We have now measured the variability of grid properties in a recent normative recurrent neural network (RNN) model that develops grid cells when trained to perform path integration (Sorscher et al., 2019). This model has been shown to develop signatures of a 2D toroidal attractor (Sorscher et al., 2023) and achieves a high accuracy on a simple path integration task. Interestingly, the units with the greatest grid scores also exhibit a range of grid spacings and grid orientations (Fig S8 of the revised manuscript). Furthermore, by decreasing the amount of sparsity (through decreasing the weight decay regularization), we found an increase in the variability of the grid properties. This analysis demonstrates a heretofore unknown similarity between the RNN models trained to perform path integration and recorded grid cells from MEC. It additionally provides a framework for computational analysis of the emergence of grid property variability.

      Reviewer #1:

      (1) Is the variability in grid spacing and orientation that the authors found intrinsically organized or is it shaped by experience? Previous research has shown that grid representations can be modified through experience (e.g., Boccara et al., Science 2019). To understand the dynamics of the network, it would be important to investigate whether robust variability exists from the beginning of the task period (recording period) or whether variability emerges in an experience-dependent manner within a session.

      This is an interesting question that was not addressed in the paper. To test this, we performed additional analysis to resolve whether the variability changes across a session.

      Using a sliding window, we have measured changes in variability with respect to recording time (Fig S5A). To this end, we compute grid orientation and spacing over a time-window whose length is half the total length of the recording. From the population distribution of orientation and spacing values, we compute the standard deviation as a measure of variability. We repeat the same procedure, sliding the window forward until the variability for the second half of the recording is computed.

      We applied this approach to recording ID R12 (the same as in Figs 2-4) given that this recording session was significantly longer than the rest (nearly two hours). Results are shown in Fig S5B-C. For both orientation and spacing, no changes of variability with respect to time can be observed. Similar results were found for other modules (see caption of Fig S5 for statistics).

      We also note that the rats were already familiarized with the environment for 10-20 sessions prior to the recordings, so there may not be further learning during the period of the grid cell recordings. No changes in variability can be seen in Rat R across days (e.g., in Fig 5B R12 and R22 have similar distributions of variability). However, we note that it may be possible that there are changes in grid properties at time-scales greater than the recordings.

      (2) It is important to consider the optimal variability size. The larger the variability, the better it is for decoding. On the other hand, as the authors state in the

      Discussion, it is assumed that variability does not exist in the continuous attractor model. Although this study describes that it does not address how such variability fits the attractor theory, it would be better if more detailed ideas and suggestions were provided as to what direction the study could take to clarify the optimal size of variability.

      We appreciate this suggestion and agree that more discussion is warranted on how our results can be reconciled with previously observed attractor dynamics. To explore this, we studied the recurrent neural network (RNN) model from Sorscher et al. (2019), which develops grid responses when trained on path integration. This network has previously been found to develop signatures of toroidal topology (Sorscher et al., 2023), yet we find its grid responses also contain heterogeneity in grid properties (Fig S8). By decreasing the strength of the weight decay regularization (which leads to denser connectivity in the recurrent layer), we find an increase in the grid property variability. Interestingly, decreasing the weight decay regularization has been previously found to lead to weaker grid responses and worse ability of the RNN to perform path integration on environments larger than it was trained on. This approach not only provides preliminary evidence to our claim that too much variability can lead to weaker continuous attractor structure, but also provides a modeling framework with which future work can explore this question in more detail. We have added discussion of this issue to the manuscript text (Discussion).

      Reviewer #2:

      (1) Even though theoreticians might have gotten the mistaken impression that grid cells are highly regular, this might be due to an overemphasis on regularity in a subset of papers. Most experimentalists working with grid cells know that many if not most grid cells show high variability of firing fields within a single neuron, though this analysis focuses on between neurons. In response to this comment, the reviewers should tone down and modify their statements about what are the current assumptions of the field (and if possible provide a short supplemental section with direct quotes from various papers that have made these assumptions).

      We agree that some experimentalists are aware of variability in the recorded grid response patterns and that this work may not come as a complete surprise to them. We have toned down our language in the Introduction, changing “our results challenge a long-held assumption” to “our results challenge a frequently made assumption in the theoretical literature”. Additionally, we have added a caveat that “experimentalists have been aware” of the observed variability in grid properties.

      We would like to emphasize that the lack of work carefully examining the robustness of this variability has prevented a firm understanding of whether this is an inherent property of grid cells or due to measurement noise. The impact of this can be seen in theoretical neuroscience work where a considerable number of articles (including recent publications) start with the assumption that all grid cells within a module have identical properties, with the exception of phase shift and noise. We have now cited a number of these papers in the Introduction, to provide specific references. To further illustrate the pervasiveness of this assumption being explicitly made in theoretical neuroscience, below we provide quotes from a few important papers:

      “Cells with a common spatial period also share a common grid orientation; their responses differ only by spatial translations, or different preferred firing phases, with respect to their common response period” (Sreenivasan and Fiete, 2011)”

      “Grid cells are organized into discrete modules; within each module, the spatial scale and orientation of the grid lattice are the same, but the lattice for different cells is shifted in space.” (Stemmler et al., 2015)”

      “Recently, it was shown that grid cells are organized in discrete modules within which cells share the same orientation and periodicity but vary randomly in phase” (Wei et al., 2015)”

      “...cells within one module have receptive fields that are translated versions of one another, and different modules have firing lattices of different scales and orientations” (Dorrell et al., 2023)”

      In these works, this assumption is used to derive properties relating to the computational properties of grid cells (e.g., error correction, optimal scaling between grid spacings in different modules).

      In addition, since grid cells are assumed to be identical in the computational neuroscience community, there has been little work on quantifying how much variability a given model produces. This makes it challenging to understand how consistent different models are with our observations. This is illustrated in our analysis of a recent recurrent neural network (RNN) model of grid cells (Fig S8), which does exhibit variability.

      (2) The authors state that "no characterization of the degree and robustness of variability in grid properties within individual modules has been performed." It is always dangerous to speak in absolute terms about what has been done in scientific studies. It is true that few studies have had the number of grid cells necessary to make comparisons within and between modules, but many studies have clearly shown the distribution of spacing in neuronal data (e.g. Hafting et al., 2005; Barry et al., 2007; Stensola et al., 2012; Hardcastle et al., 2015) so the variability has been visible in the data presentations. Also, most researchers in the field are well aware that highly consistent grid cells are much rarer than messy grid cells that have unevenly spaced firing fields. This doesn't hurt the importance of the paper, but they need to tone down their statements about the lack of previous awareness of variability (specific locations are noted in the specific comments).

      We have toned down our language in the Introduction. However, we note that our point that no detailed analysis had been done on measuring the robustness of this variability stands. Thus, for the general community, it has not been clear whether this previously observed variability is noise or a real feature of the grid code.

      (3) The methods section needs to have a separate subheading entitled: How grid cells were assigned to modules" that clearly describes how the grid cells were assigned to a module (i.e. was this done by Gardner et al., or done as part of this paper's post-processing?

      We thank the reviewer for pointing out this missing information. We have added a new subsection in the Materials and Methods section, entitled “Grid module classification” to clarify how the grid cells are assigned to modules. In short, this was done by Gardner et al. (2022) using an unsupervised clustering approach that was viewed as enabling a less biased identification of modules. We did not perform any additional processing steps on module identity.

      Reviewer #3:

      (1) One possible explanation of the dispersion in lambda (not in theta) could be variability in the typical width of the field. For a fixed spacing, wider fields might push the six fields around the center of the autocorrelogram toward the outside, depending on the details of how exactly the position of these fields is calculated. We recommend authors show that lambda does not correlate with field width, or at least that the variability explained by field width is smaller than the overall lambda variability.

      We agree that this option had not been carefully ruled out by our previous analyses. To tackle this question, we compute the field width of a given cell using the value at the minima of its spatial autocorrelogram (Fig S4A-B). For all cells in recording ID R12, there is a non-significant negative linear correlation between grid field width and between-cell variability (Fig S4C) . The variability explained by the width of the field is 4% of the variability, as indicated by the R<sup>2</sup> value of the linear fit. Similar results were found for all other modules (see caption of Fig S4C for statistics). Therefore, we do not think that grid field width explains spacing variability.

      (2) An alternative explanation could be related to what happens at the borders. The authors tackle this issue in Figure S2 but introduce a different way of measuring lambda based on three fields, which in our view is not optimal. We recommend showing that the dispersions in lambda and theta remain invariant as one removes the border-most part of the maps but estimating lambda through the autocorrelogram of the remaining part of the map. Of course, there is a limit to how much can be removed before measures of lambda and theta become very noisy.

      We have performed additional analysis to explore the role of borders in grid property variability. To do so, we have followed the suggestion by the reviewer and have re-analyzed grid properties from the autocorrelogram when the border-most part of the maps are removed (Fig S6A-B). For all modules, we do not see any changes in variability (computed as the standard deviation of the population distribution) for either orientation or spacing. As predicted by the reviewer, after removing about 25% of the border-most part of the environment we start seeing changes in variability, as measures of theta and lambda become noisy and computed over a smaller spatial range. This result holds for all other modules (Fig S6C-D).

      (3) A third possibility is slightly more tricky. Some works (for example Kropff et al, 2015) have shown that fields anticipate the rat position, so every time the rat traverses them they appear slightly displaced opposite to the direction of movement. The amount of displacement depends on the velocity. Maps that we construct out of a whole session should be deformed in a perfectly symmetric way if rats traverse fields in all directions and speeds. However, if the cell is conjunctive, we would expect a deformation mainly along the cell's preferred head direction. Since conjunctive cells have all possible preferred directions, and many grid cells are not conjunctive at all, this phenomenon could create variability in theta and lambda that is not a legitimate one but rather associated with the way we pool data to construct maps. To rule away this possibility, we recommend the authors study the variability in theta and lambda of conjunctive vs non-conjunctive grid cells. If the authors suspect that this phenomenon could explain part of their results, they should also take into account the findings of Gerlei and colleagues (2020) from the Nolan lab, that add complexity to this issue.

      We appreciate the reviewer pointing out the possible role conjunctive cells may play. To investigate how conjunctive cells may affect the observed grid property variability, we have performed additional analyses taking into account if the grid cells included in the study are conjunctive. Comparing within- and between-cell variability of conjunctive vs. non-conjunctive cells in recording R12, we do not see any qualitative differences for either orientation or spacing (Fig S7A-B). When excluding conjunctive cells from the between-variability comparison, we do not see any significant difference compared to when these cells are included (Fig S7C-D). As such, it does not appear that conjunctive cells are the source of variability in the population.

      We further note that the number of putative conjunctive cells varied across modules and recordings. For instance, in recording Q1 and Q2, Gardner et al. (2022) reported 3 (out of 97) and 1 (out of 66) conjunctive cells, respectively. Given that we see variability robustly across recordings (Fig 5), we do not believe that conjunctive cells can explain the presence of variability we observe.

      (4) The results in Figure 6 are correct, but we are not convinced by the argument. The fact that grid cells fire in the same way in different parts of the environment and in different environments is what gives them their appeal as a platform for path integration since displacement can be calculated independently of the location of the animal. Losing this universal platform is, in our view, too much of a price to pay when the only gain is the possibility of decoding position from a single module (or non-adjacent modules) which, as the authors discuss, is probably never the case. Besides, similar disambiguation of positions within the environment would come for free by adding to the decoding algorithm spatial cells (non-hexagonal but spatially stable), which are ubiquitous across the entorhinal cortex. Thus, it seems to us that - at least along this line of argumentation - with variability the network is losing a lot but not gaining much.

      We agree that losing the continuous attractor network (CAN) structure and the ability to path integrate would be a very large loss. However, we do not believe that the variability we observe necessarily destroys either the CAN or path integration. We argue this for two reasons. First, the data we analyzed [from Gardner et al. (2022)] is exactly the data set that was found to have toroidal topology and therefore viewed to be consistent with a major prediction of CANs. Thus, the amount of variability in grid properties does not rule out the underlying presence of a continuous attractor. Second, path integration may still be possible with grid cells that have variable properties. To illustrate this, we analyzed data from Sorscher et al. (2019) recurrent neural network model (RNN) that was trained explicitly on path integration, and found that the grid representations that emerged had variability in spacing and orientation (see point #6 below).

      (5) In Figure 4 one axis has markedly lower variability. Is this always the same axis? Can the authors comment more on this finding?

      We agree that in Fig 4 the first axis has lower variability. We believe that this is specific to the module R12 and does not reflect any differences in axis or bias in the methods used to compute the axis metrics. To test this, we have performed the same analyses for other modules, finding that other recordings do not exhibit the same bias. Results for the modules with the most cells are shown below (Author response image 1).

      Author response image 1.

      Grid propertied along Axis 1 are not less variable for many recorded grid modules. Same as Fig.4C-D, but for four other recorded modules. Note that the variability along each axis is similar.

      (6) The paper would gain in depth if maps coming out of different computational models could be analyzed in the same way.

      We agree with the reviewer that examining computational models using the same approach would strengthen our results and we appreciate the suggestion. To address this, we have analyzed the results from a previous normative model for grid cells [Sorscher et al., (2019)] that trained a recurrent neural network (RNN) model to perform path integration and found that units developed grid cell like responses. These models have been found to exhibit signatures of toroidal attractor dynamics [Sorscher et al. (2023)] and exhibit a diversity of responses beyond pure grid cells, making them a good starting point for understanding whether models of MEC may contain uncharacterized variability in grid properties.

      We find that RNN units in these normative models exhibit similar amounts of variability in grid spacing and orientation as observed in the real grid cell recordings (Fig S8A-D). This provides additional evidence that this variability may be expected from a normative framework, and that the variability does not destroy the ability to path integrate (which the RNN is explicitly trained to perform).

      The RNN model offers possibilities to assess what might cause this variability. While we leave a detailed investigation of this to future work, we varied the weight decay regularization hyper-parameter. This value controls how sparse the weights in the hidden recurrent layer are. Large weight decay regularization strength encourages sparser connectivity, while small weight decay regularization strength allows for denser connectivity. We find that increasing this penalty (and enforcing sparser connectivity) decreases the variability of grid properties (Fig S8E-F). This suggests that the observed variability in the Gardner et al. (2022) data set could be due to the fact that grid cells are synaptically connected to other, non-grid cells in MEC.

      (7) Similarly, it would be very interesting to expand the study with some other data to understand if between-cell delta_theta and delta_lambda are invariant across environments. In a related matter, is there a correlation between delta_theta (delta_lambda) for the first vs for the second half of the session? We expect there should be a significant correlation, it would be nice to show it.

      We agree this would be interesting to examine. For this analysis, it is essential to have a large number of grid cells, and we are not aware of other published data sets with comparable cell numbers using different environments.

      Using a sliding window analysis, we have characterized changes in variability with respect to the recording time (Figure S5A). To do so, we compute grid orientation and spacing over a time-window whose length is half of the total length of the recording. From the population distribution of orientation and spacing values, we compute the standard deviation as a measure of between-cell variability. We repeat the same procedure, sliding the window forward until the variability for the second half of the recording is computed.

      We applied this approach to recording ID R12 (the same as in Figs 2-4) given that this recording session was significantly longer than the rest (almost two hours). Results are shown in Fig S5 B-C. For both orientation and spacing, no systematic changes of variability with respect to time were observed. Similar results were found for other modules (see caption of Fig S5 for statistics).

      We also note that the rats were already familiarized with the environment for 10-20 sessions prior to the recordings, so there may not be further learning during the period of the grid cell recordings. No changes in variability can be seen in Rat R across days (e.g., in Fig 5B R12 and R22 have similar distributions of variability). However, we note that it may be possible that there are changes in grid properties at time-scales greater than the recordings.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this manuscript entitled "Hexokinase regulates Mondo-mediated longevity via the PPP and organellar dynamics", Laboy and colleagues investigated upstream regulators of MML-1/Mondo, a key transcription factor that regulates aging and metabolism, using the nematode C. elegans and cultured mammalian cells. By performing a targeted RNAi screen for genes encoding enzymes in glucose metabolism, the authors found that two hexokinases, HXK-1 and HXK-2, regulate nuclear localization of MML-1 in C. elegans. The authors showed that knockdown of hxk-1 and hxk-2 suppressed longevity caused by germline-deficient glp-1 mutations. The authors demonstrated that genetic or pharmacological inhibition of hexokinases decreased nuclear localization of MML-1, via promoting mitochondrial β-oxidation of fatty acids. They found that genetic inhibition of hxk-2 changed the localization of MML-1 from the nucleus to mitochondria and lipid droplets by activating pentose phosphate pathway (PPP). The authors further showed that the inhibition of PPP increased the nuclear localization of mammalian MondoA in cultured human cells under starvation conditions, suggesting the underlying mechanism is evolutionarily conserved. This paper provides compelling evidence for the mechanisms by which novel upstream metabolic pathways regulate MML-1/Mondo, a key transcription factor for longevity and glucose homeostasis, through altering organelle communications, using two different experimental systems, C. elegans and mammalian cells. This paper will be of interest to a broad range of biologists who work on aging, metabolism, and transcriptional regulation. 

      Reviewer #2 (Public Review):

      Raymond Laboy et.al explored how transcriptional Mondo/Max-like complex (MML-1/MXL-2) is regulated by glucose metabolic signals using germ-line removal longevity model. They believed that MML-1/MXL-2 integrated multiple longevity pathways through nutrient sensing and therefore screened the glucose metabolic enzymes that regulated MML-1 nuclear localization. Hexokinase 1 and 2 were identified as the most vigorous regulators, which function through mitochondrial beta-oxidation and the pentose phosphate pathway (PPP), respectively. MML-1 localized to mitochondria associated with lipid droplets (LD), and MML-1 nuclear localization was correlated with LD size and metabolism. Their findings are interesting and may help us to further explore the mechanisms in multiple longevity models, however, the study is not complete and the working model remains obscure. For example, the exact metabolites that account for the direct regulation of MML-1 were not identified, and more detailed studies of the related cellular processes are needed. 

      The identification of responsible metabolites is necessary since multiple pieces of evidence from the study suggests that lipid other than glucose metabolites may be more likely to be the direct regulator of MML-1 and HXK regulate MML-1 indirectly by affecting the lipid metabolism: 1) inhibiting the PPP is sufficient to rescue MML-1 function independent of G6P levels; 2) HXK-1 regulates MML-1 by increasing fatty acid beta-oxidation; 3) LD size correlates with MML-1 nuclear localization and LD metabolism can directly regulate MML-1. The identification of metabolites will be helpful for understanding the mechanism. 

      Beta-oxidation and the PPP are involved in the regulation of MML-1 by HXK-1 and HXK-2, respectively. But how these two pathways participate in the regulation is not clear. Is it the beta-oxidation rate or the intermediate metabolites that matters? As for the PPP, it provides substrates for nucleotide synthesis and also its product NADPH is essential for redox balance. Is one of the metabolites or the NADPH levels involved in MML-1 regulation? More studies are needed to provide answers to these concerns. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Following are my comments that the authors may want to address to further improve this excellent paper.

      Major comments 

      (1) Although the authors provided evidence that hexokinases in glucose metabolism are associated with germline-deficient glp-1(-) mutants, they did not mention why they focused on glp-1(-) mutants rather than other longevity mutants. In their previous study (Nakamura et al., 2016), they showed that MML-1 is required for multiple longevity pathways in C. elegans, including reduced mitochondrial respiration and insulin/IGF-1 signaling. Please discuss why the authors focused on glp-1(-) mutants in this paper. It will be even better if the authors test the roles of hexokinases in some other longevity regimens. 

      Many thanks for this astute comment. Previously we had shown that mml-1 is required for glp-1, daf-2, and isp-1 longevity, and Johnson et al. had shown a requirement for eat-2, hence the idea that MML-1 is a convergent transcription factor. We first focused on glp-1 because that was the starting point of our screen, and the result was clear and simple: hexokinases regulate MML‑1 nuclear localization and activity in glp-1 and are required for longevity. Naturally, the question arises: do hexokinases behave like MML-1 as convergent longevity regulators across pathways? To address this, we examined the interaction of hxk-1 and hxk-2 with isp-1, daf-2, and raga-1.  Specifically, we now show that:

      A. Like glp-1(e2141) mutants, isp-1(qm150) mutants stimulate MML-1 nuclear localization, and the hexokinases are required for isp-1 longevity (Figure 1G-H).

      B. daf-2(e1370) mutants do not further stimulate MML-1 nuclear localization beyond basal levels, yet MML-1 is strongly required for daf-2 longevity (Nakamura et al., 2016, Supplementary Figure 1L-M). However, the hexokinases are not required for daf-2 longevity (Supplementary Figure 1M), suggesting that the signaling pathway is wired differently in daf-2, and that other pathways regulate MML-1 activity.

      C. raga-1(ok701) mutants stimulate MML-1 nuclear localization and mml-1 is required for raga-1 longevity, suggesting that MML-1 acts downstream of TORC1 signaling (Supplementary Figure 1N-O). However, hexokinases are not required for raga-1 longevity, suggesting that raga-1 acts downstream or parallel to hexokinase signaling (Supplementary Figure 1P).

      D. We performed untargeted metabolomics in glp-1, daf-2, and mml-1 single and double mutants and observed that hexose phosphates, which have been shown to regulate MML-1 human homologs MondoA/ChREBP, were differentially regulated between mutants.

      Author response image 1.

      E. Altogether these experiments reveal that though MML-1 promotes longevity in most pathways, the hexokinases are only required in some (glp-1, isp-1), but not others (raga-1, daf-2). Furthermore, strong MML-1 nuclear localization is often but not always associated with longevity (e.g. daf-2), and the wiring of the signaling pathway is different for various longevity regimens. Consistently, mTOR and Insulin signaling are more functionally linked and therefore may show a more similar genetic profile. Differences in hexose phosphate between glp-1 and daf-2 could explain why MML-1 requires hexokinase function in glp-1 to promote longevity but not in daf-2. However, considerably more work is required to rigorously validate this hypothesis.

      (2) In figure 5, the authors investigated whether the association between PPP and MML‑1/MondoA, tested in C. elegans, is conserved in mammals under starvation conditions. The authors should clarify why they tested the MondoA localization upon starvation in cultured human cells. This comment is related to my comment #1 as the authors could determine the roles of hexokinases under dietary restriction (DR)-conditions or in DR-mimetic in eat-2(-) mutants. 

      In this case, the actual translatability to a worm longevity pathway was not our goal. Rather, we examined MondoA in cell culture under contrasting conditions of MondoA subcellular localization, where high glucose media had cytosolic/nuclear localization and starvation conditions cytosolic localization. We then showed that similar to our data in worms, PPP inhibition with 6-AN induced MondoA nuclear localization and activity. We now mention this rationale in the results section, lines 352-356.

      (3) In figure 2, the authors showed that HXK-2 regulates mitochondrial localization of MML-1, and HXK-1 regulates nuclear localization of MML-1 through mitochondrial β-oxidation in glp‑1(-) mutants. Can the authors test whether mitochondrial β-oxidation affects the effects of hxk RNAi on longevity of glp-1(-) mutants? 

      Excellent suggestion. We tried to test this idea and found that acs-2 RNAi alone abolished glp-1 longevity, making epistasis experiments difficult to interpret. This is consistent with published data showing that glp-1 longevity requires NHR-49, a transcription factor that regulates mitochondrial b‑oxidation, that drives acs-2 expression (Ratnappan et al., 2014). It could well be that b‑oxidation inhibition promotes MML-1 nuclear localization but abolishes lifespan extension because of epistatic effects on other transcription factors or processes. Further investigation would be required to elucidate the exact mechanism that goes beyond the scope of the paper.

      (4) The authors showed that 2-deoxy-glucose, which decreases the activity of HXK, decreased the nuclear localization of MML-1, and this is consistent with their genetic data. Based on these data, 2-deoxy-glucose is expected to decrease longevity. Interestingly, however, 2-deoxy-glucose has been reported to increase lifespan by restricting glucose, whereas extra glucose intake decreases lifespan in C. elegans, shown by multiple research groups, including M. Ristow, C. Kenyon, and S.J.V. Lee labs. This is seemingly paradoxical and worth discussing with key references, especially because MondoA and Chrebp are known as glucose-responsive transcription factors. 

      Thank you for this important comment. 2-DG has been shown to extend lifespan by suppressing glucose metabolism at concentrations ranging from 0.1 to 5 mM, higher concentrations ranging from 20 to 50 mM had the opposite effect decreasing lifespan (Schulz et al., 2007). The concentration we tested was 50 mM 2-DG and observed decreased MML-1 nuclear localization, which is consistent with the previous data showing decreased longevity. We now raise this point in the discussion suggesting that mild inhibition of glucose metabolism has beneficial effects on longevity, while strong suppression causes a shortening of the lifespan (lines 411-414).

      Minor comments 

      (1) The current Introduction does not include the explicit statement about that MML-1 and MondoA are homologs. Please clarify this as naive readers may be confused.

      Thank you for pointing this out. We now say in the intro that MondoA and MML-1 are homologs (lines 59-60).

      (2) In figure 1, the effects of hxk-3 on nuclear localization of MML-1 is small compared to those of hxk-1 and hxk-2. Please add speculation about why HXK-3 has different roles in nuclear localization of MML-1 compared to HXK-1 and HXK-2. 

      According to GExplore 1.4 (Hutter & Suh, 2016), hxk-3 expression declines during larval development and is low expressed in the adult. Perhaps it has little effect in the young adult, and the other hexokinases suffice to support MML-1 nuclear localization. It also remains possible that hxk-3 is not required in glp-1, but required in other longevity pathways.

      (3) The authors tested the effects of genetic inhibition of hxk-1 and hxk-2 on the regulation of MML-1 localization and lifespan of glp-1(-) mutants by using RNAi. I wonder whether the authors can perform the experiments with hxk-1 or hxk-2 loss (or reduction) of function mutants. If they cannot, please discuss the reason and the limitations of RNAi. 

      This is an important point raised by the reviewer. We found that RNAi was most effective for phenotypes related to MML-1 nuclear localization and longevity, likely because it results in acute knockdown. We also showed that pharmacological inhibition of hexokinase function with 3BrP and 2‑DG (Supplementary Figure 1B and 1C) and the PPP with 6-AN (Figure 3B) had consistent results with our observation with RNAi.

      We generated hexokinase KO mutants by deleting the coding sequence of each hexokinase by CRISPR/Cas9. First, we measured the expression of each hexokinase isozyme in each mutant. Notably, hxk-1(syb1271) null mutant had higher expression of hxk-2 and hxk-3, hxk-2(syb1261) did not significantly affect the expression of hxk-1 and hxk-3, and hxk-3(syb1267) had a mild increase in hxk-2 expression. We followed up on the hxk-1(syb1271) and hxk-2(syb1261) and crossed these mutants with our MML-1::GFP reporter. We observed a modest but significant reduction in MML-1 nuclear localization in both strains. The effect with RNAi is much stronger in comparison to the null mutants, potentially due to a compensatory upregulation of the other hexokinases in the mutants that we do not observe with RNAi (Supplementary Figure 1D-E). Another alternative is that there is a threshold in the effects of hexokinase function on MML-1 nuclear localization. We tried to generate a hxk-1; hxk-2 double mutant but it was lethal and therefore did not pursue this further.

      Author response image 2.

      (4) Please correct minor typos throughout the manuscript. Following are some examples. <br /> - On page 4, line 111, please correct "Supplementary Figure D-E" to "Supplementary Figure 1D-E". 

      - On page 9, line 272, please correct "3A-B" to "4A-B". 

      - On page 9, line 275, please correct "S4" to "4". 

      - On page 10, line 309, please correct "4A" to "4B" 

      Corrected.

      (5) In Fig. 3E, please add the information about the scale bars in figure legends.

      Corrected.

      Reviewer #2 (Recommendations For The Authors):

      Here are some detailed suggestions for the authors:

      (1) Since MML-1/MXL-2 complex functions in multiple longevity models, e.g. DR, ILS, what are the roles of HXK-1 and HXK-2 in these models? 

      We now show that although mml-1 is required in most longevity pathways, hxk-1 and hxk-2 are required in some pathways (glp-1, isp-1) but not others (daf-2, raga-1). See above for more details.

      (2) As for the metabolites screening, the lipid metabolic genes can be included. Not only for the above reasons, also previous study had found that the mml-1 mRNA levels and MML-1 GFP nuclear localization were all increased in the glp-1 model, while mml-1 mRNA levels were unaffected by hxk knockdown, suggesting more pathways be involved. 

      We agree with the reviewer that understanding what metabolites regulate MML-1 nuclear localization and activity is an important, yet challenging question. Our studies demonstrate a role of glucose metabolism, in particular, hexokinase in this process, consistent with hexose-p being activators of MondoA. Our data also suggest mechanisms beyond hexose-p regulate MML-1, since knockdown of the PPP components stimulates MML-1 even when hxk-2 is depleted and low G6P, and inhibition of the PPP with 6-AN stimulates MondoA nuclear localization under starvation conditions in mammalian cell culture. We tested redox regulation, nucleoside, and lipid metabolism as candidate processes (see below). Notably, our data suggest this other mechanism is tied to lipid metabolism through droplet size since various perturbations that impact LD size and number (atgl-1, dgat-2, tkt-1, Figure 4) affected MML-1 nuclear localization. It remains an open question whether MML-1 is regulated by other metabolites through a ligand-protein interaction or not. We cannot exclude that beyond lipid droplet regulation, specific lipids, other metabolites, or metabolic modules linked to the PPP might regulate MML-1 nuclear localization and activity.

      We employed genetic manipulation and pharmacological inhibition to understand the upstream signals that regulate MML-1. These approaches will not be sufficient to determine whether other metabolite(s) are involved in MML-1/MondoA translocation to the nucleus through a direct interaction. Novel technologies that determine protein-metabolite interactions (e.g. MIDAS) will help us answer this question in future work, and go beyond the scope of this paper. As a compromise, we discuss possible metabolites that may orchestrate this based on our observations based on MML‑1 subcellular localization at LD/mitochondria (including PPP and TCA cycle intermediates).

      (3) Line 238, it should be "NADPH". 

      Corrected.

      (4) RNAi targeting enzymes of different branches of PPP can be performed

      In our initial screen, we examined the effect of various enzymes of the PPP on MML-1 nuclear localization (Figure 1A, Supplementary Table S1) and found that knockdown of enzymes in both the oxidative phase (PGDH/T25B9.9) and non-oxidative phase (transketolase/TKT-1) affect MML-1 nuclear localization. In line, 6-AN treatment, which affects the oxidative phase, also stimulated MML‑1 nuclear localization (Figure 3B). We also observed that knockdown of enzymes involved in ribose 5P conversion to ribose, ribose 1P, and phosphoribosyl pyrophosphate, an intermediate in nucleotide biosynthesis, decreased MML-1 nuclear localization (rpia-1, F07A11._5, _Y43F4B.5, _R151._2; Supplementary Table S1). Whether MML‑1/MondoA responds to nucleotide pool remains elusive.

      (5) As for PPP, these are many possibilities that can be tested. For example, as PPP supplies NADPH for oxidative balance, does MML-1 respond to ROS? Also, it appears the genes in the non-oxidative arm of PPP regulate MML-1, so is nucleotide synthesis involved? 

      Thank you for the suggestion. We tested other enzymes involved in NADPH production from the folate cycle and observed a mild but significant reduction of MML-1 nuclear localization upon dao-3i (Supplementary Table S1). Moreover, we tested whether MML-1 nuclear localization is responsive to ROS. While paraquat exposure induced oxidative stress by measuring the transcriptional reporter gst‑4p::GFP (Supplementary Figure 3A), paraquat exposure did not significantly affect MML-1 nuclear localization (Supplementary Figure 3B). Therefore we think it less likely that NADPH production acting through redox regulation is the main effect.

      We also tried supplementation with some of the metabolite outputs of PPP including ribose, ribulose, and xylulose, as well as nucleosides (see below), but saw no effect on MML-1 nuclear localization. We agree that further studies are required to pinpoint whether there is another metabolic moiety regulating MML-1 at the protein-ligand level, but this goes beyond the scope of the current investigation.

      Author response image 2.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      “EGFRvIII is mainly associated with the classical subtype, so the mesenchymal subtype might be unexpected here. This could be commented on.” 

      We acknowledge that EGFRvIII is most often associated with the classical subtype of glioblastoma and agree that mesenchymal subtype classification may be unexpected given the use of her4.1:EGFRvIII as a driver in our model. We would like to highlight the fact that our brain tumors do also express certain markers associated with the classical subtype including neural precursor and neural stem cell markers like sox2, ascl1b, and gli2 (Supplementary Fig 4, 5; Supplementary Table 1-3). However, our transcriptomic data was not found to significantly enrich for classical subtype gene expression, compared to normal brains. This could be due to a significant contribution of normal brain tissue to our analyses (bulk tumor burdened brains were harvested for RNA sequencing), as well as the significant contribution of mesenchymal subtype signatures and/or inflammatory gene expression in our brain tumor-positive samples. Because signatures associated with inflammation consist of some of the most highly upregulated genes in our samples, this could potentially dilute out and/or lessen alterative subtype and/or signature gene expression. Importantly, it is now widely appreciated that patient tumors simultaneously consist of heterogenous tumor cells reflecting multiple molecular subtypes (Couturier et al., 2020; Darmanis et al., 2017; Neftel et al., 2019), providing glioblastoma with a high level of phenotypic plasticity. We also demonstrate that the contribution of additional drivers not always present with EGFRvIII in patient glioblastoma enhances primary brain tumors in vivo. This result is consistent with more aggressive glioblastomas seen in patients with EGFRvIII variants and TP53 loss-of-function mutations (Ruano et al., 2009). It will therefore be interesting in the future to consider how single or multiple driver mutations contribute to subtype-specific gene expression in our model, as well as histopathology, relative to patients. We have included some of these discussion points to our revised manuscript.     

      “Some more histologic characterization of the tumors would be helpful. Are they invasive, do larger tumors show necrosis and microvascular proliferation? This would help with understanding the full potential of the new model.”

      We have updated our manuscript to include more histolopathological characterization and images (Supplementary Fig 2).

      “Current thinking in established glioblastoma is that the M1/M2 designations for macrophages are not relevant, with microglia macrophage populations showing a mixture of pre- and anti-inflammatory features. Ideally, there would be a much more detailed characterization of the intratumoral microglia/macrophage population here, as single markers can’t be relied upon.”

      We performed additional gene set enrichment analyses (GSEA) using our sequencing datasets and compared p53EPS gene expression to M1/M2 macrophage expression signatures and expression signatures from MCSF-stimulated macrophages at early and late (M2 polarized) time-points. From this analysis, we detected enrichment for markers of both pro- and antiinflammatory features, however, with stronger and significant enrichment for gene expression signatures associated with classical pro-inflammatory M1 macrophages. We have included these GSEA plots and gene set enrichment lists as supplementary materials (Supplementary Fig 6, Supplementary Table 6). We also performed GSEA against a broad curated set of immunologic gene sets (C7: immunologic signature gene sets, Molecular Signatures Database, (Liberzon et al., 2011)) and have included the list of signatures and enrichment scores as a supplementary table (Supplementary Table 6). 

      “Phagocytosis could have anti-tumor effects through removal of live cancer cells or could be cancer-promoting if apoptotic cells are being rapidly cleared with concomitant activation of an immunosuppressive phenotype in the phagocytes (ie. efferocytosis).” 

      We looked at efferocytosis-associated gene expression in our sequencing dataset (124 “efferocytosis” genes, GeneCards), and while we detected upregulation of certain genes associated with efferocytosis in p53EPS brains, we did not detect significant enrichment for the entire gene set. Furthermore, we did not detect up-regulation of key efferocytosis receptors including Axl and Tyro3 (Supplementary Table 1, 2), compared to normal brains. While efferocytosis may contribute to tumor growth and evolution, this GSEA combined with our functional data supporting an inhibitory role for phagocytes in p53EPS tumor initiation and engraftment following transplantation (Fig 4, Fig 5, Supplementary Fig 7), suggests that efferocytosis is not a major driver of tumor formation in our model. However, how efferocytosis affects tumor progression in our model and/or relapse following therapy will be an interesting feature to explore in the future using temporal manipulations of phagocytes and/or treatments with chemical inhibitors.

      Author response image 1.

      Gene Set Enrichment Analysis (GSEA) for efferocytosis-associated gene expression (124 “efferocytosis” genes in GeneCards) in tp53EPS tumor brains, compared to normal zebrafish brains. Normalized enrichment score (NES) and p-value are indicated. 

      “Do the irf7/8 and chlodronate experiments distinguish between effects on microglia/macrophages and dendritic cells?”

      In addition to microglia/macrophages, the IRF8 transcription factor has been shown to control survival and function of dendritic cells (Sichien et al., 2016). Chlodronate treatments are also used to deplete both macrophages and dendritic cells in vivo. Therefore, we cannot distinguish the effects of these manipulations in our experiments and have updated our manuscript throughout to reflect this.     

      Reviewer #2:

      “The authors state that oncogenic MAPK/AKT pathway activation drives glial-derived tumor formation. It would be important to include a wild-type or uninjected control for the pERK and pAKT staining shown in Fig1 I-K to aid in the interpretation of these results. Likewise, quantification of the pERK and pAKT staining would be useful to demonstrate the increase over WT, and would also serve to facilitate comparison with the similar staining in the KPG model (Supp Fig 2D).”

      We have updated Fig 1 and Supplementary Fig 3D (formerly Fig 2D), to include histology from tumor-free uninjected control animals, as well as quantifications of p-ERK and p-AKT staining to highlight increased MAPK/AKT signaling pathway activation in our tumor model.  

      “The authors use a transplantation assay to further test the tumorigenic potential of dissociated cells from glial-derived tumors. Listing the percentage of transplants that generate fluorescent tumor would be helpful to fully interpret these data. Additionally, it was not clear based on the description in the results section that the transplantation assay was an “experimental surrogate” to model the relapse potential of the tumor cell. This is first mentioned in the discussion. The authors may consider adding a sentence for clarity earlier in the manuscript as it helps the reader better understand the logic of the assay.” 

      We have clarified in the text the percentage of transplants that generated fluorescent tumor (1625%, n=3 independent screens). This is also represented in Fig 5C,D. We also added text when introducing the transplantation assay, explaining that transplantation is frequently used as an experimental surrogate to assess relapse potential, and that our objective was to assess tumor cell propagation in the context of specific manipulations within the TME.  

      “The authors nicely show high levels of immune cell infiltration and associations between microglia/macrophages and tumor cells. However, a quantification of the emergence of macrophages over time in relation to tumor initiation and growth would provide significant support to the observations of tumor suppressive activity of the phagocytes. Along these lines, the inclusion of a statement about when leukocytes emerge during normal development would be informative for those not familiar with the zebrafish model.”

      In zebrafish, microglia colonize the neural retina by 48 hpf, and the optic tectum by 84 hpf (Herbomel et al., 2001), prior to when we typically observe lesions in our p53EPS brains. To validate the emergence of microglia prior to tumor formation in p53EPS, we have now used live confocal imaging through the brains of uninjected control and p53EPS injected zebrafish at 5, 7 and 9 dpf. As expected, microglia were present throughout the cephalic region and in the brain at 5 dpf (120 hpf). At this stage, p53EPS injected zebrafish brains displayed mosaic cellular expression of her4.1:mScarlet; however, cells were sparse and diffuse, and no large intensely fluorescent tumor-like clusters were detected at this stage (n=12/12 tumor negative). At 7 dpf, microglia were observed in the brains of control and p53EPS zebrafish; however, at this stage we detected clusters of her4.1:mScarlet+ cells (n=5/9), indicative of tumor formation. Lesions were found to be surrounded and/or infiltrated by mpeg:_EGFP+ microglia. Finally, at 9 dpf _her4.1:mScarlet+ expression became highly specific to tumor lesions, and these lesions were associated with _mpeg:_EGFP+ microglia/macrophages (n=8/8 of tumor-positive zebrafish). These descriptions along with representative images has been added to Figure 3.

      “From the data provided in Figure 4G and Supp Fig 7b, the authors suggest that “increased p53EPS tumor initiation following Irf gene knock-down is a consequence of irf7 and irf8 loss-of-function in the TME.” Given the importance of the local microenvironment highlighted in this study, spatial information on the form of in situ hybridization to identify the relevant location of the expression change would be important to support this conclusion.”

      We performed fluorescent in situ hybridization (using HCR RNA-FISH, Molecular Instruments) on whole mount control and irf7 CRISPR-injected p53EPG animals (her4.1:EGFRvIII +her4.1:PI3KCAH1047R + her4.1:GFP, GFP was used in this case because of probe availability).

      Representative confocal projections through tumors, as well as single optical sections are presented and discussed in Figure 4, highlighting the location of irf7 expression change following gene knock-down. We found significant irf7 signal in and surrounding p53EPS tumors at early stages of tumor formation_. This expression was reduced and/or lost following _irf7 CRISPR gene targeting, consistent with RT-PCR data (Supplementary Fig 7).          

      “The authors used neutral red staining that labels lysosomal-rich phagocytes to assess enrichment at the early stages of tumor initiation. The images in Figure 3 panel A should be labeled to denote the uninjected controls to aid in the interpretation of the data. In Supplemental Figure 6, the neutral red staining in the irf8 CRISPR-injected larvae looks to be increased, counter to the quantification. Can the authors comment if the image is perhaps not representative?”

      We have updated Figure 3 and Supplementary Figure 6 to aid in the interpretation of our results. In Fig 3A, we used tumor-negative controls from our injected cohorts. This was done to control for exogenous transgene presence and/or over-expression prior to (or in the absence of) malignant transformation. In Supplementary Fig 6, our images are representative, but we have now used unprocessed images with arrowheads to highlight neutral-red positive foci for clarity. In our original manuscript the images contained software generated markers, which could have obscured and/or confused the neutral red staining we were trying the highlight.    

      Recommendations For the Authors:

      Reviewer #1: 

      “The PI 3-kinase does a lot more than just activating mTOR and Akt – I would suggest modifying that sentence in the introduction.”

      We have adjusted text in the introduction to reflect the broad role for PI3K signaling.

      Reviewer #2:

      “In Supplemental Fig 1, it would be helpful for the authors to provide a co-stain, such as DAPI to label all nuclei, which would allow the reader to assess the morphology of the cells in the context of the surrounding tissue.”

      We have included brightfield images in Supplementary Fig 1, that together with her4.1:mScarlet fluorescence, should help readers assess tumor location and morphology in the context of surrounding tissue. Tumor cell morphology at high-resolution can be visualized in Fig 3, Movie 1 and Movie 2.

      “The authors state that oncogenic MAPK/AKT pathway activation drives glial-derived tumor formation. The authors may consider testing if the addition of an inhibitor of MAPK signaling may prevent or decrease the formation of glial-derived tumors in this context to further support their results.” 

      To further assess the role for MAPK activation, we decided to test the effect of 50uM AZD6244 MAPK inhibitor following transplantation of dissociated primary p53EPS cells into syngeneic CG1 strain zebrafish embryos, similar to as previously described (Modzelewska et al., 2016). Following 5 days of drug treatments, we did not detect significant differences in tumor engraftment or in tumor size between DMSO control and AZD6244-treated cohorts, suggesting that MAPK inhibition is not sufficient to prevent p53EPS engraftment and growth in our model. In the future, assessments of on-target drug effects, possible resistance mechanisms, and/or testing MAPK inhibitors in combination with other targeted agents including Akt and/or mTOR inhibitors (Edwards et al., 2006; McNeill et al., 2017; Schreck et al., 2020) will enhance our understanding of potential therapeutic strategies.

      Author response image 2.

      Dorsal views of 8 dpf zebrafish larvae engrafted with her4.1:mScarlet+ p53EPS tumor cells following treatment from 3-8dpf with 0.1% DMSO (control) or 50uM AZD6244. Tumor cell injections were performed at 2 dpf into syngeneic CG1 strain embryos. The percentage of total animals with persisting engraftment following drug treatments, as well as tumor size (microns squared, quantified using Carl Zeiss ZEN software) are shown for control and AZD6244 treated larvae. 

      “Have the authors tested if EGFR and PI3KCA driven by other neural promoters produce similar results, or not? This would help support the specificity of her4.1 neural progenitors and glia as the cell of origin in this model.”

      At this time, we have not tested other neural promoters. However, previous reports describe a zebrafish zic4-driven glioblastoma model with mesenchymal-like gene expression (Mayrhofer et al., 2017), supporting neural progenitors as a cell of origin. In the future it will be interesting to test sox2, nestin, and gfap promoters to further define and support her4.1-expressing neural progenitors and glia as the cell of origin in our model.

      “Other leukocyte populations, such as neutrophils, can also respond to inflammatory cues. Can the authors comment if neutrophils are also observed in the TME?”

      We performed initial assessments of neutrophils in the TME using our expression datasets as well as her4.1:EGFRvIII + her4.1:PI3KCAH1047R co-injection into Tg(mpx:EGFP) strain zebrafish. We observed tumor formation without significant infiltration of mpx:EGFP+ neutrophils. Future investigations will be important to assess differences in the contributions of different myeloidderived lineages in the TME of p53EPS, as well as how heterogeneity may be altered depending on different oncogenic drivers and/or stage of tumor progression, as seen in human glioblastoma (Friedmann-Morvinski and Hambardzumyan, 2023). We have added text in the disscussion section of our manuscript to indicate the possibility of neutrophils and/or other immune cell types contributing to p53EPS tumor biology. 

      Author response image 3.

      Control-injected tumornegative and tumor-positive Tg(mpx:EGFP) zebrafish at 10 dpf. Tg(mpx:EGFP) strain embryos were injected at the one-cell stage with her4.1:EGFRvIII + her4.1:PI3KCAH1047R + her4.1:mScarlet.

      “It is not clear if the transcriptomics data has been deposited in a publicly available database, such as the Gene Expression Omnibus (GEO). Sharing of these data would be a benefit to the field and facilitate use in other studies.”

      We have uploaded all transcriptomic data to GEO under accession GSE246295.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank all three Reviewers for their comments and have revised the manuscript accordingly.

      Reviewer #1 (Public Review):

      The main objective of this paper is to report the development of a new intramuscular probe that the authors have named Myomatrix arrays. The goal of the Myomatrix probe is to significantly advance the current technological ability to record the motor output of the nervous system, namely fine-wire electromyography (EMG). Myomatrix arrays aim to provide large-scale recordings of multiple motor units in awake animals under dynamic conditions without undue movement artifacts and maintain long-term stability of chronically implanted probes. Animal motor behavior occurs through muscle contraction, and the ultimate neural output in vertebrates is at the scale of motor units, which are bundles of muscle fibers (muscle cells) that are innervated by a single motor neuron. The authors have combined multiple advanced manufacturing techniques, including lithography, to fabricate large and dense electrode arrays with mechanical features such as barbs and suture methods that would stabilize the probe's location within the muscle without creating undue wiring burden or tissue trauma. Importantly, the fabrication process they have developed allows for rapid iteration from design conception to a physical device, which allows for design optimization of the probes for specific muscle locations and organisms. The electrical output of these arrays is processed through a variety of means to try to identify single motor unit activity. At the simplest, the approach is to use thresholds to identify motor unit activity. Of intermediate data analysis complexity is the use of principal component analysis (PCA, a linear second-order regression technique) to disambiguate individual motor units from the wide field recordings of the arrays, which benefits from the density and numerous recording electrodes. At the highest complexity, they use spike sorting techniques that were developed for Neuropixels, a large-scale electrophysiology probe for cortical neural recordings. Specifically, they use an estimation code called kilosort, which ultimately relies on clustering techniques to separate the multi-electrode recordings into individual spike waveforms.

      The biggest strength of this work is the design and implementation of the hardware technology. It is undoubtedly a major leap forward in our ability to record the electrical activity of motor units. The myomatrix arrays trounce fine-wire EMGs when it comes to the quality of recordings, the number of simultaneous channels that can be recorded, their long-term stability, and resistance to movement artifacts.

      The primary weakness of this work is its reliance on kilosort in circumstances where most of the channels end up picking up the signal from multiple motor units. As the authors quite convincingly show, this setting is a major weakness for fine-wire EMG. They argue that the myomatrix array succeeds in isolating individual motor unit waveforms even in that challenging setting through the application of kilosort.

      Although the authors call the estimated signals as well-isolated waveforms, there is no independent evidence of the accuracy of the spike sorting algorithm. The additional step (spike sorting algorithms like kilosort) to estimate individual motor unit spikes is the part of the work in question. Although the estimation algorithms may be standard practice, the large number of heuristic parameters associated with the estimation procedure are currently tuned for cortical recordings to estimate neural spikes. Even within the limited context of Neuropixels, for which kilosort has been extensively tested, basic questions like issues of observability, linear or nonlinear, remain open. By observability, I mean in the mathematical sense of well-posedness or conditioning of the inverse problem of estimating single motor unit spikes given multi-channel recordings of the summation of multiple motor units. This disambiguation is not always possible. kilosort's validation relies on a forward simulation of the spike field generation, which is then truth-tested against the sorting algorithm. The empirical evidence is that kilosort does better than other algorithms for the test simulations that were performed in the context of cortical recordings using the Neuropixels probe. But this work has adopted kilosort without comparable truth-tests to build some confidence in the application of kilosort with myomatrix arrays.

      Kilosort was developed to analyze spikes from neurons rather than motor units and, as Reviewer #1 correctly points out, despite a number of prior validation studies the conditions under which Kilosort accurately identifies individual neurons are still incompletely understood. Our application of Kilosort to motor unit data therefore demands that we explain which of Kilosort’s assumptions do and do not hold for motor unit data and explain how our modifications of the Kilosort pipeline to account for important differences between neural and muscle recording, which we summarize below and have included in the revised manuscript.

      Additionally, both here and in the revised paper we emphasize that while the presented spike sorting methods (thresholding, PCA-based clustering, and Kilosort) robustly extract motor unit waveforms, spike sorting of motor units is still an ongoing project. Our future work will further elaborate how differences between cortical and motor unit data should inform approaches to spike sorting as well as develop simulated motor unit datasets that can be used to benchmark spike sorting methods.

      For our current revision, we have added detailed discussion (see “Data analysis: spike sorting”) of the risks and benefits of our use of Kilosort to analyze motor unit data, in each case clarifying how we have modified the Kilosort code with these issues in mind:

      “Modification of spatial masking: Individual motor units contain multiple muscle fibers (each of which is typically larger than a neuron’s soma), and motor unit waveforms can often be recorded across spatially distant electrode contacts as the waveforms propagate along muscle fibers. In contrast, Kilosort - optimized for the much more local signals recorded from neurons - uses spatial masking to penalize templates that are spread widely across the electrode array. Our modifications to Kilosort therefore include ensuring that Kilosort search for motor unit templates across all (and only) the electrode channels inserted into a given muscle. In this Github repository linked above, this is accomplished by setting parameter nops.sigmaMask to infinity, which effectively eliminates spatial masking in the analysis of the 32 unipolar channels recorded from the injectable Myomatrix array schematized in Supplemental Figure 1g. In cases including chronic recording from mice where only a single 8-contact thread is inserted into each muscle, a similar modification can be achieved with a finite value of nops.sigmaMask by setting parameter NchanNear, which represents the number of nearby EMG channels to be included in each cluster, to equal the number of unipolar or bipolar data channels recorded from each thread. Finally, note that in all cases Kilosort parameter NchanNearUp (which defines the maximum number of channels across which spike templates can appear) must be reset to be equal to or less than the total number of Myomatrix data channels.”

      “Allowing more complex spike waveforms: We also modified Kilosort to account for the greater duration and complexity (relative to neural spikes) of many motor unit waveforms. In the code repository linked above, Kilosort 2.5 was modified to allow longer spike templates (151 samples instead of 61), more spatiotemporal PCs for spikes (12 instead of 6), and more left/right eigenvector pairs for spike template construction (6 pairs instead of 3). These modifications were crucial for improving sorting performance in the nonhuman primate dataset shown in Figure 3, and in a subset of the rodent datasets (although they were not used in the analysis of mouse data shown in Fig. 1 and Supplemental Fig. 2a-f).”

      Furthermore, as the paper on the latest version of kilosort, namely v4, discusses, differences in the clustering algorithm is the likely reason for kilosort4 performing more robustly than kilosort2.5 (used in the myomatrix paper). Given such dependence on details of the implementation and the use of an older kilosort version in this paper, the evidence that the myomatrix arrays truly record individual motor units under all the types of data obtained is under question.

      We chose to modify Kilosort 2.5, which has been used by many research groups to sort spike features, rather than the just-released Kilosort 4.0. Although future studies might directly compare the performance of these two versions on sorting motor unit data, we feel that such an analysis is beyond the scope of this paper, which aims primarily to introduce our electrode technology and demonstrate that a wide range of sorting methods (thresholding, PCA-based waveform clustering, and Kilosort) can all be used to extract single motor units. Additionally, note that because we have made several significant modifications to Kilosort 2.5 as described above, it is not clear what a “direct” comparison between different Kilosort versions would mean, since the procedures we provide here are no longer identical to version 2.5.

      There is an older paper with a similar goal to use multi-channel recording to perform sourcelocalization that the authors have failed to discuss. Given the striking similarity of goals and the divergence of approaches (the older paper uses a surface electrode array), it is important to know the relationship of the myomatrix array to the previous work. Like myomatrix arrays, the previous work also derives inspiration from cortical recordings, in that case it uses the approach of source localization in large-scale EEG recordings using skull caps, but applies it to surface EMG arrays. Ref: van den Doel, K., Ascher, U. M., & Pai, D. K. (2008). Computed myography: three-dimensional reconstruction of motor functions from surface EMG data. Inverse Problems, 24(6), 065010.

      We thank the Reviewer for pointing out this important prior work, which we now cite and discuss in the revised manuscript under “Data analysis: spike sorting” [lines 318-333]:

      “Our approach to spike sorting shares the same ultimate goal as prior work using skin-surface electrode arrays to isolate signals from individual motor units but pursues this goal using different hardware and analysis approaches. A number of groups have developed algorithms for reconstructing the spatial location and spike times of active motor units (Negro et al. 2016; van den Doel, Ascher, and Pai 2008) based on skin-surface recordings, in many cases drawing inspiration from earlier efforts to localize cortical activity using EEG recordings from the scalp (Michel et al. 2004). Our approach differs substantially. In Myomatrix arrays, the close electrode spacing and very close proximity of the contacts to muscle fibers ensure that each Myomatrix channel records from a much smaller volume of tissue than skin-surface arrays. This difference in recording volume in turn creates different challenges for motor unit isolation: compared to skin-surface recordings, Myomatrix recordings include a smaller number of motor units represented on each recording channel, with individual motor units appearing on a smaller fraction of the sensors than typical in a skin-surface recording. Because of this sensordependent difference in motor unit source mixing, different analysis approaches are required for each type of dataset. Specifically, skin-surface EMG analysis methods typically use source-separation approaches that assume that each sensor receives input from most or all of the individual sources within the muscle as is presumably the case in the data. In contrast, the much sparser recordings from Myomatrix are better decomposed using methods like Kilosort, which are designed to extract waveforms that appear only on a small, spatially-restricted subset of recording channels.”

      The incompleteness of the evidence that the myomatrix array truly measures individual motor units is limited to the setting where multiple motor units have similar magnitude of signal in most of the channels. In the simpler data setting where one motor dominates in some channel (this seems to occur with some regularity), the myomatrix array is a major advance in our ability to understand the motor output of the nervous system. The paper is a trove of innovations in manufacturing technique, array design, suture and other fixation devices for long-term signal stability, and customization for different muscle sizes, locations, and organisms. The technology presented here is likely to achieve rapid adoption in multiple groups that study motor behavior, and would probably lead to new insights into the spatiotemporal distribution of the motor output under more naturally behaving animals than is the current state of the field.

      We thank the Reviewer for this positive evaluation and for the critical comments above.

      Reviewer #2 (Public Review):

      Motoneurons constitute the final common pathway linking central impulse traffic to behavior, and neurophysiology faces an urgent need for methods to record their activity at high resolution and scale in intact animals during natural movement. In this consortium manuscript, Chung et al. introduce highdensity electrode arrays on a flexible substrate that can be implanted into muscle, enabling the isolation of multiple motor units during movement. They then demonstrate these arrays can produce high-quality recordings in a wide range of species, muscles, and tasks. The methods are explained clearly, and the claims are justified by the data. While technical details on the arrays have been published previously, the main significance of this manuscript is the application of this new technology to different muscles and animal species during naturalistic behaviors. Overall, we feel the manuscript will be of significant interest to researchers in motor systems and muscle physiology, and we have no major concerns. A few minor suggestions for improving the manuscript follow.

      We thank the Reviewer for this positive overall assessment.

      The authors perhaps understate what has been achieved with classical methods. To further clarify the novelty of this study, they should survey previous approaches for recording from motor units during active movement. For example, Pflüger & Burrows (J. Exp. Biol. 1978) recorded from motor units in the tibial muscles of locusts during jumping, kicking, and swimming. In humans, Grimby (J. Physiol. 1984) recorded from motor units in toe extensors during walking, though these experiments were most successful in reinnervated units following a lesion. In addition, the authors might briefly mention previous approaches for recording directly from motoneurons in awake animals (e.g., Robinson, J. Neurophys. 1970; Hoffer et al., Science 1981).

      We agree and have revised the manuscript to discuss these and other prior use of traditional EMG, including here [lines 164-167]:

      “The diversity of applications presented here demonstrates that Myomatrix arrays can obtain highresolution EMG recordings across muscle groups, species, and experimental conditions including spontaneous behavior, reflexive movements, and stimulation-evoked muscle contractions. Although this resolution has previously been achieved in moving subjects by directly recording from motor neuron cell bodies in vertebrates (Hoffer et al. 1981; Robinson 1970; Hyngstrom et al. 2007) and by using fine-wire electrodes in moving insects (Pfluger 1978; Putney et al. 2023), both methods are extremely challenging and can only target a small subset of species and motor unit populations. Exploring additional muscle groups and model systems with Myomatrix arrays will allow new lines of investigation into how the nervous system executes skilled behaviors and coordinates the populations of motor units both within and across individual muscles…

      For chronic preparations, additional data and discussion of the signal quality over time would be useful. Can units typically be discriminated for a day or two, a week or two, or longer?

      A related issue is whether the same units can be tracked over multiple sessions and days; this will be of particular significance for studies of adaptation and learning.

      Although the yields of single units are greatest in the 1-2 weeks immediately following implantation, in chronic preparations we have obtained well-isolated single units up to 65 days post-implant. Anecdotally, in our chronic mouse implants we occasionally see motor units on the same channel across multiple days with similar waveform shapes and patterns of behavior-locked activity. However, because data collection for this manuscript was not optimized to answer this question, we are unable to verify whether these observations actually reflect cross-session tracking of individual motor units. For example, in all cases animals were disconnected from data collection hardware in between recording sessions (which were often separated by multiple intervening days) preventing us from continuously tracking motor units across long timescales. We agree with the reviewer that long-term motor unit tracking would be extremely useful as a tool for examining learning and plan to address this question in future studies.

      We have added a discussion of these issues to the revised manuscript [lines 52-59]:

      “…These methods allow the user to record simultaneously from ensembles of single motor units (Fig. 1c,d) in freely behaving animals, even from small muscles including the lateral head of the triceps muscle in mice (approximately 9 mm in length with a mass of 0.02 g 23). Myomatrix recordings isolated single motor units for extended periods (greater than two months, Supp. Fig. 3e), although highest unit yield was typically observed in the first 1-2 weeks after chronic implantation. Because recording sessions from individual animals were often separated by several days during which animals were disconnected from data collection equipment, we are unable to assess based on the present data whether the same motor units can be recorded over multiple days.”

      Moreover, we have revised Supplemental Figure 3 to show an example of single motor units recorded >2 months after implantation:

      Author response image 1.

      Longevity of Myomatrix recordings In addition to isolating individual motor units, Myomatrix arrays also provide stable multi-unit recordings of comparable or superior quality to conventional fine wire EMG…. (e) Although individual motor units were most frequently recorded in the first two weeks of chronic recordings (see main text), Myomatrix arrays also isolate individual motor units after much longer periods of chronic implantation, as shown here where spikes from two individual motor units (colored boxes in bottom trace) were isolated during locomotion 65 days after implantation. This bipolar recording was collected from the subject plotted with unfilled black symbols in panel (d).

      It appears both single-ended and differential amplification were used. The authors should clarify in the Methods which mode was used in each figure panel, and should discuss the advantages and disadvantages of each in terms of SNR, stability, and yield, along with any other practical considerations.

      We thank the reviewer for the suggestion and have added text to all figure legends clarifying whether each recording was unipolar or bipolar.

      Is there likely to be a motor unit size bias based on muscle depth, pennation angle, etc.?

      Although such biases are certainly possible, the data presented here are not well-suited to answering these questions. For chronic implants in small animals, the target muscles (e.g. triceps in mice) are so small that the surgeon often has little choice about the site and angle of array insertion, preventing a systematic analysis of this question. For acute array injections in larger animals such as rhesus macaques, we did not quantify the precise orientation of the arrays (e.g. with ultrasound imaging) or the muscle fibers themselves, again preventing us from drawing strong conclusions on this topic. This question is likely best addressed in acute experiments performed on larger muscles, in which the relative orientations of array threads and muscle fibers can be precisely imaged and systematically varied to address this important issue.

      Can muscle fiber conduction velocity be estimated with the arrays?

      We sometimes observe fiber conduction delays up to 0.5 msec as the spike from a single motor unit moves from electrode contact to electrode contact, so spike velocity could be easily estimated given the known spatial separation between electrode contacts. However (closely related to the above question) this will only provide an accurate estimate of muscle fiber conduction velocity if the electrode contacts are arranged parallel to fiber direction, which is difficult to assess in our current dataset. If the arrays are not parallel, this computation will produce an overestimate of conduction velocity, as in the extreme case where a line of electrode contacts arranged perpendicular to the fiber direction might have identical spike arrival times, and therefore appear to have an infinite conduction velocity. Therefore, although Myomatrix arrays can certainly be used to estimate conduction velocity, such estimates should be performed in future studies only in settings where the relative orientation of array threads and muscle fibers can be accurately measured.

      The authors suggest their device may have applications in the diagnosis of motor pathologies. Currently, concentric needle EMG to record from multiple motor units is the standard clinical method, and they may wish to elaborate on how surgical implantation of the new array might provide additional information for diagnosis while minimizing risk to patients.

      We thank the reviewer for the suggestion and have modified the manuscript’s final paragraph accordingly [lines 182-188]:

      “Applying Myomatrix technology to human motor unit recordings, particularly by using the minimally invasive injectable designs shown in Figure 3 and Supplemental Figure 1g,i, will create novel opportunities to diagnose motor pathologies and quantify the effects of therapeutic interventions in restoring motor function. Moreover, because Myomatrix arrays are far more flexible than the rigid needles commonly used to record clinical EMG, our technology might significantly reduce the risk and discomfort of such procedures while also greatly increasing the accuracy with which human motor function can be quantified. This expansion of access to high-resolution EMG signals – across muscles, species, and behaviors – is the chief impact of the Myomatrix project.”

      Reviewer #3 (Public Review):

      This work provides a novel design of implantable and high-density EMG electrodes to study muscle physiology and neuromotor control at the level of individual motor units. Current methods of recording EMG using intramuscular fine-wire electrodes do not allow for isolation of motor units and are limited by the muscle size and the type of behavior used in the study. The authors of Myomatrix arrays had set out to overcome these challenges in EMG recording and provided compelling evidence to support the usefulness of the new technology.

      Strengths:

      They presented convincing examples of EMG recordings with high signal quality using this new technology from a wide array of animal species, muscles, and behavior.

      • The design included suture holes and pull-on tabs that facilitate implantation and ensure stable recordings over months.

      • Clear presentation of specifics of the fabrication and implantation, recording methods used, and data analysis.

      We thank the Reviewer for these comments.

      Weaknesses:

      The justification for the need to study the activity of isolated motor units is underdeveloped. The study could be strengthened by providing example recordings from studies that try to answer questions where isolation of motor unit activity is most critical. For example, there is immense value for understanding muscles with smaller innervation ratio which tend to have many motor neurons for fine control of eyes and hand muscles.

      We thank the Reviewer for the suggestion and have modified the manuscript accordingly [lines 170-174]:

      “…how the nervous system executes skilled behaviors and coordinates the populations of motor units both within and across individual muscles. These approaches will be particularly valuable in muscles in which each motor neuron controls a very small number of muscle fibers, allowing fine control of oculomotor muscles in mammals as well as vocal muscles in songbirds (Fig. 2g), in which most individual motor neurons innervate only 1-3 muscle fibers (Adam et al. 2021).”

      Reviewer #1 (Recommendations for The Authors):

      I would urge the authors to consider a thorough validation of the spike sorting piece of the workflow. Barring that weakness, this paper has the potential to transform motor neuroscience. The validation efforts of kilosort in the context of Neuropixels might offer a template for how to convince the community of the accuracy of myomatrix arrays in disambiguating individual motor unit waveforms.

      I have a few minor detailed comments, that the authors may find of some use. My overall comment is to commend the authors for the precision of the work as well as the writing. However, exercising caution associated with kilosort could truly elevate the paper by showing where there is room for improvement.

      We thank the Reviewer for these comments - please see our summary of our revisions related to Kilosort in our reply to the public reviews above.

      L6-7: The relationship between motor unit action potential and the force produced is quite complicated in muscle. For example, recent work has shown how decoupled the force and EMG can be during nonsteady locomotion. Therefore, it is not a fully justified claim that recording motor unit potentials will tell us what forces are produced. This point relates to another claim made by the authors (correctly) that EMG provides better quality information about muscle motor output in isometric settings than in more dynamic behaviors. That same problem could also apply to motor unit recordings and their relationship to muscle force. The relationship is undoubtedly strong in an isometric setting. But as has been repeatedly established, the electrical activity of muscle is only loosely related to its force output and lacks in predictive power.

      This is an excellent point, and our revised manuscript now addresses this issue [lines 174-176]:

      “…Of further interest will be combining high-resolution EMG with precise measurement of muscle length and force output to untangle the complex relationship between neural control, body kinematics, and muscle force that characterizes dynamic motor behavior. Similarly, combining Myomatrix recordings with high-density brain recordings….”

      L12: There is older work that uses an array of skin mounted EMG electrodes to solve a source location problem, and thus come quite close to the authors' stated goals. However, the authors have failed to cite or provide an in-depth analysis and discussion of this older work.

      As described above in the response to Reviewer 1’s public review comments, we now cite and discuss these papers.

      L18-19: "These limitations have impeded our understanding of fundamental questions in motor control, ..." There are two independently true statements here. First is that there are limitations to EMG based inference of motor unit activity. Second is that there are gaps in the current understanding of motor unit recruitment patterns and modification of these patterns during motor learning. But the way the first few paragraphs have been worded makes it seem like motor unit recordings is a panacea for these gaps in our knowledge. That is not the case for many reasons, including key gaps in our understanding of how muscle's electrical activity relates to its force, how force relates to movement, and how control goals map to specific movement patterns. This manuscript would in fact be strengthened by acknowledging and discussing the broader scope of gaps in our understanding, and thus more precisely pinpointing the specific scientific knowledge that would be gained from the application of myomatrix arrays.

      We agree and have revised the manuscript to note this complexity (see our reply to this Reviewer’s other comment about muscle force, above).

      L140-143: The estimation algorithms yields potential spikes but lacking the validation of the sorting algorithms, it is not justifiable to conclude that the myomatrix arrays have already provided information about individual motor units.

      Please see our replies to Reviewer #1s public comments (above) regarding motor unit spike sorting.

      L181-182: "These methods allow very fine pitch escape routing (<10 µm spacing), alignment between layers, and uniform via formation." I find this sentence hard to understand. Perhaps there is some grammatical ambiguity?

      We have revised this passage as follows [lines 194-197]:

      "These methods allow very fine pitch escape routing (<10 µm spacing between the thin “escape” traces connecting electrode contacts to the connector), spatial alignment between the multiple layers of polyimide and gold that constitute each device, and precise definition of “via” pathways that connect different layers of the device.”

      L240: What is the rationale for choosing this frequency band for the filter?

      Individual motor unit waveforms have peak energy at roughly 0.5-2.0 kHz, although units recorded at very high SNR often have voltage waveform features at higher frequencies. The high- and lowpass cutoff frequencies should reflect this, although there is nothing unique about the 350 Hz and 7,000 Hz cutoffs we describe, and in all recordings similar results can be obtained with other choices of low/high frequency cutoffs.

      L527-528: There are some key differences between the electrode array design presented here and traditional fine-wire EMG in terms of features used to help with electrode stability within the muscle. A barb-like structure is formed in traditional fine-wire EMG by bending the wire outside the canula of the needle used to place it within the muscle. But when the wire is pulled out, it is common for the barb to break off and be left behind. This is because of the extreme (thin) aspect ratio of the barb in fine wire EMG and low-cycle fatigue fracture of the wire. From the schematic shown here, the barb design seems to be stubbier and thus less prone to breaking off. This raises the question of how much damage is inflicted during the pull-out and the associated level of discomfort to the animal as a result. The authors should present a more careful statement and documentation with regard to this issue.

      We have updated the manuscript to highlight the ease of inserting and removing Myomatrix probes, and to clarify that in over 100 injectable insertions/removal there have been zero cases of barbs (or any other part) of the devices breaking off within the muscle [lines 241-249]:

      “…Once the cannula was fully inserted, the tail was released, and the cannula slowly removed. After recording, the electrode and tail were slowly pulled out of the muscle together. Insertion and removal of injectable Myomatrix devices appeared to be comparable or superior to traditional fine-wire EMG electrodes (in which a “hook” is formed by bending back the uninsulated tip of the recording wire) in terms of both ease of injection, ease of removal of both the cannula and the array itself, and animal comfort. Moreover, in over 100 Myomatrix injections performed in rhesus macaques, there were zero cases in which Myomatrix arrays broke such that electrode material was left behind in the recorded muscle, representing a substantial improvement over traditional fine-wire approaches, in which breakage of the bent wire tip regularly occurs (Loeb and Gans 1986).”

      Reviewer #2 (Recommendations For The Authors):

      The Abstract states the device records "muscle activity at cellular resolution," which could potentially be read as a claim that single-fiber recording has been achieved. The authors might consider rewording.

      The Reviewer is correct, and we have removed the word “cellular”.

      The supplemental figures could perhaps be moved to the main text to aid readers who prefer to print the combined PDF file.

      After finalizing the paper we will upload all main-text and supplemental figures into a single pdf on biorXiv for readers who prefer a single pdf. However, given that the supplemental figures provide more technical and detailed information than the main-text figures, for the paper on the eLife site we prefer the current eLife format in which supplemental figures are associated with individual main-text figures online.

      Reviewer #3 (Recommendations For The Authors):

      • The work could be strengthened by showing examples of simultaneous recordings from different muscles.

      Although Myomatrix arrays can indeed be used to record simultaneously from multiple muscles, in this manuscript we have decided to focus on high-resolution recordings that maximize the number of recording channels and motor units obtained from a single muscle. Future work from our group with introduce larger Myomatrix arrays optimized for recording from many muscles simultaneously.

      • The implantation did not include mention of testing the myomatrix array during surgery by using muscle stimulation to verify correct placement and connection.

      As the Reviewer points out electrical stimulation is a valuable tool for confirming successful EMG placement. However we did not use this approach in the current study, relying instead on anatomical confirmation of muscle targeting (e.g. intrasurgical and postmortem inspection in rodents) and by implanting large, easy-totarget arm muscles (in primates) where the risk of mis-targeting is extremely low. Future studies will examine both electrical stimulation and ultrasound methods for confirming the placement of Myomatrix arrays.

      References cited above

      Adam, I., A. Maxwell, H. Rossler, E. B. Hansen, M. Vellema, J. Brewer, and C. P. H. Elemans. 2021. 'One-to-one innervation of vocal muscles allows precise control of birdsong', Curr Biol, 31: 3115-24 e5.

      Hoffer, J. A., M. J. O'Donovan, C. A. Pratt, and G. E. Loeb. 1981. 'Discharge patterns of hindlimb motoneurons during normal cat locomotion', Science, 213: 466-7.

      Hyngstrom, A. S., M. D. Johnson, J. F. Miller, and C. J. Heckman. 2007. 'Intrinsic electrical properties of spinal motoneurons vary with joint angle', Nat Neurosci, 10: 363-9.

      Loeb, G. E., and C. Gans. 1986. Electromyography for Experimentalists, First edi (The University of Chicago Press: Chicago, IL).

      Michel, C. M., M. M. Murray, G. Lantz, S. Gonzalez, L. Spinelli, and R. Grave de Peralta. 2004. 'EEG source imaging', Clin Neurophysiol, 115: 2195-222.

      Negro, F., S. Muceli, A. M. Castronovo, A. Holobar, and D. Farina. 2016. 'Multi-channel intramuscular and surface EMG decomposition by convolutive blind source separation', J Neural Eng, 13: 026027.

      Pfluger, H. J.; Burrows, M. 1978. 'Locusts use the same basic motor pattern in swimming as in jumping and kicking', Journal of experimental biology, 75: 81-93.

      Putney, Joy, Tobias Niebur, Leo Wood, Rachel Conn, and Simon Sponberg. 2023. 'An information theoretic method to resolve millisecond-scale spike timing precision in a comprehensive motor program', PLOS Computational Biology, 19: e1011170.

      Robinson, D. A. 1970. 'Oculomotor unit behavior in the monkey', J Neurophysiol, 33: 393-403.

      van den Doel, Kees, Uri M Ascher, and Dinesh K Pai. 2008. 'Computed myography: three-dimensional reconstruction of motor functions from surface EMG data', Inverse Problems, 24: 065010.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      1. Experiments regarding the inducible expression of MukBEF: The authors should provide western blots or rt-qPCR for MukBEF expression at 40 min and 2H.

      We provide now a western blot of MukB in non-induced and induced conditions as Figure 1-figure supplement 1D.

      1. Experiments with RiTer and LiTer constructs:<br /> a. Authors compare the mukB deletion against wild type (Fig. 2C). It would be additionally informative if these comparisons are made for matP deletion and wild type as well. This will strengthen the conclusion that long-range interactions in ter do increase in the absence of matP.

      We agree that the matP mutant may help the reader to compare the effect of the translocation in different backgrounds and have added it to the figure. This strengthens the conclusion that longrange interactions in ter do increase in the absence of matP in a rearranged chromosome, as observed in the WT configuration (Lioy et al., 2018).

      b. Additionally, in Fig. 2C, it appears that there is some decrease in long-range interactions in the absence of mukB in ter1 (Riter). Is this a significant change?

      The change observed is not significant. The results shown in Fig. 2C have been obtained using a 3C approach, which generated slightly more variability than Hi-C. Furthermore, we measured the range of contacts for the segment corresponding to Ter1 in RiTer (matS12-matS28), in different genetic contexts and different configurations. The results show that this level of variation is not significant (see graph below reporting two independent experiments).

      Author response image 1.

      Range of interactions measured on the interval matS12-matS18 in different genetic contexts and different configurations (MG1655 WT(1 and 2), ∆mukB, RiTer, RiTer ∆mukB).

      1. Experiments with various matS organizations: These experiments are interesting and an important part of the paper. However, it is rather hard to visualize the chromosome conformations in the strains after transposition. To aid the reader (particularly with panel E), authors can provide schematics of the chromosome conformations and anticipated/ observed chromosomal interactions. Circular interaction plots would be useful here.

      We thank the reviewer for this interesting remark; we have tried in the past to represent these interactions using a circular representation (see for example the web site of Ivan Junier; https://treetimc.github.io/circhic/index.html). However, this representation is not trivial to apprehend for nonspecialists, especially in strains with a rearranged chromosome configuration. Nonetheless, we have added graphical circular representations of the chromosome configurations to help the reader.

      1. ChIP experiments:<br /> a. This section of the manuscript needs to be further strengthened. It is not clear whether the ChIP signal observed is significant (for example at T10 or T20 min, the peak value does not appear to go above 1.1 fold. Can the authors be sure that this small increase is not simply a consequence of increase in copy number of the loci around the origin, as replication has initiated?

      The basal value of the ChIP on the non-replicated sequences (between 0-3.5 Mb for 10 minutes and 0-3 Mb for 20 minutes) is 0.8 and 0.7, respectively, whereas the mean value of the replicated sequence is 1.6 and 1.45. So the enrichment observed for these two points is about 2-fold, not 1.1 and it is 4 fold for t40min. These values were obtained by dividing the number of normalized reads in the ChIP (the number of reads at each position divided by the total number of reads) by the normalized reads of the input. Therefore, the increase in copy number is considered in the calculation. Furthermore, we added a supplementary figure (Figure Sup9) in which we performed a ChIP without tags on synchronized cells, and in this case, we did not observe any enrichment triggered by replication.

      b. Authors make a conclusion that MukB loads behind the replication fork. However, the time resolution of the presented experiments is not sufficient to be certain of this. Authors would need to perform more time-resolved experiments for the same.

      Reviewer 1 is correct; we attempted to discriminate whether the observed enrichment is (i) associated with the replication fork since we observed a decrease in the center of the enrichment at oriC as the maximum enrichment moves away with the replication fork after 20 and 40 minutes, or (ii) associated with the newly replicated sequence. To investigate this, we attempted to induce a single round of replication by shifting the cells back to 40°C after 10 minutes at 30°C. Unfortunately, replication initiation is not immediately halted by shifting the cells to 40°C, and we were unable to induce a single round of replication. To clarify our conclusions, we modified our manuscript to

      “Altogether, these findings indicate that MukBEF is loaded into regions newly replicated either at the replication fork or even further behind it, except in the Ter region from which it would be excluded.”

      c. Authors conclude that in the LiTer7 strain, MukB signal is absent from Ter2. However, when compared with the ChIP profiles by eye across panels in A and B, this does not seem to be significant. In the same results sections, authors state that there is a 3-fold increase in MukB signal in other regions. The corresponding graph does not show the same.

      Rather than relying solely on the enrichment levels, which can be challenging to compare across different strains due to slight variations in replication levels, we believe there is a clear disruption in this profile that corresponds to the Ter2 sequence. Furthermore, this discontinuity in enrichment relative to the replication profile is also observable in the WT configuration. At T40min, MukB ChIPseq signals halt at the Ter boundary, even though Ter is actively undergoing replication, as evidenced by observations in the input data.

      Regarding the fold increase of MukB, Reviewer 1 is correct; we overestimated this enrichment in the text and have now corrected it.

      d. Authors should provide western blot of MukB-Flag.

      We have added Supplementary Figure 1 D, which contains a Western blot of MukB-Flag.

      1. The bioinformatic analysis of matS site distribution is interesting, but this is not followed upon. The figure (Fig 5) is better suited in the supplement and used only as a discussion point.

      We acknowledge the reviewer's point, but we used this section to attempt to extend our findings to other bacteria and emphasize the observation that even though a few matS sites are necessary to inhibit MukBEF, the Ter domains are large and centered on dif even in other bacteria.

      1. The discussion section is lacking many references and key papers have not been cited (paragraph 1 of discussion for example has no references).

      The possibility that SMC-ScpAB and MukBEF can act independent of replication has been suggested previously, but are not cited or discussed. Similarly, there is some evidence for SMC-ScpAB association with newly replicated DNA (PMID 21923769).

      We have added references to the suggested paragraph and highlighted the fact that MukBEF's activity independent of replication was already known. However, we believe that the situation is less clear for SMC-ScpAB in B. subtilis or C. crescentus. In a similar manner, we found no clear evidence that SMCScpAB is associated with newly replicated DNA in the referenced studies.

      To clarify and enrich the discussion section, we have added a paragraph that provides perspective on the loading mechanisms of SMC-ScpAB and MukBEF.

      1. There are minor typographical errors that should be corrected. Some are highlighted here:

      a. Abstract: L5: "preferentially 'on' instead of 'in'"

      b. Introduction: Para 1 L8: "features that determine"

      c. Introduction: Para 2 L1: please check the phrasing of this line

      d. Results section 2: L1: Ter "MD" needs to be explained

      e. Page 8: Para 2: L6: "shows that 'a'"

      g. Page 13: Para 2: "MukBEF activity...". This sentence needs to be fixed.

      i. Figure 4: "input" instead of "imput"

      We thank Reviewer 1 for pointing out all these grammatical or spelling mistakes. We have corrected them all.

      f. Page 12: Para 2: "Xer" instead of "XDS"? *We added a reference to clarify the term.

      h. Methods: ChIP analysis: Authors state "MatP peaks", however, reported data is for MukB

      This description pertains to the matP peak detection shown in Supplementary Figure 3. We have incorporated this clarification into the text.

      j. Supplementary figure legends need to be provided (currently main figure legends appear to be pasted twice)

      Supplementary figure legends are provided at the end of the manuscript, and we have edited the manuscript to remove one copy of the figure legends.

      k. Authors should ensure sequencing data are deposited in an appropriate online repository and an accession number is provided.

      We waited for the appropriate timing in the editing process to upload our data, which we have now done. Additionally, we have added a data availability section to the manuscript, including sequence references on the NCBI.

      Reviewer #2 (Recommendations For The Authors):

      The authors largely avoid speculation on what might be the physiological relevance of the exclusion of MukBEF (and Smc-ScpAB) from the replication termination region (and the coordination with DNA replication). At this stage it would be helpful to present possible scenarios even if not yet supported by data. The authors should for example consider the following scenario: loop extrusion of a dif site in a chromosome dimer followed by dimer resolution by dif recombination leads to two chromosomes that are linked together by MukBEF (equivalent to cohesin holding sister chromatids together in eukaryotes but without a separase). This configuration (while rare) will hamper chromosome segregation. Is MatP particularly important under conditions of elevated levels of chromosome dimers? Could this even be experimentally tested? Other scenarios might also be entertained.

      Even though we prefer to avoid speculations, we agree that we may attempt to propose some hypotheses to the reader. To do so, we have added a few sentences at the end of our discussion. “We may speculate, based on in vitro observations (Kumar et al., 2022), that MukBEF could interfere with TopIV activity and delay potential chromosome decatenation. Another possibility is that chromosome dimers resolved at the dif site may become trapped in loops formed by MukBEF, thus delaying segregation. But none of these possible scenarios are supported by data yet, and a major challenge for the future is to determine whether and how MukBEF may interfere with one or both of these processes.”

      The manuscript text is well written. However, the labeling of strains in figures and text is sometimes inconsistent which can be confusing (LiTer Liter liter; e.g Riter Fig 2C). For consistency, always denote the number of matS sites in LiTer strains and also in the RiTer strain. The scheme denoting LiTer and RiTer strains should indicate the orientation of DNA segments so it is clear that the engineering does not involve inversion (correct?). Similarly: Use uniform labelling for time points: see T40mn vs 40mn vs T2H vs 2H

      We have reviewed the manuscript to standardize our labeling. Additionally, we have included a schema in Figure 2, indicating the matS numbers at the Ter border to emphasize that the transposition events do not involve inversion.

      matS sites do not have identical sequences and bind different levels of MatP (suppl fig 3). Does this possibly affect the interpretation of some of the findings (when altering few or only a single matS site). Maybe a comment on this possibility can be added.

      We agree with the referee; we do not want to conclude too strongly about the impact of matS density, so we have added this sentence at the end of the section titled 'matS Determinants to Prevent MukBEF Activity':

      “Altogether, assuming that differences in the matS sequences do not modify MatP's ability to bind to the chromosome and affect its capacity to inhibit MukBEF, these results suggested that the density of matS sites in a small chromosomal region has a greater impact than dispersion of the same number of matS sites over a larger segment”

      Figure 5: show selected examples of matS site distribution in addition to the averaged distribution (as in supplemental figure)?

      Figure 5 shows the median of the matS distribution based on the matS positions of 16 species as displayed in the supplementary figure. We believe that this figure is interesting as it represents the overall matS distribution across the Enterobacterales, Pasteurellales, and Vibrionales.

      How do authors define 'background levels' (page 9)in their ChIP-Seq experiments? Please add a definition or reword.

      We agree that the term 'background level' here could be confusing, so we have modified it to 'basal level' to refer to the non-replicating sequence. The background level can be observed in Supplementary Figure 9 in the ChIP without tags, and, on average, the background level is 1 throughout the entire chromosome in these control experiments.

      This reviewer would naively expect the normalized ChIP-Seq signals to revolve around a ratio of 1 (Fig. 4)? They do in one panel (Figure 4B) but not in the others (Figure 4A). Please provide an explanation.

      We thank the referee for this pertinent observation. An error was made during the smoothing of the data in Figure 4A, which resulted in an underestimation of the input values. This mistake does not alter the profile of the ChIP (it's a division by a constant) and our conclusions. We provide a revised version of the figure.

      Inconsistent axis labelling: e.g Figure 4

      Enterobacterals should be Enterobacterales (?)

      KB should be kb

      MB should be Mb

      Imput should be Input

      FlaG should be Flag

      We have made the suggested modifications to the text.

      'These results unveiled that fluorescent MukBEF foci previously observed associated with the Ori region were probably not bound to DNA' Isn't the alternative scenario that MukBEF bound to distant DNA segments colocalize an equally likely scenario? Please rephrase.

      Since we lack evidence regarding what triggers the formation of a unique MukB focus associated with the origin and what this focus could represent, we have removed this sentence.

      Reviewer #3 (Recommendations For The Authors):

      The text is well-written and easy to follow, but I would suggest several improvements to make things clearer:

      1. Many plots are missing labels or legends. (I) All contact plots such as Fig. 1C should have a color legend. It is not clear how large the signal is and whether the plots are on the same scale. (II)<br /> Ratiometric contact plots such as in Fig. 1D should indicate what values are shown. Is this a log ratio?

      As indicated in the materials and methods section, the ratio presented on this manuscript was calculated for each point on the map by dividing the number of contacts in one condition by the number of contacts in the other condition. The Log2 of the ratio was then plotted using a Gaussian filter.

      1. Genotypes and strain names are often inconsistent. Sometimes ΔmukB, ΔmatP, ΔmatS is used, other times it is just mukB, matP, matS; There are various permutations of LiTer, Liter, liter etc.

      These inconsistencies have been corrected.

      1. The time notation is unconventional. I recommend using 0 min, 40 min, 120 min etc. instead of T0, T40mn, T2H.

      As requested, we have standardized and used conventional annotations.

      1. A supplemental strain table listing detailed genotypes would be helpful.

      A strain table has been added, along with a second table recapitulating the positions of matS in the different strains.

      1. Fig. 1A: Move the IPTG labels to the top? It took me a while to spot them.

      We have moved the labels to the top of the figure and increased the font size to make them more visible.

      1. Fig 1C: Have these plots been contrast adjusted? If so, this should be indicated. The background looks very white and the transitions from diagonal to background look quite sharp.

      No, these matrices haven't been contrast-adjusted. They were created in MATLAB, then exported as TIFF files and directly incorporated into the figure. Nevertheless, we noticed that the color code of the matrix in Figure 3 was different and subsequently adjusted it to achieve uniformity across all matrices.

      7, Fig 1C: What is the region around 3 Mb and 4 Mb? It looks like the contacts there are somewhat MukBEF-independent.

      The referee is right. In the presence of the plasmid pPSV38 (carrying the MukBEF operon or not), we repeatedly observed an increase of long range contacts around 3 Mb. The origin of these contacts is unknown.

      1. Fig 1D: Have the log ratios been clipped at -1 and 1 or was some smoothing filter applied? I would expect the division of small and noisy numbers in the background region to produce many extreme values. This does not appear to be the case.

      The referee is right, dividing two matrices generates a ratio with extreme values. To avoid this, the Log2 of the ratio is plotted with a Gaussian filter, as described before (Lioy et al., 2018).

      1. Fig 1E: I recommend including a wild-type reference trace as a point of reference.

      We have added the WT profile to the figure.

      1. Fig 2: I feel the side-by-side cartoon from Supplemental Fig. 2A could be included in the main figure to make things easier to grasp.

      We added a schematic representation of the chromosome configuration on top of the matrices to aid understanding.

      1. Fig. 2C: One could put both plots on the same y-axis scale to make them comparable.

      We have modified the axes as required.

      1. Fig. 3C: The LiTer4 ratio plot has two blue bands in the 3-4.5 Mb region. I was wondering what they might be. These long-range contacts seem to be transposition-dependent and suppressed by MatP, is that correct?

      The referee is right. This indicates that in the absence of MatP, one part of the Ter was able to interact with a distal region of the chromosome, albeit with a low frequency. The origin is not yet known.

      1. Fig. 3E: It is hard to understand what is a strain label and what is the analyzed region of interest. The plot heading and figure legend say Ter2 (but then, there are different Ter2 variants), some labels say Ter, others say Ter2, sometimes it doesn't say anything, some labels say ΔmatS or ΔmatP, others say matS or matP, and so on.

      We have unified our notation and add more description on the legend to clarify this figure :

      “Ter” corresponds to the range of contacts over the entire Ter region, in the WT strain (WT Ter) or in the ΔmatP strain (ΔmatP Ter). The column WT matSX-Y corresponds to the range of contacts between the designated matS sites in the WT configuration. This portion of the Ter can be compared with the same Ter segment in the transposed strain (Ter2). Additionally, the matS20-28 segment corresponds to Ter2 in LiTer9, just as matS22-28 corresponds to Ter2 in LiTer7, and matS25-28 to Ter2 in LiTer4. The range of contacts of this segment was also measured in a ΔmatP or ΔmatS background.”

      1. Fig. 4 and p.9: "Normalized ChIP-seq experiments were performed by normalizing the quantity of immuno-precipitated fragments to the input of MukB-Flag and then divide by the normalized ChIP signals at t0 to measure the enrichment trigger by replication."

      This statement and the ChIP plots in Fig. 4A are somewhat puzzling. If the data were divided by the ChIP signal at t0, as stated in the text, then I would expect the first plot (t0) to be a flat line at value 1. This is not the case. I assume that normalized ChIP is shown without the division by t0, as stated in the figure legend.

      The referee is right. This sentence has been corrected, and as described in the Methods section, Figure 4 shows the ChIP normalized by the input.

      If that's true and the numbers were obtained by dividing read-count adjusted immunoprecipitate by read-count adjusted input, then I would expect an average value of 1. This is also not the case. Why are the numbers so low? I think this needs some more details on how the data was prepared.

      The referee is right; we thank him for this remark. Our data are processed using the following method: the value of each read is divided by the total number of reads. A sliding window of 50 kb is applied to these normalized values to smooth the data. Then, the resulting signal from the ChIP is divided by the resulting signal from the input. This is what is shown in Figure 4. Unfortunately, for some of our results, the sliding window was not correctly applied to the input data. This did not alter the ChIP profile but did affect the absolute values. We have resolved this issue and corrected the figure.

      Another potential issue is that it's not clear what the background signal is and whether it is evenly distributed. The effect size is rather small. Negative controls (untagged MukB for each timepoint) would help to estimate the background distribution, and calibrator DNA could be used to estimate the signal-to-background ratio. There is the danger that the apparent enrichment of replicated DNA is due to increased "stickiness" rather than increased MukBEF binding. If any controls are available, I would strongly suggest to show them.

      To address this remark, a ChIP experiment with a non-tagged strain under comparable synchronization conditions has been performed. The results are presented as Supplementary Figure 9; they reveal that the enrichment shown in Figure 4 is not attributed to nonspecific antibody binding or 'stickiness’.

      1. Fig. 4A, B: The y-axes on the right are unlabeled and the figure legends mention immunoblot analysis, which is not shown.

      We labeled the y-axes as 'anti-Flag ChIP/input' and made corrections to the figure legend.

      1. Fig. 4B: This figure shows a dip in enrichment at the Ter2 region of LiTer7, which supports the authors' case. Having a side-by-side comparison with WT at 60 min would be good, as this time point is not shown in Fig. 4A.

      Cell synchronization can be somewhat challenging, and we have observed that the timing of replication restart can vary depending on the genetic background of the cells. This delay is evident in the case of LiTer7. To address this, we compared LiTer7 after 60 minutes to the wild type strain (WT) after 40 minutes of replication. Even though the duration of replication is 20 minutes longer in LiTer7, the replication profiles of these two strains under these two different conditions (40 minutes and 60 minutes) are comparable and provide a better representation of similar replication progression.

      1. Fig. 4C: Highlighting the position of the replication origin would help to interpret the data.

      We highlight oriC position with a red dash line

      1. Fig. 4C: One could include a range-of-contact plot that compares the three conditions (similar to Fig. 1E).

      We have added this quantification to Supplemental Figure 8

      1. Supplemental Fig. 2A: In the LiTer15 cartoon, the flanking attachment sites do not line up. Is this correct? I would also recommend indicating the direction of the Ter1 and Ter2 regions before and after recombination.

      In this configuration, attB and attR, as well as attL and attB', should be aligned but the remaining attR attL may not. We have corrected this misalignment. To clarify the question of sequence orientation, we have included in the figure legend that all transposed sequences maintain their original orientation.

      1. Supplemental Fig. 3: One could show where the deleted matS sites are.

      We added red asterisks to the ChIP representation to highlight the positions of the missing matS.

      1. Supplemental Fig. 3B: The plot legend is inconsistent with panel A (What is "WT2")?

      We have corrected it.

      1. Supplemental Fig. 3C: The E-value notation is unusual. Is this 8.9 x 10^-61?

      The value is 8.9 x 10-61; we modified the annotation.

      23) Abstract: "While different features for the activity of the bacterial canonical SMC complex, SmcScpAB, have been described in different bacteria, not much is known about the way chromosomes in enterobacteria interact with their SMC complex, MukBEF."

      Could this be more specific? What features are addressed in this manuscript that have been described for Smc-ScpAB but not MukBEF? Alternatively, one could summarize what MukBEF does to capture the interest of readers unfamiliar with the topic.

      We modified these first sentences.

      1. p.5 "was cloned onto a medium-copy number plasmid under control of a lacI promoter" Is "lacI promoter" correct? My understanding is that the promoter of the lacI gene is constitutive, whereas the promoter of the downstream lac operon is regulated by LacI. I would recommend providing an annotated plasmid sequence in supplemental material to make things clearer.

      We modified it and replaced “ lacI promoter” with the correct annotation, pLac.

      1. p. 5 heading "MukBEF activity does not initiate at a single locus" and p. 6 "Altogether, the results indicate that the increase in contact does not originate from a specific position on the chromosome but rather appears from numerous sites". Although this conclusion is supported by the follow-up experiments, I felt it is perhaps a bit too strong at this point in the text. Perhaps MukBEF loads slowly at a single site, but then moves away quickly? Would that not also lead to a flat increase in the contact plots? One could consider softening these statements (at least in the section header), and then be more confident later on.

      We used 'indicate' and 'suggesting' at the end of this results section, and we feel that we have not overreached in our conclusions at this point. While it's true that we can consider other hypotheses, we believe that, at this stage, our suggestion that MukBEF is loaded over the entire chromosome is the simplest and more likely explanation.

      1. p.7: "[these results] also reveal that MukBEF does not translocate from the Ori region to the terminus of the chromosome as observed with Smc-ScpAB in different bacteria."

      This isn't strictly true for single molecules, is it? Some molecules might translocate from Ori to Ter. Perhaps clarify that this is about the bulk flux of MukBEF?

      At this point, our conclusion that MukBEF does not travel from the ori to Ter is global and refers to the results described in this section. However, the referee is correct in pointing out that we cannot exclude the possibility that in a WT configuration (without a Ter in the middle of the right replicore), a specific MukBEF complex can be loaded near Ori and travel all along the chromosome until the Ter. To clarify our statement, we have revised it to 'reveal that MukBEF does not globally translocate from the Ori region to the terminus of the chromosome.' This change is intended to highlight the fact that we are drawing a general conclusion about the behavior of MukBEF and to facilitate its comparison with Smc-ScpAB in B. subtilis.

      1. p. 10: The section title "Long-range contacts correlate with MukBEF binding" and the concluding sentence "Altogether, these results indicate that MukBEF promotes long-range DNA contacts independently of the replication process even though it binds preferentially in newly replicated regions" seem to contradict each other. I would rephrase the title as "MukBEF promotes long-range contacts in the absence of replication" or similar.

      We agree with this suggestion and have used the proposed title.

      1. p. 13: I recommend reserving the name "condensin" for the eukaryotic condensin complex and using "MukBEF" throughout.

      We used MukBEF throughout.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      Beyond what is stated in the title of this paper, not much needs to be summarized. eIF2A in HeLa cells promotes translation initiation of neither the main ORFs nor short uORFs under any of the conditions tested. 

      Strengths: 

      Very comprehensive, in fact, given the huge amount of purely negative data, an admirably comprehensive and well-executed analysis of the factor of interest. 

      Weaknesses: 

      The study is limited to the HeLa cell line, focusing primarily on KO of eIF2A and neglecting the opposite scenario, higher eIF2A expression which could potentially result in an increase in non-canonical initiation events. 

      We thank the reviewer for the positive evaluation. As suggested by the reviewer in the detailed recommendations, we will clarify in the title, abstract and text that our conclusions are limited to HeLa cells. Furthermore, as suggested we will test the effect of eIF2A overexpression on the luciferase reporter constructs, and will upload a revised manuscript.

      Reviewer #2 (Public review):

      Summary 

      Roiuk et al describe a work in which they have investigated the role of eIF2A in translation initiation in mammals without much success. Thus, the manuscript focuses on negative results. Further, the results, while original, are generally not novel, but confirmatory, since related claims have been made before independently in different systems with Haikwad et al study recently published in eLife being the most relevant. 

      Despite this, we find this work highly important. This is because of a massive wealth of unreliable information and speculations regarding eIF2A role in translation arising from series of artifacts that began at the moment of eIF2A discovery. This, in combination with its misfortunate naming (eIF2A is often mixed up with alpha subunit of eIF2, eIF2S1) has generated a widespread confusion among researchers who are not experts in eukaryotic translation initiation. Given this, it is not only justifiable but critical to make independent efforts to clear up this confusion and I very much appreciate the authors' efforts in this regard.  

      Strengths 

      The experimental investigation described in this manuscript is thorough, appropriate and convincing. 

      Weaknesses 

      However, we are not entirely satisfied with the presentation of this work which we think should be improved. 

      We thank the reviewer for the positive evaluation. We will revise the manuscript according to the reviewer's suggestions made in the detailed recommendations.

      Reviewer #3 (Public review):

      Summary: 

      This is a valuable study providing solid evidence that the putative non-canonical initiation factor eIF2A has little or no role in the translation of any expressed mRNAs in cultured human (primarily HeLa) cells. Previous studies have implicated eIF2A in GTP-independent recruitment of initiator tRNA to the small (40S) ribosomal subunit, a function analogous to canonical initiation factor eIF2, and in supporting initiation on mRNAs that do not require scanning to select the AUG codon or that contain near-cognate start codons, especially upstream ORFs with non-AUG start codons, and may use the cognate elongator tRNA for initiation. Moreover, the detected functions for eIF2A were limited to, or enhanced by, stress conditions where canonical eIF2 is phosphorylated and inactivated, suggesting that eIF2A provides a back-up function for eIF2 in such stress conditions. CRISPR gene editing was used to construct two different knockout cell lines that were compared to the parental cell line in a large battery of assays for bulk or gene-specific translation in both unstressed conditions and when cells were treated with inhibitors that induce eIF2 phosphorylation. None of these assays identified any effects of eIF2A KO on translation in unstressed or stressed cells, indicating little or no role for eIF2A as a back-up to eIF2 and in translation initiation at near-cognate start codons, in these cultured cells. 

      The study is very thorough and generally well executed, examining bulk translation by puromycin labeling and polysome analysis and translational efficiencies of all expressed mRNAs by ribosome profiling, with extensive utilization of reporters equipped with the 5'UTRs of many different native transcripts to follow up on the limited number of genes whose transcripts showed significant differences in translational efficiencies (TEs) in the profiling experiments. They also looked for differences in translation of uORFs in the profiling data and examined reporters of uORF-containing mRNAs known to be translationally regulated by their uORFs in response to stress, going so far as to monitor peptide production from a uORF itself. The high precision and reproducibility of the replicate measurements instil strong confidence that the myriad of negative results they obtained reflects the lack of eIF2A function in these cells rather than data that would be too noisy to detect small effects on the eIF2A mutations. They also tested and found no evidence for a recent claim that eIF2A localizes to the cytoplasm in stress and exerts a global inhibition of translation. Given the numerous papers that have been published reporting functions of eIF2A in specific and general translational control, this study is important in providing abundant, high-quality data to the contrary, at least in these cultured cells. 

      Strengths: 

      The paper employed two CRISPR knock-out cell lines and subjected them to a combination of high-quality ribosome profiling experiments, interrogating both main coding sequences and uORFs throughout the translatome, which was complemented by extensive reporter analysis, and cell imaging in cells both unstressed and subjected to conditions of eIF2 phosphorylation, all in an effort to test previous conclusions about eIF2A functioning as an alternative to eIF2. 

      Weaknesses: 

      There is some question about whether their induction of eIF2 phosphorylation using tunicamycin was extensive enough to state forcefully that eIF2A has little or no role in the translatome when eIF2 function is strongly impaired. Also, similar conclusions regarding the minimal role of eIF2A were reached previously for a different human cell line from a study that also enlisted ribosome profiling under conditions of extensive eIF2 phosphorylation; although that study lacked the extensive use of reporters to confirm or refute the identification by ribosome profiling of a small group of mRNAs regulated by eIF2A during stress. 

      We thank the reviewer for the positive evaluation. We will revise the manuscript according to the recommendations made in the detailed recommendations. Regarding the two points mentioned here:

      (1) The reason eIF2alpha phosphorylation does not increase appreciably is because unfortunately the antibody is very poor. The fact that the Integrated Stress Response (ISR) is induced by our treatment can be seen, for instance, by the fact that ATF4 protein levels increase strongly (in the very same samples where eIF2alpha phosphorylation does not increase much, in Suppl. Fig. 5E). We will strengthen the conclusion that the ISR is indeed activated with additional experiments/data as suggested by the reviewer.

      (2) We agree that our results are in line with results from the previous study mentioned by the reviewer, so we will revise the manuscript to mention this other study more extensively in the discussion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I suggest to state (already in the abstract, but perhaps also even in the title, definitely in the rest of the paper) that this analysis is limited to the HeLa cell line. 

      As suggested, we have now specified in both the title and the abstract that the work is done in HeLa cells.

      (2) In my view, it is a pity that the authors - given the tools are available - did not check the impact of high eIF2A levels on expression of individual mRNAs under normal and stress conditions. I am not suggesting to repeat ribo-seq in this setup, it would be too much to ask for, but re-examining some of the many reporters the authors generated with eIF2A overexpressed may point to some function, e.g. increased number of non-canonical initiation events (non-AUG-initiated)? If anything, the use of HeLa and the primary focus on eIF2A KO neglecting the prospective impact of eIF2A overexpression should be mentioned as two main limitations of this study. 

      We thank the reviewer for the good suggestion to test our synthetic reporters with eIF2A overexpression. New Suppl. Fig. 4G now shows that overexpression of eIF2A does not affect translation of synthetic reporters carrying an ATG start codon in different initiation contexts, or carrying near-cognate start codons, in agreement with a lack of effect on translation which we previously observed with loss of eIF2A.

      (3) Ribo-seq with eIF2A. Did the authors focus on ORFs that are known, or whose isoforms are known, to be non-AUG initiated? Would the loss of eIF2A decrease FPs in their CDSes under at least some conditions?

      We have now assessed the read distribution on the eIF4G2 transcript in both the control and tunicamycin conditions ( Author response image 1). In our hands, eIF4G2 is one of the best examples of non-AUG initiation in human cells, since the main coding sequence starts with GTG and the CDS is well translated. Nonetheless, we do not observe any significant changes in read distribution (panels A-B) or overall translation efficiency of eIF4G2 upon eIF2A loss (panels C-D).

      Author response image 1.

      (A-B) Average reads occupancy on the eIF4G2 (ENST0000339995) transcript in DMSO treated (panel A, n=3) or tunicamycin treated samples (panel B, n=2) derived from either control (black) or eIF2A-KO (red) HeLa cells. Reads counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C-D) The total number of reads mapping to the eIF4G2 CDS, normalized to library sequencing depth per replica was quantified. No significant difference between control and eIF2A-KO cells was observed in either DMSO treated (panel C) or tunicamycin treated (panel D) cells. Significance by unpaired, two-sided, t-test. ns = not significant.

      Thank you for giving me the opportunity to review this article.

      Reviewer #2 (Recommendations for the authors):

      While some of our suggestions below may be considered subtle, in our opinion they are important and it would be good if the authors consider them for their revision, we also have a couple of technical suggestions. 

      (1) Abstract. 

      The authors failed to identify the role of eIF2A in translation initiation and have provided compelling evidence that eIF2A is not involved in recognition of non-AUG codons as start codons nor in recruitment of initiator tRNA during stress conditions which are two activities most commonly misattributed to eIF2A. However, they have not exhausted all possible potential functions of eIF2A, see below, it is also possible that eIF2A may have a role not yet suggested by anyone and it may function in translation initiation in special circumstances that have not been tested yet. The authors indeed discuss such possibility in the Discussion section. Given that there is genetic evidence (that is unaffected by biochemical impurities) linking eIF2A to other initiation factors (5B and 4E), we are not yet convinced that eIF2A does not have any role in translation initiation and therefore we find the last sentence of the abstract premature. We suggest to soften this statement into something like this: whether eIF2A has any role in translation remains unknown, it may even have a role in a different aspect of RNA Biology. 

      We agree with the reviewer. We changed the last sentence of the abstract to read as follows:

      “It is possible that eIF2A plays a role in translation regulation in specific conditions that we have not tested here, or that it plays a role in a different aspect of RNA biology.”

      (2) Recently eIF2A has been implicated in ribosomal frameshifting, see Wei et al 2023 DOI: 10.1016/j.celrep.2023.112987 

      Could authors look into PEG10 mRNA ribosome profile to see if there are detectable statistically significant changes in footprint density downstream of frameshift site between WT and eIF2A Kos? It is likely that the coverage will be insufficient to give a definitive answer, but it is worth checking, it would be a pity to miss it. 

      We thank the reviewer for this suggestion. We have now looked at the distribution of ribosome footprints on the PEG10 transcript variant that is expressed in HeLa cells (ENST00000482108) and indeed observe coverage downstream of the annotated stop codon, consistent with a frameshifting event that results in an extended protein isoform being translated. Visual assessment of the read distribution between the main ORF and the "ORF extension" does not show a substantial difference between control and eIF2A knock-out cells ( Author response image 2A-B). Additionally, we quantified the ratio of reads mapping to the PEG10 ORF upstream of the slippery site versus those mapping downstream, extending into the predicted longer protein. Nonetheless, we could not detect significant changes between control and eIF2A-KO cells in either tested condition ( Author response image 2C-D).

      Author response image 2.

      (A-B) Average reads occupancy on the PEG10 (ENST00000482108) transcript in DMSO treated (panel A, n=3) or tunicamycin treated samples (panel B, n=2) derived from either control (black) or eIF2A-KO (red) HeLa cells are shown. Reads counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C-D) The ratio of reads mapping to the ORF upstream of the slippery site to reads mapping to the predicted extended protein downstream to the slippery site is shown. Reads counts were normalized to the sequencing depth. Neither DMSO treated samples (panel C) nor tunicamycin treated samples (panel D) had a significant difference between control and eIF2A-KO cells. Significance by unpaired, two-sided, t-test. ns = not significant.

      (3) Introduction 

      Given the volume of unreliable claims regarding eIF2A in the literature and the overall confusion it is very difficult (may even be impossible) to write a clear coherent introduction into the topic. Nonetheless, there are few points that need to be taken into account. 

      The authors state that eIF2A is capable to recruit initiator tRNA citing Zoll et al 2002. This activity was later shown to be a biochemical artefact (which was most likely reproduced by Kim et al 2018), eIF2A fraction was contaminated with eIF2D which does bind tRNAs in GTP-independent manner. eIF2A purified from RRL separates from initiator tRNA binding activity, see Dmitriev et al 2010 DOI: 10.1074/jbc.M110.119693. This point is also relevant to the second paragraph of Discussion, it should be acknowledged that it has been shown previously that eIF2A does not bind the initiator tRNA.

      We appreciate the advice provided by the reviewer. We have modified both the introduction and the 2nd paragraph of the discussion to reflect that the tRNA-binding activity is due to contaminating eIF2D rather than eIF2A.

      In many cases the authors describe certain claims as facts even though they refute them themselves. For example 

      "Such eIF2A-driven non-AUG initiation events were shown to play a crucial role in different aspects of cell physiology and disease progression: cellular adaptation during the integrated stress response (Chen et al., 2019; Starck et al., 2016)"  While non-AUG initiation events do play crucial roles in different aspects of cell physiology (reviewed in Andreev et al 2023 doi: 10.1186/s13059-022-02674-2) eIF2A has nothing to do with it as the authors show themselves. Therefore different language should be used, e.g.. "eIF2A has been suggested (or proposed or reported) to be responsible for non-AUG initiation events that were shown to play ..." 

      The word "shown" is used in many other instances for the claims that the authors refute. "Shown" is only appropriate for strong evidence that leaves little doubt. 

      We agree with the reviewer and made the suggested changes in the text.

      (4) Supplementary Fig. 1. 

      Panel C is used to argue that eIF2A has a higher concentration than in the nucleus, perhaps it is worth explaining how this conclusion was drawn. If levels in cytoplasm are comparable to GAPDH and Tubulin but less than c-Myc in nucleus does it really mean that there is less eIF2A in the nucleus than in cytoplasm? This is not obvious to us. Also, presumably WCL stands for Whole Cell Lysate, it would be nice to introduce this abbreviation somewhere. 

      To compare levels of eIF2A in the nuclear and cytosolic fractions, we lysed the two fractions in equal volumes of buffer (i.e. the cytosolic fraction was extracted in 200 µl of hypotonic buffer, and the nuclear fraction was extracted in 200 µl of cell extraction buffer). This assures that per microliter of lysate we have the same number of "cytosols" or nuclei. Hence, equal intensity bands in the cytosolic and nuclear fractions would mean that half of the protein is in the nucleus and half is in the cytosol. We originally described this in the Methods section, but now also mention it in the Results and in the figure legend.

      We replaced WCL with "whole cell" in the figure. 

      (5) The differential translation analysis is described very briefly "To obtain values of translation efficiency, log2 fold changes, and adjusted p values the DESeq2 software package was used". Was TE calculated based on ribosome footprint to RNA-seq ratios? How exactly DESeq2 was used here? TE measured in this way spuriously correlates with RNA-seq values, see Larsson et al 2010 DOI: 10.1073/pnas.1006821107, perhaps it would be worse assessing differential translation with anota2seq (Oertlin et al 2019 doi: 10.1093/nar/gkz223.)? Anota2seq avoids calculating the ratios and enables comprehensive analysis of differential translation including detection of buffered translation which might be the case here while avoiding artefacts that may arise from varying RNA levels.  

      We now specified in more detail in the Methods section how we analyzed the data. Indeed, the DeSeq2 was used on translation efficiency values, which we calculated as the ratio of ribosome footprints to RNA-seq. 

      As suggested, we have now also performed the analysis using anota2seq (Suppl. Fig. 3C) and this analysis identified zero transcripts that are translationally regulated, in agreement with our analysis.

      (6) Section "eIF2a-inactivating stresses do not redirect tRNA delivery function to eIF2A." 

      The description of ISR mechanism is a bit inaccurate. Strictly speaking eIF2alpha phosphorylation does not inactivate it eIF2alpha. It results in formation of a very stable eIF2*GDP*eIF2B complex, thus severely depleting eIF2B which serves as a GEF for eIF2. This in turn reduces the ternary complex (eIF2*GTP*tRNAi) concentration since there is no free eIF2B to exchange GDP for GTP. Without getting into much detail, we think it would be more accurate to say that eIF2alpha phosphorylation leads to ternary complex depletion instead of saying that stress inactivates eIF2alpha. 

      We agree with the reviewer - we were trying to use simple, compact wording. We have now reworded the section title to "No detectable role for eIF2A in translation when eIF2 is inhibited" and rephrased the subsequent text to be correct.

      Also the subtitle uses eIF2a with small a that stands for alpha which potentially could lead to substantial confusion since in this case the difference between eIF2alpha and eIF2A is only in capitalisation of the last letter, many text-mining engines such as modern LLMs may not be able to pick the differences. Perhaps it would be better to refer to eIF2alpha by the HGNC approved name of its gene - eIF2S1 to avoid further confusions. For clarity it may be stated at the beginning that eIF2S1 is commonly known as eIF2alpha. 

      We thank the reviewer for this point. We have removed all instances of eIF2a (with lowercase a) from the manuscript to avoid this source of confusion. In the first instance of eIF2a we also added the official HGNC gene name. However, we prefer to use eIF2a instead of eIF2S1 because people outside the translation field tend to know the subunit as eIF2a, and we think it is important that also people outside the translation field read this manuscript, since some of the questionable papers on eIF2A come from labs working at the interface between translation and other fields.

      Minor 

      Introduction 

      (7) "uses the CAT anticodon" change CAT to CAU 

      We corrected CAT to CAU

      (8) "In the canonical initiation pathway", change "canonical" to "most common", canonical is somewhat a judgemental statement that originates in theology. Same applies to numerous occurrences of "canonical AUG", simply using "AUG" would be simpler and more accurate as you will avoid giving impression that there are "non-canonical AUGs".  

      Done.

      (9) "eIF2A was initially considered to be a functional analogue of prokaryotic IF2 (Merrick and Anderson, 1975), however later this role was reassigned to the above-mentioned heterotrimeric factor eIF2 (a,b,g) (Levin et al., 1973)." - there is a chronological contradiction within this sentence, the initial consideration is attributed to 1975 while its later reassignment to 1973. 

      We are grateful to the reviewer for spotting this mistake. There was a citation problem; we fixed it and now cite the correct paper for the initial discovery of eIF2A to PMID 5472357 (Shafritz et al 1970).

      (10) "On the other hand, studies on the role of eIF2A on viral IRES translation have arrived at conflicting results." Remove "On the other hand" since conflicting results have been mentioned above. In fact the entire sentence is somewhat redundant given prior "For example, eIF2A has been studied in the context of internal ribosome entry sites (IRES), where it was found to act both as a suppressor and an activator of IRESmediated initiation."  

      We have rewritten the paragraph to make it more coherent.

      (11) Fig. 1. C-D. is using CHX abbreviation for cycloheximide, this need to be mentioned on the legend or elsewhere in the text. Otherwise CHX may not be clear for a reader uninitiated in ribosome profiling. 

      We now mention in the figure legend that CHX stands for cycloheximide and indicate that it was used as a negative control to block translation. 

      (12) Page 7, section "Ribosome profiling reveals a few eIF2Adependent transcripts" 

      In this section you describe ribosome profiling experiments and identify few transcripts whose translation seems to be changing based on ribosome profiling data. Then you attempt to verify them using gene expression reporters and reasonably suggest that these are false positives. In essence this section argues that there are no eIF2A-dependent transcripts, therefore the title of this subsection is misleading, it makes sense to rename it so that it better reflects the content of this section. 

      We agree and have renamed the section to "Ribosome profiling identifies no eIF2Adependent transcripts"

      (13) Page 8, top. Rephrase "To do this, we performed ribosome profiling on control and eIF2AKO cells, which sequences the mRNA footprints protected by ribosomes."  

      Fixed.

      (14) Page 10, bottom. "Several studies have reported that eIF2A can delivery alternative initiator tRNAs to uORFs with nearcognate start codons". Change "delivery" to "deliver". 

      Thanks for spotting it. We corrected to “deliver”

      (15) Page 13 "This suggests that, as in non-stressed conditions, eIF2A has a minimal effect on global translation also when eIF2a activity is low." - rephrase to avoid impression that eIF2alpha activity is low in normal conditions, also please see comment #6 above. 

      We fixed this sentence to read: “This suggests that, as in non-stressed conditions, eIF2A has a minimal effect on global translation also when the integrated stress response is active.”

      Reviewer #3 (Recommendations for the authors):

      - The experimental data in Fig. S5E do not support the claim of increased eIF2 phosphorylation on TM treatment; although, comparing Fig. S5A with Fig. 1B supports a marked reduction in bulk translation and the reporter data in Fig. 4A show the expected induction of the uORF-containing reporters by TM. Because these are the conditions employed for ribosome profiling in stress conditions shown in Fig. 4B, it would be reassuring to document TM-induced translational efficiencies of ATF4 and the other known mRNAs resistant to eIF2 phosphorylation in the ribosome profiling data, including gene browser images of the replicate experiments. If the induction of TEs by TM for such mRNAs was not robust, it would be valuable to repeat the analysis using arsenite (SA) treatment, which produces a greater inhibition of bulk translation. 

      Unfortunately, the eIF2alpha antibody is not very good and also detects the nonphosphorylated protein, causing high background and poor apparent induction in response to tunicamycin. The fact that the ISR was activated is visible from the induction of ATF that was assessed by western blot in the Suppl. Fig. 5E. To ensure that our ribosome profiling libraries also recorded the activation of ISR we built single gene plots for ATF4 both in control and HeLa eIF2A-KO cell. As shown in  Author response image 3 A&B in both cell lines tunicamycin treatment led to the induction of ATF4. This can also be seen by the 4-fold induction in ATF4 translation efficiency in response to tunicamycin in both WT and eIF2A-KO cells ( Author response image 3C). Additionally, we checked that another marker induced by tunicamycin, HSPA5, is also translationally upregulated in both cell lines, as well as the downstream target of ATF4 – PPP1R15B. ( Author response image 3C). 

      Author response image 3.

      (A-B) Average read occupancy on the ATF4 (ENST00000674920) transcript in DMSO treated (n=3) or tunicamycin treated samples (n=2) derived from either control (panel A) or eIF2A-KO (panel B) HeLa cells are shown. Read counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C) Scatter plot of log2(fold change) of Translation Efficiency TM/DMSO for control cells on the xaxis versus eIF2AKO cells on the y-axis. The induction of ATF4 as well as the downstream target PPP1R15B are shown. The upregulation of HSP5A translation, the other hallmark of ER-stress induced by tunicamycin treatment is shown.

      - It should be pointed out in the text that in both published studies being cited here of cells lacking eIF2A, that by Gaikwad et al. on a yeast eIF2A deletion mutant, and that by Ichihara et al. on human HEK293 CRISPR KO cells, the analyses included stress conditions in which eIF2 phosphorylation is induced (amino acid starvation or SA treatment, respectively), as was conducted here.  

      Good point - we added this information into the introduction: 

      "Furthermore, loss of eIF2A in several systems did not recapitulate these effects on non-AUG initiation in either non-stressed or stress conditions (caused either by amino acid depletion or sodium arsenate treatment) (Gaikwad et al., 2024; Ichihara et al., 2021)."

      - The Ichihara et al. (2021) study just mentioned reached some of the same conclusions for HEK cells obtained here by conducting ribosome profiling in untreated and SA-treated cells, finding only 1 mRNA (untreated) or four mRNAs (SA-treated cells) that showed significantly reduced TEs in the eIF2A knockout vs. parental cells. It seems appropriate for the authors to expand their treatment of this prior work by summarizing its findings in some detail and also noting how their study goes beyond this previous one. 

      We have added a paragraph to the discussion pointing out that our data agree fully with Ichihara et al. (2021), and that Ichihara et al. (2021) also found only very few mRNAs that change in TE upon loss of eIF2A in either non-stressed or stressed conditions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1

      Summary:

      In this paper, the authors performed molecular dynamics (MD) simulations to investigate the molecular basis of the association of alpha-synuclein chains under molecular crowding and salt conditions. Aggregation of alpha-synuclein is linked to the pathogenesis of Parkinson's disease, and the liquid-liquid phase separation (LLPS) is considered to play an important role in the nucleation step of the alpha-synuclein aggregation. This paper re-tuned the Martini3 coarse-grained force field parameters, which allows long-timescale MD simulations of intrinsically disordered proteins with explicit solvent under diverse environmental perturbation. Their MD simulations showed that alpha-synuclein does not have a high LLPS-forming propensity, but the molecular crowding and salt addition tend to enhance the tendency of droplet formation and therefore modulate the alpha-synuclein aggregation. The MD simulation results also revealed important intra- and inter-molecule conformational features of the alpha-synuclein chains in the formed droplets and the key interactions responsible for the stability of the droplets. These MD simulation data add biophysical insights into the molecular mechanism underlying the association of alpha-synuclein chains, which is important for understanding the pathogenesis of Parkinson's disease.

      Strengths:

      (1) The re-parameterized Martini 3 coarse-grained force field enables the large-scale MD simulations of the intrinsically disordered proteins with explicit solvent, which will be useful for a more realistic description of the molecular basis of LLPS.

      (2) This paper showed that molecular crowding and salt contribute to the modulation of the LLPS through different means. The molecular crowding minimally affects surface tension, but adding salt increases surface tension. It is also interesting to show that the aggregation pathway involves the disruption of the intra-chain interactions arising from C-terminal regions, which potentially facilitates the formation of inter-chain interactions.

      We thank the reviewer for pointing out the strengths of our study.

      Weaknesses:

      (1) Although the authors emphasized the advantage of the Martini3 force field for its explicit description of solvent, the whole paper did not discuss the water's role in the aggregation and LLPS.

      We thank the reviewer for pointing this out. We agree that we have not explored or discussed the role of water in aS aggregation or LLPS. We would like to convey that we would like to explore that in detail in a separate study altogether. However we have updated the “Discussion” section with the following lines to convey to the readers the importance water plays in aggregation and LLPS of aS.

      Page 24: “The significance of the solvent in alpha-synuclein (αS) aggregation remains underexplored. Recent studies [26, 55] underscore the pivotal role of water as a solvent in LLPS. It suggests that comprehending the solvent’s role, particularly water, is essential for attaining a deeper grasp of the thermodynamic and physical aspects of αS LLPS and aggregation. By delving into the solvent’s contribution, researchers can uncover additional factors influencing αS aggregation. Such insights hold the potential to advance our comprehension of protein aggregation phenomena, crucial for devising strategies to address diseases linked to protein misfolding and aggregation, notably Parkinson’s disease. Future investigations focusing on elucidating the interplay between αS, solvent (especially water), and other environmental elements could yield valuable insights into the mechanisms underlying LLPS and aggregation. Ultimately, this could aid in the development of therapeutic interventions or preventive measures for Parkinson’s and related diseases.”

      (2) This paper discussed the effects of crowders and salt on the surface tension of the droplets.

      The calculation of the surface tension relies on the droplet shape. However, for the formed clusters in the MD simulations, the typical size is <10, which may be too small to rigorously define the droplet shape. As shown in previous work cited by this paper [Benayad et al., J. Chem. Theory Comput. 2021, 17, 525−537], the calculated surface tension becomes stable when the chain number is larger than 100.

      We appreciate the insightful feedback from the reviewer. However, we would like to emphasize that the αS droplets exhibit a highly liquid-like behavior, characterized by frequent exchanges of chains between the dense and dilute phases, alongside a slow aggregation process. In the study by Benayad et al. (2020, JCTC) [ref. 30], FUS-LCD was the protein of choice at concentrations in the (mM) range. FUS-LCD is known to undergo very rapid LLPS at concentrations lower than 100 (μM) where for αS the critical concentration for LLPS is 500 (μM) and undergoes slower aggregation than FUS. Moreover, the diffusion constant of αS inside newly formed droplets (no liquid to solid phase transition has occurred) has been estimated to be 0.23-0.58 μm2/s (Ray et al, 2020, Nat. Comm.). The value of diffusion constant for FUS-LCD inside LLPS droplets has been estimated to be 0.17 μm2/s (Murthy et al. 2023, Nat. Struct. and Mol. Biol.). These prove that αS forms droplets that are less viscous than that formed by FUS-LCD. This dynamic nature impedes the formation of large droplets in the simulations, making it challenging to rigorously calculate surface tension from interfacial width, which, in turn, necessitates the computation of g(r) between water and the droplet.

      Furthermore, it's essential to note that our primary aim in calculating surface tension was not to determine its absolute value. Rather, we aimed to compare surface tensions obtained for the three distinct environments explored in this study. Hence, our primary objective is to compare the distributions of surface tensions rather than focusing solely on the mean values obtained. The distributions shown in Figure 4a clearly show a trend which we have stated in the article.

      (3) In this work, the Martini 3 force field was modified by rescaling the LJ parameters \epsilon and \sigma with a common factor \lambda. It has not been very clearly described in the manuscript why these two different parameters can be rescaled by a common factor and why it is necessary to separately tune these two parameters, instead of just tuning the coefficient \epsilon as did in a previous work [Larsen et al., PLoS Comput Biol 16: e1007870].

      We thank the reviewer for the comment. We think that the distance of the first hydration layer also should have an impact on aggregation/LLPS. Here we are scaling both the epsilon and sigma. A higher epsilon of water-protein interactions mean higher the energy required for removal of water molecules (dehydration) when a chain goes from the dilute to the dense phase. A higher sigma on the other hand means that the hydration shell will also be at a larger distance making dehydration easier. Moreover, tuning both (either by same or different parameter) required a change of the overall protein-water interaction by only 1%, thereby requiring only considerably minimal change in forcefield parameters (compared to the case where only epsilon is being tuned which required 6-10% change in epsilon from its original values.) . Thus we think one of the ways of tuning water-protein interactions which requires minimal retuning of Martini 3 is by optimizing both epsilon and sigma. However whether a single scaling parameter is good enough requires further exploration and is outside the scope of the current study. More importantly it would introduce another free parameter into the system and the lesser the number of free parameters, the better. For this study, a single parameter sufficed as depicted in Figure 9. To inform the readers of why we chose to scale both sigma and epsilon, we have added the following in the main text:

      Page 25-26: “Increasing the ϵ value of water-protein interactions results in a higher energy demand for removing water molecules (dehydration) as a chain transitions from the dilute to the dense phase. Conversely, a higher σ value implies that the hydration shell will be at a greater distance, facilitating dehydration if a chain moves into the dilute phase. Therefore, adjusting water-protein interactions based on the protein’s single-chain behavior may not significantly influence the protein’s phase behavior. Furthermore, fine-tuning both ϵ and σ parameters only requires a minimal change in the overall protein-water interaction (1%). As a result, this adjustment minimally alters the force field parameters.”

      (4) Both the sizes and volume fractions of the crowders can affect the protein association. It will be interesting to perform MD simulations by adding crowders with various sizes and volume fractions. In addition, in this work, the crowders were modelled by fullerenes, which contribute to protein aggregation mainly by entropic means as discussed in the manuscript. It is not very clear how the crowder effect is sensitive to the chemical nature of the crowders (e.g., inert crowders with excluded volume effect or crowders with non-specific attractive interactions with proteins, etc) and therefore the force field parameters.

      We thank the reviewer for a potential future direction. In this investigation our main focus was to simulate the inertness features of crowders only, to ensure that only entropic effect of the crowders are explored. Although this study focuses on the factors that enable aS to form an aggregates/LLPS under different environmental conditions, it would be interesting to explore in a systematic way the mechanism of action of crowders of varying shapes, sizes and interactions. Therefore we added the following lines in the “Discussion” section to let the readers know that this is also a future prospect of investigation.

      Page 22: “Under physiological conditions, crowding effects emerge prominently. While crowders are commonly perceived to be inert, as has been considered in this investigation, the morphology, dimensions, and chemical interactions of crowding agents with αS in both dilute and dense phases may potentially exert considerable influence on its LLPS. Hence, a comprehensive understanding through systematic exploration is another avenue that warrants extensive investigation.”

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure S1. The title of the figure and the description in the figure caption are inconsistent?

      We thank the reviewer for the comment and we have updated the article with the correct caption.

      (2) Page 14, line 3, the authors may want to provide more descriptions of the "ms1", "ms2", and "ms3" for better understanding.

      We are grateful to the reviewer for pointing this out. We have added a line describing in brief what “ms1”, “ms2” and “ms3” represent. It reads “Subsequent to the investigation, we utilize three representative conformations, each corresponding to one of the macrostates. We designate these macrostates as 1 (ms1), 2 (ms2), and 3 (ms3) (Figure S7)” (Page 28)

      (3) Page 20, the authors may want to briefly explain how the normalized Shannon entropy was calculated.

      We thank the reviewer for pointing this out. This is plain Shannon Entropy and the word “normalized” should not have been there. To avoid confusion we have provided the equation we have used to calculate the Shannon entropy (Eq 8) (Page 21).

      Reviewer #2 (Public Review):

      In the manuscript "Modulation of α-Synuclein Aggregation Amid Diverse Environmental Perturbation", Wasim et al describe coarse-grained molecular dynamics (cgMD) simulations of α-Synuclein (αS) at several concentrations and in the presence of molecular crowding agents or high salt. They begin by bench-marking their cgMD against all-atom simulations by Shaw. They then carry 2.4-4.3 µs cgMD simulations under the above-noted conditions and analyze the data in terms of protein structure, interaction network analysis, and extrapolated fluid mechanics properties. This is an interesting study because a molecular scale understanding of protein droplets is currently lacking, but I have a number of concerns about how it is currently executed and presented.

      We thank the reviewer for finding our study interesting.

      (1) It is not clear whether the simulations have reached a steady state. If they have not, it invalidates many of their analysis methods and conclusions.

      We have used the last 1 μs (1.5-2.5 1 μs) from each simulation for further analysis in this study. To understand whether the simulations have reached steady state or not, we plot the time profile of the concentration of the protein in the dilute phase for all three cases.

      Author response image 1.

      Except for the scenario of only αS (Figures a and b), the rest show very steady concentrations across various sections of the trajectory (Figures c-f). The larger sudden fluctuations observed inFigures a and b are due to the fact that only αS undergo very slow spontaneous aggregation and owing to the fact that the dense phase itself is very fluxional, addition/removal of a few chains to/from the dense to dilute phase register themselves as large fluctuations in the protein concentration in the dilute phase. For the other two scenarios (Figures c-f) aggregation has been accelerated due to the presence of crowders/salt. This causes larger aggregates to be formed. Therefore addition/removal of one or two chains does not significantly affect the concentration and we do not see such sudden large jumps. In summary, the large jumps seen in Figures a and b are due to slow, fluxional aggregation of pure αS and finite size effects. However as these still are only fluctuations, we posit that the systems have reached steady states. This claim is further supported by the following figure where the time profile of a few useful system wide macroscopic properties show no change between 1.5-2.5 µs.

      We also have added a brief discussion in the Methods section (Page 29-30) with these figures in the Supplementary Information.

      Author response image 2.

      “In this study, we utilized the final 1 µs from each simulation for further analysis. To ascertain whether the simulations have achieved a steady state, we plotted the time profile of protein concentration in the dilute phase for all three cases. Except for minor intermittent fluctuation involving only αS in neat water (Figures S8a and S8b), the remaining cases exhibit notably stable concentrations throughout various segments of the trajectory (Figures S8 c-f). The relatively higher fluctuations observed in Figures S8a and b stem from the slow, spontaneous aggregation of αS alone, compounded by the inherently ambiguous nature of the dense phase.

      Consequently, the addition or removal of a few chains from the dense to the dilute phase results in significant fluctuations in protein concentration within the dilute phase. Conversely, in the other two scenarios (Figures S8c-f), aggregation is expedited by the presence of crowders/salt, leading to the formation of larger aggregates. Consequently, the addition or removal of one or two chains has negligible impact on concentration, thereby mitigating sudden large jumps. In summary, the conspicuous jumps depicted in Figures S8a and b arise from the gradual, fluctuating aggregation of pure αS and finite size effects. However, since these remain within the realm of fluctuations, we assert that the systems have indeed reached steady states. This assertion is bolstered by the subsequent figure, where the time profile of several pertinent system-wide macroscopic properties reveals no discernible change between 1.5-2.5 µs (Figures S9).”

      (2) The benchmarking used to validate their cgMD methods is very minimal and fails to utilize a large amount of available all-atom simulation and experimental data.

      We disagree with the reviewer on this point. We have cited multiple previous studies [26, 27] that have chosen Rg as a metric of choice for benchmarking coarse-grained model and have used a reference (experimental or otherwise) to tune Martini force fields. Majority of the notable literature where Rg was used as a benchmark during generation of new coarse-grained force fields are works by Dignon et al. (PLoS Comp. Biol.) [ref. 25], Regy et al (Protein Science. 2021) [ref. 26], Joseph et al.(Nature Computational Science. 2021) [ref. 27] and Tesei et al (Open Research Europe, 2022) [ref. 28]. From a polymer physics perspective, tuning water-protein interactions is simply changing the solvent characteristics for the biopolymer and Rg has been generally considered a suitable metric in the case of coarse-grained model. Moreover we try to match the distribution of the Rg rather than only the mean value. This suggests that at a single molecule level, the cgMD simulations at the optimum water of water-protein interactions would allow the protein to sample the conformations present in the reference ensemble. We use the extensively sampled 70 μs all-atom data from DE Shaw Research to obtain the reference Rg distribution. Also we perform a cross validation by comparing the fraction of bound states in all-atom and cgMD dimer simulations which also seem to corroborate well with each other at optimum water-protein interactions. To let the readers understand the rationale behind choosing Rg we have added a section in the Methods section (Page 25) that explains why Rg is plausibly a good metric for tuning water-protein interactions in Martini 3, at least when dealing with IDPs.

      Our optimized model is further supported by the FRET experiments by Ray et al. [6]. They found that interchain NAC-NAC interactions drive LLPS. Residue level contact maps obtained from our simulations also show decreased intrachain NAC-NAC interactions with an increased interchain NAC-NAC interactions inside the droplet. This corroborates well with the experimental observations and furthermore validates the metrics we have used for optimization of the water-protein interactions. However the comparison with the FRET data by Ray et al. was not present earlier and we have added the following lines in the updated draft.

      Page17: “Thus we observed that increased inter-chain NAC-NAC regions facilitate the formation of αS droplets which also have previously been seen from FRET experiments on αS LLPS

      droplets[6].”

      (3) They also miss opportunities to compare their simulations to experimental data on aSyn protein droplets.

      We thank the reviewer for pointing this out. We have tried to compare the results from our simulations to existing experimental FRET data on αS. Please see the previous response where we have described our comparison with FRET observations.

      (4) Aspects such as network analysis are not contextualized by comparison to other protein condensed phases.

      For a proper comparison between other protein condensed phases, we would require the position phase space of such condensates which is not readily available. Therefore we tried to explain it in a simpler manner to paint a picture of how αS forms an interconnecting network inside the droplet phase.

      (5) Data are not made available, which is an emerging standard in the field.

      We thank the reviewer for mentioning this. We have provided the trajectories between 1.5-2.5 μs, which we used for the analysis presented in the article, via a zenodo repository along with other relevant files related to the simulations (https://zenodo.org/records/10926368).

      Firstly, it is not clear that these systems are equilibrated or at a steady state (since protein droplets are not really equilibrium systems). The authors do not present any data showing time courses that indicate the system to be reaching a steady state. This is problematic for several of their data analysis procedures, but particularly in determining free energy of transfer between the condensed and dilute phases based on partitioning.

      We have addressed this concern as stated previously in the response. We have updated the article accordingly.

      Secondly, the benchmarking that they perform against the 73 µs all-atom simulation of aSyn monomer by Shaw and coworkers provides only very crude validation of their cgMD models based on reproducing Rg for the monomer. The authors should make more extensive comparisons to the specific conformations observed in the DE Shaw work. Shaw makes the entire trajectory publicly available. There are also a wealth of experimental data that could be used for validation with more molecular detail. See for example, NMR and FRET data used to benchmark Monte Carlo simulations of aSyn monomer (as well as extensive comparisons to the Shaw MD trajectory) in Ferrie at al: A Unified De Novo Approach for Predicting the Structures of Ordered and Disordered Proteins, J. Phys. Chem. B 124 5538-5548 (2020)

      DOI:10.1021/acs.jpcb.0c02924

      I note that NMR measurements of aSyn in liquid droplets are available from Vendruscolo: Observation of an α-synuclein liquid droplet state and its maturation into Lewy body-like assemblies, Journal of Molecular Cell Biology, Volume 13, Issue 4, April 2021, Pages 282-294, https://doi.org/10.1093/jmcb/mjaa075.

      In addition, there are FRET studies by Maji: Spectrally Resolved FRET Microscopy of α-Synuclein Phase-Separated Liquid Droplets, Methods Mol Biol 2023:2551:425-447. doi: 10.1007/978-1-0716-2597-2_27.

      So the authors are missing opportunities to better validate the simulations and place their structural understanding in greater context. This is just based on my own quick search, so I am sure that additional and possibly better experimental comparisons can be found.

      We have performed a comparison with existing FRET measurements by Ray et al. (2020) as discussed in a previous response and also updated the same in the article. The doi (10.1007/978-1-0716-2597-2_27) provided by the reviewer is however for a book on Methods to characterize protein aggregates and does not contain any information regarding the observations from FRET experiments. The other doi (https://doi.org/10.1093/jmcb/mjaa075) for the article from Vendrusculo group does not contain information directly relevant to this study. Moreover NMR measurements cannot be predicted from cgMD since full atomic resolution is lost upon coarse-graining of the protein . A past literature survey by the authors found very little scientific literature on molecular level characterization of αS LLPS droplets.

      Thirdly, the small word network analysis is interesting, but hard to contextualize. For instance, the 8 Å cutoff used seems arbitrary. How does changing the cutoff affect the value of S determined? Also, how does the value of S compare to other condensed phases like crystal packing or amyloid forms of aSyn?

      The 8 Å cutoff is actually arbitrary since a distance based clustering always requires a cutoff which is empirically decided. However 8 Å is quite large compared to other cutoffs used for distance based clustering. For example in ref 26, 5 Å was used as a cutoff for calculation of protein clusters. Larger cutoffs will lead to sparser network structures. However we used the same cutoff for all distance based clustering which makes the networks obtained comparable. We wanted to perform a comparison among the networks formed by αS under different environmental conditions.

      Fourthly, I see no statement on data availability. The emerging standard in the computational field is to make all data publicly available through Github or some similar mechanism.

      We thank the reviewer for pointing this out and we have provided the raw data between 1.5-2.5 μs for each scenario along with other relevant files via a zenodo repository (https://zenodo.org/records/10926368).

      Finally, on page 16, they discuss the interactions of aSyn(95-110), but the sequence that they give is too long (seeming to contain repeated characters, but also not accurate). aSyn(95-110) = VKKDQLGKNEEGAPQE. Presumably this is just a typo, but potentially raises concerns about the simulations (since without available data, one cannot check that the sequence is accurate) and data analysis elsewhere.

      This indeed is a typographical error. We have updated the article with the correct sequence. The validity of the simulations can be verified from the data we have shared via the zenodo repository (https://zenodo.org/records/10926368).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1:

      Mehrdad Kashefi et al. investigated the availability of planning future reaches while simultaneously controlling the execution of the current reach. Through a series of experiments employing a novel sequential arm reaching paradigm they developed, the authors made several findings: 1) participants demonstrate the capability to plan future reaches in advance, thereby accelerating the execution of the reaching sequence, 2) planning processes for future movements are not independent one another, however, it's not a single chunk neither, 3) Interaction among these planning processes optimizes the current movement for the movement that comes after for it.

      The question of this paper is very interesting, and the conclusions of this paper are well supported by data. However, certain aspects require further clarification and expansion.

      We thank reviewer one for their evaluation of the work.

      (1) The question of this study is whether future reach plans are available during an ongoing reach. In the abstract, the authors summarized that "participants plan at least two future reaches simultaneously with an ongoing reach and that the planning processes of the two future reaches are not independent of one another" and showed the evidence in the next sentences. However the evidence is about the relationship about ongoing reach and future plans but not about in between future plans (Line 52-55). But the last sentence (Line 55-58) mentioned about interactions between future plans only. There are some discrepancies between sentences. Could you make the abstract clear by mentioning interference between 1) ongoing movement and future plans and 2) in between future plans?

      We thank Reviewer for their comment. We have separated the longer sentence in the original abstract into two shorter ones. This should clarify that the two pieces of evidence pertain to the interaction of planning processes.

      (2) I understood the ongoing reach and future reaches are not independent from the results of first experiment (Figure 2). A target for the current reach is shown at Horizon 1, on the other hand, in Horizon 2, a current and a future target are shown on the screen. Inter-reach-interval was significantly reduced from H1 to H2 (Figure 2). The authors insist that "these results suggest that participants can plan two targets (I guess +1 and +2) ahead of the current reach (I guess +0)". But I think these results suggest that participants can plan a target (+1) ahead of the current reach (+0) because participants could see the current (+0) and a future target (+1) in H2. Could the authors please clarify this point?

      We thank Reviewer for raising this point. Our conclusion that “participants can plan two targets ahead of the current reach” is supported by the reduction in Inter-Response Interval (IRI) observed when comparing H2 to H3 in the 75 ms Dwell time condition. Specifically, on average, participants were 16 ms faster when they could see two future targets on the screen (H3) than when they could see only one (H2). To clarify this in the paper, we have revised the wording in line 124 to explicitly state that the conclusion pertains to the 75 ms Dwell time condition. Additionally, we emphasize that the strongest evidence for planning two future targets comes from the experiment shown in Figure 3.

      (3) Movement correction for jump of the +1 target takes longer time in H3 compared to H2 (Figure 4). Does this perturbation have any effect on reaching for +2 target? If the +1 jump doesn't affect reaching for +2 target, combined with the result that jump of the +2 target didn't affect the movement time of +1 target (Figure 3C), perturbation (target jump) only affects the movement directly perturbed. Is this implementation correct? If so, does these results support to decline future reaches are planned as motor chunk? I would like to know the author's thoughts about this.

      In the experiment presented in Figure 4, once we jumped the +1 target, the reach to that target was changed and participants replaned a corrective movement to the new location of the +1 target. This usually was followed by a longer-than-usual pause at the new location of +1 target for resuming the sequence and finishing the trial. Consequently, in these jump trials, it was impossible to compare the +2 reach to no-jump trials, as the normal sequence of movement was disrupted, and the reach to the +2 target originated from a different starting location. Nevertheless, we addressed the possibility that the two future reaches were planned as a chunk by the analysis shown in figure 5: There we showed that a displacement of the +2 target did not influence the reach to the +1 target, indicating that the movement plans could be updated independently.

      (4) Any discussion about Saccade position (Figure 7)?

      We thank reviewer 1 for this important comment. The following discussion section is added for the gaze position results.

      In our sequence task, participants switched their gaze location only once per reach, suggesting that information about the location of the next target is perceived parafoveally (Figure 7A). This observation aligns with previous studies (Clavagnier et al., 2007; González-Alvarez et al., 2007; Sivak and MacKenzie, 1990) that found participants keep their visual attention on the current sequence item and can perceive the location of spatial targets even when foveal vision is occluded. However, when comparing gaze locations for conditions Horizon >1, we observed that participants systematically biased their gaze location based on the sequence context. The gaze position shifted toward the next target, potentially allowing for more accurate location estimation (Figures 7C-D). Notably, changes in gaze location were observed even in Horizon 2, despite no changes in the curvature of hand movements in this horizon (Figure 6B). This suggests that information about the next target may first be available in the circuitry that controls eye movements and later in the cortical areas that control voluntary upper limb movements. Further control studies are required to investigate this hypothesis.

      Reviewer #2:

      Summary:

      In this work, Kashefi et al. investigate the planning of sequential reaching movements and how the additional information about future reaches affects planning and execution. This study, carried out with human subjects, extends a body of research in sequential movements to ask important questions: How many future reaches can you plan in advance? And how do those future plans interact with each other?

      The authors designed several experiments to address these questions, finding that information about future targets makes reaches more efficient in both timing and path curvature. Further, with some clever target jump manipulations, the authors show that plans for a distant future reach can influence plans for a near future reach, suggesting that the planning for multiple future reaches is not independent. Lastly, the authors show that information about future targets is acquired parafoveally--that is, subjects tend to fixate mainly on the target they are about to reach to, acquiring future target information by paying attention to targets outside the fixation point.

      The study opens up exciting questions about how this kind of multi-target planning is implemented in the brain. As the authors note in the manuscript, previous work in monkeys showed that preparatory neural activity for a future reaching movement can occur simultaneously with a current reaching movement, but that study was limited to the monkey only knowing about two future targets. It would be quite interesting to see how neural activity partitions preparatory activity for a third future target, given that this study shows that the third target's planning may interact with the second target's planning.

      Strengths:

      A major strength of this study is that the experiments and analyses are designed to answer complementary questions, which together form a relatively complete picture of how subjects act on future target information. This complete description of a complex behavior will be a boon to future work in understanding the neural control of sequential, compound movements.

      We thank the reviewer for their thorough reading of our work.

      Weaknesses:

      I found no real glaring weaknesses with the paper, though I do wish that there had been some more discussion of what happens to planning with longer dwell times in target. In the later parts of the manuscript, the authors mention that the co-articulation result (where reaches are curved to make future target acquisition more efficient) was less evident for longer dwell times, likely because for longer dwell times, the subject needs to fully stop in target before moving to the next one. This result made me wonder if the future plan interaction effect (tested with the target jumps) would have been affected by dwell time. As far as I can tell, the target jump portion only dealt with the shorter dwell times, but if the authors had longer dwell time data for these experiments, I would appreciate seeing the results and interpretations.

      We thank the reviewer for raising this point. In our time (Figure 2) and curvature analysis (Figure 6), we collected data with five levels of the horizon and three levels of dwell time to explore the space of parameters and to see if there is any interaction between dwell time and the horizon of planning the future targets. Apriori, we expected that the full stop in each target imposed by the 400 ms dwell time would be long enough to remove any effect of future targets on how the current move is executed. In line with our initial hypothesis, the systematic curvature of reaches based on the future target was smaller in longer dwell times (Figure 6E). Nevertheless, we observed a significant curvature even in 400 ms dwell time. Based on this observation, we expect running the jump experiments (Figures 4 and 5) in longer dwell times will lead to the same pattern of results but with a smaller effect size since longer dwells break the interdependence of sequence elements (Kalidindi & Crevecoeur, 2023). In the end, for the jump experiments, we limited our experimental conditions to the fastest dwell time (75 ms dwell) since we were conceptually interested in situations where movements in the sequence are maximally dependent on each other.

      Beyond this , the authors also mentioned in the results and discussion the idea of "neural resources" being assigned to replan movements, but it's not clear to me what this might actually mean concretely. I wonder if the authors have a toy model in mind for what this kind of resource reassignment could mean. I realize it would likely be quite speculative, but I would greatly appreciate a description or some sort of intuition if possible.

      Our use of the term "neural resources" is inspired by classic psychology literature on how cognitive resources such as attention and working memory are divided between multiple sequence components. Early studies on working memory suggest that human participants can retain and manipulate a fixed number of abstract items in working memory (Miller, 1956). However, more recent literature postulates that a specific number of items does not limit working memory, rather, it is limited by a finite attentional resource that is softly allocated to task items.

      Here we borrowed the same notion of soft distribution of resources for the preparation of multiple sequence items. A large portion of our observation in this paper and also previous work on sequence production can be explained by a simple model that assumes one central planning resource that is “softly” divided between sequence elements when participants see future items of the sequence (Author Response Image 1). The first sequence element receives the majority of the resources and is planned the most. The rest of the sequence receives the remaining planning resources in an exponentially decaying manner for preparation of the movement during the execution of the ongoing movement. Once the ongoing movement is over, the resource is then transferred to the next sequence item and this process is repeated until the sequence is over. Assignment of planning resources to future items explains why participants are faster when seeing future items (Figure 2). But this comes with a cost – if the ongoing movement is perturbed, the replanning process is delayed since some of the resources are occupied by future planning (Figure 4). This naturally leads to the question of how this resource allocation is implemented in neural tissue. To address this, we are conducting the same sequence task with the horizon in non-human primates (NHPs), and the investigation of these neural implementation questions will be the focus of future studies.

      Author response image 1.

      Basic diagram showing a soft distribution of a limited planning resource. The diagram shows a Horizon 3 condition in which two future reaches (+1 and +2) are planned while executing a movement (+0). The majority of resources is assigned to the execution of the ongoing movement while the reset is distributed for planning future movements. Once the movement is over, the chain of preparation and execution moves forward.

      Recommendations for the author:

      Reviewer #1

      We thank reviewer one for these comments regarding the clarity and consistency of figures and terminology.

      (1) Figure 3. Are "+1 Move" in Fig. 3B and "+ 1 Movement" in Fig. 3C as same as "E + 1" in Fig. 3A? Also does "Dwell" in Fig. 3B mean same as "+1 Dwell" in Fig. 3C? Consistent terminology would help readers to understand the figure.

      “+1 Move” in Figure 3B is the same as +1 movement in Figure 3C. “Dwell” in Figure 3B is the same as +1 Dwell in Figure 3C. We changed the figure for more consistency.

      (2) Figure 3. A type in the second last line in the legend, "pre-jump target for no-jump and jump and condition". The second "and" isn't necessary.

      The typo is corrected. Thank you.

      (3) Figure 4C. Is "Movement time" equivalent with "E + 1"?

      “Movement time” is equivalent to E+1 only in no-jump conditions. When the jump occurs,

      Movement time contains all the

      (4) Figure 6B. Is the gray circle in between the graph and target positions there by mistake?

      We fixed this typo. Thank you.

      (5) Figure 6E. It's hard to distinguish H2-H5 from the color differences.

      We changed the H5 to full white with a black stroke to improve the contrast. Thank you.

      (6) Figure 7A. Blue dots are almost invisible.

      We added a black stroke to blue circles for more visibility. Thank you.

      Reviewer #2

      I found this manuscript to be engaging and well written--many of the questions I had while reading were answered promptly in the next section. As such, my comments are mostly minor and primarily geared towards improving clarity in the manuscript.

      (1) One major recurring confusion I had while reading the manuscript was how to think about H1, H2, and H3. It was clearly explained in the text, and the explanations of the results were generally clear once I read through it all, but I found it strangely confusing at times when trying to interpret the figures for myself (e.g., in H2, 2 targets are on screen, but the second target can only be planned during the reach toward the first target). This confusion may just be me reading the manuscript over two days, but I wonder if it could be made clearer with some semantic iconography associated with each horizon added to the later figures alongside the H labels. As one option, perhaps the planning timeline part of Fig 1D could be simplified and shrunk down to make an icon for each horizon that clearly shows when planning overlaps for each horizon.

      (Please see the response to point #2 below)

      (2) Regarding Fig 1D: I like this figure, but it's unclear to me how the exact preparation and execution times are determined. Is this more of a general schematic of overlaps, or is there specific information about timing in here?

      We thank reviewer 2 for their important feedback. The role of Figure 1D was to summarize the timing of the experiments for different horizons. That is, to clarify the relative timing of the targets appearing on the screen (shown with a small circle above the horizontal line) and targets being captured by participants (the ticks and their associated number on the line). Execution is shown as the time interval that the hand is moving between the targets and planning is the potential planning time for participants from the target appearing on the screen until initiation of the reach to that target. We added the relevant parts of Figure 1D to the subplots for each subsequent experiment, to summarize the timing of other experiments and their analyses. For the experiments with target jump, a small vertical arrow shows the time of the target jump relative to other events.

      However, this figure will be less useful, if the connection between the timing dots and ticks is not communicated. We agree that in the original manuscript, this important figure was only briefly explained in the caption of Figure 1. We expanded the explanation in the caption of Figure 1 and referenced the dots and ticks in the main text.

      (3) Fig 6B - for some reason I got confused here: I thought the central target in this figure was the start target, and it took me embarrassingly long to figure out that the green target was the start target. This is likely because I'm used to seeing center-out behavioral figures. Incidentally, I wasn't confused by 7c (in fact, seeing 7c is what made me understand 6b), so maybe the solution is to clearly mark a directionality to the reach trajectories, or to point an arrow at the green target like in previous figures. Also, the bottom left gray target in the figure blends into the graph on the left--I didn't notice it until rereading. Because there's white space between that target and the green one, it might be good to introduce some white space to separate the graph from the targets more. The target arrangement makes more sense in panel C, but by the time I got there, I had already been a bit confused.

      Thanks for raising this point. As shown in Figure 6C, we used the reach to the +1 target for the curvature analysis. The confusion about Figure 6B is probably due to continuing the reach trajectories after the +1 target. That also explains why Figure 7C seemed more straightforward. To solve this issue we modified Figure 6B such that the reaches are shown with full opacity right until the +1 target and then shown with more transparency. We believe this change focuses the reader's attention to the reach initiated from the +0 target to the +1 target.

      As for the gray target in Figure 6B, we originally had the gray target as it is a potential start location for the reach to the +0 target, and for having similar visuals between the plots. The gray target is now removed from Figure 6B.

      (4) Line 253 - I'm not sure I understand the advantage over simple averaging that the authors mention here--would be nice to get a bit more intuition.

      Thanks for raising this point. We used a two-factor model in our analysis, with each factor representing the angle of the last and next target, respectively. Both factors had five levels: -120, -60, 0, 60, and 120 degrees relative to the +1 reach. In a balanced two-factor design, where each combination of factor levels has an equal number of trials, using a linear model and simple averaging would yield equivalent results. However, when the number of trials for the combinations of the two factors is unbalanced, simple averaging can lead to misleading differences in the levels of the second factor. Additionally, the linear model allows us to investigate potential interactions between the two factors, which is not possible with simple averaging.

      (5) Fig 7a - I would have liked to see the traces labeled in figure (i.e. hand trajectory vs. eye trajectory)

      Hand and eye trajectories are now labeled in the figure.

      (6) Fig 7c - very minor, but the hexagon of targets is rotated 30 degrees from all previous hexagons shown (also, this hex grid target arrangement can't lead to the trajectory shown in 7a, so it can't be that this was a different experimental grid). I'm guessing this was a simple oversight.

      We used the same grid in the eye-tracking experiment. The targets are to visually match the previous plots. Thank you for raising this point.

      Reference

      Clavagnier, S., Prado, J., Kennedy, H., & Perenin, M.-T. (2007). How humans reach: distinct cortical systems for central and peripheral vision. The Neuroscientist: A Review Journal Bringing Neurobiology, Neurology and Psychiatry, 13(1), 22–27.

      González-Alvarez, C., Subramanian, A., & Pardhan, S. (2007). Reaching and grasping with restricted peripheral vision. Ophthalmic & Physiological Optics: The Journal of the British College of Ophthalmic Opticians , 27(3), 265–274.

      Kalidindi, H. T., & Crevecoeur, F. (2023). Task dependent coarticulation of movement sequences (p.2023.12.15.571847). https://doi.org/10.1101/2023.12.15.571847

      Miller, G. A. (1956). The magical number seven plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2), 81–97.

      Sivak, B., & MacKenzie, C. L. (1990). Integration of visual information and motor output in reaching and grasping: the contributions of peripheral and central vision. Neuropsychologia, 28(10), 1095–1116.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study could potentially represent a step forward towards personalized medicine by combining cell-based data and a prior-knowledge network to derive Boolean-based predictive logic models to uncover altered protein/signaling networks within cancer cells. However, the level of evidence supporting the conclusions is inadequate, and further validation of the reported approach is required. If properly validated, these findings could be of interest to medical biologists working in the field of cancer and would inform drug development and treatment choices in the field of oncology.

      We thank the editor and the reviewer for their constructive comments, which helped us to improve our story. We have now performed new analyses and experiments to further support our proposed approach.

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) The authors deploy a combination of their own previously developed computational methods and databases (SIGNOR and CellNOptR) to model the FLT3 signaling landscape in AML and identify synergistic drug combinations that may overcome the resistance AML cells harboring ITD mutations in the TKI domain of FLT3 to FLT3 inhibitors. I did not closely evaluate the details of these computational models since they are outside of my area of expertise and have been previously published. The manuscript has significant issues with data interpretation and clarity, as detailed below, which, in my view, call into question the main conclusions of the paper.

      The authors train the model by including perturbation data where TKI-resistant and TKIsensitive cells are treated with various inhibitors and the activity (i.e. phosphorylation levels) of the key downstream nodes are evaluated. Specifically, in the Results section (p. 6) they state "TKIs sensitive and resistant cells were subjected to 16 experimental conditions, including TNFa and IGF1 stimulation, the presence or absence of the FLT3 inhibitor, midostaurin, and in combination with six small-molecule inhibitors targeting crucial kinases in our PKN (p38, JNK, PI3K, mTOR, MEK1/2 and GSK3)". I would appreciate more details on which specific inhibitors and concentrations were used for this experiment. More importantly, I was very puzzled by the fact that this training dataset appears to contain, among other conditions, the combination of midostaurin with JNK inhibition, i.e. the very combination of drugs that the authors later present as being predicted by their model to have a synergistic effect. Unless my interpretation of this is incorrect, it appears to be a "self-fulfilling prophecy", i.e. an inappropriate use of the same data in training and verification/test datasets.

      We thank the reviewer for this comment. We have now extensively revised the Figure 2B and edited the text to clarify and better describe the experimental conditions of our multiparametric analysis. As the reviewer stated, we have used different combinations of drugs, including midostaurin and JNK inhibitor to generate two cell-specific predictive models recapitulating the main signal transduction events, down-stream FLT3, occurring in resistant (FLT3ITD-TKD) and sensitive (FLT3ITD-JMD) cells. These experiments were performed by treating cells at very early time points to obtain a picture of the signaling response of FLT3-ITD positive cells. Indeed, we have measured the phosphorylation level of signaling proteins, because at these early time points (90 minutes) we do not expect a modulation of downstream crucial phenotypes, including apoptosis or proliferation. To infer perturbations impacting the apoptosis or proliferation phenotypes, we applied a computational two-steps strategy:

      (1) We extracted key regulators of ‘apoptosis’ and ‘proliferation’ hallmarks from SIGNOR database.

      (2) We applied our recently developed ProxPath algorithm to retrieve significant paths linking nodes of our two optimized models to ‘proliferation’ and ‘apoptosis’ phenotypes.

      This allowed us to evaluate in silico the “proliferation” and “apoptosis” rate upon inactivation of each node of the network. With the proposed approach, we identified JNK as a potential drug target to use in combination with FLT3 to restore sensitivity (i.e. in silico inducing apoptosis and reducing proliferation) of FLT3 ITD-TKD cells. We here want to stress once more that although the first piece of information (the effect of JNK and FLT3 inhibition) on sentinel readouts was provided in the training dataset, the second piece of information (the effect on this treatment over the entire model and, as a consequence, on the cellular phenotype) was purely the results of our computational models. As such, we hope that the reviewer will agree that this could not represent a “self-fulfilling prophecy".

      That said, we understand that this aspect was not clearly defined in the manuscript. For this reason, we have now 1) extensively revised the Figure 2B; 2) edited the text (pg. 6) to clarify the purpose and the results of our approach; and 3) described in further detail (pg. 16-18) the experimental conditions of our multiparametric analysis.

      (2) My most significant criticism is that the proof-of-principle experiment evaluating the combination effects of midostaurin and SP600125 in FLT3-ITD-TKD cell line model does not appear to show any synergism, in my view. The authors' interpretation of the data is that the addition of SP600125 to midostaurin rescues midostaurin resistance and results in increased apoptosis and decreased viability of the midostaurin-resistant cells. Indeed, they write on p.9: "Strikingly, the combined treatment of JNK inhibitor (SP600125) and midostaurin (PKC412) significantly increased the percentage of FLT3ITD-TKD cells in apoptosis (Fig. 4D). Consistently, in these experimental conditions, we observed a significant reduction of proliferating FLT3ITD- TKD cells versus cells treated with midostaurin alone (Fig. 4E)." However, looking at Figs 4D and 4E, it appears that the effects of the midostaurin/SP600125 combination are virtually identical to SP600125 alone, and midostaurin provides no additional benefit. No p-values are provided to compare midostaurin+SP600125 to SP600125 alone but there seems to be no appreciable difference between the two by eye. In addition, the evaluation of synergism (versus additive effects) requires the use of specialized mathematical models (see for example Duarte and Vale, 2022). That said, I do not appreciate even an additive effect of midostaurin combined with SP600125 in the data presented.

      We agree with the reviewer that the JNK inhibitor and midostaurin do not have neither a synergic nor additive effect and we have now revised the text accordingly. It is highly discussed in the scientific community whether FLT3ITD-TKD AML cells benefit from midostaurin treatments. In a recently published retroprospective study of K. Dohner et al. (Rücker et al., 2022), the authors investigated the prognostic and predictive impact of FLT3-ITD insertion site (IS) in 452 patients randomized within the RATIFY trial, which evaluated midostaurin additionally to intensive chemotherapy. Their study clearly showed that “Midostaurin exerted a significant benefit only for JMDsole” patients. In agreement with this result, we have demonstrated that midostaurin treatment had no effects on apoptosis of blasts derived from FLT3ITD-TKD patients (Massacci et al., 2023). On the other hand, we and others observed that midostaurin triggers apoptosis in FLT3ITD-TKD cells to a lesser extent as compared to FLT3ITDJMD cells (Arreba-Tutusaus et al., 2016). The data presented here (Fig. 4) and our previously published papers (Massacci et al., 2023; Pugliese et al., 2023) pinpoint that hitting cell cycle regulators (WEE1, CDK7, JNK) induce a significant apoptotic response of TKI resistant FLT3ITD-TKD cells. Prompted by the reviewer comment, we have now revised the text and discussion (pg.9; 14) highlighting the crucial role of JNK in apoptosis induction.

      (3) In my view, there are significant issues with clarity and detail throughout the manuscript. For example, additional details and improved clarity are needed, in my view, with respect to the design and readouts of the signaling perturbation experiments (Methods, p. 15 and Fig 2B legend). For example, the Fig 2B legend states: "Schematic representation of the experimental design: FLT3 ITD-JMD and FLT3 ITD-JMD cells were cultured in starvation medium (w/o FBS) overnight and treated with selected kinase inhibitors for 90 minutes and IGF1 and TNFa for 10 minutes. Control cells are starved and treated with PKC412 for 90 minutes, while "untreated" cells are treated with IGF1 100ng/ml and TNFa 10ng/ml with PKC412 for 90 minutes.", which does not make sense to me. The "untreated" cells appear to be treated with more agents than the control cells. The logic behind cytokine stimulation is not adequately explained and it is not entirely clear to me whether the cytokines were used alone or in combination. Fig 2B is quite confusing overall, and it is not clear to me what the horizontal axis (i.e. columns of "experimental conditions", as opposed to "treatments") represents. The Method section states "Key cell signaling players were analyzed through the X-Map Luminex technology: we measured the analytes included in the MILLIPLEX assays" but the identities of the evaluated proteins are not given in the Methods. At the same time, the Results section states "TKIs sensitive and resistant cells were subjected to 16 experimental conditions" but these conditions do not appear to be listed (except in Supplementary data; and Fig 2B lists 9 conditions, not 16). In my subjective view, the manuscript would benefit from a clearer explanation and depiction of the experimental details and inhibitors used in the main text of the paper, as opposed to various Supplemental files/Figures. The lack of clarity on what exactly were the experimental conditions makes the interpretation of Fig 2 very challenging. In the same vein, in the PCA analysis (Fig 2C) there seems to be no reference to the cytokine stimulation status while the authors claim that PC2 stratifies cells according to IGF1 vs TNFalpha. There are numerous other examples of incomplete or confusing legends and descriptions which, in my view, need to be addressed to make the paper more accessible.

      We thank the reviewer for his/her comment. We have now extensively revised the text of the manuscript (pg. 6), revised Fig. 2B (now Fig 2C) and methods (pg. 16-18) to improve the clarity of our manuscript, making the take-home messages more accessible. We believe that the revised versions of text and of Figure 2 better explain our strategy and clarify the experimental set up, we added details on the choices of the experimental conditions, and we proposed a better graphic representation of the analysis.

      (4) I am not sure that I see significant value in the patient-specific logic models because they are not supported by empirical evidence. Treating primary cells from AML patients with relevant drug combinations would be a feasible and convincing way to validate the computational models and evaluate their potential benefit in the clinical setting.

      We thank the reviewer for this comment. We have now performed additional experiments in a small cohort of FLT3-ITD positive patient-derived primary blasts. Specifically, we have treated blasts from 2 FLT3ITD-TKD patients and 3 FLT3ITD-JMD+TKD patients with PKC412 (100nM) 24h and/or 10μM SP600125 (JNK inhibitor). After 24h of treatment we have measured the apoptotic rate. As shown below and in the new Fig. 4F (see pg.10, main text), midostaurin triggers higher levels of apoptosis in FLT3ITD-JMD+TKD blasts as compared to FLT3ITD-TKD blasts. Importantly, treatment with the JNK inhibitor SP600125 alone triggers apoptosis in FLT3ITD-TKD blasts, validating the crucial role of JNK in FLT3ITD-TKD cell survival and TKI resistance. The combined treatment of midostaurin and SP600125 increases the percentage of apoptotic cells as compared to midostaurin treatment alone but to a lesser extent than single agent treatment. This result is in agreement with the current debate in the scientific community on the actual beneficial effect of midostaurin treatment in FLT3ITD-TKD AML patients.

      Author response image 1.

      Primary samples from AML patients with the FLT3ITD-TKD mutation (n=2, yellow bars) or the FLT3ITD-JMD/TKD mutation (n=3, blue bars) were exposed to Midostaurin (100nM, PKC412), and JNK inhibitor (10µM, SP600125) for 48 hours, or combinations thereof. The specific cell death of gated AML blasts was calculated to account for treatment-unrelated spontaneous cell death. The bars on the graph represent the mean values with standard errors.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Latini et al describes a methodology to develop Boolean-based predictive logic models that can be applied to uncover altered protein/signalling networks in cancer cells and discover potential new therapeutic targets. As a proof-of-concept, they have implemented their strategy on a hematopoietic cell line engineered to express one of two types of FLT3 internal tandem mutations (FLT3-ITD) found in patients, FLT3-ITD-TKD (which are less sensitive to tyrosine kinase inhibitors/TKIs) and FLT3-ITD-JMD (which are more sensitive to TKIs).

      Strengths:

      This useful work could potentially represent a step forward towards personalised targeted therapy, by describing a methodology using Boolean-based predictive logic models to uncover altered protein/signalling networks within cancer cells. However, the weaknesses highlighted below severely limit the extent of any conclusions that can be drawn from the results.

      Weaknesses:

      While the highly theoretical approach proposed by the authors is interesting, the potential relevance of their overall conclusions is severely undermined by a lack of validation of their predicted results in real-world data. Their predictive logic models are built upon a set of poorlyexplained initial conditions, drawn from data generated in vitro from an engineered cell line, and no attempt was made to validate the predictions in independent settings. This is compounded by a lack of sufficient experimental detail or clear explanations at different steps. These concerns considerably temper one's enthusiasm about the conclusions that could be drawn from the manuscript.

      We thank the reviewer for the thorough review and kind comments about our manuscript. We hope the changes and new data we provide further strengthen it in his or her eyes.

      Some specific concerns include:

      (1) It remains unclear how robust the logic models are, or conversely, how affected they might be by specific initial conditions or priors that are chosen. The authors fail to explain the rationale underlying their input conditions at various points. For example: - at the start of the manuscript, they assert that they begin with a pre-PKN that contains "76 nodes and 193 edges", though this is then ostensibly refined with additional new edges (as outlined in Fig 2A). However, why these edges were added, nor model performance comparisons against the basal model are presented, precluding an evaluation of whether this model is better.

      We understand the reviewer’s concern. We have now complemented the manuscript with an extended version of the proposed modelling strategy offering a detailed description of the pipeline and the rationale behind each choice (Supplementary material, pg.14-19). Furthermore, we also referenced the manuscript to a GitHub repository where users can follow and reproduce each step of the pipeline (https://github.com/SaccoPerfettoLab/FLT3ITD_driven_AML_Boolean_models).

      • At a later step (relevant to Fig S4 and Fig 3), they develop separate PKNs, for each of the mutation models, that contain "206 [or] 208 nodes" and "756 [or] 782 edges", without explaining how these seemingly arbitrary initial conditions were arrived at. Their relation to the original parameters in the previous model is also not investigated, raising concerns about model over-fitting and calling into question the general applicability of their proposed approach. The authors need to provide a clearer explanation of the logic underlying some of these initial parameter selections, and also investigate the biological/functional overlap between these sets of genes (nodes).

      We thank the reviewer for raising this question. Very briefly, the proposed optimization strategy falls in a branch of the modelling, where the predictive model is, indeed, driven by the data (Blinov and Moraru, 2012). From a certain point of view, the scope of optimization is the one of fitting the experimental data in the best way possible. To achieve this, we followed standard practices (Dorier et al., 2016; Traynard et al., 2017). To address the issue of “calling into question the general applicability of their proposed approach”, we have compared the activity status of nodes in the models with ‘real data’ extracted from cell lines and patients’ samples to reassure about the robustness and scalability of the strategy (please see below, response to point 3 pg. 9).

      Finally, as mentioned in the previous point, we have now provided a detailed supplementary material, where we have described all the aspects mentioned by the reviewer: step-by-step changes in the PKN, the choice of the parameters and other details can be traced over the novel text and are also available in the GitHub repository (https://github.com/SaccoPerfettoLab/FLT3-ITD_driven_AML_Boolean_models).

      (2) There is concern about the underlying experimental data underpinning the models that were generated, further compounded by the lack of a clear explanation of the logic. For example, data concerning the status of signalling changes as a result of perturbation appears to be generated from multiplex LUMINEX assays using phosphorylation-specific antibodies against just 14 "sentinel" proteins. However, very little detail is provided about the rationale underlying how these 14 were chosen to be "sentinels" (and why not just 13, or 15, or any other number, for that effect?). How reliable are the antibodies used to query the phosphorylation status? What are the signal thresholds and linear ranges for these assays, and how would these impact the performance/reliability of the logic models that are generated from them?

      We thank the reviewer for this comment as it gives us the opportunity to clarify and better explain the criteria behind the experimental data generation.

      Overall, we revised the main text at page 6 and the Figure 2B to improve the clarity of our experimental design. Specifically, the sentinels were chosen because they were considered indirect or direct downstream effectors of the perturbations and were conceived to serve as both a benchmarking system of the study and a readout of the global perturbation of the system. To clarify this aspect, we have added a small network (compressed PKN) in Figure 2B to show that the proteins (green nodes) we chose to measure in the LUMINEX multiplex assay are “sentinels” of the activity of almost all the pathways included in the Prior knowledge network. Moreover, we implemented the methods section “Multiparametric experiment of signaling perturbation” (pg. 16-18), where we added details about the antibodies used in the assay paired with the target phosphosites and their functional role (Table 3). We also better specified the filtering process based on the number of beads detected per each antibody used (pg. 18). About the reliability of the measurements, we can say that the quality of the perturbation data impacts greatly on the logic models’ performance. xMAP technology been already used by the scientific community to generate highly reproducible and reliable multiparametric dataset for model training (Terfve et al., 2012). Additionally, we checked that for each sentinel we could measure a fully active state, a fully inactive state and intermediate states. Modulation of individual analytes are displayed in Figure S3.

      Author response image 2.

      Partial Figure of normalization of analytes activity through Hill curves. Experimental data were normalized and scaled from 0 to 1 using analyte-specific Hill functions. Raw data are reported as triangles, normalized data and squares. Partial Figure representing three plots of the FLT3 ITD-JMD data (Complete Figure in Supplementary material Fig S3).

      (3) In addition, there are publicly available quantitative proteomics datasets from FLT3-mutant cell lines and primary samples treated with TKIs. At the very least, these should have been used by the authors to independently validate their models, selection of initial parameters, and signal performance of their antibody-based assays, to name a few unvalidated, yet critical, parameters. There is an overwhelming reliance on theoretical predictions without taking advantage of real-world validation of their findings. For example, the authors identified a set of primary AML samples with relevant mutations (Fig 5) that could potentially have provided a valuable experimental validation platform for their predictions of effective drug combination. Yet, they have performed Boolean simulations of the predicted effects, a perplexing instance of adding theoretical predictions on top of a theoretical prediction!

      Additionally, there are datasets of drug sensitivity on primary AML samples where mutational data is also known (for example, from the BEAT-AML consortia), that could be queried for independent validation of the authors' models.

      We thank the reviewer for this comment that helped us to significantly strengthen our story. Prompted by his/her comment, we have now queried three different datasets for independent validation of our logic models. Specifically, we have taken advantage of quantitative phosphoproteomics datasets of FLT3-ITD cell lines treated with TKIs (Massacci et al., 2023), phosphoproteomic data of FLT3-ITD positive patients-derived primary blast (Kramer et al., 2022) and of drug sensitivity data on primary FLT3-ITD positive AML samples (BEAT-AML consortia)

      • Comparison with phosphoproteomic data of FLT3-ITD cell lines treated with TKIs (Massacci et al., 2023)

      Here, we compared the steady state of our model upon FLT3 inhibition with the phosphoproteomic data describing the modulation of 16,319 phosphosites in FLT3-ITD BaF3 cells (FLT3ITD-TKD and FLT3ITD-JMD) upon TKI treatment (i.e. quizartinib, a highly selective FLT3 inhibitor). As shown in the table below and new Figure S5A, the activation status of the nodes in the two generated models is highly comparable with the level of regulatory phosphorylations reported in the reference dataset. Briefly, to determine the agreement between each model and the independent dataset, we focused on the phosphorylation level of specific residues that (i) regulate the functional activity of sentinel proteins (denoted in the ‘Mode of regulation’ column) and (ii) that were measured in this work to train the model. So, we cross-referenced the sentinel protein status in FLT3 inhibition simulation (as denoted in the 'Model simulation of FLT3 inhibition' column) with the functional impact of phosphorylation measured in Massacci et. al dataset (as denoted in the 'Functional impact in quizartinib dataset' column). Points of congruence were summarized in the 'Consensus' column. As an example, if the phosphorylation level of an activating residue decreases (e.g., Y185 of Mapk1), we can conclude that the protein is inhibited (‘Down-reg’) and this is coherent with model simulation in which Mapk1 is ‘Inactive’.

      Author response image 3.

      • Comparison with phosphoproteomic data of FLT3-ITD patient-derived primary blasts (Kramer et al., 2022)

      Using the same criteria, we extended our validation efforts by comparing the activity status of the proteins in the “untreated” simulation (i.e. reproducing the tumorigenic state where FLT3, IGF1R and TNFR are set to be active) with their phosphorylation levels in the dataset by Kramer et al. (Kramer et al., 2022). Briefly, this dataset gathers phosphoproteomic data from a cohort of 44 AML patients and we restricted the analysis to 11 FLT3-ITD-positive patients. Importantly, all patients carry the ITD mutation in the juxta membrane domain (JMD), thus allowing for the comparison with FLT3 ITD-JMD specific Boolean model, exclusively.

      The results are shown in the heatmap below. Each cell in the heatmap reports the phosphorylation level of sentinel proteins’ residues in the indicated patient (red and blue indicate up- or- down-regulated phosphoresidues, respectively). Patients were clustered according to Pearson correlation. We observed a good level of agreement between the patients’ phosphoproteomics data and our model (reported in the column “Tumor simulation steady state”) for a subset of patients highlighted within the black rectangle. However, for the remaining patients, the level of agreement is poor. The main reason is that our work focuses on FLT3-ITD signaling and a systematic translation of the Boolean modeling approach to the entire cohort of AML patients would require the inclusion of the impact of other driver mutations in the network. This is actually a current and a future line of investigation of our group. We have revised the discussion, taking this result into consideration.

      Author response image 4.

      • Comparison with drug sensitivity data on primary FLT3-ITD positive AML samples (BEAT-AML consortia)

      Here we took advantage of the Beat AML programme on a cohort of 672 tumour specimens collected from 562 patients. The BEAT AML consortium provides whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity of this large cohort of patient-derived primary blasts. We focused on drug sensitivity screening on 134 patients carrying the typical FLT3-ITD mutation in the JMD region. Unfortunately, the ITD insertion in the TKD region is less characterized and additional in-depth sequencing studies are required to identify in this cohort FLT3ITD-TKD positive blasts. Next, we focused on those compounds hitting nodes present in the FLT3ITD-JMD Boolean model. Specifically, we selected drugs inhibiting FLT3, PI3K, mTOR, JNK and p38 and we calculated the average IC50 of FLT3ITD-JMD patient-derived primary blasts for each drug. These results are reported as a bar graph in the new Fig. S5B and below (upper panel) and were compared with the apoptotic and proliferation rate measured in silico simulation of the FLT3ITD-JMD Boolean model. Drug sensitivity screening on primary FLT3ITD-JMD blasts revealed that inhibition of FLT3, PI3K and mTOR induces cell death at low drug concentrations in contrast with JNK and p38 inhibitors showing higher IC50 values. These observations are consistent with our simulation results of the FLT3ITD-JMD model. As expected, in silico inhibition of FLT3 greatly impacts apoptosis and proliferation. Additionally, in silico suppression of mTOR and to a lesser extent PI3K and p38 affect apoptosis and proliferation. Of note, JNK inhibition neither in silico nor in vitro seems to affect viability of FLT3ITD-JMD cells.

      Author response image 5.

      Altogether these publicly available datasets independently validate our models, strengthening the reliability and robustness of our approach.

      We have now revised the main text (pg. 8; 9) and added a new Figure (Fig. S5) in the supplementary material; we collected the results of the analysis in TableS6.

      (4) There are additional examples of insufficient experimental detail that preclude a fuller appreciation of the relevance of the work. For example, it is alluded that RNA-sequencing was performed on a subset of patients, but the entire methodological section detailing the RNA-seq amounts to just 3 lines! It is unclear which samples were selected for sequencing nor where the data has been deposited (or might be available for the community - there are resources for restricted/controlled access to deidentified genomics/transcriptomics data).

      We apologize for the lack of description regarding the RNA sequencing of patient samples. We have now added details of this approach in the method section (pg. 24), clearly explained in text how we selected the patients for the analysis. Additionally, data has now been deposited in the GEO database (accession number: GSE247483).

      The sentences we have rephrased are below:

      “We analyzed the mutational and expression profiles of 262 genes (Table S7), relevant to hematological malignancies in a cohort of 14 FLT3-ITD positive de novo AML patients (Fig. 5A, panel a). Since, follow-up clinical data were available for 10 out of 14 patients (Fig. 5B, Table S9), we focused on this subset of patients. Briefly, the classification of these 10 patients according to their ITD localization (see Methods) was as follows: 8 patients with FLT3ITD-JMD, 4 with FLT3ITD-JMD+TKD, and 2 with FLT3ITD-TKD (Fig. 5A, panel b). The specific insertion sites of the ITD in the patient cohort are shown in Table S8.

      Similarly, in the "combinatory treatment inference" methods, it states "...we computed the steady state of each cell line best model....." and "Then we inferred the activity of "apoptosis" and "proliferation" phenotypes", without explaining the details of how these were done. The outcomes of these methods are directly relevant to Fig 4, but with such sparse methodological detail, it is difficult to independently assess the validity of the presented data.

      Overall, the theoretical nature of the work is hampered by real-world validation, and insufficient methodological details limit a fuller appreciation of the overall relevance of this work.

      We thank the reviewer for the insightful feedback regarding the methodology in our paper.<br /> About ‘real-world validation’ we have extensively replied to this issue in point 3 (pg. 9-14 of this document). For what concerns the ‘insufficient methodological details’, we have made substantial improvements to enhance clarity and reproducibility, that encompass: (i) revisions in the main text and in the Materials and Methods section; (ii) detailed explanation of each step and decisions taken that can be accessed either as an extended Materials and Methods section (Supplementary material, pg. 14-19) and through our GitHub repository (https://github.com/SaccoPerfettoLab/FLT3-ITD_driven_AML_Boolean_models). We sincerely hope this addition addresses concerns and facilitates a more thorough and independent assessment of our work.

      Reviewer #3 (Public Review):

      Summary:

      The paper "Unveiling the signaling network of FLT3-ITD AML improves drug sensitivity prediction" reports the combination of prior knowledge signaling networks, multiparametric cell-based data on the activation status of 14 crucial proteins emblematic of the cell state downstream of FLT3 obtained under a variety of perturbation conditions and Boolean logic modeling, to gain mechanistic insight into drug resistance in acute myeloid leukemia patients carrying the internal tandem duplication in the FLT3 receptor tyrosine kinase and predict drug combinations that may reverse pharmacoresistant phenotypes. Interestingly, the utility of the approach was validated in vitro, and also using mutational and expression data from 14 patients with FLT3-ITD positive acute myeloid leukemia to generate patient-specific Boolean models.

      Strengths:

      The model predictions were positively validated in vitro: it was predicted that the combined inhibition of JNK and FLT3, may reverse resistance to tyrosine kinase inhibitors, which was confirmed in an appropriate FLT3 cell model by comparing the effects on apoptosis and proliferation of a JNK inhibitor and midostaurin vs. midostaurin alone.

      Whereas the study does have some complexity, readability is enhanced by the inclusion of a section that summarizes the study design, plus a summary Figure. Availability of data as supplementary material is also a high point.

      We thank the reviewer for his/her constructive comments about our manuscript. We believe that our story has been significantly strengthened by the changes and new data we provided.

      Weaknesses:

      (1) Some aspects of the methodology are not properly described (for instance, no methodological description has been provided regarding the clustering procedure that led to Figs. 2C and 2D).

      We apologize for the lack of proper description of the methodology. We have extensively revised the methods section and worked to improve the clarity. We have now added a description of the clustering procedures in the methods section (pg. 19) of new Fig. S2D., Fig. S2E.

      It is not clear in the manuscript whether the patients gave their consent to the use of their data in this study, or the approval from an ethical committee. These are very important points that should be made explicit in the main text of the paper.

      We thank the reviewer for this comment. We have now added the following sentence (pg. 24): “Peripheral blood (PB) samples from 14 AML patients were obtained upon patient’s informed consent.”

      The authors claim that some of the predictions of their models were later confirmed in the follow-up of some of the 14 patients, but it is not crystal clear whether the models helped the physicians to make any decisions on tailored therapeutic interventions, or if this has been just a retrospective exercise and the predictions of the models coincide with (some of) the clinical observations in a rather limited group of patients. Since the paper presents this as additional validation of the models' ability to guide personalized treatment decisions, it would be very important to clarify this point and expand the presentation of the results (comparison of observations vs. model predictions).

      As described in the introduction section, this study was inspired by an urgent clinical problem in AML research: patients carrying the ITD in the TKD domain of the FLT3 receptor display poor prognosis and do not respond to current therapy: Midostaurin (which on the other hand is effective in patients with the ITD in the JMD domain).

      To fill this gap, we gathered a team of 18 participants, of which 7 have a clinical background and have expertise in the diagnosis, treatment and management of AML patients and 5 are experts in Boolean modeling. The scope of the project is the development of a computational approach to identify possible alternative solutions for FLT3ITD-TKD AML patients, generating future lines of investigations. Drug combinations are currently under investigation as a potential means of avoiding drug resistance and achieving more effective and durable treatment responses. However, it is impractical to test for potential synergistic properties among all available drugs using empirical experiments alone. With our approach, we developed models that recreated in silico the main differences in the signaling of sensitive and resistant cells to support the prioritization of novel therapies. Prompted by the reviewer suggestions, we have now extended the validation of our models, through the comparison with publicly available cell lines and patient-derived dataset. We have also confirmed our results by performing in vitro experiments in patient-derived primary blasts treated with midostaurin and/or JNK inhibitor. Importantly, we have already demonstrated that hitting cell cycle regulators in FLT3ITD-TKD cells can be an effective approach to kill resistant leukemia cells (Massacci et al., 2023; Pugliese et al., 2023). We are aware that changing the clinical practice and the therapies for patients require a proper clinical study which goes far beyond the scope of this manuscript.

      However, we hope that our results can be translated soon from “bench-to-bed”. Importantly, we believe that our study can open lines of investigations aimed at the application of our approach to identify promising therapeutic strategies in other clinical settings.

      Recommendations for the authors

      The reviewers have highlighted significant issues regarding the inadequate level of evidence to support some of the conclusions, plus lack of an exhaustive methodological description that may jeopardize reproducibility.

      We hope that the editor and the reviewers will appreciate the extensive revision we made and new data and analysis we provided to strengthen our story.

      Reviewer #1 (Recommendations For The Authors):

      (1) In Fig 2D the hierarchical tree is off-set in relation to the treatment symbols and names in the middle of the Figure. In addition, I do not see FLT3i combination with JNKi in the JMD cells (perhaps, a coloring error?).

      We thank the reviewer for this observation. We have now revised the hierarchical tree, which is now in Figure S2D, we have aligned the tree with the symbols and names and corrected the colouring error for the sample FLT3i+JNKi in JMD cells.

      (2) Midostaurin and PKC412 refer to the same drug and are used interchangeably in the manuscript. Using one name consistently would improve readability.

      We have now improved the readability of the text and the Figures by choosing “Midostaurin” when we refer to the FLT3 inhibitor.

      (3) It is not clear to me why the FLT3-ITD-JMD cells are not presented in Fig. 4B. Perhaps their values are 0? In that case, the readability would be improved by including a thin blue line representing zero values. Additionally, on p.8 the authors state "Interestingly, in the FLT3ITDTKD model, the combined inhibition of JNK and FLT3, exclusively, in silico restores the TKI sensitivity, as revealed by the evaluation of the apoptosis and proliferation levels (Fig. 4B-C)." but Fig. 4C shows no differential effects of JNK inhibition in sensitive versus resistant cells.

      To address the reviewer's point, we’ve added a thin blue line representing the zero values of the FLT3ITD-JMD in the results of the simulations in Figure 4B. Regarding the Figure 4C, the reviewer is right in saying that there is no difference in terms of proliferation between sensitive and resistant cells upon JNKi and FLT3i co-inhibition. However, we can see lower proliferation levels in both cell lines as compared to the “untreated” condition. Indeed, the simulation suggests that by combining JNK and FLT3 inhibition we restore the resistant phenotype lowering the proliferation rate of the resistant cells to the TKI-sensitive levels.

      Reviewer #2 (Recommendations For The Authors):

      I have addressed a number of concerns in the public review. Much better effort needs to be made to provide sufficient methodological detail (to permit independent validation by a sufficiently capable and motivated party) and explain the rationale of important parameter selections. Furthermore, I urge the authors to take advantage of the plethora of publicly available real-world data to validate their predicted outcomes.

      We are grateful to the reviewer for the careful revisions. All the aspects raised have been discussed in the specific sections of the public review. In summary, we have provided more methodological details, by revising the text, the methods session, by adding a new step-by-step description of the modelling strategy, the parameters and the criteria adopted in each phase (supplementary methods) and by referring to the entire code developed. Prompted by the reviewer suggestions, we have performed a novel and extensive comparison of our model with three different publicly available datasets. This analysis significantly strengthens our story, and a new supplementary Figure (Fig. S5) summarizes our findings (pg. 9-14 of this document).

      Reviewer #3 (Recommendations For The Authors):

      (1) At first sight, the distribution of the data points in the PCA space does not really seem to speak of nice clustering. Have the authors computed any clustering validation metric to assess if their clustering strategy is adequate and how informative the results are? Further analysis of this point of the article is precluded by the absence of a clear methodological description.

      Here we have used the PCA analysis to obtain a global view of our complex multiparametric data. We have now worked on the PCA to improve its readability. As shown in the new Figure 2D, PCA analysis showed that the activity level of sentinel proteins stratifies cells according to FLT3 activation status (component 1: presence vs absence of FLT3i) and cytokine stimulation (component 2: IGF1 vs TNF⍺). We have now added new experimental details on this part in the methods section (pg. 19) and we deposited the code used for the clustering strategy on the GitHub repository (https://github.com/SaccoPerfettoLab/FLT3ITD_driven_AML_Boolean_models).

      (2) Whereas scientists and medical professionals who work in the field of oncology may be familiar with some of the abbreviations used here, it would be good for improved readability by a more general audience to make sure that all the abbreviations (e.g., TKI) are properly defined the first time that they appear in the text.

      We thank the reviewer for this observation. To improve the readability of the text, we properly defined all the abbreviations in their first appearance, and we added the “Abbreviation” paragraph at page 15 of the manuscript to summarize them all.

      (3) How were the concentrations of the combined treatments chosen in the cell assays used as validation?

      We thank the reviewer for giving us the chance to clarify this point. We implemented the Methods with additional information about the treatments used in the validations. We detailed the SP600125 IC50 evaluation and usage in our cell lines (pg.22): IC50 values are approximately 1.5 µM in FLT3-ITD mutant cell lines; the SP600125 treatment affects cell viability, reaching a plateau phase of cell death and at about 2 µM. I used the minimal dose of SP600125 (10µM) to properly inhibit JNK. (Kim et al., 2010; Moon et al., 2009).

      We also specified (pg.22) that the concentration of Midostaurin was chosen based on the previously published work (Massacci et al., 2022): FLT3 ITD-TKD cells treated with Midostaurin 100nM show lower apoptotic rate and higher cell viability compared to FLT3 ITD-JMD cells.

      The concentration of SB203580 and UO126 was chosen based on previous data available in the lab and set up experiments (pg.22).

      (4) The authors say that "we were able to derive patient-specific signaling features and enable the identification of potential tailored treatments restoring TKI resistance" and that "our predictions were confirmed by follow-up clinical data for some patients". However, the results section on this part of the manuscript is rather scarce (the main text should be much more descriptive about the results summarized in Fig. 5, which are not self-explanatory).

      We thank the reviewer for this observation. We have now expanded the text to provide a more comprehensive description of the results about personalized Boolean model generation and usage and the content presented in Fig. 5 (pg.10-12).

      (5) I do not really agree with the final conclusion about this paper being "the proof of concept that our personalized informatics approach described here is clinically valid and will enable us to propose novel patient-centered targeted drug solutions". First, the clinical data used here belongs to a rather low number of patients. Second, as mentioned before, it is not clear if the models have been used to make any prospective decision or if this conclusion is drawn from an in vitro assay plus a retrospective analysis on a limited number of patients. Moreover, a description of the results and the discussion of the part of the manuscript dealing with patientspecific models is rather scarce, and it is difficult to see how the authors support their conclusions. Also, the statement " In principle, the generalization of our strategy will enable to obtain a systemic perspective of signaling rewiring in different cancer types, driving novel personalized approaches" may be a bit overoptimistic if one considers that so far, the approach has only been applied to a single type of drug-resistant cancer.

      We thank the reviewer for this comment. We agree with the referees that the clinical data we used belongs to a rather low number of patients. However, during the revision we have extensively worked to support the clinical relevance of our models and our discoveries. Specifically, we have compared our Boolean logic models with two different publicly available datasets on phosphoproteomics and drug sensitivity of FLT3ITD-JMD and FLT3ITD-TKD cell lines and blasts (FigS5 and answer to reviewer 2, point 3). Importantly, these datasets independently validated our models, highlighting that our approach has a translational value. Additionally, we have performed novel experiments by measuring the apoptotic rate of patient-derived primary blasts upon pharmacological suppression of JNK (Fig. 4H, pg. 10 of main text). Our data highlights that our approach has the potential to suggest novel effective treatments.

      That said, we have now revised the discussion to avoid overstatements.

      References

      Arreba-Tutusaus, P., Mack, T.S., Bullinger, L., Schnöder, T.M., Polanetzki, A., Weinert, S., Ballaschk, A., Wang, Z., Deshpande, A.J., Armstrong, S.A., Döhner, K., Fischer, T., Heidel, F.H., 2016. Impact of FLT3-ITD location on sensitivity to TKI-therapy in vitro and in vivo. Leukemia 30, 1220–1225. https://doi.org/10.1038/leu.2015.292

      Blinov, M.L., Moraru, I.I., 2012. Logic modeling and the ridiculome under the rug. BMC Biol 10, 92. https://doi.org/10.1186/1741-7007-10-92

      Dorier, J., Crespo, I., Niknejad, A., Liechti, R., Ebeling, M., Xenarios, I., 2016. Boolean regulatory network reconstruction using literature based knowledge with a genetic algorithm optimization method. BMC Bioinformatics 17, 410. https://doi.org/10.1186/s12859-016-1287-z

      Kramer, M.H., Zhang, Q., Sprung, R., Day, R.B., Erdmann-Gilmore, P., Li, Y., Xu, Z., Helton, N.M., George, D.R., Mi, Y., Westervelt, P., Payton, J.E., Ramakrishnan, S.M., Miller, C.A., Link, D.C., DiPersio, J.F., Walter, M.J., Townsend, R.R., Ley, T.J., 2022. Proteomic and phosphoproteomic landscapes of acute myeloid leukemia. Blood 140, 1533–1548. https://doi.org/10.1182/blood.2022016033

      Massacci, G., Venafra, V., Latini, S., Bica, V., Pugliese, G.M., Graziosi, S., Klingelhuber, F., Krahmer, N., Fischer, T., Mougiakakos, D., Boettcher, M., Perfetto, L., Sacco, F., 2023. A key role of the WEE1-CDK1 axis in mediating TKI-therapy resistance in FLT3-ITD positive acute myeloid leukemia patients. Leukemia 37, 288–297. https://doi.org/10.1038/s41375-022-01785-w

      Pugliese, G.M., Venafra, V., Bica, V., Massacci, G., Latini, S., Graziosi, S., Fischer, T., Mougiakakos, D., Boettcher, M., Perfetto, L., Sacco, F., 2023. Impact of FLT3-ITD location on cytarabine sensitivity in AML: a network-based approach. Leukemia 37, 1151–1155. https://doi.org/10.1038/s41375-023-01881-5

      Rücker, F.G., Du, L., Luck, T.J., Benner, A., Krzykalla, J., Gathmann, I., Voso, M.T., Amadori, S., Prior, T.W., Brandwein, J.M., Appelbaum, F.R., Medeiros, B.C., Tallman, M.S., Savoie, L., Sierra, J., Pallaud, C., Sanz, M.A., Jansen, J.H., Niederwieser, D., Fischer, T., Ehninger, G., Heuser, M., Ganser, A., Bullinger, L., Larson, R.A., Bloomfield, C.D., Stone, R.M., Döhner, H., Thiede, C., Döhner, K., 2022. Molecular landscape and prognostic impact of FLT3-ITD insertion site in acute myeloid leukemia: RATIFY study results. Leukemia 36, 90–99. https://doi.org/10.1038/s41375-021-01323-0

      Terfve, C., Cokelaer, T., Henriques, D., MacNamara, A., Goncalves, E., Morris, M.K., van Iersel, M., Lauffenburger, D.A., Saez-Rodriguez, J., 2012. CellNOptR: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms. BMC Syst Biol 6, 133. https://doi.org/10.1186/1752-0509-6-133

      Traynard, P., Tobalina, L., Eduati, F., Calzone, L., Saez-Rodriguez, J., 2017. Logic Modeling in Quantitative Systems Pharmacology: Logic Modeling in Quantitative Systems Pharmacology. CPT Pharmacometrics Syst. Pharmacol. 6, 499–511. https://doi.org/10.1002/psp4.12225

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank you for the time you took to review our work and for your feedback!

      The major changes to the manuscript are:

      1) Promoted by multiple reviewers, we have replaced the statistical analysis in Figure 1L with a bootstrap analysis, added an ANOVA (in Table S1), and have also added the same analysis with mice as a statistical unit as Figure S4J to the manuscript.

      2) In response to reviewer 1, comment 3, we have replaced the response latency maps previously shown in Figures 3B, 3C, 3E and 3F with response amplitude maps.

      3) In response to reviewer 2, comment 1, we have added a variant of the response traces shown in Figures 3B, 3C, 3E and 3F with mice as the statistical unit as Figures S2C and S2D.

      4) In response to reviewer 2, public review, we have added data from additional experiments as Figures S6F-S6H, that control for the effect of a saline injection.

      A detailed point-by-point response to all reviewer concerns is provided in the following.  

      Reviewer #1 (Public Review):

      The authors present a study of visuo-motor coupling primarily using wide-field calcium imaging to measure activity across the dorsal visual cortex. They used different mouse lines or systemically injected viral vectors to allow imaging of calcium activity from specific cell-types with a particular focus on a mouse-line that expresses GCaMP in layer 5 IT (intratelencephalic) neurons. They examined the question of how the neural response to predictable visual input, as a consequence of self-motion, differed from responses to unpredictable input. They identify layer 5 IT cells as having a different response pattern to other cell-types/layers in that they show differences in their response to closed-loop (i.e. predictable) vs open-loop (i.e. unpredictable) stimulation whereas other cell-types showed similar activity patterns between these two conditions. They analyze the latencies of responses to visuomotor prediction errors obtained by briefly pausing the display while the mouse is running, causing a negative prediction error, or by presenting an unpredicted visual input causing a positive prediction error. They suggest that neural responses related to these prediction errors originate in V1, however, I would caution against overinterpretation of this finding as judging the latency of slow calcium responses in wide-field signals is very challenging and this result was not statistically compared between areas. Surprisingly, they find that presentation of a visual grating actually decreases the responses of L5 IT cells in V1. They interpret their results within a predictive coding framework that the last author has previously proposed. The response pattern of the L5 IT cells leads them to propose that these cells may act as 'internal representation' neurons that carry a representation of the brain's model of its environment. Though this is rather speculative. They subsequently examine the responses of these cells to anti-psychotic drugs (e.g. clozapine) with the reasoning that a leading theory of schizophrenia is a disturbance of the brain's internal model and/or a failure to correctly predict the sensory consequences of self-movement. They find that anti-psychotic drugs strongly enhance responses of L5 IT cells to locomotion while having little effect on other cell-types. Finally, they suggest that anti-psychotics reduce long-range correlations between (predominantly) L5 cells and reduce the propagation of prediction errors to higher visual areas and suggest this may be a mechanism by which these drugs reduce hallucinations/psychosis.

      This is a large study containing a screening of many mouse-lines/expression profiles using wide-field calcium imaging. Wide-field imaging has its caveats, including a broad point-spread function of the signal and susceptibility to hemodynamic artifacts, which can make interpretation of results difficult. The authors acknowledge these problems and directly address the hemodynamic occlusion problem. It was reassuring to see supplementary 2-photon imaging of soma to complement this data-set, even though this is rather briefly described in the paper. Overall the paper's strengths are its identification of a very different response profile in the L5 IT cells compared other layers/cell-types which suggests an important role for these cells in handling integration of self-motion generated sensory predictions with sensory input. The interpretation of the responses to anti-psychotic drugs is more speculative but the result appears robust and provides an interesting basis for further studies of this effect with more specific recording techniques and possibly behavioral measures.

      We thank the reviewer for the feedback and the help with improving the manuscript. We agree, the findings presented in this study are merely a starting point. The two questions we are currently pursuing in follow up work are:

      1) Do the findings generalize to all known antipsychotic drugs?

      2) What is the mechanism by which these drugs induce a decorrelation of activity, specifically in layer 5 neurons?

      But we suspect these questions will take at least a few more years of research to answer.

      Reviewer #2 (Public Review):

      Summary:

      This work investigates the effects of various antipsychotic drugs on cortical responses during visuomotor integration. Using wide-field calcium imaging in a virtual reality setup, the researchers compare neuronal responses to self-generated movement during locomotion-congruent (closed loop) or locomotionincongruent (open loop) visual stimulation. Moreover, they probe responses to unexpected visual events (halt of visual flow, sudden-onset drifting grating). The researchers find that, in contrast to a variety of excitatory and inhibitory cell types, genetically defined layer 5 excitatory neurons distinguish between the closed and the open loop condition and exhibit activity patterns in visual cortex in response to unexpected events, consistent with unsigned prediction error coding. Motivated by the idea that prediction error coding is aberrant in psychosis, the authors then inject the antipsychotic drug clozapine, and observe that this intervention specifically affects closed loop responses of layer 5 excitatory neurons, blunting the distinction between the open and closed loop conditions. Clozapine also leads to a decrease in long-range correlations between L5 activity in different brain regions, and similar effects are observed for two other antipsychotics, aripripazole and haloperidol, but not for the stimulant amphetamine. The authors suggest that altered prediction error coding in layer 5 excitatory neurons due to reduced longrange correlations in L5 neurons might be a major effect of antipsychotic drugs and speculate that this might serve as a new biomarker for drug development.

      Strengths:

      • Relevant and interesting research question:

      The distinction between expected and unexpected stimuli is blunted in psychosis but the neural mechanisms remain unclear. Therefore, it is critical to understand whether and how antipsychotic drugs used to treat psychosis affect cortical responses to expected and unexpected stimuli. This study provides important insights into this question by identifying a specific cortical cell type and long-range interactions as potential targets. The authors identify layer 5 excitatory neurons as a site where functional effects of antipsychotic drugs manifest. This is particularly interesting as these deep layer neurons have been proposed to play a crucial role in computing the integration of predictions, which is thought to be disrupted in psychosis. This work therefore has the potential to guide future investigations on psychosis and predictive coding towards these layer 5 neurons, and ultimately improve our understanding of the neural basis of psychotic symptoms.

      • Broad investigation of different cell types and cortical regions:

      One of the major strengths of this study is quasi-systematic approach towards cell types and cortical regions. By analysing a wide range of genetically defined excitatory and inhibitory cell types, the authors were able to identify layer 5 excitatory neurons as exhibiting the strongest responses to unexpected vs. expected stimuli and being the most affected by antipsychotic drugs. Hence, this quasi-systematic approach provides valuable insights into the functional effects of antipsychotic drugs on the brain, and can guide future investigations towards the mechanisms by which these medications affect cortical neurons.

      • Bridging theory with experiments

      Another strength of this study is its theoretical framework, which is grounded in the predictive coding theory. The authors use this theory as a guiding principle to motivate their experimental approach connecting visual responses in different layers with psychosis and antipsychotic drugs. This integration of theory and experimentation is a powerful approach to tie together the various findings the authors present and to contribute to the development of a coherent model of how the brain processes visual information both in health and in disease.

      Weaknesses:

      • Unclear relevance for psychosis research

      From the study, it remains unclear whether the findings might indeed be able to normalise altered predictive coding in psychosis. Psychosis is characterised by a blunted distinction between predicted and unpredicted stimuli. The results of this study indicate that antipsychotic drugs further blunt the distinction between predicted and unpredicted stimuli, which would suggest that antipsychotic drugs would deteriorate rather than ameliorate the predictive coding deficit found in psychosis. However, these findings were based on observations in wild-type mice at baseline. Given that antipsychotics are thought to have little effects in health but potent antipsychotic effects in psychosis, it seems possible that the presented results might be different in a condition modelling a psychotic state, for example after a dopamine-agonistic or a NMDA-antagonistic challenge. Therefore, future work in models of psychotic states is needed to further investigate the translational relevance of these findings.

      • Incomplete testing of predictive coding interpretation

      While the investigation of neuronal responses to different visual flow stimuli Is interesting, it remains open whether these responses indeed reflect internal representations in the framework of predictive coding. While the responses are consistent with internal representation as defined by the researchers, i.e., unsigned prediction error signals, an alternative interpretation might be that responses simply reflect sensory bottom-up signals that are more related to some low-level stimulus characteristics than to prediction errors. Moreover, This interpretational uncertainty is compounded by the fact that the used experimental paradigms were not suited to test whether behaviour is impacted as a function of the visual stimulation which makes it difficult to assess what the internal representation of the animal actual was. For these reasons, the observed effects might reflect simple bottom-up sensory processing alterations and not necessarily have any functional consequences. While this potential alternative explanation does not detract from the value of the study, future work would be needed to explain the effect of antipsychotic drugs on responses to visual flow. For example, experimental designs that systematically vary the predictive strength of coupled events or that include a behavioural readout might be more suited to draw from conclusions about whether antipsychotic drugs indeed alter internal representations.

      • Methodological constraints of experimental design

      While the study findings provide valuable insights into the potential effects of antipsychotic drugs, it is important to acknowledge that there may be some methodological constraints that could impact the interpretation of the results. More specifically, the experimental design does not include a negative control condition or different doses. These conditions would help to ensure that the observed effects are not due to unspecific effects related to injection-induced stress or time, and not confined to a narrow dose range that might or might not reflect therapeutic doses used in humans. Hence, future work is needed to confirm that the observed effects indeed represent specific drug effects that are relevant to antipsychotic action.

      Conclusion:

      Overall, the results support the idea that antipsychotic drugs affect neural responses to predicted and unpredicted stimuli in deep layers of cortex. Although some future work is required to establish whether this observation can indeed be explained by a drug-specific effect on predictive coding, the study provides important insights into the neural underpinnings of visual processing and antipsychotic drugs, which is expected to guide future investigations on the predictive coding hypothesis of psychosis. This will be of broad interest to neuroscientists working on predictive coding in health and in disease.

      We thank the reviewer for the feedback and the help with improving the manuscript.

      Regarding the concern of a lack of a negative control, we have repeated the correlation measurement experiments in a cohort of Tlx3-Cre x Ai148 mice that received injections of saline. This analysis is now shown in Figure S6F-S6H. Saline injections did not change correlations in L5 IT neurons. Combined with the absence of changes in the L5 IT correlation structure following amphetamine injections (Figures 7G – 7I), this suggests that unspecific effects related to stress of injection, or simply time, cannot explain the observed decorrelation effect of the antipsychotic drugs.

      And we fully agree, a lot more work is needed to confirm that the observed effects are specific and relevant to antipsychotic action.

      Reviewer #3 (Public Review):

      The study examines how different cell types in various regions of the mouse dorsal cortex respond to visuomotor integration and how antipsychotic drugs impacts these responses. Specifically, in contrast to most cell types, the authors found that activity in Layer 5 intratelencephalic neurons (Tlx3+) and Layer 6 neurons (Ntsr1+) differentiated between open loop and closed loop visuomotor conditions. Focussing on Layer 5 neurons, they found that the activity of these neurons also differentiated between negative and positive prediction errors during visuomotor integration. The authors further demonstrated that the antipsychotic drugs reduced the correlation of Layer 5 neuronal activity across regions of the cortex, and impaired the propagation of visuomotor mismatch responses (specifically, negative prediction errors) across Layer 5 neurons of the cortex, suggesting a decoupling of long-range cortical interactions.

      The data when taken as a whole demonstrate that visuomotor integration in deeper cortical layers is different than in superficial layers and is more susceptible to disruption by antipsychotics. Whilst it is already known that deep layers integrate information differently from superficial layers, this study provides more specific insight into these differences. Moreover, this study provides a first step into understanding the potential mechanism by which antipsychotics may exert their effect.

      Whilst the paper has several strengths, the robustness of its conclusions is limited by its questionable statistical analyses. A summary of the paper's strengths and weaknesses follow.

      Strengths:

      The authors perform an extensive investigation of how different cortical cell types (including Layer 2/3, 4 , 5, and 6 excitatory neurons, as well as PV, VIP, and SST inhibitory interneurons) in different cortical areas (including primary and secondary visual areas as well as motor and premotor areas), respond to visuomotor integration. This investigation provides strong support to the idea that deep layer neurons are indeed unique in their computational properties. This large data set will be of considerable interest to neuroscientists interested in cortical processing.

      The authors also provide several lines of evidence that visuomotor information is differentially integrated in deep vs. superficial layers. They show that this is true across experimental paradigms of visuomotor processing (open loop, closed loop, mismatch, drifting grating conditions) and experimental manipulations, with the demonstration that Layer 5 visuomotor integration is more sensitive to disruption by the antipsychotic drug clozapine, compared with cortex as a whole.

      The study further uses multiple drugs (clozapine, aripiprazole and haloperidol) to bolster its conclusion that antipsychotic drugs disrupt correlated cortical activity in Layer 5 neurons, and further demonstrates that this disruption is specific to antipsychotics, as the psychostimulant amphetamine shows no such effect.

      In widefield calcium imaging experiments, the authors effectively control for the impact of hemodynamic occlusions in their results, and try to minimize this impact using a crystal skull preparation, which performs better than traditional glass windows. Moreover, they examine key findings in widefield calcium imaging experiments with two-photon imaging.

      Weaknesses:

      A critical weakness of the paper is its statistical analysis. The study does not use mice as its independent unit for statistical comparisons but rather relies on other definitions, without appropriate justification, which results in an inflation of sample sizes. For example, in Figure 1, independent samples are defined as locomotion onsets, leading to sample sizes of approx. 400-2000 despite only using 6 mice for the experiment. This is only justified if the data from locomotion onsets within a mouse is actually statistically independent, which the authors do not test for, and which seems unlikely. With such inflated sample sizes, it becomes more likely to find spurious differences between groups as significant. It also remains unclear how many locomotion onsets come from each mouse; the results could be dominated by a small subset of mice with the most locomotion onsets. The more disciplined approach to statistical analysis of the dataset is to average the data associated with locomotion onsets within a mouse, and then use the mouse as an independent unit for statistical comparison. A second example, for instance, is in Figure 2L, where the independent statistical unit is defined as cortical regions instead of mice, with the left and right hemispheres counting as independent samples; again this is not justified. Is the activity of cortical regions within a mouse and across cortical hemispheres really statistically independent? The problem is apparent throughout the manuscript and for each data set collected. An additional statistical issue is that it is unclear if the authors are correcting for the use of multiple statistical tests (as in for example Figure 1L and Figure 2B,D). In general, the use of statistics by the authors is not justified in the text.

      Finally, it is important to note that whilst the study demonstrates that antipsychotics may selectively impact visuomotor integration in L5 neurons, it does not show that this effect is necessary or sufficient for the action of antipsychotics; though this is likely beyond the scope of the study it is something for readers to keep in mind.

      We thank the reviewer for the feedback and the help with improving the manuscript.

      Regarding the concerns of statistical analysis, this may partially be a misunderstanding. We apologize for the lack of clarity. For example, the data in Figures 1F-1K is indeed shown as averaged over locomotion onsets, but there is no statistical analysis performed in these panels. The unit for the statistical analysis shown in Figure 1L is brain area (not locomotion onset). A central tenet of the analysis shown in Figures 1L and 2 is that the effect of differential activation during closed and open loop locomotion onsets is not specific to visual areas of cortex. In visual areas of cortex, one would expect to find a difference. In essence, the surprising finding here is the lack of a difference in other cell types but L5 IT neurons. Thus, in the analyses of those figure panels we are testing whether the effect is present on average across all cortical areas. Hence, we chose the statistical unit of Figure 1L to be cortical areas, not mice. We have added the same analysis with mice as a statistical unit as Figure S4J.

      Reviewer #1 (Recommendations For The Authors):

      I have a few concerns and questions that I would like to see addressed:

      1) Figure 1L - the statistics are a little unusual here as the errors are across visual areas rather than across mice or hemispheres. This isn't ideal as ideally, we want to generalize the results across animals, not areas, and the results seem to be driven mostly by V1/RSC. I would like to see comparisons using mice as the statistical unit either in an ANOVA with areas as factors or post-hoc comparisons per area.

      Based on the assumption that visual cortex should respond to visual stimuli, we would have expected to find a difference between closed and open loop locomotion onset responses in all cell types in visual areas of cortex (a closed loop locomotion onset being the combination of locomotion and visual flow onset, while an open loop locomotion onset lacks the visual flow component). Thus, the first surprise was that in most cell types we found very little difference between these two locomotion onset types. Conversely, in Tlx3-positive L5 IT neurons the difference was apparent well outside of the visual areas of cortex (even though the difference was indeed strongest in V1/RSC). To quantify the extent to which closed and open loop locomotion onsets result in different activity patterns across dorsal cortex we performed the analyses shown in Figures 1L and 2. To make the point that the effect was observable on average across cortical areas, we used cortical area as a unit in Figure 1L. We have added the analysis shown in Figure 1L with mice as the statistical unit as Figure S4J and have added the ANOVA information to Table S1, as suggested.

      2) The reduction of activity of L5 IT cells in V1 after the presentation of gratings is curious. The authors suggest it might have been due to one population of cells tuned for the orientation of the presented grating suppressing the remaining cells leading to an aggregate negative response. However, they also observed this negative response in the 2p signal for individual somata. Presumably in the 2p data they could check their hypothesis - is there a group of cells that were tuned for the grating? Is it possible that for some reason the L5 IT cells in the 2p were not being activated by the grating because of their RF locations? How large were the gratings - I didn't see this in the methods section?

      We can certainly identify neurons that selectively increase activity to one particular grating. See Author response image 1, for vertical and horizontal gratings. The gratings were presented full-field on a toroidal screen that surrounded the mouse (240 degrees horizontal and 100 degrees vertical coverage of the visual field). This covered a large fraction of the field of view of the mouse. While we did not map receptive fields of individual neurons in this study, it is unlikely that the receptive fields of the neurons recorded were outside the stimulated area. We have made this clearer in the manuscript.

      Author response image 1.

      The population L5 IT neuron response to full-field drifting grating stimuli was a decrease of activity, yet there were increasing responses in a subset of neurons. (A) Heatmap of responses of all L5 IT neuron somata recorded with two-photon imaging in 7 Tlx3-Cre x Ai148 mice to drifting gratings of vertical orientation, sorted by their response. Data were sorted on odd trials and plotted on even trials to avoid regression to the mean artifacts. Dashed black box marks the top 10% responsive neurons. The data are a subset of the data shown in Figure S3D. (B) As in A, but for responses to drifting gratings of horizontal orientation. (C) Responses of top 10% vertical grating responsive neurons (dashed black box in A) to vertical (orange) or horizontal gratings (green). Neurons were selected on odd trials, and the average response of even trials is shown. (D) As in A, but sorted to the response of horizontal drifting gratings. (E) As in D, but for the horizontal grating stimulus. (F) As in C, but for the top 10% horizontal grating responsive neurons.

      3) I would caution against over-interpretation of latencies from wide-field GCaMP activity (Figure 3). A weaker response in a smaller population of neurons that has the same latency as a strong response in a large population of neurons will appear to have different latencies when convolved with the GCaMP kernel. Also there doesn't appear to be any statistical support for different latencies in different cortical areas. Either this should be correctly treated (ideally with linear mixed effects models to account for the increased correlation within animals) or the latency conclusions should be removed from the manuscript (my recommendation).

      We suspect that by “latency conclusions” the reviewer means “latency analysis”. The only time we mention latency differences is to state that: “In C57BL/6 mice that expressed GCaMP brain wide, both visuomotor mismatch and grating stimuli resulted in increases of activity that were strongest and appeared first in visual regions of dorsal cortex (Figures 3A-3C).”

      Nevertheless, we agree with the reviewer that response latency and response amplitude are not independent in our measurements and have replaced the latency plots in Figures 3B, 3C, 3E and 3F with average response maps.

      4) Given that the data is baseline corrected, is it possible that the effects of the anti-psychotic drugs on L5IT cells was due to a change in the baseline activity of this population?

      While we do find a small increase in average activity as a result of antipsychotic drug injections (Author response image 2), these effects are much smaller than those on locomotion onset responses.

      Author response image 2.

      On average, activity was increased in dorsal cortex after administration of antipsychotic drugs. Average calcium activity over the entire recording session before (naïve) and after (antipsy.) the administration of antipsychotic drugs. Colored lines indicate paired data for individual mice (Blue: 5 mice that had received clozapine, green: 3 mice that had received aripiprazole, red: 3 mice that had received haloperidol).

      To illustrate that the clozapine induced change in locomotion related activity cannot be explained by baseline activity differences, we have replotted the responses shown in Figures 4D and 4E, S3B, S5F without baseline subtraction (Author response image 3).

      Author response image 3.

      Antipsychotic drug injection only modestly shifts the baseline before locomotion onsets. (A) Average response expressed as F/F0 (wherein F0 was defined as the median of a recording session) during closed (solid line, 1101 onsets) and open loop (dashed line, 348 onsets) locomotion onsets in 5 Tlx3-Cre x Ai148 mice that expressed GCaMP6 in layer L5 IT neurons. Shading indicates SEM over onsets. Dashed horizontal line marks a value of F/F0 of 1.005 for comparison with panel B. Underlying data were the same as in Figures 4D and 4E. (B) As in A, but after a single intraperitoneal injection of the drug clozapine and for 707 closed and 350 open loop locomotion onsets. (C) Average response expressed as F/F0 (wherein F0 was defined as the median of a recording session) of L5 soma in V1, recorded with two-photon imaging in 7 Tlx3-Cre x Ai148 mice that expressed GCaMP6 in L5 IT neurons, during either closed (solid) or open loop (dashed) locomotion onsets. Shading indicates SEM over 8434 neurons. Dashed horizontal line marks a value of F/F0 of 1.045 for comparison with panel D. Underlying data were the same as in Figure S3B. (D) As in C, but for the 3 Tlx3 x Ai148 mice that had received a single intraperitoneal injection of clozapine. Underlying data were from Figure S5F.

      5) Figure 5/Figure S6 - Do the results really reflect an effect of distance or is it driven by areas from different hemispheres. Does the result hold if they factor out the effect of hemisphere or calculate the results within hemisphere?

      The effect appears qualitatively unchanged when we exclude interhemispheric connections from the analysis (Author response image 4).

      Author response image 4.

      As in Figures 6D-6F, but with the exclusion of interhemispheric connections. The decorrelation effect appears qualitatively unchanged.

      Reviewer #2 (Recommendations For The Authors):

      In addition to my public review, I only have one statistics-related and a few minor editing suggestions for the abstract. I hope that these might help the authors to improve their manuscript.

      1) It seems that the researchers are combining observations across different subjects, as seen in Figure 1F-L as well as in all of the other figures. While this has been a common practice in their field, it is now widely recognized that this approach can result in biased statistical inferences since it violates the assumptions of most statistical tests (see this recent discussion: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7906290/). As such, it may be beneficial for the authors to consider utilizing statistical tests that are designed to accurately deal with hierarchical data sets, like linear mixed models or hierarchical bootstrap, to confirm their key results. Additionally or alternatively, presenting data grouped by subject would help demonstrate the consistency of their findings across subjects.

      Please note, in Figures 1F-1K, there are no statistical tests – but the data are indeed averaged over locomotion onsets across all mice. We could use hierarchical sampling to calculate a bootstrap estimate of the mean response curves and show those instead, but that is also not standard practice in the field. We suspect this is also not what the reviewer is suggesting. In Figure 1L, the unit is indeed brain areas (see also our response to comment 1 of reviewer 1), but it is not areas x mice (i.e., the analysis is not hierarchical).

      We have now added a supplementary panel (Figure S4J) that shows the data of Figure 1L with mouse as the statistical unit (note, this is also not hierarchical). We have replaced the statistical test data using bootstrapping, as the reviewer suggests. This information can be found in Table S1.<br /> In Figures 2B and 2D, we have replaced the statistical test with hierarchical bootstrap, and updated the corresponding information in Table S1.

      For Figure 3, in which we show mismatch and grating onset responses averaged using onsets as the base unit, we have added supplementary panels (Figure S2) that show the same analysis using mice as the statistical unit. This did not change any of the conclusions. Note, there was no statistical testing in Figure 3.

      For the decorrelation effect of the different antipsychotic drugs that we show in Figures 6 and 7 the statistical unit is mice x region pairs (that is, while the structure is hierarchical, all mice contribute the same number of pairs). Our data are underpowered to use hierarchical bootstrap for testing the drug effects individually. However, if we combine all antipsychotic drug data (clozapine, aripiprazole, and haloperidol) we reach the same conclusions with hierarchical bootstrap as with the statistical tests (ttest and ranksum) used in the paper (Author response image 5).

      Author response image 5.

      Hierarchical bootstrap of the combined distribution of correlation values shown in Figures 6F, 7C and 7F did not change the conclusion that administration of antipsychotic drugs reduces L5 IT neuron correlations. Statistical comparisons using hierarchical bootstrap: Short-range vs no change, p < 0.001; long-range vs no change, p < 0.001; short-range vs longrange, p < 0.05.

      2) Given the impressive amount of data, I found it sometimes a little difficult to follow the manuscript. The authors might want to consider including a high-level overview of their results and rationales at the end of the introduction, and start each Results subsection with a sentence referring back to that highlevel overview ("To test whether X, we did Y and present it in this section.")

      We have attempted to improve the writing along these lines.

      3) Some suggestions that might further improve the clarity of writing.

      Abstract: Does the brain really distinguish between different "activity patterns", or would externallygenerated and self-generated "stimuli" be a slightly more accurate term to describe the observed alterations in schizophrenia?

      We would argue that (outside of sensory organs) the brain only has access to activity patterns, not stimuli directly. We would prefer to keep the phrasing with activity patterns here.

      Line 12: It might be easier to follow if the authors explicitly related that sentence back to the previous sentence "their ability to identify self-generated activity patterns" -> "their ability to distinguish between externally and self/internally generated ..."

      Absolutely correct – we have improved the writing here.

      Line 14: It remains unclear how visuomotor integration relates to the problem of distinguishing between self- and externally generated stimuli.

      We have attempted to expand on this in the abstract.

      Line 26: it remains unclear how the results support the activation of "internal representations" as this term has not been defined previously

      We have removed “internal representation” from the abstract.

      Results, line 80ff: I was confused by the description of all the different investigated cell types, as the first figure panels then only talk about brain wide and L5. Maybe the authors might find that shortening this with a reference to the methods might improve the flow.

      We have moved the list of cell types and mouse lines to the methods, as suggested.  

      Reviewer #3 (Recommendations For The Authors):

      The authors should strongly consider reassessing their statistics as outlined in the Public Review.

      Specifically:

      1) They should justify their definition of independent statistical unit; if this is not the mouse, they should justify why another definition (i.e. locomotion onset) is used, and show that their defined statistical unit achieves the requirements of being statistically independent (i.e. variance of the unit within a mouse is statistically indistinguishable from variance found between mice; more formally they could calculate the intraclass correlation (ICC)).

      We assume the reviewer is referring mainly to Figure 1 and therein to panel 1L.

      Since we did not perform statistical tests on the calcium traces, we are not sure why we would need to justify the choice of the unit we were showing. Moreover, Figure S2 shows the data of the V1 ROI averaged over mice to address this concern. As also mentioned to reviewer 2, we have amended this Figure S2 for the mouse-averaged traces of the V1 ROI data shown in main Figure 3.

      3) They should justify the statistical tests they use and whether they corrected for multiple comparisons; why for example was an ANOVA not used for Figure 1L and Figure 2B,D?

      We did not rely on ANOVA statistics for Figure 1L because we were mainly interested in carving out that Tlx3- (and Ntsr1-) positive mice inhabit a unique space when comparing the similarity of activity during closed and open loop locomotion onsets. We appreciate the reviewer taking a slightly different point of view on the data and now additionally report the ANOVA test result in Table S1. We have also opted to replace the statistical test in Figure 1L with bootstrapping. Lastly, we added Figure S4J which now shows the data in Figure 1L but with mice as the statistical unit.

      With similar logic, in Figure 2, we were not interested in comparing how the correlation of activity in cortical regions with locomotion behavior evolves over regions within a visuomotor feedback condition (closed loop, open loop or dark) but rather how a given region compares across feedback conditions.

      Still, we have opted to replace the statistical test in Figures 2B and 2D with hierarchical bootstrap, as also suggested by reviewer #2, comment 1. This did not change the significance indicator bars. We have accordingly updated Table S1 in which we report the full statistics.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Response to eLife Assessment:

      We sincerely appreciate your recognition of the novelty and potential significance of our study, and we are grateful for your constructive and valuable comments.

      With regard to your concern that cast immobilization (CI) may itself act as a stressor—potentially influencing skeletal muscle, brown adipose tissue (BAT), and locomotor energy expenditure—we fully recognize this as a highly important issue. In our study, we sought to interpret the findings in light of oxygen consumption and activity data; however, it is inherently difficult to disentangle systemic stress responses and the increased energetic costs associated with CI. We have therefore revised the manuscript to explicitly acknowledge this point as a limitation, and to identify it as a subject for future investigation.

      We also greatly value your suggestion concerning the potential involvement of branched-chain amino acids (BCAAs) derived from adipose tissue in BAT thermogenesis. While our present work primarily focused on muscle-derived amino acids, previous studies have reported that impaired BCAA catabolism in white adipose tissue (WAT) is associated with elevated circulating BCAA levels and metabolic dysfunction [1]. Thus, the possibility that adipose tissue contributes to the BCAA pool used by BAT cannot be disregard. We fully agree that directly addressing this possibility would be highly valuable, and in future work we plan to locally administer isotope-labeled BCAAs into skeletal muscle or adipose tissue and analyze their contribution to circulating BCAA levels and BAT utilization. Although such experiments could not be performed within the timeframe of this resubmission, we have explicitly stated this limitation in the revised manuscript.

      In summary, we have revised the text to acknowledge the limitations highlighted in your comments and to better clarify future research directions. We believe these revisions more accurately position our current study within the broader context. Once again, we are deeply grateful for your recognition of the originality of our work and for your constructive guidance in refining it.

      Response to Reviewers:

      We sincerely appreciate the reviewers’ thoughtful evaluations and constructive comments, and we are grateful for their recognition of the novelty and significance of our study.

      Response to Reviewer 1:

      We thank the reviewer for the detailed and thoughtful comments regarding the potential systemic effects of CI, including stress responses, energy balance, and tissue wasting. These factors are indeed critical when interpreting our findings, and we agree that CI is not merely a passive loss-of-function model but also introduces stress-related influences.

      The principal aim of our study was to investigate the “physiological compensatory mechanisms” that are triggered by loss of muscle function induced by CI. Although CI inevitably elicits systemic metabolic alterations—including stress-related responses—our study is, to our knowledge, the first to demonstrate that a compensatory thermogenic pathway, mediated by the supply of amino acids from skeletal muscle to BAT, is activated under such conditions. We regard this as the central novelty of our work, and it is consistent with the reviewer’s observation that CI results in a “gain of function.”

      Our intention is not to exclude stress as a contributing factor. Rather, we emphasize that under physiological stress conditions requiring BAT thermogenesis—such as reduced energy stores or decreased heat production from skeletal muscle—amino acid supply from muscle to BAT is induced. Importantly, this mechanism is not unique to CI, as we have confirmed similar metabolic crosstalk under acute cold exposure.

      At the same time, we acknowledge that our current data do not allow us to conclude that “stress is not a primary driver” of BAT thermogenesis induced by CI. Chronic stress induced by CI appeared to be limited in our study (Fig. 2_figure supplement 2), but we cannot fully exclude stress-related effects. Accordingly, we now describe the potential triggers of BAT thermogenesis in the manuscript as either decreased body temperature due to muscle functional loss or stress, explicitly noting in the Discussion that stress and reductions in energy reserves may both contribute, as the reviewer suggested. We also modified the original overstatement that “suppression of muscle thermogenesis induces hypothermia,” and now limit the description to the observed phenomenon that “CI-induced restriction of muscle activity leads to reduced cold tolerance,” while recognizing that multiple factors—including stress, substrate availability, and BAT functional capacity—may underlie this effect.

      We further appreciate the reviewer’s comment regarding the energetic burden imposed by CI. The cast weighed less than 2 g (5–10% of body weight), and thus increased locomotor costs cannot be excluded. However, locomotor activity during the dark phase was reduced by approximately 50%, making the net energetic effect difficult to quantify. In the manuscript, we now present oxygen consumption data and restrict our description to “an increase in oxygen consumption per body weight.” Moreover, as food intake remained almost unchanged compared with controls, the animals appear to have compensated for additional energetic demands, supporting the interpretation that the observed effects were not solely attributable to starvation.

      We also find the reviewer’s suggestion—that CI induces BAT overactivation but impairs its functional capacity—extremely important. Indeed, although CI increased thermogenic gene expression in BAT, body temperature maintenance was impaired. We interpret this reduction in thermoregulation as reflecting decreased heat production from skeletal muscle; however, as the reviewer noted, under prolonged CI, depletion of energy stores could further prevent BAT from fully exerting its thermogenic function.

      We have clarified in the revised Discussion that BAT activation under CI is transient, and that long-term outcomes may be influenced by contributions from other thermogenic organs, and that we recognize the impact of energy depletion as an important issue to be addressed in future studies. We also agree that detailed analyses of metabolic changes and BCAA dynamics following prolonged CI will be an important next step.

      Regarding the reviewer’s concern about potential anesthesia effects on acute cold exposure experiments, we confirmed that body temperature had returned to baseline one hour before testing, and that mice displayed spontaneous feeding and grooming behaviors, which suggested adequate recovery. Moreover, the differences observed compared with sham-anesthetized controls support our interpretation that the results reflect CI-specific effects. Nonetheless, we acknowledge this potential confounding factor as an additional limitation.

      Response to Reviewer 2:

      We thank the reviewer for the constructive comments and clear summary of our findings. We fully agree that the impact of immobilization on skeletal muscle and BAT function under cold exposure represents a key future direction. In the present study, we performed acute cold exposure following short-term immobilization and assessed UCP1 expression and metabolic changes in BAT. However, we acknowledge that we did not fully examine coordinated functional adaptations between skeletal muscle and BAT under cold stress. In particular, how skeletal muscle–derived amino acid supply and IL-6–dependent mechanisms operate during cold exposure remains unresolved. We have therefore noted this explicitly as a limitation and highlighted it as a focus for future work. Going forward, we plan to investigate muscle–BAT metabolic crosstalk and IL-6 signaling in detail under cold conditions to clarify whether the observed responses are specific to CI or represent more general physiological adaptations.

      (1) Herman MA, She P, Peroni OD, Lynch CJ, Kahn BB. Adipose tissue branched chain amino acid (BCAA) metabolism modulates circulating BCAA levels. J Biol Chem. 2010;285(15):11348-56. doi:10.1074/jbc.M109.075184.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Heat production mechanisms are flexible, depending on a wide variety of genetic, dietary, and environmental factors. The physiology associated with each mechanism is important to understand since loss of flexibility is associated with metabolic decline and disease. The phenomenon of compensatory heat production has been described in some detail in publications and reviews, notably by modifying BAT-dependent thermogenesis (for example by deleting UCP1 or impairing lipolysis, cited in this paper). These authors chose to eliminate exercise as an alternative means of maintaining body temperature. To do this, they cast either one or both mouse hindlimbs. This paper is set up as an evaluation of a loss of function of muscle on the functionality of BAT.

      Strengths:

      The study is supported by a variety of modern techniques and procedures.

      Weaknesses:

      The authors show that cast immobilization (CI) does not work as a (passive) loss of function, instead, this procedure produces a dramatic gain of function, putting the animal under considerable stress, inducing b-adrenergic effectors, increased oxygen consumption, and IL6 expression in a variety of tissues, together with commensurate cachectic effects on muscle and fat. The BAT is put under considerable stress, super-induced but relatively poor functioning. Thus within hours and days of CI, there is massive muscle loss (leading to high circulating BCAAs), and loss of lipid reserves in adipose and liver. The lipid cycle that maintains BAT thermogenesis is depleted and the mouse is unable to maintain body temperature.

      I cannot agree with these statements in the Discussion:  

      "We have here shown that cast immobilization suppressed skeletal muscle thermogenesis, resulting in failure to maintain core body temperature in a cold environment."

      This result could also be attributed to high stress and decreased calorie reserves. Note also: CI suppresses 50% of locomotor activity, but the actual work done by the mouse carrying bilateral casts is not taken into account.

      We appreciate the reviewer's suggestion. We thank you for raising this issue. As the reviewers suggest, we also consider that cold intolerance resulting from cast immobilization may be attributed to high stress levels, decreased calorie reserves, or reduced systemic locomotor activity. Indeed, reductions in the weight of visceral adipose tissue weight and increases in lipid utilization were observed in the early phase of cast immobilization (Fig.2G and 2F). This suggests that the depletion of calorie reserves induced by stress may affect cold intolerance in cast immobilized mice (Fig.1A-1B). On the other hand, the experiment shown in Fig.1C involved acute cold exposure of mice 2 h after cast immobilization. This result suggests that, even before the depletion of energy stores by immobilization of skeletal muscle, cast immobilization may cause cold intolerance in mice. In addition, as the reviewer suggests, cast immobilization may result in BAT thermogenesis and cachectic effects on muscle and fat. However, circulating corticosterone concentrations and hypothalamic CRH gene expression are not significantly altered after cast immobilization (Figure 2_figure supplement 2D-F). This raises questions about the contribution of stress to the changes in the systemic energy metabolism in this model. As such, we responded to the reviewers’ comments by revising this statement at the beginning of the ‘Discussion’ section and adding a discussion on pages 16, in addition to the existing discussion on pages 17–18.

      Furthermore, to respond as best we could to the reviewer's comments, we performed additional experiments using the restraint stress model (Figure 7). We found that short-term restraint stress may recruit substrate supply from skeletal muscle for BAT thermogenesis via Il6 gene expression. Based on these data, we speculate that the interaction between BAT and skeletal muscle amino acid metabolism may operate under various physiological stress conditions, including infection and exercise, as well as skeletal muscle immobilization, stress, and cold exposure. This interaction may play a significant role in regulating body temperature and energy metabolism. We are currently investigating the effects of sympathetic activation on skeletal muscle amino acid metabolism and systemic thermoregulation via IL-6 secretion from skeletal muscle using a new model. These data will be reported in a subsequent report.

      "Thermoregulatory system in endotherms cannot be explained by thermogenesis based on muscle contraction alone, with nonshivering thermogenesis being required as a component of the ability to tolerate cold temperatures in the long term."

      This statement is correct, and it clearly showcases how difficult it is to interpret results using this CI strategy. The question to the author is- which components of muscle thermogenesis are actually inhibited by CI, and what is the relative heat contribution?

      We appreciate raising this important issue. This study required the measurements of skeletal muscle temperature and electromyography in mice with cast immobilization, but we were unable to perform these measurements. We have therefore described the reviewers suggest on page 18 as limitations of this study.

      In our additional experiments, we found that several genes that are usually activated in skeletal muscle during cold exposure are repressed in mice with cast immobilization (Figure 1_figure supplement 1_G-1K). Skeletal muscle is an important thermogenic organ. Although the role of the sarcolipin gene in non-shivering thermogenesis is well understood, the primary regulator of thermogenesis in metabolism and shivering remains unclear. In Future, we would like to use models in which key signals for energy metabolism are inhibited, such as muscle-specific PGC-1α-deficient mice and muscle-specific AMPK-deficient mice, to clarify important factors in skeletal muscle heat thermogenesis. We expect this approach to enable us to analyze the relative thermal contributions of each component of the heat production process in skeletal muscle, which has proven difficult in immobilized muscle models.

      This conclusion is overinterpreted:

      "In conclusion, we have shown that cast immobilization induced thermogenesis in BAT that was dependent on the utilization of free amino acids derived from skeletal muscle, and that muscle-derived IL-6 stimulated BCAA metabolism in skeletal muscle. Our findings may provide new insights into the significance of skeletal muscle as a large reservoir of amino acids in the regulation of body temperature".

      In terms of the production of the article - the data shown in the heat maps has oddly obscure log10 dimensions. The changes are minimal, approx. 1.5x increase/decrease and therefore significance would be key to reporting these data. Fig.3C heatmap is not suitable. What are the 6 lanes to each condition? Overall, this has little/no information.

      Rather than cherry-picking for a few genes, the results could be made more rigorous using RNA-seq profiling of BAT and muscle tissues.

      We agree that this is an important point. Indeed, our model of skeletal muscle immobilization reveals only modest changes in metabolomics and gene expression analysis. We consider this to be a weakness of the study. However, the interactive thermogenic system that we discovered between skeletal muscle and BAT may also function under other conditions, such as acute stress and cold exposure. We should investigate this further in future models involving such dramatic metabolic changes. In fact, it has been shown that the levels of several metabolites are significantly altered in BAT after acute cold exposure.[1] Therefore, we have corrected the conclusion of this section, as stated on page 18, and added it. We also performed an enrichment analysis on the metabolomics data in BAT following cast immobilization and included the results in Figure 2_figure Supplement 1A. In addition, we excluded the heatmap from Fig. 3C of the pre-revision manuscript, as advised by the reviewer. Although we excluded the results in Figure 3C, we consider Figure 3_figure supplement_1 to be sufficient for the text.  

      In addition, we agree with the reviewer's remarks on our gene expression analysis. In this study, we were unable to examine RNA-seq profiling of BAT and muscle tissue. Therefore, we have described this as a limitation of the study on page 20. However, we are interested in investigating the effect of IL-6 derived from skeletal muscle on RNA-seq profiling of skeletal muscle and BAT. We will conduct future RNA-seq analyses of BAT and skeletal muscle, using models of skeletal muscle immobilization, acute cold exposure and restraint stress.

      Reviewer #2 (Public Review):

      Summary:

      In this study, the authors identified a previously unrecognized organ interaction where limb immobilization induces thermogenesis in BAT. They showed that limb immobilization by cast fixation enhances the expression of UCP1 as well as amino acid transporters in BAT, and amino acids are supplied from skeletal muscle to BAT during this process, likely contributing to increased thermogenesis in BAT. Furthermore, the experiments with IL-6 knockout mice and IL-6 administration to these mice suggest that this cytokine is likely involved in the supply of amino acids from skeletal muscle to BAT during limb immobilization.

      Strengths:

      The function of BAT plays a crucial role in the regulation of an individual's energy and body weight. Therefore, identifying new interventions that can control BAT function is not only scientifically significant but also holds substantial promise for medical applications. The authors have thoroughly and comprehensively examined the changes in skeletal muscle and BAT under these conditions, convincingly demonstrating the significance of this organ interaction.

      Weaknesses:

      Through considerable effort, the authors have demonstrated that limb-immobilized mice exhibit changes in thermogenesis and energy metabolism dynamics at their steady state. However, The impact of immobilization on the function of skeletal muscle and BAT during cold exposure has not been thoroughly analyzed.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, the authors show that impairment of hind limb muscle contraction by cast immobilization suppresses skeletal muscle thermogenesis and activates thermogenesis in brown fat. They also propose that free BCAAs derived from skeletal muscle are used for BAT thermogenesis, and identify IL-6 as a potential regulator.

      Strengths:

      The data support the conclusions for the most part.

      Weaknesses: The data provided in this manuscript are largely descriptive. It is therefore difficult to assess the potential significance of the work. Moreover, many of the described effects are modest in magnitude, questioning the overall functional relevance of this pathway. There are no experiments that directly test whether BCAAs derived from adipose tissue are used for thermogenesis, which would require more robust tracing experiments. In addition, the rigor of the work should be improved. It is also recommended to put the current work in the context of the literature.

      We appreciate the reviewer's valuable feedback. As the reviewer pointed out, many of the effects described in this study are modest in magnitude. This reflects a limitation of our study, which used skeletal muscle immobilization as a model. To clarify the overall functional relevance of this pathway, we therefore plan to use alternative models in which BAT thermogenesis and systemic cachectic effect are more strongly induced. We have added this point to the 'Conclusion' section on page 18.

      In addition, previous findings reported that mitochondrial BCAA catabolism in brown adipocytes promotes systemic BCAA clearance, suggesting that BCAAs may be supplied to BAT from other organs during BAT thermogenesis.[5] However, as the reviewer rightly pointed out, the current study did not directly investigate whether BCAAs derived from adipose tissue contribute to thermogenic processes. In light of this, we have revised the manuscript to include a statement in the limitations section on page 20 that addresses this point. 

      Metabolomic analysis of white adipose tissue (WAT) following skeletal muscle immobilization revealed alterations in amino acid concentrations in WAT in response to cast immobilization (Author response image 1A). Notably, levels of BCAAs in WAT remained largely unchanged at 24 hours after cast immobilization, but increased significantly by day 7 (Author response image 1B). At the 24-hour time point, when BAT thermogenesis is known to be activated, WAT weights was found to be reduced (Fig. 2H). Gene expression analysis of amino acid metabolism-related genes in WAT at this time point revealed a modest upregulation of several genes (Author response image 1C). Furthermore, a slight increase in the uptake of [<sup>3</sup>H] leucine into WAT was observed following immobilization (Fig. 3C). Collectively, these findings suggest that BCAAs within WAT may be primarily metabolized locally rather than being mobilized and supplied to BAT. In addition, given the relatively low levels of BCAAs per tissue mass and the limited capacity for BCAA uptake in WAT compared to other tissues, we consider it unlikely that WAT serves as a major reservoir of BCAAs.

      Author response image 1.

      (A) Amino acids in epididymal white adipose tissue (eWAT) of IL-6 KO (–/–) and WT (+/+) mice without (control) or with bilateral cast immobilization for the indicated times. Results are presented as heat maps of the log10 value of the fold change relative to control WT mice and are means of four mice in each group. (B) BCAA concentrations in eWAT of IL-6 KO and WT mice without (control) or with bilateral cast immobilization for 1 or 7 days. (n = 4 per group) (C) RT and real-time PCR analysis of the expression of SLC1A5, SLC7A1, SLC38A2, SLC43A1, BCAT2 and BCKDHA genes in eWAT of mice without (control) or with bilateral cast immobilization for 24 h. (n = 6 per group). All data other than in (A) are means ± SEM. *p < 0.05, **p < 0.01, ***p < 0.001 as determined by Dunnett's test (B) or by the unpaired t test (C).

      Reviewer #1 (Recommendations for the authors): 

      • Gypsum is an irrelevant label. Label consistently, with a procedure acronym, like CI or Imm.

      We apologize for any confusion that our notation may have caused. We corrected all labels relating to the skeletal muscle immobilization model in mice to 'Imm'.

      There are many grammatical errors and typos. Search for an example on Fudure1. The sense of some sentences is enough to obscure their meaning.

      We appreciate the reviewer's points. We have checked the article for grammatical and typographical errors, correcting them where necessary.

      • Figures 6E and F need to be re-annotated in the legend and on figures.

      Following the peer reviewer's advice, we have re-annotated the Figure legends of this result.

      Reviewer #2 (Recommendations for the authors): 

      (1) It is difficult to understand how the data presented in Supplemental Table 1 were obtained. This appears to be data showing that the skeletal muscle weight of the hind limbs in mice accounts for 40 to 50% of the total skeletal muscle weight. How did the authors calculate the muscle weight? Specifically, how did they measure the weight of muscles that are neither in the hind limbs nor in the forelimbs ("Other")? Was this estimated from whole-body CT or MRI data?

      In the legend, it mentions "the posterior cervical region," but what exactly was measured in the posterior cervical region? The methods for this data should be clearly described.

      We appreciate the reviewers' comments. We apologize for any confusion caused by inadequate explanation of this data. This data was obtained by removing skeletal muscle from the posterior cervical region and measuring the weight of the wet tissue. We have taken care to remove most of the skeletal muscle, but some will remain. However, we do not believe that these errors are significant enough to alter the interpretation of the results. This has now been added to the 'Methods' section on page 21.

      (2) Through considerable effort, the authors have demonstrated that limb-immobilized mice exhibit changes in thermogenesis and energy metabolism dynamics at their steady state. However, it remains unclear why limb-immobilized mice have reduced tolerance to cold exposure. Was there any change in the abundance of energy metabolism-related genes during cold exposure between the immobilized and control mice? For example, if the gene expression of UCP1 and UCP2, which are typically upregulated in brown adipose tissue (BAT) and skeletal muscle during cold exposure, was suppressed in the immobilized mice, it might explain their reduced cold tolerance. Thus, the changes in the response of skeletal muscle and BAT to cold exposure between immobilized and control mice should also be analyzed.

      We thank the reviewer for the constructive comments. We consider the main weakness of this study to be the fact that we were unable to measure the temperature and electromyography (EMG) of the skeletal muscles of the cast-immobilized mice. Following the reviewers' advice, we analyzed the expression levels of several genes related to heat production or energy metabolism (Ucp1, Ucp2, Ucp3, Sln and Ppargc1a) in BAT and skeletal muscle of cast-immobilized mice after acute cold exposure (Figure1_figure supplement 1G-1K). The results showed that the expression of several genes that are usually increased in BAT and skeletal muscle during cold exposure was repressed in cast-immobilized mice. Notably, cast immobilization did not induce the UCP2 and PGC-1α genes at room temperature, and their upregulation during cold exposure was also suppressed in cast-immobilized mice. UCP2 is known to alter its expression in relation to energy metabolism, but it is unclear whether it regulates energy metabolism.[2] Additionally, UCP2 is understood to play a non-role in thermogenesis, and the function of the UCP2 in skeletal muscle remains unclear.[3] On the other hands, PGC-1α is widely recognized as a transcriptional coactivator that regulates various metabolic processes, including thermogenesis.[4] In our study, we found that the amounts of metabolites in the TCA cycle and the expression of the PGC-1α gene were decreased rapidly in immobilized skeletal muscle. This suggests that the metabolic rate is reduced in immobilized skeletal muscle (Figure 1_figure supplement 2A and 2F). In endothermic animals, energy expenditure in skeletal muscle plays a significant role in maintaining body temperature during both activity and rest. Hence, it is assumed that the reduced metabolic rate in skeletal muscle significantly impacts the maintenance of body temperature in cold conditions. Further investigation is required into the function of these genes in skeletal muscle thermogenesis, but we expect that the additional data suggest that the loss of muscle function due to immobilization affects the maintenance of body temperature under cold temperature. These results were discussed further on page 15.

      Reviewer #3 (Recommendations for the authors): 

      There are also more specific concerns related to the data supporting the claims.

      (1) The relevance of increasing thermogenesis in BAT after cast immobilization is unclear, as adult humans have very little BAT. Thermogenesis gene and protein expression should be measured in white adipose tissue.

      We would like to thank the reviewers for highlighting this important issue. We agree with the reviewer's comments. We did not observe significant changes in UCP1 expression in the subcutaneous adipose tissue of the inguinal region following skeletal muscle immobilization. We suspect that this is because skeletal muscle immobilization in mice did not exert a strong enough effect to induce browning of white adipose tissue. The ability of immobilizing skeletal muscle to activate thermogenesis in brown or beige adipocytes in adults remains unclear. We have therefore noted this limitation in our study in line 6.

      Additionally, in this study, we aimed to clarify the role of skeletal muscle as an amino acid reservoir under metabolic stress conditions that increase BAT thermogenesis. To this end, we employed models of skeletal muscle immobilization, acute cold exposure, and restraint stress. We also intend to analyze the metabolic interactions between beige adipose tissue and skeletal muscle in more detail using models that induce browning, such as exercise or cold acclimation.

      (2) In Figures 1E-G, there is no significant difference in UCP1 levels relative to the control, but body temperature is lowered from day 2 to day 7. How do the authors explain this?

      This is an important point. We consider the decrease in body temperature of mice following cast immobilization at room temperature to be the result of a reduction in systemic locomotor activity.

      (3) The small induction of PGC1a seen at 10 hours goes away after day 3. Why is this?

      This is an important point. Our investigation showed that the norepinephrine concentration in BAT and blood of cast-immobilized mice tends to increase, peaking at 24 hours of immobilization (Fig. 1H and Figure 2_figure supplement 2D), and then gradually returns to baseline. We speculate that this transient activation of the sympathetic nervous system may affect the expression of PGC1α in BAT. Additionally, although thermogenesis in BAT temporarily increases after skeletal muscle immobilization, studies from other research groups suggest that long-term skeletal muscle immobilization (two weeks) may increase non-shivering thermogenesis in skeletal muscle via high expression SLN.[6] Therefore, we hypothesize that other thermogenic mechanisms besides BAT might be involved during prolonged cast immobilization. We have added a discussion of these topics on page 16.

      (4) The metabolic cage data are marked in multiple places as significant, but the effect size is extremely small. Please describe how significance was calculated (Figure 5 supplement 1B, E, F).

      This is a valid point. This data was statistically analyzed using daily averages, with the results then being compiled. However, the figure was amended because it was not appropriate to use the original to demonstrate significant differences.

      (5) How does IL-6 increase BCAA levels in muscle?

      This is an important point. We are also investigating this issue with great interest. In future, we will use RNA-seq profiling to investigate the mechanism by which IL-6 regulates amino acid metabolism in skeletal muscle. This point was added as a

      limitation of the study on page 19.

      (6) What is the mechanism behind the elevated il6 levels after cast immobilization?

      We appreciate the reviewer's points. Since IL-6 gene expression in skeletal muscle increases in response to acute cold exposure and acute stress, we hypothesize that IL-6 is regulated by β-adrenergic effectors. In our preliminary experiments, stimulation with norepinephrine or with clenbuterol, a β2-adrenergic receptor agonist, suggests an increase in IL-6 gene expression and the intracellular free BCAA concentration in cultured mouse muscle cells (Author response image 2A-2D). Going forward, our plans include conducting further studies using a mouse model in which the sympathetic nervous system is activated by administering LPS intracerebroventricularly, as well as using muscle-specific β2-adrenergic receptor knockout mice.  

      Reference:

      (1) Okamatsu-Ogura, Y., et al. UCP1-dependent and UCP1-independent metabolic changes induced by acute cold exposure in brown adipose tissue of mice. Metabolism. 2020 113:  154396 doi: 10.1016/j.metabol.2020.154396.

      (2) Patrick Schrauwen and Matthijs Hesselink, UCP2 and UCP3 in muscle controlling body metabolism., J Exp Biol. 2002 Aug;205(Pt 15):2275-85. doi: 10.1242/jeb.205.15.2275.

      (3) C Y Zhang, et al., Uncoupling protein-2 negatively regulates insulin secretion and is a major link between obesity, beta cell dysfunction, and type 2 diabetes., Cell. 2001 Jun 15;105(6):745-55. doi: 10.1016/s0092-8674(01)00378-6.

      (4) Christophe Handschin and Bruce M Spiegelman, Peroxisome proliferator-activated receptor gamma coactivator 1 coactivators, energy homeostasis, and metabolism., Endocr Rev. 2006 Dec;27(7):728-35. doi: 10.1210/er.2006-0037.

      (5) Yoneshiro, et al., BCAA catabolism in brown fat controls energy homeostasis through SLC25A44. Nature. 2019 572(7771): 614-619 doi: 10.1038/s41586-019-1503-x.

      (6) Shigeto Tomiya, et al., Cast immobilization of hindlimb upregulates sarcolipin expression in atrophied skeletal muscles and increases thermogenesis in C57BL/6J mice., Am J Physiol Regul Integr Comp Physiol. 2019 Nov1;317(5):R649-R661.doi:10.1152/ajpregu.00118.2019.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Summary:

      The authors had previously found that brief social isolation could increase the activity of these neurons, and that manipulation of these neurons could alter social behavior in a social rank-dependent fashion. This manuscript explored which of the outputs were responsible for this, identifying the central nucleus of the amygdala as the key output region. The authors identified some discrete behavior changes associated with these outputs, and found that during photostimulation of these outputs, neuronal activity appeared altered in 'social response' neurons.

      Strengths:

      Rigorous analysis of the anatomy. Careful examination of the heterogenous effects on cell activity due to stimulation, linking the physiology with the behavior via photostimulation during recording in vivo.

      Weaknesses:

      (1) There are some clear imbalances in the sample size across the different regions parsed. The CeA has a larger sample size, likely in part to the previous work suggesting differential effects depending on social rank/dominance. Given the potential variance, it may be hard to draw conclusions about the impact of stimulation across different social ranks for other groups.

      While it may be difficult to draw conclusions about the impact of stimulation across different social ranks, we believe that the dominance-induced variance in our dataset reveals key insights into how social history may affect the function of these circuits. However, we do recognize that there are imbalances in sample size across the different circuits that we probed. To test whether we could detect a significant effect in our DRN<sup>DAT</sup>-CeA:ChR2 group with a sample size matched to the DRN<sup>DAT</sup>-BLP:ChR2 group (the lowest sample size of the three circuits probed), we subsampled and ran tests for statistical significance using the following MATLAB code:

      Author response image 1.

      We found that out of 1000 subsamples, we detected a statistically significant effect 40.5% of the time (Author response image 2A). This suggests that the optogenetic effect exists, though it is moderate and is variable across mice (as explained by the significant correlation between social rank and optogenetic effect).

      To test whether these inconsistent effects may be an effect of variance induced by social rank, we wrote the following MATLAB code to maintain the distribution of social rank in our subsamples:

      Author response image 2.

      P-values from subsampling analysis show a moderately reproducible social preference effect in DRN<sup>DAT</sup>-CeA:ChR2 mice, but not in DRN<sup>DAT</sup>-BNST:ChR2 mice. (A-D) Histograms showing distribution of paired t-test p-values comparing OFF and ON social preference scores (as shown in Figure 4A-I) in subsampled groups (to match the sample size of the DRN<sup>DAT</sup>-BLP:ChR2 group). (A) 14 DRN<sup>DAT</sup>-CeA:ChR2 mice were randomly subsampled, a paired t-test was performed, and the resulting p-values were binned and plotted. (B) Same as (A), but ensuring that the proportion of subordinate, intermediate, and dominant mice in the subsampled groups were the same as the original distribution. (C) Same as (A), but with DRN<sup>DAT</sup>-BNST:ChR2 mice. (D) Same as (B), but with DRN<sup>DAT</sup>-BNST:ChR2 mice.

      Author response image 3.

      We found that out of 1000 subsamples, we detected a statistically significant effect 45.5% of the time when we maintained the original distribution of social rank in DRN<sup>DAT</sup>-CeA:ChR2 mice (Author response image 2B). This suggests that reducing the sample size to N=14 reduces the statistical power and indeed can make an effect harder to reliably detect. The reviewer is correct in saying that sample imbalance may skew conclusions. However, given the rank-dependent optogenetic effect on social preference seen in DRN<sup>DAT</sup>-CeA:ChR2 mice (N=29 mice, p=0.002, Figure 4H) that is notably absent in DRN<sup>DAT</sup>-BLP:ChR2 mice (N=14 mice, p=0.806, Figure 4I), we hypothesize that we would not see a significant effect of photoactivating the DRN<sup>DAT</sup>-BLP circuit on social preference, even with a larger sample size. While we acknowledge there may be evidence that there could be an effect in the DRN<sup>DAT</sup>-BLP projection, this analysis reveals that this effect is not as robust as the effect we see in the DRN<sup>DAT</sup>-CeA projection, which is the focus of this study. An in-depth exploration of the DRN<sup>DAT</sup> projection to the BLP is certainly warranted in future studies.

      Interestingly, the same analysis approach applied to DRN<sup>DAT</sup>-BNST:ChR2 mice suggest a reliably negative result, with subsampling only resulting in a significant result 1.1% of the time (Author response image 2C) and 1.7% of the time if maintaining the original rank distribution (Author response image 2D).

      (2) It is somewhat unclear why only the 'social object ratio' was used to assess the effects versus more direct measurements of social behavior.

      We decided to use ‘social:object ratio’ as we felt that measurement more directly supported our claim of increased social preference through optogenetic manipulation; however, in our updated manuscript, we included direct measurements of social behavior in the revised manuscript (Figure 4—figure supplement 1) and have updated the legend to reflect this addition (lines 1679-1684; 1698-1708).

      (3) Somewhat related, while it is statistically significant, it is unclear if the change seen in face investigation of biologically significant, on average, it looks like a few-seconds difference and that was not modulated by social rank.

      While the effect size is relatively small (4.19 seconds, 2.32% of the session), we believe we should report any statistically significant findings we discover. However, due to the small effect size, we have de-emphasized our claims regarding this finding in the text (line 172).

      (4) There are several papers studying these neurons that have explored behaviors examined here, as well as the physiological connectivity that are not cited that would provide important context for this work. In particular, multiple groups have found a dopamine-mediated IPSP in the BNST, in contrast to this work. There are technical differences that may drive these differences, but not addressing them is a major weakness.

      In the revised text, we have cited the groups who have found different effects of dopamine-mediated effects in the ovBNST (specifically from Krawczyk et al., 2011, Maracle et al., 2018, and Yu et al., 2021) and reconciled these results with those from our study (lines 422-432).

      (5) The inclusion of some markers for receptors for some of these outputs is interesting, and the authors suggest that this may be important, but this is somewhat disconnected from the rest of the work performed.

      We agree that we cannot make any causal signaling mechanism claims with the current downstream receptor RNA expression data (and we are careful in avoiding making those claims in the text), but we include these data to offer a potential mechanism and hope that these descriptive data will be useful to the field for follow up studies.

      Reviewer #2 (Public review):<br /> Summary:

      The authors perform a series of studies to follow up on their previous work, which established a role for dorsal raphe dopamine neurons (DRN) in the regulation of social-isolation-induced rebound in mice. In the present study, Lee et. al, use a combination of modern circuit tools to investigate putatively distinct roles of DRN dopamine transporting containing (DAT) projections to the bed nucleus of the stria terminalis (BNST), central amygdala (CeA), and posterior basolateral amygdala (BLP). Notably, they reveal that optogenetic stimulation of distinct pathways confers specific behavioral states, with DRNDAT-BLP driving aversion, DRNDAT-BNST regulating non-social exploratory behavior, and DRNDAT-CeA promoting socialability. A combination of electrophysiological studies and in situ hybridization studies reveal heterogenous dopamine and neuropeptide expression and different firing properties, providing further evidence of pathway-specific neural properties. Lastly, the authors combine optogenetics and calcium imaging to resolve social encoding properties in the DRNDAT-CeA pathway, which correlates observed social behavior to socially engaged neural ensembles.

      Collectively, these studies provide an interesting way of dissecting out separable features of a complex multifaceted social-emotional state that accompanies social isolation and the perception of 'loneliness.' The main conclusions of the paper provide an important and interesting set of findings that increase our understanding of these distinct DRN projections and their role in a range of social (e.g., prosocial, dominance), non-social, and emotional behaviors. However, as noted below, the examination of these circuits within a homeostatic framework is limited given that a number of the datasets did not include an isolated condition. The DRNDAT-CeA pathway was investigated with respect to social homeostatic states in the present study for some of the datasets.

      Strengths: 

      (1) The authors perform a comprehensive and elegant dissection of the anatomical, behavioral, molecular, and physiological properties of distinct DRN projections relevant to social, non-social, and emotional behavior, to address multifaceted and complex features of social state.<br /> (2) This work builds on prior findings of isolation-induced changes in DRN neurons and provides a working framework for broader circuit elements that can be addressed across the social homeostatic state.<br /> (3) This work characterizes a broader circuit implicated in social isolation and provides a number of downstream targets to explore, setting a nice foundation for future investigation.<br /> (4) The studies account for social rank and anxiety-like behavior in several of the datasets, which are an important consideration to the interpretation of social motivation states, especially in male mice with respect to dominance behavior.

      Weaknesses:

      (1) The conceptual framework of the study is based on the premise of social isolation and perceived 'loneliness' under the framework of social homeostasis, analogous to hunger. In this framework, social isolation should provoke an aversive state and compensatory social contact behavior. In the authors' prior work, they demonstrate synaptic changes in DRN neurons and social rebound following acute social isolation. Thus, the prediction would be that downstream projections also would show state-dependent changes as a function of social housing conditions (e.g., grouped vs. isolated). In the current paper, a social isolation condition was not included for the majority of the studies conducted (e.g., Figures 1-6 do not include an isolated condition, Figures 7-8 do include an isolated condition). Thus, while Figure 1-6 adds a very interesting and compelling set of data that is of high value to the social behavior field with respect to social and emotional processing and general circuit characterization, these studies do not directly investigate the impacts of dynamic social homeostatic state. The main claim of the paper, including the title (e.g., separable DRN projections mediate facets of loneliness-like state), abstract, intro, and discussion presents the claim of this work under the framework of dynamic social homeostatic states, which should be interpreted with caution, as the majority of the work in the paper did not include a social isolation comparison.

      In previous studies, loneliness-like phenotypes have been characterized across species as having the key dimensions of an aversive state that increases prosociality[1–5].  These two features are amplified by photostimulation of DRN DA neurons, and as we show in this manuscript, are separable across different projections to each target, and our ability to distinctly mimic different aspects of the constellation of features we characterize as “loneliness.”

      However we agree with the reviewer that we do not intend to imply that the mouse currently feels lonely.  Indeed, isolating the animals would occlude our ability to see photostimulation-induced mimicry of specific features of the loneliness-like phenotype, and this is precisely why we did not isolate animals for our ChR2 gain-of-function experiments.  To address the reviewers’ concern, we will change the title of our manuscript from making a claim of “mediating” (which we agree would rely more heavily on mediating actual (ethologically-induced) loneliness rather than “mimicry” (photostimulation-induced) behaviors associated with a loneliness-like phenotype. We have changed language regarding this claim throughout our manuscript (Lines 1, 83, 285, 369).

      For the ChR2 experiments in particular, we intended the optogenetic manipulation to be a gain-of-function one to test the hypothesis that activation of these circuits is sufficient to recapitulate different facets of a loneliness-like state (i.e. prosociality, aversion, and increased exploratory behavior). As such, that is why we only included group-housed conditions for these experiments—to mimic the phenotype of social isolation without social isolation. To test the necessity of these circuits in mediating different facets of a loneliness-like state, we agree that silencing the studied projections in an isolated state is critical, which is what we show in Figure 8. We agree that the addition of an isolated condition to understand the circuit-specific impact of dynamic social homeostatic state is important (particularly through in vivo recordings of these specific circuits during relevant behaviors), and would be a great follow-up to this study.

      (2) In Figure 1, the authors confirm co-laterals in the BNST and CeA via anatomical tracing studies. The goal of the optogenetic studies is to dissociate the functional/behavioral roles of distinct projections. However, one limitation of optogenetic projection targeting is the possibility of back-propagating action potentials (stimulation of terminals in one region may back-propagate to activate cell bodies, and then afferent projections to other regions), and/or stimulation of fibers of passage. Therefore, one limitation in the dataset for the optogenetic stimulation studies is the possibility of non-specific unintended activation of projections other than those intended (e.g., DRNDAT-CeA). This can be dealt with by administering lidocaine to prevent back-propagating action potentials.

      While back-propagating action potentials are potentially confounding for the manipulation techniques presented in this paper, we do show circuit-specific optogenetic behavioral effects despite significant collateralization (specifically between DRN<sup>DAT</sup> neurons projecting to the CeA and BNST; Figure 1H), suggesting circuit-specificity. Namely, we see that stimulation of DRN<sup>DAT</sup> terminals in CeA promotes social preference (Figure 4E,K) whereas stimulation of DRN<sup>DAT</sup> terminals in BNST promotes rearing (exploratory) behavior (Figure 3G). There is a non-negligible chance that we are stimulating DRN<sup>DAT</sup> fibers of passage, which we have addressed in a caveat disclaimer included in the revised discussion (lines 345-347).

      (3) It is unclear from the test, but in the subjects' section of the methods, it appears that only male animals were included in the study, with no mention of female subjects. It should be clear to the reader that this was conducted in males only if that is the case, with consideration or discussion, about female subjects and sex as a biological variable.

      In the revised manuscript, we have included discussion about sex as a biological variable (lines 342-345).

      (4) Averaged data are generally reported throughout the study in the form of bar graphs, across most figures. Individual data points would increase the transparency of the data.

      In an effort to increase the transparency of the data, we have prepared source data for each data panel in the final version of the manuscript and will upload it to eLife.  

      REFERENCES

      (1) Cacioppo, J.T., Hughes, M.E., Waite, L.J., Hawkley, L.C., and Thisted, R.A. (2006). Loneliness as a specific risk factor for depressive symptoms: cross-sectional and longitudinal analyses. Psychol Aging 21, 140–151. https://doi.org/10.1037/0882-7974.21.1.140.

      (2) Cacioppo, S., Capitanio, J.P., and Cacioppo, J.T. (2014). Toward a Neurology of Loneliness. Psychol Bull 140, 1464–1504. https://doi.org/10.1037/a0037618.

      (3) Baumeister, R.F., and Leary, M.R. (1995). The need to belong: Desire for interpersonal attachments as a fundamental human motivation. Psychological Bulletin 117, 497–529. https://doi.org/10.1037/0033-2909.117.3.497.

      (4) Niesink, R.J., and Van Ree, J.M. (1982). Short-term isolation increases social interactions of male rats: A parametric analysis. Physiology & Behavior 29, 819–825. https://doi.org/10.1016/0031-9384(82)90331-6.

      (5) Panksepp, J., and Beatty, W.W. (1980). Social deprivation and play in rats. Behavioral & Neural Biology 30, 197–206. https://doi.org/10.1016/S0163-1047(80)91077-8.

      Reviewer #3 (Public review):

      Summary:

      The authors investigated the role of dopaminergic neurons (dopamine transporter expressing, DAT) in the dorsal raphe nucleus (DRN) in regulating social and affective behavior through projections to the central nucleus of the amygdala (CeA), bed nucleus of the stria terminalis (BNST), and the posterior subdivision of the basolateral amygdala. The largest effect observed was in the DRN-DAT projections to the CeA. Augmenting previously published results from this group (Matthews et al., 2016), the comprehensive behavioral analysis relative to social dominance, gene expression analysis, electrophysiological profiling, and in vivo imaging provides novel insights into how DRN-DAT projections to the CeA influence the engagement of social behavior in the contexts of group-housed and socially isolated mice.

      Strengths:

      Correlational analysis with social dominance is a nice addition to the study. The overall computational analyses performed are well-designed and rigorous.

      Weaknesses: 

      (1) Analysis of dopamine receptor expression did not include Drd3, Drd4, or Drd5 which may provide more insights into how dopamine modulates downstream targets. This is particularly relevant to the BNST projection in which the densest innervation did not robustly co-localize with the expression of either Drd1 or Drd2. It is also possible that dopamine release from DRN-DAT neurons in any or all of these structures modulates neurotransmitter release from inputs to these regions that contain D2 receptors on their terminals.

      Although we find that there is more Vipr2 and Npbwr1 expression compared to Drd1 and Drd2 expression in ovBNST, we still do find that a substantial proportion of cells in ovBNST express dopamine receptors (particularly D2 dopamine receptors, as shown in Figure 5C). In our revised manuscript, we have discussed potential functional mechanism through D3, D4, and D5 dopamine receptors, as well as pre-synaptic dopamine receptor expression (lines 459-461).

      (2) Although not the focus of this study, without pharmacological blockade of dopamine receptors, it is not possible to assess what the contribution of dopamine is to the behavioral outcomes. Given the co-release of glutamate and GABA from these neurons, it is possible that dopamine plays only a marginal role in the functional connectivity of DRN-DAT neurons.

      While we agree with the reviewer’s comments, we are careful to avoid making claims about dopamine-mediated physiological and behavioral effects of DRN<sup>DAT</sup> neurons (despite that these neurons are genetically identified through the expression of dopamine transporter [DAT]), mentioned in lines 222-228 in the text.

      (3) Photostimulation parameters used during the behavioral studies (8 pulses of light delivered at 30 Hz for several minutes) could lead to confounding results limiting data interpretation. As shown in Figure 6J, 8 pulses of light delivered at 30 Hz result in a significant attenuation of the EPSC amplitude in the BLP and CeA projection. Thus, prolonged stimulation could lead to significant synaptic rundown resulting in an overall suppression of connectivity in the later stages of the behavioral analyses.

      Despite attenuation of EPSC amplitude in BLP and CeA projections and potential synaptic rundown, we still observe significant behavioral effects through optogenetic manipulation of these circuits (increasing the likelihood of capturing a ‘true positive’ rather than a ‘false negative’ effect). In general, we attempt to reduce the duty cycle by sparingly delivering trains of optogenetic stimulation (eight 5-ms pulses every 5 seconds). Additionally, in the real time place preference task where stimulation of the DRN<sup>DAT</sup>-BLP projection significantly reduces the time spent in the “ON” chamber, stimulation is only delivered when the mouse is in the “ON” compartment of the apparatus. However, we do feel that the reviewer’s concern that EPSC attenuation and potential synaptic rundown may potentially explain the robust place avoidance effects in DRN<sup>DAT</sup>-BLP:ChR2 mice in the first half of the session (Figure 2G). Importantly, we show in our previous published work (Matthews et al., 2016, Cell; Figure 3) through fast-scan cyclic voltammetry (FSCV) that dopamine transients were consistently recorded in response to eight pulses of 30 Hz DRN<sup>TH</sup> stimulation delivered every 5 seconds in the BNST, though less consistently in the CeA.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful manuscript shows a set of interesting data including the first cryo-EM structures of human PIEZO1 as well as structures of disease-related mutants in complex with the regulatory subunit MDFIC, which generate different inactivation phenotypes. The molecular basis of PIEZO channel inactivation is of great interest due to its association with several pathologies. This manuscript provides some structural insights that may help to ultimately build a molecular picture of PIEZO channel inactivation. While the structures are of use and clear conformational differences can be seen in the presence of the auxiliary subunit MDFIC, the strength of the evidence supporting the conclusions of the paper, especially the proposed role for pore lipids in inactivation, is incomplete and there is a lack of data to support them.

      We thank the editors and reviewers for taking the time and effort to review our manuscript.  The evidence supporting the key role of pore lipids in hPIEZO1 activation is as follows. i. Compared with wild-type hPIEZO1, the hydrophobic acyl chain tails of the pore lipids retracted from the hydrophobic pore region in slower inactivating mutant hPIEZO1-A1988V (Fig. 7a-b). ii. Previous electrophysiological functional studies revealed that substituting this hydrophobic pore formed by I2447, V2450, and F2454 with a hydrophilic pore prolongs the inactivation time for both PIEZO1 and PIEZO2 channels (PMID: 30628892). iii. In the structure of the HX channelopathy mutant R2456H, the interaction between the hydrophilic phosphate group head of pore lipids and R2456 is disrupted, remodeling the blade and pore module and resulting in a significantly slow-inactivating rate. iv. The interaction between pore lipids and lipidated-MDFIC stabilizes the pore lipids to reseal the pore upon activation of the hPIEZO1-MDFIC complex.

      According to previously proposed models for the role of pore lipids in mechanosensitive ion channels, such as MscS (PMID: 33568813), MS K2P (PMID: 25500157) and OSCA channels (PMID: 37402734), the pore lipids seal the channel pores in closed state and could be removed in open state by mechanical force induced membrane deformation, which obeys the force-from-lipids principle. Therefore, in our putative model, the pore lipids seal the hydrophobic pore of hPIEZO1 in the closed state. Upon activation of hPIEZO1, the pore lipids retract from the hydrophobic pore and interact with multi-lipidated MDFIC, stabilizing in the inactivation state. The mild channelopathy mutants make the pore lipids retract from the hydrophobic pore and harder to close upon activation. For the severe channelopathy mutant, the interaction between the pore lipids and R2456 is disrupted, resulting in the missing of pore lipids and significantly slow-inactivating. We fully understand the concern of the role of pore lipids in our proposed model. Therefore, we have toned down our putative model.

      Public Reviews:  

      Reviewer #1 (Public review):  

      Summary:  

      This manuscript by Shan, Guo, Zhang, Chen et al., shows a raft of interesting data including the first cryo-EM structures of human PIEZO1. Clearly, the molecular basis of PIEZO channel inactivation is of great interest and as such this manuscript provides some valuable extra information that may help to ultimately build a molecular picture of PIEZO channel inactivation. However, the current manuscript though does not provide any compelling evidence for a detailed mechanism of PIEZO inactivation.

      Strengths:

      This manuscript documents the first cryo-EM structures of human PIEZO1 and the gain of function mutants associated with hereditary anaemia. It is also the first evidence showing that PIEZO1 gain of function mutants are also regulated by the auxiliary subunit MDFIC.

      We thank reviewer #1 for the encouragement.

      Weaknesses:

      While the structures are interesting and clear differences can be seen in the presence of the auxiliary subunit MDFIC the major conclusions and central tenets of the paper, especially a role for pore lipids in inactivation, lack data to support them. The post-translational modification of PIEZOser# auxiliary subunit MDFIC is not modelled as a covalent interaction.

      We fully understand the concern of the role of pore lipids in our proposed model. Therefore, we have toned down our putative model.

      The lipids densities of the post-transcriptional modification of PIEZO1 auxiliary subunit MDFIC are shown below. As the lipids densities are not confident, we only use the single-chain lipids to represent them. And the lipidated MDFIC is proven by the MDFIC identification paper.

      Author response image 1.

      Reviewer #2 (Public review):

      Summary:

      Mechanically activated ion channels PIEZOs have been widely studied for their role in mechanosensory processes like touch sensation and red blood cell volume regulation. PIEZO in vivo roles are further exemplified by the presence of gain-of-function (GOF) or loss-of-function (LOF) mutations in humans that lead to disease pathologies. Hereditary xerocytosis (HX) is one such disease caused due to GOF mutation in Human PIEZO1, which are characterized by their slow inactivation kinetics, the ability of a channel to close in the presence of stimulus. But how these mutations alter PIEZO1 inactivation or even the underlying mechanisms of channel inactivation remains unknown. Recently, MDFIC (myoblast determination family inhibitor proteins) was shown to directly interact with mouse PIEZO1 as an auxiliary subunit to prolong inactivation and alter gating kinetics. Furthermore, while lipids are known to play a role in the inactivation and gating of other mechanosensitive channels, whether this mechanism is conserved in PIEZO1 is unknown. Thus, the structural basis for PIEZO1 inactivation mechanism, and whether lipids play a role in these mechanisms represent important outstanding questions in the field and have strong implications for human health and disease.

      To get at these questions, Shan et al. use cryogenic electron microscopy (Cryo-EM) to investigate the molecular basis underlying differences in inactivation and gating kinetics of PIEZO1 and human disease-causing PIEZO1 mutations. Notably, the authors provide the first structure of human PIEZO1 (hPIEZO1), which will facilitate future studies in the field. They reveal that hPIEZO1 has a more flattened shape than mouse PIEZO1 (mPIEZO1) and has lipids that insert into the hydrophobic pore region. To understand how PIEZO1 GOF mutations might affect this structure and the underlying mechanistic changes, they solve structures of hPIEZO1 as well as two HXcausing mild GOF mutations (A1988V and E756del) and a severe GOF mutation (R2456H). Unable to glean too much information due to poor resolution of the mutant channels, the authors also attempt to resolve MCFIC-bound structures of the mutants. These structures show that MDFIC inserts into the pore region of hPIEZO1, similar to its interaction with mPIEZO1, and results in a more curved and contracted state than hPIEZO1 on its own. The authors use these structures to hypothesize that differences in curvature and pore lipid position underlie the differences in inactivation kinetics between wild-type hPIEZO1, hPIEZO1 GOF mutations, and hPIEZO1 in complex with MDFIC.

      Strengths:

      This is the first human PIEZO1 structure. Thus, these studies become the stepping stone for future investigations to better understand how disease-causing mutations affect channel gating kinetics.

      We thank reviewer #2 for the positive comments.

      Weaknesses:

      Many of the hypotheses made in this manuscript are not substantiated with data and are extrapolated from mid-resolution structures.

      We fully understand the concern of the role of pore lipids in our proposed model. Therefore, we have toned down our putative model.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors used structural biology approaches to determine the molecular mechanism underlying the inactivation of the PIEZO1 ion channel. To this end, the authors presented structures of human PIEZO1 and its slow-inactivating mutants. The authors also determined the structures of these PIEZO1 constructs in complexes with the auxiliary subunit MDFIC, which substantially slows down PIEZO1 inactivation. From these structures, the authors suggested an anti-correlation between the inactivation kinetics and the resting curvature of PIEZO1 in detergent. The authors also observed a unique feature of human PIEZO1 in which the lipid molecules plugged the channel pore. The authors proposed that these lipid molecules could stabilize human PIEZO1 in a prolonged inactivated state.

      We thank reviewer #3 for the summary.

      Strengths:

      Notedly, this manuscript reported the first structures of a human PIEZO1 channel, its channelopathy mutants, and their complexes with MDFIC. The evidence that lipid molecules could occupy the channel pore of human PIEZO1 is solid. The authors' proposals to correlate PIEZO1 resting curvature and pore-resident lipid molecules with the inactivation kinetics are novel and interesting.

      Thanks for the positive comments.

      Weaknesses:

      However, in my opinion, additional evidence is needed to support the authors' proposals.

      (1) The authors determined the apo structure of human PIEZO1, which showed a more flattened architecture than that of the mouse PIEZO1. Functionally, the inactivation kinetics of human PIEZO1 is faster than its mouse counterpart. From this observation (and some subsequent observations such as the complex with MDFIC), the authors proposed the anti-correlation between curvature and inactivation kinetics. However, the comparison between human and mouse PIEZO1 structure might not be justified. For example, the human and mouse structures were determined in different detergent environments, and the choice of detergent could influence the resting curvature of the PIEZO structures.

      We apologize for the misleading statement about the anti-correlation between curvature and inactivation kinetics of PIEZOs. We cannot conclude that the observation of curvature variation of mPIEZO1 and hPIEZO1 is related to their inactivation kinetics based on structural studies and electrophysiological assay. The difference in structural basis between mPIEZO1 and hPIEZO1 is what we want to state. To avoid this misleading, we have revised the manuscript. 

      For the concern about detergent, we cannot fully exclude its influence on the curvature of PIEZOs. However, previously reported structures of mPiezo1 (PDB: 7WLT, 5Z10, 6B3R) were in the different detergent environments or in lipid bilayer, but the curvature of mPiezo1 is similar as shown below. Considering the high sequence similarity between mPiezo1 and hPiezo1, we hypothesize that the curvature of both hPiezo1 and mPiezo1 may be unaffected by the detergent.

      Author response image 2.

      Overall structural comparison of curved mPIEZO1 in the lipid bilayer (PDB: 7WLT), mPiezo1 in CHAPS (PDB: 6B3R) and mPiezo1 in Digitonin (PDB: 5Z10).

      (2) Related to point 1), the 3.7 Å structure of the A1988V mutant presented by the authors showed a similar curvature as the WT but has a slower inactivating kinetics.

      Based on the structural comparison between hPIEZO1 and its A1998V mutant, the retraction of pore lipids from the hydrophobic center pore in hPIEZO1-A1998V is mainly responsible for its slower inactivating kinetics.

      (3) Related to point 1), the authors stated that human PIEZO1 might not share the same mechanism as mouse PIEZO1 due to its unique properties. For example, MDFIC only modifies the curvature of human PIEZO1, and lipid molecules were only observed in the pore of the human PIEZO1. Therefore, it may not be justified to draw any conclusions by comparing the structures of PIEZO1 from humans and mice.

      Thanks for the constructive suggestion. To avoid this misleading, we have revised the manuscript.

      (4) Related to point 1), it is well established that PIEZO1 opening is associated with a flattened structure. If the authors' proposal were true, in which a more flattened structure led to faster inactivation, we would have the following prediction: more opening is associated with faster inactivation. In this case, we would expect a pressure-dependent increase in the inactivation kinetics.

      Could the authors provide such evidence, or provide other evidence along this direction?

      We appreciate the reviewer’s comment. We are not claiming a relationship between the flattened structure and activation/inactivation. We only present the results of the structure of wild-type/mutant PIEZO1.

      (5) In Figure S2, the authors showed representative experiments of the inactivation kinetics of PIEZO1 using whole-cell poking. However, poking experiments have high cell-to-cell variability.

      The authors should also show statics of experiments obtained from multiple cells.

      We have shown the statics of representative electrophysiology experiments obtained from multiple cells in Figure S2.

      (6) In Figure 2 and Figure 5, when the authors show the pore diameter, it could be helpful to also show the side chain densities of the pore lining residues.

      We appreciate the reviewer’s suggestion. The side chain of the pore lining restricted residues have been shown in Figure 2 and Figure 5 and the densities of pore domain have been shown in Figure S4 and S14. Interestingly, the pore lining restricted residues in mPIEZO1 and hPIEZO1 is highly conserved.

      (7) The authors observed pore-plugging lipids in slow inactivating conditions such as channelopathy mutations or in complex with MDFIC. The authors propose that these lipid molecules stabilize a "deep resting state" of PIEZO1, making it harder to open and harder to inactivate once opened. This will lead to the prediction that the slow-inactivating conditions will lead to a higher activation threshold, such as the mid-point pressure in the activation curve. Is this true?

      Yes, it is true. In Figure S2, the MDFIC-induced slow-inactivation conditions in hPIEZO1-MDFIC, hPIEZO1-A1988V-MDFIC, hPIEZO1-E756del-MDFIC and hPIEZO1-R2456H-MDFIC result in larger half-activation thresholds than hPIEZO1, hPIEZO1-A1988V, hPIEZO1-E756del and hPIEZO1-R2456H, respectively.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I document the major issues below:

      (1) Mouse vs Human inactivation

      Line 21- "than the slower inactivating curved mouse PIEZO1 (mPIEZO1)."

      Where is the data in this paper or any other paper that human PIEZO1 inactivates faster than mouse PIEZO1? This is central to the way the authors present the paper. In fact, the tau quoted for the hPIEZO1 of ~10 ms is similar to that often measured for mPIEZO1. The reference in the discussion for mouse vs human inactivation times is a review of mechanotransduction. Either the authors need to directly compare the tau of mP1 vs hP1 or quote the relevant primary literature if it exists.

      As measured in HEK-PIKO cells transfected with mPiezo1, the inactivation time of mPiezo1 is 13 ± 1 ms (PMID: 29261642) at -80 mV. 

      The tau is also voltage-dependent. The tau is beyond 20 ms at -60 mV for mPIEZO1 (PMID:

      20813920) and for hPIEZO1 is still around 10 ms.

      (2) MDFIC-lipidation

      Without seeing the PDB or EMDB I can't guarantee this but from Figure 6d it seems like the Sacylation in the distal C-terminus of MDFIC is not modelled as a covalent interaction, these lipids are covalently added to the Cys residues in S-acylation via zDHHC enzymes. This should be modelled correctly.

      Thanks for this suggestion. As the lipid densities of the post-transcriptional modification of PIEZOs auxiliary subunit MDFIC are not confident, we only use the single-chain lipids to represent them.

      And the lipidated MDFIC is proven by the MDFIC identification paper (PMID: 37590348).

      (3) Pore lipids and inactivation

      The lipids close to the pore are interesting and the density for a lipid is also seen in the mouse MDFIC-PIEZO1 complex from Zhou, Ma et al, 2023. However, there is no data provided by the authors that the lipid is functionally relevant to anything. There is not even a correlation with inactivation in Figure 7. P1+MDFIC inactivates slowest yet the lipids are present within the pore. Second, there is no evidence for what these structures are: closed, or inactivated? In fact, the Xiao lab is now interpreting the 7WLU structure as inactivated.

      The evidence supporting the key role of pore lipids in hPIEZO1 activation is as follows. i. Compared with wild-type hPIEZO1, the hydrophobic acyl chain tails of the pore lipids retracted from the hydrophobic pore region in slower inactivating mutant hPIEZO1-A1988V (Fig. 7a-b). ii. Previous electrophysiological functional studies revealed that substituting this hydrophobic pore formed by I2447, V2450, and F2454 with a hydrophilic pore prolongs the inactivation time for both PIEZO1 and PIEZO2 channels (PMID: 30628892). iii. In the structure of the HX channelopathy mutant R2456H, the interaction between the hydrophilic phosphate group head of pore lipids and R2456 is disrupted, remodeling the blade and pore module and resulting in a significantly slow-inactivating rate. iv. The interaction between pore lipids and lipidated-MDFIC stabilizes the pore lipids to reseal the pore upon activation of the hPIEZO1-MDFIC complex. Overall, the pore lipid is involved in inactivation, and we have toned down the statement.

      (4) Cytosolic plug

      There is additional cytosolic density for the human PIEZO1 that the authors intimate could be from a different binding partner. IS it possible to refine this density? Is it from the PIEZO1-tag? At the very least a little more information about this density should be given if it is going to be mentioned like this.

      Our purification result shows that the protein is tag-free. We are also curious about the extra cytosolic density, but we do not know what it is.

      (5) Reduced sensitivity of PIEZO1 in the presence of MDFIC and its regulatory mechanism

      This was reported in the first article however no data is presented by the authors to support MDFIC increasing the mechanical energy required to open PIEZO1. The sentence in the discussion; "MDFIC enables hPIEZO1 to respond to different forces by modifying the pore module through lipid interactions." is not supported by any functional data and seems to be an over-interpretation of the structures.

      We appreciate this suggestion. The half-activation threshold of hPEIZO1 and hPEIZO1-MDFIC is measured to be 7 μm and 9 μm, respectively (Fig.S2). In addition, the mechanical currents amplitude of hPIEZO1-MDFIC is extremely small compared to that of WT reaching the nA level (Fig.S2). Therefore, the less mechanosensitive hPIEZO1-MDFIC may require more mechanical energy to open than PIEZO1 WT.

      6) Both referencing of the PIEZO1 literature and prose could be improved.

      Thanks for the suggestion. We have improved the referencing and prose.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors speculate that the difference in curvature between human and mouse PIEZO1 results in its fast inactivation but do not provide experimental evidence to support this idea. This claim would have been bolstered by showing that the GOF human mutations have a more curved structure, but these proved too structurally unstable to be solved at high resolution. However, the authors state that the 3.7 angstrom map solved for hPIEZO1-A1988V does have an overall similar architecture as wild-type hPIEZO1; thus, contradicting their hypothesis.

      We apologize for the misleading statement. In our revised manuscript, we do not claim a relationship between the flattened structure and activation/inactivation. We only present the results of the structure of wild-type/mutant PIEZO1.

      The structure comparison between the A1988V mutant and WT shows a similar architecture but a different occupancy pattern of pore lipids. Therefore, we suggested that the A1988V mutant has slightly slower inactivation kinetics, mainly due to the exit of pore lipids from the pore.

      (2) The authors show that interaction with MDFIC alters hPIEZO1 structure to be more curved and use this to support their idea that changing the curvature of the protein underlies the prolonged inactivation kinetics. It has been previously shown that MDFIC does not change the structure of mPIEZO1 but does alter its inactivation and gating kinetics. How does this discrepancy fit into the inactivation model proposed by the authors? Similarly, their claim that MDFIC slows hPIEZO1 inactivation and weakens mechanosensitivity just by affecting the pore module and changing blade curvature is made based on observation and no experimental data to test it.

      We have revised the manuscript to avoid misleading the relationship between the curvature and the inaction kinetics of hPIEZO1. The evidence reported previously that substitution of the hydrophobic pore, formed by I2447, V2450, and F2454, with a hydrophilic pore prolongs the inactivation time for both PIEZO1 and PIEZO2 channels (PMID: 30628892). In addition, the severe HX channelopathy mutant R2456H, wherein the interaction between the hydrophilic phosphate group head and R2456 is disrupted, leads to remodeling of the blade and pore module. Indeed, our observation is limited and further experiments will be performed to support our model.

      (3) How does their model fit in cell types that have PIEZO1 (or GOF mutant PIEZO1) but not MDFIC?

      In cell types that have PIEZO1 or GOF mutant PIEZO1 but not MDFIC, PIEZO1 or GOF mutant PIEZO1 may have a faster inactivation rate than those that bind to MDFIC. It can be proved that overexpressed PIEZOs exhibit faster inactivation kinetics than those in some native cell types with MDFIC expression (PMID: 20813920, 30132757).

      (4) Figure S2 is missing quantification of the electrophysiology data. The authors should show summary data in addition to their representative traces including the Imax for all conditions, tau for data shown in b, and sample size for all conditions, and related statistics. The text claims that MDFIC decreases mechanosensitivity (line 156) but there is no data to support this.

      For the electrophysiological assay in Figure S2, we referred to previously reported mPIEZO1 mutants (PMID: 23487776, 28716860). We confirmed that the slower inactivation phenotypes of these mutations of hPIEZO1 are similar to those of mPIEZO1.

      The half-activation threshold of hPEIZO1 and hPEIZO1-MDFIC is measured to be 7 μm and 9 μm, respectively. This tendency of increased half-activation threshold of hPIEZO1 upon binding with MDFIC is also shown in the electrophysiological result of hPIEZO1 channelopathy mutants.

      (5) In line 144, the authors mention that they were able to validate the MDFIC density with multilipidated cysteines on the C-terminal amphipathic helix, but they do not show the density with fitted lipids. While individual densities for some of the lipids are shown in extended Figure 12, it would be helpful to include a figure where they show the map for MDFIC with fitted lipids in it.

      Thanks for the valuable suggestion. As the lipid densities of the post-transcriptional modification of PIEZOs auxiliary subunit MDFIC are not confident, we only use the single-chain lipids to represent them. And the lipidated MDFIC is proven by the MDFIC identification paper.

      (6) The authors show that R2456 interacts with a lipid at the pore module and hypothesize that this underlies the fast inactivation of hPIEZO1. While they did not obtain a high-resolution structure of this mutant, this hypothesis could be tested by substituting R for side chains with different charges and performing electrophysiology to determine the effects on inactivation.

      Thanks for the constructive suggestion. We will perform the electrophysiology assay for R2456 mutants with different side chains.

      7) Figure 4 shows overall structure of hPIEZO1 GOF mutations A1988V and E756del in complex with MDFIC. Other than showing an overall similar structure to wildtype hPIEZO1, the authors do not show how the human mutations A1988V alter the structure of the protein at the site of change. Understanding how these mutations affect the local architecture of the protein has important relevance for human physiology.

      As the GOF channelopathy mutant hPIEZO1-A1988V is structurally unstable, the density at the site of A1988V is too weak to figure out the related interaction in the structure of the hPIEZO1-A1988V mutant. 

      Minor comment:

      In general, the manuscript will benefit from heavy copy editing. For example, the word cartoon is misspelled in many of the figure legends.

      We apologize for the mistake. The manuscript has been checked and revised.

      Reviewer #3 (Recommendations for the authors):

      Some portions of this manuscript were not well written. For example, at the end of the 3rd paragraph in the introduction, the authors talked about HX mutations and their correlation with malaria infection and plasma iron. This is irrelevant information and will only distract the readers. It would be ideal if the authors could go through the entire manuscript and improve its clarity.

      Thanks for the suggestion. We have revised the sentences about HX mutations as suggested and improved the entire manuscript.

    1. Author response:

      We were delighted by the reviewers' general comments. We thank the reviewers for their thoughtful reviews, constructive criticism, and analysis suggestions. We have carefully addressed each of their points during the revision of the manuscript.

      Unfortunately, after the paper was submitted to eLife, the first author, who ran all the analyses, left academia. We now realized that we currently do not have sufficient resources to perform all additional analyses as requested by the reviewers.

      The following is the authors’ response to the original reviews:

      Public Reviews:

      Reviewer #1 (Public Review):

      This study uses MEG to test for a neural signature of the trial history effect known as 'serial dependence.' This is a behavioral phenomenon whereby stimuli are judged to be more similar than they really are, in feature space, to stimuli that were relevant in the recent past (i.e., the preceding trials). This attractive bias is prevalent across stimulus classes and modalities, but a neural source has been elusive. This topic has generated great interest in recent years, and I believe this study makes a unique contribution to the field. The paper is overall clear and compelling, and makes effective use of data visualizations to illustrate the findings. Below, I list several points where I believe further detail would be important to interpreting the results. I also make suggestions for additional analyses that I believe would enrich understanding but are inessential to the main conclusions.

      (1) In the introduction, I think the study motivation could be strengthened, to clarify the importance of identifying a neural signature here. It is clear that previous studies have focused mainly on behavior, and that the handful of neuroscience investigations have found only indirect signatures. But what would the type of signature being sought here tell us? How would it advance understanding of the underlying processes, the function of serial dependence, or the theoretical debates around the phenomenon?

      Thank you for pointing this out. Our MEG study was designed to address two questions: 1) we asked whether we could observe a direct neural signature of serial dependence, and 2) if so, whether this signature occurs at the encoding or post-encoding stage of stimulus processing in working memory. This second question directly concerns the current theoretical debate on serial dependence.

      Previous studies have found only indirect signatures of serial dependence such as reactivations of information from the previous trial or signatures of a repulsive bias, which were in contrast to the attractive bias in behavior. Thus, it remained unclear whether an attractive neural bias can be observed as a direct reflection of the behavioral bias. Moreover, previous studies observed the neuronal repulsion during early visual processes, leading to the proposal that neural signals become attracted only during later, post-encoding processes. However, these later processing stages were not directly accessible in previous studies. To address these two questions, we combined MEG recordings with an experimental paradigm with two items and a retro-cue. This design allowed to record neural signals during separable encoding and post-encoding task phases and so to pinpoint the task phase at which a direct neural signature of serial dependence occurred that mirrored the behavioral effect.

      We have slightly modified the Introduction to strengthen the study motivation.

      (1a) As one specific point of clarification, on p. 5, lines 91-92, a previous study (St. JohnSaaltink et al.) is described as part of the current study motivation, stating that "as the current and previous orientations were either identical or orthogonal to each other, it remained unclear whether this neural bias reflected an attraction or repulsion in relation to the past." I think this statement could be more explicit as to why/how these previous findings are ambiguous. The St. John-Saaltink study stands as one of very few that may be considered to show evidence of an early attractive effect in neural activity, so it would help to clarify what sort of advance the current study represents beyond that.

      Thank you for this comment. In the study by St. John-Saaltink et al. (2016), two gratings oriented at 45° and 135° were always presented to either the left or right side of a central fixation point in a trial (90° orientation difference). As only the left/right position of the 45° and 135° gratings varied across trials, the target stimulus in the current trial was either the same or differed by exactly 90° from the previous trial. In consequence, this study could not distinguish whether the observed bias was attractive or repulsive, which concerned both the behavioral effect and the V1 signal. Furthermore, the bias in the V1 signal was partially explained by the orientation that was presented at the same position in the previous trial, which could reflect a reactivation of the previous orientation rather than an actual altered orientation.

      We have changed the Introduction accordingly.

      References:

      St. John-Saaltink E, Kok P, Lau HC, de Lange FP (2016) Serial Dependence in Perceptual Decisions Is Reflected in Ac6vity Pa9erns in Primary Visual Cortex. Journal of Neuroscience 36: 6186–6192.

      (1b) The study motivation might also consider the findings of Ranieri et al (2022, J. Neurosci) Fornaciai, Togoli, & Bueti (2023, J. Neurosci), and Lou& Collins (2023, J. Neurosci) who all test various neural signatures of serial dependence.

      Thank you. As all listed findings showed neural signatures revealing a reactivation of the previous stimulus or a response during the current trial, we have added them to the paragraph in the Introduction referring to this class of evidence for the neural basis for serial dependence.

      (2) Regarding the methods and results, it would help if the initial description of the reconstruction approach, in the main text, gave more context about what data is going into reconstruction (e.g., which sensors), a more conceptual overview of what the 'reconstruction' entails, and what the fidelity metric indexes. To me, all of that is important to interpreting the figures and results. For instance, when I first read, it was unclear to me what it meant to "reconstruct the direction of S1 during the S2 epoch" (p. 10, line 199)? As in, I couldn't tell how the data/model knows which item it is reconstructing, as opposed to just reporting whatever directional information is present in the signal.

      (2a) Relatedly, what does "reconstruction strength" reflect in Figure 2a? Is this different than the fidelity metric? Does fidelity reflect the strength of the particular relevant direction, or does it just mean that there is a high level of any direction information in the signal? In the main text explain what reconstruction strength and what fidelity is?

      Thank you for pointing this out. We applied the inverted encoding model method to MEG data from all active sensors (271) within defined time-windows of 100 ms length. MEG data was recorded in two sessions on different days. Specifically, we constructed an encoding model with 18 motion direction-selective channels. Each channel was designed to show peak sensitivity to a specific motion direction, with gradually decreasing sensitivity to less similar directions. In a training step, the encoding model was fiCed to the MEG data of one session to obtain a weight matrix that indicates how well the sensor activity can be explained by the modeled direction. In the testing step, the weight matrix was inverted and applied to the MEG data of the other session, resulting in a response profile of ‘reconstruction strengths’, i.e., how strongly each motion direction was present in a trial. When a specific motion direction was present in the MEG signal, the reconstruction strengths peaked at that specific direction and decreased with increasing direction difference. If no information was present, reconstruction strengths were comparable across all modeled directions, i.e., the response profile was flat. To integrate response profiles across trials, single trial profiles were aligned to a common center direction (i.e., 180°) and then averaged.

      To quantify the accuracy of each IEM reconstruction, i.e., how well the response profile represents a specific motion direction relative to all other directions we computed the ‘reconstruction fidelity’. Fidelity was obtained by projecting the polar vector of the reconstruction at every direction angle (in steps of 1°) onto the common center (180°) and averaging across all direction angles (Rademaker et al 2019, Sprague, Ester & Serences, 2016). As such, ‘reconstruction fidelity’ is a summary metric with fidelity greater than zero indicating an accurate reconstruction.

      How does the model know which direction to reconstruct? Our modelling procedure was informed about the stimulus in question during both the training and the testing step. Specifically, we informed our model during the training step about e.g., the current S2. Then, we fit the model to training data from the S2 epoch and applied it to testing data from the S2 epoch. Crucially, during the testing step the motion direction in question, i.e., current S2, becomes relevant again. For example, when S2 was 120°, the reconstructions were shifted by 60° in order to align with the common center, i.e., 180°. In addition, we also tested whether we could reconstruct the motion direction of S1 during the S2 epoch. Here, we used again the MEG data from the S2 epoch but now for S1 training. i.e., the model was informed about S1 direction. Accordingly, the recentering step during testing was done with regard to the S1 direction. Similarly, we also reconstructed the motion direction of the previous target (i.e., the previous S1 or S2), e.g., during the S2 epoch.

      Together, the multi-variate pattern of MEG activity across all sensors during the S2 epoch could contain information about the currently presented direction of S2, the direction of the preceding S1 and the direction of the target stimulus from the previous trial (i.e., either previous S1 or previous S2) at the same time. An important exception from this regime was the cross-reconstruction analysis (Appendix 1—figure 2). Here we trained the encoding model on the currently relevant item (S1 during the S1 epoch, S2 during the S2 epoch and the cued item during the retro-cue epoch) of one MEG session and reconstructed the previous target on the other MEG session.

      Finally, to examine shifts of the neural representation, single-trial reconstructions were assigned to two groups, those with a previous target that was oriented clockwise (CW) in relation to the currently relevant item and those with a previous target that was oriented counter-clockwise (CCW). The CCW reconstructions were flipped along the direction space, hence, a negative deviation of the maximum of the reconstruction from 180° indicated an attraction toward the previous target, whereas a positive deviation indicated a repulsion. Those reconstructions were then first averaged within each possible motion direction and then across them to account for different presentation numbers of the directions, resulting in one reconstruction per participant, epoch and time point. To examine systematic shifts, we then tested if the maximum of the reconstruction was systematically different from the common center (180°). For display purposes, we subtracted the reconstructed maximum from 180° to compute the direction shifts. A positive shift thus reflected attraction and a negative shift reflected repulsion.

      We have updated the Results accordingly.

      References:

      Rademaker RL, Chunharas C, Serences JT (2019) Coexisting representations of sensory and mnemonic information in human visual cortex. Nature Neuroscience. 22: 1336-1344.

      Sprague TC, Ester EF, Serences JT (2016) Restoring Latent Visual Working Memory Representations in Human Cortex. Neuron. 91: 694-707

      (3) Then in the Methods, it would help to provide further detail still about the IEM training/testing procedure. For instance, it's not entirely clear to me whether all the analyses use the same model (i.e., all trained on stimulus encoding) or whether each epoch and timepoint is trained on the corresponding epoch and timepoint from the other session. This speaks to whether the reconstructions reflect a shared stimulus code across different conditions vs. that stimulus information about various previous and current trial items can be extracted if the model is tailored accordingly.

      As reported above, our modeling procedure was informed about same stimulus during both the training and the testing step, except for the cross-reconstruction analysis.

      Regarding the training and testing data, the model was always trained on data from one session and tested on data from the other session, so that each MEG session once served as the training data set and once as the test data set, hence, training and test data were independent. Importantly, training and testing was always performed in an epoch- and time point-specific way: For example, the model that was trained on the first 100-ms time bin from the S1 epoch of the first MEG session was tested on the first 100-ms time bin from the S1 epoch of the second MEG session.

      Specifically, when you say "aim of the reconstruction" (p. 31, line 699), does that simply mean the reconstruction was centered in that direction (that the same data would go into reconstructing S1 or S2 in a given epoch, and what would differentiate between them is whether the reconstruction was centered to the S1 or S2 direction value)?

      As reported above, during testing the reconstruction was centered at the currently relevant direction. The encoding model was trained with the direction labels of S1, S2 or the target item, corresponding to the currently relevant direction, i.e., S1 in S1 epochs, S2 in S2 epochs and target item (S1 or S2) in the retro-cue epoch. The only exception was the reconstruction of S1 during the S2 epoch. Here the encoding model was trained on the S1 direction, but with data from the S2 epoch and then applied to the S2 epoch data and recentered to the S1 direction. So here, S1 and S2 were indeed trained and tested separately for the same epoch.

      (4) I think training and testing were done separately for each epoch and timepoint, but this could have important implications for interpreting the results. Namely if the models are trained and tested on different time points, and reference directions, then some will be inherently noisier than others (e.g., delay period more so than encoding), and potentially more (or differently) susceptible to bias. For instance, the S1 and S2 epochs show no attractive bias, but they may also be based on more high-fidelity training sets (i.e., encoding), and therefore less susceptible to the bias that is evident in the retrocue epoch.

      Thanks for pointing this out. Training and testing were performed in an epoch- and time point-specific way. Thus, potential differences in the signal-to-noise ratio between different task phases could cause quality differences between the corresponding reconstructed MEG signals. However, we did not observe such differences. Instead, we found comparable time courses of the reconstruction fidelities and the averaged reconstruction strengths between epochs (Figure 2b and 2c, respectively). Fig. 2b, e.g., shows that reconstruction fidelity for motion direction stimuli built up slowly during the stimulus presentation, reaching its maximum only after stimulus offset. This observation may contrast to different stimulus materials with faster build-ups, like the orientation of a Gabor.

      We agree with the reviewer that, regardless of the comparable but not perfectly equal reconstruction fidelities, there are good arguments to assume that the neural representation of the stimulus during its encoding is typically less noisy than during its post-encoding processing and that this difference could be one of the reasons why serial dependence emerged in our study only during the retro-cue epoch. However, the argument could also be reversed: a biased representation, which represents a small and hard-to-detect neural effect, might be easier to observe for less noisy data. So, the fact that we found a significant bias only during the potentially “noisier” retro-cue epoch makes the effect even more noteworthy.

      We mentioned the limitation related to our stimulus material already at the end of the Discussion. We have now added a new paragraph to the Discussion to address the two opposing lines of reasoning.  

      (4) I believe the work would benefit from a further effort to reconcile these results with previous findings (i.e., those that showed repulsion, like Sheehan & Serences), potentially through additional analyses. The discussion attributes the difference in findings to the "combination of a retro-cue paradigm with the high temporal resolution of MEG," but it's unclear how that explains why various others observed repulsion (thought to happen quite early) that is not seen at any stage here. In my view, the temporal (as well as spatial) resolution of MEG could be further exploited here to better capture the early vs. late stages of processing. For instance, by separately examining earlier vs. later time points (instead of averaging across all of them), or by identifying and analyzing data in the sensors that might capture early vs. late stages of processing. Indeed, the S1 and S2 reconstructions show subtle repulsion, which might be magnified at earlier time points but then shift (toward attraction) at later time points, thereby counteracting any effect. Likewise, the S1 reconstruction becomes biased during the S2 epoch, consistent with previous observations that the SD effects grow across a WM delay. Maybe both S1 and S2 would show an attractive bias emerging during the later (delay) portion of their corresponding epoch? As is, the data nicely show that an attractive bias can be detected in the retrocue period activity, but they could still yield further specificity about when and where that bias emerges.

      We are grateful for this suggestion. Before going into detail, we would like to explain our motivation for choosing the present analysis approach that included averaging time points within an epoch of interest.

      Our aim was to detect a neuronal signature of serial dependence which is manifested as an attractive shift of about 3.5° degrees within the 360° direction space. To be able to detect such a small effect in the neural data and given the limited resolution of the reconstruction method and the noisy MEG signals, we needed to maximize the signal-to-noise ratio. A common method to obtain this is by averaging data points. In our study we asked subjects to perform 1022 trials, down-sampled the MEG data from the recorded sampling rate of 1200 Hz to 10 Hz (one data point per 100 ms) that we used for the estimation of reconstruction fidelity and calculated the final neural shift estimates by averaging time points that showed a robust reconstruction fidelity, thus representing interpretable data points.

      Our procedure to maximize the signal-to-noise ratio was successful as we were able to reliably reconstruct the presented and remembered motion direction in all epochs (Figure 1a and 1b in the manuscript). However, the reconstruction did not work equally well for all time points within each epoch. In particular, there were time points with a non-significant reconstruction fidelity. In consequence, for the much smaller neural shift effect we did not expect to observe reliable time-resolved results, i.e., when considering each time point separately. Instead, we used the reconstruction results to define the time window in order to calculate the neural shift, i.e., we averaged across all time points with a significant reconstruction fidelity.

      Author response image 1 depicts the neural shift separately for each time point during the retro-cue epoch. Importantly, the gray parts of the time courses indicate time points where the reconstruction of the presented or cued stimulus was not significant. This means that the reconstructed maxima at those time points were very variable/unreliable and therefore the neural shifts were hardly interpretable.

      Author response image 1.

      Time courses of the reconstruction shift reveal a tendency for an attractive bias during the retrocue phase. Time courses of the neural shift separately for each time point during the S1 (left panel), S2 (middle panel) and retro-cue epochs (right panel). Gray lines indicate time points with non-significant reconstruction fidelities and therefore very variable and non-interpretable neural reconstruction shifts. The colored parts of the lines correspond to the time periods of significant reconstruction fidelities with interpretable reconstruction shifts. Error bars indicate the middle 95% of the resampling distribution. Time points with less than 5% (equaling p < .05) of the resampling distribution below 0° are indicated by a colored circle. N = 10.

      First, the time courses in the Author response image 1 show that the neural bias varied considerably between subjects, as revealed by the resampling distributions, at given time points. In this resampling procedure, we drew 10 participants in 10.000 iterations with replacement and calculated the reconstruction shift based on the mean reconstruction of the resampled participants. The observed variability stresses the necessity to average the values across all time points that showed a significant reconstruction fidelity to increase the signal-to-noise ratio.

      Second, despite this high variability/low signal-to-noise ratio, Author response image 1 (right panel) shows that our choice for this procedure was sensible as it revealed a clear tendency of an attractive shift at almost all time points between 300 through 1500 ms after retro-cue onset with only a few individual time-points showing a significant effect (uncorrected for multiple comparisons). It is worth to mention that this time course did not overlap with the time course of previous target cross-reconstruction (Appendix 1—figure 2, right panel), as there was no significant target cross-reconstruction during the retro-cue epoch with an almost flat profile around zero. Also, there was no overlap with previous target decoding in the retro-cue epoch (Figure 5 in the manuscript). Here, the previous target was reactivated significantly only at early time points of 200 and 300 ms post cue onset (i.e., at time points with a non-significant reconstruction fidelity and therefore no interpretable neural shift), while the nominally highest values of the attractive neural shift were visible at later time points that also showed a significant reconstruction fidelity (Figure 2b in the manuscript).

      Third, Author response image 1 (left and middle panel) shows the time courses of the neural shift during the S1 and S2 epochs. While no neural shift could be observed for S1, during the S2 epoch the time-resolved analysis indicated an initial attractive shift followed by a (nonsignificant) tendency for a repulsive shift. After averaging neural shifts across time points with a significant reconstruction fidelity, there was no significant effect with an overall tendency for repulsion, as reported in the paper. The attractive part of the neural shift during the S2 epoch was nominally strongest at very early time points (at 100-300 ms after S2 onset) and overlapped perfectly with the reactivation of the previous target as shown by the cross-reconstruction analysis (Appendix 1—figure 2, middle panel). This overlap suggests that the neural attractive shift did not reflect an actual bias of the early S2 representation, but rather a consequence of the concurrent reactivation of the previous target in the same neural code as the current representation. Finally, this neural attractive shift during S2 presentation did not correlate with the behavioral error (single trial-wise correlation: no significant time points during S2 epoch) or the behavioral bias (subject-wise correlation). In contrast, for the retro-cue epoch, we observed a significant correlation between the neural attractive shift and behavior.

      Together, the time-resolved results show a clear tendency for an attractive neural bias during the retro-cue phase, thus supporting our interpretation that the attractive shift during the retro-cue phase reflects a direct neuronal signature of serial dependence. However, these additional analyses also demonstrated a large variability between participants and across time points, warranting a cautious interpretation. We conclude that our initial approach of averaging across time points was an appropriate way of reducing the high level of noise in the data and revealed the reported significant and robust attractive neural shift in the retrocue phase.

      (5) A few other potentially interesting (but inessential considerations): A benchmark property of serial dependence is its feature-specificity, in that the attractive bias occurs only between current and previous stimuli that are within a certain range of similarity to each other in feature space. I would be very curious to see if the neural reconstructions manifest this principle - for instance, if one were to plot the trialwise reconstruction deviation from 0, across the full space of current-previous trial distances, as in the behavioral data. Likewise, something that is not captured by the DoG fivng approach, but which this dataset may be in a position to inform, is the commonly observed (but little understood) repulsive effect that appears when current and previous stimuli are quite distinct from each other. As in, Figure 1b shows an attractive bias for direction differences around 30 degrees, but a repulsive one for differences around 170 degrees - is there a corresponding neural signature for this component of the behavior?

      We appreciate the reviewer's idea to split the data. However, given that our results strongly relied on the inclusion of all data points, i.e., including all distances in motion direction between the current S1, S2 or target and the previous target and requiring data averaging, we are concerned that our study was vastly underpowered to be able to inform whether the attractive bias occurs only within a certain range of inter-stimulus similarity. To address this important question, future studies would require neural measurements with much higher signal-to-noise-ratio than the present MEG recordings with two sessions per participant and 1022 trials in total.

      Reviewer #2 (Public Review):

      Summary:

      The study aims to probe the neural correlates of visual serial dependence - the phenomenon that estimates of a visual feature (here motion direction) are attracted towards the recent history of encoded and reported stimuli. The authors utilize an established retro-cue working memory task together with magnetoencephalography, which allows to probe neural representations of motion direction during encoding and retrieval (retro-cue) periods of each trial. The main finding is that neural representations of motion direction are not systematically biased during the encoding of motion stimuli, but are attracted towards the motion direction of the previous trial's target during the retrieval (retro-cue period), just prior to the behavioral response. By demonstrating a neural signature of attractive biases in working memory representations, which align with attractive behavioral biases, this study highlights the importance of post-encoding memory processes in visual serial dependence.

      Strengths:

      The main strength of the study is its elegant use of a retro-cue working memory task together with high temporal resolution MEG, enabling to probe neural representations related to stimulus encoding and working memory. The behavioral task elicits robust behavioral serial dependence and replicates previous behavioral findings by the same research group. The careful neural decoding analysis benefits from a large number of trials per participant, considering the slow-paced nature of the working memory paradigm. This is crucial in a paradigm with considerable trial-by-trial behavioral variability (serial dependence biases are typically small, relative to the overall variability in response errors). While the current study is broadly consistent with previous studies showing that attractive biases in neural responses are absent during stimulus encoding (previous studies reported repulsive biases), to my knowledge it is the first study showing attractive biases in current stimulus representations during working memory. The study also connects to previous literature showing reactivations of previous stimulus representations, although the link between reactivations and biases remains somewhat vague in the current manuscript. Together, the study reveals an interesting avenue for future studies investigating the neural basis of visual serial dependence.

      Weaknesses:

      (1) The main weakness of the current manuscript is that the authors could have done more analyses to address the concern that their neural decoding results are driven by signals related to eye movements. The authors show that participants' gaze position systematically depended on the current stimuli's motion directions, which together with previous studies on eye movement-related confounds in neural decoding justifies such a concern. The authors seek to rule out this confound by showing that the consistency of stimulus-dependent gaze position does not correlate with (a) the neural reconstruction fidelity and (b) the repulsive shift in reconstructed motion direction. However, both of these controls do not directly address the concern. If I understand correctly the metric quantifying the consistency of stimulus-dependent gaze position (Figure S3a) only considers gaze angle and not gaze amplitude. Furthermore, it does not consider gaze position as a function of continuous motion direction, but instead treats motion directions as categorical variables. Therefore, assuming an eye movement confound, it is unclear whether the gaze consistency metric should strongly correlate with neural reconstruction fidelity, or whether there are other features of eye movements (e.g., amplitude differences across participants, and tuning of gaze in the continuous space of motion directions) which would impact the relationship with neural decoding. Moreover, it is unclear whether the consistency metric, which does not consider history dependencies in eye movements, should correlate with attractive history biases in neural decoding. It would be more straightforward if the authors would attempt to (a) directly decode stimulus motion direction from x-y gaze coordinates and relate this decoding performance to neural reconstruction fidelity, and (b) investigate whether gaze coordinates themselves are history-dependent and are attracted to the average gaze position associated with the previous trials' target stimulus. If the authors could show that (b) is not the case, I would be much more convinced that their main finding is not driven by eye movement confounds.

      The reviewer is correct that our eye-movement analysis approach considered gaze angle (direction) and not gaze amplitude. We considered gaze direction to be the more important feature to control for when investigating the neural basis of serial dependence that manifests, given the stimulus material used in our study, as a shift/deviation of angle/direction of a representation towards the previous target motion direction. To directly relate gaze direction and MEG data to each other we equaled the temporal resolution of the eye tracking data to match that of the MEG data. Specifically, our analysis procedure of gaze direction provided a measure indicating to which extent the variance of the gaze directions was reduced compared with random gaze direction patterns, in relation to the specific stimulus direction within each 100 ms time bin. Importantly, this procedure was able to reveal not only systematic gaze directions that were in accordance with the stimulus direction or the opposite direction, but also picked up all stimulus-related gaze directions, even if the relation differed across participants or time.

      Our analysis approach was highly sensitive to detect stimulus-related gaze directions during all task phases (Appendix 1—figure 3). As expected, we found systematic gaze directions when S1 and S2 were presented on the screen, and they were reduced thereafter, indicating a clear relationship between stimulus presentation and eye movement. Systematic gaze directions were also present in the retro-cue phase where no motion direction was presented. Here they showed a clearly different temporal dynamic as compared to the S1 and S2 phases. They appeared at later time points and with a higher variability between participants, indicating that they coincided with retrieving the target motion direction from working memory.

      To relate gaze directions with MEG results, we calculated Spearman rank correlations. We found that there was no systematic relationship at any time point between the stimulus related reconstruction fidelity and the amount of stimulus-related gaze direction. Even more, the correlation varied strongly from time point to time point revealing its random nature. In addition to the lack of significant correlations, we observed clearly distinct temporal profiles for gaze direction (Appendix 1—figure 3a and Appendix 1—figure 3b) and the reconstruction fidelities (Figure 2b in the manuscript, Appendix 1—figure 3c), in particular in the critical retro-cue phase.

      We favored this analysis approach over one that directly decoded stimulus motion direction from x-y gaze coordinates, as we considered it hardly feasible to compute an inverted encoding model with only two eye-tracker channels as an input (in comparison to 271 MEG sensors), and to our knowledge, this has not been done before. Other decoding methods have previously been applied to x-y gaze coordinates. However, in contrast to the inverted encoding model, they did not provide a measure of the representation shift which would be crucial for our investigation of serial dependence.

      We appreciate the suggestion to conduct additional analyses on eye tracking data (including different temporal and spatial resolution and different features) and their relation to MEG data. However, the first author, who ran all the analyses, has in the meantime left academia. Unfortunately, we currently do not have sufficient resources to perform additional analyses.

      While the presented eye movement control analysis makes us confident that our MEG finding was not crucially driven by stimulus-related gaze directions, we agree with the reviewer that we cannot completely exclude that other eye movement-related features could have contributed to our MEG findings. However, we would like to stress that whatever that main source for the observed MEG effect was (shift of the neuronal stimulus representation, (other) features of gaze movement, or shift of the neuronal stimulus representation that leads to systematic gaze movement), our study still provided clear evidence that serial dependence emerged at a later post-encoding stage of object processing in working memory. This central finding of our study is hard to observe with behavioral measures alone and is not affected by the possible effects of eye movements.

      We have slightly modified our conclusion in the Results and Appendix 1. Please see also our response to comment 1 from reviewer 3.

      (2) I am not convinced by the across-participant correlation between attractive biases in neural representations and attractive behavioral biases in estimation reports. One would expect a correlation with the behavioral bias amplitude, which is not borne out. Instead, there is a correlation with behavioral bias width, but no explanation of how bias width should relate to the bias in neural representations. The authors could be more explicit in their arguments about how these metrics would be functionally related, and why there is no correlation with behavioral bias amplitude.

      We are grateful for this suggestion. We correlated the individual neuronal shift with the two individual parameter fits of the behavior shift, i.e., amplitude (a) and tuning width (w). We found a significant correlation between the individual neural bias and the w parameter (r = .70, p = .0246) but not with the a parameter (r = -.35, p = .3258) during the retro-cue period (Appendix 1—figure 1). This indicates that a broader tuning width of the individual bias (as reflected by a smaller w parameter) was associated with a stronger individual neural attraction.

      It is important to note that for the calculation of the neural shift, all trials entered the analysis to increase the signal-to-noise ratio, i.e., it included many trials where current and previous targets were separated by, e.g., 100° or more. These trials were unlikely to produce serial dependence. Subjects with a more broadly tuned serial dependence had more interitem differences that showed a behavioral attraction and therefore more trials affected by serial dependence that entered the calculation of the neural shift. In contrast, individual differences in the amplitude (a) parameter were most likely too small, and higher individual amplitude did not involve more trials as compared to smaller amplitude to affect the neural bias in a way to be observed in a significant correlation.

      We have added this explanation to Appendix 1.  

      (3) The sample size (n = 10) is definitely at the lower end of sample sizes in this field. The authors collected two sessions per participant, which partly alleviates the concern. However, given that serial dependencies can be very variable across participants, I believe that future studies should aim for larger sample sizes.

      We want to express our appreciation for raising this issue. We apologize that we did not explicitly explain and justifythe choice for the sample size used in our paper, in particular, as we had in fact performed a formal a-priori power analysis.

      At the time of the sample size calculation, there were no comparable EEG or MEG studies to inform our power calculation. Thus, we based our calculation merely on the behavioral effect reported in the literature and, in particular, observed in a behavioral study from our lab that included four different experiments with overall more than 100 participants with 1632 trials each (see Fischer et al., 2020), in which the behavioral serial dependence effect (target vs. nontarget) was very robust. Based on the contrast between target and non-target with an effect size of 1.359 in Experiment 1, a power analysis with 80% desired power led to a small, estimated sample size of 6 subjects.

      However, we expected that the detection of the neural signature of this effect would require more participants. Therefore, we based our power calculation on a much smaller behavioral effect, i.e. the modulation of serial dependence by the context-feature congruency that we observed in our previous study (Fischer et al., 2020). In particular, we focused on Experiment 1 of the previous study that used color as the feature for retro-cueing, as we planned to use exactly the same paradigm for the MEG study. In contrast to the serial dependence effect, its modulation by color resulted in a more conservative power estimate: Based on an effect size of 0.856 in that experiment, a sample size of n = 10 should yield a power of 80% with two MEG sessions per subject.

      At the time when we conducted our study, two other studies were published that investigated serial dependence on the neural level. Both studies included a smaller number of data points than our study: Sheehan & Serences (2022) recorded about 840 trials in each of 6 participants, resulting in fewer data points both on the participant and on the trial level. Hajonides et al. (2023) measured 20 participants with 400 trials each, again resulting in fewer datapoints than our study (10 participants with 1022 trials each). Taken together, our a-priori sample size estimation resulted in comparable if not higher power as compared to other similar studies, making us feel confident that the estimated sample was sufficient to yield reliable results.

      We have now included this description and the results of this power analysis in the Materials and Methods section.

      Despite this, we fully agree with the reviewer that our study would profit from higher power. With the knowledge of the results from this study, future projects should attempt to increase substantially the signal-to-noise-ratio by increasing the number of trials in particular, in order to observe, e.g., robust time-resolved effects (see our comments to review 1).

      References:

      Fischer C, Czoschke S, Peters B, Rahm B, Kaiser J, Bledowski C (2020) Context information supports serial dependence of multiple visual objects across memory episodes. Nature Communication 11: 1932.

      Sheehan TC, Serences JT (2022) Attractive serial dependence overcomes repulsive neuronal adaptation PLOS Biology 20: e3001711.

      Hajonides JE, Van Ede F, Stokes MG, Nobre AC, Myers NE (2023) Multiple and Dissociable Effects of Sensory History on Working-Memory Performance Journal of Neuroscience 43: 2730–2740.

      (4) It would have been great to see an analysis in source space. As the authors mention in their introduction, different brain areas, such as PPC, mPFC, and dlPFC have been implicated in serial biases. This begs the question of which brain areas contribute to the serial dependencies observed in the current study. For instance, it would be interesting to see whether attractive shifts in current representations and pre-stimulus reactivations of previous stimuli are evident in the same or different brain areas.

      We appreciate this suggestion. As mentioned above, we currently do not have sufficient resources to perform a MEG source analysis.

      Reviewer #3 (Public Review):

      Summary:

      This study identifies the neural source of serial dependence in visual working memory, i.e., the phenomenon that recall from visual working memory is biased towards recently remembered but currently irrelevant stimuli. Whether this bias has a perceptual or postperceptual origin has been debated for years - the distinction is important because of its implications for the neural mechanism and ecological purpose of serial dependence. However, this is the first study to provide solid evidence based on human neuroimaging that identifies a post-perceptual memory maintenance stage as the source of the bias. The authors used multivariate pattern analysis of magnetoencephalography (MEG) data while observers remembered the direction of two moving dot stimuli. After one of the two stimuli was cued for recall, decoding of the cued motion direction re-emerged, but with a bias towards the motion direction cued on the previous trial. By contrast, decoding of the stimuli during the perceptual stage was not biased.

      Strengths:

      The strengths of the paper are its design, which uses a retrospective cue to clearly distinguish the perceptual/encoding stage from the post-perceptual/maintenance stage, and the rigour of the careful and well-powered analysis. The study benefits from high within participant power through the use of sensitive MEG recordings (compared to the more common EEG), and the decoding and neural bias analysis are done with care and sophistication, with appropriate controls to rule out confounds.

      Weaknesses:

      A minor weakness of the study is the remaining (but slight) possibility of an eye movement confound. A control analysis shows that participants make systematic eye movements that are aligned with the remembered motion direction during both the encoding and maintenance phases of the task. The authors go some way to show that this eye gaze bias seems unrelated to the decoding of MEG data, but in my opinion do not rule it out conclusively. They merely show that the strengths of the gaze bias and the strength of MEGbased decoding/neural bias are uncorrelated across the 10 participants. Therefore, this argument seems to rest on a null result from an underpowered analysis.

      Our MEG as well eye-movement analysis showed that they were sensitive to pick up robustly stimulus-related effects, both for presented and remembered motion directions. When relating both signals to each other by correlating MEG reconstruction strength with gaze direction, we found a null effect, as pointed out by the reviewer. Importantly, there was also a null effect when the shift of the reconstruction (representing our main finding) was correlated with gaze direction. Furthermore, an examination of the individual time courses of gaze direction and individual MEG reconstruction strength revealed that the lack of a relationship between MEG and gaze data did not rest on a singular observation but was present across all time points. Even more, the temporal profile of the correlation varied strongly from time point to time point revealing its random nature and indicating that there was no hint of a pattern that just failed to reach significance. Taking these observations together, our MEG findings were unlikely to be explained by eye position.

      Nevertheless, we agree with the reviewer that there is general problem of interpreting a null effect with a limited number of observations (and an analysis approach that focused on one out of many possible features of the gaze movement). Thus, we admit that there is a (slight) possibility that eye movements contributed to the observed MEG effects. This possibility, however, did not affect our novel finding that serial dependence occurred during the postencoding stage of object processing in working memory.

      Please see also our response to point 1 from reviewer 2.

      Impact:

      This important study contributes to the debate on serial dependence with solid evidence that biased neural representations emerge only at a relatively late post-perceptual stage, in contrast to previous behavioural studies. This finding is of broad relevance to the study of working memory, perception, and decision-making by providing key experimental evidence favouring one class of computational models of how stimulus history affects the processing of the current environment.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor concerns:

      The significance statement opens "Our perception is biased towards sensory input from the recent past." This is a semantic point, but it seems a somewhat odd statement, given there is so much debate about whether serial dependence is perceptual vs. decisional, and that the current work indeed claims that it emerges at a late, post-encoding stage.

      Thank you for this point. We agree. “Visual cognition is biased towards sensory input from the recent past.” would be a more appropriate statement. According to the Journal's guidelines, however, the paragraph with the Significant Statement will be not included in the final manuscript.

      It would be preferable for data and code to be available at review so that reviewers might verify some procedural points for clarity.

      Code and preprocessed data used for the presented analyses are now available on OSF via http://osf.io/yjc93/. Due to storage limitations, only the preprocessed MEG data for the main IEM analyses focusing on the current direction are uploaded. For access to additional data, please contact the authors.

      For instance, I could use some clarification on the trial sequence. The methods first say the direction was selected randomly, but then later say each direction occurred equally often, and there were restrictions on the relationships between current and previous trial items. So it seems it couldn't have truly been random direction selection - was the order selected randomly from a predetermined set of possibilities?

      For the S1/S2 stimuli in a trial the dots moved fully coherent in a direction randomly drawn from a pool of directions between 5° and 355° spaced 10° from one another, therefore avoiding cardinal directions. Across trials, there was a predetermined set of possible differences in motion direction between the current and the previous target. This set included 18 motion direction differences, ranging from -170° to 180°, in steps of 10°. Trial sequences were balanced in a way that each of these differences occurred equally often during a MEG session.

      I could also use some additional assurance the sample size (participants or data points) is sufficient for the analysis approach deployed here.

      We performed a formal a-priori power analysis to justify our choice for the sample size. Please see our response to reviewer 2, point 3, where we explained the procedure of the apriori power analysis in detail. We have now included this description and the results of this power analysis in the Materials and Methods.

      Did you consider a decoding approach, instead of reconstruction, to test what information predominates the signal, in an unbiased way?

      Thank you for this argument. With our analysis approach based on the inverted encoding model, we believe to be unbiased, since we first reconstructed whether the MEG signal contained information about the presented and remembered motion direction. Only in the next step, we tested whether this reconstructed signal showed an offset and if so, whether this offset was biased towards or away from the previous target. A decoding approach aims to answer classification questions and is not suitable to reveal the actual shifts of the neural information. In our study, we could decode, e.g., the current direction or the previous target, but this would not answer the question of whether and at which stage of object processing the current representation was biased towards the past. Moreover, in a decoding approach to reveal which information predominates in the signal, we would have to classify different options (e.g. current information vs previous), thereby biasing the possible set of results more than in our chosen analysis.

      I think the claim of a "direct" neural signature may come off as an overstatement when the spatial and temporal aspects of the attractive bias are still so coarsely specified here.

      Thank you for pointing this out. We agree that the term “direct neural signature” can be seen as an overstatement when it is interpreted to indicate a narrowly defined activity of a brain region (ideally via “direct” invasive recordings) that reflects serial dependence. Our definition of the term “direct” referred to the observation of an attractive shift in a neural representation of the current target motion direction item towards the previous target. This was in contrast to previous “indirect” evidence for the neural basis of serial dependence based on either repulsive shifts of neural representations that were opposite to the attractive bias in behavior or on a reactivation of previous information in the current trial without presenting evidence for the actual neural shift. With this definition in mind, we consider the title of our study a valid description of our findings.

      Reviewer #2 (Recommendations For The Authors):

      I was wondering why the authors chose a bootstrap test for their neural bias analysis instead of a permutation test, similar to the one they used for their behavioral analysis. As far as I know, bootstrap tests do not provide guaranteed type-1 error rate control. The procedure for the permutation test would be quite straightforward here, randomly permuting the sign of each participant's neural shift and recording the group-average shift in a permutation distribution. This test seems more adequate and more consistent with the behavioral analysis.

      Thank you for this comment. We adapted a resampling approach (bootstrapping) that was similar to that by Ester et al. (2020) who also investigated categorical biases and also applied a reconstruction method (Inverted Encoding Model) to assess significance of a bias of the reconstructed orientation against zero in a certain direction. The bootstrapping method relied on a) detecting an offset against zero and b) evaluating the robustness of the observed effect across participants. In contrast, a permutation approach, as suggested by the reviewer, assesses whether an empirical neural shift is more extreme than the permutation distribution. The permutation approach seems more suited to assess the magnitude of the shift which in our study was not a priority. Therefore, we reasoned that the bootstrapping for our inference statistics was better suited to assess the direction of the neural shift and its robustness across participants.

      We have added this additional information to the Materials and Methods:

      References:

      Ester EF, Sprague TC, Serences JT (2020) Categorical biases in human occipitoparietal cortex. Journal of Neuroscience 40:917–931.

      The manuscript could be improved by more clearly spelling how the training and testing data were labelled, particularly for the reactivation analyses. If I understood correctly, in the first reactivation analysis the authors train and test on current trial data, but label both training and testing data according to the previous trial's motion direction. In the second analysis, they label the training data according to the current motion direction, but label the testing data according to the previous motion direction. Is that correct?

      Yes, this is correct. Please see also our response to reviewer 1, point 2 and 3, for a detailed description.

      I was surprised to see that the shift in the reconstructed direction is about three times larger than the behavioral attraction bias. Would one not expect these to be comparable in magnitude? It would be helpful to address and discuss this in the discussion section.

      Thank you for pointing this out. We agree with the reviewer that as both measures provided an identical metric (angle degree), one would expect that their magnitudes should be directly comparable. However, we speculate that these magnitudes inform only about the direction of the bias and their significant difference from zero, thus they operate on different scales and are not directly comparable. For example, Hallenbeck et al. (2022) showed that fMRI-based reconstructed orientation bias and behavioral bias correlated on both individual and group level, despite strong magnitude differences. This is in line with our observation and supports the speculation that the magnitudes of neural and behavioral biases operate on different scales and, thus, are not directly comparable.

      We have updated to the Discussion accordingly.

      References:

      Hallenbeck GE, Sprague TC, Rahmati M, Sreenivasan KK, Curtis CE (2022) Working memory representations in visual cortex mediate distraction effects Nature Communications 12: 471.

      Reviewer #3 (Recommendations For The Authors):

      (1) It may be worth showing that the gaze bias towards the current/cued stimulus is not biased towards the previous target. One option might be to run the same analysis pipeline used for the MEG decoding but on the eye-tracking data. Another could be to remove all participants with significant gaze bias, but given the small sample size, this might not be feasible.

      We appreciate this suggestion. However, as mentioned above, we currently do not have sufficient resources to conduct additional analyses on the eye tracking data.

      (2) Minor typo: Figure 3c - bias should be 11.7º, not -11.7º.

      Corrected. Thank you!

      Note on data/code availability: The authors state that preprocessed data and analysis code will be made available on publication, but are not available yet.

      Code and preprocessed data used for the present analyses are now available on OSF via http://osf.io/yjc93/. Due to storage limitations, only the preprocessed MEG data for the main IEM analyses focusing on the current direction are uploaded. For access to additional data, please contact the authors.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides valuable information on the mechanism of PepT2 through enhanced-sampling molecular dynamics, backed by cell-based assays, highlighting the importance of protonation of selected residues for the function of a proton-coupled oligopeptide transporter (hsPepT2). The molecular dynamics approaches are convincing, but with limitations that could be addressed in the manuscript, including lack of incorporation of a protonation coordinate in the free energy landscape, possibility of protonation of the substrate, errors with the chosen constant pH MD method for membrane proteins, dismissal of hysteresis emerging from the MEMENTO method, and the likelihood of other residues being affected by peptide binding. Some changes to the presentation could be considered, including a better description of pKa calculations and the inclusion of error bars in all PMFs. Overall, the findings will appeal to structural biologists, biochemists, and biophysicists studying membrane transporters.

      We would like to express our gratitude to the reviewers for providing their feedback on our manuscript, and also for recognising the variety of computational methods employed, the amount of sampling collected and the experimental validation undertaken. Following the individual reviewer comments, as addressed point-by-point below, we have prepared a revised manuscript, but before that we address some of the comments made above in the general assessment:

      • “lack of incorporation of a protonation coordinate in the free energy landscape”.

      We acknowledge that of course it would be highly desirable to treat protonation state changes explicitly and fully coupled to conformational changes. However, at this point in time, evaluating such a free energy landscape is not computationally feasible (especially considering that the non-reactive approach taken here already amounts to almost 1ms of total sampling time).  Previous reports in the literature tend to focus on either simpler systems or a reduced subset of a larger problem.  As we were trying to obtain information on the whole transport cycle, we decided to focus here on non-reactive methods.

      • “possibility of protonation of the substrate”.

      The reviewers are correct in pointing out this possibility, which we had not discussed explicitly in our manuscript.  Briefly, while we describe a mechanism in which protonation of only protein residues (with an unprotonated ligand) can account for driving all the necessary conformational changes of the transport cycle, there is some evidence for a further intermediate protonation site in our data (as we commented on in the first version of the manuscript as well), which may or may not be the substrate itself. A future explicit treatment of the proton movements through the transporter, when it will become computationally tractable to do so, will have to include the substrate as a possible protonation site; for the present moment, we have amended our discussion to alert the reader to the possibility that the substrate could be an intermediate to proton transport. This has repercussions for our study of the E56 pKa value, where – if protons reside with a significant population at the substrate C-terminus – our calculated shift in pKa upon substrate binding could be an overestimate, although we would qualitatively expect the direction of shift to be unaffected. However, we also anticipate that treating this potential coupling explicitly would make convergence of any CpHMD calculation impractical to achieve and thus it may be the case that for now only a semi-quantitative conclusion is all that can be obtained.

      • “errors with the chosen constant pH MD method for membrane proteins”.

      We acknowledge that – as reviewer #1 has reminded us – the AMBER implementation of hybrid-solvent CpHMD is not rigorous for membrane proteins, and as such added a cautionary note to our paper.  We also explain how the use of the ABFE thermodynamic cycle calculations helps to validate the CpHMD results in a completely orthogonal manner (we have promoted this validation, which was in the supplementary figures, into the main text in the revised version).   We therefore remain reasonably confident in the results presented with regards to the reported pKa shift of E56 upon substrate binding, and suggest that if the impact of neglecting the membrane in the implicit-solvent stage of CpHMD is significant, then there is likely an error cancellation when considering shifts induced by the incoming substrate.

      • “dismissal of hysteresis emerging from the MEMENTO method”.

      We have shown in our method design paper how the use of the MEMENTO method drastically reduces hysteresis compared to steered MD for path generation, and find this improvement again for PepT2 in this study. We address reviewer #3’s concern about our presentation on this point by revising our introduction of the MEMENTO method, as detailed in the response below.

      • “the likelihood of other residues being affected by peptide binding”.

      In this study, we have investigated in detail the involvement of several residues in proton-coupled di-peptide transport by PepT2. Short of the potential intermediate protonation site mentioned above, the set of residues we investigate form a minimal set of sorts within which the important driving forces of alternating access can be rationalised.  We have not investigated in substantial detail here the residues involved in holding the peptide in the binding site, as they are well studied in the literature and ligand promiscuity is not the problem of interest here. It remains entirely possible that further processes contribute to the mechanism of driving conformational changes by involving other residues not considered in this paper. We have now made our speculation that an ensemble of different processes may be contributing simultaneously more explicit in our revision, but do not believe any of our conclusions would be affected by this.

      As for the additional suggested changes in presentation, we provide the requested details on the CpHMD analysis. Furthermore, we use the convergence data presented separately in figures S12 and S16 to include error bars on our 1D-reprojections of the 2D-PMFs in figures 3, 4 and 5. (Note that we have opted to not do so in figures S10 and S15 which collate all 1D PMF reprojections for the OCC ↔ OF and OCC ↔ IF transitions in single reference plots, respectively, to avoid overcrowding those necessarily busy figures). We have also changed the colours schemes of these plots in our revision to improve accessibility. We have additionally taken the opportunity to fix some typos and further clarified some other statements throughout the manuscript, besides the requests from the reviewers.

      Reviewer #1 (Public Review):

      The authors have performed all-atom MD simulations to study the working mechanism of hsPepT2. It is widely accepted that conformational transitions of proton-coupled oligopeptide transporters (POTs) are linked with gating hydrogen bonds and salt bridges involving protonatable residues, whose protonation triggers gate openings. Through unbiased MD simulations, the authors identified extra-cellular (H87 and D342) and intra-cellular (E53 and E622) triggers. The authors then validated these triggers using free energy calculations (FECs) and assessed the engagement of the substrate (Ala-Phe dipeptide). The linkage of substrate release with the protonation of the ExxER motif (E53 and E56) was confirmed using constant-pH molecular dynamics (CpHMD) simulations and cellbased transport assays. An alternating-access mechanism was proposed. The study was largely conducted properly, and the paper was well-organized. However, I have a couple of concerns for the authors to consider addressing.

      We would like to note here that it may be slightly misleading to the reader to state that “The linkage of substrate release with the protonation of the ExxER motif (E53 and E56) was confirmed using constant-pH molecular dynamics (CpHMD) simulations and cell-based transport assays.” The cellbased transport assays confirmed the importance of the extracellular gating trigger residues H87, S321 and D342 (as mentioned in the preceding sentence), not of the substrate-protonation link as this line might be understood to suggest.

      (1) As a proton-coupled membrane protein, the conformational dynamics of hsPepT2 are closely coupled to protonation events of gating residues. Instead of using semi-reactive methods like CpHMD or reactive methods such as reactive MD, where the coupling is accounted for, the authors opted for extensive non-reactive regular MD simulations to explore this coupling. Note that I am not criticizing the choice of methods, and I think those regular MD simulations were well-designed and conducted. But I do have two concerns.

      a) Ideally, proton-coupled conformational transitions should be modelled using a free energy landscape with two or more reaction coordinates (or CVs), with one describing the protonation event and the other describing the conformational transitions. The minimum free energy path then illustrates the reaction progress, such as OCC/H87D342-  →  OCC/H87HD342H →  OF/H87HD342H as displayed in Figure 3.

      We concur with the reviewer that the ideal way of describing the processes studied in our paper would be as a higher-dimensional free energy landscapes obtained from a simulation method that can explicitly model proton-transfer processes. Indeed, it would have been particularly interesting and potentially informative with regards to the movement of protons down into the transporter in the OF → OCC → IF sequence of transitions. As we note in our discussion on the H87→E56 proton transfer: 

      “This could be investigated using reactive MD or QM/MM simulations (both approaches have been employed for other protonation steps of prokaryotic peptide transporters, see Parker et al. (2017) and Li et al. (2022)).  However, the putative path is very long (≈ 1.7 nm between H87 and E56) and may or may not involve a large number of intermediate protonatable residues, in addition to binding site water. While such an investigation is possible in principle, it is beyond the scope of the present study.” 

      Where even sampling the proton transfer step itself in an essentially static protein conformation would be pushing the boundaries of what has been achieved in the field, we believe that considering the current state-of-the-art, a fully coupled investigation of large-scale conformational changes and proton-transfer reaction is not yet feasible in a realistic/practical time frame. We also note this limitation already when we say that:

      “The question of whether proton binding happens in OCC or OF warrants further investigation, and indeed the co-existence of several mechanisms may be plausible here”. 

      Nonetheless, we are actively exploring approaches to treat uptake and movement of protons explicitly for future work.

      In our revision, we have expanded on our discussion of the reasoning behind employing a non-reactive approach and the limitations that imposes on what questions can be answered in this study.

      Without including the protonation as a CV, the authors tried to model the free energy changes from multiple FECs using different charge states of H87 and D342. This is a practical workaround, and the conclusion drawn (the OCC→ OF transition is downhill with protonated H87 and D342) seems valid. However, I don't think the OF states with different charge states (OF/H87D342-, OF/H87HD342-, OF/H87D342H, and OF/H87HD342H) are equally stable, as plotted in Figure 3b. The concern extends to other cases like Figures 4b, S7, S10, S12, S15, and S16. While it may be appropriate to match all four OF states in the free energy plot for comparison purposes, the authors should clarify this to ensure readers are not misled.

      The reviewer is correct in their assessment that the aligning of PMFs in these figures is arbitrary; no relative free energies of the PMFs to each other can be estimated without explicit free energy calculations at least of protonation events at the end state basins. The PMFs in our figures are merely superimposed for illustrating the differences in shape between the obtained profiles in each condition, as discussed in the text, and we now make this clear in the appropriate figure captions.

      b) Regarding the substrate impact, it appears that the authors assumed fixed protonation states. I am afraid this is not necessarily the case. Variations in PepT2 stoichiometry suggest that substrates likely participate in proton transport, like the Phe-Ala (2:1) and Phe-Gln (1:1) dipeptides mentioned in the introduction. And it is not rigorous to assume that the N- and C-termini of a peptide do not protonate/deprotonate when transported. I think the authors should explicitly state that the current work and the proposed mechanism (Figure 8) are based on the assumption that the substrates do not uptake/release proton(s).

      This is indeed an assumption inherent in the current work. While we do “speculate that the proton movement processes may happen as an ensemble of different mechanisms, and potentially occur contemporaneously with the conformational change” we do not in the previous version indicate explicitly that this may involve the substrate. We make clear the assumption and this possibility in the revised version of our paper. Indeed, as we discuss, there is some evidence in our PMFs of an additional protonation site not considered thus far, which may or may not be the substrate. We now make note of this point in the revised manuscript.

      As for what information can be drawn from the given experimental stoichiometries, we note in our paper that “a 2:1 stoichiometry was reported for the neutral di-peptide D-Phe-L-Ala and 3:1 for anionic D-Phe-L-Glu. (Chen et al., 1999) Alternatively, Fei et al. (1999) have found 1:1 stoichiometries for either of D-Phe-L-Gln (neutral), D-Phe-L-Glu (anionic), and D-Phe-L-Lys (cationic).” 

      We do not assume that it is our place to arbit among the apparent discrepancies in the experimental data here, although we believe that our assumed 2:1 stoichiometry is additionally “motivated also by our computational results that indicate distinct and additive roles played by two protons in the conformational cycle mechanism”.

      (2) I have more serious concerns about the CpHMD employed in the study.

      a) The CpHMD in AMBER is not rigorous for membrane simulations. The underlying generalized Born model fails to consider the membrane environment when updating charge states. In other words, the CpHMD places a membrane protein in a water environment to judge if changes in charge states are energetically favorable. While this might not be a big issue for peripheral residues of membrane proteins, it is likely unphysical for internal residues like the ExxER motif. As I recall, the developers have never used the method to study membrane proteins themselves. The only CpHMD variant suitable for membrane proteins is the membrane-enabled hybrid-solvent CpHMD in CHARMM. While I do not expect the authors to redo their CpHMD simulations, I do hope the authors recognize the limitations of their method.

      We discuss the limitations of the AMBER CpHMD implementation in the revised version. However, despite that, we believe we have in fact provided sufficient grounds for our conclusion that substrate binding affects ExxER motif protonation in the following way.

      In addition to CpHMD simulations, we establish the same effect via ABFE calculations, where the substrate affinity is different at the E56 deprotonated vs protonated protein. This was figure S20 before, though in the revised version we have moved this piece of validation into a new panel of figure 6 in the main text, since it becomes more important with the CpHMD membrane problem in mind. Since the ABFE calculations are conducted with an all-atom representation of the lipids and the thermodynamic cycle closes well, it would appear that if the chosen CpHMD method has a systematic error of significant magnitude for this particular membrane protein system, there may be the benefit of error cancellation. While the calculated absolute pKa values may not be reliable, the difference made by substrate binding appears to be so, as judged by the orthogonal ABFE technique.

      Although the reviewer does “not expect the authors to redo their CpHMD simulations”, we consider that it may be helpful to the reader to share in this response some results from trials using the continuous, all-atom constant pH implementation that has recently become available in GROMACS (Aho et al 2022, https://pubs.acs.org/doi/10.1021/acs.jctc.2c00516) and can be used rigorously with membrane proteins, given its all-atom lipid representation.

      Unfortunately, when trying to titrate E56 in this CpHMD implementation, we found few protonationstate transitions taking place, and the system often got stuck in protonation state–local conformation coupled minima (which need to interconvert through rearrangements of the salt bridge network involving slow side-chain dihedral rotations in E53, E56 and R57). Author response image 1 shows this for the apo OF state, Author response image 2 shows how noisy attempts at pKa estimation from this data turn out to be, necessitating the use of a hybrid-solvent method.

      Author response image 1.

      All-atom CpHMD simulations of apo-OF PepT2. Red indicates protonated E56, blue is deprotonated.

      Author response image 2.

      Difficulty in calculating the E56 pKa value from the noisy all-atom CpHMD data shown in Author response image 1.

      b) It appears that the authors did not make the substrate (Ala-Phe dipeptide) protonatable in holosimulations. This oversight prevents a complete representation of ligand-induced protonation events, particularly given that the substrate ion pairs with hsPepT2 through its N- & C-termini. I believe it would be valuable for the authors to acknowledge this potential limitation. 

      In this study, we implicitly assumed from the outset that the substrate does not get protonated, which – as by way of response to the comment above – we now acknowledge explicitly. This potential limitation for the available mechanisms for proton transfer also applies to our investigation of the ExxER protonation states. In particular, a semi-grand canonical ensemble that takes into account the possibility of substrate C-terminus protonation may also sample states in which the substrate is protonated and oriented away from R57, thus leaving the ExxER salt bridge network in an apo-like state. The consequence would be that while the direction of shift in E56 pKa value will be the same, our CpHMD may overestimate its magnitude. It would thus be interesting to make the C-terminus protonatable for obtaining better quantitative estimates of the E56 pKa shift (as is indeed true in general for any other protein protonatable residue, though the effects are usually assumed to be negligible). We do note, however, that convergence of the CpHMD simulations would be much harder if the slow degree of freedom of substrate reorientation (which in our experience takes 10s to 100s of nanoseconds in this binding pocket) needs to be implicitly equilibrated upon protonation state transitions. We discuss such considerations in the revised paper.

      Reviewer #2 (Public Review):

      This is an interesting manuscript that describes a series of molecular dynamics studies on the peptide transporter PepT2 (SLC15A2). They examine, in particular, the effect on the transport cycle of protonation of various charged amino acids within the protein. They then validate their conclusions by mutating two of the residues that they predict to be critical for transport in cell-based transport assays. The study suggests a series of protonation steps that are necessary for transport to occur in Petp2. Comparison with bacterial proteins from the same family shows that while the overall architecture of the proteins and likely mechanism are similar, the residues involved in the mechanism may differ. 

      Strengths: 

      This is an interesting and rigorous study that uses various state-of-the-art molecular dynamics techniques to dissect the transport cycle of PepT2 with nearly 1ms of sampling. It gives insight into the transport mechanism, investigating how the protonation of selected residues can alter the energetic barriers between various states of the transport cycle. The authors have, in general, been very careful in their interpretation of the data. 

      Weaknesses: 

      Interestingly, they suggest that there is an additional protonation event that may take place as the protein goes from occluded to inward-facing but they have not identified this residue.

      We have indeed suggested that there may be an additional protonation site involved in the conformational cycle that we have not been able to capture, which – as we discuss in our paper – might be indicated by the shapes of the OCC ↔ IF PMFs given in Figure S15. One possibility is for this to be the substrate itself (see the response to reviewer #1 above) though within the scope of this study the precise pathway by which protons move down the transporter and the exact ordering of conformational change and proton transfer reactions remains a (partially) open question. We acknowledge this, denote it with question marks in the mechanistic overview we give in Figure 8 and also “speculate that the proton movement processes may happen as an ensemble of different mechanisms, and potentially occur contemporaneously with the conformational change”.

      Some things are a little unclear. For instance, where does the state that they have defined as occluded sit on the diagram in Figure 1a? - is it truly the occluded state as shown on the diagram or does it tend to inward- or outward-facing?

      Figure 1a is a simple schematic overview intended to show which structures of PepT2 homologues are available to use in simulations. This was not meant to be a quantitative classification of states. Nonetheless, we can note that the OCC state we derived has extra- and intracellular gate opening distances (as measured by the simple CVs defined in the methods and illustrated in Figure 2a) that indicate full gate closure at both sides. In particular, although it was derived from the IF state via biased sampling, the intracellular gate opening distance in the OCC state used for our conformational change enhanced sampling was comparable to that of the OF state (ie, full closure of the gate), see Figure S2b and the grey bars therein. Therefore, we would schematically classify the OCC state to lie at the center of the diagram in Figure 1a. Furthermore, it is largely stable over triplicates of 1 μslong unbiased MD, where in 2/3 replicates the gates remain stable, and the remaining replicate there is partial opening of the intracellular gate (as shown in Figure 2 b/c under the “apo standard” condition). We comment on this in the main text by saying that “The intracellular gate, by contrast, is more flexible than the extracellular gate even in the apo, standard protonation state”, and link it to the lower barrier for transition to IF than to OF. We did this by saying that “As for the OCC↔OF transitions, these results explain the behaviour we had previously observed in the unbiased MD of Figure 2c.” We acknowledge this was not sufficiently clear and have added details to the latter sentence to help clarify better the nature of the occluded state.

      The pKa calculations and their interpretation are a bit unclear. Firstly, it is unclear whether they are using all the data in the calculations of the histograms, or just selected data and if so on what basis was this selection done. Secondly, they dismiss the pKa calculations of E53 in the outward-facing form as not being affected by peptide binding but say that E56 is when there seems to be a similar change in profile in the histograms.

      In our manuscript, we have provided two distinct analyses of the raw CpHMD data. Firstly, we analysed the data by the replicates in which our simulations were conducted (Figure 6, shown as bar plots with mean from triplicates +/- standard deviation), where we found that only the effect on E56 protonation was distinct as lying beyond the combined error bars. This analysis uses the full amount of sampling conducted for each replicate. However, since we found that the range of pKa values estimated from 10ns/window chunks was larger than the error bars obtained from the replicate analysis (Figures S17 and S18), we sought to verify our conclusion by pooling all chunk estimates and plotting histograms (Figure S19). We recover from those the effect of substrate binding on the E56 protonation state on both the OF and OCC states. However, as the reviewer has pointed out (something we did not discuss in our original manuscript), there is a shift in the pKa of E53 of the OF state only. In fact, the trend is also apparent in the replicate-based analysis of Figure 6, though here the larger error bars overlap. In our revision, we added more details of these analyses for clarity (including more detailed figure captions regarding the data used in Figure 6) as well as a discussion of the partial effect on the E53 pKa value. 

      We do not believe, however, that our key conclusions are negatively affected. If anything, a further effect on the E53 pKa which we had not previously commented on (since we saw the evidence as weaker, pertaining to only one conformational state) would strengthen the case for an involvement of the ExxER motif in ligand coupling.

      Reviewer #3 (Public Review):

      Summary: 

      Lichtinger et al. have used an extensive set of molecular dynamics (MD) simulations to study the conformational dynamics and transport cycle of an important member of the proton-coupled oligopeptide transporters (POTs), namely SLC15A2 or PepT2. This protein is one of the most wellstudied mammalian POT transporters that provides a good model with enough insight and structural information to be studied computationally using advanced enhanced sampling methods employed in this work. The authors have used microsecond-level MD simulations, constant-PH MD, and alchemical binding free energy calculations along with cell-based transport assay measurements; however, the most important part of this work is the use of enhanced sampling techniques to study the conformational dynamics of PepT2 under different conditions. 

      The study attempts to identify links between conformational dynamics and chemical events such as proton binding, ligand-protein interactions, and intramolecular interactions. The ultimate goal is of course to understand the proton-coupled peptide and drug transport by PepT2 and homologous transporters in the solute carrier family. 

      Some of the key results include:

      (1) Protonation of H87 and D342 initiate the occluded (Occ) to the outward-facing (OF) state transition. 

      (2) In the OF state, through engaging R57, substrate entry increases the pKa value of E56 and thermodynamically facilitates the movement of protons further down. 

      (3) E622 is not only essential for peptide recognition but also its protonation facilitates substrate release and contributes to the intracellular gate opening. In addition, cell-based transport assays show that mutation of residues such as H87 and D342 significantly decreases transport activity as expected from simulations. 

      Strengths: 

      (1) This is an extensive MD-based study of PepT2, which is beyond the typical MD studies both in terms of the sheer volume of simulations as well as the advanced methodology used. The authors have not limited themselves to one approach and have appropriately combined equilibrium MD with alchemical free energy calculations, constant-pH MD, and geometry-based free energy calculations. Each of these 4 methods provides a unique insight regarding the transport mechanism of PepT2.

      (2) The authors have not limited themselves to computational work and have performed experiments as well. The cell-based transport assays clearly establish the importance of the residues that have been identified as significant contributors to the transport mechanism using simulations.

      (3) The conclusions made based on the simulations are mostly convincing and provide useful information regarding the proton pathway and the role of important residues in proton binding, protein-ligand interaction, and conformational changes.

      Weaknesses: 

      (1) Some of the statements made in the manuscript are not convincing and do not abide by the standards that are mostly followed in the manuscript. For instance, on page 4, it is stated that "the K64-D317 interaction is formed in only ≈ 70% of MD frames and therefore is unlikely to contribute much to extracellular gate stability." I do not agree that 70% is negligible. Particularly, Figure S3 does not include the time series so it is not clear whether the 30% of the time where the salt bridge is broken is in the beginning or the end of simulations. For instance, it is likely that the salt bridge is not initially present and then it forms very strongly. Of course, this is just one possible scenario but the point is that Figure S3 does not rule out the possibility of a significant role for the K64-D317 salt bridge. 

      The reviewer is right to point out that the statement and Figure S3 as they were do not adequately support our decision to exclude the K64-D317 salt-bridge in our further investigations. The violin plot shown in Figure S3, visualised as pooled data from unbiased 1 μs triplicates, did indeed not rule out a scenario where the salt bridge only formed late in our simulations (or only in some replicates), but then is stable. Therefore, in our revision, we include the appropriate time-series of the salt bridge distances, showing how K64-D317 is initially stable but then falls apart in replicate 1, and is transiently formed and disengaged across the trajectories in replicates 2 and 3. We have also remade the data for this plot as we discovered a bug in the relevant analysis script that meant the D170-K642 distance was not calculated accurately. The results are however almost identical, and our conclusions remain.

      (2) Similarly, on page 4, it is stated that "whether by protonation or mutation - the extracellular gate only opens spontaneously when both the H87 interaction network and D342-R206 are perturbed (Figure S5)." I do not agree with this assessment. The authors need to be aware of the limitations of this approach. Consider "WT H87-prot" and "D342A H87-prot": when D342 residue is mutated, in one out of 3 simulations, we see the opening of the gate within 1 us. When D342 residue is not mutated we do not see the opening in any of the 3 simulations within 1 us. It is quite likely that if rather than 3 we have 10 simulations or rather than 1 us we have 10 us simulations, the 0/3 to 1/3 changes significantly. I do not find this argument and conclusion compelling at all.

      If the conclusions were based on that alone, then we would agree.  However, this section of work covers merely the observations of the initial unbiased simulations which we go on to test/explore with enhanced sampling in the rest of the paper, and which then lead us to the eventual conclusions.

      Figure S5 shows the results from triplicate 1 μs-long trajectories as violin-plot histograms of the extracellular gate opening distance, also indicating the first and final frames of the trajectories as connected by an arrow for orientation – a format we chose for intuitively comparing 48 trajectories in one plot. The reviewer reads the plot correctly when they analyse the “WT H87-prot” vs “D342A H87-prot” conditions. In the former case, no spontaneous opening in unbiased MD is taking place, whereas when D342 is mutated to alanine in addition to H87 protonation, we see spontaneous transition in 1 out of 3 replicates.  However, the reviewer does not seem to interpret the statement in question in our paper (“the extracellular gate only opens spontaneously when both the H87 interaction network and D342-R206 are perturbed”) in the way we intended it to be understood. We merely want to note here a correlation in the unbiased dataset we collected at this stage, and indeed the one spontaneous opening in the case comparison picked out by the reviewer is in the condition where both the H87 interaction network and D342-R206 are perturbed. In noting this we do not intend to make statistically significant statements from the limited dataset. Instead, we write that “these simulations show a large amount of stochasticity and drawing clean conclusions from the data is difficult”. We do however stand by our assessment that from this limited data we can “already appreciate a possible mechanism where protons move down the transporter pore” – a hypothesis we investigate more rigorously with enhanced sampling in the rest of the paper. We have revised the section in question to make clearer that the unbiased MD is only meant to give an initial hypothesis here to be investigated in more detail in the following sections. In doing so, we also incorporate, as we had not done before, the case (not picked out by the reviewer here but concerning the same figure) of S321A & H87 prot. In the third replicate, this shows partial gate opening towards the end of the unbiased trajectory (despite D342 not being affected), highlighting further the stochastic nature that makes even clear correlative conclusions difficult to draw.

      (3) While the MEMENTO methodology is novel and interesting, the method is presented as flawless in the manuscript, which is not true at all. It is stated on Page 5 with regards to the path generated by MEMENTO that "These paths are then by definition non-hysteretic." I think this is too big of a claim to say the paths generated by MEMENTO are non-hysteretic by definition. This claim is not even mentioned in the original MEMENTO paper. What is mentioned is that linear interpolation generates a hysteresis-free path by definition. There are two important problems here: (a) MEMENTO uses the linear interpolation as an initial step but modifies the intermediates significantly later so they are no longer linearly interpolated structures and thus the path is no longer hysteresisfree; (b) a more serious problem is the attribution of by-definition hysteresis-free features to the linearly interpolated states. This is based on conflating the hysteresis-free and unique concepts. The hysteresis in MD-based enhanced sampling is related to the presence of barriers in orthogonal space. For instance, one may use a non-linear interpolation of any type and get a unique pathway, which could be substantially different from the one coming from the linear interpolation. None of these paths will be hysteresis-free necessarily once subjected to MD-based enhanced sampling techniques.

      We certainly do not intend to claim that the MEMENTO method is flawless. The concern the reviewer raises around the statement "These paths are then by definition non-hysteretic" is perhaps best addressed by a clarification of the language used and considering how MEMENTO is applied in this work. 

      Hysteresis in the most general sense denotes the dependence of a system on its history, or – more specifically – the lagging behind of the system state with regards to some physical driver (for example the external field in magnetism, whence the term originates). In the context of biased MD and enhanced sampling, hysteresis commonly denotes the phenomenon where a path created by a biased dynamics method along a certain collective variable lags behind in phase space in slow orthogonal degrees of freedom (see Figure 1 in Lichtinger and Biggin 2023, https://doi.org/10.1021/acs.jctc.3c00140). When used to generate free energy profiles, this can manifest as starting state bias, where the conformational state that was used to seed the biased dynamics appears lower in free energy than alternative states. Figure S6 shows this effect on the PepT2 system for both steered MD (heavy atom RMSD CV) + umbrella sampling (tip CV) and metadynamics (tip CV). There is, in essence, a coupled problem: without an appropriate CV (which we did not have to start with here), path generation that is required for enhanced sampling displays hysteresis, but the refinement of CVs is only feasible when paths connecting the true phase space basins of the two conformations are available. MEMENTO helps solve this issue by reconstructing protein conformations along morphing paths which perform much better than steered MD paths with respect to giving consistent free energy profiles (see Figure S7 and the validation cases in the MEMENTO paper), even if the same CV is used in umbrella sampling. 

      There are still differences between replicates in those PMFs, indicating slow conformational flexibility propagated from end-state sampling through MEMENTO. We use this to refine the CVs further with dimensionality reduction (see the Method section and Figure S8), before moving to 2D-umbrella sampling (figure 3). Here, we think, the reviewer’s point seems to bear. The MEMENTO paths are ‘non-hysteretic by definition’ with respect to given end states in the sense that they connect (by definition) the correct conformations at both end-states (unlike steered MD), which in enhanced sampling manifests as the absence of the strong starting-state bias we had previously observed (Figure S7 vs S6). They are not, however, hysteresis-free with regards to how representative of the end-state conformational flexibility the structures given to MEMENTO really were, which is where the iterative CV design and combination of several MEMENTO paths in 2D-PMFs comes in. 

      We also cannot make a direct claim about whether in the transition region the MEMENTO paths might be separated from the true (lower free energy) transition paths by slow orthogonal degrees of freedom, which may conceivably result in overestimated barrier heights separating two free energy basins. We cannot guarantee that this is not the case, but neither in our MEMENTO validation examples nor in this work have we encountered any indications of a problem here.

      We hope that the reviewer will be satisfied by our revision, where we replace the wording in question by a statement that the MEMENTO paths do not suffer from hysteresis that is otherwise incurred as a consequence of not reaching the correct target state in the biased run (in some orthogonal degrees of freedom).

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors): 

      Figure S1: it would be useful to label the panels.

      We have now done this.

      At the bottom of page 4, it is written that "the extracellular gate only opens spontaneously when both the H87 interaction network and D342-R206 are perturbed (Figure S5)." But it is hard to interpret that from the figure.  

      See also our response to reviewer #3. We have revised the wording of this statement, and also highlight in Figure S5 the crucial runs we are referring to, in order to make them easier to discern.

      At the bottom of page 5, and top of page 6, there is a lot of "other" information shown, which is inserted for the record - this is a bit glossed over and hard to follow.

      The “other” information refers to further conditions we had calculated PMFs for and that gave some insight, but which were secondary for drawing our key conclusions. We thank the reviewer for their feedback that this section needs clarification. We have revised this paragraph to make it easier to follow and highlight better the conclusions we draw form the data.

      In Figure 7 it looks as though the asterisks have shifted.

      We are indebted to the reviewer for spotting this error, the asterisks are indeed shifted one bar to the right of their intended position. The revised version fixes this issue.

      Reviewer #3 (Recommendations For The Authors):

      Minor points: In Figure 1a, The 7PMY label and arrow are slightly misplaced.

      Figure 1a is a schematic diagram to show the available structures of PepT2 homologues (see also the response to reviewer #2 above). The 7PMY label placement is intentional to indicate a partially occluded inwards-facing state. As we write in the figure caption: “Intermediate positions between states indicate partial gate opening”.

    1. Author Response

      The following is the authors’ response to the latest reviews.

      A revised version of the manuscript models "slope-based" excitability changes in addition to "threshold-based" changes. This serves to address the above concern that as constructed here changes in excitability threshold are not distinguishable from changes in input. However, it remains unclear what the model would do should only a subset of neurons receive a given, fixed input. In that case, are excitability changes sufficient to induce drift? This remains an important question that is not addressed by the paper in its current form.

      Thank you for this important point. In the simulation of two memories (Fig. S6), we stimulated half of the neural population for each of the two memories. We therefore also showed that drift happens when only a subset of neuron was simulated.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Current experimental work reveals that brain areas implicated in episodic and spatial memory have a dynamic code, in which activity r imulated networks for epresenting familiar events/locations changes over time. This paper shows that such reconfiguration is consistent with underlying changes in the excitability of cells in the population, which ties these observations to a physiological mechanism.

      Delamare et al. use a recurrent network model to consider the hypothesis that slow fluctuations in intrinsic excitability, together with spontaneous reactivations of ensembles, may cause the structure of the ensemble to change, consistent with the phenomenon of representational drift. The paper focuses on three main findings from their model: (1) fluctuations in intrinsic excitability lead to drift, (2) this drift has a temporal structure, and (3) a readout neuron can track the drift and continue to decode the memory. This paper is relevant and timely, and the work addresses questions of both a potential mechanism (fluctuations in intrinsic excitability) and purpose (time-stamping memories) of drift.

      The model used in this study consists of a pool of 50 all-to-all recurrently connected excitatory neurons with weights changing according to a Hebbian rule. All neurons receive the same input during stimulation, as well as global inhibition. The population has heterogeneous excitability, and each neuron's excitability is constant over time apart from a transient increase on a single day. The neurons are divided into ensembles of 10 neurons each, and on each day, a different ensemble receives a transient increase in the excitability of each of its neurons, with each neuron experiencing the same amplitude of increase. Each day for four days, repetitions of a binary stimulus pulse are applied to every neuron.

      The modeling choices focus in on the parameter of interest-the excitability-and other details are generally kept as straightforward as possible. That said, I wonder if certain aspects may be overly simple. The extent of the work already performed, however, does serve the intended purpose, and so I think it would be sufficient for the authors to comment on these choices rather than to take more space in this paper to actually implement these choices. What might happen were more complex modeling choices made? What is the justification for the choices that are made in the present work?

      The two specific modeling choices I question are (1) the excitability dynamics and (2) the input stimulus. The ensemble-wide synchronous and constant-amplitude excitability increase, followed by a return to baseline, seems to be a very simplified picture of the dynamics of intrinsic excitability. At the very least, justification for this simplified picture would benefit the reader, and I would be interested in the authors' speculation about how a more complex and biologically realistic dynamics model might impact the drift in their network model. Similarly, the input stimulus being binary means that, on the singleneuron level, the only type of drift that can occur is a sort of drop-in/drop-out drift; this choice excludes the possibility of a neuron maintaining significant tuning to a stimulus but changing its preferred value. How would the use of a continuous input variable influence the results.

      (1) In our model, neurons tend to compete for allocation to the memory ensemble: neurons with higher excitability tend to be preferentially allocated and neurons with lower excitability do not respond to the stimulus. Because relative, but not absolute excitability biases this competition, we suggest that the exact distribution of excitability would not impact the results qualitatively. On the other hand, the results might vary if excitability was considered dependent on the activity of the neurons as previously reported experimentally (Cai 2016, Rachid 2016, Pignatelli 2019). An increase in excitability following neural activity might induce higher correlation among ensembles on consecutive days, decreasing the drift.

      (2) We thank the reviewer for this very good point. Indeed, two recent studies (Geva 2023 , Khatib 2023) have highlighted distinct mechanisms for a drift of the mean firing rate and the tuning curve. We extended the last part of the discussion to include this point: “Finally, we intended to model drift in the firing rates, as opposed to a drift in the turning curve of the neurons. Recent studies suggest that drifts in the mean firing rate and tuning curve arise from two different mechanisms [33, 34]. Experience drives a drift in neurons turning curve while the passage of time drives a drift in neurons firing rate. In this sense, our study is consistent with these findings by providing a possible mechanism for a drift in the mean firing rates of the neurons driven a dynamical excitability. Our work suggests that drift can depend on any experience having an impact on excitability dynamics such as exercise as previously shown experimentally [9, 35] but also neurogenesis [9, 31, 36], sleep [37] or increase in dopamine level [38]”

      Result (1): Fluctuations in intrinsic excitability induce drift

      The two choices highlighted above appear to lead to representations that never recruit the neurons in the population with the lowest baseline excitability (Figure 1b: it appears that only 10 neurons ever show high firing rates) and produce networks with very strong bidirectional coupling between this subset of neurons and weak coupling elsewhere (Figure 1d). This low recruitment rate need may not necessarily be problematic, but it stands out as a point that should at least be commented on. The fact that only 10 neurons (20% of the population) are ever recruited in a representation also raises the question of what would happen if the model were scaled up to include more neurons.

      This is a very good point. To test how the model depends on the network size, we plotted the drift index against the size of the ensemble. With this current implementation, we did not observe a significant correlation between the drift rate and size of the initial ensemble (Figure S2).

      Author response image 1.

      The rate of the drift does not depend on the size of the engram. Drift rate against the size of the original engram. Each dot shows one simulation (Methods). n = 100 simulations.

      Result (2): The observed drift has a temporal structure

      The authors then demonstrate that the drift has a temporal structure (i.e., that activity is informative about the day on which it occurs), with methods inspired by Rubin et al. (2015). Rubin et al. (2015) compare single-trial activity patterns on a given session with full-session activity patterns from each session. In contrast, Delamare et al. here compare full-session patterns with baseline excitability (E = 0) patterns. This point of difference should be motivated. What does a comparison to this baseline excitability activity pattern tell us? The ordinal decoder, which decodes the session order, gives very interesting results: that an intermediate amplitude E of excitability increase maximizes this decoder's performance. This point is also discussed well by the authors. As a potential point of further exploration, the use of baseline excitability patterns in the day decoder had me wondering how the ordinal decoder would perform with these baseline patterns.

      This is a good point. Here, we aimed at dissociating the role of excitability from the one of the recurrent currents. We introduced a time decoder that compares the pattern with baseline excitability (E = 0), in order to test whether the temporal information was encoded in the ensemble i.e. in the recurrent weights. By contrast, because the neural activity is by construction biased towards excitability, a time decoder performed on the full session would work in a trivial way.

      Result (3): A readout neuron can track drift

      The authors conclude their work by connecting a readout neuron to the population with plastic weights evolving via a Hebbian rule. They show that this neuron can track the drifting ensemble by adjusting its weights. These results are shown very neatly and effectively and corroborate existing work that they cite very clearly.

      Overall, this paper is well-organized, offers a straightforward model of dynamic intrinsic excitability, and provides relevant results with appropriate interpretations. The methods could benefit from more justification of certain modeling choices, and/or an exploration (either speculative or via implementation) of what would happen with more complex choices. This modeling work paves the way for further explorations of how intrinsic excitability fluctuations influence drifting representations.

      Reviewer #2 (Public Review):

      In this computational study, Delamare et al identify slow neuronal excitability as one mechanism underlying representational drift in recurrent neuronal networks and that the drift is informative about the temporal structure of the memory and when it has been formed. The manuscript is very well written and addresses a timely as well as important topic in current neuroscience namely the mechanisms that may underlie representational drift.

      The study is based on an all-to-all recurrent neuronal network with synapses following Hebbian plasticity rules. On the first day, a cue-related representation is formed in that network and on the next 3 days it is recalled spontaneously or due to a memory-related cue. One major observation is that representational drift emerges day-by-day based on intrinsic excitability with the most excitable cells showing highest probability to replace previously active members of the assembly. By using a daydecoder, the authors state that they can infer the order at which the reactivation of cell assemblies happened but only if the excitability state was not too high. By applying a read-out neuron, the authors observed that this cell can track the drifting ensemble which is based on changes of the synaptic weights across time. The only few questions which emerged and could be addressed either theoretically or in the discussion are as follows:

      1. Would the similar results be obtained if not all-to-all recurrent connections would have been molded but more realistic connectivity profiles such as estimated for CA1 and CA3?

      This is a very interesting point. We performed further simulations to show that the results are not dependent on the exact structure of the network. In particular, we show that all-to-all connectivity is not required to observe a drift of the ensemble. We found similar results when the recurrent weights matrix was made sparse (Fig. S4a-c, Methods). Similarly to all-to-all connectivity, we found that the ensemble is informative about its temporal history (Fig. S4d) and that an output neuron can decode the ensemble continuously (Fig. S4e).

      Author response image 2.

      Sparse recurrent connectivity shows similar drifting behavior as all-to-all connectivity. The same simulation protocol as Fig. 1 was used while the recurrent weights matrix was made 50% sparse (Methods). a) Firing rates of the neurons across time. The red traces correspond to neurons belonging to the first assembly, namely that have a firing rate higher than the active threshold after the first stimulation. The black bars show the stimulation and the dashed line shows the active threshold. b) Recurrent weights matrices after each of the four stimuli show the drifting assembly. c) Correlation of the patterns of activity between the first day and every other days. d) Student's test t-value of the ordinal time decoder, for the real (blue) and shuffled (orange) data and for different amplitudes of excitability E. e) Center of mass of the distribution of the output weights (Methods) across days. c-e) Data are shown as mean ± s.e.m. for n = 10 simulations.

      1. How does the number of excited cells that could potentially contribute to an engram influence the representational drift and the decoding quality?

      This is indeed a very good question. We did not observe a significant correlation between the drift rate and size of the initial ensemble (Fig. S2).

      Author response image 3.

      The rate of the drift does not depend on the size of the engram. Drift rate against the size of the original engram. Each dot shows one simulation (Methods). n = 100 simulations.

      1. How does the rate of the drift influence the quality of readout from the readout-out neuron?

      We thank the reviewer for this interesting question. We introduced a measure of the “read-out quality” and plotted this value against the rate of the drift. We found a small correlation between the two quantities. Indeed, the read-out quality decreases with the rate of the drift.

      Author response image 4.

      The quality of the read-out decreases with the rate of the drift. Read-out quality computed on the firing rate of the output neuron against the rate of the drift (Methods). Each dot shows one simulation. n = 100 simulations.

      Reviewer #3 (Public Review):

      The authors explore an important question concerning the underlying mechanism of representational drift, which despite intense recent interest remains obscure. The paper explores the intriguing hypothesis that drift may reflect changes in the intrinsic excitability of neurons. The authors set out to provide theoretical insight into this potential mechanism.

      They construct a rate model with all-to-all recurrent connectivity, in which recurrent synapses are governed by a standard Hebbian plasticity rule. This network receives a global input, constant across all neurons, which can be varied with time. Each neuron also is driven by an "intrinsic excitability" bias term, which does vary across cells. The authors study how activity in the network evolves as this intrinsic excitability term is changed.

      They find that after initial stimulation of the network, those neurons where the excitability term is set high become more strongly connected and are in turn more responsive to the input. Each day the subset of neurons with high intrinsic excitability is changed, and the network's recurrent synaptic connectivity and responsiveness gradually shift, such that the new high intrinsic excitability subset becomes both more strongly activated by the global input and also more strongly recurrently connected. These changes result in drift, reflected by a gradual decrease across time in the correlation of the neuronal population vector response to the stimulus.

      The authors are able to build a classifier that decodes the "day" (i.e. which subset of neurons had high intrinsic excitability) with perfect accuracy. This is despite the fact that the excitability bias during decoding is set to 0 for all neurons, and so the decoder is really detecting those neurons with strong recurrent connectivity, and in turn strong responses to the input. The authors show that it is also possible to decode the order in which different subsets of neurons were given high intrinsic excitability on previous "days". This second result depends on the extent by which intrinsic excitability was increased: if the increase in intrinsic excitability was either too high or too low, it was not possible to read out any information about past ordering of excitability changes.

      Finally, using another Hebbian learning rule, the authors show that an output neuron, whose activity is a weighted sum of the activity of all neurons in the network, is able to read out the activity of the network. What this means specifically, is that although the set of neurons most active in the network changes, the output neuron always maintains a higher firing rate than a neuron with randomly shuffled synaptic weights, because the output neuron continuously updates its weights to sample from the highly active population at any given moment. Thus, the output neuron can readout a stable memory despite drift.

      Strengths:

      The authors are clear in their description of the network they construct and in their results. They convincingly show that when they change their "intrinsic excitability term", upon stimulation, the Hebbian synapses in their network gradually evolve, and the combined synaptic connectivity and altered excitability result in drifting patterns of activity in response to an unchanging input (Fig. 1, Fig. 2a). Furthermore, their classification analyses (Fig. 2) show that information is preserved in the network, and their readout neuron successfully tracks the active cells (Fig. 3). Finally, the observation that only a specific range of excitability bias values permits decoding of the temporal structure of the history of intrinsic excitability (Fig. 2f and Figure S1) is interesting, and as the authors point out, not trivial.

      Weaknesses:

      1. The way the network is constructed, there is no formal difference between what the authors call "input", Δ(t), and what they call "intrinsic excitability" Ɛ_i(t) (see Equation 3). These are two separate terms that are summed (Eq. 3) to define the rate dynamics of the network. The authors could have switched the names of these terms: Δ(t) could have been considered a global "intrinsic excitability term" that varied with time and Ɛ_i(t) could have been the external input received by each neuron i in the network. In that case, the paper would have considered the consequence of "slow fluctuations of external input" rather than "slow fluctuations of intrinsic excitability", but the results would have been the same. The difference is therefore semantic. The consequence is that this paper is not necessarily about "intrinsic excitability", rather it considers how a Hebbian network responds to changes in excitatory drive, regardless of whether those drives are labeled "input" or "intrinsic excitability".

      This is a very good point. We performed further simulations to model “slope-based”, instead of “threshold-based”, changes in excitability (Fig. S5a, Methods). In this new definition of excitability, we changed the slope of the activation function, which is initially sampled from a random distribution. By introducing a varying excitability, we found very similar results than when excitability was varied as the threshold of the activation function (Fig. S5b-d). We also found similarly that the ensemble is informative about its temporal history (Fig. S5e) and that an output neuron can decode the ensemble continuously (Fig. S5f).

      Author response image 5.

      Change of excitability as a variable slope of the input-output function shows similar drifting behavior as considering a change in the threshold. The same simulation protocol as Fig. 1 was used while the excitability changes were modeled as a change in the activation function slope (Methods). a) Schema showing two different ways of defining excitability, as a threshold (top) or slope (bottom) of the activation function. Each line shows one neuron and darker lines correspond to neurons with increased excitability. b) Firing rates of the neurons across time. The red traces correspond to neurons belonging to the first assembly, namely that have a firing rate higher than the active threshold after the first stimulation. The black bars show the stimulation and the dashed line shows the active threshold. c) Recurrent weights matrices after each of the four stimuli show the drifting assembly. d) Correlation of the patterns of activity between the first day and every other days. e) Student's test t-value of the ordinal time decoder, for the real (blue) and shuffled (orange) data and for different amplitudes of excitability E. f) Center of mass of the distribution of the output weights (Methods) across days. d-f) Data are shown as mean ± s.e.m. for n = 10 simulations.

      1. Given how the learning rule that defines input to the readout neuron is constructed, it is trivial that this unit responds to the most active neurons in the network, more so than a neuron assigned random weights. What would happen if the network included more than one "memory"? Would it be possible to construct a readout neuron that could classify two distinct patterns? Along these lines, what if there were multiple, distinct stimuli used to drive this network, rather than the global input the authors employ here? Does the system, as constructed, have the capacity to provide two distinct patterns of activity in response to two distinct inputs?

      This is an interesting point. In order to model multiple memories, we introduced non-uniform feedforward inputs, defining different “contexts” (Methods). We adapted our model so that two contexts target two random sub-populations in the network. We also introduced a second output neuron to decode the second memory. The simulation protocol was adapted so that each of the two contexts are stimulated every day (Fig. S6a). We found that the network is able to store two ensembles that drift independently (Fig. S6 and S7a). We were also able to decode temporal information from the patterns of activity of both ensembles (Fig. S7b). Finally, both memories could be decoded independently using two output neurons (Fig. S7c and d).

      Author response image 6.

      Two distinct ensembles can be encoded and drift independently. a) and b) Firing rates of the neurons across time. The red traces in panel b) correspond to neurons belonging to the first assembly and the green traces to the second assembly on the first day. They correspond to neurons having a firing rate higher than the active threshold after the first stimulation of each assembly. The black bars show the stimulation and the dashed line shows the active threshold. c) Recurrent weights matrices after each of the eight stimuli showing the drifting of the first (top) and second (bottom) assembly.

      Author response image 7.

      The two ensembles are informative about their temporal history and can be decoded using two output neurons. a) Correlation of the patterns of activity between the first day and every other days, for the first assembly (red) and the second assembly (green). b) Student's test t-value of the ordinal time decoder, for the first (red, left) and second ensemble (green, right) for different amplitudes of excitability E. Shuffled data are shown in orange. c) Center of mass of the distribution of the output weights (Methods) across days for the first (w?ut , red) and second (W20L't , green) ensemble. a-c) Data are shown as mean ± s.e.m. for n = 10 simulations. d) Output neurons firing rate across time for the first ensemble (Yl, top) and the second ensemble (h, bottom). The red and green traces correspond to the real output. The dark blue, light blue and yellow traces correspond to the cases where the output weights were randomly shuffled for every time points after presentation of the first, second and third stimulus, respectively.

      Impact:

      Defining the potential role of changes in intrinsic excitability in drift is fundamental. Thus, this paper represents a potentially important contribution. Unfortunately, given the way the network employed here is constructed, it is difficult to tease apart the specific contribution of changing excitability from changing input. This limits the interpretability and applicability of the results.

    1. Author response:

      The following is the authors’ response to the original reviews.

      In addition to our responses to reviewer suggestions below, a minor bug in the calculation of CAIS was brought to our attention by a reader of our preprint. We have corrected this bug and rerun analyses, whose results became slightly stronger as noise was removed. While we were doing that, someone pointed out to us that our equations were almost the same as Kullback-Leibler divergence, which explains why our metric performed so well. We have made the numerically trivial (see before vs. after figure below) mathematical change to use Kullback-Leibler divergence instead, and now have a better story, with a solid basis in information theory, as to why CAIS works.

      Author response image 1.

      Unfortunately, we discovered a second bug that caused our PIC correction code to fail to perform the needed correction for phylogenetic confounding. The previously reported correlation between CAIS (or ENC) with body mass no longer survives PIC-correction. We have therefore removed this analysis from the manuscript. Our story now stands more on the theoretical basis of CAIS and ENC than on the post facto validation than it previously did. We now also present CAIS and ENC on a more equal footing. ENC results are slightly stronger, while CAIS has the complementary advantage of correcting for amino acid frequencies.

      The work involved in these changes, as well as some of the responses to reviews below, justifies changing the second author into a co-first author, and adding an additional coauthor (Hanon McShea) who discovered the second bug.

      Reviewer #1 (Public Review): 

      In this manuscript, the authors propose a new codon adaptation metric, Codon Adaptation Index of Species (CAIS), which they present as an easily obtainable proxy for effective population size. To permit between-species comparisons, they control for both amino acid frequencies and genomic GC content, which distinguishes their approach from existing ones. Having confirmed that CAIS negatively correlates with vertebrate body mass, as would be expected if small-bodied species with larger effective populations experience more efficient selection on codon usage, they then examine the relationship between CAIS and intrinsic structural disorder in proteins. 

      The idea of a robust species-level measure of codon adaptation is interesting. If CAIS is indeed a reliable proxy for the effectiveness of selection, it could be useful to analyze species without reliable life history- or mutation rate data (which will apply to many of the genomes becoming available in the near future). 

      A key question is whether CAIS, in fact, measures adaptation at the codon level. Unfortunately, CAIS is only validated indirectly by confirming a negative correlation with body mass. As a result, the observations about structural disorder are difficult to evaluate. 

      As discussed in the preamble above, we have replaced the body mass validation with a stronger theoretical basis in information theory.

      A potential problem is that differences in GC between species are not independent of life history. Effective population size can drive compositional differences due to the effects of GC-biased gene conversion (gBGC). As noted by Galtier et al. (2018), genomic GC correlates negatively with body mass in mammals and birds. It would therefore be important to examine how gBGC might affect CAIS, and to what extent it could explain the relationship between CAIS and body mass. 

      Suppose that gBGC drives an increase in GC that is most pronounced at 3rd codon positions in highrecombination regions in small-bodied species. In this case, could observed codon usage depart more strongly from expectations calculated from overall genomic GC in small vertebrates compared to large ones? The authors also report that correcting for local intergenic GC was unsuccessful, based on the lack of a significant negative relationship with body mass (Figure 3D). In principle, this could also be consistent with local GC providing a relatively more appropriate baseline in regions with high recombination rates. Considering these scenarios would clarify what exactly CAIS is capturing. 

      Figure 3 (previously Supplementary Figures S5A and S5B) shows that CAIS is negligibly correlated with %GC (not robust to multiple comparisons correction), and ENC not at all. We believe this is evidence against the possibility brought up by the reviewer, i.e. that Ne might affect gBGC (and hence global %GC). This relationship, if present, could act as a confounding effect, but it is not present within our species dataset. 

      Note that we expect our genomic-GC-based codon usage expectations to reflect unchecked gBGC in an average genomic region, independently of whether that species has high or low Ne. Our working model is that non-selective forces, include gBGC as well as conventional mutation biases, vary among species, and that they rather than selection determine each species’ genome-wide %GC. By correcting for genome-wide %GC, CAIS and ENC correct for both mutation bias and gBGC, in order to isolate the effects of selection.

      This argument, based on an average genomic region, is vulnerable to gene-rich genomic regions having differentially higher recombination rates and hence GC-biased gene conversion. However, we do not see the expected positive correlation between |𝐥𝐨𝐜𝐚𝐥 𝐆𝐂 - global GC| and CAIS (see new Figure 5), again suggesting that gene conversion strength is not a confounding factor acting on CAIS.

      Given claims about "exquisitely adapted species", the case for using CAIS as a measure of codon adaptation would also be stronger if a relationship with gene expression could be demonstrated. RSCU is expected to be higher in highly expressed genes. Is there any evidence that the equivalent GCcontrolled measure behaves similarly? 

      Correlations with gene expression are outside the scope of the current work, which is focused on producing and exploiting a single value of codon adaptation per species. It is indeed possible that our general approach of using Kullback-Leibler divergence to correct for genomic %GC could be useful in future work investigating differences among genes.  

      The manuscript is overall easy to follow, though some additional context may be helpful for the general reader. A more detailed discussion of how this work compares to the approach taken by Galtier et al. (2018), which accounted for GC content and gBGC when examining codon preferences, would be appropriate, for example. In addition, it would have been useful to mention past work that has attempted to explicitly quantify selection on codon usage. 

      One key difference between our work and that of Galtier et al. 2018 is that our approach does not rely on identifying specific codon preferences as a function of species. Our approach might therefore be robust to scenarios where different genes have different codon preferences (see Gingold et al. 2014 https://doi.org/10.1016/j.cell.2014.08.011). At a high level, our results are in broad agreement with those of Galtier et al., 2018, who found that gBGC affected all animal species, regardless of Ne, and who like us, found that the degree of selection on codon usage depended on Ne.

      Reviewer #2 (Public Review): 

      ## Summary 

      The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that controls for differences in amino acid usage and GC% across species. Using their new metric, the authors find a previously unobserved negative correlation between the overall adaptiveness of codon usage and body size across 118 vertebrates. As body size is negatively correlated with effective population size and thus the general strength of natural selection, the negative correlation between CAIS and body size is expected. The authors argue this was previously unobserved due to failures of other popular metrics such as Codon Adaptation Index (CAI) and the Effective Number of Codons (ENC) to adequately control for differences in amino acid usage and GC content across species. Most surprisingly, the authors also find a positive relationship between CAIS and the overall "disorderedness" of a species protein domains. As some of these results are unexpected, which is acknowledged by the authors, I think it would be particularly beneficial to work with some simulated datasets. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection $sN_e$ when the mutation bias changes across species.  

      ## Strengths 

      (1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance (see Cope et al. Biochemica et Biophysica Acta - Biomembranes 2018 for a clear example of this). 

      We now cite Cope et al. as an example of how amino acid composition can act as a confounding factor.

      (2) The authors present numerous analysis using both ENC and mean CAI as a comparison to CAIS, helping given a sense of how CAIS corrects for some of the issues with these other metrics. I also enjoyed that they examined the previously unobserved relationship between codon usage bias and body size, which has bugged me ever since I saw Kessler and Dean 2014. The result comparing protein disorder to CAIS was particularly interesting and unexpected. 

      Unfortunately, our previous PIC correction code was buggy, and in fact the relationship with body size does not survive PIC correction (although it is strong prior to PIC correction). We have therefore removed it from the paper. However, the more novel result on protein disorder remains strong.

      (3) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences. 

      ## Weaknesses 

      (1) The main weakness of this work is that it lacks simulated data to confirm that it works as expected. This would be particularly useful for assessing the relationship between CAIS and the overall effect of protein structure disorder, which the authors acknowledge is an unexpected result. I think simulations could also allow the authors to assess how their metric performs in situations where mutation bias and natural selection act in the same direction vs. opposite directions. Additionally, although I appreciate their comparisons to ENC and mean CAI, the lack of comparison to other popular codon metrics for calculating the overall adaptiveness of a genome (e.g. dos Reis et al.'s $S$ statistic, which is a function of tRNA Adaptation Index (tAI) and ENC) may be more appropriate. Even if results are similar to $S$, CAIS has a noted advantage that it doesn't require identifying tRNA gene copy numbers or abundances, which I think are generally less readily available than genomic GC% and protein-coding sequences. 

      The main limitation of dos Reis’s test in our view is that, like the better versions of CAI, it requires comparable orthologs across species. See also the discussion below re the benefits of proteome-wide approach. We now also note the advantage of not needing tRNA gene copy numbers and abundances. 

      Simulated datasets would be great, but we think it a nice addition rather than must-have, in particular because we are skeptical about whether our understanding of all relevant processes is good enough such that simulations would add much to our more heuristic argument along the lines of Figure 2. E.g. the complications of Gingold et al. 2014 cited above are pertinent, but incorporating them would make simulations quite involved. Instead, we now have a stronger theoretical justification for CAIS grounded in information theory. We have significantly expanded discussion of Figure 2 to give a clearer idea of the conceptual underpinnings of CAIS and ENC.

      The authors mention the selection-mutation-drift equilibrium model, which underlies the basic ideas of this work (e.g. higher $N_e$ results in stronger selection on codon usage), but a more in-depth framing of CAIS in terms of this model is not given. I think this could be valuable, particularly in addressing the question "are we really estimating what we think we're estimating?" 

      Let's take a closer look at the formulation for RSCUS. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of r = genome for some species s.

      I think what the authors are attempting to do is "divide out" the effects of mutation bias (as given by $E_i$), such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represent adaptive codon usage. Consider Gilchrist et al. MBE 2015, which says that the expected frequency of codon i at selection-mutation-drift equilibrium in gene g for an amino acid with Na synonymous codons is

      where ∆M is the mutation bias, ∆η is the strength of selection scaled by the strength of drift, and φg is the gene expression level of gene g. In this case, ∆M and ∆η reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which ∆M,∆η = 0. Assuming the selection-mutation-drift equilibrium model is generally adequate to model of the true codon usage patterns in a genome (as I do and I think the authors do, too), the Ei,g could be considered the expected observed frequency codon i in gene g

      E[Oi,g].

      Let’s re-write the  in the form of Gilchrist et al., such that it is a function of mutation bias ∆M. For simplicity we will consider just the two codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term gr and 1 − gr can be written as

      where µx→y is the mutation rate from nucleotides x to y. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias .This can be expressed in terms of the equilibrium GC content by recognizing that

      As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon i at an amino acid becomes just a Bernoulli process. 

      If we do this, then 

      Recall that in the Gilchrist et al. framework, the reference codon has ∆MNNG,NNG \= 0 =⇒ e−∆MNNG,NNG \=1. Thus, we have recovered the Gilchrist et al. model from the formulation of $E_i$ under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for ∆η in equation (1).. 

      We can then calculate the expected RSCUS using equation (1) (using notation E[Oi]) and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as ). Assume in this case that NNG is the reference codon (∆MNNG,∆ηNNG \= 0).

      This shows that the expected value of RSCUS for a two-codon amino acid is expected to increase as the strength of selection $\Delta\eta$ increases, which is desired. Note that $\Delta\eta$ in Gilchrist et al. is formulated in terms of selection *against* a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If $\Delta\eta = 0$ (i.e. selection does not favor either codon), then $E[RSCUS] = 1$. Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if $sN_e$ (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content ranging around 0.41, so I suspect their results are okay. 

      Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids. 

      We thank Reviewer 2 for explicitly laying out the math that was implicit in our Figures 1 and 2. While we keep our more heuristic presentation, our revised manuscript now more clearly acknowledges that the per-site codon adaptation bias depicted in Figure 1 has limited sensitivity to s*Ne. The reason that we believe our approach worked despite this, is that we think the phenomenon is driven by what is shown in Figure 2. I.e., where Ne makes a difference is by determining the proteome-wide fraction of codons subject to significant codon adaptation, rather than by determining the strength of codon adaptation at any particular site or gene. We have made multiple changes to the texts to make this point clearer.

      Another minor weakness of this work is that although the method is generally applicable to any species with an annotated genome and the code is publicly available, the code itself contains hard-coded values for GC% and amino acid frequencies across the 118 vertebrates. The lack of a more flexible tool may make it difficult for less computationally-experienced researchers to take advantage of this method. 

      Genome-wide %GC values are hard-coded because they were taken from the previous study of James et al. (2023) https://doi.org/10.1093/molbev/msad073. As summarized in the manuscript, genome-wide %GC was a byproduct of a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019. The more complicated code used to calculate the intergenic %GC, and the code used to calculate amino acid frequencies is located at https://github.com/MaselLab/CodonAdaptation-Index-of-Species. Luckily, someone else just wrote a simpler end to end pipeline for us, on the basis of our preprint. We now note this in the Acknowledgements, and link to it: https://github.com/gavinmdouglas/handy_pop_gen/blob/main/CAIS.py.

    1. Author response:

      The following is the authors’ response to the original reviews.

      (1) Combined Public Reviews:

      Strengths:

      This work investigates the role of DNAH3 in sperm mobility and male infertility and utilised gold-standard molecular biology techniques, showing strong evidence of its role in male infertility. All aspects of the study design and methods are well described and appropriate to address the main question of the manuscript. The conclusions drawn are consistent with the analyses conducted and supported by the data.

      We extend our sincere gratitude to the expert reviewers for their valuable comments and insightful suggestions.

      Weaknesses:

      (1.1) The manuscript lacks a comparison with previous studies on DNAH3 in the Discussion section.

      We thank the reviewers' comments.

      Recently, Meng et al. identified bi-allelic variants in DNAH3 from patients diagnosed with asthenoteratozoospermia, revealing multiple morphological defects and a disrupted "9+2" arrangement in the patients' sperm (https://doi.org/10.1093/hropen/hoae003, PMID: 38312775). Furthermore, they generated Dnah3 KO mice, which were infertile, and exhibited moderate morphological abnormalities with a normally structured “9 + 2” microtubule arrangement. In our study, we also observed similar phenotypic differences between the phenotypes of DNAH3-deficient patients and Dnah3 KO mice. These findings indicate that DNAH3 may play crucial yet distinct roles in human and mouse male reproduction. Additionally, our TEM analysis demonstrated a notable absence of IDAs in sperm from both DNAH3-deficent patients and Dnah3 KO mice, resembling the findings of Meng et al. To further investigate, we conducted immunofluorescent staining and western blotting to assess the levels of IDA-associated proteins (DNAH1, DNAH6 and DNALI1) and ODA-associated proteins (DNAH8, DNAH17 and DNAI1) in sperm samples from both our DNAH3-deficient patients and Dnah3 KO mice. Our data revealed a reduction in IDA-associated protein levels and comparable ODA-associated protein levels in comparison to normal controls and WT mice, respectively, thus corroborating the TEM observations. These results suggest that DNAH3 is involved in sperm flagellar development in human and mice, specifically through its role in the assembly of IDAs.

      Intriguingly, in our study, none of the patients with DNAH3 deficiency reported experiencing any of the principal symptoms associated with PCD. Additionally, our Dnah3 KO mice exhibited normal ciliary development in the lung, brain, eye, and oviduct. Similarly, Meng et al. did not mention any PCD symptoms in their DNAH3-deficient patients, and their Dnah3 KO mice also demonstrated normal ciliary morphology in the trachea and brain. These combined observations suggest that DNAH3 may play a more significant role in sperm flagellar development than in other motile cilia functions. Given that DNAH3 is expressed in ciliary tissues, its role in these tissues remains intriguing and could be elucidated through sequencing of larger cohorts of individuals with PCD.

      We have added these discussions in line 267 to 283, and line 300 to 303.

      (1.2) The variants of DNAH3 in four infertile men were identified through whole-exome sequencing. Providing an overview of the WES data would be beneficial to offer additional insights into whether other variants may contribute the infertility. This could also help explain why ICSI only works for two out of four patients with DNAH3 variants.

      We thank the reviewer's helpful suggestions.

      We have deposited the raw whole-exome sequencing data in the National Genomics Data Center (NGDC) (https://ngdc.cncb.ac.cn/, accession number: HRA007467). The clean reads, sequencing depth, sequencing coverage, and mapping quality of the WES on the patients are listed below (Table R1). A summary of WES has been presented in Table S1.

      Author response table 1.

      Quality of whole exome sequencing on infertile men.

      The variants identified through WES were annotated and filtered using Exomiser. Next, the variants were screened to obtain candidate variants based on the following criteria: (1) the allele frequency in the East Asian population was less than 1% in any database, including the ExAC Browser, gnomAD, and the 1000 Genomes Project; (2) the variants affected coding exons or canonical splice sites; (3) the variants were predicted to be possibly pathogenic or damaging.

      Following filtering and screening, the numbers of candidate variants obtained were as follows: Patient 1: 98, Patient 2: 101, Patient 3: 67, and Patient 4: 91(Table S1). Subsequently, we utilized the Human Protein Atlas (HPA) database (https://www.proteinatlas.org/) and Mouse Genome Informatics (MGI) database (https://informatics.jax.org/) to analyze the expression patterns of corresponding genes. Variants whose corresponding genes were not expressed in the human or mouse testis were excluded from further consideration. We also consulted OMIM database and reviewed relevant literature to exclude variants associated with diseases unrelated to male infertility. Additionally, considering the assumption of a recessive inheritance pattern, we excluded all monoallelic variants. Ultimately, only bi-allelic variants in DNAH3 (NG_052617.1, NM_017539.2, NP_060009.1) remained, suggesting as the pathogenic variants responsible for the infertility of the patients (Table S1). These DNAH3 variants were verified by Sanger sequencing on DNA from the patients' families.

      We have added the overview of the WES in Table S1 and supplemented the analysis process of WES data in line 100 to 106, and line 348 to 360.

      Additionally, we did not identify any pathogenic variants that associated with fertilization failure and early embryonic development in the two patients with failed ICSI outcomes. Therefore, these different ICSI outcomes might be attributed to additional unexplained factors from the female partners.

      (1.3) Quantification of images would help substantiate the conclusions, particularly in Figures 2, 3, 4, and 6. Improved images in Figures 3A, 4B, and 4C, would help increase confidence in the claims made.

      In response to reviewer’s valuable suggestions. We presume that the reviewer means quantification of images in Figure S6, but not Figure 6.

      We have compiled statistics for results shown in Figures 2, 3, 4, and S6. Specifically:

      - The percentages of abnormal flagellar morphology in normal control and patients, associated with the observations in Figure 2A, have been shown in Figure S1A.

      - The percentages of aberrant axonemal ultrastructure in different cross-sections of sperm from in normal control and patients, correspond to the findings in Figure 3A, have been presented in Figure S1B.

      - The percentages of abnormal flagellar morphology in WT mice and Dnah3 KO mice have been shown in Figure S7A.

      - The percentages of aberrant axonemal arrangement in different cross-sections of sperm from WT mice and Dnah3 KO mice, corresponding to the findings in Figure 4B, have been presented in Figure S7C.

      - The percentages of microtubule doublets presenting IDAs in sperm from WT mice and Dnah3 KO mice, related to Figure 4B, have been detailed in Figure S7D.

      - The percentages of malformed mitochondria in the midpiece of sperm from WT mice and Dnah3 KO mice, associated with the observations in Figure 4C, have been presented in Figure S7E.

      Moreover, we have revised Figures 3A, 4B, and 4C by replacing the unclear TEM images.

      (2) Reviewer #1 (Recommendations for The Authors):

      (2.1) Please add reference(s) that support what is claimed in lines 83-84.

      We are very grateful for the reviewer's careful comments, we have added a reference that describing the homology and expression of DNAH3.

      (2.2) In line 286, change "suggested" to "suggest".

      Thanks for the reviewer's comments. We have corrected the grammar.

      (2.3) Please add reference(s) that support what is claimed in lines 359-360.

      According to the reviewer’s suggestions, we have included references detailing the STA-PUT velocity sedimentation for isolation of single human and mouse testicular cells.

      (2.4) In line 365, change "in" to "into".

      Thanks for the reviewer’s careful comments, we have corrected this word.

      (2.5) In Figure 7, I suggest changing "patients" to "wife or partners of patient". Given that the results are indeed from the spouses of the infertile men, I suggest making this small change to keep the consistency and clarity of what the authors did.

      In response to reviewer’s kind suggestions, we have replaced “Patient” by “partners of Patient” and revised Figure 7.

      (3) Reviewer #2 (Recommendations for The Authors):

      (3.1) A summary of the WES data would be needed (i.e. number of reads, mapping quality, etc). As mentioned in the public review, it would be beneficial to present a summary of all variants identified in the data and clarify whether DNAH3 is the only gene that contains variants and whether these variants have been validated.

      Many thanks for reviewer’s kind suggestions.

      The clean reads, sequencing depth, sequencing coverage, and mapping quality of the WES on the patients are listed (see author response table 1) A summary of WES has been presented in Table S1.

      The variants identified through WES were annotated and filtered using Exomiser. Next, the variants were screened to obtain candidate variants based on the following criteria: (1) the allele frequency in the East Asian population was less than 1% in any database, including the ExAC Browser, gnomAD, and the 1000 Genomes Project; (2) the variants affected coding exons or canonical splice sites; (3) the variants were predicted to be possibly pathogenic or damaging.

      Following filtering and screening, the numbers of candidate variants obtained were as follows: Patient 1: 98, Patient 2: 101, Patient 3: 67, and Patient 4: 91(Table S1). Subsequently, we utilized the Human Protein Atlas (HPA) database (https://www.proteinatlas.org/) and Mouse Genome Informatics (MGI) database (https://informatics.jax.org/) to analyze the expression patterns of corresponding genes. Variants whose corresponding genes were not expressed in the human or mouse testis were excluded from further consideration. We also consulted OMIM database and reviewed relevant literature to exclude variants associated with diseases unrelated to male infertility. Additionally, considering the assumption of a recessive inheritance pattern, we excluded all monoallelic variants. Ultimately, only bi-allelic variants in DNAH3 (NG_052617.1, NM_017539.2, NP_060009.1) remained, suggesting as the pathogenic variants responsible for the infertility of the patients (Table S1). These DNAH3 variants were verified by Sanger sequencing on DNA from the patients' families.

      We have added the overview of the WES in Table S1 and supplemented the analysis process of WES data in line 100 to 106, and line 348 to 360.

      (3.2) It would be beneficial to the scientific community if the raw data of WES could be uploaded to a public data repository, such as GEO.

      According to the reviewer's suggestion, we have deposited the raw whole-exome sequencing data in the National Genomics Data Center (NGDC) (https://ngdc.cncb.ac.cn/, accession number: HRA007467) and described its availability in the "Data Availability" section.

      (3.3) In line 115, it is not clear how the prediction was made. Clarifying them by adding citations or describing methods that predict these pathways/functions would help strengthen it.

      Thanks for the reviewer's comments.

      SIFT, PolyPhen-2, MutationTaster and CADD assess the deleteriousness of genetic variants by considering genomic features and evolutionary constraint of the surrounding sequence or structural and chemical property altercations by the amino acid substitutions. We have added websites and references of these tools in the manuscript (line 116 to 118).

      Here are the principles of these tools.

      - The SIFT considers the position at which the change occurred and the type of amino acid change, and then to predict whether an amino acid substitution in a protein will affect protein function [https://sift.bii.a-star.edu.sg/, PMID: 12824425].

      - The PolyPhen-2 predicts the impact of an amino acid substitution on a human protein by considering several features, including sequence, phylogenetic, and structural information [http://genetics.bwh.harvard.edu/pph2/, PMID: 20354512].

      - The MutationTaster utilizes a Bayes classifier to predict the functional consequences of amino acid substitutions, intronic and synonymous changes, short insertions/deletions (indels), etc. [https://www.mutationtaster.org/, PMID: 24681721].

      - The CADD scores are based on diverse genomic features derived from surrounding sequence context, gene model annotations, evolutionary constraint, epigenetic measurements, and functional predictions [https://cadd.gs.washington.edu/, PMID: 30371827].

      (4) Reviewer #3 (Recommendations for The Authors):

      (4.1) Please ensure that all gene names used in your manuscript have been approved by the HUGO nomenclature committee. For example, "c.3590C>T (p.P1197L)" should be described as "c.3590C>T (Pro1197Leu)".

      In response to the reviewer's suggestion, we have improved all the names of gene and variants according to the HUGO nomenclature committee and HGVS Variant Nomenclature Committee, respectively.

      (4.2) For Table 1, the authors should provide the rates of abnormal sperm morphologies using the sperm cells from normal male controls.

      Thanks for the reviewer’s careful comments. Consistent with the WHO laboratory manual (World Health Organization. WHO laboratory manual for the examination and processing of human semen. World Health Organization, 2021.), our routine semen analysis establishes 4% as the minimum rate of sperm with normal morphology but does not define the maximum rate of various tail defects. However, we reviewed the routine semen analysis on the normal controls in our study, and the approximate distribution of sperm with various flagellar in the normal controls was as follows: normal flagella, 78.6%; absent flagella, 1.7%; short flagella, 0.6%; coiled flagella, 12.5%; bent flagella, 7.9%; irregular flagella, 1.8%.

      (4.3) In Table 2, "Mutation Tester" or "Mutation Taster"?

      We thank the reviewer’s comments. It should be "MutationTaster", and we have corrected this mistake in Table 2 and the manuscript.

      (4.4) In Figure 2B, the bars for patient 1 should be aligned. 

      Following the reviewer's valuable suggestion, we have ensured consistent scar bar alignment in Figure 2B and implemented this alignment throughout all other figures.

      (4.5) In Figure 3A, what about the ultrastructure for sperm heads in DNAH3 deficient sperm cell? The authors previously mentioned abnormalities in sperm head morphologies (Figure 2B) in patients with DNAH3 mutations.

      We thank the reviewers for their kind comments. A small fraction of abnormal sperm head of our patients was captured under TEM, manifested by round head with loose chromatin (Author response image 1)

      Author response image 1.

      Ultrastructure of sperm head from DNAH3-deficient infertile men. TEM analysis revealed a fraction of round head with loose chromatin in patients harboring DNAH3 variants. Scale bars, 200 nm.

      (4.6) In Figure S6, the authors should provide the rates of abnormal sperm morphologies for Dnah3 KO male mice.

      In response to the reviewer's valuable suggestion, we have quantified morphological defects in spermatozoa from both Dnah3 KO and WT mice. Compared to about 17% morphological abnormalities in sperm from WT mice, the morphological abnormalities in sperm from Dnah3 KO mice were about 37%. The results are presented in the revised Figure S7.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Major changes in the revised manuscript include:

      (1) The distinction between condition-dependent versus condition-independent variation in neural activity has been clarified. 

      (2) Principal angle calculations have been added. 

      (3) Neurons modulated during action execution but not during action observation have been analyzed to compare and contrast with mirror neurons. 

      (4) Canonical correlation analysis has been extended to three dimensions. 

      (5) Speculations have been moved to and modified in the Discussion. 

      (6) Computational details have been expanded in the Methods.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary and strengths. This paper starts with an exceptionally fair and balanced introduction to a topic, the mirror neuron literature, which is often debated and prone to controversies even in the choice of the terminology. In my opinion, the authors made an excellent job in this regard, and I really appreciated it. Then, they propose a novel method to look at population dynamics to compare neural selectivity and alignment between execution and observation of actions performed with different types of grip. 

      Thank you.

      Weakness.

      Unfortunately, the goal and findings within this well-described framework are less clear to me. The authors aimed to investigate, using a novel analytic approach, whether and to what extent a match exists between population codes and neural dynamics when a monkey performs an action or observes it performed by an experimenter. This motivation stems from the fact that the general evidence in the literature is that the match between visual and motor selectivity of mirror neuron responses is essentially at a chance level. While the approach devised by the author is generally well-described and understandable, the main result obtained confirms this general finding of a lack of matching between the two contexts in 2 out of the three monkeys. Nevertheless, the authors claim that the patterns associated with execution and observation can be re-aligned with canonical correlation, indicating that these distinct neural representations show dynamical similarity that may enable the nervous system to recognize particular actions. This final conclusion is hardly acceptable to me, and constitutes my major concern, at least without a more explicit explanation: how do we know that this additional operation can be performed by the brain? 

      Point taken.  In the Discussion, we now have clarified that this is our speculation rather than a conclusion and we also offer an alternative interpretation (lines 724 to 744):

      “One classic interpretation of similar latent dynamics in the PM MN population during execution and observation would be that this similarity provides a means for the brain to recognize similar movements performed by the monkey during execution and by the experimenter during observation. Through some process akin to a communication subspace (Semedo et al., 2019), brain regions beyond PM might recognize the correspondence between the latent dynamics of the executed and observed actions.

      Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021). Though neurons active only during observation of others (AO units) have been hypothesized to drive observation activity in MNs, the present AO populations were too small to analyze with the approaches we applied here.  Nevertheless, the similar relative organization of the execution and observation population activity in PM MNs revealed here by alignment of their latent dynamics through CCA could constitute a correspondence between particular movements that might be made by the subject in response to particular movements made by the other individual, i.e. responsive movements which would not necessarily be motorically similar to the observed movements.”

      Is this a computational trick to artificially align something that is naturally non-aligned, or can it capture something real and useful? 

      We feel this is more than a trick.  In the Introduction, we now have clarified (lines 166 to 170):

      “Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.”

      In the Results we give the follow example (lines 446 to 455):

      “Such alignment would indicate that neural representations of trials involving the four objects bore a similar relationship to one another in neural space during execution and observation, even though they occurred in different subspaces.  For example, the trajectories of PMd+M1 neuron populations recorded from two different monkeys during center-out reaching movements could be aligned well (Safaie et al., 2023).  CCA showed, for example, that in both brains the neural trajectory for the movement to the target at 0° was closer to the trajectory for movement to the target at 45° than to the trajectory for the movement to the target at 180°. Relationships among these latent dynamic representations of the eight movements thus were similar even though the neural populations were recorded from two different monkeys.”

      And in the Discussion we now compare (lines 677 to 686):

      “Corresponding neural representations of action execution and observation during task epochs with higher neural firing rates have been described previously in PMd MNs and in PMv MNs using representational similarity analysis RSA (Papadourakis and Raos, 2019).  And during force production in eight different directions, neural trajectories of PMd neurons draw similar “clocks” during execution, cooperative execution, and passive observation (Pezzulo et al., 2022).  Likewise in the present study, despite execution and observation trajectories progressing through largely distinct subspaces, in all three monkeys execution and observation trajectory segments showed some degree of alignment, particularly the Movement and Hold segments (Figure 8C), indicating similar relationships among the latent dynamic representations of the four RGM movements during execution and observation.”

      Based on the accumulated evidence on space-constrained coding of others' actions by mirror neurons (e.g., Caggiano et al. 2009; Maranesi et al. 2017), recent evidence also cited by the authors (Pomper et al. 2023), and the most recent views supported even by the first author of the original discovery (i.e., Vittorio Gallese, see Bonini et al. 2022 on TICS), it seems that one of the main functions of these cells, especially in monkeys, might be to prepare actions and motor responses during social interaction rather than recognizing the actions of others - something that visual brain areas could easily do better than motor ones in most situations. In this perspective, and given the absence of causal evidence so far, the lack of visuo-motor congruence is a potentially relevant feature of the mechanism rather than something to be computationally cracked at all costs. 

      We agree that this perspective provides a valuable interpretation of our findings.  In the Discussion, we have added the following paragraph (lines 730 to 744):

      “Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021). Though neurons active only during observation of others (AO units) have been hypothesized to drive observation activity in MNs, the present AO populations were too small to analyze with the approaches we applied here.  Nevertheless, the similar relative organization of the execution and observation population activity in PM MNs revealed here by alignment of their latent dynamics through CCA could constitute a correspondence between particular movements that might be made by the subject in response to particular movements made by the other individual, i.e. responsive movements which would not necessarily be motorically similar to the observed movements.”

      Specific comments on Results/Methods: 

      I can understand, based on the authors' hypothesis, that they employed an ANOVA to preliminarily test whether and which of the recorded neurons fit their definition of "mirror neurons". However, given the emphasis on the population level, and the consolidated finding of highly different execution and observation responses, I think it could be interesting to apply the same analysis on (at least also) the whole recorded neuronal population, without any preselection-based on a single neuron statistic. Such preselection of mirror neurons could influence the results of EXE-OBS comparisons since all the neurons activated only during EXE or OBS are excluded. Related to this point, the authors could report the total number of recorded neurons per monkey/session, so that also the fraction of neurons fitting their definition of mirror neuron is explicit. 

      We are aware that a number of recent studies from other laboratories already have analyzed the entire population of neurons during execution versus observation, without selectively analyzing neurons active during both execution and observation (Jiang et al., 2020; Albertini et al., 2021). However, our focus lies not in how the entire PM neural population encodes execution versus observation, but in the differential activity of the mirror neuron subpopulation in these two contexts.  Our new Table 2 presents the numbers of mirror neurons (MN), action execution only neurons (AE), action observation only neurons (AO), and neurons not significantly task-related during either execution or observation (NS).  Although we often recorded substantial numbers of AE neurons, very few AO neurons were found in our recordings.  In analyzing the AE subpopulation, we found unexpected differences in canonical correlation alignment between and within the MN and AE neuron populations. In view of the editors’ comments that “…the reviewers provided several specific recommendations of new analyses to include. However, now the paper feels extremely long…”. We have chosen to focus on comparing AE neurons with MNs.  

      Furthermore, the comparison of the dynamics of the classification accuracy in figures 4 and 5, and therefore the underlying assumption of subspaces shift in execution and observation, respectively, reveal substantial similarities between monkeys despite the different contexts, which are clearly greater than the similarities among neural subspaces shifts across task epochs: to me, this suggests that the main result is driven by the selected neural populations in different monkeys/implants rather than by an essential property of the neuronal dynamics valid across animals. Could the author comment on this issue? This could easily explain the "strange" result reported in figure 6 for monkey T. 

      We have taken the general approach of emphasizing findings common across individual animals, but also reporting individual differences.  We have added the following in the Discussion (lines 645 to 654):

      “We did not attempt to classify neurons in our PM MN populations as strictly congruent, broadly congruent, or non-congruent.  Nevertheless, the minimal overlap we found in instantaneous execution and observation subspaces would be consistent with a low degree of congruence in our PM MN populations.  Particularly during one session monkey T was an exception in this regard, showing a considerable degree of overlap between execution and observation subspaces, not unlike the shared subspace found in other studies that identified orthogonal execution and observation subspaces as well (Jiang et al., 2020).  Although our microelectrode arrays were placed in similar cortical locations in the three monkeys, by chance monkey T’s PM MN population may have included a substantial proportion of congruent neurons.”

      Reviewer #2 (Public Review): 

      In this work, the authors set out to identify time-varying subspaces in the premotor cortical activity of monkeys as they executed/observed a reach-grasp-hold movement of 4 different objects. Then, they projected the neural activity to these subspaces and found evidence of shifting subspaces in the time course of a trial in both conditions, executing and observing. These shifting subspaces appear to be distinct in execution and observation trials. However, correlation analysis of neural dynamics reveals the similarity of dynamics in these distinct subspaces. Taken together, Zhao and Schieber speculate that the condition-dependent activity studied here provides a representation of movement that relies on the actor. 

      This work addresses an interesting question. The authors developed a novel approach to identify instantaneous subspaces and decoded the object type from the projected neural dynamics within these subspaces. As interesting as these results might be, I have a few suggestions and questions to improve the manuscript: 

      (1) Repeating the analyses in the paper, e.g., in Fig5, using non-MN units only or the entire population, and demonstrating that the results are specific to MNs would make the whole study much more compelling. 

      We have added analyses of those non-MNs modulated significantly during action execution but not during observation, which we refer to as AE neurons.  The additional findings from these analyses are spread throughout the manuscript:

      Lines 284-293:

      “We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.” 

      Lines 411-419:

      “During execution trials, classification accuracy for AE populations (Figure 6I-L) showed a time course quite similar to that for MN populations, though amplitudes were lower overall, most likely because of the smaller population sizes. During observation, AE populations showed only low-amplitude, short-lived peaks of classification accuracy around times I, G, M, and H (Figure 6 – figure supplement 1).  Given that individual AE neurons showed no statistically significant modulation during observation trials, even these small peaks might not have been expected.  Previous studies have indicated, however, that neurons not individually related to task events nevertheless may contribute to a population response (Shenoy et al., 2013; Cunningham and Yu, 2014; Gallego et al., 2017; Jiang et al., 2020).”

      Lines 495-508:

      “Although MNs are known to be present in considerable numbers in both the primary motor cortex and premotor cortex (see Introduction), most studies of movement-related cortical activity in these areas make no distinction between neurons with activity only during action execution (AE neurons) and those with activity during both execution and observation (MNs).  This reflects an underlying assumption that during action execution, mirror neurons function in parallel with AE neurons, differing only during observation.  We therefore tested the hypothesis that MN and AE neuron execution trajectory segments from the same session would align well.  Figure 8C (blue) shows the mean CCs between MN and AE execution trajectory segments across 8 alignments (MN/AE; 2 R, 3 T, 3 F), which reached the highest values for the Hold segments .  All three of these coefficients were substantially lower than those for the MN execution vs. observation alignments given above.  Surprisingly, the alignment of AE neuron execution trajectory segments with those of the simultaneously recorded MN population was weaker than the alignment of MN trajectories during execution vs. observation.

      Did these differences in MN:1/2, MN:E/O, and MN/AE alignment result from consistent differences in their respective patterns of co-modulation, or from of greater trial-by-trial variability in the patterns of co-modulation among MNs during observation than during execution, and still greater variability among AE neurons during execution?  The bootstrapping approach we used for CCA (see Methods) enabled us to evaluate the consistency of relationships among trajectory segments across repeated samplings of trials recorded from the same neuron population in the same session and in the same context (execution or observation).  We therefore performed 500 iterations of CCA between two different random samples of MN execution (MN:E/E), MN  observation (MN:O/O), or AE execution (AE:E/E) trajectory segments from a given session (2 R, 3 T, 3 F). This within-group alignment of MN execution trajectory segments from the same session (Figure 8D, MN:E/E, gray, Hold: () was as strong as between session alignment (Figure 8C, MN/1:2, black).  But within-group alignment of MN observation trajectory segments (Figure 8D, MN:O/O, orange, Hold: () was lower than that found with MN execution segments (Figure 8C, MN:E/O, red, .  Likewise, within-group alignment of AE neuron trajectory segments (Figure 8D, AE:E/E, light blue, Hold: () was lower than their alignment with MN execution segments (Figure 8C, MN/AE, blue, Hold: ().  Whereas MN execution trajectories were relatively consistent within sessions, MN observation trajectories and AE execution trajectories were less so.”

      And in the Discussion we now suggest (lines 682 to 698):

      “Based on the assumption that AE neurons and MNs function as a homogenous neuron population during action execution, we had expected AE and MN execution trajectory segments to align closely.  During execution trials, the progression of instantaneous condition-dependent subspaces and of classification accuracy in AE populations was quite similar to that in MN populations.  We were surprised to find, therefore, that alignment between execution trajectory segments from AE populations and from the simultaneously recorded MN populations was even lower than alignment between MN execution and observation segments (Figure 8C, blue versus red).  Moreover, whereas within-group alignment of MN execution trajectory segments was high, within-group alignment of AE neuron execution trajectory segments was low (Figure 8D, gray versus light blue).  These findings indicate that the predominant patterns of co-modulation among MNs during execution are quite consistent within sessions, but the patterns of comodulation among AE neurons are considerably more variable.  Together with our previous finding that modulation of MNs leads that of non-mirror neurons in time, both at the single neuron level and at the population level (Mazurek and Schieber, 2019), this difference in consistency versus variability leads us to speculate that during action execution, while MNs carry a consistent forward model of the intended movement, AE neurons carry more variable feedback information.”

      (2) The method presented here is similar and perhaps related to principal angles (https://doi.org/10.2307/2005662). It would be interesting to confirm these results with principal angles. For instance, instead of using the decoding performance as a proxy for shifting subspaces, principal angles could directly quantify the 'shift' (similar to Gallego et al, Nat Comm, 2018). 

      Point taken.  We now have calculated the principal angles as a function of time and present them as a new section of the Results including new figure 4 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described in subsection “Subspace Comparisons—Principal Angles”

      Relatedly, why the decoding of the 'object type' is used to establish the progressive shifting of the subspaces? I would be interested to see the authors' argument. 

      We have clarified the reason for our decoding analysis as follows (lines 295 to 297):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.”

      And… (lines 332 to 348):

      “Decodable information changes progressively during both execution and observation 

      As RGM trials proceeded in time, the condition-dependent neural activity of the PM MN population thus changed in two ways.  First, the instantaneous condition-dependent subspace shifted, indicating that the patterns of firing-rate co-modulation among neurons representing the four different RGM movements changed progressively, both during execution and during observation.  Second, as firing rates generally increased, the neural trajectories representing the four RGM movements became progressively more separated, more so during execution than during observation. 

      To evaluate the combined effects of these two progressive changes, we clipped 100 ms single-trial trajectory segments beginning at times I, G, M, or H, and projected these trajectory segments from individual trials into the instantaneous 3D subspaces at 50 ms time steps.  At each of these time steps, we trained a separate LSTM decoder to classify individual trials according to which of the four objects was involved in that trial.  We expected that the trajectory segments would be classified most accurately when projected into instantaneous subspaces near the time at which the trajectory segments were clipped.  At other times we reasoned that classification accuracy would depend both on the similarity of the current instantaneous subspace to that found at the clip time as evaluated by the principal angle (Figure 4), and on the separation of the four trajectories at the clip time (Figure 5).”

      The object type should be much more decodable during movement or hold, than instruction, which is probably why the chance-level decoding performance (horizontal lines) is twice the instruction segment for the movement segment. 

      Indeed, the object type is more decodable during the movement and hold than during instruction or delay epochs.

      (3) Why aren't execution and observation subspaces compared together directly? Especially given that there are both types of trials in the same session with the same recorded population of neurons. Using instantaneous subspaces, or the principal angles between manifolds during exec trials vs obs trials.

      Point taken.  We now have added comparison of the execution and observation subspaces using the principal angles between instantaneous subspaces (lines 421 to 436):

      “Do PM mirror neurons progress through the same subspaces during execution and observation?

      Having found that PM mirror neuron populations show similar progressive shifts in their instantaneous neural subspace during execution and observation of RGM trials, as well as similar changes in decodable information, we then asked whether this progression passes through similar subspaces during execution and observation.  To address this question, we first calculated the principal angles between the instantaneous mirror-neuron execution subspace at selected times I, G, M, or H and the entire time series of instantaneous mirror-neuron observation subspaces (Figure 7A-D).  Conversely, we calculated the principal angles between the instantaneous observation subspaces at selected times I, G, M, or H and the entire time series of instantaneous execution subspaces (Figure 7E-H).  Although the principal angles were slightly smaller than might be expected from chance alone, indicating some minimal overlap of execution and observation instantaneous subspaces, the instantaneous observation subspaces did not show any progressive shift toward the I, G, M, or H execution subspace (Figure 7A-D), nor did the instantaneous execution subspaces shift toward the I, G, M, or H observation subspace (Figure 7E-H).”

      (4) The definition of the instantaneous subspaces is a critical point in the manuscript. I think it is slightly unclear: based on the Methods section #715-722 and the main text #173-#181, I gather that the subspaces are based on trial averaged neural activity for each of the 4 objects, separately. So for each object and per timepoint, a vector of size (1, n) -n neurons- is reduced to a vector of (1, 2 or 3 -the main text says 2, methods say 3-) which would be a single point in the low-d space. Is this description accurate? This should be clarified in the manuscript.  

      In the Methods, we now have clarified (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, W, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, W_i, forming a time series of filters (Figure 1B).”

      (5) Isn't the process of projecting segments of neural dynamics and comparing the results equivalent to comparing the projection matrices in the first place? If so, that might have been a more intuitive avenue to follow. 

      As described in more detail in our responses to item 2, above, we have added analyses of principal angles to compare the projection matrices directly.  However, “the process of projecting segments of neural dynamics and comparing the results” incorporates the progressively increasing separation of the trajectory segments and hence is not simply equivalent to comparing the subspaces with principal angles.

      (6) Lines #385-#389: This process seems unnecessarily complicated. Also, given the number of trials available, this sometimes doesn't make sense. E.g. Monkey R exec has only 8 trials of one of the objects, so bootstrapping 20 trials 500 times would be spurious. Why not, as per Gallego et al, Nat Neurosci 2020 and Safaie et al, Nat 2023 which are cited, concatenate the trials? 

      In the Methods we now clarify that (lines 953 to 969):

      “To provide an estimate of variability, we used a bootstrapping approach to CCA.  From each of two data sets we randomly selected 20 trials involving each target object (totaling 80 trials) with replacement, clipped trajectory segments from each of those trials for 100 ms (100 points at 1 ms intervals) after the instruction onset, go cue, movement onset, or beginning of the final hold, and performed CCA as described above. (Note that because session 1 from monkey R included only 8 button trials (Table 1), we excluded this session from CCA analyses.)  With 500 iterations, we obtained a distribution of the correlation coefficients (CCs) between the two data sets in each of the three dimensions of the aligned subspace, which permitted statistical comparisons. We then used this approach to evaluate alignment of latent dynamics between different sessions (e.g. execution trials on two different days), between different contexts (e.g. execution and observation), and between different neural populations (e.g. MNs and AE neurons).This bootstrapping approach further enabled us to assess the consistency of relationships among neural trajectories within a given group—i.e. the same neural population during the same context (execution or observation) in the same session—by drawing two separate random samples of 80 trials from the same population, context, and session (Figure 8D), which would not have been possible had we concatenated trajectory segments from all trials in the session (Gallego et al., 2020; Safaie et al., 2023).”

      And we report results that could not have been obtained by concatenating all the trials (lines 522 to 541):

      “Did these differences in MN:1/2, MN:E/O, and MN/AE alignment result from consistent differences in their respective patterns of co-modulation, or from of greater trial-by-trial variability in the patterns of co-modulation among MNs during observation than during execution, and still greater variability among AE neurons during execution?  The bootstrapping approach we used for CCA (see Methods) enabled us to evaluate the consistency of relationships among trajectory segments across repeated samplings of trials recorded from the same neuron population in the same session and in the same context (execution or observation).  We therefore performed 500 iterations of CCA between two different random samples of MN execution (MN:E/E), MN  observation (MN:O/O), or AE execution (AE:E/E) trajectory segments from a given session (2 R, 3 T, 3 F). This within-group alignment of MN execution trajectory segments from the same session (Figure 8D, MN:E/E, gray, Hold: () was as strong as between session alignment (Figure 8C, MN/1:2, black).  But within-group alignment of MN observation trajectory segments (Figure 8D, MN:O/O, orange, Hold: () was lower than that found with MN execution segments (Figure 8C, MN:E/O, red, .  Likewise, within-group alignment of AE neuron trajectory segments (Figure 8D, AE:E/E, light blue, Hold: () was lower than their alignment with MN execution segments (Figure 8C, MN/AE, blue, Hold: ().  Whereas MN execution trajectories were relatively consistent within sessions, MN observation trajectories and AE execution trajectories were less so.”

      Because only 8 button trials were available in Session 1 from Monkey R, we excluded this session from the CCA analyses.  Sessions 2 and 3 from monkey R provide valid results, however.  For example, we now state explicitly (lines 468 to 472):

      “As a positive control, we first aligned MN execution trajectory segments from two different sessions in the same monkey (which we abbreviate as MN:1/2).  The 2 sessions in monkey R provided only 1 possible comparison, but the 3 sessions in monkeys T and F each provided 3 comparisons.  For each of these 7 comparisons, we found the bootstrapped average of CC1, of CC2, and of CC3.”

      (7) Related to the CCA analysis, what behavioural epoch has been used here, the same as the previous analyses, i.e. 100ms? how many datapoint is that in time? Given that CCA is essentially a correlation value, too few datapoints make it rather meaningless. If that's the case, I encourage using, let's say, one window combined of I and G until movement, and one window of movement and hold, such that they are both easier to interpret. Indeed low values of exec-exec in CC2 compared to Gallego et al, Nat Neurosci, 2020 might be a sign of a methodological error. 

      In the Methods described for CCA, we now have clarified that (lines 953 to 961):

      “To provide an estimate of variability, we used a bootstrapping approach to CCA.  From each of two data sets we randomly selected 20 trials involving each target object (totaling 80 trials) with replacement, clipped trajectory segments from each of those trials for 100 ms (100 points at 1 ms intervals) after the instruction onset, go cue, movement onset, or beginning of the final hold, and performed CCA as described above. (Note that because session 1 from monkey R included only 8 button trials (Table 1), we excluded this session from CCA analyses.)  With 500 iterations, we obtained a distribution of the correlation coefficients (CCs) between the two data sets in each of the three dimensions of the aligned subspace, which permitted statistical comparisons.”

      And in the Results we report that (lines 475 to 480):

      “The highest values for MN:1/2 correlations were obtained for the Movement trajectory segments .  These values indicate consistent relationships among the Movement neural trajectory segments representing the four different RGM movements from session to session, as would have been expected from previous studies (Gallego et al., 2018; Gallego et al., 2020; Safaie et al., 2023).”

      Reviewer #3 (Public Review): 

      Summary: 

      In their study, Zhao et al. investigated the population activity of mirror neurons (MNs) in the premotor cortex of monkeys either executing or observing a task consisting of reaching to, grasping, and manipulating various objects. The authors proposed an innovative method for analyzing the population activity of MNs during both execution and observation trials. This method enabled to isolate the condition-dependent variance in neural data and to study its temporal evolution over the course of single trials. The method proposed by the authors consists of building a time series of "instantaneous" subspaces with single time step resolution, rather than a single subspace spanning the entire task duration. As these subspaces are computed on an instant time basis, projecting neural activity from a given task time into them results in latent trajectories that capture condition-dependent variance while minimizing the condition-independent one. The authors then analyzed the time evolution of these instantaneous subspaces and revealed that a progressive shift is present in subspaces of both execution and observation trials, with slower shifts during the grasping and manipulating phases compared to the initial preparation phase. Finally, they compared the instantaneous subspaces between execution and observation trials and observed that neural population activity did not traverse the same subspaces in these two conditions. However, they showed that these distinct neural representations can be aligned with Canonical Correlation Analysis, indicating dynamic similarities of neural data when executing and observing the task. The authors speculated that such similarities might facilitate the nervous system's ability to recognize actions performed by oneself or another individual. 

      Strengths: 

      Unlike other areas of the brain, the analysis of neural population dynamics of premotor cortex MNs is not well established. Furthermore, analyzing population activity recorded during non-trivial motor actions, distinct from the commonly used reaching tasks, serves as a valuable contribution to computational neuroscience. This study holds particular significance as it bridges both domains, shedding light on the temporal evolution of the shift in neural states when executing and observing actions. The results are moderately robust, and the proposed analytical method could potentially be used in other neuroscience contexts. 

      Weaknesses: 

      While the overall clarity is satisfactory, the paper falls short in providing a clear description of the mathematical formulas for the different methods used in the study. 

      We have added the various mathematical formulas in the Methods.

      For Cumulative Separation (lines 864 to 871): 

      “To quantify the separation between the four trial-averaged trajectory segments involving the different objects in a given instantaneous subspace, we then calculated their cumulative separation (𝐶𝑆) as: 

      where d<sub>ij</sub>(t) is the 3-dimensional Euclidean distance between the i<sup>th</sup> and j<sup>th</sup> trajectories at time point 𝑡. We summed the 6 pairwise distances between the 4 trajectory segments across time points and normalized by the number of time points, 𝑇 = 100.  The larger the 𝐶𝑆, the greater the separation of the trajectory segments.”

      For principal angles (lines 877 to 884): 

      For example, given the 3-dimensional instantaneous subspace at the time of movement onset, W<sub>M</sub> and at any other time, W<sub>i</sub>, we calculated their 3x3 inner product matrix and performed singular value decomposition to obtain:

      where 3x3 matrices P<sub>M</sub> and W<sub>P</sub> define new manifold directions which successively minimize the 3 principal angles specific to the two subspaces being compared. The elements of diagonal matrix 𝐶 then are the ranked cosines of the principal angles, 𝜃𝑖 , ordered from smallest to largest: 

      For CCA (lines 945 to 952): 

      “CCA was performed as follows: The original latent dynamics, L<sub>A</sub> and L<sub>B</sub>, first were transformed and decomposed as and .  The first m = 3 column vectors of each 𝑄𝑖 provide an orthonormal basis for the column vectors of (where 𝑖 = 𝐴, 𝐵).  Singular value decomposition on the inner product matrix of  𝑄𝐴 and 𝑄𝐵 then gives , and new manifold directions that maximize pairwise correlations are provided by and .  We then projected the original latent dynamics into the new, common subspace: .  Pairwise correlation coefficients between the aligned latent dynamics sorted from largest to smallest then are given by the elements of the diagonal matrix .”

      Moreover, it was not immediately clear why the authors did not consider a (relatively) straightforward metric to quantity the progressive shift of the instantaneous subspaces, such as computing the angle between consecutive subspaces, rather than choosing a (in my opinion) more cumbersome metric based on classification of trajectory segments representing different movements. 

      Point taken.  We now have calculated the principal angles as a function of time and present them as a new section of the Results including new figure 4 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described in subsection “Subspace Comparisons—Principal Angles”

      Specific comments: 

      In the methods, it is stated that instantaneous subspaces are found with 3 PCs. Why does it say 2 here?  

      We now have clarified. (lines 295 to 310):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.  To illustrate this increasing separation, we clipped 100 ms segments of high-dimensional MN population trial-averaged trajectories beginning at times I, G, M, and H, for trials involving each of the four objects.  We then projected the set of four object-specific trajectory segments clipped at each time into each of the four instantaneous 3D subspaces at times I, G, M, and H.  This process was repeated separately for execution trials and for observation trials.  

      For visualization, we projected these trial-averaged trajectory segments from an example session into the PC1 vs PC2 planes (which consistently captured > 70% of the variance) of the I, G, M, or H instantaneous 3D subspaces.  In Figure 5, the trajectory segments for each of the four objects (sphere – purple, button – cyan, coaxial cylinder – magenta, perpendicular cylinder – yellow) sampled at different times (rows) have been projected into each of the four instantaneous subspaces defined at different times (columns).  Rather than appearing knotted as in Figure 3, these short trajectory segments are distinct when projected into each instantaneous subspace.”

      And in the legend for Figure 5 we now clarify that:

      “Each set of these four segments then was projected into the PC1 vs PC2 plane of the instantaneous 3D subspace present at four different times (columns: I, G, M, H).”

      Another doubt on how instantaneous subspaces are computed: in the methods you state that you apply PCA on trial-averaged activity at each 50ms time step. From the next sentence, I gather that you apply PCA on an Nx4 data matrix (N being the number of neurons, and 4 being the trial-averaged activity of the four objects) every 50 ms. Is this right? It would help to explicitly specify the dimensions of the data matrix that goes into PCA computation. 

      We apologize for this confusion.  Although the LSTM decoding was performed in 50 ms time steps, the instantaneous subspaces were calculated at 1 ms intervals. In the Methods we now have clarified (lines 849 to 759):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, W, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, W_i, forming a time series of filters (Figure 1B).”

      It would help to include some equations in the methods section related to the LSTM decoding. Just to make sure I understood correctly: after having identified the instantaneous subspaces (every 50 ms), you projected the Instruction, Go, Movement, and Holding segments from individual trials (each containing 100 samples, since they are sampled from a 100ms window) onto each instantaneous subspace. So you have four trajectories for each subspace. In the methods, it is stated that a single LSTM classifier is trained for each subspace. Do you also have a separate classifier for each trajectory segment? What is used as input to the classifier? Each trajectory segment should be a 100x3 matrix once projected in an instantaneous subspace. Is that what (each of) the LSTMs take as input? And lastly, what is the LSTM trained to predict exactly? Just a label indicating the type of object that was manipulated in that trial? I apologize if I overlooked any detail, but I believe a clearer explanation of the LSTM, preferably with mathematical formulas, would greatly help readers understand this section. 

      LSTM decoding is not readily described with a set of equations.  However, we have expanded our description to provide the information requested (lines 910 to 937):

      “Decodable information—LSTM

      As illustrated schematically in Figure 1B, the same segment of high-dimensional neural activity projected into different instantaneous subspaces can generate low-dimensional trajectories of varying separation.  The degree of separation among the projected trajectory segments will depend, not only on their separation at the time when the segments were clipped, but also on the similarity of the subspaces into which the trajectory segments are projected.  To quantify the combined effects of trajectory separation and projection into different subspaces, we projected high-dimensional neural trajectory segments (each including 100 points at 1 ms intervals) from successful trials involving each of the four different target objects into time series of 3-dimensional instantaneous subspaces at 50 ms intervals. In each of these instantaneous subspaces, the neural trajectory segment from each trial thus became a 100 point x 3 dimensional matrix.  For each instantaneous subspace in the time series, we then trained a separate long short-term memory (LSTM, (Hochreiter and Schmidhuber, 1997)) classifier to attribute each of the neural trajectories from individual trials to one of the four target object labels: sphere, button, coaxial cylinder, or perpendicular cylinder. Using MATLAB’s Deep Learning Toolbox, each LSTM classifier had 3 inputs (instantaneous subspace dimensions), 20 hidden units in the bidirectional LSTM layer, and a softmax layer preceding the classification layer which had 4 output classes (target objects). The total number of successful trials available in each session for each object is given in Table 1.  To avoid bias based on the total number of successful trials, we used the minimum number of successful trials across the four objects in each session, selecting that number from the total available randomly with replacement. Each LSTM classifier was trained with MATLAB’s adaptive moment estimation (Adam) optimizer on 40% of the selected trials, and the remaining 60% were decoded by the trained classifier.  The success of this decoding was used as an estimate of classification accuracy from 0 (no correct classifications) to 1 (100% correct classifications). This process was repeated 10 times and the mean ± standard deviation across the 10 folds was reported as the classification accuracy at that time.  Classification accuracy of trials projected into each instantaneous subspace at 50 ms intervals was plotted as a function of trial time.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Here are some more specific comments. 

      Abstract. Line 41. "same action" is not justified, there is plenty of evidence showing that the action does not need to be the same (or it has not even to be an action), rephrasing or substituting with "similar" is necessary, especially in the light of the subsequent sentence (which is totally correct). 

      Thank you for pointing this out.  As recommended, we have changed “same” to “similar” (lines 40 to 41):  

      “Many neurons in the premotor cortex show firing rate modulation whether the subject performs an action or observes another individual performing a similar action.”

      Introduction. A relevant, missing reference in the otherwise exhaustive introduction is Albertini et al. 2021 J Neurophysiol, showing that neural dynamics and similarities between biological and nonbiological movements in premotor areas are greater than those between the same executed and observed movements. 

      Thank you for pointing out this important finding.  After revision, we felt it was now cited most appropriately in the revised Discussion as follows (lines 730 to 736):

      “Alternatively, given that observation of another individual can be considered a form of social interaction, PM MN population activity during action observation, rather than representing movements made by another individual similar to one’s own movements, instead may represent different movements one might execute oneself in response to those made by another individual (Ninomiya et al., 2020; Bonini et al., 2022; Ferrucci et al., 2022; Pomper et al., 2023). This possibility is consistent with the finding that the neural dynamics of PM MN populations are more similar during observation of biological versus non-biological movements than during execution versus observation (Albertini et al., 2021)."

      In Line 85, the sentence about Papadourakis and Raos 2019 has to be generalized to PMv, as they show that the proportion of congruent MNs is at chance in both PMd and PMv. 

      Point taken.  We have rephrased this sentence as follows (lines 88 to 89): 

      “And in both PMv and PMd, the proportion of congruent neurons may not be different from that expected by chance alone (Papadourakis and Raos, 2019).”

      Lines 122-132. The initial sentence was unclear to me at first glance. I was wondering how subspaces could be "at other times over the course of the trial" if they are instantaneous. I could imagine that the subspaces referred to corresponding behavioral intervals of execution and observation conditions (and this may be what they will later call "condition dependent" activity), but nevertheless, they could hardly be understood as "instantaneous". I grasped the author's idea only when reading the results, with the statement "no-time dependent variance is captured". The idea is to take a static snapshot of the evolution of population activity at each checkpoint (i.e. I, G, M, and H): I suggest clarifying this point immediately in the introduction to improve readability. 

      We have clarified this point by adding two paragraphs to the Introduction first defining condition independent versus condition-dependent variance and then explaining the use of instantaneous subspaces (lines 125 to 153):

      “A relevant but often overlooked aspect of such dynamics in neuron populations active during both execution and observation has to do with the distinction between condition independent and condition-dependent variation in neuronal activity (Kaufman et al., 2016; Rouse and Schieber, 2018).  The variance in neural activity averaged across all the conditions in a given task context is condition-independent.  For example, in an 8-direction center-out reaching task, averaging a unit’s firing rate as a function of time across all 8 directions may show an initially low firing rate that increases prior to movement onset, peaks during the movement, and then declines during the final hold, irrespective of the movement direction.  Subtracting this condition-independent activity from the unit’s firing rate during each trial gives the remaining variance, and averaging separately across trials in each of the 8 directions then averages out noise variance, leaving the condition-dependent variance that represents the unit’s modulation among the 8 directions (conditions). Alternatively, condition-independent, condition dependent, and noise variance can be partitioned through demixed principal component analysis (Kobak et al., 2016; Gallego et al., 2018).  The extent to which neural dynamics occur in a subspace shared by execution and observation versus subspaces unique to execution or observation may differ for the condition-independent versus condition-dependent partitions of neural activity.  Here, we tested the hypothesis that the condition-dependent activity of PM mirror neuron populations progresses through distinct subspaces during execution versus observation, which would indicate distinct patterns of co-modulation amongst mirror neurons during execution versus observation.

      Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.”

      Results. 

      Regarding the execution-observation alignment, as explained in my initial comment, it does not sound convincing. Applying a CCA to align EXE and OBS activities (which the authors had just shown being essentially not aligned), even separately for each epoch segment (line 396), seems to be a trick to show that they nonetheless share some similarities. Couldn't this be applied to any pairs of differently encoded conditions to create some sort of artificial link between them? Is the similarity in the neural data or rather in the method used to realign them? 

      CCA would not align arbitrary sets of neural data.  The similarity is in the data, not in the method.  For example, in an 8-direction center-out task, the neural representation of movement to the 45° target is between the neural representations of the 0° and the 90° targets.  If the same is true in a second data set, then CCA will give high correlation coefficients.  But if in the second data set the neural representation of the 45° target is between the 135° and 180° targets, CCA will give low correlation coefficients. 

      In the end, what does this tell us about the brain? 

      In the Introduction we now clarify that (lines 166 to 170):

      “Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.”

      And in the Results (lines 449 to 455):

      “For example, the trajectories of PMd+M1 neuron populations recorded from two different monkeys during center-out reaching movements could be aligned well (Safaie et al., 2023).  CCA showed, for example, that in both brains the neural trajectory for the movement to the target at 0° was closer to the trajectory for movement to the target at 45° than to the trajectory for the movement to the target at 180°. Relationships among these latent dynamic representations of the eight movements thus were similar even though the neural populations were recorded from two different monkeys.”

      In relation to Figure 8 (lines 461 to 467)

      “But when both sets of trajectory segments are projected into another common subspace identified with CCA, as shown in Figure 8B, a similar relationship among the neural representations of the four movements during execution and observation is revealed.  In both behavioral contexts the neural representation of movements involving the sphere (purple) is now closest to the representation of movements involving the coaxial cylinder (magenta) and farthest from that of movements involving the button (cyan). The two sets of trajectory segments are more or less “aligned.”

      And in the Discussion (lines 665 to 674):

      “Corresponding neural representations of action execution and observation during task epochs with higher neural firing rates have been described previously in PMd MNs and in PMv MNs using representational similarity analysis RSA (Papadourakis and Raos, 2019).  And during force production in eight different directions, neural trajectories of PMd neurons draw similar “clocks” during execution, cooperative execution, and passive observation (Pezzulo et al., 2022).  Likewise in the present study, despite execution and observation trajectories progressing through largely distinct subspaces, in all three monkeys execution and observation trajectory segments showed some degree of alignment, particularly the Movement and Hold segments (Figure 12A), indicating similar relationships among the latent dynamic representations of the four RGM movements during execution and observation.”

      Concerning the discussion, I would like to reconsider it after having seen the authors' response to the comments above and to my general concern about the relevance of the findings from the neurophysiological point of view. 

      Certainly, please do.

      Reviewer #2 (Recommendations For The Authors): 

      Here are a few issues that I want to bring to the authors' attention (in no particular order): 

      • I am not clear on what is meant by "condition-dependent". Is the condition exec vs obs, or the object types? 

      In the Introduction, we now clarify (lines 125 to 144): 

      “A relevant but often overlooked aspect of such dynamics in neuron populations active during both execution and observation has to do with the distinction between condition independent and condition-dependent variation in neuronal activity (Kaufman et al., 2016; Rouse and Schieber, 2018).  The variance in neural activity averaged across all the conditions in a given task context is condition-independent.  For example, in an 8-direction center-out reaching task, averaging a unit’s firing rate as a function of time across all 8 directions may show an initially low firing rate that increases prior to movement onset, peaks during the movement, and then declines during the final hold, irrespective of the movement direction.  Subtracting this condition-independent activity from the unit’s firing rate during each trial gives the remaining variance, and averaging separately across trials in each of the 8 directions then averages out noise variance, leaving the condition-dependent variance that represents the unit’s modulation among the 8 directions (conditions). Alternatively, condition-independent, condition dependent, and noise variance can be partitioned through demixed principal component analysis (Kobak et al., 2016; Gallego et al., 2018).  The extent to which neural dynamics occur in a subspace shared by execution and observation versus subspaces unique to execution or observation may differ for the condition-independent versus condition-dependent partitions of neural activity.  Here, we tested the hypothesis that the condition-dependent activity of PM mirror neuron populations progresses through distinct subspaces during execution versus observation, which would indicate distinct patterns of co-modulation amongst mirror neurons during execution versus observation.”

      And in the Results, we have added a new Figure 3 to illustrate condition-independent versus conditiondependent activity using an example from the present data sets (lines 208 to 236): 

      “Condition-dependent versus condition-independent neural activity in PM MNs

      Whereas a large fraction of condition-dependent neural variance during reaching movements without grasping can be captured in a two-dimensional subspace (Churchland et al., 2012; Ames et al., 2014), condition-dependent activity in movements that involve grasping is more complex (Suresh et al., 2020). In part, this may reflect the greater complexity of controlling the 24 degrees of freedom in the hand and wrist as compared to the 4 degrees of freedom in the elbow and shoulder (Sobinov and Bensmaia, 2021).  Figure 3 illustrates this complexity in a PM MN population during the present RGM movements.  Here, PCA was performed on the activity of a PM MN population across the entire time course of execution trials involving all four objects.  The colored traces in Figure 3A show neural trajectories averaged separately across trials involving each of the four objects and then projected into the PC1 vs PC2 plane of the total neural space.  Most of the variance in these four trajectories is comprised of a shared rotational component.  The black trajectory, obtained by averaging trajectories from trials involving all four objects together, represents this condition-independent (i.e. independent of the object involved) activity.  The condition-dependent (i.e. dependent on which object was involved) variation in activity is reflected by the variation in the colored trajectories around the black trajectory.  The condition-dependent portions can be isolated by subtracting the black trajectory from each of the colored trajectories. The resulting four condition dependent trajectories have been projected into the PC1 vs PC2 plane of their own common subspace in Figure 3B.  Rather than exhibiting a simple rotational motif, these trajectories appear knotted. To better understand how these complex, condition-dependent trajectories progress over the time course of RGM trials, we chose to examine time series of instantaneous subspaces.”

      While there is an emphasis on the higher complexity of manipulating objects compared to just reaching movements in the Abstract, the majority of the analysis relates to the instruction, movement initiation, and grasp, and there is no specific analyses looking at manipulation and how those presumably more complex dynamics compare to the reaching dynamics, and how they differ from reaching in the mirror neurons. 

      We have clarified that (lines 178 to 187):

      “Because we chose to study relatively naturalistic movements, the reach, grasp, and manipulation components were not performed separately, but rather in a continuous fluid motion during the movement epoch of the task sequence (Figure 2B).  In previous studies involving a version of this task without separate instruction and delay epochs, we have shown that joint kinematics, EMG activity, and neuron activity in the primary motor cortex, all vary throughout the movement epoch in relation to both reach location and object grasped, with location predominating early in the movement epoch and object predominating later (Rouse and Schieber, 2015, 2016a, b).  The present task, however, did not dissociate the reach, the hand shape used to grasp the object, and the manipulation performed on the object.”

      • The analysis in Fig3C,D is interesting, however, in my opinion, requires control. For instance, what would these values look like if you projected the segments to a subspace defined by the activity during the entire length of the trial, or if you projected the activity during intertrials, just to get a sense of how meaningful these values are? 

      This material is now presented in Figure 5 – figure supplement 1.  In the legend to this figure supplement, we have clarified that (lines 327 to 328):

      “CS values, which we use only to characterize the phenomenon of trajectory separation,….”

      • MN is used (#85) before definition (#91). Similar for RGM, I believe. 

      Thanks for catching this problem.  We have now defined these abbreviations at first use as follows:

      In lines 89 to 92:

      “Though many authors apply the term mirror neurons strictly to highly congruent neurons, here we will refer to all neurons modulated during both contexts—execution and observation—as mirror neurons (MNs).”

      And in lines 148 to 150:

      We identified separate time series for execution trials and for observation trials, both involving four different reach-grasp-manipulation (RGM) movements.”

      • I believe in the Intro when presenting the three hypotheses, there is a First, and a Third, but no Second. 

      We have revised this part of the Introduction without numbering our hypotheses as follows (lines 145 to 173):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.

      We then tested the hypothesis that the condition-dependent subspace shifts progressively over the time course of behavioral trials (Figure 1A) by calculating the principal angles between four selected instantaneous subspaces that occurred at times easily defined in each behavioral trial—instruction onset (I), go cue (G), movement onset (M), and the beginning of the final hold (H)—and every other instantaneous subspace in the time series.  Initial analyses showed that condition-dependent neural trajectories for the four RGM movements tended to separate increasingly over the course of behavioral trials.  We therefore additionally examined the combined effects of i) the progressively shifting subspaces and ii) the increasing trajectory separation, by decoding neural trajectory segments sampled for 100 msec after times I, G, M, and H and projected into the time series of instantaneous subspaces (Figure 1B).

      Finally, we used canonical correlation to ask whether the prevalent patterns of mirror neuron co-modulation showed similar relationships among the four RGM movements during execution and observation (Figure 1C).  Such alignment would indicate that the relationships among the trajectory segments in the execution subspace are similar to the relationships among the trajectory segments in the observation subspace, indicating a corresponding structure in the latent dynamic representations of execution and observation movements by the same PM MN population.  And finally, because we previously have found that during action execution the activity of PM mirror neurons tends to lead that of non-mirror neurons which are active only during action execution (AE neurons) (Mazurek and Schieber, 2019), we performed parallel analyses of the instantaneous state space of PM AE neurons.”

      • The use of the term 'instantaneous subspaces' in the abstract confused me initially, as I wasn't sure what it meant. It might be a good idea to define or rephrase it. 

      In the Abstract we now state (lines 51 to 52):

      “Rather than following neural trajectories in subspaces that contain their entire time course, we identified time series of instantaneous subspaces …”

      And in the Introduction, we have clarified (lines 145 to 153):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.”

      And in the Methods (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, 𝑊, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, 𝑊𝑖, forming a time series of filters (Figure 1B).”

      Reviewer #3 (Recommendations For The Authors): 

      (1) Page 4, lines 127-131. In the introduction, it was not immediately clear to me what you meant by 'separation' and 'decoding' of the projected neural activity. You do mention that you are separating/decoding trajectory segments representing different movements at the end of this paragraph, but at this point of the paper it was not very clear to me what those different movements were (I only understood that after reading the results section). I suggest briefly expanding on these concepts here. 

      To clarify these points in the Introduction, we have expanded exposition of these concepts (lines 145 to 163):

      “Because of the complexity of condition-dependent neural trajectories for movements involving the hand, we developed a novel approach.  Rather than examining trajectories over the entire time course of behavioral trials, we identified time series of instantaneous PM mirror neuron subspaces covering the time course of behavioral trials. We identified separate time series for execution trials and for observation trials, both involving four different reach-graspmanipulation (RGM) movements.  Given that each subspace in these time series is instantaneous (a snapshot in time), it captures condition-dependent variance in the neural activity among the four RGM movements while minimizing condition-independent (time dependent) variance.

      We then tested the hypothesis that the condition-dependent subspace shifts progressively over the time course of behavioral trials (Figure 1A) by calculating the principal angles between four selected instantaneous subspaces that occurred at times easily defined in each behavioral trial—instruction onset (I), go cue (G), movement onset (M), and the beginning of the final hold (H)—and every other instantaneous subspace in the time series.  Initial analyses showed that condition-dependent neural trajectories for the four RGM movements tended to separate increasingly over the course of behavioral trials.  We therefore additionally examined the combined effects of i) the progressively shifting subspaces and ii) the increasing trajectory separation, by decoding neural trajectory segments sampled for 100 msec after times I, G, M, and H and projected into the time series of instantaneous subspaces (Figure 1B).”

      (2) Page 6, line 175. In the methods, it is stated that instantaneous subspaces are found with 3 PCs. Why does it say 2 here? 

      Thank you for noticing this discrepancy.  In the Methods, we have clarified that the instantaneous subspaces are 3-dimensional (see our reply to the next comment), but in Figure 5 (previously Figure 3), for purposes of visualization, we are projecting trajectory segments into the PC1-PC2 plane (lines 295 to 308):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.  To illustrate this increasing separation, we clipped 100 ms segments of high-dimensional MN population trial-averaged trajectories beginning at times I, G, M, and H, for trials involving each of the four objects.  We then projected the set of four object-specific trajectory segments clipped at each time into each of the four instantaneous 3D subspaces at times I, G, M, and H.  This process was repeated separately for execution trials and for observation trials.  

      For visualization, we projected these trial-averaged trajectory segments from an example session into the PC1 vs PC2 planes (which consistently captured > 70% of the variance) of the I, G, M, or H instantaneous 3D subspaces.  In Figure 5, the trajectory segments for each of the four objects (sphere – purple, button – cyan, coaxial cylinder – magenta, perpendicular cylinder – yellow) sampled at different times (rows) have been projected into each of the four instantaneous subspaces defined at different times (columns).”

      And in the legend for Figure 5 we now clarify that:

      “Each set of these four segments then was projected into the PC1 vs PC2 plane of the instantaneous 3D subspace present at four different times (columns: I, G, M, H).”

      Another doubt on how instantaneous subspaces are computed: in the methods you state that you apply PCA on trial-averaged activity at each 50ms time step. From the next sentence, I gather that you apply PCA on an Nx4 data matrix (N being the number of neurons, and 4 being the trial-averaged activity of the four objects) every 50 ms. Is this right? It would help to explicitly specify the dimensions of the data matrix that goes into PCA computation. 

      Thank you for catching an error: The instantaneous subspaces were computed at 1 ms intervals. (It is the LSTM decoding that was done in 50 ms time steps).  We have clarified how the instantaneous subspaces were computed in the Methods (lines 849 to 859):

      “Instantaneous subspace identification 

      Instantaneous neural subspaces were identified at 1 ms intervals.  At each 1 ms time step, the N-dimensional neural firing rates from trials involving the four different objects— sphere, button, coaxial cylinder, and perpendicular cylinder—were averaged separately, providing four points in the N-dimensional space representing the average neural activity for trials involving the different objects at that time step.  PCA then was performed on these four points.  Because three dimensions capture all the variance of four points, three principal component dimensions fully defined each instantaneous subspace.  Each instantaneous 3D subspace can be considered a filter described by a matrix, 𝑊, that can project high-dimensional neural activity into a low-dimensional subspace, with the time series of instantaneous subspaces, 𝑊𝑖, forming a time series of filters (Figure 1B).”

      (3) Page 7, line 210-212. I am not sure if I missed it in the discussion, but have you speculated on why the greatest separation in observation trials was observed during the holding phase while in execution trials during the movement phase? 

      This was a consistent finding, and we therefore point it out as a difference between execution and observation.  Of course, this reflects greater condition-dependent variance in the PM MN population in the movement epoch than in the hold epoch during execution, whereas the reverse is true during observation.  We have no clear speculation as to why this occurs, however.

      (4) Figure 3. Add a legend with color scheme for each object in panels A and B. Also, please specify what metric is represented by the colorbar of panels C, D, E, F (write it down next to the colorbar itself and not just in the caption). 

      This is now Figure 5.  We have added a color legend for A and B.  Panels C, D, E, and F, now have been moved to Figure 5 – figure supplement 1, where we have indicated that the colorbar represents cumulative separation.

      (5) Page 9, line 228. I found the description of this decoding analysis a bit confusing initially (and perhaps still do), this should be clarified. 

      We have clarified our decoding analysis in the Methods (lines 910 to 937):

      “Decodable information—LSTM

      As illustrated schematically in Figure 1B, the same segment of high-dimensional neural activity projected into different instantaneous subspaces can generate low-dimensional trajectories of varying separation.  The degree of separation among the projected trajectory segments will depend, not only on their separation at the time when the segments were clipped, but also on the similarity of the subspaces into which the trajectory segments are projected.  To quantify the combined effects of trajectory separation and projection into different subspaces, we projected high-dimensional neural trajectory segments (each including 100 points at 1 ms intervals) from successful trials involving each of the four different target objects into time series of 3-dimensional instantaneous subspaces at 50 ms intervals. In each of these instantaneous subspaces, the neural trajectory segment from each trial thus became a 100 point x 3 dimensional matrix.  For each instantaneous subspace in the time series, we then trained a separate long short-term memory (LSTM, (Hochreiter and Schmidhuber, 1997)) classifier to attribute each of the neural trajectories from individual trials to one of the four target object labels: sphere, button, coaxial cylinder, or perpendicular cylinder. Using MATLAB’s Deep Learning Toolbox, each LSTM classifier had 3 inputs (instantaneous subspace dimensions), 20 hidden units in the bidirectional LSTM layer, and a softmax layer preceding the classification layer which had 4 output classes (target objects). The total number of successful trials available in each session for each object is given in Table 1.  To avoid bias based on the total number of successful trials, we used the minimum number of successful trials across the four objects in each session, selecting that number from the total available randomly with replacement. Each LSTM classifier was trained with MATLAB’s adaptive moment estimation (Adam) optimizer on 40% of the selected trials, and the remaining 60% were decoded by the trained classifier.  The success of this decoding was used as an estimate of classification accuracy from 0 (no correct classifications) to 1 (100% correct classifications). This process was repeated 10 times and the mean ± standard deviation across the 10 folds was reported as the classification accuracy at that time.  Classification accuracy of trials projected into each instantaneous subspace at 50 ms intervals was plotted as a function of trial time.”

      (6) Page 9, line 268. This might be trivial, but can you speculate on why the accuracy for Instruction segments had a lower peak compared to the rest of the segments? Is it because there is less 'distinct' information embedded in neural data about the type of object manipulated until you are actually reaching toward it or holding it? The latter seems straightforward, but the former not so much. 

      Thank you for asking this question.  We have added the following speculations (lines 592 to 604): 

      “Short bursts of “signal” related discharge are known to occur in a substantial fraction of PMd neurons beginning at latencies of ~60 ms following an instructional stimulus (Weinrich et al., 1984; Cisek and Kalaska, 2004).  Here we found that the instantaneous subspace shifted briefly toward the subspace present at the time of instruction onset (I), similarly during execution and observation.  This brief trough in principal angle (Figure 4A) and the corresponding peak in classification accuracy (Figure 7A) in part may reflect smoothing of firing rates with a 50 ms Gaussian kernel.  We speculate, however, that the early rise of this peak at the time of instruction onset also reflects the anticipatory activity often seen in PMd neurons in expectation of an instruction, which may not be entirely non-specific, but rather may position the neural population to receive one of a limited set of potential instructions (Mauritz and Wise, 1986). We attribute the relatively low amplitude of peak classification accuracy for Instruction trajectory segments to the likely possibility that only the last 40 ms of our 100 ms Instruction segments captured signal related discharge.”

      (7) Figure 8. Shouldn't the plots in panel A resemble those in Figure 3? Here you are projecting the hold trajectory segments into the subspace at time H, which should be the same as in Fig. 3A/B bottom right panel. 

      The previous Figure 8 is now Figure 8 panels A and B, and the previous Figure 3 is now Figure 5.  The data used in these two figures come from two different recording sessions in two different monkeys. The current Figure 8A,B uses data from monkey F, session 2; whereas Figure 5 uses data from monkey T, session 3, which we now state in the legend to each figure, respectively.  Consequently, the relative arrangement of the trajectory segments in the instantaneous subspace at time H differs.  The session used in Figure 8A,B, which we now show in three dimensions, better illustrates how CCA identifies a common subspace in which execution versus observations segments show alignment (Figure 8B) that was not evident in their original subspaces (Figure 8A).

      (8) Page 14, line 369. Are you computing CCA using only 2 components? I thought the subspaces were 3 dimensional. Why not align all three dimensions? 

      We have expanded this analysis to use all three dimensions, as illustrated in Figure 8 above.

      (9) Page 14, line 407. Does this mean that instantaneous subspaces between execution and observation trials are more similar to each other during the Movement and Holding phase? Is this related to the fact that in those moments there is a smaller progressive shift of the subspaces within execution and observation trials? 

      Our new analyses of principal angles (see our reply to your comment 11, below) show that the progressive shifting of the instantaneous subspace continues through the movement and hold epochs.  We now discuss this better alignment of the Movement and Hold trajectory segments as follows (lines 656 to 664):

      “Given the complexity of condition-dependent neural trajectories across the entire time course of RGM trials (Figure 3B), rather than attempting to align entire neural trajectories, we applied canonical correlation to trajectory segments clipped for 100 ms following four well defined behavioral events: Instruction onset, Go cue, Movement onset, and the beginning of the final Hold.  In all cases, alignment was poorest for Instruction segments, somewhat higher for Go segments, and strongest for Movement and Hold segments.  This progressive increase in alignment likely reflects a progressive increase in the difference between average neuron firing rates for trials involving different objects (Figure 6) relative to the trial-by-trial variance in firing rate for a given object.”

      (10) page 15, line 431. Typo, it should be Table 3. 

      We have removed Table 3 which no longer applies.

      (11) A more general observation: did you try to compute another metric to assess the progressive shift of subspaces over time? I am thinking of something like computing the principal angles between consecutive subspaces. If it is true that the shifts happen over time, but it slows down during movement and hold, you should be able to conclude it from principal angles as well. Am I missing something? Is there any reason you went with classification accuracy instead of a metric like this?  

      Point taken.  We now have calculated the principal angles as a function of time and have presented them as a new section of the Results including new Figure 4 and Figure 4 – figure supplement 3 (lines 237 to 293). 

      “Instantaneous subspaces shift progressively during both execution and observation 

      We identified an instantaneous subspace at each one millisecond time step of RGM trials.  At each time step, we applied PCA to the 4 instantaneous neural states (i.e. the 4 points on the neural trajectories representing trials involving the 4 different objects each averaged across 20 trials per object, totaling 80 trials), yielding a 3-dimensional subspace at that time (see Methods).  Note that because these 3-dimensional subspaces are essentially instantaneous, they capture the condition-dependent variation in neural states, but not the common, condition-independent variation.  To examine the temporal progression of these instantaneous subspaces, we then calculated the principal angles between each 80-trial instantaneous subspace and the instantaneous subspaces averaged across all trials at four behavioral time points that could be readily defined across trials, sessions, and monkeys: the onset of the instruction (I), the go cue (G), the movement onset (M), and the beginning of the final hold (H).  This process was repeated 10 times with replacement to assess the variability of the principal angles.  The closer the principal angles are to 0°, the closer the two subspaces are to being identical; the closer to 90°, the closer the two subspaces are to being orthogonal.  

      Figure 4A-D illustrate the temporal progression of the first principal angle of the mirror neuron population in the three sessions (red, green, and blue) from monkey R during execution trials. As illustrated in Figure 4 – figure supplement 1 (see also the related Methods), in each session all three principal angles, each of which could range from 0° to 90°, tended to follow a similar time course.  In the Results we therefore illustrate only the first (i.e. smallest) principal angle.  Solid traces represent the mean across 10-fold cross validation using the 80-trial subsets of all the available trials; shading indicates ±1 standard deviation.  As would be expected, the instantaneous subspace using 80 trials approaches the subspace using all trials at each of the four selected times—I, G, M, and H—indicated by the relatively narrow trough dipping toward 0°.  Of greater interest are the slower changes in the first principal angle in between these four time points.  Figure 4A shows that after instruction onset (I) the instantaneous subspace shifted quickly away from the subspace at time I, indicated by a rapid increase in principal angle to levels not much lower than what might be expected by chance alone (horizontal dashed line). In contrast, throughout the remainder of the instruction and delay epochs (from I to G), Figure 4B and C show that the 80-trial instantaneous subspace shifted gradually and concurrently, not sequentially, toward the all-trial subspaces that would be reached at the end of the delay period (G) and then at the onset of movement (M), indicated by the progressive decreases in principal angle. As shown by Figure 4D, shifting toward the H subspace did not begin until the movement onset (M). To summarize, these changes in principal angles indicate that after shifting briefly toward the subspace present at time the instruction appeared (I), the instantaneous subspace shifted progressively throughout the instruction and delay epochs toward the subspace that would be reached at the time of the go cue (G), then further toward that at the time of movement onset (M), and only thereafter shifted toward the instantaneous subspace that would be present at the time of the hold (H).

      Figure 4E-H show the progression of the first principal angle of the mirror neuron population during observation trials.  Overall, the temporal progression of the MN instantaneous subspace during observation was similar to that found during execution, particularly around times I and H.  The decrease in principal angle relative to the G and M instantaneous subspaces during the delay epoch was less pronounced during observation than during execution.  Nevertheless, these findings support the hypothesis that the condition-dependent subspace of PM MNs shifts progressively over the time course of RGM trials during both execution and observation, as illustrated schematically in Figure 1A.

      We also examined the temporal progression of the instantaneous subspace of AE neurons.  As would be expected given that AE neurons were not modulated significantly during observation trials, in the observation context AE populations had no gradual changes in principal angle (Figure 4 – figure supplement 3).  During execution, however, Figure 4I-L show that the AE populations had a pattern of gradual decrease in principal angle similar to that found in the MN population (Figure 4A-D).  After the instruction onset, the instantaneous subspace shifted quickly away from that present at time I and progressed gradually toward that present at times G and M, only shifting toward that present at time H after movement onset.  As for the PM MN populations, the condition-dependent subspace of the PM AE populations shifted progressively over the time course of execution RGM trials.”

      The related Methods are now described is subsection “Subspace Comparisons—Principal Angles”

      Is there any reason you went with classification accuracy instead of a metric like this? 

      We now point out that (lines 295 to 297):

      “The progressive changes in principal angles do not capture another important aspect of condition-dependent neural activity.  The neural trajectories during trials involving different objects separated increasingly as trials progressed in time.”

      And we further clarify this as follows (lines 331 to 348):

      “Decodable information changes progressively during both execution and observation 

      As RGM trials proceeded in time, the condition-dependent neural activity of the PM MN population thus changed in two ways.  First, the instantaneous condition-dependent subspace shifted, indicating that the patterns of firing-rate co-modulation among neurons representing the four different RGM movements changed progressively, both during execution and during observation.  Second, as firing rates generally increased, the neural trajectories representing the four RGM movements became progressively more separated, more so during execution than during observation. 

      To evaluate the combined effects of these two progressive changes, we clipped 100 ms single-trial trajectory segments beginning at times I, G, M, or H, and projected these trajectory segments from individual trials into the instantaneous 3D subspaces at 50 ms time steps.  At each of these time steps, we trained a separate LSTM decoder to classify individual trials according to which of the four objects was involved in that trial.  We expected that the trajectory segments would be classified most accurately when projected into instantaneous subspaces near the time at which the trajectory segments were clipped.  At other times we reasoned that classification accuracy would depend both on the similarity of the current instantaneous subspace to that found at the clip time as evaluated by the principal angle (Figure 4), and on the separation of the four trajectories at the clip time (Figure 5).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their careful and overall positive evaluation of our work and the constructive feedback! To address the main concerns, we have:

      – Clarified a major misunderstanding of our instructions: Participants were only informed that they would receive different stimuli of medium intensity and were thus not aware that the stimulation temperature remained constant

      – Implemented a new analysis to evaluate how participants rated their expectation and pain levels in the control condition

      – Added a paragraph in the discussion in which we argue that our paradigm is comparable to previous studies

      Below, we provide responses to each of the reviewers’ comments on our manuscript.

      Reviewer #1 (Public Review):

      Summary:  

      In this important paper, the authors investigate the temporal dynamics of expectation of pain using a combined fMRI-EEG approach. More specifically, by modifying the expectations of higher or lower pain on a trial-to-trial basis, they report that expectations largely share the same set of activations before the administration of the painful stimulus, and that the coding of the valence of the stimulus is observed only after the nociceptive input has been presented. fMRIinformed EEG analysis suggested that the temporal sequence of information processing involved the Dorsolateral prefrontal cortex (DLPFC), the anterior insula, and the anterior cingulate cortex. The strength of evidence is convincing, and the methods are solid, but a few alternative interpretations about the findings related to the control group, as well as a more in-depth discussion on the correlations between the BOLD and EEG signals would strengthen the manuscript. 

      Thank you for your positive evaluation! In the revised version of the manuscript, we elaborated on the control condition and the BOLD-EEG correlations in more detail.

      Strengths:  

      In line with open science principles, the article presents the data and the results in a complete and transparent fashion. 

      From a theoretical standpoint, the authors make a step forward in our understanding of how expectations modulate pain by introducing a combination of spatial and temporal investigation. It is becoming increasingly clear that our appraisal of the world is dynamic, guided by previous experiences, and mapped on a combination of what we expect and what we get. New research methods, questions, and analyses are needed to capture these evolving processes.  

      Thank you very much for these positive comments!

      Weaknesses:  

      The control condition is not so straightforward. Across the manuscript it is defined as "no expectation", and in the legend of Figure 1 it is mentioned that the third state would be "no prediction". However, it is difficult to conceive that participants would not have any expectations or predictions. Indeed, in the description of the task it is mentioned that participants were instructed that they would receive stimuli during "intermediate sensitive states". The results of the pain scores and expectations might support the idea that the control condition is situated in between the placebo and nocebo conditions. However, since this control condition was not part of the initial conditioning, and participants had no reference to previous stimuli, one might expect that some ratings might have simply "regressed to the mean" for a lack of previous experience. 

      General considerations and reflections:  

      Inducing expectations in the desired direction is not a straightforward task, and results might depend on the exact experimental conditions and the comparison group. In this sense, the authors' choice of having 3 groups of positive, negative, and "neutral" expectations is to be praised. On the other hand, also control groups form their expectations, and this can constitute a confounder in every experiment using expectation manipulation, if not appropriately investigated. 

      Thank you for raising these important concerns! Firstly, as it seems that we did not explain the experimental procedure in a clear fashion, there appeared to be a general misunderstanding regarding our instructions. We want to emphasize that we did not tell participants that the stimulus intensity would always be the same, but that pain stimuli would be different temperatures of medium intensity. Furthermore, our instruction did not necessarily imply that our algorithm detected a state of medium sensitivity, but that the algorithm would not make any prediction, e.g., due to highly fluctuating states of pain sensitivity, or no clear-cut state of high or low pain sensitivity. We changed this in the Methods (ll. 556-560, 601-606, 612-614) and Results (ll. 181-192) sections of the manuscript to clarify these important features of our procedure.

      Then, we absolutely agree that participants explicitly and implicitly form expectations regarding all conditions over time, including the control condition. We carefully considered your feedback and rephrased the control condition, no longer framing it as eliciting “no expectations” but as “neutral expectations” in the revised version of the manuscript. This follows the more common phrasing in the literature and acknowledges that participants indeed build up expectations in the control condition. However, we do still think that we can meaningfully compare the placebo and nocebo condition to the control condition to investigate the neuronal underpinnings of expectation effects. Independently of whether participants build up an expectation of “medium” intensities in the control condition, which caused them to perceive stimuli in line with this expectation, or if they simply perceived the stimuli as they were (of medium intensity) with limited effects of expectations, the crucial difference to the placebo and nocebo conditions is that there was no alteration of perception due to previous experiences or verbal information and no shift of perception from the actual stimulus intensity towards any direction in the control condition. This allowed us to compare the neural basis of a modulation of pain perception in either direction to a condition in which this modulation did not take place. 

      Author response image 1.

      Variability within conditions over time. Relative variability index for expectation (left) and pain ratings (right) per condition and measurement block. 

      Lastly, we want to highlight that our finding of the control condition being rated in between the placebo and nocebo condition is in line with many previous studies that included similar control conditions and advanced our understanding of pain-related expectations (Bingel et al., 2011; Colloca et al., 2010; Shih et al., 2019). We thank the reviewer for the very interesting idea to evaluate the development of ratings in the control condition in more detail and added a new analysis to the manuscript in which we compared how much intra-subject variance was within the ratings of each of the three conditions and how much this variance changed over time. For this aim, we computed the relative variability index (Mestdagh et al., 2018), a measure that quantifies intra-subject variation over multiple ratings, and compared between the three conditions and the three measurement blocks. We observed differences in variances between conditions for both expectation (F(2,96) = 8.14, p < .001) and pain ratings (F(2,96) = 3.41, p = .037). For both measures, post-hoc tests revealed that there was significantly more variance in the placebo compared to the control condition (both p_holm < .05), but no difference between control and nocebo. The substantial and comparable variation in pain and expectation ratings in all three conditions (or at least between control and nocebo) shows that participants did not always expect and perceive the same intensity within conditions. Variance in expectation ratings decreased from the first block compared to the other two blocks (_F(1.35,64.64) = 5.69, p = .012; both p_holm < .05), which was not the case for pain ratings. Most importantly, there was no interaction effect of block and condition for neither expectation (_F(2.65,127.06) = 0.40, p = .728) nor pain ratings (F(4,192) = 0.48, p = .748), which implies that expectations were similarly dynamically updated in all conditions over the course of the experiment. This speak against a “regression to the mean” in the control condition and shows that control ratings fluctuated from trial to trial. We included this analysis and a more in-depth discussion of the choice of conditions in the Result (ll. 219-232) and Discussion (ll. 452-486) sections of the revised manuscript.

      In addition, although fMRI is still (probably) the best available tool we have to understand the spatial representation of cortical processing, limitations about not only the temporal but even the spatial resolution should be acknowledged. Given the anatomical and physiological complexity of the cortical connections, as we know from the animal world, it is still well possible that subcircuits are activated also for positive and negative expectations, but cannot be observed due to the limitation of our techniques. Indeed, on an empirical/evolutionary basis it would remain unclear why we should have a system that waits for the valence of a stimulus to show differential responses. 

      We agree that the spatial resolution of fMRI is limited and that our signal is often not able to dissociate different subcircuits. Whether on this basis differential processes occurred cannot be observed in fMRI but is indeed possible. We now include this reasoning in our Discussion (ll. 373-377):

      “Importantly, the spatial resolution of fMRI is limited when it comes to discriminating whether the same pattern of activity is due to identical activation or to activation in different sub-circuits within the same area. Nonetheless, the overlap of areas is an indicator for similar processes involved in a more general preparation process.

      Also, moving in a dimension of network and graph theory, one would not expect single areas to be responsible for distinct processes, but rather that they would integrate information in a shared way, potentially with different feedback and feedforward communications. As such, it becomes more difficult to assume the insula is a center for coding potential pain, perhaps more of a node in a system that signals potential dangers for the integrity of the body. 

      We appreciate the feedback on our interpretation of our results and agree that the overall network activity most likely determines how a large part of expectations and pain are coded. We therefore adjusted the Discussion, embedding the results in an interpretation considering networks (ll. 427-430, 432-435,438-442 ). 

      The authors analyze the EEG signal between 0.5 to 128 Hz, finding significant results in the correlation between single-trial BOLD and EEG activity in the higher gamma range (see Figure 6 panel C). It would be interesting to understand the rationale for including such high frequencies in the signal, and the interpretation of the significant correlation in the high gamma range. 

      On a technical level, we adapted our EEG processing pipeline from Hipp et al. (2011) who similarly investigated signals up to 128 Hz. Of note, the spectral smoothing was adjusted to match 3/4 octave, meaning that the frequency resolution at 128 Hz is rather broad and does not only contain oscillations at 128 Hz sharp. Gamma oscillations in general have repeatedly been reported in relation to pain and feedforward signals reflecting noxious information (e.g. Ploner et al., 2017; Strube et al., 2021). Strube et al. (2021) reported the highest effects of pain stimulus intensity and prediction error processing at high gamma frequencies (100 and 98 Hz, respectively). These findings could also serve as basis to interpret our results in this frequency range: If anticipatory activation in the ACC is linked to high gamma oscillations, which appear to play an important role in feedforward signaling of pain intensity and prediction errors, this could indicate that later processing of intensity in this area is already pre-modulated before the stimulus actually occurs. Of note: although not significant, it looks as if the cluster extends further into pain processing on a descriptive level. We added additional explanation regarding the interpretation of the correlation in the Discussion (ll. 414425):

      “The link between anticipatory activity in the ACC and EEG oscillatory activity was observed in the high gamma band, which is consistent with findings that demonstrate a connection between increased fMRI BOLD signals and a relative shift from lower to higher frequencies (Kilner et al., 2005). Gamma oscillations have been repeatedly reported in the context of pain and expectations and have been interpreted as reflecting feedforward signals of noxious information ( e.g. Ploner et al., 2017; Strube et al., 2021). In combination with our findings, this might imply that high frequency oscillations may not only signal higher actual or perceived pain intensity during pain processing (Nickel et al., 2022; Ploner et al., 2017; Strube et al., 2021; Tu et al., 2016), but might also be instrumental in the transfer of directed expectations from anticipation into pain processing.”

      Reviewer #2 (Public Review):  

      I think this is a very promising paper. The combination of EEG and fMRI is unique and original. However, I also have some suggestions that I think could help improve the manuscript. 

      This manuscript reports the findings of an EEG-fMRI study (n = 50) on the effects of expectations on pain. The combination of EEG with fMRI is extremely original and well-suited to study the transition from expectation to perception. However, I think that the current treatment of the data, as well as the way that the manuscript is currently written, does not fully capitalize on the potential of this unique dataset. Several findings are presented but there is currently no clear message coming out of this manuscript. 

      First, one positive point is that the experimental manipulation clearly worked. However, it should be noted that the instructions used are not typical of studies on placebo/nocebo. Participants were not told that the stimulations would be of higher/lower intensity. Rather, they were told that objective intensities were held constant, but that EEG recordings could be used to predict whether they would perceive the stimulus as more or less intense. I think that this is an interesting way to manipulate expectations, but there could have been more justification in the introduction for why the authors have chosen this unusual procedure. 

      Most importantly, we again want to emphasize again that participants were not aware that the stimulation temperature was always the same but were informed that they would receive different stimuli of medium intensity. We now clarify this in the revised Results (ll. 190-192) and Methods (ll. 612-614) sections.

      While we agree that our procedure was not typical, we do not think that the manipulation is not comparable to previous studies on pain-related expectations. To our knowledge, either expectations regarding a treatment that changes pain perception (treatment expectancy) or expectations regarding stimulus intensities (stimulus expectancy) are manipulated (see Atlas & Wager, 2014). In our study, participants received a cue that induced expectations in regard to a ”treatment”, although in this case the “treatment” came from changes in their own brain activity. This is comparable to studies using TENS-devices that are supposedly changing peripheral pain transmission (Skvortsova et al., 2020). Thus, although not typical, our paradigm could be classified as targeting treatment expectancies and allowed us to examine effects on a trial-by-trial level within subjects. We added a paragraph regarding the comparability of our paradigm with previous studies in the Discussion of the revised manuscript (ll. 452-464) .

      Also, the introduction mentions that little is known about potential cerebral differences between expectations of high vs. low pain expectations. I think the fear conditioning literature could be cited here. Activations in ACC, SMA, Ins, parahippocampal gyrus, PAG, etc. are often associated with upcoming threat, whereas activations vmPFC/default mode network are associated with safety. 

      We thank you for your suggestions to add literature on fear conditioning. We agree there is some overlap between fear conditioning and expectation effects in humans, but we also believe there are fundamental differences regarding their underlying processes and paradigms. E.g. the expectation effects are not driven by classical learning algorithms but act in a large amount as self-fulfilling prophecies (see e.g. Jepma et al., 2018). However, we now acknowledge the similarities e.g in the recruitment of the insula and the vmPFC of the modalities in our Introduction (ll. 132-136 ).

      The fact that the authors didn't observe a clearer distinction between high and low expectations here could be related to their specific instructions that imply that the stimulus is the same and that it is the subjective perception that is expected to change. In any case, this is a relatively minor issue that is easy to address. 

      We apologize again for the lack of clarity in our instructions: Participants were unaware that they would receive the exact same stimulus. The clear effects of the different conditions on expectation and pain ratings also challenge the notion that participants always expected the same level of stimulation and/or perception. Additionally, if participants were indeed expecting a consistent level of intensity in all conditions, one would also assume to see the same anticipatory activation in the control condition as in the placebo and nocebo conditions, which is not the case. Thus, we respectfully disagree that the common effects might be explained by our instructions but would argue that they indeed reflect common (anticipatory) processes of positive and negative expectations.

      Towards the end of the introduction, the authors present the aims of the study in mainly exploratory terms: 

      (1) What are the differences between anticipation and perception? 

      (2) What regions display a difference between high and low expectations (high > low or low < high) vs. an effect of expectation regardless of the direction (high and low different than neutral)? 

      I think these are good questions, but the authors should provide more justification, or framework, for these questions. More specifically, what will they be able to conclude based on their observations? 

      For instance (note that this is just an example to illustrate my point. I encourage the authors to come up with their own framework/predictions) : 

      (1) Possibility #1: A certain region encodes expectations in a directed fashion (high > low) and that same region also responds to perception in the same direction (high > low). This region would therefore modulate pain by assimilating perception towards expectations. 

      (2) Possibility # 2: different regions are involved in expectation and perception. Perhaps this could mean that certain regions influence pain processing through descending facilitation for instance...  

      Thank you for pointing out that our hypotheses were not crafted carefully enough. We tried to give better explanations for the possible interpretations of our hypotheses. Additionally, we interpreted our results on the background of a broader framework for placebo and nocebo effects (predictive coding) to derive possible functions of the described brain areas. We embedded this in our Introduction (ll. 74-86, 158-175 ) and Discussion (ll. 384-388 ), interpreting the anticipatory activity and the activity during pain processing in the context of expectation formation as described in Büchel et al. (2014).

      Interpretation derived from our framework (ll. 384-388):

      e.g.: “Following the framework of predictive coding, our results would suggest that the DPMS is the network responsible for integrating ascending signals with descending signals in the pain domain and that this process is similar for positive and negative valences during anticipation of pain but differentiates during pain processing.”

      Regarding analyses, I think that examining the transition from expectations to perception is a strong angle of the manuscript given the EGG-fMRI nature of the study. However, I feel that more could have been done here. One problem is that the sequence of analyses starts by identifying an fMRI signal of interest and then attempts to find its EEG correlates. The problem is that the low temporal resolution of fMRI makes it difficult to differentiate expectation from perception, which doesn't make this analysis a good starting point in my opinion. Why not start by identifying an EEG signal that differentiates perception vs expectation, and then look for its fMRI correlates?  

      We appreciate your feedback on the transition from expectations to perceptions and also think that additional questions could be answered with our data set. However, based on the literature we had specific hypotheses regarding specific brain areas, and we therefore decided to start from the fMRI data with the superior spatial resolution and EEG was used to focus on the temporal dynamics within the areas important for anticipatory processes. We share the view that many different approaches in analyzing our data are possible. On the other hand, identifying relevant areas based on EEG characteristics inherits even more uncertainty due to the spatial filtering of the EEG signal. For the research question of this study a more accurate evaluation of the involved areas and the related representation was more important. We therefore decided to only implement the procedure already present in the manuscript. 

      Finally, I found the hypotheses on "valenced" vs. "absolute" effects a little bit more difficult to follow. This is because "neutral" is not really neutral: it falls in between low and high. If I follow correctly, participants know that the temperature is always the same. Therefore, if they are told that the machine cannot predict whether their perception is going to be low or high, then it must be because it is likely to be in between. Ratings of expectation and pain ratings confirm that. The neutral condition is not "devoid" of expectations as the authors suggest.

      Therefore, it would make sense to look at regions with the following pattern low > neutral > high, or vice-versa, low < neutral < high. Low & high being different than neutral is more difficult to interpret. I don't think that you can say that it reflects "absolute" expectations because neutral is also the expectation of a medium temperature. Perhaps it reflects "certainty/uncertainty" or something like that, but it is not clear that it reflects "expectations". 

      Thank you for your valuable feedback! We considered your concerns about the interpretation of our results and completely agree that the control condition cannot be interpreted as void of expectations (ll. 119-123). We therefore evaluated the control condition in more detail in a separate analysis (ll. 219-232) and integrated a new assessment of the conditions into the Discussion (ll. 465-486). We changed the phrasing of our control condition to “neutral expectations”, as we agree that the control condition is not void of expectations and this phrasing is more in line with other studies (e.g. Colloca et al., 2010; Freeman et al., 2015; Schmid et al., 2015). We would argue that the neutral expectations can still be meaningfully compared to positive and negative expectations because only the latter shift expectations and perception in one direction. Thus, we changed our wording throughout the manuscript to acknowledge that we indeed did not test for general effects of expectations vs. no expectations, but for effects of directed expectations. Please also see our reasoning regarding the control condition in response to Reviewer 1, in which we addressed the interpretation of the control condition. We therefore still believe that the contrasts that we calculated between conditions are valid. The proposed new contrast largely overlaps with our differential contrast low>high and vice versa already reported in the manuscript (for additional results also see Supplements).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 6, panel C. The figure mentions Anterior Cingulate Cortex R, whereas the legend mentions left ACC. Please check. 

      Thanks for catching this, we changed the figure legend accordingly.

      Reviewer #2 (Recommendations For The Authors):  

      - I don't think that activity during the rating of expectations is easily interpretable. I think I would recommend not reporting it. 

      The majority of participants completed the expectation rating relatively quickly (M = 2.17 s, SD = 0.35 s), which resulted in the overlap between the DLPFC EEG cluster and the expectation rating encompassing only a limited portion of the cluster (~ 1 s). We agree that this activity still is more difficult to interpret, yet we have decided to report it for reasons of completeness.

      - The effects on SIIPS are interesting. I think that it is fine to present them as a "validation" of what was observed with pain ratings, but it also seems to give a direction to the analyses that the authors don't end up following. For instance, why not try other "signatures" like the NPS or signatures of pain anticipation? Also, why not try to look at EEG correlates of SIIPS? I don't think that the authors "need" to do any of that, but I just wanted to let them know that SIIPS results may stir that kind of curiosity in the readers.  

      While this would be indeed very interesting, these additional analyses are not directly related to our current research question. We fear that too many analyses could be confusing for the readers. Nonetheless, we are grateful for your suggestion and will implement additional brain signatures in future studies. 

      - The shock was calibrated to be 60%. Why not have high (70%) and low (30%) conditions at equal distances from neutral, like 80% and 40% for instance? The current design makes it hard to distinguish high from control. Perhaps the "common" effects of high + low are driven by a deactivation for low (30%)?  

      We appreciate your feedback! We adjusted the temperature during the test phase to counteract habituation typically happening with heat stimuli. We believe that this was a good measure as participants rated the control condition at roughly VAS 50 (M = 51.40) which was our target temperature and then would be equidistant to the VAS 70 and VAS 30 during conditioning when no habituation should have taken place yet. We further tested whether participants rated placebo and nocebo trials at equal distances from the control condition and found no existent bias for either of the conditions. To do this, we computed the individual placebo effect (control minus placebo) and nocebo effect (nocebo minus control) for each participant during the test phase and statistically compared whether they differed in terms of magnitude. There was no significant difference between placebo and nocebo effects for both expectation (placebo effect M = 14.25 vs. nocebo effect M = 17.22, t(49) = 1.92, p = .061) and pain ratings (placebo effect M = 6.52 vs. nocebo effect M = 5.40, t(49) = -1.11, p = .274). This suggests that our expectation manipulation resulted in comparable shifts in expectation and pain ratings away from the control condition for both the placebo and nocebo condition and thus hints against any bias of the conditioning temperatures. Please also note that the analysis of the common effects was masked for differences of the high and low, therefore the effects cannot be driven by one condition by itself.

      - If I understand correctly, all fMRI contrasts were thresholded with FWE. This is fine, but very strict. The authors could have opted for FDR. Maybe I missed something here....  

      While it is true that FDR is the more liberal approach, it is not valid for spatially correlated fMRI data and is no longer available in SPM for the correction of multiple comparisons. The newly implemented topological peak based FDR correction is comparably sensitive with the FWE correction (see. Chumbley et al. BELEG). We opted for the slightly more conservative approach in our preregistration (_p_FWE < .05), therefore a change of the correction is not possible.

      Altogether, I think that this is a great study. The combination of EEG and fMRI is truly unique and affords many opportunities to examine the transition from expectations to perception. The experimental manipulation of expectations seems to have worked well, and there seem to be very promising results. However, I think that more could have been done. At least, I would recommend trying to give more of a theoretical framework to help interpret the results.  

      We are very grateful for your positive feedback. We took your suggestion seriously and tried to implement a more general framework from the literature (see Büchel et al., 2014) to provide a better explanation for our results.

      References

      Atlas, L. Y., & Wager, T. D. (2014). A meta-analysis of brain mechanisms of placebo analgesia: Consistent findings and unanswered questions. Handbook of Experimental Pharmacology, 225, 37–69. https://doi.org/10.1007/978-3-662-44519-8_3

      Bingel, U., Wanigasekera, V., Wiech, K., Ni Mhuircheartaigh, R., Lee, M. C., Ploner, M., & Tracey, I. (2011). The effect of treatment expectation on drug efficacy: Imaging the analgesic benefit of the opioid remifentanil. Science Translational Medicine, 3(70), 70ra14. https://doi.org/10.1126/scitranslmed.3001244

      Büchel, C., Geuter, S., Sprenger, C., & Eippert, F. (2014). Placebo analgesia: A predictive coding perspective. Neuron, 81(6), 1223–1239. https://doi.org/10.1016/j.neuron.2014.02.042

      Colloca, L., Petrovic, P., Wager, T. D., Ingvar, M., & Benedetti, F. (2010). How the number of learning trials affects placebo and nocebo responses. Pain, 151(2), 430–439. https://doi.org/10.1016/j.pain.2010.08.007

      Freeman, S., Yu, R., Egorova, N., Chen, X., Kirsch, I., Claggett, B., Kaptchuk, T. J., Gollub, R. L., & Kong, J. (2015). Distinct neural representations of placebo and nocebo effects. NeuroImage, 112, 197–207. https://doi.org/10.1016/j.neuroimage.2015.03.015

      Hipp, J. F., Engel, A. K., & Siegel, M. (2011). Oscillatory synchronization in large-scale cortical networks predicts perception. Neuron, 69(2), 387–396. https://doi.org/10.1016/j.neuron.2010.12.027

      Jepma, M., Koban, L., van Doorn, J., Jones, M., & Wager, T. D. (2018). Behavioural and neural evidence for self-reinforcing expectancy effects on pain. Nature Human Behaviour, 2(11), 838–855. https://doi.org/10.1038/s41562-018-0455-8

      Kilner, J. M., Mattout, J., Henson, R., & Friston, K. J. (2005). Hemodynamic correlates of EEG: A heuristic. NeuroImage, 28(1), 280–286. https://doi.org/10.1016/j.neuroimage.2005.06.008

      Nickel, M. M., Tiemann, L., Hohn, V. D., May, E. S., Gil Ávila, C., Eippert, F., & Ploner, M. (2022). Temporal-spectral signaling of sensory information and expectations in the cerebral processing of pain. Proceedings of the National Academy of Sciences of the United States of America, 119(1). https://doi.org/10.1073/pnas.2116616119

      Ploner, M., Sorg, C., & Gross, J. (2017). Brain Rhythms of Pain. Trends in Cognitive Sciences, 21(2), 100–110. https://doi.org/10.1016/j.tics.2016.12.001

      Schmid, J., Bingel, U., Ritter, C., Benson, S., Schedlowski, M., Gramsch, C., Forsting, M., & Elsenbruch, S. (2015). Neural underpinnings of nocebo hyperalgesia in visceral pain: A fMRI study in healthy volunteers. NeuroImage, 120, 114–122. https://doi.org/10.1016/j.neuroimage.2015.06.060

      Shih, Y.‑W., Tsai, H.‑Y., Lin, F.‑S., Lin, Y.‑H., Chiang, C.‑Y., Lu, Z.‑L., & Tseng, M.‑T. (2019). Effects of Positive and Negative Expectations on Human Pain Perception Engage Separate But Interrelated and Dependently Regulated Cerebral Mechanisms. Journal of Neuroscience, 39(7), 1261–1274. https://doi.org/10.1523/JNEUROSCI.2154-18.2018

      Skvortsova, A., Veldhuijzen, D. S., van Middendorp, H., Colloca, L., & Evers, A. W. M. (2020). Effects of Oxytocin on Placebo and Nocebo Effects in a Pain Conditioning Paradigm: A Randomized Controlled Trial. The Journal of Pain, 21(3-4), 430–439. https://doi.org/10.1016/j.jpain.2019.08.010

      Strube, A., Rose, M., Fazeli, S., & Büchel, C. (2021). The temporal and spectral characteristics of expectations and prediction errors in pain and thermoception. ELife, 10. https://doi.org/10.7554/eLife.62809

      Tu, Y., Zhang, Z., Tan, A., Peng, W., Hung, Y. S., Moayedi, M., Iannetti, G. D., & Hu, L. (2016). Alpha and gamma oscillation amplitudes synergistically predict the perception of forthcoming nociceptive stimuli. Human Brain Mapping, 37(2), 501–514. https://doi.org/10.1002/hbm.23048

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The current manuscript provides a timely contribution to the ongoing discussion about the mechanism of the apical sodium/bile acid transporter (ASBT) transporters. Recent structures of the mammalian ASBT transporters exhibited a substrate binding mode with few interactions with the core domain (classically associated with substrate binding), prompting an unusual proposal for the transport mechanism. Early structures of ASBT homologues from bacteria also exhibit unusual substrate binding in which the core substrate binding domain is less engaged than expected. Due to the ongoing questions of how substrate binding and mechanism are linked in these transporters, the authors set out to deepen our understanding of a model ABST homolog from bacteria N. meningitidis (ABST-NM).

      The premise of the current paper is that the bacterial ASBT homologs are probably not physiological bile acid transporters, and that structural elucidation of a natively transported substrate might provide better mechanistic information. In the current manuscript, the authors revisit the first BASS homologue to be structurally characterized, ABST-NM. Based on bacteriological assays in the literature, the authors identify the coenzyme A precursor pantoate as a more likely substrate for ABSTNM than taurocholate, the substrate in the original structure. A structure of ASBT-NM with pantoate exhibits interesting differences in structure. The structures are complemented with MD simulations, and the authors propose that the structures are consistent with a classical elevator transport mechanism.

      The structural experiments are generally solid, although showing omit maps would bolster the identification of the substrate binding site.

      We have added an omit map in Fig S2.

      One shortcoming is that, although pantoate binding is observed, the authors do not show transport of this substrate, undercutting the argument that the pantoate structure represents binding of a "better" or more native substrate. Mechanistic proposals, like the proposed role of T112 in unlocking the transporter, would be much better supported by transport data.

      In the absence of being able to source radiolabelled pantoate at a reasonable cost, we decided to focus on binding studies, relying on the fact that pantoate/pyruvate uptake has been shown in other BASS transporters. While we agree that transport needs to be substantiated, our crystallographic and molecular dynamics studies combined provide a picture of sodium ions stabilising the substrate binding site to enable the binding of the substrate, which in turn induces further conformational changes. Such changes would be consistent with a mechanism of sodium driven transport with clear coupling of the sodium ions to substrate translocation. We are not saying this is a “better” substrate but rather that a substrate binding like this would be able to elicit the conformational changes necessary for transport – something that has been missing from previous studies.

      Reviewer #2 (Public Review):

      The manuscript starts with a demonstration of pantoate binding to ASBTnm using a thermostability assay and ITC, and follows with structure determinations of ASBTnm with or without pantoate. The structure of ASBTnm in the presence of pantoate pinpoints the binding site of pantoate to the "crossover" region formed by partially unwinded helices TMs 4 and 9. Binding of pantoate induces modest movements of side chain and backbone atoms at the crossover region that are consistent with providing coordination of the substrate. The structures also show movement of TM1 that opens the substrate binding site to the cytosol and mobility of loops between the TMs. MD simulations of the ASBT structure embedded in lipid bilayer suggests a stabilizing effect of the two sodium ions that are known to co-transport with the substrate. Binding study on pantoate analogs further demonstrates the specificity of pantoate as a substrate.

      The weakness of the manuscript includes a lack of transport assay for pantoate and a lack of demonstration that the observed conformational changes in TM1 and the loops are relevant to the binding or transport of pantoate.

      We agree that the manuscript would have been bolstered by transport data (see response to reviewer 1). The take-home message from the movement of TM1 and the loops is that they are flexible. It is probably unlikely that TM1 moves like this during the transport cycle and we have avoided overplaying the significance of this movement. Instead, we have focussed on the conformational changes in the pantoate binding site. We have made an additional movie concentrating on the binding site and not including TM1.

      Overall, the structural, functional and computational studies are solid and rigorous, and the conclusions are well justified. In addition, the authors discussed the significance of the current study in a broader perspective relevant to recent structures of mammalian BASS members.

      Reviewer #3 (Public Review)

      The manuscript describes new ligand-bound structures within the larger bile acid sodium symporter family (BASS). This is the primary advance in the manuscript, together with molecular simulations describing how sodium and the bile acids sit in the structure when thermalized. What I think is fairly clear is that the ligands are more stable when the sodiums are present, with a marked reduction in RMSD over the course of repeated trajectories. This would be consistent with a transport model where sodium ions bind first, and then the bile acid binds, followed by a conformational change to another state where the ligands unbind.

      While the authors mention that BASS transporters are thought to undergo an elevator transport mechanisms, this is not tested here. In my reading, all the crystal structures describe the same conformational state, and the simulations do not make an attempt to induce a transition on accessible simulation timescales. Instead, there is a morph between two states where different substrates are bound, which induces a conformational change that looks unrelated to the transport cycle.

      To make our conclusions clearer we have added another movie showing a morph between the structure without substrate (instead of using the structure with taurocholate, which we were using as a representative of the unbound structure) and that with pantoate and have omitted the panel domain including TM1. While both of these structures are inward-facing, there are significant conformational changes within TM4 that we have described in the article.

      Instead, the focus is on what kinds of substrates bind to this transporter, interrogating this with isothermal calorimetry together with mutations. With a Kd in the micromolar range, even the best binder, pantoate, actually isn't a particularly tight binder in the pharmaceutical sense. For a transporter, tight binding is not actually desirable, since the substrate needs to be able to leave after conformational change places it in a position accessible to the other side.

      As the referee points out the Kd that we observe would be consistent with those for substrates of other transporters.

      There is one really important point that readers and authors should be aware of. In Figure 2A, the names are not consistent with the chemical structure. "-ate" denotes when a carboxylic acid is in the deprotonated form, creating a charged carboxylate. What is drawn is pantoic acid, ketopantoic acid, and pantoethenic acid. Less importantly, the wedges and hashes for the methyl group are arguably not appropriate, since the carbon they are attached to is not a chiral center. For the crystallization, this makes no difference, since under near-neutral pKas the carboxylic acid will spontaneously deprotonate, and the carboxylate form will be the most common. However, if the structures in Figure 2A were used for classical molecular simulation, that would be a big problem, since now that would be modeling the much rarer neutral form rather than the charged state. I am reasonably sure based on Figure 5 that the MD correctly modeled the deprotonated form with a carboxylate, but that is inconsistent with Figure 2A. Otherwise, the structure and simulation analysis falls into the mainstream of modern structural biology work.

      We have corrected the inconsistency of the protonaNon state in the naming of the molecular structures. Thank you for poinNng this out – though the names represented the predominant form in soluNon, the more aestheNcally pleasing protonated form got the beOer of us in our representaNons. The correct form was used in the MD.

      Reviewer #1 (Recommendations For The Authors):

      1) Omit maps (Fo-Fc) should be shown for pantoate and for the sodiums in the structure.

      This has been added to supplementary Figure 2.

      2) Line 86 - could you briefly describe the alternative mechanism proposed for the mammalian NTPCs?

      We have added an extra line to describe this deviation from the classical alternating access model.

      3) Line 124 - where is the lipid like molecule, and does it interact with either the kinked helix or the substrate? A supplemental figure would be helpful.

      The lipid like molecule lies between the substrate and the kinked helix, but doesn’t interact strongly with either. It would appear that the lipid would bind in the crevice rather than causing the crevice. We add Author response image 1 here but have not added it to the supplementary figures. The maps and PDB file are available for download.

      Author response image 1.

      The 2mFo-DFc density is at 1σ, the mFo-DFc density is at 2.5σ.

      4) I notice that the apo and pantoate structures are crystallized in different space groups. How does this compare to the original TCH structure? Is there any chance that crystal packing is altering the TM1 geometry or loop 1?

      We cannot rule out the effect of the crystallisation conditions on the movement of the TM1. We have now solved a number of different structures of ASBTNM and this is the first time we observe TM1 in this conformation. As stated above we have refrained from overplaying the significance of the movement of TM1 to transport, other than to say that some adjustments need to be made to accommodate the pantoate.

      Reviewer #2 (Recommendations For The Authors):

      Minor comments:

      Pg 3, "... with a 5-fold inverted repeat...", Should be 2-fold?

      Changed, thank you.

      Reviewer #3 (Recommendations For The Authors):

      Is there any chance that the MD simulations (even in a reduced form) could be uploaded to Zenodo or a similar repository?

      We have taken up this suggestion and added the information in the paper: MD trajectories in the GROMACS XTC format were deposited in the OSF.io repository under DOI 10.17605/OSF.IO/KFDT5 under the open CC-BY Attribution 4.0 International license. The trajectories contain all atoms and were subsampled at 5-ns intervals. GROMACS run input files (TPR format) and initial coordinate files (GRO format) together with topology files (GROMACS format) are also included.

      Watch the "Å" symbol in Figures 5, S6, S7. This looks like they were made in matplotlib, and probably used something like: "$\AA$", which puts the symbol in math mode. This makes the Å symbol in italics. Matplotlib has gotten better UTF-8 support

      Changed, thank you.

      Your citation for LINCS duplicates the citation for PME. I think you want the Hess 1998 paper. 10.1002/(SICI)1096-987X(199709)18%3A12<1463%3A%3AAID-JCC4>3.0.CO%3B2-H

      Changed, thank you

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      (1) The notion of a “root” causal gene - which the authors define based on a graph theoretic notion of topologically sorting graphs - requires a graph that is directed and acyclic. It is the latter that constitutes an important weakness here - it simply is a large simplification of human biology to draw out a DAG including hundreds of genes and a phenotype Y and to claim that the true graph contains no cycles.

      We agree that real causal graphs in biology often contain cycles. We now include additional experimental results with cyclic directed graphs in the Supplementary Materials. RCSP outperformed the other algorithms even in this setting, but we caution the reader that the theoretical interpretation of the RCS score may not coincide with a root causal effect when cycles exist:

      “We also evaluated the algorithms on directed graphs with cycles. We generated a linear SEM over ρ + 1 = 1000 variables in . We sampled the coefficient matrix β from a Bernoulli (1/(p − 1)) distribution but did not restrict the non-zero coefficients to the upper triangular portion of the matrix. We then proceeded to permute the variable ordering and weight each entry as in the Methods for the DAG. We repeated this procedure 30 times and report the results in Supplementary Figure 3.

      RCSP again outperformed all other algorithms even in the cyclic case. The results suggest that conditioning on the surrogate ancestors also estimates the RCS well even in the cyclic case. However, we caution that an error term E<sub>i</sub> can affect the ancestors of when cycles exist. As a result, the RCS may not isolate the causal effect of the error term and thus not truly coincide with the notion of a root causal effect in cyclic causal graphs.”

      (2) I also encourage the authors to consider more carefully when graph structure learned from Perturb-seq can be ported over to bulk RNA-seq. Presumably this structure is not exactly correct - to what extent is the RCSP algorithm sensitive to false edges in this graph? This leap - from cell line to primary human cells - is also not modeled in the simulation. Although challenging - it would be ideal for the RCSP to model or reflect the challenges in correctly identifying the regulatory structure.

      We now include additional experimental results, where we gradually increased the incongruence between the DAG modeling the Perturb-seq and the DAG modeling the bulk RNA-seq using a mixture of graphs. The performance of RCSP degraded gradually, rather than abruptly, with increasing incongruence. We therefore conclude that RCSP is robust to differences between the causal graphs representing Perturb-seq and bulk RNA-seq:

      “We next assessed the performance of RCSP when the DAG underlying the Perturb-seq data differs from the DAG underlying the bulk RNA-seq data. We considered a mixture of two random DAGs in bulk RNA-seq, where one of the DAGs coincided with the Perturb-seq DAG and second alternate DAG did not. We instantiated and simulated samples from each DAG as per the previous subsection. We generated 0%, 25%, 50%, 75%, and 100% of the bulk RNA-seq samples from the alternate DAG, and the rest from the Perturb-seq DAG. We ideally would like to see the performance of RCSP degrade gracefully, as opposed to abruptly, as the percent of samples derived from the alternate DAG increases.

      We summarize results in Supplementary Figure 4. As expected, RCSP performed the best when we drew all samples from the same underlying DAG for Perturb-seq and bulk RNA-seq. However, the performance of RCSP also degraded slowly as the percent of samples increased from the alternate DAG. We conclude that RCSP can accommodate some differences between the underlying DAGs in Perturb-seq and bulk RNA-seq with only a mild degradation in performance.”

      (3) It should also be noted that in most Perturb-seq experiments, the entire genome is not perturbed, and frequently important TFs (that presumably are very far “upstream” and thus candidate “root” causal genes) are not expressed highly enough to be detected with scRNA-seq. In that context - perhaps slightly modifying the language regarding RCSP’s capabilities might be helpful for the manuscript - perhaps it would be better to describe it as an algorithm for causal discovery among a set of genes that were perturbed and measured, rather than a truly complete search for causal factors. Perhaps more broadly it would also benefit the manuscript to devote slightly more text to describing the kinds of scenarios where RCSP (and similar ideas) would be most appropriately applied - perhaps a well-powered, phenotype annotated Perturb-seq dataset performed in a disease relevant primary cell.

      We now clarify that Perturb-seq can only identify root causal genes among the perturbed set of genes in the Discussion:

      “Modern genome-wide Perturb-seq datasets also adequately perturb and measure only a few thousand, rather than all, gene expression levels. RCSP can only identify root causal genes within this perturbed and measured subset.”

      We now also describe the scenario where RCSP can identify root causal genes well in the Introduction:

      “Experiments demonstrate marked improvements in performance, when investigators have access to a large bulk RNA-seq dataset and a genome-wide Perturb-seq dataset from a cell line of a disease-relevant tissue.”

      Reviewer 2:

      (1) The process from health-to-disease is not linear most of the time with many checks along the way that aim to prevent the disease phenotype. This leads to a non-deterministic nature of the path from health-to-disease. In other words, with the same root gene perturbations, and depending on other factors outside of gene expression, someone may develop a phenotype in a year, another in 10 years and someone else never. Claiming that this information is included in the error terms might not be sufficient to address this issue. The authors should discuss this limitation.

      The proposed approach accommodates the above non-deterministic nature. The error terms of model factors that are outside of gene expression. We model the relation from gene expression to Y as probabilistic rather than deterministic because , where E<sub>Y</sub> introduces stochasticity. Thus, two individuals with the same instantiations of the root causes may develop disease differently. We now clarify this in Methods:

      “The error terms model root causes that are outside of gene expression, such as genetic variation or environmental factors. Moreover, the relation from gene expression to Y is stochastic because , where E<sub>Y</sub> introduces the stochasticity. Two individuals may therefore have the exact same error term values over but different instantiations of Y.”

      (2) The paper assumes that the network connectivity will remain the same after perturbation. This is not always true due to backup mechanisms in the cells. For example, suppose that a cell wants to create product P and it can do it through two alternative paths: Path #1: ABP, Path #2: ACP. Now suppose that path #1 is more efficient, so when B can be produced, path #2 is inactive. Once the perturbation blocks element B from being produced, the graph connectivity changes by activation of path #2. I did not see the authors taking this into consideration, which seems to be a major limitation in using Perturb-seq results to infer conductivities.

      We agree that backup mechanisms can exist and therefore now include additional experimental results, where we gradually increased the incongruence between the DAG modeling the Perturb-seq and the DAG modeling the bulk RNA-seq using a mixture of graphs. The performance of RCSP degraded gradually, rather than abruptly, with increasing incongruence. We therefore conclude that RCSP is robust to differences between the causal graphs representing Perturb-seq and bulk RNA-seq:

      “We next assessed the performance of RCSP when the DAG underlying the Perturb-seq data differs from the DAG underlying the bulk RNA-seq data. We considered a mixture of two random DAGs in bulk RNA-seq, where one of the DAGs coincided with the Perturb-seq DAG and second alternate DAG did not. We generated 0%, 25%, 50%, 75%, and 100% of the bulk RNA-seq samples from the alternate DAG, and the rest from the Perturb-seq DAG. We ideally would like to see the performance of RCSP degrade gracefully, as opposed to abruptly, as the percent of samples derived from the alternate DAG increases.

      We summarize results in Supplementary Figure 4. As expected, RCSP performed the best when we drew all samples from the same underlying DAG for Perturb-seq and bulk RNA-seq. However, the performance of RCSP also degraded slowly as the percent of samples increased from the alternate DAG. We conclude that RCSP can accommodate some differences between the underlying DAGs in Perturb-seq and bulk RNA-seq with only a mild degradation in performance.”

      (3) There is substantial system heterogeneity that may cause the same phenotype. This goes beyond the authors claim that although the initial gene causes of a disease may differ from person to person, at some point they will all converge to changes in the same set of “root genes.” This is not true for many diseases, which are defined based on symptoms and lab tests at the patient level. You may have two completely different molecular pathologies that lead to the development of the same symptoms and test results. Breast cancer with its subtypes is a prime example of that. In theory, this issue could be addressed if there is infinite sample size. However, this assumption is largely violated in all existing biological datasets.

      The proposed method accommodates the above heterogeneity. We do not assume that the root causes affect the same set of root causal genes. Instead the root causes and root causal genes may vary from person to person. We write in the Introduction:

      “The problem is further complicated by the existence of complex disease, where a patient may have multiple root causal genes that differ from other patients even within the same diagnostic category... We thus also seek to identify patient-specific root causal genes in order to classify patients into meaningful biological subgroups each hopefully dictated by only a small group of genes.”

      The root causal genes may further affect different downstream genes at the patient-specific level. However root causal genes tend to have many downstream effects so that virtually every gene expression level becomes correlated with Y. We now clarify this by describing the omnigenic root causal model in the Introduction as follows:

      “Finally, application of the algorithm to two complex diseases with disparate pathogeneses recovers an omnigenic root causal model, where a small set of root causal genes drive pathogenesis but impact many downstream genes within each patient. As a result, nearly all gene expression levels are correlated with the diagnosis at the population level.”

      (4) Were the values of the synthetic variables Z-scored?

      Yes, all variables were z-scored. We now clarify this in Methods:

      “We also standardized all variables before running the regressions to prevent gaming of the marginal variances in causal discovery (Reisach et al., 2021; Ng et al., 2024).”

      (5) The algorithm seems to require both RNA-seq and Perturb-seq data (Algorithm 1, page 14). Can it function with RNA-seq data only? What will be different in this case?

      The algorithm cannot function with observational bulk RNA-seq data only. We included Perturb-seq because causal discovery with observational RNA-seq data alone tends to be inaccurate and unstable, as highlighted by the results of CausalCell. We further emphasize that we do not rely on d-separation faithfulness in Methods, which is typically required for causal discovery from observational data alone:

      “We can also claim the backward direction under d-separation faithfulness. We however avoid making this additional assumption because real biological data may not arise from distributions obeying d-separation faithfulness in practice.”

      (6) Synthetic data generation: how many different graphs (SEMs) did they start from? (30?) How many samples per graph? Did they test different sample sizes?

      We now clarify that we generate 30 random SEMs, each associated with a DAG. We used 200 samples for the bulk RNA-seq to mimic a relatively large but common sample size. We also drew 200 samples for each perturbation or control in the Perturb-seq data. We did not consider multiple sample sizes due to the time required to complete each run. Instead, we focused on a typical scenario where investigators would apply RCSP. We now write the following in the Methods:

      “We drew 200 samples for the bulk RNA-seq data to mimic a large but common dataset size. We introduced knockdown perturbations in Perturb-seq by subtracting an offset of two in the softplus function: . We finally drew 200 samples for the control and each perturbation condition to generate the Perturb-seq data. We repeated the above procedure 30 times.” We also include the following in Results:

      “We obtained 200 cell samples from each perturbation, and another 200 controls without perturbations. We therefore generated a total of 2501 × 200 = 500,200 single cell samples for each Perturb-seq dataset. We simulated 200 bulk RNA-seq samples.”

      (7) The presentation of comparative results (Supplementary Figures 4 and 7) is not clear. No details are given on how these results were generated. (what does it mean “The first column denotes the standard deviation of the outputs for each algorithm?”) Why all other methods have higher SD differences than RCSP? Is it a matter of scaling? Shouldn’t they have at least some values near zero since the authors “added the minimum value so that all histograms begin at zero?”

      Each of these supplementary figures contains a 6 by 3 table of figures. By the first column, we mean column one (with rows 1 through 6) of each figure. The D-RCS and D-SD scores represent standard deviations of the RCS and SD scores from zero of each gene, respectively. We can similarly compute the standard deviation of the outputs of the algorithms. We now clarify this in the Supplementary Materials:

      “The figure contains 6 rows and 3 columns. Similar to the D-RCS, we can compute the standard deviation of the output of each algorithm from zero for each gene. The first column in Supplementary Figure 7 denotes the histograms of these standard deviations across the genes.”

      Many histograms do not appear to start at zero because the bars are too small to be visible. We now clarify this in the Supplementary Materials as well:

      “Note that the bars at zero are not visible for many algorithms, since only a few genes attained standard deviations near the minimum.”

      (8) Why RCSP results are more like a negative binomial distribution and every other is kind of normal?

      All other methods have higher standard deviations than RCSP because they fail to compute an accurate measure of the root causal effect. Recall that, just like a machine has a few root causal problems, only a few root casual genes have large root causal effects under the omnigenic root causal model. The results of RCSP look more like a negative binomial distribution because most RCS scores are concentrated around zero and only a few RCS scores are large – consistent with the omnigenic root causal model. The other algorithms fail to properly control for the upstream genes and thus attain large standard deviations for nearly all genes. We now clarify these points in the Supplementary Materials as follows:

      “If an algorithm accurately identifies root causal genes, then it should only identify a few genes with large conditional root causal effects under the omnigenic root causal model. The RCSP algorithm had a histogram with large probability mass centered around zero with a long tail to the right. The standard deviations of the outputs of the other algorithms attained large values for nearly all genes. Incorporating feature selection and causal discovery with CausalCell introduced more outliers in the histogram of ANM. We conclude that only RCSP detected an omnigenic root causal model.”

      (9) What is the significance of genes changing expression “from left to right” in a UMAP plot? (e.g., Fig. 3h and 3g)

      The first UMAP dimension captured the variability of the RCS scores for most root causal genes. As a result, we could focus our analysis on the black cluster in Figure 3 (g) with large RCS scores in the subsequent pathway enrichment analysis summarized in Figure 3 (j). If two dimensions were involved, then we would need to analyze at least two clusters (e.g., black and pink), but this was not the case. We now clarify this in Results:

      “The RCS scores of most of the top genes exhibited a clear gradation increasing only from the left to the right hand side of the UMAP embedding; we plot an example in Figure 3 (h). We found three exceptions to this rule among the top 30 genes (example in Figure 3 (i) and see Supplementary Materials). RCSP thus detected genes with large RCS scores primarily in the black cluster of Figure 3 (g). Pathway enrichment analysis within this cluster alone yielded supra-significant results on the same pathway detected in the global analysis...”

      (10) The authors somewhat overstate the novelty of their algorithm. Representation of GRNs as causal graphs dates back in 2000 with the work of Nir Friedman in yeast. Other methods were developed more recently that look on regulatory network changes at the single sample level which the authors do not seem to be aware (e.g., Ellington et al, NeurIPS 2023 workshop GenBio and Bushur et al, 2019, Bioinformatics are two such examples). The methods they mention are for single cell data and they are not designed to connect single sample-level changes to a person’s phenotype. The RCS method needs to be put in the right background context in order to bring up what is really novel about it.

      We agree that many methods already exist for uncovering associational, predictive (Markov, neighborhood) and causal gene regulatory networks. We now cite the above papers. However, the novelty in our manuscript is not causal graph discovery, but rather estimation of root causal effects, detection of root causal genes, and the proposal of the omnigenic root causal model. We now clarify this in the

      Introduction:

      “Many algorithms focus on discovering associational or predictive relations, sometimes visually represented as gene regulatory networks (Costa et al., 2017; Ellington et al., 2023). Other methods even identify causal relations (Friedman et al., 2000; Wang et al., 2023; Wen et al., 2000; Buschur et al., 2000), but none pinpoint the first gene expression levels that ultimately generate the vast majority of pathogenesis. Simply learning a causal graph does not resolve the issue because causal graphs do not summarize the effects of unobserved root causes, such as unmeasured environmental changes or variants, that are needed to identify all root causal genes. We therefore define the Root Causal Strength (RCS) score...”

      Reviewer 3:

      (1) Several assumptions of the method are problematic. The most concerning is that the observational expression changes are all causally upstream of disease. There is work using Mendelian randomization (MR) showing that the opposite is more likely to be true: most differential expression in disease cohorts is a consequence rather than a cause of disease (Porcu et al., 2021). Indeed, the oxidative stress of AMD has known cellular responses including the upregulation of p53. The authors need to think carefully about how this impacts their framework. Can the theory say anything in this light? Simulations could also be designed to address robustness.

      Strictly speaking, we believe that differential expression in disease most likely has a cyclic causal structure: gene expression causes a diagnosis or symptom severity, and a diagnosis or symptom severity lead to treatments and other behavioral changes that perturb gene expression. For example, revTMWR in Porcu et al. (2021) uses trans-variants that are less likely to directly cause gene expression and instead directly cause a phenotype. However, TWMR as proposed in Porcu et al. (2019) instead uses cis-eQTLs and finds many putative causal relations from gene expression to phenotype. Thus, both causal directions likely hold.

      RCSP uses disease-relevant tissue believed to harbor gene expression levels that cause disease. However, RCSP theoretically cannot handle the scenario where Y is a non-sink vertex and is a parent of a gene expression level because modern Perturb-seq datasets usually do not perturb or measure Y. We therefore empirically investigated the degree of error by running experiments, where we set Y to a non-sink vertex, so that it can cause gene expression. We find that the performance of RCSP degrades considerably for gene expression levels that contain Y as a parent. Thus RCSP is sensitive to violations of the sink target assumption:

      “We finally considered the scenario where Y is a non-sink (or non-terminal) vertex. If Y is a parent of a gene expression level, then we cannot properly condition on the parents because modern Perturbseq datasets usually do not intervene on Y or measure Y . We therefore empirically investigated the degradation in performance resulting from a non-sink target Y, in particular for gene expression levels where Y is a parent. We again simulated 200 samples from bulk RNA-seq and each condition of Perturbseq with a DAG over 1000 vertices, an expected neighborhood size of 2 and a non-sink target Y . We then removed the outgoing edges from Y and resampled the DAG with a sink target. We compare the results of RCSP for both DAGs in gene expression levels where Y is a parent. We plot the results in Supplementary Figure 5. As expected, we observe a degradation in performance when Y is not terminal, where the mean RMSE increased from 0.045 to 0.342. We conclude that RCSP is sensitive to violations of the sink target assumption.”

      (2) A closely related issue is the DAG assumption of no cycles. This assumption is brought to bear because it is required for much classical causal machinery, but is unrealistic in biology where feedback is pervasive. How robust is RCSP to (mild) violations of this assumption? Simulations would be a straightforward way to address this.

      We agree that real causal graphs in biology often contain cycles. We now include additional experimental results with cyclic directed graphs in the Supplementary Materials. RCSP outperformed the other algorithms even in this setting, but we caution the reader that the theoretical interpretation of the RCS score may not coincide with a root causal effect when cycles exist:

      “We also evaluated the algorithms on directed graphs with cycles. We generated a linear SEM over p + 1 = 1000 variables in . We sampled the coefficient matrix β from a Bernoulli (1/(p − 1)) distribution but did not restrict the non-zero coefficients to the upper triangular portion of the matrix. We then proceeded to permute the variable ordering and weight each entry as in the Methods for the DAG. We repeated this procedure 30 times and report the results in Supplementary Figure 3.

      RCSP again outperformed all other algorithms even in the cyclic case. The results suggest that conditioning on the surrogate ancestors also estimates the RCS well even in the cyclic case. However, we caution that an error term E<sub>i</sub> can affect the ancestors of , when cycles exist. As a result, the RCS may not isolate the causal effect of the error term and thus not truly coincide with the notion of a root causal effect in cyclic causal graphs.”

      (3) The authors spend considerable effort arguing that technical sampling noise in X can effectively be ignored (at least in bulk). While the mathematical arguments here are reasonable, they miss the bigger picture point that the measured gene expression X can only ever be a noisy/biased proxy for the expression changes that caused disease: 1) Those events happened before the disease manifested, possibly early in development for some conditions like neurodevelopmental disorders. 2) bulk RNA-seq gives only an average across cell-types, whereas specific cell-types are likely “causal.” 3) only a small sample, at a single time point, is typically available. Expression in other parts of the tissue and at different times will be variable.

      We agree that many other sources of error exist. The causal model of RNA-expression in Methods corresponds to a single snapshot in time for each sample. We now clarify this in the Methods as follows:

      “We represent a snapshot of a biological causal process using an SEM over obeying Equation (3).”

      We thus only detect the root causal genes in a single snapshot in time for each sample in bulk RNA-seq. If we cannot detect the root causal effect in a gene due to the signal washing out over time as in (1), or if the root causal effect in different cell types cancel each other out to exactly zero in bulk as in (2), then we cannot detect those root causal genes even with an infinite sample size.

      (4) While there are connections to the omnigenic model, the latter is somewhat misrepresented. The authors refer to the “core genes” of the omnigenic model as being at the end (longitudinal) of pathogenesis. The omnigenic model makes no statements about temporal ordering: in causal inference terminology the core genes are simply the direct causes of disease.

      We now clarify that we use the word pathogenesis to mean the causal cascade from root causes to the diagnosis. In this case, the direct causes of the diagnosis correspond to the end of pathogenesis, while the root causes correspond to the beginning. For example, if , with Y a diagnosis, then X<sub>1</sub> is a root causal gene while X<sub>2</sub> is a core (direct causal) gene. We now clarify this in the Introduction:

      Root causes of disease correspond to the most upstream causes of a diagnosis with strong causal effects on the diagnosis. Pathogenesis refers to the causal cascade from root causes to the diagnosis. Genetic and non-genetic factors may act as root causes and affect gene expression as an intermediate step during pathogenesis. We introduce root causal gene expression levels – or root causal genes for short – that correspond to the initial changes to gene expression induced by genetic and non-genetic root causes that have large causal effects on a downstream diagnosis (Figure 1 (a)). Root causal genes differ from core genes that directly cause the diagnosis and thus lie at the end, rather than at the beginning, of pathogenesis (Boyle et al., 2017).”

      (5) A key observation underlying the omnigenic model is that genetic heritability is spread throughout the genome (and somewhat concentrated near genes expressed in disease relevant cell types). This implies that (almost) all expressed genes, or their associated (e)SNPs, are “root causes”.

      We now clarify that genetic heritability can be spread throughout the genome in the omnigenic root causal model as well in the Discussion:

      “Further, each causal genetic variant tends to have only a small effect on disease risk in complex disease because the variant can directly cause Y or directly cause any causal gene including those with small root causal effects on Y ; thus, all error terms that cause Y can model genetic effects on Y. However, the root causal model further elaborates that genetic and non-genetic factors often combine to produce a few root causal genes with large root causal effects, where non-genetic factors typically account for the majority of the large effects in complex disease. Many variants may therefore cause many genes in diseases with only a few root causal genes.”

      We finally add Figure 5 into the Discussion as a concrete example illustrating the omnigenic root causal model:

      (6) The claim that root causal genes would be good therapeutic targets feels unfounded. If these are highly variable across individuals then the choice of treatment becomes challenging. By contrast the causal effects may converge on core genes before impacting disease, so that intervening on the core genes might be preferable. The jury is still out on these questions, so the claim should at least be made hypothetical.

      We clarify that we do not claim that root causal genes are better treatment targets than core genes in terms of magnitudes of causal effects on the phenotype. For example, in the common cold with a virus as the root cause, giving a patient an antiviral will eliminate fever and congestion, but so will giving a decongestant and an antipyretic. We only claim that treating root causal genes can eliminate disease near its pathogenic onset, just like giving an antiviral can eliminate the viral load and stop pathogenesis. We write the following the Introduction:

      “Treating root causal genes can modify disease pathogenesis in its entirety, whereas targeting other causes may only provide symptomatic relief... Identifying root causal genes is therefore critical for developing treatments that eliminate disease near its pathogenic onset.”

      We also further clarify in the Discussion that root causal genes account for deleterious causal effects not captured by the diagnosis Y:

      “We finally emphasize that the root causal model accounts for all deleterious effects of the root causal genes, whereas the core gene model only captures the deleterious effects captured by the diagnosis Y. For example, the disease of diabetes causes retinopathy, but retinopathy is not a part of the diagnostic criteria of diabetes. As a result, the gene expression levels that cause retinopathy but not the diagnosis of diabetes are not core genes, even though they are affected by the root causal genes.”

      We do agree that root causal genes may differ substantially between patients, although it is unclear if the heterogeneity is too great to develop treatments.

      (7) The closest thing to a gold standard I believe we have for “root causal genes” is integration of molecular QTLs and GWAS, specifically coloc/MR. Here the “E” of RCSP are explicitly represented as SNPs. I don’t know if there is good data for AMD but there certainly is for MS. The authors should assess the overlap with their results. Another orthogonal avenue would be to check whether the root causal genes change early in disease progression.

      Colocalization and Mendelian randomization unfortunately cannot identify root causal effects because they all attempt, either heuristically (colocalization) or rigorously (MR), to identify variants that cause each gene expression level rather than variants that directly cause each gene expression level and thus make up the error terms. We therefore need new methods that can identify direct causal variants in order to assess overlap.

      We checked whether root causal genes change early in disease progression using knowledge of pathogenesis. In particular, oxidative stress induces pathogenesis in AMD, and RCSP identified root causal genes involved in oxidative stress in AMD:

      “The pathogenesis of AMD involves the loss of RPE cells. The RPE absorbs light in the back of the retina, but the combination of light and oxygen induces oxidative stress, and then a cascade of events such as immune cell activation, cellular senescence, drusen accumulation, neovascularization and ultimately fibrosis (Barouch et al., 2007). We therefore expect the root causal genes of AMD to include genes involved in oxidative stress during early pathogenesis. The gene MIPEP with the highest D-RCS score in Figure 3 (d) indeed promotes the maturation of oxidative phosphorylation-related proteins (Shi et al., 2011). The second gene SLC7A5 is a solute carrier that activates mTORC1 whose hyperactivation increases oxidative stress via lipid peroxidation (Nachef et al., 2021; Go et al., 2020). The gene HEATR1 is involved in ribosome biogenesis that is downregulated by oxidative stress (Turi et al., 2018). The top genes discovered by RCSP thus identify pathways known to be involved in oxidative stress.”

      Similarly, T cell infiltration across the blood brain barrier initiates pathogenesis in MS, and RCSP identified root causal genes involved in this infiltration:

      “Genes with the highest D-RCS scores included MNT, CERCAM and HERPUD2 (Figure 4 (d)). MNT is a MYC antagonist that modulates the proliferative and pro-survival signals of T cells after engagement of the T cell receptor (Gnanaprakasam et al., 2017). Similarly, CERCAM is an adhesion molecule expressed at high levels in microvessels of the brain that increases leukocyte transmigration across the blood brain barrier (Starzyk et al., 2000). HERPUD2 is involved in the endoplasmic-reticulum associated degradation of unfolded proteins (Kokame et al., 2000). Genes with the highest D-RCS scores thus serve key roles in known pathogenic pathways of MS.”

      (8) The available Perturb-seq datasets have limitations beyond on the control of the authors. 1) The set of genes that are perturbed. The authors address this by simply sub-setting their analysis to the intersection of genes represented in the perturbation and observational data. However, this may mean that a true ancestor of X is not modeled/perturbed, limiting the formal claims that can be made. Additionally, some proportion of genes that are nominally perturbed show little to no actual perturbation effect (for example, due to poor guide RNA choice) which will also lead to missing ancestors.

      We now clarify that Perturb-seq can only identify root causal genes among the adequately perturbed set of genes in the Discussion:

      “Modern genome-wide Perturb-seq datasets also only adequately perturb and measure a few thousand, rather than all, gene expression levels. RCSP can only identify root causal genes within this perturbed and measured subset.”

      (9) The authors provide no mechanism for statistical inference/significance for their results at either the individual or aggregated level. While I am a proponent of using effect sizes more than p-values, there is still value in understanding how much signal is present relative to a reasonable null.

      We now explain that RCSP does not perform statistical inference in Methods because it is not clear how to define the appropriate cut-off for the RCS score under the null distribution:

      “We focus on statistical estimation rather than statistical inference because Φ<sub>i</sub> > 0 when E<sub>i</sub> causes Y under mild conditions, so we reject the null hypothesis that Φ<sub>i</sub> \= 0 for many genes if many gene expression levels cause Y. However, just like a machine typically breaks down due to only one or a few root causal problems, we hypothesize that only a few genes have large RCS scores Φ<sub>i</sub> ≫ 0 even in complex disease.”

      (10) I agree with the authors that age coming out of a “root cause” is potentially encouraging. However, it is also quite different in nature to expression, including being “measured” exactly. Will RCSP be biased towards variables that have lower measurement error?

      We tested the above hypothesis by plotting sequencing depth against the D-RCS scores of each gene. We observed a small negative correlation between sequencing depth and D-RCS scores, indicating the D-RCS scores are slightly biased upwards with low sequencing depth. However, genes with the largest D-RCS scores exhibited a wide variety of sequencing depths in both MS and AMD, suggesting that sequencing depth has minimal effect on the largest D-RCS scores. We now explain these results for AMD in the Supplementary Materials:

      “Theorem 1 states that RCS scores may exhibit bias with insufficient sequencing depth. The genes with large D-RCS scores may therefore simply have low sequencing depths. To test this hypothesis, we plotted sequencing depth against D-RCS scores. Consistent with Theorem 1, we observed a small negative correlation between D-RCS and sequencing depth (ρ \= −0.16, p=2.04E-13), and D-RCS scores exhibited greater variability at the lowest sequencing depths (Supplementary Figure 8). However, genes with the largest D-RCS scores had mean sequencing depths interspersed between 20 and 3000. We conclude that genes with the largest D-RCS scores had a variety of sequencing depths ranging from low to high.”

      We also report the results for MS:

      “We plot sequencing depth against the D-RCS scores of each gene similar to the AMD dataset. We again observed a small negative correlation (ρ \= −0.136, p_<_2.2E-16), indicating that genes with low sequencing depths had slightly higher D-RCS scores on average (Supplementary Figure 12). However, genes with the largest D-RCS scores again had a variety of sequencing depths. We conclude that sequencing depth has minimal correlation with the largest D-RCS scores.”

      (11) Finally, it’s a stretch to call K562 cells “lymphoblasts.” They are more myeloid than lymphoid.

      We now clarify that K562 cells are undifferentiated blast cells that can be induced to differentiate into lymphoblasts in Results:

      “We next ran RCSP on 137 samples collected from CD4+ T cells of multiple sclerosis (MS; GSE137143) as well as Perturb-seq data of 1,989,578 undifferentiated blast cells that can be induced to differentiate into lymphoblasts, or the precursors of T cells and other lymphocytes.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study advances our understanding of the brain nuclei involved in rapid-eye movement (REM) sleep regulation. Using a combination of imaging, electrophysiology, and optogenetic tools, the study provides convincing evidence that inhibitory neurons in the preoptic area of the hypothalamus influence REM sleep. This work will be of interest to neurobiologists working on sleep and/or brain circuitry.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper identifies GABA cells in the preoptic hypothalamus which are involved in REM sleep rebound (the increase in REM sleep) after selective REM sleep deprivation. By calcium photometry, these cells are most active during REM, and show more claim signals during REM deprivation, suggesting they respond to "REM pressure". Inhibiting these cells ontogenetically diminishes REM sleep. The optogenetic and photometry work is carried out to a high standard, the paper is well-written, and the findings are interesting.

      We thank the reviewer for the detailed feedback and thoughtful comments on how to improve our manuscript. To address the reviewer’s concerns, we revised our discussion and added new data. Below, we address the concerns point by point.

      Points that could be addressed or discussed:

      (1) The circuit mechanism for REM rebound is not defined. How do the authors see REM rebound as working from the POAGAD2 cells? Although the POAGAD2 does project to the TMN, the actual REM rebound could be mediated by a projection of these cells elsewhere. This could be discussed.

      We demonstrate thatPOA GAD2→TMN cells become more frequently activated as the pressure for REMs builds up, whereas inhibiting these neurons during high REMs pressure leads to a suppression of the REMs rebound. It is not known how POA GAD2→TMN cells encodeincreased REMs pressure and subsequently influence the REMs rebound. REMsdeprivation wasshown to changethe intrinsic excitabilityof hippocampal neurons and impact synaptic plasticity (McDermott et al., 2003; Mallick and Singh, 2011 ; Zhou et al., 2020) . We speculate that increasedREMs pressure leads to an increase in the excitabilityof POA->TMN neurons, reflected inthe increased number ofcalcium peaks. The increased excitability of POA GAD2→TMN neurons in turn likely leads to stronger inhibition of downstream REM-off neurons. Consequently, as soon as REMsdeprivation stops, there is an increased chance for enteringREMs. The time coursefor how long it takes till the POA excitability resettles toits baseline consequently sets a permissive time window for increasedamounts of REMs to recover its lostamount. For future studies, it would be interesting to map how quickly the excitability ofPOA neurons increases or decays as afunction of the lost or recovered amount of REMs andunravel the cellularmechanisms underlying the elevated activity of POAGAD2 →TMN neurons during highREMs pressure, e.g., whether changes in the expression of ion channels contribute to increasedexcitability of these neurons (Donlea et al., 2014) . As we mentioned in the Discussion, the POAalso projects to other REMs regulatorybrain regions such as the vlPAG and LH. Therefore, it remains to be tested whether POA GAD2 →TMN neurons also innervate these brain regions to potentially regulate REMs homeostasis. We explicitly state this now in the revised Discussion.

      (2) The "POAGAD2 to TMN" name for these cells is somewhat confusing. The authors chose this name because they approach the POAGAD2 cells via retrograde AAV labelling (rAAV injected into the TMN). However, the name also seems to imply that neurons (perhaps histamine neurons) in the TMN are involved in the REM rebound, but there is no evidence in the paper that this is the case. Although it is nice to see from the photometry studies that the histamine cells are selectively more active (as expected) in NREM sleep (Fig. S2), I could not logically see how this was a relevant finding to REM rebound or the subject of the paper. There are many other types of cells in the TMN area, not just histamine cells, so are the authors suggesting that these non-histamine cells in the TMN could be involved?

      We acknowledge that other types of neurons in the TMN may also be involved in the REMs rebound, and therefore inhibition of histamine neurons by POA GAD2 →TMN neurons may not be the sole source of the observed effect. To stress that other neurons within the TMN and/or brain regions may also contribute to the REMs rebound, we have revised the Results section.

      We performed complementary optogenetic inhibition experiments of TMN HIS neurons to investigate if suppression of these neurons is sufficient to promote REMs. We foundthat SwiChR++ mediated inhibition of TMNHIS neurons increased theamount of REMs compared withrecordings without laser stimulation in the same mice and eYFPmice withlaser stimulation. Thus, while TMN HIS neurons may not bethe only downstream target of GABAergic POA neurons, these data suggest that they contribute to REMs regulation. We have incorporated these results in Fig. S4 .

      We further investigated whether the activity of TMN HIS neurons changes between two REMs episodes. Assumingthat REMs pressure inhibits the activity ofREM-off histamine neurons,their firing rates should behighest right after REMs ends when REMs pressure is lowest, and progressivelydecay throughout the inter-REM interval, and reach their lowest activity right before the onset of REMs ( Park et al., 2021) , similarto the activity profile observed for vlPAG REM-off neurons (Weber et al., 2018).We indeed found that TMNHIS neurons displaya gradual decrease in their activity throughout theinter-REM interval and thus potentially reflect the build up of REM pressure ( Fig. S2F ).

      (3) It is a puzzle why most of the neurons in the POA seem to have their highest activity in REM, as also found by Miracca et al 2022, yet presumably some of these cells are going to be involved in NREM sleep as well. Could the same POAGAD2-TMN cells identified by the authors also be involved in inducing NREM sleep-inhibiting histamine neurons (Chung et al). And some of these POA cells will also be involved in NREM sleep homeostasis (e.g. Ma et al Curr Biol)? Is NREM sleep rebound necessary before getting REM sleep rebound? Indeed, can these two things (NREM and REM sleep rebound) be separated?

      Previous studies have demonstrated that POA GABAergic neurons, including those projecting to the TMN, are involved in NREMs homeostasis (Sherin et al., 1998; Gong et al., 2004; Ma et al., 2019) . Therefore, we predict that POA neurons that are involved in NREMs homeostasis are a subset of POA GAD2 → TMN neurons in our manuscript.

      Using optrode recordings in the POA, we recently reported that 12.4% of neurons sampled have higher activity during NREMs compared with REMs; in contrast, 43.8% of neurons sampled have the highest activity during REMs compared with NREMs (Antila et al., 2022) indicating that the proportion of NREM max neurons is smaller compared with REM max neurons. These proportions of neurons are in agreement with previous results (Takahashi et al., 2009) . Considering fiber photometry monitors the average activity of a population of neurons as opposed to individual neurons, it is possible that we recorded neural activity across heterogeneous populations and therefore our findings may disguise the neural activity of the low proportion of NREMs neurons. We previously reported thespiking activity of POA GAD2 →TMN neurons at the singlecell level (Chung et al., 2017) . We have noted in themanuscript thatwhile the activity ofPOA GAD2→TMN neurons is highestduring REMs, theneural activity increases at NREMs → REMs transitions indicating these neurons also areactive during NREMs.

      Using our REMs restriction protocol, we selectively restricted REMs leading to the subsequent rebound of REMs without affecting NREMs and consequently we did not find an increase in the amount of NREMs during the rebound or an increase in slow-wave activity, a key characteristic of sleep rebound that gradually dissipates during recovery sleep (Blake and Gerard, 1937; Williams et al., 1964; Rosa and Bonnet, 1985; Dijk et al., 1990; Neckelmann and Ursin, 1993; Ferrara et al., 1999) . However, during total sleep deprivation when subjects are deprived of both NREMs and REMs, isolating NREMs and REMs rebound may not be attainable.

      (4) Is it possible to narrow down the POA area where the GAD2 cells are located more precisely?

      POA can be subdivided into anatomically distinct regions such as medial preoptic area, median preoptic area, ventrolateral preoptic area, and lateral preoptic area (MPO, MPN, VLPO, and LPO respectively). To quantify where the virus expressing GAD2 cells and optic fibers are located within the POA, we overlaid the POA coronal reference images (with red boundaries denoting these anatomically distinct regions) over the virus heat maps and optic fiber tracts from datasets used in Figure 1A. We found that virus expression and optic fiber tracts were located in the ventrolateral POA, lateral POA, and the lateral part of medial POA, and included this description in the text.

      Author response image 1.

      Location of virus expression (A) and optic fiber placement (B) within subregions of POA.

      (5) It would be ideal to further characterize these particular GAD2 cells by RT-PCR or RNA seq. Which other markers do they express?

      Single-cell RNA-sequencing of POA neurons has revealed an enormous level of molecular diversity, consisting of nearly 70 subpopulations based on gene expression of which 43 can be clustered into inhibitory neurons (Moffitt et al., 2018) . One of the most studied subpopulation of POA sleep-active neurons contains the inhibitory neuropeptide galanin (Sherin et al., 1998; Gaus et al., 2002; Chung et al., 2017; Kroeger et al., 2018; Ma et al., 2019; Miracca et al., 2022) . Galanin neurons have been demonstrated to innervate the TMN (Sherin et al., 1998) yet, within the galanin neurons 7 distinct clusters exist based on unique gene expression (Moffitt et al., 2018) . In addition to galanin, we have previously performed single-cell RNA-seq on POA GAD2 → TMN neurons and identified additional neuropeptides such as cholecystokinin (CCK), corticotropin-releasing hormone (CRH), prodynorphin (PDYN), and tachykinin 1 (TAC1) as subpopulations of GABAergic POA sleep-active neurons (Chung et al., 2017; Smith et al., 2023) . Like galanin, these neuropeptides can also be divided into multiple subtypes as well (Chen et al., 2017; Moffitt et al., 2018) . Thus while these molecular markers for POA neurons are immensely diverse, we agree that characterizing the molecular identity of POA GAD2 → TMN neurons and investigating the functional relevance of these neuropeptides in the context of REMs homeostasis would enrich our understanding of a neural circuit involved in REMs homeostasis and can stand as a separate extension of this manuscript.

      Reviewer #2 (Public Review):

      Maurer et al investigated the contribution of GAD2+ neurons in the preoptic area (POA), projecting to the tuberomammillary nucleus (TMN), to REM sleep regulation. They applied an elegant design to monitor and manipulate the activity of this specific group of neurons: a GAD2-Cre mouse, injected with retrograde AAV constructs in the TMN, thereby presumably only targeting GAD2+ cells projecting to the TMN. Using this set-up in combination with technically challenging techniques including EEG with photometry and REM sleep deprivation, the authors found that this cell-type studied becomes active shortly (≈40sec) prior to entering REM sleep and remains active during REM sleep. Moreover, optogenetic inhibition of GAD2+ cells inhibits REM sleep by a third and also impairs the rebound in REM sleep in the following hour. Despite a few reservations or details that would benefit from further clarification (outlined below), the data makes a convincing case for the role of GAD2+ neurons in the POA projecting to the TMN in REM sleep regulation.

      We thank the reviewer for the thorough assessment of our study and supportive comments. We have addressed your concerns in the revised manuscript, and our point by point response is provided below.

      The authors found that optogenetic inhibition of GAD2+ cells suppressed REM sleep in the hour following the inhibition (e.g. Fig2 and Fig4). If the authors have the data available, it would be important to include the subsequent hours in the rebound time (e.g. from ZT8.5 to ZT24) to test whether REM sleep rebound remains impaired, or recovers, albeit with a delay.

      We thank the reviewer for this comment and agree that it would be interesting to know how REMs changes for a longer period of time throughout the rebound phase. For Fig. 2, we did not record the subsequent hours. For Fig 4, we recorded the subsequent rebound between ZT7.5 and 10.5. When we compare the REMs amount during this 4 hr interval, the SwiChR mice have less REMs compared with eYFP mice with marginal significance (unpaired t-test, p=0.0641). We also plotted the cumulative REMs amount during restriction and rebound phases, and found that the cumulative amount of REMs was still lower in SwiChR mice than eYFP mice at ZT 10.5 (Author response image 2). Therefore, it will be interesting to record for a longer period of time to test when the SwiChR mice compensate for all the REMs that was lost during the restriction period.

      Author response image 2.

      Cumulative amount of REMs during REMs deprivation and rebound combined with optogenetic stimulation in eYFP and SwiChR groups. This data is shown as bar graphs in Figure 4.

      REM sleep is under tight circadian control (e.g. Wurts et al., 2000 in rats; Dijk, Czeisler 1995 in humans). To contextualize the results, it would be important to mention that it is not clear if the role of the manipulated neurons in REM sleep regulation hold at other circadian times of the day.

      Author response image 3.

      Inhibiting POA GAD2→ TMN neurons at ZT5-8 reduces REMs. (A) Schematic of optogenetic inhibition experiments. (B) Percentage of time spent in REMs, NREMs and wakefulness with laser in SwiChR++ and eYFP mice. Unpaired t-tests, p = 0.0013, 0.0469 for REMs and wakeamount. (C) Duration of REMs, NREMs, and wake episodes. Unpaired t-tests, p = 0.0113 for NREMs duration. (D) Frequency of REMs, NREMs, and wake episodes. Unpaired t-tests, p = 0.0063, 0.0382 for REMs and NREMs frequency.

      REMs propensity is largest towards the end of the light phase (Czeisler et al., 1980; Dijk and Czeisler, 1995; Wurts and Edgar, 2000). As a control, we therefore performed the optogenetic inhibition experiments of POA GAD2→TMN neurons during ZT5-8 (Author response image 3). Similar to our results in Figure 2, we found that SwiChR-mediated inhibition of POA GAD2 →TMN neurons attenuated REMs compared with eYFP laser sessions. These findings suggest our results are consistentat other circadian times of the day.

      The effect size of the REM sleep deprivation using the vibrating motor method is unclear. In FigS4-D, the experimental mice reduce their REM sleep to 3% whereas the control mice spend 6% in REM sleep. In Fig4, mice are either subjected to REM sleep deprivation with the vibrating motor (controls), or REM sleep deprivations + optogenetics (experimental mice).

      The control mice (vibrating motor) in Fig4 spend 6% of their time in REM sleep, which is double the amount of REM sleep compared to the mice receiving the same treatment in FigS4-D. Can the authors clarify the origin of this difference in the text?

      The effect size for REM sleep deprivation is now added in the text.

      It is important to note that these figures are analyzing two different intervals of the REMs restriction. In Fig. S4D, we analyzed the total amount of REMs over the entire 6 hr restriction interval (ZT1.5-7.5). In Fig. 4, we analyzed the amount of REMs only during the last 3 hr of restriction (ZT4.5-7.5) as optogenetic inhibition was performed only during the last 3 hrs when the REMs pressure is high. In Fig. S4D, we looked at the amount of REMs during ZT1.5-4.5 and 4.5-7.5 and found that the amount of REMs during ZT4.5-7.5 (4.46 ± 0.25 %; mean ± s.e.m.) is indeed higher than ZT 1.5-4.5 (1.66 ± 0.62 %), and is comparable to the amount of REMs during ZT4.5-7.5 in eYFP mice (5.95 ± 0.52 %) in Fig. 4. We now clearly state in the manuscript at which time points we analyzed the amount, duration and frequency of REMs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) A few further citations suggested: Discussion "The TMN contains histamine producing neurons and antagonizing histamine neurons causes sleepiness..." It would be appropriate to cite Uygun DS et al 2016 J Neurosci (PMID: 27807161) here. Using the same HDC-Cre mice as used by Maurer et al., Uygun et al found that selectively increasing GABAergic inhibition onto histamine neurons produced NREM sleep.

      We apologize for omitting this important paper. In the revised manuscript, we added this citation.

      (2) Materials and Methods.

      Although the JAX numbers are given for the mouse lines based on researchers generously donating to JAX for others to use, please cite the papers corresponding to the GAD2-ires-Cre and HDC-ires-Cre mouse lines deposited at JAX.

      GAD2-ires-Cre was described in Taniguchi H et al., 2011, Neuron (PMID: 21943598).

      The construction of the HDC-ires-CRE line is described in Zecharia AY et al J Neurosci et al 2012 (PMID: 22993424).

      We have now added these important citations in the revised manuscript.

      (3) Similarly, for the viruses, please provide the citations for the AAV constructs that were donated to Addgene.

      We have now added these citations in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      The authors rely heavily on their conclusions by using an optogenetic tool that inhibits the activity of GAD2+ neurons, however, it is not shown that these neurons are indeed inhibited as expected. An alternative approach to tackle this could be the application of a different technique to achieve the same output (e.g. chemogenetics). However, both experiments (confirmation of inhibition, or using a different technique) would require a significant amount of work, and given the numerous studies out there showing that these optogenetic tools tend to work, may not be necessary. Hence the authors could also cite a similar study that used a likewise construct and where it was indeed shown that this technique works (i.e. similar retrograde optogenetic construct with Cre depedendent expression combined with electrophysiological recordings).

      This laser stimulation protocol was designed based on previous reports of sustained inhibition using the same inhibitory opsin and our prior results that recapitulate similar findings as inhibitory chemogenetic techniques (Iyer et al., 2016; Kim et al., 2016; Wiegert et al., 2017; Stucynski et al., 2022). We have now added this description in the Result section.

      Fig1A - Right: the virus expression graphs are great and give a helpful insight into the variability. The image on the left (GCAMP+ cells) is less clear, the GCAMP+ cells don't differentiate well from the background. Perhaps the whole brain image with inset in POA can show the GCAMP expression more convincingly.

      We have added a histology picture showing the whole brain image with inset in the POA in the updated Fig. 1A .

      Statistics: The table is very helpful. Based on the degrees of freedom, it seems that in some instances the stats are run on the recordings rather than on the individual mice (e.g. Fig1). It could be considered to use a mixed model where subjects as taken into account as a factor.

      Author response image 4.

      ΔF/Factivity of POA GAD2→TMN neurons during NREMs. The duration of NREMs episodes was normalized in time, ranging from 0 to 100%. Shading, ± s.e.m. Pairwise t-tests with Holm-Bonferroni correctionp = 5.34 e-4 between80 and100. Graybar, intervals where ΔF/F activity was significantly different from baseline (0 to 20%, the first time bin). n = 10 mice. In Fig. 1E , we ran stats based on the recordings. In this data set, we ran stats based on the individual mice, and found that the activity also gradually increased throughout NREMs episodes.

      There is an effect of laser in Fig2 on REM sleep amount, as well as an interaction effect with virus injection (from the table). Therefore, it would be helpful for the reader to also show REM sleep data from the control group (laser stimulation but no active optogenetics construct) in Fig 2.

      To properly control laser and virus effect, we performed the same laser stimulation experiments in eYFP control mice (expressing only eYFP without optogenetic construct, SwiChR++) and the data is provided in Fig 2C .

      Fig3B: At the start of the rebound of REM sleep, there is a massive amount of wakefulness, also reflected in the change of spectral composition. Could you comment on the text about what is happening here?

      We quantified the amount of wakefulness during the first hour of REMs rebound and found that indeed there is no significant difference in wakefulness between REM restriction and baseline control conditions ( Fig. S4H ). Therefore, while the representative image in Fig 3B shows increased wakefulness at the beginning of REMs rebound, we do not think the overall amount of wakefulness is increased.

      Fig 4, supplementary data: it would be helpful for the reader to have mentioned in the text the effect size of the REM sleep restriction protocol (e.g. mean and standard deviation).

      Thank you for this suggestion. We have now added the effect size for the REM sleep restriction experiments in the main text.

      REM sleep restriction and photometry experiment: could be improved by adding within the main body of text that, in order to conduct the photometry experiment in the last hours of REM sleep deprivation, the manual REM sleep deprivation had to be applied, because the vibrating motor technique disturbed the photometry recordings.

      Thank you for this suggestion. We have added the description in the main text.

      Suggestion to build further on the already existing data (not for this paper): you have a powerful dataset to test whether REM sleep pressure builds up during wakefulness or NREM sleep, by correlating when your optogenetic treatment occurs (NREM or wakefulness), with the subsequent rebound in REM sleep (see also Endo et al., 1998; Benington and Heller, 1994; Franken 2001).

      We thank the reviewer for this excellent suggestion. We plan to carry out this experiment in the future.

      References

      Antila, H., Kwak, I., Choi, A., Pisciotti, A., Covarrubias, I., Baik, J., et al. (2022). A noradrenergic-hypothalamic neural substrate for stress-induced sleep disturbances. Proc. Natl. Acad. Sci. 119, e2123528119. doi: 10.1073/pnas.2123528119.

      Blake, H., and Gerard, R. W. (1937). Brain potentials during sleep. Am. J. Physiol.-Leg. Content 119, 692–703. doi: 10.1152/ajplegacy.1937.119.4.692.

      Chen, R., Wu, X., Jiang, L., and Zhang, Y. (2017). Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity. Cell Rep. 18, 3227–3241. doi: 10.1016/j.celrep.2017.03.004.

      Chung, S., Weber, F., Zhong, P., Tan, C. L., Nguyen, T., Beier, K. T., et al. (2017). Identification of Preoptic Sleep Neurons Using Retrograde Labeling and Gene Profiling. Nature 545, 477–481. doi: 10.1038/nature22350.

      Czeisler, C. A., Zimmerman, J. C., Ronda, J. M., Moore-Ede, M. C., and Weitzman, E. D. (1980). Timing of REM sleep is coupled to the circadian rhythm of body temperature in man. Sleep 2, 329–346.

      Dijk, D. J., Brunner, D. P., Beersma, D. G., and Borbély, A. A. (1990). Electroencephalogram power density and slow wave sleep as a function of prior waking and circadian phase. Sleep 13, 430–440. doi: 10.1093/sleep/13.5.430.

      Dijk, D. J., and Czeisler, C. A. (1995). Contribution of the circadian pacemaker and the sleep homeostat to sleep propensity, sleep structure, electroencephalographic slow waves, and sleep spindle activity in humans. J. Neurosci. Off. J. Soc. Neurosci. 15, 3526–3538. doi: 10.1523/JNEUROSCI.15-05-03526.1995.

      Donlea, J. M., Pimentel, D., and Miesenböck, G. (2014). Neuronal machinery of sleep homeostasis in Drosophila. Neuron 81, 860–872. doi: 10.1016/j.neuron.2013.12.013.

      Ferrara, M., De Gennaro, L., Casagrande, M., and Bertini, M. (1999). Auditory arousal thresholds after selective slow-wave sleep deprivation. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 110, 2148–2152. doi: 10.1016/s1388-2457(99)00171-6.

      Gaus, S. E., Strecker, R. E., Tate, B. A., Parker, R. A., and Saper, C. B. (2002). Ventrolateral preoptic nucleus contains sleep-active, galaninergic neurons in multiple mammalian species. Neuroscience 115, 285–294. doi: 10.1016/S0306-4522(02)00308-1.

      Gong, H., McGinty, D., Guzman-Marin, R., Chew, K.-T., Stewart, D., and Szymusiak, R. (2004). Activation of c-fos in GABAergic neurones in the preoptic area during sleep and in response to sleep deprivation. J. Physiol. 556, 935–946. doi: 10.1113/jphysiol.2003.056622.

      Iyer, S. M., Vesuna, S., Ramakrishnan, C., Huynh, K., Young, S., Berndt, A., et al. (2016). Optogenetic and chemogenetic strategies for sustained inhibition of pain. Sci. Rep. 6, 30570. doi: 10.1038/srep30570.

      Kim, H., Ährlund-Richter, S., Wang, X., Deisseroth, K., and Carlén, M. (2016). Prefrontal Parvalbumin Neurons in Control of Attention. Cell 164, 208–218. doi: 10.1016/j.cell.2015.11.038.

      Kroeger, D., Absi, G., Gagliardi, C., Bandaru, S. S., Madara, J. C., Ferrari, L. L., et al. (2018). Galanin neurons in the ventrolateral preoptic area promote sleep and heat loss in mice. Nat. Commun. 9, 4129. doi: 10.1038/s41467-018-06590-7.

      Ma, Y., Miracca, G., Yu, X., Harding, E. C., Miao, A., Yustos, R., et al. (2019). Galanin Neurons Unite Sleep Homeostasis and α2-Adrenergic Sedation. Curr. Biol. CB 29, 3315-3322.e3. doi: 10.1016/j.cub.2019.07.087.

      Mallick, B. N., and Singh, A. (2011). REM sleep loss increases brain excitability: role of noradrenaline and its mechanism of action. Sleep Med. Rev. 15, 165–178. doi: 10.1016/j.smrv.2010.11.001.

      McDermott, C. M., LaHoste, G. J., Chen, C., Musto, A., Bazan, N. G., and Magee, J. C. (2003). Sleep deprivation causes behavioral, synaptic, and membrane excitability alterations in hippocampal neurons. J. Neurosci. Off. J. Soc. Neurosci. 23, 9687–9695. doi: 10.1523/JNEUROSCI.23-29-09687.2003.

      Miracca, G., Anuncibay-Soto, B., Tossell, K., Yustos, R., Vyssotski, A. L., Franks, N. P., et al. (2022). NMDA Receptors in the Lateral Preoptic Hypothalamus Are Essential for Sustaining NREM and REM Sleep. J. Neurosci. 42, 5389–5409. doi: 10.1523/JNEUROSCI.0350-21.2022.

      Moffitt, J. R., Bambah-Mukku, D., Eichhorn, S. W., Vaughn, E., Shekhar, K., Perez, J. D., et al. (2018). Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362. doi: 10.1126/science.aau5324.

      Neckelmann, D., and Ursin, R. (1993). Sleep stages and EEG power spectrum in relation to acoustical stimulus arousal threshold in the rat. Sleep 16, 467–477.

      Park, S.-H., Baik, J., Hong, J., Antila, H., Kurland, B., Chung, S., et al. (2021). A probabilistic model for the ultradian timing of REM sleep in mice. PLOS Comput. Biol. 17, e1009316. doi: 10.1371/journal.pcbi.1009316.

      Rosa, R. R., and Bonnet, M. H. (1985). Sleep stages, auditory arousal threshold, and body temperature as predictors of behavior upon awakening. Int. J. Neurosci. 27, 73–83. doi: 10.3109/00207458509149136.

      Sherin, J. E., Elmquist, J. K., Torrealba, F., and Saper, C. B. (1998). Innervation of histaminergic tuberomammillary neurons by GABAergic and galaninergic neurons in the ventrolateral preoptic nucleus of the rat. J. Neurosci. Off. J. Soc. Neurosci. 18, 4705–4721.

      Smith, J., Honig-Frand, A., Antila, H., Choi, A., Kim, H., Beier, K. T., et al. (2023). Regulation of stress-induced sleep fragmentation by preoptic glutamatergic neurons. Curr. Biol. CB , S0960-9822(23)01585–3. doi: 10.1016/j.cub.2023.11.035.

      Stucynski, J. A., Schott, A. L., Baik, J., Chung, S., and Weber, F. (2022). Regulation of REM sleep by inhibitory neurons in the dorsomedial medulla. Curr. Biol. CB 32, 37-50.e6. doi: 10.1016/j.cub.2021.10.030.

      Takahashi, K., Lin, J.-S., and Sakai, K. (2009). Characterization and mapping of sleep-waking specific neurons in the basal forebrain and preoptic hypothalamus in mice. Neuroscience 161, 269–292. doi: 10.1016/j.neuroscience.2009.02.075.

      Weber, F., Hoang Do, J. P., Chung, S., Beier, K. T., Bikov, M., Saffari Doost, M., et al. (2018). Regulation of REM and Non-REM sleep by periaqueductal GABAergic neurons. Nat. Commun. 9, 1–13. doi: 10.1038/s41467-017-02765-w.

      Wiegert, J. S., Mahn, M., Prigge, M., Printz, Y., and Yizhar, O. (2017). Silencing Neurons: Tools, Applications, and Experimental Constraints. Neuron 95, 504–529. doi: 10.1016/j.neuron.2017.06.050.

      Williams, H. L., Hammack, J. T., Daly, R. L., Dement, W. C., and Lubin, A. (1964). RESPONSES TO AUDITORY STIMULATION, SLEEP LOSS AND THE EEG STAGES OF SLEEP. Electroencephalogr. Clin. Neurophysiol. 16, 269–279. doi: 10.1016/0013-4694(64)90109-9.

      Wurts, S. W., and Edgar, D. M. (2000). Circadian and homeostatic control of rapid eye movement (REM) sleep: promotion of REM tendency by the suprachiasmatic nucleus. J. Neurosci. Off. J. Soc. Neurosci. 20, 4300–4310. doi: 10.1523/JNEUROSCI.20-11-04300.2000.

      Zhou, Y., Lai, C. S. W., Bai, Y., Li, W., Zhao, R., Yang, G., et al. (2020). REM sleep promotes experience-dependent dendritic spine elimination in the mouse cortex. Nat. Commun. 11, 4819. doi: 10.1038/s41467-020-18592-5.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the three reviewers and the reviewing editor for their positive evaluation of our manuscript. We particularly appreciate that they unanimously consider our work as “important contributions to the understanding of how the CAF-1 complex works”, “The large amounts of data provided in the paper support the authors' conclusion very well” and “The paper effectively addresses its primary objective and is strong”. We also thank them for a careful reading and useful comments to improve the manuscript. We have built on these comments to provide an improved version of the manuscript, and address them point by point below .

      Reviewer #1 (Public Review):

      Summary:

      This paper makes important contributions to the structural analysis of the DNA replication-linked nucleosome assembly machine termed Chromatin Assembly Factor-1 (CAF-1). The authors focus on the interplay of domains that bind DNA, histones, and replication clamp protein PCNA.

      Strengths:

      The authors analyze soluble complexes containing full-length versions of all three fission yeast CAF-1 subunits, an important accomplishment given that many previous structural and biophysical studies have focused on truncated complexes. New data here supports previous experiments indicating that the KER domain is a long alpha helix that binds DNA. Via NMR, the authors discover structural changes at the histone binding site, defined here with high resolution. Most strikingly, the experiments here show that for the S. pombe CAF-1 complex, the WHD domain at the C-terminus of the large subunit lacks DNA binding activity observed in the human and budding yeast homologs, indicating a surprising divergence in the evolution of this complex. Together, these are important contributions to the understanding of how the CAF-1 complex works.

      Weaknesses:

      1. There are some aspects of the experimentation that are incompletely described: <br /> In the SEC data (Fig. S1C) it appears that Pcf1 in the absence of other proteins forms three major peaks. Two are labeled as "1a" (eluting at ~8 mL) and "1b" (~10-11 mL). It appears that Pcf1 alone or in complex with either or both of the other two subunits forms two different high molecular weight complexes (e.g. 4a/4b, 5a/5b, 6a/6b). There is also a third peak in the analysis of Pcf1 alone, which isn't named here, eluting at ~14 mL, overlapping the peaks labeled 2a, 4c, and 5c. The text describing these different macromolecular complexes seems incomplete (p. 3, lines 32-33): "When isolated, both Pcf2 and Pcf3 are monomeric while Pcf1 forms large soluble oligomers". Which of the three Pcf1-alone peaks are oligomers, and how do we know? What is the third peak? The gel analysis across these chromatograms should be shown.

      We thank the reviewer for his/her careful reading of the manuscript. Indeed, we plotted two curves in Figure S1C in a color that does not match the legend, leading to confusion. Curve 1, Pcf1 alone, depicted in red, should appear in pink as indicated in the legend and in the SDS-PAGE analysis below. Curve 1 exhibits two peaks, labeled as 1a and 1b. With an elution volume of 8.5mL close to the dead volume of the column, peak 1a corresponds to soluble oligomers, while peak 1b (10.4mL) likely corresponds to monomeric Pcf1. Curve 5 (Pcf1 + Pcf2 mixture) was in pink instead of purple as indicated in the legend. This curve consists of three distinct peaks (5a, 5b, and 5c). The SDS-PAGE analysis revealed the presence of oligomers of Pcf1-Pcf2 (5a, 8.3mL), the Pcf1-Pcf2 complex (5b, 9.8mL), and Pcf2 alone (5c, 13.6 mL).

      The color has now been corrected in the revised manuscript.

      More importantly, was a particular SEC peak of the three-subunit CAF-1 complex (i.e. 4a or 4b) characterized in the further experimentation, or were the data obtained from the input material prior to the separation of the different peaks? If the latter, how might this have affected the results? Do the forms inter-convert spontaneously?

      We conducted all structural analyses and DNA/PCNA interactions Figures (1-4, S1-S4) with freshly SECpurified samples corresponding to the 4b peak (9.7mL). Aliquots were flash-frozen with 50% glycerol for in vitro histone assembly assays (Figure 5).

      1. Given the strong structural predication about the roles of residues L359 and F380 (Fig. 2f), these should be mutated to determine effects on histone binding.

      We are pleased that our structural predictions are considered as strong. We agree that investigating the role of the L359 and F380 residues will be critical to further refine the binding interface between histone H3-H4 and CAF-1. An in vitro and in vivo analysis of such mutated forms, alongside the current Pcf1-ED mutant characterized in this article and additional potential mutated forms, has the potential to provide a better understanding of the dynamic of histone deposition by CAF-1. However, these additional approaches would require to reach another step in breaking this enigmatic dynamic.

      1. Could it be that the apparent lack of histone deposition by the delta-WHD mutant complex occurs because this mutant complex is unstable when added to the Xenopus extract?

      We cannot formally exclude this possibility, and this could potentially applies to all mutated forms tested. However, in the absence of available antibodies against the fission yeast CAF-1 complex, we cannot test this hypothesis for technical reasons. Nevertheless, we feel reassured by the fact that the in vitro assays of nucleosome assembly are overall consistent with the in vivo assays. Indeed, all mutated forms tested that abolished or weakened nucleosome assembly also exhibited synthetic lethality/growth defect in the absence of a functional HIRA pathway, including the delta WHD mutated form. This genetic synergy, that reflects a defective histone deposition by CAF-1, is not specific to the fission yeast S. pombe and was previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002). This further supports the evolutionary conservation based on genetic assay as a read out for defective histone deposition by CAF-1.

      Reviewer #1 (Recommendations For The Authors):

      • p. 4: "An experimental molecular weight of 179 kDa was calculated using Small Angle X-ray Scattering (SAXS), consistent with a 1:1:1 stoichiometry (Figure S1e). These data are in agreement with a globular complex with a significant flexibility (Figure S1f)." There needs to be more description of the precision of the molecular weight measurement, and what aspects of these data indicate the flexibility.

      The molecular weight was estimated using the correlation volume (Vc) defined by (Rambo & Tainer, Nature 2013, 496, 477-481). The estimated error with this method is around 10%. We added this information together with supporting arguments for the existence of flexibility: “An experimental molecular weight of 179 kDa was calculated using Small Angle X-ray Scattering (SAXS). Assuming an accuracy of around 10% with this method (Rambo and Tainer 2013), this value is consistent with a 1:1:1 stoichiometry for the CAF-1 complex (calculated MW 167kDa) (Figure S1e). In addition, the position of the maximum for the dimensionless Kratky plot was slightly shifted to higher values in the y and x axis compared to the position of the expected maximum of the curve for a fully globular protein (Figure S1f).

      This shows that the complex was globular with a significant flexibility.”

      • p. 6, lines 21-22: "In contrast, a large part of signals (338-396) did not vanish anymore upon addition of a histone complex preformed with two other histone chaperones known to compete with CAF-1 for histone binding..." Given the contrast made later with the 338-351 region which is insensitive to Asf1/Mcm2, it would be clearer for the reader to describe the Asf1/Mcm2-competed regions as residues 325-338 plus 352-396. Note that the numerical scale of residues doesn't line up perfectly with the data points in Figure 2d, and this should be fixed as well.

      We thank this reviewer for spotting this typographical error; we intended to write "In contrast, a large part of signals (348-396) did not vanish anymore… “. We modified paragraph as suggested by the reviewer because we agree it is clearer for the reader : “In contrast, only a shorter fragment (338-347) vanished upon addition of Asf1-H3-H4-Mcm2(69-138), a histone complex preformed with two other histone chaperones, Asf1 and Mcm2, known to compete with CAF-1 for histone binding (Sauer et al. 2017) and whose histone binding modes are well established (Figure 2e) (Huang et al. 2015, Richet et al. 2015). This finding underscores a direct competition between residues (325-338) and (349-396) within the ED domain and Asf1/Mcm2 for histone binding.”

      The slight shift in the numerical scale Figure 2d was also corrected.

      • p. 8. Lines 22-24: "EMSAs with a double-stranded 40bp DNA fragment confirmed the homogeneity of the bound complex. When increasing the SpCAF-1 concentration, additional mobility shifts suggest, a cooperative DNA binding (Figure 3a)." I agree that the migration of the population is further retarded upon the addition of more protein. However, doesn't this negate the first sentence? That is, if multiple CAF-1 complexes can bind each dsDNA molecule, can these complexes be described as homogeneous?

      We fully agree with the reviewer's comment and have removed the notion of homogeneity from the first sentence. “EMSAs with a double-stranded 40bp DNA fragment showed the formation of a bound complex.”

      • Figure S2b Legend: "1H-15N HSQC spectra of Pcf1_ED (425-496)." The residue numbers should read 325-396.

      The typo has been corrected.

      • Is the title for Figure 5 correct?: "Figure 5: Rescue using Y340 and W348 in the ED domain, the intact KER DNA binding domain and the C-terminal WHD of Pcf1 in SpCAF-1 mediated nucleosome assembly." I don't see that any point mutation rescue experiments are done here.

      The title of figure 5 has been modified for “Efficient nucleosome assembly by SpCAF-1 in vitro requires interactions with H3-H4, DNA and PCNA, and the C-terminal WHD domain”.

      • Figure S6C. I assume the top strain lacks the Pcf2-GFP but this should be stated explicitly.

      The following sentence “The top strain corresponds to a strain expressing wild-type and untagged Pcf2 as a negative control of GFP fluorescence” is now added to the figure legend. The figure S6C has been modified accordingly to mention “Pcf2 (untagged)” and state more explicitly.

      • Regarding point #3 in the public review, a simple initial test of this idea would be to determine if similar amounts of wt and mutant complexes can be immunoprecipitated at the endpoint of the assembly reactions.

      In the absence of available antibodies against the fission yeast CAF-1 complex, we cannot test this hypothesis for technical reasons. However, the in vitro assays of nucleosome assembly are overall consistent with the in vivo assays. Indeed, all mutated forms tested that abolished or weakened nucleosome assembly also exhibited synthetic lethality/growth defect in the absence of a functional HIRA pathway, including the delta WHD mutated form. This genetic synergy, reflecting defective histone deposition by CAF-1, is not specific to the fission yeast S. pombe, as it was previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002), further supporting the evolution conservation in the genetic assay as a read out for defective histone deposition by CAF-1.

      • Foundational findings that should be cited: The role of PCNA in CAF-1 activity was first recognized by pioneering studies in the Stillman laboratory (PMID: 10052459, 11089978). The earliest recombinant studies of CAF-1 showed that the large subunit is the binding platform for the other two, showed that the KER and ED domains were required for histone deposition activity, and roughly mapped the p60-binding site on the large subunit (PMID: 7600578). Another early study roughly mapped the binding site for the third subunit and showed that biological effects of impairing the PCNA binding synergized with defects in the HIR pathway (PMID: 11756556), a genetic synergy first demonstrated in budding yeast (PMID: 9671489).

      We thank the reviewer for providing these important references that are now cited in the manuscript. PMID: 10052459 and 11089978 are cited page 2 line 18 and 19, PMID: 7600578 page 19 line 5 and PMID: 11756556 and 9671489 page 18 line 2.

      Reviewer #2 (Public Review):

      Summary:

      The authors describe the structure-functional relationship of domains in S. pombe CAF-1, which promotes DNA replication-coupled deposition of histone H3-H4 dimer. The authors nicely showed that the ED domain with an intrinsically disordered structure binds to histone H3-H4, that the KER domain binds to DNA, and that, in addition to a PIP box, the KER domain also contributes to the PCNA binding. The ED and KER domains as well as the WHD domain are essential for nucleosome assembly in vitro. The ED, KER domains, and the PIP box are important for the maintenance of heterochromatin.

      Strengths:

      The combination of structural analysis using NMR and Alphafold2 modeling with biophysical and biochemical analysis provided strong evidence on the role of the different domain structures of the large subunit of SpCAF-1, spPCF-1 in the binding to histone H3-H4, DNA as well as PCNA. The conclusion was further supported by genetic analysis of the various pcf1 mutants. The large amounts of data provided in the paper support the authors' conclusion very well.

      Reviewer #2 (Recommendations For The Authors):

      The paper by Ochesenbein describes the structural and functional analysis of S. pombe CAF-1 complex critical for DNA replication-coupled histone H3/H4 deposition. By using structural, biophysical, and biochemical analyses combined with genetic methods, the authors nicely showed that a large subunit of SpCAF1, SpPCF-1, consists of 5 structured domains with four connecting IDR domains. The ED domain with IDR nature binds to histone H3-H4 dimer with the conformational change of the other domain(s). SpCAF-1 binds to dsDNA by using the KER domain, but not the WHD domain. The experiments have been done with great care and a large amount of the data are highly reliable. Moreover, the results are clearly presented and convincingly written. The conclusion in the paper is very solid and will be useful for researchers who work in the field of chromosome biology.

      Major points:

      1. DNA binding of the KER mutant shown in Figures S3h and S3i, which was measured by the EMSA, looks similar to that of wild-type control in Figure S3f, which is different from the data in Figures 3b and 3e measured by the MST. The authors need a more precise description of the EMSA result of the KER mutant shown in Figures 3 and S3. The quantification of the EMSA result would resolve the point (should be provided).

      A proposed by this reviewer, we performed quantification of all EMSA presented in Figure 3 and Figure S3. We quantified the signal of the free DNA band to calculate a percentage of bound DNA in each condition. All EMSA experiments were conducted in duplicate, allowing us to calculate an average value and standard deviation for each interaction. Representative curves and fitted values are reported below in the figure provided for the reviewer (panel a data for Pcf1_KER domain with two fitting models, panel b for the entire CAF-1 complexes and mutants, panel c for the isolated Pcf1_KER domains), all fitted values in panel d. Importantly, as illustrated in panel a, the complete model for a single interaction (complete KD model, dashed line curve) does not adequately fit the data. In contrast, a function incorporating cooperativity (Hill model) better accounts for the measured data (solid line curve). Consistently, we also used the Hill model to fit the binding curves measured with the MST technique. As also specified now in the text, the Hill model allows to determine an EC50 value (concentration of protein resulting in the disappearance of half of the free DNA band intensity) and a Hill coefficient value (representing cooperativity during the interaction) for each curve.

      We measure a value of 3.4 ± 0.4 μM for the EC50 of SpCAF-1 WT, which is higher than the value measured by MST (0.7 ± 0.1 μM). Higher values were also calculated for all mutants and isolated Pcf1_KER domains compared to MST. These discrepancies could raise from the fact that the DNA concentration used in the two techniques were very different (20nM for MST experiments and 1μM for EMSA). Unlike the complete KD model, which includes in the calculation the DNA concentration (considered here as the "receptor"), the Hill model is fitted independently of this value. This model assumes that the “receptor” concentration is low compared to the KD. Here we calculate EC50 values on the same order of magnitude as the DNA concentration (low micromolar), The quantification obtained by EMSA is thus challenging to interpret. In contrast, values fitted by the MST measurements are more reliable since this limitation of low “receptor” concentration is correct.

      Therefore, although measurements of EC50 and Hill coefficient from EMSA are reproducible, they may be confusing for quantifying apparent affinity values through EC50. Nevertheless, this quantitative analysis of EMSA, requested by the reviewer, has highlighted an interesting characteristic of the KER mutant that is consistent across both methods: even though the EMSA pointed by the reviewer (Figures S3h and S3i compared to the wild-type control in Figure 3d and Figure S3f) show similar EC50 values, the binding cooperativity is different. Binding curves for the KER mutants is no longer cooperative (Hill coefficient ~1), and this is observed for all KER curves (isolated Pcf1_KER domain and the entire SpCAF-1 complex) with both methods, EMSA and MST. We thus decided to emphasize this characteristic of the KER mutant in the text (page 9 line 30-32). “Importantly, this mutant also shows a lower binding cooperativity for DNA binding, as estimated by the Hill coefficient value close to 1, compared to values around 3 for the WT and other mutants.”

      Since EMSA quantifications did not show a loss of “affinity” (as measured by the EC50 value) for the KER* mutants, compared to the WT contrary to MST measurements and because the DNA concentration was close to the measured EC50, we consider that EC50 values calculated by EMSA do not represent a KD value. If we add this quantification, we should discuss this point in detail. Thus, for sake of clarity, we prefer to put in the manuscript EMSA measurements as illustrations and qualitative validations of the interaction but not to include the quantification.

      Author response image 1.

      Quantitative analysis of interaction with DNA by EMSA. a: quantification of the amount of bound DNA for the Pcf1_KER domain (blue points with error bars). The fit with a KD model is shown as a dashed line, and the fit with a Hill model with a solid line. b: Examples of quantifications and fits (Hill model) for reconstituted SpCAF-1 WT and mutants. c: Examples of quantifications and fits (Hill model) for Pcf1_KER domains WT and mutant. d: EC50 values and Hill coefficients obtained for all EMSA experiments presented in Figure 3 and S3.

      1. As with the cooperative DNA binding of CAF-1, it is very important to show the stoichiometry of CAF-1 to the DNA or the site size. Given a long alpha-helix of the KER domain with biased charges, it is also interesting to show a model of how the dsDNA binds to the long helix with a cooperative binding property (this is not essential but would be helpful if the authors discuss it).

      We agree that having a molecular model for the binding of the KER helix to DNA would be especially interesting, but at this point, considering the accuracy of the tools currently at our disposal for predicting DNA-protein interactions, such a model would remain highly speculative.

      1. Figure 5 shows nucleosome assembly by SpCAF-1. SpCAF-1-PIP* mutant produced a product with faster mobility than the control at 2 h incubation. How much amounts of SpCAF-1 was added in the reaction seems to be critical. At least a few different concentrations of proteins should be tested.

      The slightly faster migration of the SpCAF-1-PIPis not systematically reproduced and we observed in several experiments that the band corresponding to supercoiled DNA migrated slightly above or below the one for the complementation by the SpCAF-1-WT (see Author response image 2 below). Thus this indicates that after 2 hours incubation the supercoiling assay with the SpCAF-1-PIP mutant compared to those achieved with the SpCAF-1-WT. To further document whether the WT or the PIP mutant are similar or not, we monitored difference of their nucleosome assembly efficiency by testing their ability to produce supercoiled DNA over shorter time, after 45 minute incubation. Under these conditions, we reproducibly detected supercoiled forms at earlier times with SpCAF-1-WT when compared to the SpCAF-1-PIP* (see figure 5 and Author response image 2). These observations indicate that mutation in the PIP motif of Pcf1 affects the rate of supercoiling in a distinct manner when compared to the other mutations that dramatically impair SpCAF-1 capacity to promote supercoiling.

      Author response image 2.

      Minor points:

      1. Page 8, line 26 or Table 1 legend: Please explain what "EC50" is.

      The definition of EC50, together with a reference paper for the Hill model have been added in the text page 8 lines 23-26, “The curves were fitted with a Hill model (Tso et al. 2018) with a EC50 value of 0.7± 0.1µM (effective concentration at which a 50% signal is observed) and a cooperativity (Hill coefficient, h) of 2.7 ± 0.2, in line with a cooperative DNA binging of SpCAF-1.”, in the Table 1 figure legend and in the method section (page 26).

      1. Page 13, lines 9, 11: "Xenopus" should be italicized.

      This is corrected

      1. Page 14, second half: In S. pombe, the pcf1 deletion mutant is not lethal. It is helpful to mention the phenotype of the deletion mutant a bit more when the authors described the genetic analysis of various pcf1 mutants.

      This point has been added on page 15, line 1.

      1. Figure 1d and Figure S2a: Captions and labels on the X and Y axes are overlapped or misplaced.

      This is corrected

      1. Figure 5: Please add a schematic figure of the assay to explain how one can check the nucleosome assembly by looking at the form I, supercoiled DNAs.

      A new panel has been added to Figure 5. This scheme depicts the supercoiling assay where supercoiled DNA (form I) is used as an indication of efficient nucleosome assembly. The figure legend has also been modified accordingly.

      Reviewer #3 (Public Review):

      Summary:

      The study conducted by Ouasti et al. is an elegant investigation of fission yeast CAF-1, employing a diverse array of technologies to dissect its functions and their interdependence. These functions play a critical role in specifying interactions vital for DNA replication, heterochromatin maintenance, and DNA damage repair, and their dynamics involve multiple interactions. The authors have extensively utilized various in vitro and in vivo tools to validate their model and emphasize the dynamic nature of this complex.

      Strengths:

      Their work is supported by robust experimental data from multiple techniques, including NMR and SAXS, which validate their molecular model. They conducted in vitro interactions using EMSA and isothermal microcalorimetry, in vitro histone deposition using Xenopus high-speed egg extract, and systematically generated and tested various genetic mutants for functionality in in vivo assays. They successfully delineated domain-specific functions using in vitro assays and could validate their roles to large extent using genetic mutants. One significant revelation from this study is the unfolded nature of the acidic domain, observed to fold when binding to histones. Additionally, the authors also elucidated the role of the long KER helix in mediating DNA binding and enhancing the association of CAF-1 with PCNA. The paper effectively addresses its primary objective and is strong.

      Weaknesses:

      A few relatively minor unresolved aspects persist, which, if clarified or experimentally addressed by the authors, could further bolster the study.

      1. The precise function of the WHD domain remains elusive. Its deletion does not result in DNA damage accumulation or defects in heterochromatin maintenance. This raises questions about the biological significance of this domain and whether it is dispensable. While in vitro assays revealed defects in chromatin assembly using this mutant (Figure 5), confirming these phenotypes through in vivo assays would provide additional assurance that the lack of function is not simply due to the in vitro system lacking PTMs or other regulatory factors.

      Our work demonstrates that the WHD domain is important CAF-1 function during DNA replication. Indeed, the deletion of this domain lead to a synthetic lethality when combined with mutation of the HIRA complex, as observed for a null pcf1 mutant, indicating a severe loss of function in the absence of the WHD domain. We propose that these genetic interactions, previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002) are indicative of a defective histone deposition by CAF-1. Moreover, our work establishes that this domain is dispensable to prevent DNA damage accumulation and to maintain silencing at centromeric heterochromatin, indicating that the WHD domain specifies CAF-1 functions. Moreover, our work further demonstrates that, in contrast to the S. cerevisiae and human WHD domain, the S. pombe counterpart exhibits no DNA binding activity. We thus agree that the WHD domain may contribute to nucleosome assembly in vivo via PTMs or interactions with regulatory factors that may potentially lack in in vitro systems. However, addressing these aspects deserves further investigations beyond the scope of this article.

      1. The observation of increased Pcf2-gfp foci in pcf1-ED cells, particularly in mono-nucleated (G2phase) and bi-nucleated cells with septum marks (S-phase), might suggest the presence of replication stress. This could imply incomplete replication in specific regions, leading to the persistence of Caf1-ED-PCNA factories throughout the cell cycle. To further confirm this, detecting accumulated single-stranded DNA (ssDNA) regions outside of S-phase using RPA as an ssDNA marker could be informative.

      We cannot formally exclude that cells expressing the Pcf1-ED mutated form exhibit incomplete replication in specific regions, an aspect that would require careful investigations. However, the microscopy analysis (Fig. 6c and S6c) of this mutant showed no alteration in the cell morphology, including the absence of elongated cells compared to wild type, a hallmark of checkpoint activation caused by ssDNA (Enoch et al. Gene & Dev 1992). Therefore, investigating the consequences of the interplay between the binding of CAF-1 to PCNA and histones on the dynamic of DNA replication, is of particular interest but out of the scope of the current manuscript.

      1. Moreover, considering the authors' strong assertion of histone binding defects in ED through in vitro assays (Figure 2d and S2a), these claims could be further substantiated, especially considering that some degree of histone deposition might still persist in vivo in the ED mutant (Figure 7d, viable though growth defective double ED*+hip1D mutants). For example, the approach, akin to the one employed in Fig. 6a (FLAG-IPs of various Pcf1-FLAG-tagged mutants), could also enable a comparison of the association of different mutants with histones and PCNA, providing a more thorough validation of their findings.

      We have provided in the current manuscript data establishing how Pcf1 mutated forms interacted with PCNA (Fig. 6a, 6b). Regarding the interactions with histone H3-H4, the approach based on immunoprecipitation using various Pcf1-FLAG tagged mutants has been unsuccessful in our hands. Indeed, we were unable to obtain robust and reproducible interactions between Pcf1 or its various mutated form with H3-H4. This is likely because Co-IP approaches do not probe for direct interactions. Indirect interactions between Pcf1 and H3-H4 are potentially bridged by additional factors, including the two other subunits of CAF-1, Pcf2 and Pcf3, or Asf1. Therefore, we are not in a position to address in vivo the direct interactions between Pcf1 and histone H3-H4.

      1. It would be valuable for the authors to speculate on the necessity of having disordered regions in CAF1. Specifically, exploring the overall distribution of these domains within disordered/unfolded structures could provide insightful perspectives. Additionally, it's intriguing to note that the significant disparities observed among mutants (ED, PIP, and KER*) in in vitro assays seem to become more generic in vivo, except for the indispensability of the WHD-domain. Could these disordered regions potentially play a crucial role in the phase separation of replication factories? Considering these questions could offer valuable insights into the underlying mechanisms at play.

      We agree that the potential mechanistic role of partial disorder in CAF-1 is particularly interesting. Disordered regions of human CAF-1 have been reported to form nuclear bodies with liquid-liquid phase separation properties to maintain HIV latency (Ma et al EMBO J. 2021). As suggested, this raises the question of how disordered domains of Pcf1 could promote phase separation for replication factories, if such phenomenon happens in vivo. Moreover, numerous factors of the replisome also harbor disordered regions (Bedina, A. et al, 2013. Intrinsically Disordered Proteins in Replication Process. InTech. doi: 10.5772/51673), adding complexity in disentangling experimentally such questions. We have added these elements at the end of the discussion in the revised manuscript (page 20, lines 23-29). “Such plasticity and cross-talks provided by structurally disordered domains might be key for the multivalent CAF-1 functions. Human CAF-1 has been reported to form nuclear bodies with liquid-liquid phase separation properties to maintain HIV latency (Ma et al. 2021). This raises the question of a potential role of the disordered domains of Pcf1, together with other replisome factor harbouring such disordered regions (Bedina 2013), in promoting phase separation of replication factories, if such phenomenon happens in vivo. Further studies will be needed to tackle these questions.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Yamanaka et al.'s research investigates into the impact of volatile organic compounds (VOCs), particularly diacetyl, on gene expression changes. By inhibiting histone acetylase (HDACs) enzymes, the authors were able to observe changes in the transcriptome of various models, including cell lines, flies, and mice. The study reveals that HDAC inhibitors not only reduce cancer cell proliferation but also provide relief from neurodegeneration in fly Huntington's disease models. Although the findings are intriguing, the research falls short in providing a thorough analysis of the underlying mechanisms.

      HDAC inhibitors have been previously shown to induce gene expression changes as well as control cell division and demonstrated to work on disease models. The authors demonstrate diacetyl as a prominent HDAC inhibitor. Though the demonstration of diacetyl is novel, several similar molecules have been used before.

      In this manuscript we are not trying to understand the mechanisms by which HDAC inhibitors affect Huntington’s disease or cancer, since these have either been studied in detail before and are outside the scope of this manuscript. Our focus is to demonstrate that volatile odorants commonly found in the environment can inhibit HDACs, alter gene expression, and have downstream physiological effects. To the best of our knowledge this unusual effect of odorants has not been systematically described before.

      Reviewer #2 (Public Review):

      Sachiko et al. study presents strong evidence that implicates environmental volatile odorants, particularly diacetyl, in an alternate role as an inhibitors HDAC proteins and gene expression. HDACs are histone deacetylases that generally have repressive role in gene expression. In this paper the authors test the hypothesis that diacetyl, which is a compound emitted by rotting food sources, can diffuse through blood-brain-barrier and cell membranes to directly modulate HDAC activity to alter gene expression in a neural activity independent manner. This work is significant because the authors also link modulation of HDAC activity by diacetyl exposure to transcriptional and cellular responses to present it as a potential therapeutic agent for neurological diseases, such as inhibition of neuroblastoma and neurodegeneration.

      The authors first demonstrate that exposure to diacetyl, and some other odorants, inhibits deacetylation activity of specific HDAC proteins using in vitro assays, and increases acetylation of specific histones in cultured cells. Consistent with a role for diacetyl in HDAC inhibition, the authors find dose dependent alterations in gene expression in different fly and mice tissues in response to diacetyl exposure. In flies they first identify a decrease in the expression of chemosensory receptors in olfactory neurons after exposure to diacetyl. Subsequently, they also observe large gene expression changes in the lungs, brain, and airways in mice. In flies, some of the gene expression changes in response to diacetyl are partially reversable and show an overlap with genes that alter expression in response to treatment with other HDAC inhibitors. Given the use of HDAC inhibitors as chemotherapy agents and treatment methods for cancers and neurodegenerative diseases, the authors hypothesize that diacetyl as an HDAC inhibitor can also serve similar functions. Indeed, they find that exposure of mice to diacetyl leads to a decrease in the brain expression of many genes normally upregulated in neuroblastomas, and selectively inhibited proliferation of cell lines which are driven from neuroblastomas. To test the potential for diacetyl in treatment of neurodegenerative diseases, the authors use the fly Huntington's disease model, utilizing the overexpression of Huntingtin protein with expanded poly-Q repeats in the photoreceptor rhabdomeres which leads to their degeneration. Exposing these flies to diacetyl significantly decreases the loss of rhabdomeres, suggesting a potential for diacetyl as a therapeutic agent for neurodegeneration.

      The findings are very intriguing and highlight environmental chemicals as potent agents which can alter gene expression independent of their action through chemosensory receptors.

      We thank the reviewer for the encouraging comments.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      1) The results section for figure 1 seems poorly written with errors in figure citations. Please rewrite this section.

      We thank the reviewer for pointing it out and have now rewritten the results section as well as made concomitant changes in the introduction to address this comment.

      2) Discussion could be more focused and could speculate mechanistic details of HDAC inhibitors in rescue of neurodegeneration.

      We have added in information about the mechanistic role of the HDAC inhibition in rescue of neurodegeneration. “Exposure to diacetyl volatiles in the fly model of Huntington’s disease reduces cell degeneration, as has been previously observed with orally administered HDAC inhibitors like sodium butyrate and SAHA in this genetic model (27). Previous studies indicate that the inhibition of HDACs counter the acetyltransferase inhibitory activity of the polyglutaminedomain of the human Htt protein which binds to p300, P/CAF and CBP (27).”

      A few minor comments are:

      1) Figure 1 is not properly cited in the test (Eg: line 137- Its not relevant to Fig 1B and its to IC)

      We thank the referee for pointing out our error and have now corrected it.

      2) Some Abbreviations were not expanded at the first sight, which made difficult in understanding the statement (Eg: Line 51- VOC, 111- Or

      We have now defined abbreviations the first time they appear in the manuscript.

      3) Line 98- What was the unit when you mention 0.01%?

      We have added (v/v) in the text to represent the standard volume / total volume. We have also described it in the method section.

      4) Line 138- there is no comparative study done with b-HB, but the authors have claimed its was comparable. If it’s from previous study, a relative comparative statement could be given.

      We apologize for the confusion. We have added the IC50 values previously reported for b-hydroxy butyrate “IC50 for HDAC1: 5.3 mM and HDAC3 2.4 mM” which was shown in the reference #21.

      5) In lines 146-150, more details of what are the compounds and how similar they are to diacetyl could be added

      We have added representative structures and names for the chemicals tested in Figure 1C.

      6) In line 160, Why specifically they increase H3K14 acetylation?

      This observed increased H3K9 (not H3K14) acetylation levels is identical to what has previously reported for b-hydroxybutyrate. We have added a sentence pointing out this similarity “preferable acetylation of H3K9 was also observed in HEK193 cells with b-hydroxybutyrate (reference #21)”.

      7) In line 317, How HDAC inhibitors reverse the PolyQ disorder? What is its mechanism? Can at least discuss in the discussion section.

      Our assay is based on a previous publication using the Drosophila model (Ref #27) and evaluated the mechanisms in detail. We have now added a section in the Discussion describing the past findings. “Exposure to diacetyl volatiles in the fly model of Huntington’s disease reduces cell degeneration, as has been previously observed with orally administered HDAC inhibitors like sodium butyrate and SAHA in this genetic model (27). Previous studies indicate that the inhibition of HDACs counter the acetyltransferase inhibitory activity of the polyglutamine-domain of the Htt protein which binds to p300, P/CAF and CBP (27).”

      8) In figures, 1C and 1D, proper labeling of drug molecules is missing. Check 1D- Could have included Diacetyl for comparison, Where is the uninhibited control (negative)?

      We have added the name of the chemical compounds to Figure 1C and 1D. Each compound tested has a separate blank control, which forms the basis for calculation of the percentage inhibition. The negative control is therefore part of each column.

      Reviewer #2 (Recommendations For The Authors):

      As specific feedback for the authors, I have a few questions/recommendations about the main point of the paper:

      a. Throughout the manuscript, the authors demonstrate gene expression differences in different tissues in flies and mice in response to exposure to diacetyl using both transgenic reporter expression and RNAseq. The authors mention they were able to show that these gene expression changes are independent of neural activity, yet I am not sure which experiment specifically demonstrates this. How do the authors know that these changes in gene expression are due to diacetyl reaching the brain after passing blood brain barrier but not due to changes in gene expression with olfactory circuit activity? I acknowledge that disproving that the gene expression differences are independent of neural activity, but one question is whether inhibiting neural activity result in changes in the expression of overlapping genes in the same direction. Or for example, if one inhibits neural activity in Gr21a neurons, do they reversibly shut down expression of the receptor after a few days? Is this true for other ORs or specific to Gr21a and Gr63a?

      While it is difficult to completely rule out contributions of the olfactory effects in the brain, we also report differential gene expression in the lungs of mice where we do not expect olfactory circuit activity (Fig 3D-G). The overlap in DEGs is highly statistically significant between the organs suggesting at least some commonality in mechanism (Fig 5D). We recently evaluated a Drosophila tissue that does not express odorant receptors or connections, the ovaries, and also found substantial evidence of diacetyl-exposed modulation of genes. While the data are intended for a different publication, we found up to 123 up and 61 downregulated DEGs (FDR cutoff <0.05 and log2 fold change cutoff of 1 and -1). These data should also be viewed together with the in vitro HDAC inhibition data and the increased histone acetylation seen in cell lines.

      b. Is diacetyl detected by any chemosensory receptors in flies or mice? RNA profiles from these receptor mutants can be used to distinguish whether the gene expression changes are occurring due to neural activity or direct ability of diacetyl to alter HDAC activity. One relatively simple experiment would be to test whether differentially expressed genes in the orco mutant antennae overlap at all with antennal RNA profiles from diacetyl exposed flies.

      Diacetyl can be detected by multiple chemosensory receptors in flies and mice. In flies the Gr21a+Gr63a complex expressing neurons are inhibited by diacetyl as indicated, and Or9a, Or43b, Or59b, Or67a, and Or85b are activated receptors (Hallem, Cell, 2006). It would be extremely resource and time-consuming process to create and evaluate single mutants or combinations of mutants as suggested. In response to the previous point, we noted examples of tissues without olfactory receptors or olfactory circuits showing DEGs upon diacetyl exposure.

      As suggested by the referee, we compared DEGs from RNASeq data of Orco mutant antenna (N=2 replicates) generated for another project. There is very little overlap between antennal DEGs from Orco and the diacetyl (labelled chart as d4on_up and d4on_down) exposed flies. These data suggest that large-scale silencing of antennal neurons in Orco mutants do not alter expression of the same genes as altered by exposure to diacetyl.

      Author response image 1.

      c. The comparison of DEGs from individuals exposed to diacetyl versus the other two HDAC inhibitors shows some overlap. The overlap is greater for DEGs shared between the two HDAC inhibitors. Yet, there is still a substantial number of genes that are unique to diacetyl exposure. For example, if you compare SB to VA exposure, each condition has about 150-200 genes uniquely misexpressed for each condition with about 55 genes shared. However, the number of uniquely misexpressed genes is over 600 for diacetyl exposed individuals, with only 30 and 100 genes shared with either SB and VA respectively. I would have expected a higher overlap in DEGs if these compounds all inhibit similar HDACs. Do they inhibit different HDACs? Can this explain the significant number of uniquely misexpressed genes in each condition?

      It is difficult to judge significance of overlap in DEG sets the genome has around 13,000 genes from evaluating numbers without statistical analysis which we noted in the text. “A pairwise analysis using the Fisher’s exact test of each gene set revealed a statistically significant overlap of diacetyl-induced genes with SB-induced genes (p=6x10-11) and with VA-induced genes (p=2x10-65) (Figure 4F).”

      We have also further clarified in the text “This highly significant overlap among upregulated genes lends further support to our model that diacetyl vapors act as an HDAC inhibitor in vivo. As expected, each of the 3 treatments also modulated a substantial number of unique genes (Figure 4G,H), suggesting that differences in delivery format (oral vs vapor delivery), molecular structure and inhibition profile across the repertoire of HDACs may contribute to differences in gene regulation.”

      d. The authors show changes in RNA profiles in response to diacetyl exposure in different tissues and suggest that these are due to changes in histone acetylation without direct comparison of genes that show up or down regulation with acetylation patterns. They do show in the beginning that diacetyl inhibits HDAC function in vitro and in cell culture. Yet it is critical that they also show a general increase in acetylation levels within tissues profiled for RNA. Additional experiments profiling chromatin and histone acetylation patterns in the tissues where RNA is profiled from would strengthen the argument of the paper.

      We agree with the referee’s suggestion and appreciate it. However, given the heterogeneity of the cell types and therefore histone marks in chromatin within the tissues that we analyzed, we estimate that it will require substantial effort to purify or enrich specific cell populations before performing Chip-Seq. Such studies will examine correlations between up- and down-regulated genes and histone acetylation pattens in cells in the future studies. This effort will require significant resources and time which we feel are outside the scope of this manuscript.

      e. The rhabdomere experiments might benefit from a negative control. Can the authors expose the flies to another volatile and show neurodegeneration is not affected?

      We exposed the negative control group to headspace odorants of paraffin oil which is a mixture of hydrocarbons.

      f. The same is true for the initial HDAC activity profiles from Figure 1. Can the authors show an HDAC activity that is not affected by diacetyl exposure?

      We exposed the negative control group to headspace odorants of paraffin oil which is a mixture of hydrocarbons. Diacetyl shows very little inhibition (Average inhibition = 7.69%; N=2) in purified human HDAC4 when tested at the 15mM concentration.

      g. One point that might require some explanation in the discussion is why diacetyl exposure only increases acetylation of certain histones but not others in Figure 2, especially given that many HDACs are inhibited by diacetyl in Figure 1.

      Please see response to comment #6, Reviewer 1.

      h. Figure S1C is missing descriptions of what different histogram colors signify.

      We apologize for the oversight and have now indicated it in the Figure legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank you for the time you took to review our work and for your feedback! The main changes to the manuscript are: 

      (1) We have added additional analysis of running onsets in closed and open loop conditions for audiomotor (Figure 2H) and visuomotor (Figure 3H) coupling.  

      (2) We have also added analysis of running speed and pupil dilation upon mismatch presentation (Figures S2A and S2B, S4A and S4B, and S5A and S5B).

      (3) We have expanded on the discussion of the nature of differences between audiomotor and visuomotor mismatches.

      Reviewer #1:

      The manuscript presents a short report investigating mismatch responses in the auditory cortex, following previous studies focused on the visual cortex. By correlating the mouse locomotion speed with acoustic feedback levels, the authors demonstrate excitatory responses in a subset of neurons to halts in expected acoustic feedback. They show a lack of responses to mismatch in the visual modality. A subset of neurons show enhanced mismatch responses when both auditory and visual modalities are coupled to the animal's locomotion. 

      While the study is well-designed and addresses a timely question, several concerns exist regarding the quantification of animal behavior, potential alternative explanations for recorded signals, correlation between excitatory responses and animal velocity, discrepancies in reported values, and clarity regarding the identity of certain neurons. 

      Strengths: 

      (1) Well-designed study addressing a timely question in the field. 

      (2) Successful transition from previous work focused on the visual cortex to the auditory cortex, demonstrating generic principles in mismatch responses. 

      (3) The correlation between mouse locomotion speed and acoustic feedback levels provides evidence for a prediction signal in the auditory cortex. 

      (4) Coupling of visual and auditory feedback shows putative multimodal integration in the auditory cortex. 

      Weaknesses: 

      (1) Lack of quantification of animal behavior upon mismatches, potentially leading to alternative interpretations of recorded signals. 

      (2) Unclear correlation between excitatory responses and animal velocity during halts, particularly in closed-loop versus playback conditions. 

      (3) Discrepancies in reported values in a few figure panels raise questions about data consistency and interpretation. 

      (4) Ambiguity regarding the identity of the [AM+VM] MM neurons. 

      The manuscript is a short report following up on a series of papers focusing on mismatch responses between sensory inputs and predicted signals. While previous studies focused on the visual modality, here the authors moved to the auditory modality. By pairing mouse locomotion speed to the sound level of the acoustic feedback, they show that a subpopulation of neurons displays excitatory responses to halts in the (expected) acoustic feedback. These responses were lower in the open-loop state, when the feedback was uncorrelated to the animal locomotion. 

      Overall it is a well-designed study, with a timely and well-posed question. I have several concerns regarding the nature of the MM responses and their interpretations. 

      - One lacks quantification of the animal behavior upon mismatches. Behavioral responses may trigger responses in the mouse auditory cortex, and this would be an alternative explanation to the recorded signals. 

      What is the animal speed following closed-loop halts (we only have these data for the playback condition)? 

      We have quantified the running speed of the mouse following audiomotor and visuomotor mismatches. We found no evidence of a change in running speed. We have added this to Figures S2A and S4A, respectively.

      Is there any pupillometry to quantify possible changes in internal states upon halts (both closed-loop and playback)?

      The term 'internal state' may be somewhat ambiguous in this context. We assume the reviewer is asking whether we have any evidence for possible neuromodulatory changes. We know that there are noradrenergic responses in visual cortex to visuomotor mismatches (Jordan and Keller, 2023), but no cholinergic responses (Yogesh and Keller, 2023). Pupillometry, however, is likely not always sensitive enough to pick up these responses. With very strong neuromodulatory responses (e.g. to air puffs, or other startling stimuli), pupil dilation is of course detected, but this effect is likely at best threshold linear. Looking at changes in pupil size following audiomotor and visuomotor mismatch responses, we found no evidence of a change. We have added this to Figures S2B and S4B, respectively. Note, we suspect this is also strongly experience-dependent. The first audio- or visuomotor mismatch the mouse encounters is likely a more salient stimulus (to the rest of the brain, not necessarily to auditory or visual cortex), than the following ones.  

      These quantifications must be provided for the auditory mismatches but also for the VM or [AM+VM] mismatches.  

      During the presentation of multimodal mismatches [AM + VM], mice did not exhibit significant changes in running speed or pupil diameter. These data have been now added to Figures S5A and S5B.

      - AM MM neurons supposedly receive a (excitatory) locomotion-driven prediction signal. Therefore the magnitude of the excitation should depend on the actual animal velocity. Does the halt-evoked response in a closed loop correlate with the animal speed during the halt? Is the correlation less in the playback condition? 

      This is indeed what one would expect. We fear, however, that we don’t have sufficient data to address this question properly. Moreover, there is an important experimental caveat that makes the interpretation of the results difficult. In addition to the sound we experimentally couple to the locomotion speed of the mouse, the mouse self-generates sound by running (the treadmill rotating, changes to the airflow of the air-supported treadmill, footsteps, etc.). These sources of sound all also correlate in intensity with running speed. Thus, it is not entirely clear how our increase in sound amplitude with increasing running speed relates to the increase in self-generated sounds on the treadmill. This is one of the key reasons we usually do this type of experiment in the visual system where experimental control of visual flow feedback (in a given retinotopic location) is straightforward. 

      Having said that, if we look at the how mismatch responses change as a function of locomotion speed across the entire population of neurons, there appears to be no systematic change with running speed (and the effects are highly dependent on speed bins we choose). However, just looking at the most audiomotor mismatch responsive neurons, we find a trend for increased responses with increasing running speed (Author response image 1). We analyzed the top 5% of cells that showed the strongest response to mismatch (MM) and divided the MM trials into three groups based on running speed: slow (10-20 cm/s), middle (20-30 cm/s), and fast (>30 cm/s). Given the fact that we have on average 14 mismatch events in total per neuron, we don’t have sufficient data to analyze this. 

      Author response image 1.

      The average response of strongest AM MM responders to AM mismatches as a function of running speed (data are from 51 cells, 11 fields of view, 6 mice). 

      Values in Figure 2H are way higher than what can be observed in Figures 2C, and D. Could you explain the mismatch in values? Same for 3H and 4F. 

      In Figure 2H (now Figure S2F), we display responses from 4 755 individual neurons. Since most recorded neurons did not exhibit significant responses to mismatch presentations, their responses cluster around zero, significantly contributing to the final average shown in panel D. To clarify how individual neurons contribute to the overall population activity, we have added a histogram showing the distribution of neurons responding to audiomotor mismatch and sound playback halts. We hope this addition clarifies how individual neuron responses affect the final population activity. 

      Furthermore, neurons exhibiting suppression upon closed-loop halts (Figure 2C) show changes in deltaF/F of the same order of magnitude as the AM MM neurons (with excitatory responses). I cannot picture where these neurons are found in the scatter plot of Figure 2H. 

      This is caused by a ceiling effect. While we could adjust the scale of the heat map to capture neurons with very high responses (e.g. [-50 50], Author response image 2), doing so would obscure the response dynamics of most neurons. Note that the number of neurons on the y-axis far exceeds the resolution of this figure and thus there are also aliasing issues that mask the strong responses. 

      Author response image 2.

      Responses of all L2/3 ACx neurons to audiomotor mismatches. Same as Figure 2C with different color scale [-50 50] which does not capture most of the neural activity.  

      - Are [AM+VM] MM neurons AM neurons? 

      Many of [AM + VM] and [AM] neurons overlap but it is not exactly the same population. This is partially visible in Figure 4F. There is a subset of neurons (13.7%; red dots, Figure 4F) that selectively responded to the concurrent [AM+VM] mismatch, while a different subset of neurons (11.2%; yellow dots, Figure 4F) selectively responded to the mismatch responses in isolation. The [VM] response contributes only little to the sum of the two responses [AM] + [VM]. 

      Please do not use orange in Figure 4F, it is perceptually too similar to red. 

      We have now changed it to yellow. 

      Reviewer #2 (Public Review): 

      In this study, Solyga and Keller use multimodal closed-loop paradigms in conjunction with multiphoton imaging of cortical responses to assess whether and how sensorimotor prediction errors in one modality influence the computation of prediction errors in another modality. Their work addresses an important open question pertaining to the relevance of non-hierarchical (lateral cortico-cortical) interactions in predictive processing within the neocortex. 

      Specifically, they monitor GCaMP6f responses of layer 2/3 neurons in the auditory cortex of head-fixed mice engaged in VR paradigms where running is coupled to auditory, visual, or audio-visual sensory feedback. The authors find strong auditory and motor responses in the auditory cortex, as well as weak responses to visual stimuli. Further, in agreement with previous work, they find that the auditory cortex responds to audiomotor mismatches in a manner similar to that observed in visual cortex for visuomotor mismatches. Most importantly, while visuomotor mismatches by themselves do not trigger significant responses in the auditory cortex, simultaneous coupling of audio-visual inputs to movement non-linearly enhances mismatch responses in the auditory cortex. 

      Their results thus suggest that prediction errors within a given sensory modality are non-trivially influenced by prediction errors from another modality. These findings are novel, interesting, and important, especially in the context of understanding the role of lateral cortico-cortical interactions and in outlining predictive processing as a general theory of cortical function. 

      In its current form, the manuscript lacks sufficient description of methodological details pertaining to the closed-loop training and the overall experimental design. In several scenarios, while the results per se are convincing and interesting, their exact interpretation is challenging given the uncertainty about the actual experimental protocols (more on this below). Second, the authors are laser-focused on sensorimotor errors (mismatch responses) and focus almost exclusively on what happens when stimuli deviate from the animal's expectations. 

      While the authors consistently report strong running-onset responses (during open-loop) in the auditory cortex in both auditory and visual versions of the task, they do not discuss their interpretation in the different task settings (see below), nor do they analyze how these responses change during closed-loop i.e. when predictions align with sensory evidence. 

      However, I believe all my concerns can be easily addressed by additional analyses and incorporation of methodological details in the text. 

      Major concerns: 

      (1) Insufficient analysis of audiomotor mismatches in the auditory cortex: 

      Lack of analysis of the dependence of audiomotor mismatches on the running speed: it would be helpful if the authors could clarify whether the observed audiomotor mismatch responses are just binary or scale with the degree of mismatch (i.e. running speed). Along the same lines, how should one interpret the lack of dependence of the playback halt responses on the running speed? Shouldn't we expect that during playback, the responses of mismatch neurons scale with the running speed? 

      Regarding the scaling of AM mismatch responses with running speed, please see our response to reviewer 1 above to the same question. 

      Regarding the playback halt response and dependence on running speed, we would not expect there to be a dependence. The playback halt response (by design) measures the strength of the sensory response to a cessation of a stimulus (think OFF response). These typically are less strong in cortex than the corresponding ON responses but need to be controlled for (else a mismatch response might just be an OFF response – the prediction error is quantified as the difference between AM mismatch response and playback halt response). Given that sound onset responses only have a small dependence on running state, we would similarly expect sound offset (playback halt) responses to exhibit only minimal dependence on running state. 

      Slow temporal dynamics of audiomotor mismatches: despite the transient nature of the mismatches (1s), auditory mismatch responses last for several seconds. They appear significantly slower than previous reports for analogous visuomotor mismatches in V1 (by the same group, using the same methods) and even in comparison to the multimodal mismatches within this study (Figure 4C). What might explain this sustained activity? Is it due to a sustained change in the animal's running in response to the auditory mismatch? 

      This is correct, neither AM or AM+VM mismatch return to baseline in the 3 seconds following onset. VM mismatch response in visual cortex also do not return to baseline in that time window (see e.g.

      Figure 1E in (Attinger et al., 2017), or Figure 1F in (Zmarz and Keller, 2016). What the origin or computation significance of this sustained calcium response is we do not know. In intracellular signals, we do not see this sustained response (Jordan and Keller, 2020). Also peculiar is indeed the fact that in the case of AM mismatch the sustained response is similar in strength to the initial response. But also here, why this would be the case, we do not know. It is conceivable that the initial and the sustained calcium response have different origins, if the sustained response amplitude is all or nothing, the fact that the AM mismatch response is the smallest of the three could explain why sustained and initial responses are closer than for [AM+VM] or VM (in visual cortex) mismatch responses. All sustained responses appear to be roughly 1% dF/F. There are no apparent changes in running speed or pupil dilation that would correlate with the sustained activity (new panel A in Figure S2). 

      (2) Insufficient analysis and discussion of running onset responses during audiomotor sessions: The authors report strong running-onset responses during open-loop in identified mismatch neurons. They also highlight that these responses are in agreement with their model of subtractive prediction error, which relies on subtracting the bottom-up sensory evidence from top-down motor-related predictions. I agree, and, thus, assume that running-onset responses during the open loop in identified 'mismatch' neurons reflect the motor-related predictions of sensory input that the animal has learned to expect. If this is true, one would expect that such running-onset responses should dampen during closed-loop, when sensory evidence matches expectations and therefore cancels out this prediction. It would be nice if the authors test this explicitly by analyzing the running-related activity of the same neurons during closed-loop sessions. 

      Thank you for the suggestion. We now show running onset responses in both closed and open loop conditions for audiomotor and visuomotor coupling (new Figures 2H and 3H). In closed loop, we observe only a transient running onset response. In the open loop condition, running onset responses are sustained. For the visuomotor coupling, running onset responses are sustained in both closed and open loop conditions. This would be consistent with a slightly delayed cancellation of sound and motor related inputs in the audiomotor closed loop condition but not otherwise. 

      (3) Ambiguity in the interpretation of responses in visuomotor sessions. 

      Unlike for auditory stimuli, the authors show that there are no obvious responses to visuomotor mismatches or playback halts in the auditory cortex. However, the interpretation of these results is somewhat complicated by the uncertainty related to the training history of these mice. Were these mice exclusively trained on the visuomotor version of the task or also on the auditory version? I could not find this info in the Methods. From the legend for Figure 4D, it appears that the same mice were trained on all versions of the task. Is this the case? If yes, what was the training sequence? Were the mice first trained on the auditory and then the visual version? 

      The training history of the animals is important to outline the nature of the predictions and mismatch responses that one should expect to observe in the auditory cortex during visuomotor sessions.

      Depending on whether the mice in Figure 3 were trained on visual only or both visual and auditory tasks, the open-loop running onset responses may have different interpretations. 

      a) If the mice were trained only on the visual task, how should one interpret the strong running onset responses in the auditory cortex? Are these sensorimotor predictions (presumably of visual stimuli) that are conveyed to the auditory cortex? If so, what may be their role? 

      b) If the mice were also trained on the auditory version, then a potential explanation of the running-onset responses is that they are audiomotor predictions lingering from the previously learned sensorimotor coupling. In this case, one should expect that in the visual version of the task, these audiomotor predictions (within the auditory cortex) would not get canceled out even during the closedloop periods. In other words, mismatch neurons should constantly be in an error state (more active) in the closed-loop visuomotor task. Is this the case? 

      If so, how should one then interpret the lack of a 'visuomotor mismatch' aligned to the visual halts, over and above this background of continuous errors? 

      As such, the manuscript would benefit from clearly stating in the main text the experimental conditions such as training history, and from discussing the relevant possible interpretations of the responses. 

      Mice were not trained on either audiomotor or visuomotor coupling and were reared normally. Prior to the recording day, the mice were habituated to running on the air-supported treadmill without any coupling for up to 5 days. On the first recording day, the mice experienced all three types of sessions (audiomotor, visuomotor, or combined coupling) in a random order for the first time. We have clarified this in the methods. 

      Regarding the question of how one should interpret the strong running onset responses in the auditory cortex, this is complicated by the fact that – unless mice are raised visually or auditorily deprived – they always have life-long experience with visuomotor or audiomotor coupling. The visuomotor coupling they experience in VR is geometrically matched to what they would experience by moving in the real world, for the audiomotor coupling the exact relationship is less clear, but there are a diverse set of sound sources that scale in loudness with increasing running speed. Hence running onset responses reflect either such learned associations (as the reviewer also speculates), or spurious input. Rearing mice without coupling between movement and visual feedback does not abolish movement related responses in visual cortex (Attinger et al., 2017), to the contrary, it enhances them considerably. We suspect this reflects visual cortex being recruited for other functions in the absence of visual input. But given the data we have we cannot distinguish the different possible sources of running related responses. It is very likely that any “training” related effect we could achieve in a few hours pales in comparison to the life-long experience the mouse has in the world. 

      Regarding the lack of a 'visuomotor mismatch' aligned to the visual halts, we are not sure we understand. Our interpretation is that there are no (or only a very small - we speculate that any nonzero VM mismatch response is just inherited from visual cortex) VM mismatch responses in auditory cortex above chance. Our data are consistent with the interpretation that there is no opposition of bottom up visual and top down motor related input in auditory cortex, hence no VM mismatch responses (independent of how strong the top-down motor related input is). This is of course not surprising – this is more of a sanity check and becomes relevant in the context of interpreting AM+VM responses. 

      (4) Ambiguity in the interpretation of responses in multimodal versus unimodal sessions. 

      The authors show that multimodal (auditory + visual) mismatches trigger stronger responses than unimodal mismatches presented in isolation (auditory only or visual only). Further, they find that even though visual mismatches by themselves do not evoke a significant response, co-presentation of visual and auditory stimuli non-linearly augments the mismatch responses suggesting the presence of nonhierarchical interactions between various predictive processing streams. 

      In my opinion, this is an important result, but its interpretation is nuanced given insufficient details about the experimental design. It appears that responses to unimodal mismatches are obtained from sessions in which only one stimulus is presented (unimodal closed-loop sessions). Is this actually the case? An alternative and perhaps cleaner experimental design would be to create unimodal mismatches within a multimodal closed-loop session while keeping the other stimulus still coupled to the movement. 

      This is correct, unimodal mismatches were acquired in unimodal coupling. Testing unimodal mismatch responses in multimodally coupled VR is an interesting idea we had initially even pursued. However, halting visual flow in a condition of coupling of both visual flow and sound amplitude to running speed has an additional complication. Introducing an audiomotor mismatch in this coupling inherently also creates an audiovisual (AV) mismatch, and the same applies to visuomotor mismatches, which cause a concurrent visuoaudio (VA) mismatch (Figure R3). This assumes that there are cross modal predictions from visual cortex to auditory cortex as there are from auditory cortex to visual cortex (Garner and Keller, 2022). There are interesting differences between the different types of mismatches, but with the all the necessary passive controls this quickly exceeded the amount of data we could reasonably acquire for this paper. This remains an interesting question for future research. 

      Author response image 3.

      Rationale of unimodal mismatches introduced within multimodal paradigm. 

      Given the current experiment design (if my assumption is correct), it is unclear if the multimodal potentiation of mismatch responses is a consequence of nonlinear interactions between prediction/error signals exchanged across visual and auditory modalities. Alternatively, could this result from providing visual stimuli (coupled or uncoupled to movement) on top of the auditory stimuli? If it is the latter, would the observed results still be evidence of non-hierarchical interactions between various predictive processing streams? 

      Mice are not in complete darkness during the AM mismatch experiments (the VR is off, but there is low ambient light in the experimental rooms primarily from computer screens), so we can rule out the possibility that the difference comes from having “no” visual input during AM mismatch responses. Addressing the question of whether it is this particular stimulus that cause the increase would require an experiment in which we couple sound amplitude but keep visual flow open loop. We did not do this, but also think this is highly unlikely. However, as described above, we did do an experiment in which we coupled both sound amplitude and visual flow to running, and then either halted visual flow, or sound amplitude, or both. Comparing the [AM+VM] and [AM+AV] mismatch responses, we find that [AM+VM] responses are larger than [AM+AV] responses as one would expect from an interaction between [AM] and [VM] responses (Author response image 4). Finally, either way the conclusion that there are nonhierarchical interactions of prediction error computations holds either way – if any visual stimulus (either visuomotor mismatch, or visual flow responses) influences audiomotor mismatch responses, this is evidence of non-hierarchical interactions.   

      Author response image 4.

      Average population response of all L2/3 neurons to concurrent [AM + VM] or [AM+AV] mismatch. Gray shading indicates the duration of the stimulus.

      Along the same lines, it would be interesting to analyze how the coupling of visual as well as auditory stimuli to movement influences responses in the auditory cortex in close-loop in comparison to auditoryonly sessions. Also, do running onset responses change in open-loop in multimodal vs. unimodal playback sessions? 

      We agree, and why we started out doing the experiments described above. We stopped with this however, because it quickly became a combinatorial nightmare. We will leave addressing the question of how different types of coupling influences responses in auditory cortex to brave future neuroscientists. 

      Regarding the question of running onset responses, in both the multimodal and auditory only paradigms, running onset responses are transient; bottom-up sensory evidence is quickly subtracted from top-down motor-related prediction (Author response image 5). While there appears to be a small difference in the dynamics of running onset responses between these two paradigms, it was not significant. Note, we also have much less data than we would like here for this type of analysis. 

      Author response image 5.

      Running onset responses recorded in unimodal and multimodal closed loop sessions (1903 neurons, 16 fields of view, 8 mice)

      We also compared running onsets in open loop sessions and did not find any significant differences between unimodal and multimodal sessions (Author response image 6). We found only six sessions in which animals performed at least two running onsets in each session type, therefore, we do not have enough data to include it in the manuscript. 

      Author response image 6.

      Running onset responses recorded within unimodal and multimodal open loop sessions (659 cells, 6 field of view, 5 mice).

      Minor concerns and comments:

      (1) Rapid learning of audiomotor mismatches: It is interesting that auditory mismatches are present even on day 1 and do not appear to get stronger with learning (same on day 2). The authors comment that this could be because the coupling is learned rapidly (line 110). How does this compare to the rate at which visuomotor coupling is learned? Is this rapid learning also observable in the animal's behavior i.e. is there a change in running speed in response to the mismatch? 

      In the visual system this is a bit more complicated. If you look at visuomotor mismatch responses in a normally reared mouse, responses are present from the first mismatch (as far as we can tell given the inherently small dataset with just one response pre mouse). However, this is of course confounded by the fact that a normally reared mouse has visuomotor coupling throughout life from eye-opening. Raising mice in complete darkness, we have shown that approximately 20 min of coupling are sufficient to establish visuomotor mismatch responses (Attinger et al., 2017). 

      Regarding the behavioral changes that correlate with learning, we are not sure what the reviewer would expect. We cannot detect a change in mismatch responses and hence would also not expect to see a change in behavior.

      (2) The authors should clarify whether the sound and running onset responses of the auditory mismatch neurons in Figure 2E were acquired during open-loop. This is most likely the case, but explicitly stating it would be helpful. 

      Both responses were measured in isolation (i.e. VR off, just sound and just running onset), not in an open-loop session. We have clarified in the figure legend that these are the same data as in Figure 1H and N. 

      (3) In lines 87-88, the authors state 'Visual responses also appeared overall similar but with a small increase in strength during running ...'. This statement would benefit from clarification. From Figure S1 it appears that when the animal is sitting there are no visual responses in the auditory cortex. But when the animal is moving, small positive responses are present. Are these actually 'visual' responses - perhaps a visual prediction sent from the visual cortex to the auditory cortex that is gated by movement? If so, are they modulated by features of visual stimuli eg. contrast, intensity? Or, do these responses simply reflect motor-related activity (running)? Would they be present to the same extent in the same neurons even in the dark? 

      This was wrong indeed - we have rephrased the statement as suggested. Regarding the source of visual responses, we use the term “visual response” operationally here agnostic to what pathway might be driving it (i.e. it could be a prediction triggered by visual input). 

      We did not test if recorded visual responses are modulated by contrast or intensity. However, testing whether they are would not help us distinguish whether the responses are ‘visual’ or ‘visual predictions’. Finally, regarding the question about whether they are motor-related responses, this might be a misunderstanding. These are responses to visual stimuli while the mouse is already running (i.e. there is no running onset), hence we cannot test whether these responses are present in the dark (this would be the equivalent of looking at random triggers in the dark while the mouse is running).  

      (4) The authors comment in the text (lines 106-107) about cessation of sound amplitude during audiomotor mismatches as being analogous to halting of visual flow in visuomotor mismatches. However, sound amplitude versus visual flow are quite different in nature. In the visuomotor paradigm, the amount of visual stimulation (photons per unit time) does not necessarily change systematically with running speed. Whereas, in the audiomotor paradigm, the SNR of the stimulus itself changes with running speed which may impact the accuracy of predictions. On a broader note, under natural settings, while the visual flow is coupled to movement, sound amplitude may vary more idiosyncratically with movement. 

      This is a question of coding space. The coding space of visual cortex of the mouse is probably visual flow (or change in image) not number of photons. This already starts in the retina. The demonstration of this is quite impressive. A completely static image on the retina will fade to zero response (even though the number of photons remains constant). This is also why most visual physiologists use dynamic stimuli – e.g. drifting gratings, not static gratings – to map visual responses in visual cortex. If responses were linear in number of photons, this would make less of a difference. The correspondence we make is between visual flow (which we assume is the main coding space of mouse V1 – this is not established fact, but probably implicitly the general consensus of the field) and sound amplitude. Responses in auditory cortex are probably more linear in sound amplitude than visual cortex responses are linear in number of photons, but whether that is the correct coding space is still unclear, and as far as we can tell there is no clear consensus in the field. We did consider coupling running speed to frequency, which may work as well, but given the possible equivalence (as argued above) and the fact that we could see similar responses with sound amplitude coupling we did not explore frequency coupling. 

      If visual speed is the coding space of V1, SNR should behave equivalently in both cases. 

      Perhaps such differences might explain why unlike in the case of visual cortex experiments, running speed does not affect the strength of playback responses in the auditory cortex. 

      Possible, but the more straightforward framing of this point is that sensory responses are enhanced by running in visual cortex while they are not in auditory cortex. A playback halt response (by design) is just a sensory response. Why running does not generally increase sensory responses in auditory cortex (L2/3 neurons), but does so in visual cortex, would be the more general version of the same question.

      We fear we have no intelligent answer to this question.  

      Reviewer #3 (Public Review): 

      This study explores sensory prediction errors in the sensory cortex. It focuses on the question of how these signals are shaped by non-hierarchical interactions, specifically multimodal signals arising from same-level cortical areas. The authors used 2-photon imaging of mouse auditory cortex in head-fixed mice that were presented with sounds and/or visual stimuli while moving on a ball. First, responses to pure tones, visual stimuli, and movement onset were characterized. Then, the authors made the running speed of the mouse predictive of sound intensity and/or visual flow. Mismatches were created through the interruption of sound and/or visual flow for 1 second while the animal moved, disrupting the expected sensory signal given the speed of movement. As a control, the same sensory stimuli triggered by the animal's movement were presented to the animal decoupled from its movement. The authors suggest that auditory responses to the unpredicted silence reflect mismatch responses. That these mismatch responses were enhanced when the visual flow was congruently interrupted, indicates the cross-modal influence of prediction error signals. 

      This study's strengths are the relevance of the question and the design of the experiment. The authors are experts in the techniques used. The analysis explores neither the full power of the experimental design nor the population activity recorded with 2-photon, leaving open the question of to what extent what the authors call mismatch responses are not sensory responses to sound interruption. The auditory system is sensitive to transitions and indeed responses to the interruption of the sound are similar in quality, if not quantity, in the predictive and the control situation. 

      This study's strengths are the relevance of the question and the design of the experiment. The authors are experts in the techniques used. The analysis explores neither the full power of the experimental design nor the population activity recorded with 2-photon, leaving open the question of to what extent what the authors call mismatch responses are not sensory responses to sound interruption. The auditory system is sensitive to transitions and indeed responses to the interruption of the sound are similar in quality, if not quantity, in the predictive and the control situation. The pattern they observe is different from the visuomotor mismatch responses the authors found in V1 (Keller et al., 2012), where the interruption of visual flow did not activate neuronal activity in the decoupled condition. 

      Just to add brief context to this. The reviewer is correct here, the (Keller et al., 2012) paper reports finding no responses to playback halt. However, this was likely a consequence of indicator sensitivity (these experiments were done with what now seems like a pre-historic version of GCaMP). Experiments performed with more modern indicators do find playback halt responses in visual cortex (see e.g. (Zmarz and Keller, 2016)). 

      The auditory system is sensitive to transitions, also those to silence. See the work of the Linden or the Barkat labs on-off responses, and also that of the Mesgarani lab (Khalighinejad et al., 2019) on responses to transitions 'to clean' (Figure 1c) in the human auditory cortex. Since the responses described in the current work are modulated by movement and the relationship between movement and sound is more consistent during the coupled sessions, this could explain the difference in response size between coupled and uncoupled sessions. There is also the question of learning. Prediction signals develop over a period of several days and are frequency-specific (Schneider et al., 2018). From a different angle, in Keller et al. 2012, mismatch responses decrease over time as one might expect from repetition. 

      Also for brief context, this might be a misconception. We don’t find a decrease of mismatch responses in the (Keller et al., 2012) paper – we assume what the reviewer is referring to is the fact that mismatch responses decrease in open-loop conditions (they normally do not in closed-loop conditions). This is the behavior one would expect if the mouse learns that movement no longer predicts visual feedback. 

      It would help to see the responses to varying sound intensity as a function of previous intensity, and to plot the interruption response as a function of both transition and movement in both conditions. 

      Given the large populations of neurons recorded and the diversity of the responses, from clearly negative to clearly positive, it would be interesting to understand better whether the diversity reflects the diversity of sounds used or a diversity of cell types, or both. 

      Comments and questions: 

      Does movement generate a sound and does this change with the speed of movement? It would be useful to have this in the methods. 

      There are three ways to interpret the question – below the answers to all three:

      (1) Running speed is experimentally coupled to sound amplitude of a tone played through a loudspeaker. Tone amplitude is scaled with running speed of the mouse in a closed loop fashion. We assume this is not what the reviewer meant, as this is described in the methods (and the results section). 

      (2) Movements of the mouse naturally generate sounds (footsteps, legs moving against fur, etc.). Most of these sounds trivially scale with the frequency of leg movements – we assume this also not what the reviewer meant. 

      (3) Finally, there are experimental sounds related to the rotation speed of the air supported treadmill that increase with running speed of the mouse. We have added this to the methods as suggested. 

      Figures 1a and 2a. The mouse is very hard to see. Focus on mouse, objective, and sensory stimuli? The figures are generally very clear though. 

      We have enlarged the mouse as suggested. 

      1A-K was the animal running while these responses were measured? 

      We did not restrict this analysis to running or sitting and pooled responses over both conditions.  We have made this more explicit in the results section.  

      Data in Figure 1: Since the modulation of sensory responses by movement is relevant for the mismatch responses, I would move this analysis from S1 to Figure 1 and analyze the responses more finely in terms of running speed relative to sound and gratings. I would include here a more thorough analysis of the responses to 8kHz at varying intensities, for example in the decoupled sessions. Does the response adapt? Does it follow the intensity? 

      We agree that these are interesting questions, but they do not directly pertain to our conclusions here. The key point Figure S1 addresses is whether auditory responses are generally enhanced by running (as they are e.g. in visual cortex) – the answer, on average, is no. We have tried emphasizing this more, but it changes the flow of the paper away from our main message, hence we have left the panels in the supplements. 

      Regarding the 8kHz modulation, there is a general increase of the suppression of activity with increasing sound amplitude (Author response image 7 and Author response image 8). But due to the continuously varying amplitude of the stimulus, we do not have sufficient data (or do not know how to with the data we have) to address questions of adaptation. We assume there is some form of adaptation. However, either way, we don’t see how this would change our conclusions. 

      Author response image 7.

      Neural activity as a function of sound level in an AM open loop session. 

      Author response image 8.

      The average sound evoked population response of all ACx layer 2/3 neurons to 60 dB or 75 dB 8 kHz pure tones. Stimulus duration was 1 s (gray shading).

      2C-D why not talk of motor modulation? Paralleling what happens in response to auditory and visual stimuli? 

      This is correct, a mismatch response (we use mismatch here to operationally describe the stimulus – not the interpretation) can be described either as a prediction error (this is the interpretation) or a stimulus specific motor modulation. Note, the key here is “stimulus specific”. It is stimulus specific as there is an approximately 3x change between mismatch and playback halt (the same sensory stimulus with and without locomotion), but basically no change for sound onsets (Figure S1). Having said that, one explanation (prediction error) has predictive power (and hence is testable – see e.g. (Vasilevskaya et al., 2023) for an extensive discussion on exactly this argument for mismatch responses in visual cortex), while the other does not (a “stimulus specific” motor modulation has no predictive value or computational theory behind it and is simply a description). Thus, we choose to interpret it as a prediction error. Note, this finding does not stand in isolation and many of the testable predictions of the predictive processing interpretation have turned out to be correct (see e.g. (Keller and Mrsic-Flogel, 2018) for a review). 

      Note, we try to only use the interpretation of “prediction error” when motivating why we do the experiments, and in the discussion, but not directly in the description of the results (e.g. in Figure 2).  

      How does the mismatch affect the behavior of the mouse? Does it stop running? This could also influence the size of the response. 

      We quantified animal behavior during audiomotor mismatches and did not find any significant acceleration or slowing down upon mismatch events. Thus, neural responses recorded during AM mismatches are unlikely to be explained by changes in animal behavior. These data have been added in Figure S2A and Figure S4A.

      Figure 3. What about neurons that were positively modulated by both grating and movement? How do these neurons respond to the mismatch? 

      Neurons positively modulated by both grating and movement were slightly more responsive to MM than the rest of the population, though this difference was not significant (Author response image 9). This is also visible in Figure 3G – the high VM mismatch responsive neurons are randomly distributed in regard to correlation with running speed and visual flow speed. 

      Author response image 9.

      Responses to visuomotor mismatches of neurons positively modulated by grating and movement and remaining of the population.

      Line 176. The authors say 'Thus, in the case of a [AM + VM] mismatch both the halted visual flow and the halted sound amplitude are predicted by running speed' but the mismatch (halted flow and amplitude) is not predicted by the speed, correct? Please rephrase. 

      Thank you for pointing this out – this was indeed phrased incorrectly. We have corrected this. 

      How was the sound and/or visual flow interruption triggered? Did the animal have to run at a minimum speed in order for it to happen?

      Sound and visual flow interruptions were triggered randomly, independent of the animal's running speed. However, for the analysis, only MM presentations during which animals were running at a speed of at least 0.3 cm/s were included. The 0.3 cm/s was simply the (arbitrary) threshold we used to determine if the mouse was running. In a completely stationary mouse a mismatch event will not have any effect (sound amplitude/visual flow speed are already at 0). This is described in the methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors addressed how long-range interactions between boundary elements are established and influence their function in enhancer specificity. Briefly, the authors placed two different reporters separated by a boundary element. They inserted this construct ectopically ~140 kb away from an endogenous locus that contains the same boundary element. The authors used expression patterns driven by nearby enhancers as an output to determine which enhancers the reporters interact with. They complemented this analysis with 3D DNA contact mapping. The authors found that the orientation of the boundary element determined which enhancers each reporter interacted with. They proposed that the 3D interaction topology, whether being circular or stem configuration, distinguished whether the interaction was cohesin mediated or through an independent mechanism termed pairing.

      Strengths:

      The transgene expression assays are built upon prior knowledge of the enhancer activities. The 3D DNA contacts confirm that transgene expression correlates with the contacts. Using 4 different orientations covers all combinations of the reporter genes and the boundary placement.

      Weaknesses:

      The interpretation of the data as a refusal of loop extrusion playing a role in TAD formation is not warranted, as the authors did not deplete the loop extruders to show that what they measure is independent.

      (1.1) To begin with, our findings do not exclude the possibility that cohesin loop extrusion has some sort of role in the formation or maintenance of TADs in flies or other aspects of chromosome structure.  On the other hand, it clearly is not determinative in defining the end-points of TADs or in generating the resulting topology (stem-loop or circle-loop).  Our main point, which we feel we have established unequivocally, is that it can’t explain many essential features of TADs or chromosome loops (see below) in Drosophila.  This reviewer agrees with this point in their next paragraph (below).  We also think that the loop extrusion model’s general acceptance as THE driving force behind TAD formation in mammals is unwarranted and not fully consistent with the available data, as explained below.

      As to the reviewer’s specific point regarding depletion of loop extruders, we first note that completely eliminating factors encoding cohesin subunits in fly embryos isn’t readily feasible.  As cohesin is essential starting at the beginning of embryonic development, and is maternally deposited, knockdowns/depletions would likely be incomplete and there would always be some remaining activity.  As long as there is some residual activity—and no disruption in TAD formation is observed—this experimental test would be a failure.  In addition, any defects that are observed might arise not from a failure in TAD formation via loop extrusion but rather because the rapid mitotic cycles would be disrupted.  A far better approach would be to deplete/knockdown cohesin subunits in tissue culture cells, as there is no requirement for the cells to undergo embryonic development.  Moreover, since cell division is relatively slow, the depletion would likely eliminate much if not all of the activity before a checkpoint is reached.

      While a drastic depletion of cohesin is not feasible in our model organism, we would draw the reviewer’s attention to an experiment of this type which has already been done in mammalian tissue culture cells by Goel et al. (Goel et al. 2023).  Unlike most Hi-C studies in mammals, the authors used region capture MicroC (RCMC).  In contrast to published genome-wide mammalian MicroC experiments (c.f., (Hsieh et al. 2020; Krietenstein et al. 2020)) which require large bin sizes to visualize mammalian “TADs,” the resolution of the experiments in Goel et al. (Goel et al. 2023) is similar to the resolution in our MicroC experiments (200-400 bp).  A MicroC contact map from Goel et al. shows the Pdm1g locus on chromosome 5 before and after Rad21 depletion.  The contact map visualizes a 250 kb DNA segment, which is only slightly larger than the ~230 kb DNA segment in Fig. 2C in our paper.

      In this experiment, there was a 97% reduction in the amount of Rad21.  However, as can be seen by comparing the contact profiles above and below the diagonal, there is little or no difference in TAD organization after cohesin depletion when individual TADs are visualized with a bin size of 250 bp.  These results would indicate that mammalian TADs do not require cohesin.

      Note also that the weak 45o stripes connecting different TADs (c.f. blue/green arrowheads) are still present after Rad21 depletion.  In the most popular version of the loop extrusion model, cohesin loads at a site(s) somewhere in the TAD-to-be, and then extrudes both strands until it bumps into CTCF roadblocks.  As illustrated in Figure Sup 2, this mechanism generates a vertical stripe originating at the cohesin loading site and extending until cohesin bumps into the left or right roadblock, at which point the stripe transitions into 45o stripe that ends when cohesin bumps into the other roadblock.  While 45o stripes are visible, there is no hint of a vertical stripe.  This suggests that the mechanism for generating stripes, if it is an active mechanism (rather than passive diffusion) may be quite different.  The 45o stripes must be generated by a factor(s) that is anchored to one (blue arrowhead) or both (green arrowhead) boundaries.  In addition, this factor, whatever it is, is not cohesin.  The reason for this is that the 45o stripes are present both before and after Rad21 depletion.  Moreover, if one were to imagine that the stripes represent a process involved in TAD formation, this process does not require cohesin (see Goel et al 2023).

      It is worth noting another observation that is inconsistent with the cohesin loop extrusion/CTCF roadblock model for TAD formation/maintenance.  CTCF is not found at all of the TAD boundaries in this 250 kb DNA region.  This would suggest that there are other DNA binding proteins that have chromosomal architectural functions besides CTCF.  In flies, many of the chromosomal architectural proteins are, like CTCF, polydactyl zinc finger (PZF) proteins (Bonchuk et al. 2021; Bonchuk et al. 2022; Fedotova et al. 2017).  These include Su(Hw), CTCF, Pita, Zipic and CLAMP.  The PZF family in flies is quite large.  There are ~250 different PZF genes, and since only a handful of these have been characterized, it seems likely that additional members of this family will have architectural functions.  Thus far, only one boundary protein, CTCF, has received attention in studies on mammalian chromosome architecture.  As the mammalian genome is much larger and more complicated than the fly genome, it is difficult to believe that CTCF is the sole chromosomal architectural protein in mammals.  In this respect, it is worth noting that there are ~800 members of the PZF family in mammalian genomes (Fedotova et al. 2017).

      Goel et al. (Goel et al. 2023) did observe alterations in the contact profiles after Rad21 depletion when they visualized the Ppm1g region at much lower resolution (bin sizes of 5 kb and 1 kb). The 5 kb bin size visualizes a region of ~1.2 Mb, while the 1 kb bin size visualizes a region that spans ~800 kb.  These large triangular units do not correspond to the individual TADs seen when Goel et al. visualized the Ppm1g locus at 250 bp resolution. 

      Nor do they correspond to TADs in Fig. 2 of our paper.  Instead they represent TAD neighborhoods which, likely consist of 20-30 or more individual TADs.  Consequently the alterations in contact patterns seen after Rad21 depletion are occurring at the level of TAD neighborhoods.  This can be seen by comparing pixel density inside the blue lines before (above the diagonal) and after Rad21 depletion (below the diagonal) (Goel et al 2023).  The more distant contacts between individual TADs within this neighborhood are preferentially reduced by Rad21 depletion (the region below and to the left of the double arrowhead).  By contrast, the TADs themselves are unaffected, as are contacts between individual TADs and their immediate neighbors (see purple and light green asterisk).  The other interesting feature is the loss of contacts between what appears to be partially overlapping neighborhoods.  This loss of neighborhood-toneighborhood contacts can be seen in the region located between the green and blue lines.  The neighborhood that appears to partially overlap the Ppm1g neighborhood is outlined in purple.

      It worth noting that, with the exception of the high resolution experiments in Goel et al., all of the other studies on cohesin (and CTCF) have examined the effects on contact maps within (and between) large neighborhoods (bin sizes >1 kb).  In most cases, these large neighborhoods are likely to be composed of many individual TADs like those seen in Goel et al. and in Fig. 2 of our paper.  We also observe larger neighborhoods in the fly genome, though they do not appear to be as large as those in mammals.  Our experiments do not address what role cohesin might have in facilitating contacts between more distant TADs located within the same neighborhoods, or between TADs in different neighborhoods, or whether loop extrusion is involved.

      We would also note that the Drosophila DNA segment in Fig. 2C contains 35 different genes, while the mammalian DNA segment shown in Fig. 1 has only 9.  Thus, in this part of the fly genome, Pol II genes are more densely packed than in the mammalian DNA segment.  Much of the fly genome is also densely packed, and the size of individual TADs will likely be smaller, on average, than in mammals.  Nevertheless, the MicroC profiles are not all that different.  As is also common in flies, each TAD in the Ppm1g region only encompasses one or two genes.  Note also that there are no volcano triangles with plumes as would be predicted for TADs that have a stem-loop topology.

      In fact, as shown in Author response image 1, the high-resolution contact profile for the Ppm1g region shows a strong resemblance to that observed for the fly Abd-B regulatory domains.  These regulatory domains are part of larger neighborhood that encompasses the abd-A and Abd-B genes and their regulatory domains.

      Author response image 1.

      Abd-B regulatory domains

      As the authors show, the single long DNA loop mediated by cohesin loop extrusion connecting the ectopic and endogenous boundary is clearly inconsistent with the results, therefore the main conclusion of the paper that the 3D topology of the boundary elements a consequence of pairing is strong. However, the loop extrusion and pairing are not mutually exclusive models for the formation of TADs. Loop-extruding cohesin complexes need not make a 140 kb loop, multiple smaller loops could bring together the two boundary elements, which are then held together by pairing proteins that can make circular topologies.

      (1.2) In the pairing model, distant boundaries bump into each other (by random walks or partially constrained walks), and if they are “compatible” they pair with each other, typically in an orientation-dependent manner.  As an alternative, the reviewer argues that cohesin need not make one large 140 kb loop.  Instead it could generate a series of smaller loops (presumably corresponding to the intervening TADs).  These smaller loops would bring homie in the transgene in close proximity to the eve locus so that it could interact with the endogenous homie and nhomie elements in the appropriate orientation, and in this way only one of the reporters would be ultimately activated.

      There are two problems with the idea that cohesin-dependent loop extrusion brings transgene homie into contact with homie/nhomie in the eve locus by generating a series of small loops (TADs).  The first is the very large distances over which specific boundary:boundary pairing interactions can occur.  The second is that boundary:boundary pairing interactions can take place not only in cis, but also in trans.

      We illustrate these points with several examples. 

      Fujioka et al. 2016, Fig 7 shows an experiment in which attP sites located ~2 Mb apart were used to insert two different transgenes, one containing a lacZ reporter and the other containing the eve anal plate enhancer (AP) (Fujioka et al. 2016).  If the lacZ reporter and the AP transgenes also contain homie, the AP enhancer can activate lacZ expression (panel A,).  On the other hand, if one of the transgenes has lambda DNA instead of homie, no regulatory interactions are observed (panel A,).  In addition, as is the case in our experiments using the -142 kb platform, orientation matters.  In the combination on the top left, the homie boundary is pointing away from both the lacZ reporter and the AP enhancer.  Since homie pairs with itself head-tohead, pairing brings the AP enhancer into contact with the lacZ reporter.  A different result is obtained for the transgene pair in panel A on the top right.  In this combination, homie is pointing away from the lacZ reporter, while it is pointing towards the AP enhancer.  As a consequence, the reporter and enhancer are located on opposite sides of the paired homie boundaries, and in this configuration they are unable to interact with each other.

      On the top left of panel B, the homie element in the AP enhancer transgene was replaced by a nhomie boundary oriented so that it is pointing towards the enhancer.  Pairing of homie and nhomie head-to-tail brings the AP enhancer in the nhomie transgene into contact with the lacZ reporter in the homie transgene, and it activates reporter expression.  Finally, like homie, nhomie pairs with itself head-to-head, and when the nhomie boundaries are pointing towards both the AP reporter and the lacZ reporter, reporter expression is turned on.

      Long distance boundary-dependent pairing interactions by the bithorax complex Mcp boundary have also been reported in several papers.  Fig. 6 from Muller et al. (Muller et al. 1999) shows the pattern of regulatory interactions (in this case PRE-dependent “pairing-sensitive silencing”) between transgenes that have a mini-white reporter, the Mcp and scs’ boundaries and a PRE that is located close to Mcp.  In this experiment flies carrying transgenes inserted at the indicated sites on the left and right arms of the 3rd chromosome were mated in pairwise combinations, and their trans-heterozygous progeny examined for pairing-sensitive silencing of the mini-white reporter.

      Two examples of long-distance pairing-sensitive silencing mediated by Mcp/scs’ are shown in Fig. 5b from Muller et al. 1999.  The transgene inserts in panel A are w#12.43 and ff#10.5w#12.43 is inserted close to the telomere of 3R at 99B.  ff10.5 is inserted closer to the middle of 3R at 91A.  The estimate distance between them is 11.3 Mb.  The transgene inserts in panel B are ff#10.5 and ff#11.102ff#11.102 is inserted at 84D, and the distance between them is 11 Mb.  Normally, the eye color phenotype of the mini-white reporter is additive: homozygyous inserts have twice as dark eye color as hemizygous inserts, while in trans-_heterozygous flies the eye color would be the sum of the two different transgenes.  However, when a PRE is present and the transgene can pair, silencing is observed.  In panel A, the t_rans-_heterozygous combination has a lighter eye color than either of the parents.  In panel B, the _trans-_heterozygous combination is darker than one of the parents (_ff#10.5) but much lighter than the other (ff#11.102).

      All ten of the transgenes tested were able to engage in long distance (>Mbs) trans_regulatory interactions; however, likely because of how the chromosome folds on the Mb scale (e.g., the location of meta-loops: see #2.1 and Author response image 3) not all of the possible pairwise silencing interactions are observed.  The silencing interactions shown in Muller et.al. are between transgenes inserted on different homologs.  _Mcp/scs'-dependent silencing interactions can also occur in cis. Moreover, just like the homie and nhomie experiments described above, Muller et.al. (Muller et al. 1999) found that Mcp could mediate long-distance activation of mini-white and yellow by their respective enhancers.

      The pairing-sensitive activity of the PRE associated with the Mcp boundary is further enhanced when the mini-white transgene has the scs boundary in addition to Mcp and scs’.  In the experiment shown in Fig. 8 from Muller et al. 1999, the pairing-sensitive silencing interactions of the Mcp/scs’/scs transgene are between transgenes inserted on different chromosomes.  Panel A shows pairing-sensitive silencing between w#15.60, which is on the X chromosome, and w#15.102, which is on the 2nd chromosome.  Panel B shows pairing-sensitive silencing between the 2nd chromosome insert w#15.60 and a transgene, w#15.48, which is inserted on the 3rd chromosome.

      The long-distance trans and cis interactions described here are not unique to homie, nhomie, Mcp, scs’, or scs.  Precisely analogous results have been reported by Sigrist and Pirrotta (Sigrist and Pirrotta 1997) for the gypsy boundary when the bxd PRE was included in the mini-white transgene.  Also like the Mcp-containing transgenes in Muller et al. (Muller et al. 1999), Sigrist and Pirrotta observed pairing-sensitive silencing between gypsy bxd_PRE _mini-white transgenes inserted on different chromosomes.  Similar long-distance (Mb) interactions have been reported for Fab-7 (Bantignies et al. 2003; Li et al. 2011).  In addition, there are examples of “naturally occurring” long-distance regulatory and/or physical interactions.  One would be the regulatory/physical interactions between the p53 enhancer upstream of reaper and Xrp1 which was described by Link et al. (Link et al. 2013).  Another would be the nearly 60 meta-loops identified by Mohana et al. (Mohana et al. 2023).

      Like homie at -142 kb, the regulatory interactions (pairing-sensitive silencing and enhancer activation of reporters) reported in Muller et al. (Muller et al. 1999) involve direct physical interactions between the transgenes.  Vazquez et al. (Vazquez et al. 2006) used the lacI/lacO system to visualize contacts between distant scs/Mcp/scs’-containing transgenes in imaginal discs.  As indicated in Vasquez et al. 2006, Table 3 lines #4-7,  when both transgenes have Mcp and were inserted on the same chromosome, they colocalized in trans-_heterozygotes (single dot) in 94% to 97% of the disc nuclei in the four pairwise combinations they tested.  When the transgenes both lacked _Mcp (Vasquez et al. 2006, Table 3 #1), co-localization was observed in 4% of the nuclei.  When scs/Mcp/scs’-containing transgenes on the 2nd and 3rd chromosome were combined (Vasquez et al. 2006, Table 3 #8), colocalization was observed in 96% of the nuclei.  They also showed that four different scs/Mcp/scs’ transgenes (two at the same insertion site but on different homologs, and two at different sites on different homologs) co-localized in 94% of the eye imaginal disc nuclei (Vasquez et al. 2006, Table 3 #9).  These pairing interactions were also found to be stable over several hours.  Similar co-localization experiments together with 3C were reported by Li et al. (Li et al. 2011).

      The de novo establishment of trans interactions between compatible boundary elements has been studied by Lim et al. (Lim et al. 2018).  These authors visualized transvection (enhancer activation of a MS2 loop reporter in trans) mediated by the gypsy insulator, homie and Fab-8  in NC14 embryos.  When both transgenes shared the same boundary element, transvection/physical pairing was observed in a small subset of embryos.  The interactions took place after a delay and increased in frequency as the embryo progressed into NC14.  As expected, transvection was specific: it was not observed when the transgenes had different boundaries.  For homie it was also orientation-dependent.  It was observed when homie was orientated in the same direction in both transgenes, but not when homie was orientated in opposite directions in the two transgenes.

      While one could imagine that loop extrusion-dependent compaction of the chromatin located between eve and the transgene at -142 kb into a series of small loops (the intervening TADs) might be able to bring homie in the transgene close to homie/nhomie in the eve locus, there is no cohesinbased loop extrusion scenario that would bring transgenes inserted at sites 6 Mb, 11 Mb, on different sides of the centromere, or at opposite ends of the 3rd chromosome together so that the distant boundaries recognize their partners and physically pair with each other.  Nor is there a plausible cohesin-based loop extrusion mechanism that could account for the fact that most of the documented long-distance interactions involve transgenes inserted on different homologs.  This is not to mention the fact that long-distance interactions are also observed between boundarycontaining transgenes inserted on different chromosomes.

      In fact, given these results, one would logically come to precisely the opposite conclusion.  If boundary elements inserted Mbs apart, on different homologs and on different chromosomes can find each other and physically pair, it would be reasonable to think that the same mechanism (likely random collisions) is entirely sufficient when they are only 142 kb apart.

      Yet another reason to doubt the involvement or need for cohesin-dependent loop extrusion in bringing the transgene homie in contact with the eve locus comes from the studies of Goel et al. (Goel et al. 2023).  They show that cohesin has no role in the formation of TADs in mammalian tissue culture cells.  So if TADs in mammals aren’t dependent on cohesin, there would not be a good reason to think at this point that the loops (TADs) that are located between eve and the transgene are generated by, or even strongly dependent on, cohesin-dependent loop extrusion.

      It is also important to note that even if loop-extrusion were to contribute to chromatin compaction in this context and make the looping interactions that lead to orientation-specific pairing more efficient, the role of loop extrusion in this model is not determinative of the outcome, it is merely a general compaction mechanism.  This is a far cry from the popular concept of loop extrusion as being THE driving force determining chromosome topology at the TAD level.

      Reviewer #2 (Public Review):

      In Bing et al, the authors analyze micro-C data from NC14 fly embryos, focusing on the eve locus, to assess different models of chromatin looping. They conclude that fly TADs are less consistent with conventional cohesin-based loop extrusion models and instead rely more heavily on boundaryboundary pairings in an orientation-dependent manner.

      Overall, I found the manuscript to be interesting and thought-provoking. However, this paper reads much more like a perspective than a research article. Considering eLIFE is aimed at the general audience, I strongly suggest the authors spend some time editing their introduction to the most salient points as well as organizing their results section in a more conventional way with conclusion-based titles. It was very difficult to follow the authors' logic throughout the manuscript as written. It was also not clear as written which experiments were performed as part of this study and which were reanalyzed but published elsewhere. This should be made clearer throughout.

      It has been shown several times that Drosophila Hi-C maps do not contain all of the features (frequent corner peaks, stripes, etc.) observed when compared to mammalian cells. Considering these features are thought to be products of extrusion events, it is not an entirely new concept that Drosophila domains form via mechanisms other than extrusion.

      (2.1) While there are differences between the Hi-C contact profiles in flies and mammals, these differences likely reflect in large part the bin sizes used to visualize contact profiles.  With the exception of Goel et al. (Goel et al. 2023), most of the mammalian Hi-C studies have been low resolution restriction enzyme-based experiments, and required bin sizes of >1 kb or greater to visualize what are labeled as  “TADs.”  In fact, as shown by experiments in Goel et al., these are not actually TADs, but rather a conglomeration of multiple TADs into a series of TAD neighborhoods.  The same is true for the MicroC experiments of Krietenstein et al. and Hsieh et al. on human and mouse tissue culture cells (Hsieh et al. 2020; Krietenstein et al. 2020).  This is shown in Author response image 2.  In this image, we have compared the MicroC profiles generated from human and mouse tissue culture cells with fly MicroC profiles at different levels of resolution.

      For panels A-D, the genomic DNA segments shown are approximately 2.8 Mb, 760 kb, 340 kb, and 190 kb.  For panels E-H, the genomic DNA segments shown are approximately 4.7 Mb, 870 kb, 340 kb and 225 kb.  For panels I-L, the genomic DNA segments shown are approximately 3 Mb, 550 kb, 290 kb and 175 kb.

      As reported for restriction enzyme-based Hi-C experiments, a series of stripes and dots are evident in mammalian MicroC profiles.  In the data from Krietenstein et al., two large TAD “neighborhoods” are evident with a bin size of 5 kb, and these are bracketed by 45o stripes (A: black arrows).  At 1 kb (panel B), the 45o stripe bordering the neighborhood on the left no longer defines the edge of the neighborhood (blue arrow: panel B), and both stripes become discontinuous (fuzzy dots).  At 500 (panel C) and 200 bp (panel D) bin sizes, the stripes largely disappear (black arrows) even though they were the most prominent feature in the TAD landscape with large bin sizes.  At 200 bp, the actual TADs (as opposed to the forest) are visible, but weakly populated.  There are no stripes, and only one of the TADs has an obvious “dot” (green asterisk: panel C).

      Author response image 2.

      Mammalian MicroC profiles different bin sizes.

      Large TAD neighborhoods bordered by stripes are also evident in the Hsieh et al. data set in Author response image 2 panels E and F (black arrows in E and F and green arrow in F).  At 400 bp resolution (panel G), the narrow stripe in panel F (black arrows) becomes much broader, indicating that it is likely generated by interactions across one or two small TADs that can be discerned at 200 bp resolution.  The same is true for the broad stripe indicated by the green arrows in panels F, G and H.  This stripe arises from contacts between the TADs indicated by the red bar in panels G and H and the TADs to the other side of the volcano triangle with a plume (blue arrow in panel H).  As in flies, we would expect that this volcano triangle topped by a plume corresponds to a stem-loop.  However, the resolution is poor at 200 bp, and the profiles of the neighboring TADs are not very distinct.

      For the fly data set, stripes can be discerned when analyzed at 800 bp resolution (see arrows in Author response image 3);  however, these stripes are flanked by regions of lower contact, and represent TAD-TAD interactions.  At 400 bp, smaller neighborhoods can be discerned, and these neighborhoods exhibit a complex pattern of interaction with adjacent neighborhoods.  With bin sizes of 200 bp, individual TADs are observed, as are TAD-TAD interactions like those seen near eve.  Some of the TADs have dots at their apex, while others do not—much like what is seen in the mammalian MicroC studies.

      Author response image 3.

      Mammalian MicroC profiles different bin sizes.

      Stripes: As illustrated in Author response image 2 A-D and E-H, the continuous stripes seen in low resolution mammalian studies (>1 kb bins) would appear to arise from binning artefacts.  At high resolution where single TADs are visible, the stripes seem to be generated by TAD-TAD interactions, and not by some type of “extrusion” mechanism.  This is most clearly seen for the volcano with plume TAD in Author response inage 2 G and H.  While stripes in Author response image 2 disappear at high resolution, this is not always true.  There are stripes that appear to be “real” in Geol et al. 2023 for the TADs in the Ppm1g region, and in Author response image 1 for the Abd-B regulatory domain TADs.  Since the stripes in the Ppm1g region are unaffected by Rad21 depletion, some other mechanism must be involved (c.f. (Shidlovskii et al. 2021)).

      Dots: The high resolution images of mammalian MicroC experiments in Author response image 2D and H show that, like Drosophila (Author response image 3L), mammalian TADs don’t always have a “dot” at the apex of the triangle.  This is not surprising.  In the MicroC procedure, fixed chromatin is digested to mononucleosomes with MNase.  Since most TAD boundaries in flies, and presumably also in mammals, are relatively large (150-400 bp) nuclease hypersensitive regions, extensive MNase digestion will typically reduce the boundary element sequences to oligonucleotides.

      In flies, the only known sequences (at least to date) that end up giving dots (like those seen in Author response image 1) are bound by a large (>1,000 kd) GAF-containing multiprotein complex called LBC.  In the Abd-B region of BX-C, LBC binds to two ~180 bp sequences in Fab-7 (dHS1 and HS3: (Kyrchanova et al. 2018; Wolle et al. 2015), and to the centromere proximal (CP) side of Fab-8.  The LBC elements in Fab-7 (dHS1) and Fab-8 (CP) have both blocking and boundary bypass activity (Kyrchanova et al. 2023; Kyrchanova et al. 2019a; Kyrchanova et al. 2019b; Postika et al. 2018).  Elsewhere, LBC binds to the bx and bxd PREs in the Ubx regulatory domains, to two PREs upstream of engrailed, to the hsp70 promoter, the histone H3-H4 promoters, and the eve promoter (unpublished data).  Based on ChIP signatures, it likely binds to most PREs/tethering elements in the fly genome (Batut et al. 2022; Li et al. 2023).  Indirect end-labeling experiments (Galloni et al. 1993; Samal et al. 1981; Udvardy and Schedl 1984) indicate that LBC protects an ~150-180 bp DNA segment from MNase digestion, which would explain why LBC-bound sequences are able to generate dots in MicroC experiments.  Also unlike typical boundary elements, the pairing interactions of the LBC elements we’ve tested appear to be orientation-independent (unpublished data).

      The difference in MNase sensitivity between typical TAD boundaries and LBC-bound elements is illustrated in the MicroC of the Leukocyte-antigen-related-like (Lar) meta-loop in Author response image 4 panels A and B.  Direct physical pairing of two TAD boundaries (blue and purple) brings two TADs encompassing the 125 kb lar gene into contact with two TADs in a gene poor region 620 kb away.  This interaction generates two regions of greatly enhanced contact: the two boxes on either side of the paired boundaries (panel A).  Note that like transgene homie pairing with the eve boundaries, the boundary pairing interaction that forms the lar meta-loop is orientation-dependent.  In this case the TAD boundary in the Lar locus pairs with the TAD boundary in the gene poor region head-to-head (arrow tip to arrow tip), generating a circle-loop.  This circle-loop configuration brings the TAD upstream of the blue boundary into contact with the TAD upstream of the purple boundary.  Likewise, the TAD downstream of the blue boundary is brought into contact with the TAD downstream of the purple boundary.

      In the MicroC procedure, the sequences that correspond to the paired boundaries are not recovered (red arrow in Author response image 4 panel B).  This is why there are vertical and horizontal blank stripes (red arrowheads) emanating from the missing point of contact.  Using a different HiC procedure (dHS-C) that allows us to recover sequences from typical boundary elements (Author response image 4 panels C and D), there is a strong “dot” at the point of contact which corresponds to the pairing of the blue and purple boundaries.

      There is a second dot (green arrow) within the box that represents physical contacts between sequences in the TADs downstream of the blue and purple boundaries.  This dot is resistant to MNase digestion and is visible both in the MicroC and dHS-C profiles.  Based on the ChIP signature of the corresponding elements in the two TADs downstream of the blue and purple boundaries, this dot represents paired LBC elements.

      Author response image 4.

      Lar metaloop. Panels A & bB: MicroC. Panels C & D: dHS-C

      That being said, the authors' analyses do not distinguish between the formation and the maintenance of domains. It is not clear to this reviewer why a single mechanism should explain the formation of the complex structures observed in static Hi-C heatmaps from a population of cells at a single developmental time point. For example, how can the authors rule out that extrusion initially provides the necessary proximity and possibly the cis preference of contacts required for boundaryboundary pairing whereas the latter may more reflect the structures observed at maintenance?

      (2.2) The MicroC profiles shown in Fig. 2 of our paper were generated from nuclear cycle (NC) 14 embryos.  NC14 is the last nuclear cycle before cellularization (Foe 1989).  After the nuclei exit mitosis, S-phase begins, and because satellite sequences are late replicating in this nuclear cycle, S phase lasts 50 min instead of only 4-6 min during earlier cycles (Shermoen et al. 2010).  So unlike MicroC studies in mammals, our analysis of chromatin architecture in NC14 embryos likely offers the best opportunity to detect any intermediates that are generated during TAD formation.  In particular, we should be able to observe evidence of cohesin linking the sequences from the two extruding strands together (the stripes) as it generates TADs de novo.  However, there are no vertical stripes in the eve TAD as would be expected if cohesin entered at a few specific sites somewhere within the TAD and extruded loops in opposite directions synchronously, nor are their stripes at 45o as would be expected if it started at nhomie or homie (see Figure Supplemental 1).  We also do not detect cohesin-generated stripes in any of the TADs in between eve and the attP site at -142 kb. Note that in some models, cohesin is thought to be continuously extruding loops. After hitting the CTCF roadblocks, cohesin either falls off after a short period and starts again or it breaks through one or more TAD boundaries generating the LDC domains. In this dynamic model, stripes of crosslinked DNA generated by the passing cohesin complex should be observed throughout the cell cycle.  They are not. 

      As for formation versus maintenance, and the possible involvement of cohesin loop extrusion in the former, but not the latter:  This question was indirectly addressed in point #1.2 above.  In this point we described multiple examples of specific boundary:boundary pairing interactions that take place over Mbs, in cis and in trans and even between different chromosomes.  These long-distance interactions don’t preexist;  instead they must be established de novo and then maintained.  This process was actually visualized in the studies of Lim et al. (Lim et al. 2018) on the establishment of trans boundary pairing interactions in NC14 embryos.  There is no conceivable mechanism by which cohesin-based loop extrusion could establish the long or short distance trans interactions that have been documented in many studies on fly boundary elements.  Also as noted above, its seems unlikely that it is necessary for long-range interactions in cis.  

      A more plausible scenario is that cohesin entrapment helps to stabilize these long-distance interactions after they are formed.  If this were true, then one could argue that cohesin might also function to maintain TADs after boundaries have physically paired with their neighbors in cis.  However, the Rad21 depletion experiments of Goel et al. (Goel et al. 2023) would rule out an essential role for cohesin in maintaining TADs after boundary:boundary pairing.  In short, while we cannot formally rule out that loop extrusion might help bring sequences closer together to increase their chance of pairing, neither the specificity of that pairing, nor its orientation can be explained by loop extrusion.  Furthermore, since pairing in trans cannot be facilitated by loop extrusion, invoking it as potentially important for boundary-boundary pairing in cis can only be described as a potential mechanism in search of a function, without clear evidence in its favor.

      On the other hand, the apparent loss of contacts between TADs within large multi-TAD neighborhoods (Geol et al. 2023) would suggest that there is some sort of decompaction of neighborhoods after Rad21 depletion.  It is possible that this might stress interactions that span multiple TADs as is the case for homie at -142, or for the other examples described in #1.2 above.  This kind of involvement of cohesin might or might not be associated with a loop extrusion mechanism.

      Future work aimed at analyzing micro-C data in cohesin-depleted cells might shed additional light on this.

      (2.3) This experiment has been done by Goel et al. (Goel et al. 2023) in mammalian tissue culture cells.  They found that TADs, as well as local TAD neighborhoods, are not disrupted/altered by Rad21 depletion (see Geol at al. 2023 and our response to point #1.1 of reviewer #1).

      Additional mechanisms at play include compartment-level interactions driven by chromatin states. Indeed, in mammalian cells, these interactions often manifest as a "plume" on Hi-C maps similar to what the authors attribute to boundary interactions in this manuscript. How do the chromatin states in the neighboring domains of the eve locus impact the model if at all?

      (2.4) Chromatin states have been implicated in driving compartment level interactions. 

      Compartments as initially described were large, often Mb sized, chromosomal segments that “share” similar chromatin marks/states, and are thought to merge via co-polymer segregation.  They were visualized using large multi-kb bin sizes.  In the studies reported here, we use bin sizes of 200 bp to examine a DNA segment of less than 200 kb which is subdivided into a dozen or so small TADs.  Several of the TADs contain more than one transcription unit, and they are expressed in quite different patterns, and thus might be expected to have different “chromatin states” at different points in development and in different cells in the organism. However, as can be seen by comparing the MicroC patterns in our paper that are shown in Fig. 2 with Fig. 7, Figure Supplemental 5 and Figure Supplemental 6, the TAD organization in NC14 and 12-16 hr embryos is for the most part quite similar.  There is no indication that these small TADs are participating in liquid phase compartmentalization that depends upon shared chromatin/transcriptional states in NC14 and then again in 12-16 hr embryos. 

      In NC14 embryos, eve is expressed in 7 stripes, while it is potentially active throughout much of the embryo.  In fact, the initial pattern in early cycles is quite broad and is then refined during NC14.  In 12-16 hr embryos, the eve gene is silenced by the PcG system in all but a few cells in the embryo.  However, here again the basic structure of the TAD, including the volcano plume, looks quite similar at these different developmental stages.  

      As for the suggestion that the plume topping the eve volcano triangle is generated because the TADs flanking the eve TAD share chromatin states and coalesce via some sort of phase separation:

      This model has been tested directly in Ke et al. (Ke et al. 2024).  In Ke et al., we deleted the nhomie boundary and replaced it with either nhomie in the reverse orientation or homie in the forward orientation.  According to the compartment model, changing the orientation of the boundaries so that the topology of the eve TAD changes from a stem-loop to a circle-loop should have absolutely no effect on the plume topping the eve volcano triangle.  The TADs flanking the eve TAD would still be expected to share the same chromatin states and would still be able to coalesce via phase transition.  However, this is not what is observed.  The plume disappears and is replaced by “clouds” on both sides of the eve TAD. The clouds arise because the eve TAD bumps into the neighboring TADs when the topology is a circle-loop.  

      We would also note that “compartment-level” interactions would not explain the findings presented in Muller at al. 1999, in Table 1 or in Author response image 4.  It is clear that the long distant (Mb) interactions observed for Mcp, gypsy, Fab-7, homie, nhomie and the blue and purple boundaries in Author response image 4 arise by the physical pairing of TAD boundary elements.  This fact is demonstrated directly by the MicroC experiments in Fig. 7 and Fig Supplemental 4 and 5, and by the MicroC and dHS-C experiments in Author response image 4.  There is no evidence for any type of “compartment/phase separation” driving these specific boundary pairing interactions.

      In fact, given the involvement of TAD boundaries in meta-loop formation, one might begin to wonder whether some of the “compartment level interactions” are generated by the specific pairing of TAD boundary elements rather than by “shared chromatin” states.  For example, the head-tohead pairing of the blue and purple boundaries generates a Lar meta-loop that has a circle-loop topology.  As a consequence, sequences upstream of the blue and purple boundary come into contact, generating the small dark rectangular box on the upper left side of the contact map.  Sequences downstream of the blue and purple boundary also come into contact, and this generates the larger rectangular box in the lower right side of the contact map.  A new figure, Fig. 9, shows that the interaction pattern flips (lower left and top right) when the meta-loop has a stem-loop topology.  If these meta-loops are visualized using larger bin sizes, the classic “compartment” patchwork pattern of interactions emerges.  Would the precise patchwork pattern of “compartmental” interactions involving the four distant TADs that are linked in the two meta-loops shown in Fig. 9 persist as is if we deleted one of the TAD boundaries that forms the meta-loop?  Would the precise patchwork pattern persist if we inverted one of the meta-loop boundaries so that we converted the topology of the loop from a circle-loop to a stem-loop or vice versa?  We haven’t used MicroC to compare the compartment organization after deleting or inverting a meta-loop TAD boundary; however, a comparison of the MicroC pattern in WT in Fig. 1C with that for the homie transgenes in Fig. 7 and Figs. Supplemental 5, 6 and 7 indicates a) that novel patterns of TAD:TAD interactions are generated by this homie dependent mini-meta-loop and b) that the patterns of TAD:TAD interactions depend upon loop topology. Were these novel TAD:TAD interactions generated instead by compartment level interactions/shared chromatin states, they should be evident in WT as well (Fig. 1).  They are not.

      How does intrachromosomal homolog pairing impact the models proposed in this manuscript (Abed et al. 2019; Erceg et al., 2019). Several papers recently have shown that somatic homolog pairing is not uniform and shows significant variation across the genome with evidence for both tight pairing regions and loose pairing regions. Might loose pairing interactions have the capacity to alter the cis configuration of the eve locus?

      (2.5) At this point it is not entirely clear how homolog pairing impacts the cis configuration/MicroC contact maps.  We expect that homolog pairing is incomplete in the NC14 embryos we analyzed;  however, since replication of eve and the local neighborhood is likely complete, sister chromosomes should be paired.  So we are likely visualizing the 3D organization of paired TADs.

      In summary, the transgenic experiments are extensive and elegant and fully support the authors' models. However, in my opinion, they do not completely rule out additional models at play, including extrusion-based mechanisms. Indeed, my major issue is the limited conceptual advance in this manuscript. The authors essentially repeat many of their previous work and analyses.

      (2.6) In our view, the current paper makes a number of significant contributions that go well beyond those described in our 2016 publication.  These are summarized below.

      A) While our 2016 paper used transgenes inserted in the -142 kb attP site to study pairing interactions of homie and nhomie, we didn’t either consider or discuss how our findings might bear on the loop extrusion model.  However, since the loop extrusion model is currently accepted as established fact by many labs working on chromosome structure, it is critically important to devise experimental approaches which test the predictions of this particular model.  One approach would be to deplete cohesin components; however, as discussed in #1.1, our experimental system is not ideal for this type of approach.  On the other hand, there are other ways to test the extrusion model.  Given the mechanism proposed for TAD formation—extruding a loop until cohesin bumps into CTCF/boundary road blocks—it follows that only two types of loop topologies are possible: stemloop and unanchored loop.  The loop extrusion model, as currently conceived, can’t account for the two cases in this study in which the reporter on the wrong side of the homie boundary from the eve locus is activated by the eve enhancers.  In contrast, our findings are completely consistent with orientation-specific boundary:boundary pairing.

      B) In the loop extrusion model, cohesin embraces both of the extruded chromatin fibers, transiently bringing them into close proximity.  As far as we know, there have been no (high resolution) experiments that have actually detected these extruding cohesin complexes during TAD formation.  In order to have a chance of observing the expected signatures of extruding cohesin complexes, one would need a system in which TADs are being formed.  As described in the text, this is why we used MicroC to analyze TADs in NC14 embryos.  We do not detect the signature stripes that would be predicted (see Figure Supp 2) by the current version of the loop extrusion model.

      C) Reporter expression in the different -142 kb transgenes provides only an indirect test of the loop extrusion and boundary:boundary pairing models for TAD formation.  The reporter expression results need to be confirmed by directly analyzing the pattern of physical interactions in each instance.  While we were able to detect contacts between the transgenes and eve in our 2016 paper, the 3C experiments provided no information beyond that.  By contrast, the MicroC experiments in the current paper give high resolution maps of the physical contacts between the transgene and the eve TAD.  The physical contacts track completely with reporter activity.  Moreover, just as is the case for reporter activity, the observed physical interactions are inconsistent with the loop extrusion model.

      D) Genetic studies in Muller et al. (Muller et al. 1999) and imaging in Vazquez et al. (Vazquez et al. 2006) suggested that more than two boundaries can participate in pairing interactions.  Consistent with these earlier observations, viewpoint analysis indicates the transgene homie interacts with both eve boundaries.  While this could be explained by transgene homie alternating between nhomie and homie in the eve locus, this would require the remodeling of the eve TAD each time the pairing interaction switched between the three boundary elements.  Moreover, two out of the three possible pairing combinations would disrupt the eve TAD, generating an unanchored loop (c.f., the lambda DNA TAD in Ke et al., (Ke et al. 2024)).  However, the MicroC profile of the eve TAD is unaffected by transgenes carrying the homie boundary.  This would suggest that like Mcp, the pairing interactions of homie and nhomie might not be exclusively pairwise.  In this context is interesting to compare the contact profiles of the lar meta-loop shown in Author response image 4 with the different 142 kb homie inserts.  Unlike the homie element at -142 kb, there is clearly only a single point of contact between the blue and purple boundaries.

      E) Chen et al. (Chen et al. 2018) used live imaging to link physical interactions between a homie containing transgene inserted at -142 kb and the eve locus to reporter activation by the eve enhancers.  They found that the reporter was activated by the eve enhancers only when it was in “close proximity” to the eve gene.  “Close proximity” in this case was 331 nM.  This distance is equivalent to ~1.1 kb of linear duplex B form DNA, or ~30 nucleosome core particles lined up in a row.  It would not be possible to ligate two DNAs wrapped around nucleosome core particles that are located 330 nM apart in a fixed matrix.  Since our MicroC experiments were done on embryos in which the gene is silent in the vast majority of cells, it is possible that the homie transgene only comes into close enough proximity for transgene nucleosome: eve nucleosome ligation events when the eve gene is off.  Alternatively, and clearly more likely, distance measurements using imaging procedures that require dozens of fluorescent probes may artificially inflate the distance between sequences that are actually close enough for enzymatic ligation.

      F) The findings reported in Goel et al. (Goel et al. 2023) indicate that mammalian TADs don’t require cohesin activity; however, the authors do not provide an alternative mechanism for TAD formation/stability.  Here we have suggested a plausible mechanism.

      The authors make no attempt to dissect the mechanism of this process by modifying extrusion components directly.

      (2.7) See point #1.1

      Some discussion of Rollins et al. on the discovery of Nipped-B and its role in enhancer-promoter communication should also be made to reconcile their conclusions in the proposed absence of extrusion events.

      (2.8) The reason why reducing nipped-B activity enhances the phenotypic effects of gypsy-induced mutations is not known at this point; however, the findings reported in Rollins et al. (Rollins et al. 1999) would appear to argue against an extrusion mechanism for TAD formation.

      Given what we know about enhancer blocking and TADs, there are two plausible mechanisms for how the Su(Hw) element in the gypsy transposon blocks enhancer-promoter interactions in the gypsy-induced mutants studied by Rollins et al.  First, the Su(Hw) element could generate two new TADs through pairing interactions with boundaries in the immediate neighborhood.  This would place the enhancers in one TAD and the target gene in another TAD.  Alternatively, the studies of Sigrist and Pirrotta (Sigrist and Pirrotta 1997) as well as several publications from Victor Corces’ lab raise the possibility that the Su(Hw) element in gypsy-induced mutations is pairing with gypsy transposons inserted elsewhere in the genome.  This would also isolate enhancers from their target genes.  In either case, the loss of nipped-B activity increases the mutagenic effects of Su(Hw) element presumably by strengthening its boundary function.  If this is due to a failure to load cohesin on to chromatin, this would suggest that cohesin normally functions to weaken the boundary activity of the Su(Hw) element, i.e., disrupting the ability of Su(Hw) elements to interact with either other boundaries in the neighborhood or with themselves.  Were this a general activity of cohesin (to weaken boundary activity), one would imagine that cohesin normally functions to disrupt TADs rather than generate/stabilize TADs.

      An alternative model is that Nipped-B (and thus cohesion) functions to stabilize enhancerpromoter interactions within TADs.  In this case, loss of Nipped-B would result in a destabilization of the weak enhancer:promoter interactions that can still be formed when gypsy is located between the enhancer and promoter.  In this model the loss of these weak interactions in nipped-b mutants would appear to increase the “blocking” activity of the gypsy element.  However, this alternative model would also provide no support for the notion that Nipped-B and cohesin function to promote TAD formation.

      Reviewer #3 (Public Review):

      Bing et al. attempt to address fundamental mechanisms of TAD formation in Drosophila by analyzing gene expression and 3D conformation within the vicinity of the eve TAD after insertion of a transgene harboring a Homie insulator sequence 142 kb away in different orientations. These transgenes along with spatial gene expression analysis were previously published in Fujioka et al. 2016, and the underlying interpretations regarding resulting DNA configuration in this genomic region were also previously published. This manuscript repeats the expression analysis using smFISH probes in order to achieve more quantitative analysis, but the main results are the same as previously published. The only new data are the Micro-C and an additional modeling/analysis of what they refer to as the 'Z3' orientation of the transgenes. The rest of the manuscript merely synthesizes further interpretation with the goal of addressing whether loop extrusion may be occurring or if boundary:boundary pairing without loop extrusion is responsible for TAD formation. The authors conclude that their results are more consistent with boundary:boundary pairing and not loop extrusion; however, most of this imaging data seems to support both loop extrusion and the boundary:boundary models. This manuscript lacks support, especially new data, for its conclusions.

      (3.1) The new results/contributions of our paper are described in #2.6 above. 

      Although there are (two) homie transgene configurations that give expression patterns that would be consistent with the loop extrusion model, that is not quite the same as strong evidence supporting loop extrusion.  On the contrary, key aspects of the expression data are entirely inconsistent with loop extrusion, and they thus rule out the possibility that loop extrusion is sufficient to explain the results.  Moreover, the conclusions drawn from the expression patterns of the four transgenes are back up by the MicroC contact profiles—profiles that are also not consistent with the loop extrusion model.  Further, as documented above, loop extrusion is not only unable to explain the findings reported in this manuscript, but also the results from a large collection of published studies on fly boundaries.  Since all of these boundaries function in TAD formation, there is little reason to think that loop extrusion makes a significant contribution at the TAD level in flies.   Given the results reported by Goel et al. (Goel et al. 2023), one might also have doubts about the role of loop extrusion in the formation/maintenance of mammalian TADs. 

      To further document these points, we’ve included a new figure (Fig. 9) that shows two meta-loops.  Like the loops seen for homie-containing transgenes inserted at -142 kb, meta-loops are formed by the pairing of distant fly boundaries.  As only two boundaries are involved, the resulting loop topologies are simpler than those generated when transgene homie pairs with nhomie and homie in the eve locus.  The meta-loop in panel B is a stem-loop.  While a loop with this topology could be formed by loop extrusion, cohesion would have to break through dozens of intervening TAD boundaries and then somehow know to come to a halt at the blue boundary on the left and the purple boundary on the right.  However, none of the mechanistic studies on either cohesin or the mammalian CTCF roadblocks have uncovered activities of either the cohesin complex or the CTCF roadblocks that could explain how cohesin would be able to extrude hundreds of kb and ignore dozens of intervening roadblocks, and then stop only when it encounters the two boundaries that form the beat-IV meta-loop.  The meta-loop in panel A is even more problematic in that it is a circle-loop--a topology that can’t be generated by cohesin extruding a loop until comes into contact with CTCF roadblocks on the extruded strands.

      Furthermore, there are many parts of the manuscript that are difficult to follow. There are some minor errors in the labelling of the figures that if fixed would help elevate understanding. Lastly, there are several major points that if elaborated on, would potentially be helpful for the clarity of the manuscript.

      Major Points:

      (1) The authors suggest and attempt to visualize in the supplemental figures, that loop extrusion mechanisms would appear during crosslinking and show as vertical stripes in the micro-C data. In order to see stripes, a majority of the nuclei would need to undergo loop extrusion at the same rate, starting from exactly the same spots, and the loops would also have to be released and restarted at the same rate. If these patterns truly result from loop extrusion, the authors should provide experimental evidence from another organism undergoing loop extrusion.

      (3.2) We don’t know of any reports that actually document cohesion extrusion events that are forming TADs (TADs as defined in our paper, in the RCMC experiments of Goel et al. (Goel et al. 2023), in response #1.1, or in the high-resolution images from the MicroC data of Krietenstein et al (Krietenstein et al. 2020) and Hseih et al. (Hsieh et al. 2020). However, an extruding cohesin complex would be expected to generate stripes because it transiently brings together the two chromatin strands as illustrated by the broken zipper in Figure Supplemental 2 of our paper.  While stripes generated by cohesin forming a TAD have not to our knowledge ever been observed, Fig. 4 in Goel et al. (Goel et al. 2023)) shows 45o stripes outlining TADs and connecting neighboring TADs.  These stripes are visible with or without Rad21.

      In some versions of the loop extrusion model, cohesin extrudes a loop until it comes to a halt at both boundaries, where it then remains holding the loop together.  In this model, the extrusion event would occur only once per cell cycle.  This is reason we selected NC14 embryos as this point in development should provide by far the best opportunity to visualize cohesin-dependent TAD formation.  However, the expected stripes generated by cohesin embrace of both strands of the extruding loop were not evident.  Other newer versions of the loop extrusion model are much more dynamic—cohesin extrudes the loop, coming to a halt at the two boundaries, but either doesn’t remain stably bound or breaks through one or both boundaries. In the former case, the TAD needs to be reestablished by another extrusion event, while in the latter case LDC domains are generated.  In this dynamic model, we should also be able to observe vertical and 45o stripes (or stripes leaning to one side or another of the loading site if the extrusion rates aren’t equal on both fibers) in NC14 embryos corresponding to the formation of TADs and LDC domains.  However, we don’t.

      (2) On lines 311-314, the authors discuss that stem-loops generated by cohesin extrusion would possibly be expected to have more next-next-door neighbor contacts than next-door neighbor contacts and site their models in Figure 1. Based on the boundary:boundary pairing models in the same figure would the stem-loops created by head-to-tail pairing also have the same phenotype? Making possible enrichment of next-next-door neighbor contacts possible in both situations? The concepts in the text are not clear, and the diagrams are not well-labeled relative to the two models.

      (3.3) Yes, we expect that stem-loops formed by cohesin extrusion or head-to-tail pairing would behave in a similar manner.  They could be stem-loops separated by unanchored loops as shown in Fig. 1B and E.  Alternatively, adjacent loops could be anchored to each other (by cohesin/CTCF road blocks or by pairing interactions) as indicated in Fig. 1C and F.  In stem-loops generated either by cohesin extrusion or by head-to-tail pairing, next-next door neighbors should interact with each other, generating a plume above the volcano triangle.  In the case of circle-loops, the volcano triangle should be flanked by clouds that are generated when the TAD bumps into both next-door neighbors.  In the accompanying paper, we test this idea by deleting the nhomie boundary and then a) inserting nhomie back in the reverse orientation, or b) by inserting homie in the forward orientation.  The MicroC patterns fit with the predictions that were made in this paper.

      (3) The authors appear to cite Chen et al., 2018 as a reference for the location of these transgenes being 700nM away in a majority of the nuclei. However, the exact transgenes in this manuscript do not appear to have been measured for distance. The authors could do this experiment and include expression measurements.

      (3.4) The transgenes used in Chen et al. are modified versions of a transgene used in Fujioka et al. (2016) inserted into the same attP site.  When we visualize reporter transcription in NC14 embryos driven by the eve enhancers using smFISH, HCR-FISH or DIG, only a subset of the nuclei at this stage are active.  The number of active nuclei we detect is similar to that observed in the live imaging experiments of Chen et al.  The reason we cited Chen et al. (Chen et al. 2018) was that they found that proximity was a critical factor in determining whether the reporter was activated or not in a given nucleus.  The actual distance they measured wasn’t important.  Moreover, as we discussed in response #2.6 above, there are good reasons to think that the “precise” distances measured in live imaging experiments like those used in Chen et al. are incorrect.  However, their statements are certainly correct if one considers that a distance of ~700 nM or so is “more distant” relative to a distance of ~300 nM or so, which is “closer.”

      (4) The authors discuss the possible importance of CTCF orientation in forming the roadblock to cohesin extrusion and discuss that Homie orientation in the transgene may impact Homie function as an effective roadblock. However, the Homie region inserted in the transgene does not contain the CTCF motif. Can the authors elaborate on why they feel the orientation of Homie is important in its ability to function as a roadblock if the CTCF motif is not present? Trans-acting factors responsible for Homie function have not been identified and this point is not discussed in the manuscript.

      We discussed the “importance” of CTCF orientation in forming roadblocks because one popular version of the cohesin loop extrusion/CTCF roadblock model postulates that CTCF must be oriented so that the N-terminus of the protein is facing towards the oncoming cohesin complex, otherwise it won’t be able to halt extrusion on that strand.  When homie in the transgene is pointing towards the eve locus, the reporter on the other side (farther from eve) is activated by the eve enhancers.  One possible way to explain this finding (if one believes the loop extrusion model) is that when homie is inverted, it can’t stop the oncoming cohesin complex, and it runs past the homie boundary until it comes to a stop at a properly oriented boundary farther away.  In this case, the newly formed loop would extend from the boundary that stopped cohesin to the homie boundary in the eve locus, and would include not only the distal reporter, but also the proximal reporter.  If both reporters are in the same loop with the eve enhancers (which they would have to be given the mechanism of TAD formation by loop extrusion), both reporters should be activated.  They are not.

      For the boundary pairing model, the reporter that will be activated will depend upon the orientation of the pairing interaction—which can be either head-to-head or head-to-tail (or both: see discussion of LBC elements in #2.1).  For an easy visualization of how the orientation of pairing interactions is connected to the patterns of interactions between sequences neighboring the boundary, please look at Fig. 9.  This figure shows two different meta-loops.  In panel A, head-tohead pairing of the blue and purple boundaries brings together, on the one hand, sequences upstream of the blue and purple boundary, and on the other hand, sequences downstream of the blue and purple boundaries.  In the circle loop configuration, the resulting rectangular boxes of enhanced contact are located in the upper left and lower right of the contact map.  In panel B, the head-to-tail pairing of the blue and purple boundary changes how sequences upstream and downstream of the blue and purple boundaries interact with each other.  Sequences upstream of the blue boundary interact with sequences downstream of the purple boundary, and this gives the rectangular box of enhanced interactions on the top right.  Sequences downstream of the blue boundary interact with sequences upstream of the purple boundary, and this gives the rectangular box of enhanced contact on the lower left.

      CTCF: Our analysis of the homie boundary suggests that CTCF contributes little to its activity.  It has an Su(Hw) recognition sequence and a CP190 “associated” sequence.  Mutations in both compromise boundary activity (blocking and -142 kb pairing).  Gel shift experiments and ChIP data indicate there are half a dozen or more additional proteins that associate with the 300 bp homie fragment used in our experiments.

      Orientation of CTCF or other protein binding sites:  The available evidence suggests that orientation of the individual binding sites is not important (Kyrchanova et al. 2016; Lim et al. 2018)).  Instead, it is likely that the order of binding sites affects function.

      (5) The imaging results seem to be consistent with both boundary:boundary interaction and loop extrusion stem looping.

      It is not clear whether the reviewer is referring to the different patterns of reporter expression— which clearly don’t fit with the loop extrusion model in the key cases that distinguish the two models—or the live imaging experiments in Chen et al. (Chen et al. 2018).

      (6) The authors suggest that the eveMa TAD could only be formed by extrusion after the breakthrough of Nhomie and several other roadblocks. Additionally, the overall long-range interactions with Nhomie appear to be less than the interactions with endogenous Homie (Figures 7, 8, and supplemental 5). Is it possible that in some cases boundary:boundary pairing is occurring between only the transgenic Homie and endogenous Homie and not including Nhomie?

      Yes, it is possible.  On the other hand, the data that are currently available supports the idea that transgene homie usually interacts with endogenous homie and nhomie at the same time.  This is discussed in #2.6D above.  The viewpoints indicate that crosslinking occurs more frequently to homie than to nhomie.  This could indicate that when there are only pairwise interactions, these tend to be between homie and homie.  Alternatively, this could also be explained by a difference in relative crosslinking efficiency.

      (7) In Figure 4E, the GFP hebe expression shown in the LhomieG Z5 transgenic embryo does not appear in the same locations as the LlambdaG Z5 control. Is this actually hebe expression or just a background signal?

      The late-stage embryos shown in E are oriented differently.  For GlambdaL, the embryo is oriented so that hebe-like reporter expression on the ventral midline is readily evident.  However, this orientation is not suitable for visualizing eve enhancer-dependent expression of the reporters in muscle progenitor cells.  For this reason, the 12-16 hr GeimohL embryo in E is turned so that the ventral midline isn’t readily visible in most of the embryo.  As is the case in NC14 embyros, the eve enhancers drive lacZ but not gfp expression in the muscle progenitor cells.

      (8) Figure 6- The LhomieG Z3 (LeimohG) late-stage embryo appears to be showing the ventral orientation of the embryo rather than the lateral side of the embryo as was shown in the previous figure. Is this for a reason? Additionally, there are no statistics shown for the Z3 transgenic images.

      Were these images analyzed in the same way as the Z5 line images?

      The LeimohG embryo was turned so that the hebe enhancer-dependent expression of lacZ is visible.  While the eve enhancer-dependent expression of lacZ in the muscle progenitor cells isn’t visible with this orientation, eve enhancer-dependent expression in the anal plate is.

      (9) Do the Micro-C data align with the developmental time points used in the smFISH probe assays?

      The MicroC data aligns with the smFISH images of older embryos: 12-14 hour embryos or stages 14-16.  

      Recommendations for the authors:   

      Reviewer #1 (Recommendations For The Authors):

      This was a difficult paper to review. It took me several hours to understand the terminology and back and forth between different figures to put it together. It might be useful to put the loop models next to the MicroC results and have a cartoon way of incorporating which enhancers are turning on which reporters.

      I also found the supercoiled TAD models in Figure 1 not useful. These plectoneme-type of structures likely do not exist, based on the single-cell chromosome tracing studies, and the HiC structures not showing perpendicular to diagonal interactions between the arms of the plectonemes.

      We wanted to represent the TAD as a coiled 30nM fiber, as they are not likely to resemble the large loops like those shown in Fig. 1 A, D, and G.

      There are no stripes emerging from homies, which is consistent with the pairing model, but there seem to be stripes from the eve promoter. I think these structures may be a result of both the underlying loop extruders + pairing elements.

      There are internal structures in the eve TAD that link the upstream region of the eve promoter to the eve PRE and sequences in nhomie.  All three of these sequences are bound by LBC.  Each of the regulatory domains in BX-C also have LBC elements and, as shown in Author response image 1, you can see stripes connecting some of these LBC elements to each other.  Since the stripes that Goel et al. (Goel et al. 2023) observed in their RCMC analysis of Ppm1g didn’t require cohesin, how these stripes are generated (active: e.g, a chromatin remodeler or passive: e.g., the LBC complex has non-specific DNA binding activity that can be readily crosslinked as the chromatin fiber slides past) isn’t clear.

      The authors say there are no TADs that have "volcano plumes" but the leftmost TAD TA appears to have one. What are the criteria for calling the plumes? I am also not clear why there is a stripe off the eve volcano. It looks like homie is making a "stripe" loop extrusion type of interaction with the next TAD up. Is this maybe cohesin sliding off the left boundary?

      The reviewer is correct, the left-most TAD TA appears to have a plume.  We mentioned TA seems to have a plume in the original text, but it was inadvertently edited out.

      Two different types of TADßàTAD interactions are observed.  In the case of eve, the TADs to either side of eve interact more frequently with each other than they do with eve.  This generates a “plume” above the eve volcano triangle.  The TADs that comprise the Abd-B regulatory domains (see Author response image 1) are surrounded by clouds of diminishing intensity.  Clouds at the first level represent interactions with both next-door neighbors; clouds at the second level represent interactions with both next-next-door neighbors; clouds at the third level represent interactions with next-next-next door neighbors.  The Abd-B TADs are close to the same size, so that interactions with neighbors are relatively simple.  However, this is not always the case.  When there are smaller TADs near larger TADs the pattern of interaction can be quite complicated.  An example is indicated by the red bar in Author response image 2

      The authors state "In the loop-extrusion model, a cohesin complex initiating loop extrusion in the eve TAD must break through the nhomie roadblock at the upstream end of the eve TAD. It must then make its way past the boundaries that separate eve from the attP site in the hebe gene, and come to a halt at the homie boundary associated with the lacZ reporter." Having multiple loops formed by cohesin would also bring in the 142kb apart reporter and homie. Does cohesin make 140 kb long loops in flies?

      A mechanism in which cohesin brings the reporter close to the eve TAD by generating many smaller loops (which would be the intervening TADs) was discussed in #1.2.

      Figure 5 title mistakes the transgene used?

      Fixed.

      In figure 6, the orientation of the embryos does not look the same for the late-stage panels. So it was difficult to tell if the eve enhancer was turning the reporter on.

      Here we were focusing mainly on the AP enhancer activation of the reporter, as this is most easily visualized.  It should be clear from the images that the appropriate reporter is activated by the AP enhancer for each of the transgene inserts.

      It is not clear to me why the GFP makes upstream interactions (from the 4C viewpoint) in GhomileLZ5 but not in LhomieGZ5? Corresponding interactions for Fig Supp 5 & 6 are not the same. That is, LacZ in the same place and with the same homie orientation does not show a similar upstream enrichment as the GFP reporter does.

      We are uncertain as to whether we understand this question/comment.  In GhomieLZ5 (now GhomieL, the lacZ reporter is on the eve side of the homie boundary while gfp is on the hebe enhancer side of the homie boundary.  Since homie is pointing away from gfp, pairing interactions with homie and nhomie in the eve locus bring the eve enhancers in close proximity with the gfp reporter.  This is what is seen in Fig. 7 panel D—lower trace.  In LhomieGZ5 (now GeimohL) the lacZ reporter is again on the eve side of the homie boundary while gfp is on the hebe enhancer side of the homie boundary.  However, in this case homie is inverted so that it is points away from lacZ (towards gfp).  In this orientation, pairing brings the lacZ reporter into contact with the eve enhancers.  This is what is seen in the upper trace in Fig. 7 panel D.

      The orientation of the transgene is switch in Fig. Supp 5 and 6.  For these “Z3) transgenes (now called LeimohG and LhomieG the gfp reporter is on the eve side of homie while the lacZ reporter is on the hebe enhancer side of homie.  The interactions between the reporters and eve are determined by the orientation of homie in the transgene.  When homie is pointing away from gfp (as in LeimohG), gfp is activated and that is reflected in the trace in Supp Fig. 5. When homie is pointing away from lacZ, lacZ is activated and this is reflected (though not as cleanly as in other cases) in the trace in Supp Fig. 6.  

      I did not see a data availability statement. Is the data publicly available? The authors also should consider providing the sequences of the insertions, or provide the edited genomes, in case other researchers would like to analyze the data.

      Data have been deposited.

      Reviewer #3 (Recommendations For The Authors):

      Minor Points:

      (1) There is an inconsistency in the way that some of the citations are formatted. Some citations have 'et al' italicized while others do not. It seems to be the same ones throughout the manuscript. Some examples: Chetverina et al 2017, Chetverina et al 2014, Cavalheiro et al 2021, Kyrchanova et al 2008a, Muravyova et al 2001.

      Fixed

      (2) Pita is listed twice in line 48.

      Fixed

      (3) Line 49, mod(mdg4)67.2 is written just as mod(mdg4). The isoform should be indicated.

      This refers to all Mod isoforms.

      (4) Homie and Nhomie are italicized throughout the manuscript and do not need to be.

      This is the convention used previously.  

      (5) The supplemental figure captions 1 and 2 in the main document are ordered differently than in the supplemental figures file. This caused it to look like the figures are being incorrectly cited in lines 212-214 and 231-232.

      Fixed

      (6) Is the correct figure being cited in line 388-389? The line cites Figure 6E when mentioning LlambdaG Z5; however, LlambdaG Z5 is not shown in Figure 6.

      Fixed

      (7) Section heading 'LhomieG Z5 and GhomieL Z5' could be renamed for clarity. GhomieL Z5 results are not mentioned until the next section, named 'GhomieL Z5'.

      Fixed

      (8) Can the authors provide better labeling for control hebe expression? This would help to determine what is hebe expression and what is background noise in some of the embryos in Figures 4-6.

      Author response image 5 shows expression of the lacZ reporter in GeimohL and GlambdaL.  For the GlambdaL transgene, the hebe enhancers drive lacZ expression in 1216 hr embryos.  Note that lacZ expression is restricted to a small set of quite distinctive cells along the ventral midline.  lacZ is also expressed on the ventral side of the GeimohL embryo (top panel).  However, their locations are quite different from those of the lacZ positive cells in the GlambdaL transgene embryo.  These cells are displaced from the midline, and are arranged as pairs of cells in each hemisegment, locations that correspond to eve-expressing cells in the ventral nerve cord.  The eve enhancers also drive lacZ expression elsewhere in the GeimohL embryo, including the anal plate and dorsal muscle progenitor cells (seen most clearly in the lower left panel).

      Author response image 5.

      lacZ expression in Giemohl and Glambdal embryos

      (9) The Figure 5 title is labeled with the wrong transgene.

      Fixed

      (10) Heat map scales are missing for Figures 7, supplemental 5, and supplemental 6.

      Fixed

      (11) Did the authors check if there was a significant difference in the expression of GFP and lacZ from lambda control lines to the Homie transgenic lines?

      Yes.  Statistical analysis added in Table Supplemental #1

      (12) The Figure 7 title references that these are Z3 orientations, however, it is Z5 orientations being shown.

      Fixed

      (13) The virtual 4C data should include an axis along the bottom of the graphs for better clarity. An axis is missing in all 4C figures.

      References:

      Bantignies F, Grimaud C, Lavrov S, Gabut M, Cavalli G. 2003. Inheritance of polycomb-dependent chromosomal interactions in drosophila. Genes Dev. 17(19):2406-2420.

      Batut PJ, Bing XY, Sisco Z, Raimundo J, Levo M, Levine MS. 2022. Genome organization controls transcriptional dynamics during development. Science. 375(6580):566-570.

      Bonchuk A, Boyko K, Fedotova A, Nikolaeva A, Lushchekina S, Khrustaleva A, Popov V, Georgiev P. 2021. Structural basis of diversity and homodimerization specificity of zinc-fingerassociated domains in drosophila. Nucleic Acids Res. 49(4):2375-2389.

      Bonchuk AN, Boyko KM, Nikolaeva AY, Burtseva AD, Popov VO, Georgiev PG. 2022. Structural insights into highly similar spatial organization of zinc-finger associated domains with a very low sequence similarity. Structure. 30(7):1004-1015.e1004.

      Chen H, Levo M, Barinov L, Fujioka M, Jaynes JB, Gregor T. 2018. Dynamic interplay between enhancer–promoter topology and gene activity. Nat Genet. 50(9):1296.

      Fedotova AA, Bonchuk AN, Mogila VA, Georgiev PG. 2017. C2h2 zinc finger proteins: The largest but poorly explored family of higher eukaryotic transcription factors. Acta Naturae. 9(2):4758.

      Foe VE. 1989. Mitotic domains reveal early commitment of cells in drosophila embryos. Development. 107(1):1-22.

      Fujioka M, Mistry H, Schedl P, Jaynes JB. 2016. Determinants of chromosome architecture: Insulator pairing in cis and in trans. PLoS Genet. 12(2):e1005889.

      Galloni M, Gyurkovics H, Schedl P, Karch F. 1993. The bluetail transposon: Evidence for independent cis-regulatory domains and domain boundaries in the bithorax complex. The EMBO Journal. 12(3):1087-1097.

      Goel VY, Huseyin MK, Hansen AS. 2023. Region capture micro-c reveals coalescence of enhancers and promoters into nested microcompartments. Nat Genet. 55(6):1048-1056.

      Hsieh TS, Cattoglio C, Slobodyanyuk E, Hansen AS, Rando OJ, Tjian R, Darzacq X. 2020. Resolving the 3d landscape of transcription-linked mammalian chromatin folding. Mol Cell. 78(3):539553.e538.

      Ke W, Fujioka M, Schedl P, Jaynes JB. 2024. Chromosome structure ii: Stem-loops and circle-loops. eLife.

      Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, Hsieh TS, Parsi KM, Yang L, Maehr R, Mirny LA et al. 2020. Ultrastructural details of mammalian chromosome architecture. Mol Cell. 78(3):554-565.e557.

      Kyrchanova O, Ibragimov A, Postika N, Georgiev P, Schedl P. 2023. Boundary bypass activity in the abdominal-b region of the drosophila bithorax complex is position dependent and regulated. Open Biol. 13(8):230035.

      Kyrchanova O, Kurbidaeva A, Sabirov M, Postika N, Wolle D, Aoki T, Maksimenko O, Mogila V, Schedl P, Georgiev P. 2018. The bithorax complex iab-7 polycomb response element has a novel role in the functioning of the fab-7 chromatin boundary. PLoS Genet. 14(8):e1007442.

      Kyrchanova O, Mogila V, Wolle D, Deshpande G, Parshikov A, Cleard F, Karch F, Schedl P, Georgiev P. 2016. Functional dissection of the blocking and bypass activities of the fab-8 boundary in the drosophila bithorax complex. PLoS Genet. 12(7):e1006188.

      Kyrchanova O, Sabirov M, Mogila V, Kurbidaeva A, Postika N, Maksimenko O, Schedl P, Georgiev P.

      2019a. Complete reconstitution of bypass and blocking functions in a minimal artificial fab7 insulator from drosophila bithorax complex. Proceedings of the National Academy of Sciences.201907190.

      Kyrchanova O, Wolle D, Sabirov M, Kurbidaeva A, Aoki T, Maksimenko O, Kyrchanova M, Georgiev P, Schedl P. 2019b. Distinct elements confer the blocking and bypass functions of the bithorax fab-8 boundary. Genetics.genetics. 302694.302019.

      Li H-B, Muller M, Bahechar IA, Kyrchanova O, Ohno K, Georgiev P, Pirrotta V. 2011. Insulators, not polycomb response elements, are required for long-range interactions between polycomb targets in drosophila melanogaster. Mol Cell Biol. 31(4):616-625.

      Li X, Tang X, Bing X, Catalano C, Li T, Dolsten G, Wu C, Levine M. 2023. Gaga-associated factor fosters loop formation in the drosophila genome. Mol Cell. 83(9):1519-1526.e1514.

      Lim B, Heist T, Levine M, Fukaya T. 2018. Visualization of transvection in living drosophila embryos. Mol Cell. 70(2):287-296. e286.

      Link N, Kurtz P, O'Neal M, Garcia-Hughes G, Abrams JM. 2013. A p53 enhancer region regulates target genes through chromatin conformations in cis and in trans. Genes Dev. 27(22):24332438.

      Mohana G, Dorier J, Li X, Mouginot M, Smith RC, Malek H, Leleu M, Rodriguez D, Khadka J, Rosa P et al. 2023. Chromosome-level organization of the regulatory genome in the drosophila nervous system. Cell. 186(18):3826-3844.e3826.

      Muller M, Hagstrom K, Gyurkovics H, Pirrotta V, Schedl P. 1999. The mcp element from the drosophila melanogaster bithorax complex mediates long-distance regulatory interactions. Genetics. 153(3):1333-1356.

      Postika N, Metzler M, Affolter M, Müller M, Schedl P, Georgiev P, Kyrchanova O. 2018. Boundaries mediate long-distance interactions between enhancers and promoters in the drosophila bithorax complex. PLoS Genet. 14(12):e1007702.

      Rollins RA, Morcillo P, Dorsett D. 1999. Nipped-b, a drosophila homologue of chromosomal adherins, participates in activation by remote enhancers in the cut and ultrabithorax genes. Genetics. 152(2):577-593.

      Samal B, Worcel A, Louis C, Schedl P. 1981. Chromatin structure of the histone genes of d. Melanogaster. Cell. 23(2):401-409.

      Shermoen AW, McCleland ML, O'Farrell PH. 2010. Developmental control of late replication and s phase length. Curr Biol. 20(23):2067-2077.

      Shidlovskii YV, Bylino OV, Shaposhnikov AV, Kachaev ZM, Lebedeva LA, Kolesnik VV, Amendola D, De Simone G, Formicola N, Schedl P et al. 2021. Subunits of the pbap chromatin remodeler are capable of mediating enhancer-driven transcription in drosophila. Int J Mol Sci. 22(6).

      Sigrist CJ, Pirrotta V. 1997. Chromatin insulator elements block the silencing of a target gene by the drosophila polycomb response element (pre) but allow trans interactions between pres on different chromosomes. Genetics. 147(1):209-221.

      Udvardy A, Schedl P. 1984. Chromatin organization of the 87a7 heat shock locus of drosophila melanogaster. J Mol Biol. 172(4):385-403.

      Vazquez J, Muller M, Pirrotta V, Sedat JW. 2006. The mcp element mediates stable long-range chromosome-chromosome interactions in drosophila. Molecular Biology of the Cell. 17(5):2158-2165.

      Wolle D, Cleard F, Aoki T, Deshpande G, Schedl P, Karch F. 2015. Functional requirements for fab-7 boundary activity in the bithorax complex. Mol Cell Biol. 35(21):3739-3752.

    1. Author response:

      The following is the authors’ response to the original reviews.

      In this important paper, the authors propose a computational model for understanding how the dynamics of neural representations may lead to specific patterns of errors as observed in working memory tasks. The paper provides solid evidence showing how a two-area model of sensory-memory interactions can account for the error patterns reported in orientation estimation tasks with delays. By integrating ideas from efficient coding and attractor networks, the resulting theoretical framework is appealing, and nicely captures some basic patterns of behavior data and the distributed nature of memory representation as reported in prior neurophysiological studies. The paper can be strengthened if (i) further analyses are conducted to deepen our understanding of the circuit mechanisms underlying the behavior effects; (ii) the necessity of the two-area network model is better justified; (iii) the nuanced aspects of the behavior that are not captured by the current model are discussed in more detail.

      We thank the Editors and Reviewers for their constructive comments. In response to the suggestions provided, we have implemented the following revisions:

      - Clarified the origin of the specific pattern of diffusion: We showed that variance patterns remain consistent across different noise types or levels in new Figure 5 – Figure supplement 2 and Figure 9 – Figure supplement 1 (uniform Gaussian noise with varying strengths). This is connected to the representation geometry induced by heterogeneous connections (Eq. 21).

      - Provided an intuitive explanation of the two-module network’s advantages: Additional simulations demonstrated that heterogeneity degree of sensory connections and intermodal connection strengths affect drift and diffusion terms differently (new Figure 6). This endows an extra degree of freedom in controlling heterogeneity in drift and diffusion terms in the two-module network (new Figure 9).

      - Addressed a limitation and future directions in the Discussion: Our study is limited to the dynamic evolution of memory representation for a single orientation stimulus and its associated error patterns. We acknowledge the need for further investigation to capture nuanced error patterns in broader experimental settings, such as changes in error patterns for varying stimulus presentation durations in perception tasks. We have discussed potential extensions, such as incorporating more biologically plausible baseline activities, external noise, or variations of loss functions.

      Additionally, we showed consistent error patterns when decoded from activities of the sensory module (Figure 4 – Figure supplement 1), and incorrect error patterns with autapses in the sensory module (Figure 7 – Figure supplement 2). Below, we have reorganized each Reviewer’s comments and separately addressed them. All changes were shown in red in the manuscript submitted as Related Manuscript File.  

      Reviewer #1:

      Summary:

      Working memory is imperfect - memories accrue errors over time and are biased towards certain identities. For example, previous work has shown memory for orientation is more accurate near the cardinal directions (i.e., variance in responses is smaller for horizontal and vertical stimuli) while being biased towards diagonal orientations (i.e., there is a repulsive bias away from horizontal and vertical stimuli). The magnitude of errors and biases increase the longer an item is held in working memory and when more items are held in working memory (i.e., working memory load is higher). Previous work has argued that biases and errors could be explained by increased perceptual acuity at cardinal directions. However, these models are constrained to sensory perception and do not explain how biases and errors increase over time in memory. The current manuscript builds on this work to show how a two-layer neural network could integrate errors and biases over a memory delay. In brief, the model includes a 'sensory' layer with heterogenous connections that lead to the repulsive bias and decreased error in the cardinal directions. This layer is then reciprocally connected with a classic ring attractor layer. Through their reciprocal interactions, the biases in the sensory layer are constantly integrated into the representation in memory. In this way, the model captures the distribution of biases and errors for different orientations that have been seen in behavior and their increasing magnitude with time. The authors compare the two-layer network to a simpler one-network model, showing that the one-model network is harder to tune and shows an attractive bias for memories that have lower error (which is incompatible with empirical results).

      Strengths:

      The manuscript provides a nice review of the dynamics of items in working memory, showing how errors and biases differ across stimulus space. The two-layer neural network model is able to capture the behavioral effects as well as relate to neurophysiological observations that memory representations are distributed across the sensory cortex and prefrontal cortex.

      The authors use multiple approaches to understand how the network produces the observed results. For example, analyzing the dynamics of memories in the low-dimensional representational space of the networks provides the reader with an intuition for the observed effects.

      As a point of comparison with the two-layer network, the authors construct a heterogenous one-layer network (analogous to a single memory network with embedded biases). They argue that such a network is incapable of capturing the observed behavioral effects but could potentially explain biases and noise levels in other sensory domains where attractive biases have lower errors (e.g., color).

      The authors show how changes in the strength of Hebbian learning of excitatory and inhibitory synapses can change network behavior. This argues for relatively stronger learning in inhibitory synapses, an interesting prediction.

      The manuscript is well-written. In particular, the figures are well done and nicely schematize the model and the results.

      Overall:

      Overall, the manuscript was successful in building a model that captured the biases and noise observed in working memory. This work complements previous studies that have viewed these effects through the lens of optimal coding, extending these models to explain the effects of time in memory. In addition, the two-layer network architecture extends previous work with similar architectures, adding further support to the distributed nature of working memory representations.

      We appreciate the reviewer’s comments that the work successfully explains error patterns of working memory, extends previous models of optimal coding to include temporal effects, and supports the distributed nature of working memory representations. Below, we address the specific concerns of the reviewer.

      Weaknesses:

      Despite its strengths, the manuscript does have some weaknesses.

      Major Point 1: First, as far as we can tell, behavioral data is only presented in schematic form. This means some of the nuances of the effects are lost. It also means that the model is not directly capturing behavioral effects. Therefore, while providing insight into the general phenomenon, the current manuscript may be missing some important aspects of the data.

      Relatedly, the models are not directly fit to behavioral data. This makes it hard for the authors to exclude the possibility that there is a single network model that could capture the behavioral effects. In other words, it is hard to support the authors' conclusion that "....these evolving errors...require network interaction between two distinct modules." (from the abstract, but similar comments are made throughout the manuscript). Such a strong claim needs stronger evidence than what is presented. Fitting to behavioral data could allow the authors to explore the full parameter space for both the one-layer and two-layer network architectures.

      In addition, directly comparing the ability of different model architectures to fit behavioral data would allow for quantitative comparison between models. Such quantitative comparisons are currently missing from the manuscript.

      We agree with the reviewer that incorporating quantitative comparisons to the data will strengthen our results. However, we note the limitations in fitting network models to behavior data. Previous studies employed drift-diffusion models to fit error patterns observed in visual working memory tasks (Panichello, DePasquale et al. 2019, Gu, Lee et al. 2023). In contrast to these phenomenological models, network models have more parameters that can cause overfitting. Consequently, we focused on comparing the qualitative differences between onemodule and two-module networks, examining whether each network can generate the correct shape of bias and variance patterns. In response to the reviewers’ suggestions, we have revised the manuscript to reinforce our claim by providing an intuitive explanation of the qualitative differences between these two models (see response to your Major Point 3) and conducting additional simulations to support our claim that error patterns are consistent under different noise types or levels (see responses to Major Points 2 of Reviewer 2, and Minor point 1 of Reviewer 3).  

      Major Point 2: To help broaden the impact of the paper, it would be helpful if the authors provided insight into how the observed behavioral biases and/or network structures influence cognition. For example, previous work has argued that biases may counteract noise, leading to decreased variance at certain locations. Is there a similar normative explanation for why the brain would have repulsive biases away from commonly occurring stimuli? Are they simply a consequence of improved memory accuracy? Why isn't this seen for all stimulus domains?

      Previous work has found both diffusive noise and biases increase with the number of items in working memory. It isn't clear how the current model would capture these effects. The authors do note this limitation in the Discussion, but it remains unclear how the current model can be generalized to a multi-item case.

      As pointed by the reviewer, attractors counteract noise and lead to reduced variance around the attracting locations. However, most attractor models reporting such effects did not consider the interaction of attractor dynamics with the sensory network. For the repulsive biases considered here, previous studies on the sensory stage have theoretically demonstrated that they could lower the discrimination threshold around cardinal orientations (e.g., see Wei and Stocker, 2017). In Wei and Stocker (2017), the authors showed that this relationship between bias and discrimination threshold was observed across many stimulus modalities. In the present study, we demonstrated that the bias and variability patterns naturally emerged from the underlying neural dynamics. Nonetheless, we also noted that color working memory shows attractive biases, which necessitates further study of the underlying neural mechanisms of color perception. A plausible explanation is that the categorical effect dominates color perception and memory processes, as suggested by existing modelling work (Tajima et al., 2016). 

      However, we do note the limitation of our current work that does not capture nuanced error patterns in broader experimental settings, such as variation of perception tasks or memory of multiple items. For instance, while shorter stimulus presentations with no explicit delay lead to larger biases experimentally, our current model, which starts activities from a flat baseline, shows an increase in bias throughout the stimulus presentation. Additionally, the error variance during stimulus presentation is almost negligible compared to that during the delay period, as the external input overwhelms the internal noise. These mismatches during stimulus presentation have minimal impact on activities during the delay period when the internal dynamics dominate. Nonetheless, the model needs further refinement to accurately reproduce activities during stimulus presentation, possibly by incorporating more biologically plausible baseline activities. Also, a recent Bayesian perception model suggested different types of noise like external noise or variations in loss functions that adjust tolerance to small errors may help explain various error patterns observed across different modalities (Hahn and Wei, 2024). Even for memories involving multiple items, noise can be critical in determining error patterns, as encoding more items might be equivalent to higher noise for each individual item (Chunharas, Rademaker et al. 2022).

      To make this limitation clear, we included the above response in a new paragraph on limitations and future directions in the Discussion (2nd paragraph in p. 11). Also, we modified the text that previously described that our model can “explain error patterns in both perception and working memory tasks” in p. 3 and p. 5 as

      “explain error patterns in working memory tasks that are similar to those observed in perception tasks.”

      And we added the bias and variance pattern right after the stimulus offset in Figure 4C,D with the following note in p. 6:

      “Note that the variance of errors is nearly zero during stimulus presentation because the external input overwhelms internal noise, which does not fully account for the variability observed during perception tasks (see Discussion).”

      Major Point 3: The role of the ring attractor memory network isn't completely clear. There is noise added in this stage, but how is this different from the noise added at the sensory stage? Shouldn't these be additive? Is the noise necessary?  

      Similarly, it isn't clear whether the memory network is necessary - can it be replaced by autapses (self-connections) in the sensory network to stabilize its representation? In short, it would be helpful for the authors to provide an intuition for why the addition of the memory network facilitates the repulsive bias.

      Internal noise in the circuits is necessary to replicate the variability of the readout in estimating the stimulus because our model did not incorporate external noise (i.e., noise associated with the stimulus). We note the distinct noise implementation in both extension of the previous Bayesian model (Fig. 2) and the network models (Fig. 3 and beyond). In Fig. 2, we followed previous studies by employing static tuning curves for the sensory module and Poisson noise to account for variability in the perception stage. In the memory stage, sensory output undergoes the addition of constant Gaussian noise, replicating the diffusion process along the memory manifolds as shown in traditional memory network models. In the network models, we do consider the same noise in both sensory and memory modules, subjecting all units to Poisson noise to simulate neuronal spiking variability. In the network models, the two modules dynamically interact, which warp the energy landscape and generate uneven noise coefficients along the memory manifold, reminiscent of the conditions shown in Fig. 1. 

      From the bias and variance patterns, we can infer two requirements the network to fulfill – one is efficient coding suggested by sensory perception stage and the other is memory maintenance. The former is achieved by realizing the previous Bayesian models in the sensory networks with specific heterogeneous connections. In our work, the latter is achieved by strong recurrent connections to sustain persistent activity during the delay period. On the other hand, as the reviewer noted, memory can be maintained through autapses in the sensory network, which is equivalent to elongating intrinsic time constants of individual units (Seung, Lee et al. 2000). We simulated such sensory network and showed the results in Figure 7 – Figure Supplement 2. As shown in the figure, a larger time constant also slows down the increase in bias significantly, which can be deduced from Eq. 20. 

      When memory is maintained through strong recurrent connections, there are two possible scenarios, one-module network combining both efficient coding and memory maintenance (Fig. 8), or two-module network satisfying each condition in different modules (Fig. 7). In both networks, heterogeneous connections achieving efficient coding shape drift and diffusion dynamics similarly as illustrated in Figure 9 (previous Figure 7 – Supplement 1). Discrete attractors are formed near oblique orientations, inducing an increase of repulsive bias during the delay period. Also, noise coefficient is lowest at cardinal orientations. However, there is a difference in the asymmetry degrees of the drift and diffusion at cardinal and oblique orientations the one-module network shows larger asymmetry in potential energy, while the two-module network shows larger asymmetry in the noise coefficient. These varying degrees of heterogeneity in drift and diffusion lead to qualitative differences in bias and variance patterns in estimation. Shallower potential differences with more asymmetrical noise coefficients result in correct bias and variance patterns in the two-module network, while the opposite leads to flipped variance patterns in the one-module network.  

      An intuitive explanation of how connectivity heterogeneity differentially affects the asymmetry degrees of drift and diffusion in one-module and two-module networks is detailed in our response to Major Point 3 of Reviewer 2. In summary, separating the memory module from the sensory module imposes an additional degree of freedom, allowing for more flexible control over drift and diffusion, thereby bias and variance patterns. To clarify this, we have added simulations in Figure 6 and Figure 9 and provided an intuitive explanation in the accompanying texts in pp. 6-7 and p. 9. 

      Minor Point 1: The code is stated to be available on GitHub, but I could not access it.

      Thank you for pointing it out. The repository is now publicly available.

      Minor Point 2: The legend for late/mid/early is in an odd place in Figure 1, as it is in panel E where you can't see the difference between the lines. We would suggest moving this to another panel where the different time points are clear. In general, we would suggest adding more text (legends and titles) to the figure to help the reader understand the figures without having to refer to the details in the text and/or figure legends.

      We have now moved the legend to panel B where late/mid/early is first introduced. Also, we added more text to the figure legend (Figure 3,4,5,8). 

      Minor Point 3: The last line of the first paragraph of the Introduction ends awkwardly. I assume it's referring to indirect evidence for dynamics in memory?

      Thank you. We have modified the sentence as follows:

      “For instance, biases of errors, the systematic deviation from the original stimuli, observed in estimation tasks have been used as indirect evidence to infer changes in internal representations of stimuli.”

      Minor Point 4: Similarly, the first line of the second paragraph of the Introduction was also awkward. Specifically, the clause "..., such as nonuniform stimulus distribution in nature." Seems to be missing a 'the' before 'nonuniform'.

      We have modified the sentence as follows:

      “One important source of biases is adaptation to environmental statistics, such as the nonuniform stimulus distribution found in nature or the limited range in specific settings.”

      Reviewer #2:

      In this manuscript, Yang et al. present a modeling framework to understand the pattern of response biases and variance observed in delayed-response orientation estimation tasks. They combine a series of modeling approaches to show that coupled sensory-memory networks are in a better position than single-area models to support experimentally observed delay-dependent response bias and variance in cardinal compared to oblique orientations. These errors can emerge from a population-code approach that implements efficient coding and Bayesian inference principles and is coupled to a memory module that introduces random maintenance errors. A biological implementation of such operation is found when coupling two neural network modules, a sensory module with connectivity inhomogeneities that reflect environment priors, and a memory module with strong homogeneous connectivity that sustains continuous ring attractor function. Comparison with single-network solutions that combine both connectivity inhomogeneities and memory attractors shows that two-area models can more easily reproduce the patterns of errors observed experimentally. This, the authors take as evidence that a sensory-memory network is necessary, but I am not convinced about the evidence in support of this "necessity" condition. A more in-depth understanding of the mechanisms operating in these models would be necessary to make this point clear.

      Strengths:

      The model provides an integration of two modeling approaches to the computational bases of behavioral biases: one based on Bayesian and efficient coding principles, and one based on attractor dynamics. These two perspectives are not usually integrated consistently in existing studies, which this manuscript beautifully achieves. This is a conceptual advancement, especially because it brings together the perceptual and memory components of common laboratory tasks.

      The proposed two-area model provides a biologically plausible implementation of efficient coding and Bayesian inference principles, which interact seamlessly with a memory buffer to produce a complex pattern of delay-dependent response errors. No previous model had achieved this.

      We appreciate the reviewer’s comments that the work is a conceptual advancement, combining Bayesian perception models and attractor memory models, and produces error patterns which wasn’t achieved by previous models. Below, we address the specific concerns of the reviewer.

      Major Point 1: The correspondence between the various computational models is not fully disclosed. It is not easy to see this correspondence because the network function is illustrated with different representations for different models and the correspondence between components of the various models is not specified. For instance, Figure 1 shows that a specific pattern of noise is required in the low-dimensional attractor model, but in the next model in Figure 2, the memory noise is uniform for all stimuli. How do these two models integrate? What element in the population-code model of Figure 2 plays the role of the inhomogeneous noise of Figure 1? Also, the Bayesian model of Figure 2 is illustrated with population responses for different stimuli and delays, while the attractor models of Figures 3 and 4 are illustrated with neuronal tuning curves but not population activity. In addition, error variance in the Bayesian model appears to be already higher for oblique orientations in the first iteration whereas it is only first shown one second into the delay for the attractor model in Figure 4. It is thus unclear whether variance inhomogeneities appear already at the perceptual stage in the attractor model, as it does in the population-code model. Of course, correspondences do not need to be perfect, but the reader does not know right now how far the correspondence between these models goes.

      Thank you for pointing out the lack of clarity in the correspondence between different models. We note the distinct noise implementation in extension of the previous Bayesian model (Fig. 2) and the network models (Fig. 3 and beyond). In Fig. 2, we followed previous studies by employing static tuning curves for the sensory module and Poisson noise to account for variability in the perception stage. In the memory stage, sensory output undergoes the addition of constant Gaussian noise, replicating the diffusion process along the memory manifolds as shown in traditional memory network models. In the network models in Fig. 3 and beyond, we do consider the same noise in both sensory and memory modules, subjecting all units to Poisson noise to simulate neuronal spiking variability. In the network models, the two modules dynamically interact, which warp the energy landscape and generate uneven noise coefficients along the memory manifold, reminiscent of the conditions shown in Fig. 1. 

      However, we do note the limitation of the current study which cannot fully replicate behavior patterns observed in variation of perception tasks. For instance, while shorter stimulus presentations with no explicit delay lead to larger biases experimentally, our current model, which starts activities from a flat baseline, shows an increase in bias throughout the stimulus presentation. Additionally, the error variance during stimulus presentation is almost negligible compared to that during the delay period, as the external input overwhelms the internal noise. These mismatches during stimulus presentation have minimal impact on activities during the delay period when the internal dynamics dominate. Nonetheless, the model needs further refinement to accurately reproduce activities during stimulus presentation, possibly by incorporating more biologically plausible baseline activities. To make this limitation clear, we included the above response in a new paragraph on limitations and future directions in the Discussion (2nd paragraph in p. 11). Also, we modified the text that previously described that our model can “explain error patterns in both perception and working memory tasks” in p. 3 and p. 5 as “explain error patterns in working memory tasks that are similar to those observed in perception tasks.”

      And we added the bias and variance pattern right after the stimulus offset in Figure 4C,D with the following note in p. 6:

      “Note that the variance of errors is nearly zero during stimulus presentation because the external input overwhelms internal noise, which does not fully account for the variability observed during perception tasks (see Discussion).”

      Major Point 2: The manuscript does not identify the mechanistic origin in the model of Figure 4 of the specific noise pattern that is required for appropriate network function (with higher noise variance at oblique orientations). This mechanism appears critical, so it would be important to know what it is and how it can be regulated. In particular, it would be interesting to know if the specific choice of Poisson noise in Equation (3) is important. Tuning curves in Figure 4 indicate that population activity for oblique stimuli will have higher rates than for cardinal stimuli and thus induce a larger variance of injected noise in oblique orientations, based on this Poissonnoise assumption. If this explanation holds, one wonders if network inhomogeneities could be included (for instance in neural excitability) to induce higher firing rates in the cardinal/oblique orientations so as to change noise inhomogeneities independently of the bias and thus control more closely the specific pattern of errors observed, possibly within a single memory network.

      The specific pattern of noise coefficient, lower variability at cardinal orientations in the network models, inherited that of the previous Bayesian perception models (Wei and Stocker, 2017). Either in one-module or two-module networks, the specific pattern of heterogeneous connections induces more neurons tuned to cardinal orientations with narrower tuning widths. Such sparser representation near cardinal stimuli generates lower noise variability even with constant Gaussian noise. This is verified in Eq. 21 in Methods, showing the derivation of noise coefficients – with constant Gaussian noise, Eq. 21 is modified as 

      because . Thus, 𝒟(𝜃) is inversely proportional to , which reflects the length travelled on the stable trajectory 𝒔𝒔‾(𝜃𝜃) when θ increases by one unit. For sparser representation,   becomes larger and 𝒟(𝜃) is reduced. Intuitively, with more neurons tuned to cardinal stimuli, noise is averaged and reduced. In sum, the heterogeneous connection induces the specific noise coefficient, and the choice of Poisson-like noise is not essential, although it facilitates the correct variance pattern. To clarify this point, we have added the results of using uniform Gaussian noise in new Figure 5 – Figure Supplement 2 and Figure 9 – Figure Supplement 1.

      Major point 3: The main conclusion of the manuscript, that the observed patterns of errors "require network interaction between two distinct modules" is not convincingly shown. The analyses show that there is a quantitative but not a qualitative difference between the dynamics of the single memory area compared to the sensory-memory two-area network, for specific implementations of these models (Figure 7 - Figure Supplement 1). There is no principled reasoning that demonstrates that the required patterns of response errors cannot be obtained from a different memory model on its own. Also, since the necessity of the two-area configuration is highlighted as the main conclusion of the manuscript, it is inconvenient that the figure that carefully compares these conditions is in the Supplementary Material.

      Following the suggestion by the reviewer, we moved Figure 7 – Figure supplement 1 as new Figure 9. As noted by the reviewer, drift dynamics and diffusion projected onto the lowdimensional memory manifold have similar shapes in both one-module and two-module networks, with the lowest potential and highest noise coefficient observed at the oblique orientations. However, there is a difference in the asymmetry degrees of the drift and diffusion at cardinal and oblique orientations: the one-module network shows larger asymmetry in potential energy, while the two-module network shows larger asymmetry in the noise coefficient. These varying degrees of heterogeneity in drift and diffusion lead to qualitative differences in bias and variance patterns in estimation. Shallower potential differences with more asymmetrical noise coefficients result in correct bias and variance patterns in the two-module network, while the opposite leads to flipped variance patterns in the one-module network.  

      To intuitively understand how connectivity heterogeneity differentially affects the asymmetry degrees of drift and diffusion in one-module and two-module networks, consider a simple case where only the excitatory connection is heterogeneous, denoted as α. The asymmetry of diffusion reflects the degree of heterogeneity in either the sensory or memory modules. The noise coefficient derived from the low-dimensional projection is mainly determined by the heterogeneity of . While the one-module network, with a much lower α, shows almost flat , the two-module network shows more prominent asymmetry in with a larger α in the sensory module.  

      On the other hand, the asymmetry in the potential energy is influenced differently by the connectivity heterogeneity of the sensory module and that of the memory module. For memory maintenance, overall recurrent connections need to be strong enough to overcome intrinsic decay, simplifying to w = 1. In the one-module network, α in the memory module creates potential differences at cardinal and oblique orientations as 1± α. On the other hand, in the two-module network, with w = 1 fulfilled by the memory module, α in the sensory module acts as a perturbation. The effect of α is modulated by the connectivity strengths between sensory and memory module, denoted by γ. Potential differences at cardinal and oblique orientations can be represented as 1± γα. While both α and γ determine the energy level, the noise coefficient less depends on γ (see response to your Major Point 4). Thus, even for relatively larger α in the sensory module leading to more asymmetrical noise coefficients, the potential difference could be shallower in the two-module network with small γ<1. 

      In sum, in the two-module network, there is an additional degree of freedom, connectivity strengths between sensory and memory modules, which provides the flexibility to control drift and diffusion separately, unlike in the one-module network. To clarify this, we have added simulations in Figure 6 and Figure 9 and provided an intuitive explanation in the accompanying texts in pp. 6-7 and p. 9.

      Major Point 4: The proposed model has stronger feedback than feedforward connections between the sensory and memory modules. This is not a common assumption when thinking about hierarchical processing in the brain, and it is not discussed in the manuscript.

      As noted in the previous response, the connectivity strengths between the sensory and memory modules, denoted as γ, are important parameters determining the qualitative features of bias and variance patterns. γ corresponds to the product of Jf and Jb, feedforward and feedback strengths, and our additional simulation shows that the bias and variance patterns remain similar for a fixed γ. Note that further simulation revealed that the heterogeneity degree, α, and the intermodal connectivity strengths, γ, influence the drift and diffusion terms differently. As this result highlights the advantage of the two-module network, we moved the dependence of error patterns on intermodal connectivity strengths to the main figure (previous Figure 5 – Figure supplement 2), which now includes more simulations showing bias and variance patterns for different Jf and Jb and for different α and Jb (new Figure 6). 

      Minor Point 1: page 11: "circular standard deviation of sigma_theta = 1.3º at cardinal orientations" but in Figure 2 we see sigma_theta = 2º at cardinal orientations.

      The circular standard deviation of 𝜎𝜎𝜃𝜃 = 1.3º refers to the standard deviation of the sensory module output in iteration 1, that is, before feeding into the memory module to complete this iteration. In figure 2, the standard deviation plotted is that of the output of the memory module, which has a Gaussian memory noise with standard deviation 1.3º added on top of the sensory output. Hence we see a standard deviation of √(1.32 + 1.32) = 1.84º which seems close to 2º in the figure. We added a sentence in this paragraph of Methods (p. 13) to avoid confusion.

      Minor Point 2: equation (19): What does the prime of ||s'(theta)|| mean?

      The prime represents taking the derivative with respect to θ:

      reflects the length travelled on the stable trajectory when θ increases by one unit. As we plotted in Figure 9 and Figure 5 – Figure supplement 2, we clarified it in the legend.

      Minor Point 3: page 15: "The Fisher information (F) is estimated by assuming that the likelihood function p(r|theta) is Gaussian", but the whole point of Wei and Stocker (2015) and your Figure 2 is that likelihoods are skewed in these networks. This could be clarified.

      Thank you for pointing out the lack of clarity. In Wei and Stocker (2015) and our Figure 2, the likelihood is skewed with respect to 𝜃 (note the horizontal axes). However, in the Methods section, we assumed the distribution function 𝑝(𝑟|𝜃) is Gaussian with respect to 𝑟𝑟 when 𝜃 is considered fixed:

      where . The distribution function is skewed with respect to 𝜃 because the tuning curves are skewed with respect to 𝜃 (see Figure 4B). We have clarified our assumption in p. 16 to avoid confusion.

      Reviewer #3:

      Summary:

      The present study proposes a neural circuit model consisting of coupled sensory and memory networks to explain the circuit mechanism of the cardinal effect in orientation perception which is characterized by the bias towards the oblique orientation and the largest variance at the oblique orientation.

      Strengths:

      The authors have done numerical simulations and preliminary analysis of the neural circuit model to show the model successfully reproduces the cardinal effect. And the paper is wellwritten overall. As far as I know, most of the studies on the cardinal effect are at the level of statistical models, and the current study provides one possibility of how neural circuit models reproduce such an effect.

      We appreciate the reviewer’s comments that the work successfully reproduces error patterns through circuit models, advancing beyond previous statistical models. Below, we address the specific concerns of the reviewer.

      Weaknesses:

      There are no major weaknesses and flaws in the present study, although I suggest the author conduct further analysis to deepen our understanding of the circuit mechanism of the cardinal effects. Please find my recommendations for concrete comments.

      Minor Point 1: Likely, the interplay of the potential function (Figure 5D) and the noise amplitude (Figure 5C) in the memory network is the key to reproducing the cardinal effect. For me, it is obvious to understand the spatial profile of the potential function as what it currently looks like (Figure 5D), while I haven't had an intuitive understanding of how the spatial profile of noise structure emerges from the circuit model. Therefore I suggest the authors provide a more comprehensive analysis, including theory and simulation, to demonstrate how the noise structure depends on the network parameters. I am concerned about whether the memory network can still reproduce the minimal variance at the cardinal orientation if we reduce the Fano factor of single neuron variabilities. In this case, the shape of the potential function will be dominant in determining the variance over orientation (Figure 5F) and the result might be reverted.

      Thank you for the suggestion. Either in one-module or two-module networks, the specific pattern of heterogeneous connections induces more neurons tuned to cardinal orientations with narrower tuning widths. Such sparser representation near cardinal stimuli generates lower noise variability even with constant Gaussian noise, which is now added in Figure 5 – Figure Supplement 2. We also showed that the distinctive error patterns in one-module and two-module networks are maintained under Gaussian noise with varying amplitude in Figure 9 – Figure supplement 1.

      Minor Point 2: In addition, it is interesting to show how the representation of the sensory module looks like, e.g., plotting the figures similar to Figures B-F but from the sensory module. I feel the sensory module doesn't have a result similar to Figure 5F. Is it?

      Yes, decoded error patterns obtained from the sensory module are similar to the results obtained from the memory module. We have added Figure 4 – Figure supplement 1 to show that our conclusions remain valid when decoding from the sensory module.

      Minor point 3: Last but not least, I have a conceptual question about the presentation mechanism in the proposed circuit model. The present study refers to Wei, et al., 2015 and 2017 about the statistical model mechanism of the cardinal effect. If I remember correctly, Wei's papers considered joint encoding and decoding processes to render the cardinal effect. Can the authors regard the processes in the proposed circuit model with the stages in the statistical model? Or at least the authors should discuss this link in the Discussions.

      We now included a mention of using a population vector decoder that mimics Bayesian optimal readout in the Result section (p. 6), in addition to the Discussion and Methods. However, we acknowledge that this decoder is only optimal under a specific loss function. A recent Bayesian perception model suggested different types of noise like external noise or variations in loss functions that adjust tolerance to small errors may help explain various error patterns observed across different modalities (Hahn and Wei, 2024). We have now added this limitation in the Discussion, along with the inconsistency of the current model with experimental observations during perception tasks and future directions (p. 11).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Strengths: 

      Sarpaning et al. provide a thorough characterization of putative Rnt1 cleavage of mRNA in S. cerevisiae. Previous studies have discovered Rnt1 mRNA substrates anecdotally, and this global characterization expands the known collection of putative Rnt1 cleavage sites. The study is comprehensive, with several types of controls to show that Rnt1 is required for several of these cleavages.

      Weaknesses: 

      (1) Formally speaking, the authors do not show a direct role of Rnt1 in mRNA cleavage - no studies were done (e.g., CLIP-seq or similar) to define direct binding sites. Is the mutant Rnt1 expected to trap substrates? Without direct binding studies, the authors rely on genetics and structure predictions for their argument, and it remains possible that a subset of these sites is an indirect consequence of rnt1. This aspect should be addressed in the discussion.

      We have added to this point in the discussion, as requested. We do not, however, agree that CLIP-seq or other methods are needed to address this point, or would even be helpful in the question the reviewer raises. 

      Importantly, we show that recombinant Rnt1 purified from E. coli cleaves the same sites as those mapped in vivo. This does provide direct evidence that Rnt1 directly binds those RNAs. Furthermore, it shows that it can bind these RNAs without the need of other proteins. Our observation that many mRNAs are cleaved at -14 and +16 positions from NGNN stem loops to leave 2-nt 3’ overhangs provides further support that these are the products of an RNase III enzyme, and Rnt1 is the only family member in yeast. Thus, we disagree with the reviewer that our studies do not show direct targeting.

      CLIP-seq experiments would be valuable, but they would address a different point. CLIP-seq measures protein binding to RNA targets, and it is likely that Rnt1 binds some RNAs without cleaving them. In addition, only a transient interaction are needed for cleavage and such transient interactions might not be readily detected by CLIP-seq. Thus, CLIP-seq would reveal the RNAs bound by Rnt1, but would not help identify which ones are cleaved. Catala et al (2004) showed that the catalytically inactive mutant of Rnt1 carries out some functions that are important for the cell cycle. The CLIP-seq studies would be valuable to determine these non-catalytic roles of Rnt1, but we consider those questions beyond the scope of the current study.

      (2) The comprehensive list of putative Rnt1 mRNA cleavage sites is interesting insofar as it expands the repertoire of Rnt1 on mRNAs, but the functional relevance of the majority of these sites remains unknown. Along these lines, the authors should present a more thorough characterization of putative Rnt1 sites recovered from in vitro Rnt1 cleavage.

      We have included new data that confirm that YDR514C cleavage by Rnt1 is relevant to yeast cell physiology. We show that YDR514C overexpression is indeed toxic, as we previously postulated. More importantly, we generated an allele of YDR514C that has synonymous mutations designed to disrupt the stem-loop recognized by Rnt1. We show that at 37 °C, both the wild-type and mutant allele are toxic to rnt1∆ cells, but that in cells that express Rnt1, the wild-type cleavable allele is more toxic than the allele with the mutated stem-loop. This genetic interaction provides strong evidence that cleavage of YDR514C by Rnt1 is relevant to cell physiology. 

      We have also added PARE analysis of poly(A)-enriched and poly(A)-depleted reactions and show that compared to Dcp2, Rnt1 preferentially targets poly(A)+ mRNAs, consistent with it targeting nuclear RNAs. We discuss in more detail that by cleaving nuclear RNA, Rnt1 provides a kinetic proofreading mechanism for mRNA export competence.

      (3) The authors need to corroborate the rRNA 3'-ETS tetraloop mutations with a northern analysis of 3'-ETS processing to confirm an ETS processing defect (which might need to be done in decay mutants to stabilize the liberated ETS fragment). They state that the tetraloop mutation does not yield a growth defect and use this as the basis for concluding that rRNA cleavage is not the major role of Rnt1 in vivo, which is a surprising finding. But it remains possible that tetraloop mutations did not have the expected disruptive effect in vivo; if the ETS is processed normally in the presence of tetraloop mutations, it would undermine this interpretation. This needs to be more carefully examined.

      We have removed the rRNA 3'-ETS tetraloop mutations, because initial northern blot analysis indicated that Rnt1 cleavage is not completely blocked by the mutations we designed. Therefore, the reviewer is correct that tetraloop mutations did not have the expected disruptive effect in vivo. Future investigations will be required to fully understand this. This was a minor point and removing this focuses the paper on its major contributions

      (4) To support the assertion that YDR514C cleavage is required for normal "homeostasis," and more specifically that it is the major contributor to the rnt1∆ growth defect, the authors should express the YDR514C-G220S mutant in the rDNA∆ strains with mutations in the 3'-ETS (assuming they disrupt ETS processing, see above). This simple experiment should provide a relative sense of "importance" for one or the other cleavage being responsible for the rnt1∆ defect. Given the accepted role of Rnt1 cleavage in rRNA processing and a dogmatic view that this is the reason for the rnt1∆ growth defect, such a result would be surprising and elevate the functional relevance and significance of Rnt1 mRNA cleavage.

      We agree that the experiment proposed by the reviewer is very simple, but we are puzzled by the rationale. First, our experiments do not support that there is anything special about the G220S mutation in YDR514C. A complete loss of function (ydr514c∆) also suppresses the growth defect, suggesting that ydr514c-G220S is a simple loss of function allele. We have clarified that the G220S mutation is distant from the stem-loop recognized by Rnt1 and is unlikely to affect cleavage by Rnt1. Instead, Rnt1 cleavage and the G220S mutation are independent alternative ways to reduce Ydr514c function. We have clarified this point in the text. 

      As mentioned in response to point #3, we have included other additional experiments that address the same overall question raised here – the importance of YDR514C mRNA cleavage by Rnt1.    

      (5) Given that some Rnt1 mRNA cleavage is likely nuclear, it is possible that some of these targets are nascent mRNA transcripts, as opposed to mature but unexported mRNA transcripts, as proposed in the manuscript. A role for Rnt1 in co-transcriptional mRNA cleavage would be conceptually similar to Rnt1 cleavage of the rRNA 3'-ETS to enable RNA Pol I "torpedo" termination by Rat1, described by Proudfoot et al (PMID 20972219). To further delineate this point, the authors could e.g., examine the poly-A tails on abundant Rnt1 targets to establish whether they are mature, polyadenylated mRNAs (e.g., northern analysis of oligo-dT purified material). A more direct test would be PARE analysis of oligo-dT enriched or depleted material to determine the poly-A status of the cleavage products. Alternatively, their association with chromatin could be examined. 

      We have added the requested PARE analysis of oligo-dT enriched or depleted material to determine the polyA status of the cleavage products and related discussions. These confirm our proposal that Rnt1 cleaves mature but unexported mRNA transcripts

      We also note that the northern blots shown in figures 2E, 4C, and 5B use oligo dT selected RNA because the signal was undetectable when we used total RNA. This suggests that the cleaved mRNAs are indeed polyadenylated. 

      The term nascent is somewhat ambiguous, but if the reviewer means RNA that is still associated with Pol II and has not yet been cleaved by the cleavage and polyadenylation machinery, we think that is inconsistent with our findings. We have also re-analyzed the NET-seq data from https://pubmed.ncbi.nlm.nih.gov/21248844/ and find no prominent peaks for our Rnt1 sites in Pol II associated RNAs, although for BDF2 NET-seq does suggest that “spliceosome-mediated decay” is co-transcriptional as would be expected. Altogether these data confirm our previous proposal that Rnt1 mainly cleaves mRNAs that have completed polyadenylated but are not yet exported.

      (6) While laboratory strains of budding yeast have a single RNase III ortholog Rnt1, several other budding yeast have a functional RNAi system with Dcr and Ago (PMID 19745116), and laboratory yeast strains are a derived state due to pressure from the killer virus to lose the RNAi system (PMID 21921191). The current study could provide new insight into the relative substrate preferences of Rnt1 and budding yeast Dicer, which could be experimentally confirmed by expressing Dcr in RNT1 and rnt1∆ strains. In lieu of experiments, discussion of the relevance of Rnt1 cleavage compared to yeast RNAi should be included in the discussion before the "human implications" section.

      The reviewer points out that most other eukaryotic species have multiple RNase III family members, which is a general point we discussed and have now expanded on. The reviewer specifically points to papers that study a species that was incorrectly referred to as Saccharomyces castellii in PMID 19745116, but whose current name is Naumovozyma castellii, reflecting that it is not that closely related to S. cerevisiae (diverged about 86 million years ago; for the correct species phylogeny, see http://ygob.ucd.ie/browser/species.html, as both of the published papers the reviewer cites have some errors in the phylogeny). 

      The other species discussed in PMID 19745116 (Vanderwaltozyma polyspora and Candida albicans) are even more distant. There have been several studies on substrate specificity of Dcr1 versus Rnt1 (including PMID 19745116). 

      The reviewer suggests that expressing Dcr1 in S. cerevisiae would be a valuable addition. However, we can’t envision a mechanism by which S. cerevisiae maintained physiologically relevant Dcr1 substrates in the absence of Dcr1. The results from the proposed study would, in our opinion, be limited to identifying RNAs that can be cleaved in this particular artificial system. We think an important implication of our work is that similar studies to ours should be caried out in rnt1∆, dcr1∆, and double mutants in either S. pombe or N. castellii, as well as in drosha knock outs in animals, and we discuss this in more detail in the revised paper. 

      (7) For SNR84 in Figure S3D, it appears that the TSS may be upstream of the annotated gene model. Does RNA-seq coverage (from external datasets) extend upstream to these additional mapped cleavages? The assertion that the mRNA is uncapped is concerning; an alternative explanation is that the nascent mRNA has a cap initially but is subsequently cleaved by Rnt1. This point should be clarified or reworded for accuracy.

      We agree with the reviewer that the most likely explanation is that the primary SNR84 transcript is capped, and 5’ end processed by Rnt1 and Rat1 to make a mature 5’ monophosphorylated SNR84 and have clarified the text accordingly. We suspect our usage of “uncapped” might have been confusing. “uncapped” was not meant to indicate that the primary transcript did not receive a cap, but instead that the mature transcript did not have a cap. We now use “5’ end processed” and “5’ monophosphorylated”. 

      Reviewer #2 (Public review):  

      The yeast double-stranded RNA endonuclease Rnt1, a homolog of bacterial RNase III, mediates the processing of pre-rRNA, pre-snRNA, and pre-snoRNA molecules. Cells lacking Rnt1 exhibit pronounced growth defects, particularly at lower temperatures. In this manuscript, Notice-Sarpaning examines whether these growth defects can be attributed at least in part to a function of Rnt1 in mRNA degradation. To test this, the authors apply parallel analysis of RNA ends (PARE), which they developed in previous work, to identify polyA+ fragments with 5' monophosphates in RNT1 yeast that are absent in rnt1Δ cells. Because such RNAs are substrates for 5' to 3' exonucleolytic decay by Rat1 in the nucleus or Xrn1 in the cytoplasm, these analyses were performed in a rat1-ts xrn1Δ background. The data recapitulate known Rtn1 cleavage sites in rRNA, snRNAs, and snoRNAs, and identify 122 putative novel substrates, approximately half of which are mRNAs. Of these, two-thirds are predicted to contain double-stranded stem loop structures with A/UGNN tetraloops, which serve as a major determinant of Rnt1 substrate recognition. Rtn1 resides in the nucleus, and it likely cleaves mRNAs there, but cleavage products seem to be degraded after export to the cytoplasm, as analysis of published PARE data shows that some of them accumulate in xrn1Δ cells. The authors then leverage the slow growth of rnt1Δ cells for experimental evolution. Sequencing analysis of thirteen faster-growing strains identifies mutations predominantly mapping to genes encoding nuclear exosome co-factors. Some of the strains have mutations in genes encoding a laratdebranching enzyme, a ribosomal protein nuclear import factor, poly(A) polymerase 1, and the RNAbinding protein Puf4. In one of the puf4 mutant strains, a second mutation is also present in YDR514C, which the authors identify as an mRNA substrate cleaved by Rnt1. Deletion of either puf4 or ydr514C marginally improves the growth of rnt1Δ cells, which the authors interpret as evidence that mRNA cleavage by Rnt1 plays a role in maintaining cellular homeostasis by controlling mRNA turnover. 

      While the PARE data and their subsequent in vitro validation convincingly demonstrate Rnt1mediated cleavage of a small subset of yeast mRNAs, the data supporting the biological significance of these cleavage events is substantially less compelling. This makes it difficult to establish whether Rnt1-mediated mRNA cleavage is biologically meaningful or simply "collateral damage" due to a coincidental presence of its target motif in these transcripts.

      We thank the reviewer and have added additional data to support our conclusion that mRNA cleavage, at least for YDR514C, is not simply collateral damage, but a physiologically relevant function of Rnt1. From an evolutionary perspective, cleavage of mRNAs by Rnt1 might have initially been collateral damage, but if there is a way to use this mechanism, evolution is probably going to use it.

      (1) A major argument in support of the claim that "several mRNAs rely heavily on Rnt1 for turnover" comes from comparing number of PARE reads at the transcript start site (as a proxy for fraction of decapped transcripts) and at the Rnt1 cleavage site (as a proxy for fraction of Rnt1-cleaved transcripts). The argument for this is that "the major mRNA degradation pathway is through decapping". However, polyA tail shortening usually precedes decapping, and transcripts with short polyA tails would be strongly underrepresented in PARE sequencing libraries, which were constructed after two rounds of polyA+ RNA selection. This will likely underestimate the fraction of decapped transcripts for each mRNA. There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells. Because the PARE data rely on the presence of a 5' phosphate to generate sequencing reads, they also cannot be used to estimate what fraction of a given mRNA transcript is actually cleaved by Rnt1. 

      The reviewer is correct that decapping preferentially affects mRNAs with shortened poly(A) tails, that Rnt1 cleavage likely affects mostly newly made mRNAs with long poly(A) tails, and that PARE may underestimate the decay of mRNAs with shortened poly(A) tails. We have reanalyzed our previously published data where we performed PARE on both the poly(A)-enriched fraction and the poly(A)-depleted fraction (that remains after two rounds of oligo dT selection). Rnt1 products are over-represented in the poly(A)-enriched fraction, while decapping products are enriched in the poly(A)-depleted fraction, providing further support to our conclusion that Rnt1 cleaves nuclear RNA. We have re-written key sections of the paper accordingly.

      The reviewer also points out that “There is a wide range of well-established methods that can be used to directly measure differences in the half-life of Rnt1 mRNA targets in RNT1 vs rnt1Δ cells.” However, all of those methods measure mRNA degradation rates from the steady state pool, which is mostly cytoplasmic. We have, in different contexts, used these methods, but as we pointed out they are inappropriate to measure degradation of nuclear RNA. There are some studies that measure nuclear degradation rates, but this requires purifying nuclei. There are two major drawbacks to this. First, it cannot distinguish between degradation in the nucleus and export from the nucleus because both processes cause disappearance from the nucleus. Second, the purification of yeast nuclei requires “spheroplasting” or enzymatically removing the rigid cell wall. This spheroplasting is likely to severely alter the physiological state of the yeast cell. Given these significant drawbacks and the substantial time and money required, we chose not to perform this experiment.  

      (2) Rnt1 is almost exclusively nuclear, and the authors make a compelling case that its concentration in the cytoplasm would likely be too low to result in mRNA cleavage. The model for Rnt1-mediated mRNA turnover would therefore require mRNAs to be cleaved prior to their nuclear export in a manner that would be difficult to control. Alternatively, the Rnt1 targets would need to re-enter prior to cleavage, followed by export of the cleaved fragments for cytoplasmic decay. These processes would need to be able to compete with canonical 5' to 3' and 3' to 5' exonucleolytic decay to influence mRNA fate in a biologically meaningful way.

      We disagree that mRNA export would be difficult to control, as is elegantly demonstrated by the 13 KDa HIV Rev protein. The export of many other RNAs is tightly controlled such that many RNAs are rapidly degraded in the nucleus by, for example, Rat1 and the RNA exosome, while other RNAs are rapidly exported. Indeed, the competition between RNA export and nuclear degradation is generally thought to be an important quality control for a variety of mRNAs and ncRNAs. We do agree with the reviewer that re-import of mRNAs appears unlikely (which is why we do not discuss it), although it occurs efficiently for other Rnt1-cleaved RNAs such as snRNAs. We have clarified the text accordingly, including in the introduction, results, and discussion. 

      (3) The experimental evolution clearly demonstrates that mutations in nuclear exosome factors are the most frequent suppressors of the growth defects caused by Rnt1 loss. This can be rationalized by stabilization of nuclear exosome substrates such as misprocessed snRNAs or snoRNAs, which are the major targets of Rnt1. The rescue mutations in other pathways linked to ribosomal proteins (splicing, ribosomal protein import, ribosomal mRNA binding) support this interpretation. By contrast, the potential suppressor mutation in YDR514C does not occur on its own but only in combination with a puf4 mutation; it is also unclear whether it is located within the Rnt1 cleavage motif or if it impacts Rnt1 cleavage at all. This can easily be tested by engineering the mutation into the endogenous YDR514C locus with CRISPR/Cas9 or expressing wild-type and mutant YDR514C from a plasmid, along with assaying for Rnt1 cleavage by northern blot. Notably, the growth defect complementation of YDR514C deletion in rnt1Δ cells is substantially less pronounced than the growth advantage afforded by nuclear exosome mutations (Figure S9, evolved strains 1 to 5). These data rather argue for a primary role of Rnt1 in promoting cell growth by ensuring efficient ribosome biogenesis through pre-snRNA/pre-snoRNA processing. 

      The reviewer makes several points. 

      First, we have clarified that the ydr514c-G220S mutation is not near the Rnt1 cleavage motif and is unlikely to affect cleavage by Rnt1. This is exactly what would be expected for a mutation that was selected for in an rnt1∆ strain. Although the reviewer appears to expect it, a mutation that affects Rnt1 cleavage could not be selected for in a strain that lacks Rnt1.

      Second, the reviewer points out that the original ydr514c mutations arose in a strain that also had a puf4 deletion. However, we show that ydr514c∆ also suppresses rnt1∆. Furthermore, we have added additional data that overexpressing an uncleavable YDR514C mRNA affects yeast growth at 37 °C more than the wild-type cleavable form further supporting that the cleavage of YDR154C by Rnt1 is physiologically relevant. 

      Reviewer #2 (Recommendations for the authors): 

      (1) The description of the PARE library construction protocol and data analysis workflow is insufficient to ensure their robustness and reproducibility. The library construction protocol should include details of the individual steps, and the data analysis workflow description should include package versions and exact commands used for each analysis step.

      We have clarified that the experiments were performed exactly as previously described and have included very detailed methods. The Galaxy server does not require commands and instead we have indicated the parameters chosen in the various steps. We have also added that the PARE libraries for poly(A)+ and poly(A)- fractions were generated in the lab of Pam Green according to their protocol, which is not exactly the same as ours. Nevertheless, the Rnt1 sites are also evident from those libraries, further demonstrating the robustness of our data. 

      (2) PARE signal is expressed as a ratio of sequencing coverage at a given nucleotide in RNT1 vs rnt1Δ cells. This poses challenges to estimating fold changes: by definition, there should be no coverage at Rnt1 cleavage sites in rnt1Δ cells, as there will not be any 5' monophosphate-containing mRNA fragments to be ligated to the library construction linker. This should be accounted for in the data analysis pipeline - the DESeq2 package, for example, handles this very well (https://support.bioconductor.org/p/64014/).

      The reviewer is correct and we have clarified how we do account for the possibility of having 0 reads by adding an arbitrary 0.01 cpm to all PARE scores for wild type and mutant. In the original manuscript this was not explicitly mentioned and the reader would have to go to our previous paper to learn about this detail. Adding this 0.01 cpm pseudocount avoids dividing by 0 when we calculate a comPARE score. This means we actually underestimate the fold change. As can be seen in the red line in the image below, the y-axis modified log2FC score maxes out along a diagonal line at log2([average RNT1 reads]/0.01) instead of at infinity. That is, at a wild type peak height of 1 cpm, the maximum possible score is log2(1.01/.01), which equals 6.66, and at 10 cpm, the maximum score is ~10, etc.). As can be seen, many of the scores fall along this diagonal, reflecting that indeed, there are 0 reads in the rnt1∆ samples.

      Author response image 1.

      There are multiple ways to deal with this issue, and ours is not uncommon. DESeq2, suggested by the reviewer, uses a different method, which relies on the assumption that the dispersion of read counts for genes of any given expression strength is constant, and then uses that dispersion to “correct” the 0 read counts. While this is a valid way for differential gene expression when comparing similar RNAs, the underlying assumption that the dispersion of expression of all genes is similar for similar expression level is questionable for comparing, for example, mRNAs, snoRNAs, and snRNAs. Thus, we are not convinced that this is a better way to deal with 0 counts. Our analysis accepts that 0 might be the best estimate for the number of counts that are expected from rnt1∆ samples. 

      (3) The analysis in Figure S8 is insufficient to demonstrate that the four mRNAs depicted are significantly more abundant in rnt1Δ vs RNT1 cells - differences in coverage could simply be a result of different sequencing depth. Please use an appropriate method for estimating differential expression from RNA-Seq data (e.g., DESeq2). 

      Unfortunately, the previously published data we included as figure S8 (now figure S9) did not include replicates, and we agree that it does not rigorously show an effect. The reviewer suggests that we analyze the data by DESeq2, which requires replicates, and thus, cannot be done. Instead we have clarified this. If the reviewer is not satisfied with this, we are prepared to delete it.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This is a useful study examining the determinants and mechanisms of LRMP inhibi:on of cAMP regula:on of HCN4 channel ga:ng. The evidence provided to support the main conclusions is unfortunately incomplete, with discrepancies in the work that reduce the strength of mechanis:c insights.

      Thank you for the reviews of our manuscript. We have made a number of changes to clarify our hypotheses in the manuscript and addressed all of the poten:al discrepancies by revising some of our interpreta:on. In addi:on, we have provided addi:onal experimental evidence to support our conclusions. Please see below for a detailed response to each reviewer comment.

      Public Reviews

      Reviewer #1 (Public Review):

      Summary:

      The authors use truncations, fragments, and HCN2/4 chimeras to narrow down the interaction and regulatory domains for LRMP inhibition of cAMP-dependent shifts in the voltage dependence of activation of HCN4 channels. They identify the N-terminal domain of HCN4 as a binding domain for LRMP, and highlight two residues in the C-linker as critical for the regulatory effect. Notably, whereas HCN2 is normally insensitive to LRMP, putting the N-terminus and 5 additional C-linker and S5 residues from HCN4 into HCN2 confers LRMP regulation in HCN2.

      Strengths:

      The work is excellent, the paper well written, and the data convincingly support the conclusions which shed new light on the interaction and mechanism for LRMP regulation of HCN4, as well as identifying critical differences that explain why LRMP does not regulate other isoforms such as HCN2.

      Thank you.

      Reviewer #2 (Public Review):

      Summary:

      HCN-4 isoform is found primarily in the sino-atrial node where it contributes to the pacemaking activity. LRMP is an accessory subunit that prevents cAMP-dependent potentiation of HCN4 isoform but does not have any effect on HCN2 regulation. In this study, the authors combine electrophysiology, FRET with standard molecular genetics to determine the molecular mechanism of LRMP action on HCN4 activity. Their study shows that parts of N- and C-termini along with specific residues in C-linker and S5 of HCN4 are crucial for mediating LRMP action on these channels. Furthermore, they show that the initial 224 residues of LRMP are sufficient to account for most of the activity. In my view, the highlight of this study is Fig. 7 which recapitulates LRMP modulation on HCN2-HCN4 chimera. Overall, this study is an excellent example of using time-tested methods to probe the molecular mechanisms of regulation of channel function by an accessory subunit.

      Weaknesses:

      (1) Figure 5A- I am a bit confused with this figure and perhaps it needs better labeling. When it states Citrine, does it mean just free Citrine, and "LRMP 1-230" means LRMP fused to Citrine which is an "LF" construct? Why not simply call it "LF"? If there is no Citrine fused to "LRMP 1-230", this figure would not make sense to me.

      We have clarified the labelling of this figure and specifically defined all abbreviations used for HCN4 and LRMP fragments in the results section on page 14.

      (2) Related to the above point- Why is there very little FRET between NF and LRMP 1-230? The FRET distance range is 2-8 nm which is quite large. To observe baseline FRET for this construct more explanation is required. Even if one assumes that about 100 amino are completely disordered (not extended) polymers, I think you would still expect significant FRET.

      FRET is extremely sensitive to distance (to the 6th power of distance). The difference in contour length (maximum length of a peptide if extended) between our ~260aa fragment and our ~130 aa fragments is on the order of 450Å (45nm), So, even if not extended it is not hard to imagine that the larger fragments show a weaker FRET signal. In fact, we do see a slightly larger FRET than we do in control (not significant) which is consistent with the idea that the larger fragments just do not result in a large FRET.

      Moreover, this hybridization assay is sensitive to a number of other factors including the affinity between the two fragments, the expression of each fragment, and the orientation of the fluorophores. Any of these factors could also result in reduced FRET.

      We have added a section on the limitations of the FRET 2-hybrid assay in the discussion section on page 20. Our goal with the FRET assay was to provide complimentary evidence that shows some of the regions that are important for direct association and we have edited to the text to make sure we are not over-interpreting our results.

      (3) Unless I missed this, have all the Cerulean and Citrine constructs been tested for functional activity?

      All citrine-tagged LRMP constructs (or close derivatives) were tested functionally by coexpression with HCN (See Table 1 and pages 10-11). Cerulean-tagged HCN4 fragments are of course intrinsically not-functional as they do not include the ion conducting pore.

      Reviewer #3 (Public Review):

      Summary:

      Using patch clamp electrophysiology and Förster resonance energy transfer (FRET), Peters and co-workers showed that the disordered N-terminus of both LRMP and HCN4 are necessary for LRMP to interact with HCN4 and inhibit the cAMP-dependent potentiation of channel opening. Strikingly, they identified two HCN4-specific residues, P545 and T547 in the C-linker of HCN4, that are close in proximity to the cAMP transduction centre (elbow Clinker, S4/S5-linker, HCND) and account for the LRMP effect.

      Strengths:

      Based on these data, the authors propose a mechanism in which LRMP specifically binds to HCN4 via its isotype-specific N-terminal sequence and thus prevents the cAMP transduction mechanism by acting at the interface between the elbow Clinker, the S4S5-linker, the HCND.

      Weaknesses:

      Although the work is interesting, there are some discrepancies between data that need to be addressed.

      (1) I suggest inserting in Table 1 and in the text, the Δ shift values (+cAMP; + LRMP; +cAMP/LRMP). This will help readers.

      Thank you, Δ shift values have been added to Tables 1 and 2 as suggested.

      (2) Figure 1 is not clear, the distribution of values is anomalously high. For instance, in 1B the distribution of values of V1/2 in the presence of cAMP goes from - 85 to -115. I agree that in the absence of cAMP, HCN4 in HEK293 cells shows some variability in V1/2 values, that nonetheless cannot be so wide (here the variability spans sometimes even 30 mV) and usually disappears with cAMP (here not).

      With a large N, this is an expected distribution. In 5 previous reports from 4 different groups of HCN4 with cAMP in HEK 293 (Fenske et al., 2020; Liao et al., 2012; Peters et al., 2020; Saponaro et al., 2021; Schweizer et al., 2010), the average expected range of the data is 26.6 mV and 39.9 mV for 95% (mean ± 2SD) and 99% (mean ± 3SD) of the data, respectively. As the reviewer mentions the expected range from these papers is slightly larger in the absence of cAMP. The average SD of HCN4 (with/without cAMP) in papers are 9.9 mV (Schweizer et al., 2010), 4.4 mV (Saponaro et al., 2021), 7.6 mV (Fenske et al., 2020), 10.0 mV (Liao et al., 2012), and 5.9 mV (Peters et al., 2020). Our SD in this paper is roughly in the middle at 7.6 mV. This is likely because we used an inclusive approach to data so as not to bias our results (see the statistics section of the revised manuscript on page 9). We have removed 2 data points that meet the statistical classification as outliers, no measures of statistical significance were altered by this.

      This problem is spread throughout the manuscript, and the measured mean effects are indeed always at the limit of statistical significance. Why so? Is this a problem with the analysis, or with the recordings?

      The exact P-values are NOT typically at the limit of statistical significance, about 2/3rds would meet the stringent P < 0.0001 cut-off. We have clarified in the statistics section (page 10) that any comparison meeting our significance threshold (P < 0.05) or a stricter criterion is treated equally in the figure labelling. Exact P-values are provided in Tables 1-3.

      There are several other problems with Figure 1 and in all figures of the manuscript: the Y scale is very narrow while the mean values are marked with large square boxes. Moreover, the exemplary activation curve of Figure 1A is not representative of the mean values reported in Figure 1B, and the values of 1B are different from those reported in Table 1.

      Y-axis values for mean plots were picked such that all data points are included and are consistent across all figures. They have been expanded slightly (-75 to -145 mV for all HCN4 channels and -65 to -135 mV for all HCN2 channels). The size of the mean value marker has been reduced slightly. Exact midpoints for all data are also found in Tables 1-3.

      The GV curves in Figure 1B (previously Fig. 1A) are averages with the ±SEM error bars smaller than the symbols in many cases owing to relatively high n’s for these datasets. These curves match the midpoints in panel 1C (previously 1B). Eg. the midpoint of the average curve for HCN4 control in panel A is -117.9 mV, the same as the -117.8 mV average for the individual fits in panel B.

      We made an error in the text based on a previous manuscript version about the ordering of the tables that has now been fixed so these values should now be aligned.

      On this ground, it is difficult to judge the conclusions and it would also greatly help if exemplary current traces would be also shown.

      Exemplary current traces have been added to all figures in the revised manuscript.

      (3) "....HCN4-P545A/T547F was insensitive to LRMP (Figs. 6B and 6C; Table 1), indicating that the unique HCN4 C-linker is necessary for regulation by LRMP. Thus, LRMP appears to regulate HCN4 by altering the interactions between the C-linker, S4-S5 linker, and Nterminus at the cAMP transduction centre."

      Although this is an interesting theory, there are no data supporting it. Indeed, P545 and T547 at the tip of the C-linker elbow (fig 6A) are crucial for LRMP effect, but these two residues are not involved in the cAMP transduction centre (interface between HCND, S4S5 linker, and Clinker elbow), at least for the data accumulated till now in the literature. Indeed, the hypothesis that LRMP somehow inhibits the cAMP transduction mechanism of HCN4 given the fact that the two necessary residues P545 and T547 are close to the cAMP transduction centre, remains to be proven.

      Moreover, I suggest analysing the putative role of P545 and T547 in light of the available HCN4 structures. In particular, T547 (elbow) points towards the underlying shoulder of the adjacent subunit and, therefore, is in a key position for the cAMP transduction mechanism. The presence of bulky hydrophobic residues (very different nature compared to T) in the equivalent position of HCN1 and HCN2 also favours this hypothesis. In this light, it will be also interesting to see whether a single T547F mutation is sufficient to prevent the LRMP effect.

      We agree that testing this hypothesis would be very interesting. However, it is challenging. Any mutation we make that is involved in cAMP transduction makes measuring the LRMP effect on cAMP shifts difficult or impossible.

      Our simple idea, now clarified in the discussion, is that if you look at the regions involved in cAMP transduction (HCND, C-linker, S4-S5), there are very few residues that differ between HCN4 and HCN2. When we mutate the 5 non-conserved residues in the S5 segment and the C-linker, along with the NT, we are able to render HCN2 sensitive to LRMP. Therefore, something about the small sequence differences in this region confer isoform specificity to LRMP. We speculate that this happens because of small structural differences that result from those 5 mutations. If you compare the solved structures of HCN1 and HCN4 (there is no HCN2 structure available), you can see small differences in the distances between key interacting residues in the transduction centre. Also, there is a kink at the bottom of the S4 helix in HCN4 but not HCN1. This points a putatively important residue for cAMP dependence in a different direction in HCN4. We hypothesize in the discussion that this may be how LRMP is isoform specific.

      Moreover, previous work has shown that the HCN4 C-linker is uniquely sensitive to di-cyclic nucleotides and magnesium ions. We are hypothesizing that it is the subtle change in structure that makes this region more prone to regulation in HCN4.

      Reviewing Editor (recommendations for the Authors):

      (1) Exemplar recordings need to be shown and some explanation for the wide variability in the V-half of activation.

      Exemplar currents are now shown for each channel. See the response to Reviewer 3’s public comment 2.

      (2) The rationale for cut sites in LRMP for the investigation of which parts of the protein are important for blocking the effect of cAMP is not logically presented in light of the modular schematics of domains in the protein (N-term, CCD, post-CCD, etc).

      There is limited structural data on LRMP and the HCN4 N-terminus. The cut sites in this paper were determined empirically. We made fragments that were small enough to work for our FRET hybridization approach and that expressed well in our HEK cell system. The residue numbering of the LRMP modules is based on updated structural predictions using Alphafold, which was released after our fragments were designed. This has been clarified in the methods section on pages 5-6 and the Figure 2 legend of the revised manuscript.

      (3) Role of the HCN4 C-terminus. Truncation of the HCN4 C-terminus unstructured Cterminus distal to the CNBD (Fig. 4 A, B) partially reverses the impact of LRMP (i.e. there is now a significant increase in cAMP effect compared to full-length HCN4). The manuscript is written in a manner that minimizes the potential role of the C-terminus and it is, therefore, eliminated from consideration in subsequent experiments (e.g. FRET) and the discussion. The model is incomplete without considering the impact of the C-terminus.

      We thank the reviewer for this comment as it was a result that we too readily dismissed. We have added discussion around this point and revised our model to suggest that not only can we not eliminate a role for the distal C-terminus, our data is consistent with it having a modest role. Our HCN4-2 chimera and HCN4-S719x data both suggest the possibility that the distal C-terminus might be having some effect on LRMP regulation. We have clarified this in the results (pages 12-13) and discussion (page 19).

      (4) For FRET experiments, it is not clear why LF should show an interaction with N2 (residues 125-160) but not NF (residues 1-160). N2 is contained within NF, and given that Citrine and Cerulean are present on the C-terminus of LF and N2/NF, respectively, residues 1-124 in NF should not impact the detection of FRET because of greater separation between the fluorophores as suggested by the authors.

      This is a fair point but FRET is somewhat more complicated. We do not know the structure of these fragments and it’s hard to speculate where the fluorophores are oriented in this type of assay. Moreover, this hybridization assay is sensitive to affinity and expression as well. There are a number of reasons why the larger 1-260 fragment might show reduced FRET compared to 125-260. As mentioned in our response to reviewer 2’s public comment 2, we have added a limitation section that outlines the various caveats of FRET that could explain this.

      (5) For FRET experiments, the choice of using pieces of the channel that do not correlate with the truncations studied in functional electrophysiological experiments limits the holistic interpretation of the data. Also, no explanation or discussion is provided for why LRMP fragments that are capable of binding to the HCN4 N-terminus as determined by FRET (e.g. residues 1-108 and 110-230, respectively) do not have a functional impact on the channel.

      As mentioned in the response to comment 2, the exact fragment design is a function of which fragments expressed well in HEK cells. Importantly, because FRET experiments do not provide atomic resolution for the caveats listed in the revised limitations section on page 20-21, small differences in the cut sites do not change the interpretation of these results. For example, the N-terminal 1-125 construct is analogous to experiments with the Δ1-130 HCN4 channel.

      We suspect that residues in both fragments are required and that the interaction involves multiple parts. This is stated in the results “Thus, the first 227 residues of LRMP are sufficient to regulate HCN4, with residues in both halves of the LRMP N-terminus necessary for the regulation” (page 11). We have also added discussion on this on page 21.

      (6) A striking result was that mutating two residues in the C-linker of HCN4 to amino acids found in HCN channels not affected by LRMP (P545A, T547F), completely eliminated the impact of LRMP on preventing cAMP regulation of channel activation. However, a chimeric channel, (HCN4-2) in which the C-linker, the CNBD, and the C-terminus of HCN4 were replaced by that of HCN2 was found to be partially responsive to LRMP. These two results appear inconsistent and not reconciled in the model proposed by the authors for how LRMP may be working.

      As stated in our answer to your question #3, we have revised our interpretation of these data. If the more distal C-terminus plays some role in the orientation of the C-linker and the transduction centre as a whole, these data can still be viewed consistent with our model. We have added some discussion of this idea in our discussion section.

      (7) Replacing the HCN2 N-terminus with that from HCN4, along with mutations in the S5 (MCS/VVG) and C-linker (AF/PT) recapitulated LRMP regulation on the HCN2 background. The functional importance of the S5 mutations is not clear as no other experiments are shown to indicate whether they are necessary for the observed effect.

      We have added our experiments on a midpoint HCN2 clone that includes the S5 mutants and the C-linker mutants in the absence of the HCN4 N-terminus (ie HCN2 MCSAF/VVGPT) (Fig. 7). And we have discussed our rationale for the S5 mutations as we believe they may be responsible for the different orientations of the S4-S5 linker in HCN1 and HCN4 structures that are known to impact cAMP regulation.

      Reviewer #1 (Recommendations For The Authors):

      A) Comments:

      (1) Figure 1: Please show some representative current traces.

      Exemplar currents are now shown for each channel in the manuscript.

      (2) Figure 1: There appears to be a huge number of recordings for HCN4 +/- cAMP as compared to those with LRMP 1-479Cit. How was the number of recordings needed for sufficient statistical power decided? This is particularly important because the observed slowing of deactivation by cAMP in Fig. 1C seems like it may be fairly subtle. Perhaps a swarm plot would make the shift more apparent? Also, LRMP 1-479Cit distributions in Fig. 1B-C look like they are more uniform than normal, so please double-check the appropriateness of the statistical test employed.

      We have revised the methods section (page 7) to discuss this, briefly we performed regular control experiments throughout this project to ensure that a normal cAMP response was occurring. Our minimum target for sufficient power was 8-10 recordings. We have expanded the statistics section (page 9) to discuss tests of normality and the use of a log scale for deactivation time constants which is why the shifts in Fig. 1D (revised) are less apparent.

      (3) It would be helpful if the authors could better introduce their logic for the M338V/C341V/S345G mutations in the HCN4-2 VVGPT mutant.

      See response to the reviewing editor’s comment 7.

      B) Minor Comments:

      (1) pg. 9: "We found that LRMP 1-479Cit inhibited HCN4 to an even greater degree than the full-length LRMP, likely because expression of this tagged construct was improved compared to the untagged full-length LRMP, which was detected by co-transfection with GFP." Co-transfection with GFP seems like an extremely poor and a risky measure for LRMP expression.

      We agree that the exact efficiency of co-transfection is contentious although some papers and manufacturer protocols indicate high co-transfection efficiency (Xie et al., 2011). In this paper we used both co-transfection and tagged proteins with similar results.

      (2) pg 9: "LRMP 1-227 construct contains the N-terminus of LRMP with a cut-site near the Nterminus of the predicted coiled-coil sequence". In Figure 2 the graphic shows the coiledcoil domain starting at 191. What was the logic for splitting at 227 which appears to be the middle of the coiled-coil?

      See response to the reviewing editor’s comment 2.

      (3) Figure 5C: Please align the various schematics for HCN4 as was done for LRMP. It makes it much easier to decipher what is what.

      Fig. 5 has been revised as suggested.

      (4) pg 12: I assume that the HCN2 fragment chosen aligns with the HCN4 N2 fragment which shows binding, but this logic should be stated if that is the case. If not, then how was the HCN2 fragment chosen?

      This is correct. This has been explicitly stated in the revised manuscript (page 14).

      (5) Figure 7: Add legend indicating black/gray = HCN4 and blue = HCN2.

      This has been stated in the revised figure legend.

      (6) pg 17: Conservation of P545 and T547 across mammalian species is not shown or cited.

      This sentence is not included in the revised manuscript, however, for the interest of the reviewer we have provided an alignment of this region across species here.

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      (1) It is not clear whether in the absence of cAMP, LRMP also modestly shifts the voltagedependent activity of the channels. Please clarify.

      We have clarified that LRMP does not shift the voltage-dependence in the absence of cAMP (page 10). In the absence of cAMP, LRMP does not significantly shift the voltagedependence of activation in any of the channels we have tested in this paper (or in our prior 2020 paper).

      (2) Resolution of Fig. 8b is low.

      We ultimately decided that the cartoon did not provide any important information for understanding our model and it was removed.

      (3) Please add a supplementary figure showing the amino acid sequence of LRMP to show where the demarcations are made for each fragment as well as where the truncations were made as noted in Fig 3 and Fig 4.

      A new supplementary figure showing the LRMP sequence has been added and cited in the methods section (page 5). Truncation sites have been added to the schematic in Fig. 2A.

      (4) In the cartoon schematic illustration for Fig. 3 and Fig.4, the legend should include that the thick bold lines in the C-Terminal domain represent the CNBD, while the thick bold lines in the N-Terminal domain represent the HCN domain. This was mentioned in Liao 2012, as you referenced when you defined the construct S719X, but it would be nice for the reader to know that the thick bold lines you have drawn in your cartoon indicate that it also highlights the CNBD or the HCN domain.

      This has been added to figure legends for the relevant figures in the revised manuscript.

      (5) On page 12, missing a space between "residues" and "1" in the parenthesis "...LRMP L1 (residues1-108)...".

      Fixed. Thank you.

      (6) Which isoform of LRMP was used? What is the NCBI accession number? Is it the same one from Peters 2020 ("MC228229")?

      This information has been added to the methods (page 5). It is the same as Peters 2020.

      Reviewer #3 (Recommendations For The Authors):

      (1) "Truncation of residues 1-62 led to a partial LRMP effect where cAMP caused a significant depolarizing shift in the presence of LRMP, but the activation in the presence of LRMP and cAMP was hyperpolarized compared to cAMP alone (Fig. 3B, C and 3E; Table 1). In the HCN4Δ1-130 construct, cAMP caused a significant depolarizing shift in the presence of LRMP; however, the midpoint of activation in the presence of LRMP and cAMP showed a non-significant trend towards hyperpolarization compared to cAMP alone (Fig. 3C and 3E; Table 1)".

      This means that sequence 62-185 is necessary and sufficient for the LRMP effect. I suggest a competition assay with this peptide (synthetic, or co-expressed with HCN4 full-length and LRMP to see whether the peptide inhibits the LRMP effect).

      We respectfully disagree with the reviewer’s interpretation. Our results, strongly suggest that other regions such as residues 25-65 (Fig. 3C) and C-terminal residues (Fig. 6) are also necessary. The use of a peptide could be an interesting future experiment, however, it would be very difficult to control relative expression of a co-expressed peptide. We think that our results in Fig. 7E-F where this fragment is added to HCN2 are a better controlled way of validating the importance of this region.

      (2) "Truncation of the distal C-terminus (of HCN4) did not prevent LRMP regulation. In the presence of both LRMP and cAMP the activation of HCN4-S719X was still significantly hyperpolarized compared to the presence of cAMP alone (Figs. 4A and 4B; Table 1). And the cAMP-induced shift in HCN4-S719X in the presence of LRMP (~7mV) was less than half the shift in the absence of LRMP (~18 mV)."

      On the basis of the partial effects reported for the truncations of the N-terminus of HCN4 162 and 1-130 (Fig 3B and C), I do not think it is possible to conclude that "truncation of the distal C-terminus (of HCN4) did not prevent LRMP regulation". Indeed, cAMP-induced shift in HCN4 Δ1-62 and Δ1-130 in the presence of LRMP were 10.9 and 10.5 mV, respectively, way more than the ~7mV measured for the HCN4-S719X mutant.

      As you rightly stated at the end of the paragraph:" Together, these results show significant LRMP regulation of HCN4 even when the distal C-terminus is truncated, consistent with a minimal role for the C-terminus in the regulatory pathway". I would better discuss this minimal role of the C-terminus. It is true that deletion of the first 185 aa of HCN4 Nterminus abolishes the LRMP effect, but it is also true that removal of the very Cterm of HCN4 does affect LRMP. This unstructured C-terminal region of HCN4 contains isotype-specific sequences. Maybe they also play a role in recognizing LRMP. Thus, I would suggest further investigation via truncations, even internal deletions of HCN4-specific sequences.

      Please see the response to the reviewing editor’s comment 3.

      (3) Figure 5: The N-terminus of LRMP FRETs with the N-terminus of HCN4.

      Why didn't you test the same truncations used in Fig. 3? Indeed, based on Fig 3, sequences 1-25 can be removed. I would have considered peptides 26-62 and 63-130 and 131-185 and a fourth (26-185). This set of peptides will help you connect binding with the functional effects of the truncations tested in Fig 3.

      Please see the response to the reviewing editor’s comment 2 and 5.

      Why didn't you test the C-terminus (from 719 till the end) of HCN4? This can help with understanding why truncation of HCN4 Cterminus does affect LRMP, tough partially (Fig. 4A).

      Please see the response to the reviewing editor’s comment 3.

      (4) "We found that a previously described HCN4-2 chimera containing the HCN4 N-terminus and transmembrane domains (residues 1-518) with the HCN2 C-terminus (442-863) (Liao et al., 2012) was partially regulated by LRMP (Fig. 7A and 7B)".

      I do not understand this partial LRMP effect on the HCN4-2 chimera. In Fig. 6 you have shown that the "HCN4-P545A/T547F was insensitive to LRMP (Figs. 6B and 6C; Table 1), indicating that the unique HCN4 C-linker is necessary for regulation by LRMP". How can be this reconciled with the HCN4-2 chimera? HCN4-2, "containing" P545A/T547F mutations, should not perceive LRMP.

      Please see the response to the reviewing editor’s comment 6.

      (5) "we next made a targeted chimera of HCN2 that contains the distal HCN4 N-terminus (residues 1-212) and the HCN2 transmembrane and C-terminal domains with 5 point mutants in non-conserved residues of the S5 segment and C-linker elbow (M338V/C341V/S345G/A467P/F469T)......Importantly, the HCN4-2 VVGPT channel is insensitive to cAMP in the presence of LRMP (Fig. 7C and 7D), indicating that the HCN4 Nterminus and cAMP-transduction centre residues are sufficient to confer LRMP regulation to HCN2".

      Why did you insert also the 3 mutations of S5? Are these mutations somehow involved in the cAMP transduction mechanism?

      You have already shown that in HCN4 only P545 and T547 (Clinker) are necessary for LRMP effect. I suggest to try, at least, the chimera of HCN2 with only A467P/F469T. They should work without the 3 mutations in S5.

      Please see the response to the reviewing editor’s comment 7.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors investigated the effect of chronic activation of dopamine neurons using chemogenetics. Using Gq-DREADDs, the authors chronically activated midbrain dopamine neurons and observed that these neurons, particularly their axons, exhibit increased vulnerability and degeneration, resembling the pathological symptoms of Parkinson's disease. Baseline calcium levels in midbrain dopamine neurons were also significantly elevated following the chronic activation. Lastly, to identify cellular and circuit-level changes in response to dopaminergic neuronal degeneration caused by chronic activation, the authors employed spatial genomics (Visium) and revealed comprehensive changes in gene expression in the mouse model subjected to chronic activation. In conclusion, this study presents novel data on the consequences of chronic hyperactivation of midbrain dopamine neurons.

      Strengths:

      This study provides direct evidence that the chronic activation of dopamine neurons is toxic and gives rise to neurodegeneration. In addition, the authors achieved the chronic activation of dopamine neurons using water application of clozapine-N-oxide (CNO), a method not commonly employed by researchers. This approach may offer new insights into pathophysiological alterations of dopamine neurons in Parkinson's disease. The authors also utilized state-of-the-art spatial gene expression analysis, which can provide valuable information for other researchers studying dopamine neurons. Although the authors did not elucidate the mechanisms underlying dopaminergic neuronal and axonal death, they presented a substantial number of intriguing ideas in their discussion, which are worth further investigation.

      We thank the reviewer for these positive comments.

      Weaknesses:

      Many claims raised in this paper are only partially supported by the experimental results. So, additional data are necessary to strengthen the claims. The effects of chronic activation of dopamine neurons are intriguing; however, this paper does not go beyond reporting phenomena. It lacks a comprehensive explanation for the degeneration of dopamine neurons and their axons. While the authors proposed possible mechanisms for the degeneration in their discussion, such as differentially expressed genes, these remain experimentally unexplored.

      We thank the reviewer for this review. We do believe that the manuscript has a substantial mechanistic component, as the central experiments involve direct manipulation of neuronal activity, and we show an increase in calcium levels and gene expression changes in dopamine neurons that coincide with the degeneration. However, we agree that deeper mechanistic investigation would strengthen the conclusions of the paper. We have executed several important revisions, including the addition of CNO behavioral controls, manipulation of intracellular calcium using isradipine, additional transcriptomics experiments and further validation of findings. We believe that these additions significantly bolster the conclusions of the paper.

      Reviewer #2 (Public Review):

      Summary:

      Rademacher et al. present a paper showing that chronic chemogenetic excitation of dopaminergic neurons in the mouse midbrain results in differential degeneration of axons and somas across distinct regions (SNc vs VTA). These findings are important. This mouse model also has the advantage of showing a axon-first degeneration over an experimentally-useful time course (2-4 weeks). 2. The findings that direct excitation of dopaminergic neurons causes differential degeneration sheds light on the mechanisms of dopaminergic neuron selective vulnerability. The evidence that activation of dopaminergic neurons causes degeneration and alters mRNA expression is convincing, as the authors use both vehicle and CNO control groups, but the evidence that chronic dopaminergic activation alters circadian rhythm and motor behavior is incomplete as the authors did not run a CNO-control condition in these experiments.

      Strengths:

      This is an exciting and important paper.

      The paper compares mouse transcriptomics with human patient data.

      It shows that selective degeneration can occur across the midbrain dopaminergic neurons even in the absence of a genetic, prion, or toxin neurodegeneration mechanism.

      We thank the reviewer for these comments.

      Weaknesses:

      Major concerns:

      (1) The lack of a CNO-positive, DREADD-negative control group in the behavioral experiments is the main limitation in interpreting the behavioral data. Without knowing whether CNO on its own has an impact on circadian rhythm or motor activity, the certainty that dopaminergic hyperactivity is causing these effects is lacking.

      We thank the reviewer for this important recommendation. Although the initial version showed that CNO does not produce degeneration of DA neuron terminals, it did not exclude a contribution to the behavioral changes. To address this, we now include a cohort of DREADD free non-injected mice treated with either vehicle or CNO (Figure S1C). We found that on its own, CNO did not significantly impact either light cycle or dark cycle running. Together these results along with the lack of degeneration observed with CNO treatment in non-DREADD mice (Figure 2D) support that our behavioral and histological results are the result of dopamine neuron activation.

      (2) One of the most exciting things about this paper is that the SNc degenerates more strongly than the VTA when both regions are, in theory, excited to the same extent. However, it is not perfectly clear that both regions respond to CNO to the same extent. The electrophysiological data showing CNO responsiveness is only conducted in the SNc. If the VTA response is significantly reduced vs the SNc response, then the selectivity of the SNc degeneration could just be because the SNc was more hyperactive than the VTA. Electrophysiology experiments comparing the VTA and SNc response to CNO could support the idea that the SNc has substantial intrinsic vulnerability factors compared to the VTA.

      We agree that additional electrophysiology conducted in the VTA dopamine neurons would meaningfully add to our understanding of the selective vulnerability in this model, and have completed these experiments in the revision (Figure 1, Figure S2). We now show that in vivo treatment with CNO causes some of the same physiological changes in VTA dopamine neurons as we found in SNc dopamine neurons, including an increased spontaneous firing rate, and a similar decrease in responsiveness to CNO in the slice recordings. Together these observations support the conclusion that SNc axons are intrinsically more vulnerable to increased activity than VTA dopamine axons. 

      (3) The mice have access to a running wheel for the circadian rhythm experiments. Running has been shown to alter the dopaminergic system (Bastioli et al., 2022) and so the authors should clarify whether the histology, electrophysiology, fiber photometry, and transcriptomics data are conducted on mice that have been running or sedentary.

      We have clarified which mice had access to a running wheel in the methods of our revision. Briefly, mice for histology, electrophysiology, and transcriptomics all had access to a running wheel during their treatment. The mice used for photometry underwent about 7 days of running wheel access approximately 3 weeks prior to the beginning of the experiment. The photometry headcaps prevented mice from having access to a running wheel in their home cage. Mice used for non-responder and non-hM3Dq (CNO alone) experiments also had access to a running wheel during their treatment. Mice used for the isradipine experiment did not have access to a running wheel, as the number of mice was too large and while unilateral hM3Dq expression allows for within-animal controls, it does not lend to clear interpretation of running wheel data.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, Rademacher and colleagues examined the effect on the integrity of the dopamine system in mice of chronically stimulating dopamine neurons using a chemogenetic approach. They find that one to two weeks of constant exposure to the chemogenetic activator CNO leads to a decrease in the density of tyrosine hydroxylase staining in striatal brain sections and to a small reduction of the global population of tyrosine hydroxylase positive neurons in the ventral midbrain. They also report alterations in gene expression in both regions using a spatial transcriptomics approach. Globally, the work is well done and valuable and some of the conclusions are interesting. However, the conceptual advance is perhaps a bit limited in the sense that there is extensive previous work in the literature showing that excessive depolarization of multiple types of neurons associated with intracellular calcium elevations promotes neuronal degeneration. The present work adds to this by showing evidence of a similar phenomenon in dopamine neurons.

      We thank the reviewer for the careful and thoughtful review of our manuscript.

      While extensive depolarization and associated intracellular calcium elevations promote degeneration generally, we emphasize that the process we describe is novel. Indeed, prior studies delivering chronic DREADDs to vulnerable neurons in models of Alzheimer’s disease did not detect an increase in neurodegeneration, despite seeing changes in protein aggregation (e.g. Yuan and Grutzendler, J Neurosci 2016, PMID: 26758850; Hussaini et al., PLOS Bio 2020, PMID: 32822389). Further, a critical finding from our study is that in our paradigm, this stressor does not impact all dopamine neurons equally, as the SNc DA neurons are more vulnerable than VTA DA neurons, mirroring selective vulnerability characteristic of Parkinson’s disease. This is consistent with a large body of literature that SNc dopamine neurons are less capable of handling large energetic and calcium loads compared to neighboring VTA neurons, and the finding that chronically altered activity is sufficient to drive this preferential loss is novel. In addition, we are not aware of prior studies that have chronically activated DREADDs over several weeks to produce neurodegeneration.

      In terms of the mechanisms explaining the neuronal loss observed after 2 to 4 weeks of chemogenetic activation, it would be important to consider that dopamine neurons are known from a lot of previous literature to undergo a decrease in firing through a depolarization-block mechanism when chronically depolarized. Is it possible that such a phenomenon explains much of the results observed in the present study? It would be important to consider this in the manuscript.

      Thank you for this comment. As discussed in greater detail in the “comments on results section” below, our data suggests this isn’t a prominent feature in our model. However, we cannot rule out a contribution of depolarization block, and have expanded on the discussion of this possibility in the revised manuscript.

      The relevance to Parkinson's disease (PD) is also not totally clear because there is not a lot of previous solid evidence showing that the firing of dopamine neurons is increased in PD, either in human subjects or in mouse models of the disease. As such, it is not clear if the present work is really modelling something that could happen in PD in humans.

      We completely agree that evidence of increased dopamine neuron activity from human PD patients is lacking, and the little data that exists is difficult to interpret without human controls. However, as we outline in the manuscript, multiple lines of evidence suggest that the activity level of dopamine neurons almost certainly does change in PD. Therefore, it is very important that we understand how changes in the level of neural activity influence the degeneration of DA neurons. In this paper we examine the impact of increased activity. Increased activity may be compensatory after initial dopamine neuron loss, or may be an initial driver of death (Rademacher & Nakamura, Exp Neurol 2024, PMID: 38092187). In addition to the human and rodent data already discussed in the manuscript, additional support for increased activity in PD models include:

      • Elevated firing rates in asymptomatic MitoPark mice (Good et al., FASEB J 2011, PMID: 21233488)

      • Increased frequency of spontaneous firing in patient-derived iPSC dopamine neurons and primary mouse dopamine neurons that overexpress synuclein (Lin et al., Acta Neuropath Comm 2021, PMID: 34099060)

      • Increased spontaneous firing in dopamine neurons of rats injected with synuclein preformed fibrils compared to sham (Tozzi et al., Brain 2021, PMID: 34297092)

      We have included citation of these important examples in our revision. In our model, we have found that chronic hyperactivity causes a substantial loss of nigral DA terminals while mesolimbic terminals are relatively spared (Figure 2), and that striatal DA levels are markedly decreased (Figure S6), phenomena that are hallmarks of Parkinson’s disease.

      There are additional levels of complexity to accurately model changes in PD, which may differ between subtypes of the disease, the disease stage, and the subtype of dopamine neuron. Our study models a form of increased intrinsic activity, and interpretation of our results will be facilitated as we learn more about how the activity of DA neurons changes in humans in PD. Similarly, in future studies, it will also be important to study the impact of decreasing DA neuron activity.

      Comments on the introduction:

      The introduction cites a 1990 paper from the lab of Anthony Grace as support of the fact that DA neurons increase their firing rate in PD models. However, in this 1990 paper, the authors stated that: "With respect to DA cell activity, depletions of up to 96% of striatal DA did not result in substantial alterations in the proportion of DA neurons active, their mean firing rate, or their firing pattern. Increases in these parameters only occurred when striatal DA depletions exceeded 96%." Such results argue that an increase in firing rate is most likely to be a consequence of the almost complete loss of dopamine neurons rather than an initial driver of neuronal loss. The present introduction would thus benefit from being revised to clarify the overriding hypothesis and rationale in relation to PD and better represent the findings of the paper by Hollerman and Grace.

      We agree that the findings of Hollerman and Grace support compensatory changes in dopamine neuron activity in response to loss of dopamine neurons, rather than informing whether dopamine neuron loss can also be an initial driver of activity. Importantly, while significant changes to burst firing were not seen until almost complete loss of dopamine neurons, these recordings were made in anesthetized rats which may not be representative of neural activity in awake animals. We adjusted the text so that this is no longer referred to as ‘partial’ loss. At the same time, we point out that the results of other studies on this point are mixed: a 50% reduction in dopamine neurons didn’t alter firing rate or bursting (Harden and Grace, J Neurosci 1995, PMID: 7666198; Bilbao et al., Brain Res 2006, PMID: 16574080), while a 40% loss was found to increase firing rate and bursting (Chen et al., Brain Res 2009. PMID: 19545547) and larger reductions alter burst firing (Hollerman & Grace, Brain Res 1990, PMID: 2126975; Stachowiak et al., J Neurosci 1987, PMID: 3110381). Importantly, even if compensatory, such late-stage increases in dopamine neuron activity may contribute to disease progression and drive a vicious cycle of degeneration in surviving neurons. In addition, we also don’t know how the threshold of dopamine neuron loss and altered activity may differ between mice and humans, and PD patients do not present with clinical symptoms until ~30-60% of nigral neurons are lost (Burke & O’Malley, Exp Neurol 2013, PMID: 22285449; Shulman et al., Annu Rev Pathol 2011, PMID: 21034221).   

      Other lines of evidence support the potential role of hyperactivity in disease initiation, including increased activity before dopamine neuron loss in MitoPark mice (Good et al., FASEB J 2011, PMID: 21233488), increased spontaneous firing in patient-derived iPSC dopamine neurons (Lin et al., Acta Neuropath Comm 2021, PMID: 34099060), and increased activity observed in genetic models of PD (Bishop et al., J Neurophysiol 2010, PMID: 20926611; Regoni et al., Cell Death Dis 2020, PMID: 33173027).

      It would be good that the introduction refers to some of the literature on the links between excessive neuronal activity, calcium, and neurodegeneration. There is a large literature on this and referring to it would help frame the work and its novelty in a broader context.

      We agree that a discussion of hyperactivity, calcium, and neurodegeneration would benefit the introduction. Accordingly, we have expanded on our citation of this literature in both the introduction and discussion sections. However, we believe that the novelty of our study lies in: 1) a chronic chemogenetic activation paradigm via drinking water, 2) demonstrating selective vulnerability of dopamine neurons as a result of altering their activity/excitability alone, and 3) comparing mouse and human spatial transcriptomics.

      Comments on the results section:

      The running wheel results of Figure 1 suggest that the CNO treatment caused a brief increase in running on the first day after which there was a strong decrease during the subsequent days in the active phase. This observation is also in line with the appearance of a depolarization block.

      The authors examined many basic electrophysiological parameters of recorded dopamine neurons in acute brain slices. However, it is surprising that they did not report the resting membrane potential, or the input resistance. It would be important that this be added because these two parameters provide key information on the basal excitability of the recorded neurons. They would also allow us to obtain insight into the possibility that the neurons are chronically depolarized and thus in depolarization block.

      We do report the input resistance in Figure S1C (now Figure S2A, S2B), which was unchanged in CNO-treated animals compared to controls. We did not previously report the resting membrane potential because many of the DA neurons were spontaneously firing. In the revision, we now report the initial membrane potential on first breaking into the cell for the whole cell recordings, which did not vary between groups (Figure S2). This is still influenced by action potential activity, but is the timepoint in the recording least impacted by dialyzing the neuron with the internal solution, which might alter the intracellular concentrations of ions. We observed increased spontaneous action potential activity ex vivo in slices from CNO-treated mice (Figure 1D), thus at least under these conditions these dopamine neurons are not in depolarization block. We also did not see strong evidence of changes in other intrinsic properties of the neurons with whole cell recordings (e.g. Figure S2). Overall, our electrophysiology experiments are not consistent with the depolarization block model, at least not due to changes in the intrinsic properties of the neurons. Although our ex vivo findings cannot exclude a contribution of depolarization block in vivo, we do show that CNO-treated mice removed from their cages for open field testing continue to have a strong trend for increased activity for approximately 10 days (Figure S4B). This finding is also consistent with increased activity of the DA neurons. We have added discussion of these important considerations in the revision.

      It is great that the authors quantified not only TH levels but also the levels of mCherry, coexpressed with the chemogenetic receptor. This could in principle help to distinguish between TH downregulation and true loss of dopamine neuron cell bodies. However, the approach used here has a major caveat in that the number of mCherry-positive dopamine neurons depends on the proportion of dopamine neurons that were infected and expressed the DREADD and this could very well vary between different mice. It is very unlikely that the virus injection allowed to infect 100% of the neurons in the VTA and SNc. This could for example explain in part the mismatch between the number of VTA dopamine neurons counted in panel 2G when comparing TH and mCherry counts. Also, I see that the mCherry counts were not provided at the 2-week time point. If the mCherry had been expressed genetically by crossing the DAT-Cre mice with a floxed fluorescent reported mice, the interpretation would have been simpler. In this context, I am not convinced of the benefit of the mCherry quantifications. The authors should consider either removing these results from the final manuscript or discussing this important limitation.

      We thank the reviewer for this comment, and we agree that this is a caveat of our mCherry quantification. Quantitation of the number of mCherry+ DA neurons specifically informs the impact on transduced DA neurons, and mCherry appears to be less susceptible to downregulation versus TH. As the reviewer points out, it carries the caveat that there is some variability between injections. Our control animals give us an indicator of injection variability, which is likely substantial and prevents us from detecting more subtle changes. Nonetheless, we believe that it conveys useful complementary data. We discuss this caveat in our revision. Note that mCherry was not quantified at the two-week timepoint because there is no loss of TH+ cells at that time.

      Although the authors conclude that there is a global decrease in the number of dopamine neurons after 4 weeks of CNO treatment, the post-hoc tests failed to confirm that the decrease in dopamine number was significant in the SNc, the region most relevant to Parkinson's. This could be due to the fact that only a small number of mice were tested. A "n" of just 4 or 5 mice is very small for a stereological counting experiment. As such, this experiment was clearly underpowered at the statistical level. Also, the choice of the image used to illustrate this in panel 2G should be reconsidered: the image suggests that a very large loss of dopamine

      neurons occurred in the SNc and this is not what the numbers show. A more representative image should be used.

      We agree that the stereology experiments were performed on relatively small numbers of animals, such that only robust effects would be detected. Combined with the small effect size, this may have contributed to the post-hoc tests showing a trend of p=0.1 for both the TH and mCherry dopamine cell counts in the SN at 4 weeks. Given this small effect size, we would indeed need much larger groups to better discern these changes. Stereology is an intensive technique, and we have therefore elected to focus on terminal loss. We have also replaced panel 2G with a more representative CNO image.

      In Figure 3, the authors attempt to compare intracellular calcium levels in dopamine neurons using GCaMP6 fluorescence. Because this calcium indicator is not quantitative (unlike ratiometric sensors such as Fura2), it is usually used to quantify relative changes in intracellular calcium. The present use of this probe to compare absolute values is unusual and the validity of this approach is unclear. This limitation needs to be discussed. The authors also need to refer in the text to the difference between panels D and E of this figure. It is surprising that the fluctuations in calcium levels were not quantified. I guess the hypothesis was that there should be more or larger fluctuations in the mice treated with CNO if the CNO treatment led to increased firing. This needs to be clarified.

      We thank the reviewer for this comment. We understand that this method of comparing absolute values is unconventional. However, these animals were tested concurrently on the same system, and a clear effect on the absolute baseline was observed. We have included a caveat of this in our discussion. Panel D of this figure shows the raw, uncorrected photometry traces, whereas panel E shows the isosbestic corrected traces for the same recording. In panel E, the traces follow time in ascending order. We have also included frequency and amplitude data for these recordings (Figure S4A), along with discussion of the significance of these findings.

      Although the spatial transcriptomic results are intriguing and certainly a great way to start thinking about how the CNO treatment could lead to the loss of dopamine neurons, the presented results, the focusing of some broad classes of differentially expressed genes and on some specific examples, do not really suggest any clear mechanism of neurodegeneration. It would perhaps be useful for the authors to use the obtained data to validate that a state of chronic depolarization was indeed induced by the chronic CNO treatment. Were genes classically linked to increased activity like cfos or bdnf elevated in the SNc or VTA dopamine neurons? In the striatum, the authors report that the levels of DARP32, a gene whose levels are linked to dopamine levels, are unchanged. Does this mean that there were no major changes in dopamine levels in the striatum of these mice?

      While levels of DARPP32 mRNA were unchanged, our additional HPLC data show strong decreases in striatal dopamine in hyperactivated mice. We do not see strong changes in classic activity-related genes (data not shown), however these genes may behave differently in the context of chronic hyperactivity and ongoing degeneration. Instead, we employed NEUROeSTIMator (Bahl et al., Nature Comm. 2024, PMID: 38278804), a deep learning method to predict neural activation based on transcriptomic data. We found that predicted activity scores were significantly higher in GqCNO dopaminergic regions compared to controls (Figure X). Indeed, some of the genes used within the model to predict activity are immediate early genes eg. c-fos.

      The usefulness of comparing the transcriptome of human PD SNc or VTA sections to that of the present mouse model should be better explained. In the human tissues, the transcriptome reflects the state of the tissue many years after extensive loss of dopamine neurons. It is expected that there will be few if any SNc neurons left in such sections. In comparison, the mice after 7 days of CNO treatment do not appear to have lost any dopamine neurons. As such, how can the two extremely different conditions be reasonably compared? Our mouse model and human PD progress over distinct timescales, as is the case with essentially all mouse models of neurodegenerative diseases. Nonetheless, in our view there is still great value in comparing gene expression changes in mouse models with those in human disease. It seems very likely that the same pathologic processes that drive degeneration early in the disease continue to drive degeneration later in the disease. Note that we have tried to address the discrepancy in time scales in part by comparing our mouse model to early PD samples when there is more limited SNc DA neuron loss (see the proportion of DA neurons within the areas of human tissues we selected for sampling in Author response image 1). Therefore, we can indeed use spatial transcriptomics to compare dopamine neurons from mice with initial degeneration to those in patients where degeneration is ongoing.    

      Author response image 1.

      Violin plot of DA neuron proportions sampled within the vulnerable SNV (deconvoluted RCTD method used in unmasked tissue sections of the SNV). Control and early PD subjects.

      Comments on the discussion:

      In the discussion, the authors state that their calcium photometry results support a central role of calcium in activity-induced neurodegeneration. This conclusion, although plausible because of the very broad pre-existing literature linking calcium elevation (such as in excitotoxicity) to neuronal loss, should be toned down a bit as no causal relationship was established in the experiments that were carried out in the present study.

      Our model utilizes hM3Dq-DREADDs that function by activating Gq pathways that are classically expected to increase intracellular calcium to increase neuronal excitability. Indeed in slices from mice that were not treated with CNO, acute CNO application caused depolarizations (Figure 1E) that can be due to an increase in intracellular calcium and also cause increases in intracellular calcium. Additionally, our results show increased calcium by fiber photometry and changes to calcium-related genes, suggesting a causal relation and crucial role of calcium in the mechanism of degeneration. However, we agree that we have not experimentally proven this point. Indeed, a small preliminary experiment with chronic isradipine failed to show protection, although it lacked power to detect a partial effect. We have acknowledged this in the text, and also briefly consider other mechanisms such as increased dopamine levels that could also mediate the toxicity.

      In the discussion, the authors discuss some of the parallel changes in gene expression detected in the mouse model and in the human tissues. Because few if any dopamine neurons are expected to remain in the SNc of the human tissues used, this sort of comparison has important conceptual limitations and these need to be clearly addressed.

      As discussed, we sampled SN DA neurons in early PD (see Author response image 1), and in our view there is great value for such comparisons.

      A major limitation of the present discussion is that it does not discuss the possibility that the observed phenotypes are caused by the induction of a chronic state of depolarization block by the chronic CNO treatment. I encourage the authors to consider and discuss this hypothesis.

      As discussed above, our analyses of DA neuron firing in slices and open field testing to date do not support a prominent contribution of depolarization block with chronic CNO treatment. However, we cannot rule out this hypothesis, therefore we have included additional electrophysiology experiments and have added discussion of this important consideration.  

      Also, the authors need to discuss the fact that previous work was only able to detect an increase in the firing rate of dopamine neurons after more than 95% loss of dopamine neurons. As such, the authors need to clearly discuss the relevance of the present model to PD. Are changes in firing rate a driver of neuronal loss in PD, as the authors try to make the case here, or are such changes only a secondary consequence of extensive neuronal loss (for example because a major loss of dopamine would lead to reduced D2 autoreceptor activation in the remaining neurons, and to reduced autoreceptor-mediated negative feedback on firing). This needs to be discussed.

      As discussed above, while increases in dopamine neuron activity may be compensatory after loss of neurons, the precise percentage required to induce such compensatory changes is not defined in mice and varies between paradigms, and the threshold level is not known in humans. We also reiterate that a compensatory increase in activity could still promote the degeneration of critical surviving DA neurons, whose loss underlies the substantial decline in motor function that typically occurs over the course of PD. Moreover, there are also multiple lines of evidence to suggest that changes in activity can initiate and drive dopamine neuron degeneration (Rademacher & Nakamura, Exp Neurol 2024). For example, overexpression of synuclein can increase firing in cultured dopamine neurons (Dagra et al., NPJ Parkinsons Dis 2021, PMID: 34408150), while mice expressing mutant Parkin have higher mean firing rates (Regoni et al., Cell Death Dis 2020, PMID: 33173027). Similarly, an increased firing rate has been reported in the MitoPark mouse model of PD at a time preceding DA neuron degeneration (Good et al., FASEB J 2011, PMID: 21233488). We also acknowledge that alterations to dopamine neuron activity are likely complex in PD, and that dopamine neuron health and function can be impacted not just by simple increases in activity, but also by changes in activity patterns and regularity. We have amended our discussion to include the important caveat of changes in activity occurring as compensation, as well as further evidence of changes in activity preceding dopamine neuron death.

      There is a very large, multi-decade literature on calcium elevation and its effects on neuronal loss in many different types of neurons. The authors should discuss their findings in this context and refer to some of this previous work. In a nutshell, the observations of the present manuscript could be summarized by stating that the chronic membrane depolarization induced by the CNO treatment is likely to induce a chronic elevation of intracellular calcium and this is then likely to activate some of the well-known calcium-dependent cell death mechanisms. Whether such cell death is linked in any way to PD is not really demonstrated by the present results. The authors are encouraged to perform a thorough revision of the discussion to address all of these issues, discuss the major limitations of the present model, and refer to the broad pre-existing literature linking membrane depolarization, calcium, and neuronal loss in many neuronal cell types.

      While our model demonstrates classic excitotoxic cell death pathways, we would like to emphasize both the chronic nature of our manipulation and the progressive changes observed, with increasing degeneration seen at 1, 2, and 4 weeks of hyperactivity in an axon-first manner. This is a unique aspect of our study, in contrast to much of the previous literature which has focused on shorter timescales. Thus, while we have revised the discussion to more comprehensively acknowledge previous studies of calcium-dependent neuron cell death, we believe we have made several new contributions that are not predicted by existing literature. We have shown that this chronic manipulation is specifically toxic to nigral dopamine neurons, and the data that VTA dopamine neurons continue to be resilient even at 4 weeks is interesting and disease-relevant. We therefore do not want to use findings from other neuron types to draw assumptions about DA neurons, which are a unique and very diverse population. We acknowledge that as with all preclinical models of PD, we cannot draw definitive conclusions about PD with this data. However, we reiterate that we strongly believe that drawing connections to human disease is important, as dopamine neuron activity is very likely altered in PD and a clearer understanding of how dopamine neuron survival is impacted by activity will provide insight into the mechanisms of PD.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The temporal design of the experiments is quite confusing. For instance, Figures 1 and 3 illustrate the daily changes of the mice and suggest some critical time points within 2 weeks of CNO administration, whereas Figure 2 presents data at 2 and 4 weeks, which are much later than the proposed critical time points. Furthermore, Figure 4 includes only 1 week data, and lacks subsequent data from 2 and 4 weeks, at which significant changes such as calcium levels and neuronal/axonal degeneration are observed.

      While interesting behavior and calcium phenotypes were detected within 2 and 4 weeks of CNO administration (Figures 1 and 3), we only collected tissues for histology at the 2 and 4 week time points (Figure 2). Observing degeneration of DA neuron axons but not cell bodies at 2 weeks served as a rationale to extend to the 4 week time point to determine whether degeneration was progressive. At the same time, our primary focus is on identifying early changes that may drive or contribute to the degeneration. As such, we recorded calcium changes over a 2-week treatment period, capturing the period during which almost all of the dopamine axons are lost. Similarly, we had the capacity to perform spatial transcriptomics at only one time point, and the 1 week time point was selected to capture transcriptomic changes that precede and potentially contribute to the mild and severe degeneration that occurs at 2 and 4 weeks, respectively. We have added text clarifying the rationale for the time points chosen.

      (2) The authors showed the changes in neuronal firing in dopamine neurons by the administration of CNO. However, one of the most important features of dopaminergic neuronal activity is dopamine release at its axon terminals in the striatum. Thus, the claims raised in this paper would be better supported if the authors further show any alterations in dopamine release (by FSCV or fluorescent dopamine sensors) at some critical time points during or after CNO application.

      While we are confident that DA release is altered due to the significant changes in behavior when hM3Dq DREADDs are activated specifically in DA neurons, the current manuscript does not quantify this, or distinguish between axonal and somatodendritic DA release. Interestingly, we did find significantly decreased striatal dopamine by HPLC after chronic activation (Figure S6). We believe that resolving these questions is beyond the scope of this manuscript, but have added text indicating the importance of these experiments.

      (3) The authors used 2% sucrose as a vehicle via drinking water. Please explain the rationale behind this choice.

      We used 2% sucrose as the vehicle because it is also added to the CNO water to counteract the bitterness of CNO (Kumar et al., J Neurotrauma 2024, PMID: 37905504). We have clarified this in the manuscript.

      (4) As we know, mRNA levels of some genes do not always predict their protein levels; there is sometimes a huge discrepancy between mRNA and protein abundance. In this paper, the mechanistic interpretation of the results by the authors heavily relies on the spatial transcriptomics of the midbrain and striatum. Thus, the authors need to provide additional data proving that the gene expression of some genes in the CNO group is also changed at the level of protein.

      We agree that validating hits at the protein level is valuable, however we were limited in our ability to assess these changes for the revision. However, we have done additional transcriptomics with the high resolution Xenium platform to increase confidence in a subset of hits of interest for follow up in future work, and we included data on genes related to DA metabolism and markers of DA neurons.

      (5) The authors provided spatial transcriptomics data only for mice with one week of chronic activation. However, other data also indicate significant differences when the activation period extends beyond 10 to 12 days (Figure 1C, Figure 3D-F). While a 7-day chronic activation time point might be crucial, additional transcriptomics data from later time points would be beneficial to confirm the persistence of these changes in gene expression. Furthermore, differential gene expression (DEG) analysis at these later time points could identify novel pathways or genes influenced by the chronic activation of dopamine neurons.

      This is an interesting point and would provide valuable data as to how chronic activity influences gene expression, however additional transcriptomics at later timepoints is beyond the scope of this paper. In future studies we will assess changes observed in this manuscript at other time points.

      (6) Figure 1D, Figure S1C:

      The authors should present the sample recording traces to demonstrate that the electrophysiological recordings were appropriately made.

      These data have been provided in Figure S2.

      (7) Figure S1C:

      AP thresholds in SNc dopamine neurons from both groups look quite high. In addition, considering the data from the previous reports, AP peak amplitudes in SNc dopamine neurons from both groups seem to be very low. Are these values correct? 

      The thresholds and peaks are correct, including the AP (threshold to peak), which is typical in our (Dr. Margolis’s) experience. AP thresholds are measured from an average of at least 10 APs, as the voltage at which the derivative of the trace first exceeds 10 V/s. As mentioned in the methods section, junction potentials were not corrected, which can result in values that are a bit depolarized from ground truth. This junction potential would be consistent across all recordings, thus not impede detection of a difference in AP thresholds between groups of animals.

      (8) Figure 1E:

      It would be better if the statistical significance is depicted in the graph.

      We don’t perform repeated measures statistics across data like these, as the data are continuous, collected at 10 kHz. For ease of displaying the data, the data for each neuron is binned and then these traces are averaged together. We display SEM to give a sense of the variance across neurons. We have provided sample traces of individual neurons to better demonstrate the variability and significance of this data (Figure S2).

      (9) Figure 2C:

      The representative staining images appear to be taken from coronal slices at anatomically different positions along the rostral-to-caudal axis. Although the total numbers of TH+ cells are comparable between vehicle and CNO groups in the graph, the sample images do not reflect this result. The authors should replace the current images with the better ones.

      We have replaced this image in the manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Minor concerns:

      (1) The authors claim that their transcriptomics experiments are conducted 'before any degeneration has occurred'. And they do not see significant differences in the TH expression in the striatum. However, the n for these mice at 1 week is lower than the n use at 2 weeks (n=5 vs n=8-9) and the images used to show 'no degeneration' really look like there is some degeneration going on. Also, throughout the paper, there is a stronger effect when degeneration is measured with mCherry compared to when it is measured with TH. The 'no change' claim is made only with the TH comparison. It seems possible (and almost likely) that there would be significant axonal degeneration at one week with either a higher sample size or using the mCherry comparison. The authors should simply claim that their transcriptomics data is collected before any 'somatic' degeneration occurs.

      Thank you, we have included data that shows partial terminal loss after one week of activation (Figure S3B, Figure S5A) and have corrected this language in the manuscript to reflect transcriptomics occurring before somatic degeneration.

      (2) While selective degeneration is one of the most interesting findings in the paper, that finding is not emphasized and why it would be interesting to compare the VTA vs SNc is not discussed in the introduction.

      Emphasis for comparing the VTA vs the SNc has been added to the introduction, along with additional electrophysiology data in VTA dopamine neurons in Figure 1 and Figure S2.

      (3) In a similar direction, the vulnerability of dopaminergic neurons has been shown to be differential even within the SNc, with the ventral tier neurons degenerating more severely and the dorsal tier neurons remaining resilient. Is there any evidence for a ventral-dorsal degeneration gradient in the SNc in these experiments?

      This is a really interesting point and changes to dopamine neuron subtypes along the ventraldorsal axis may be occurring in this model, particularly as there is more selective loss of SNc neurons. However, the cell type involved would be difficult to determine at this stage, since single cell transcriptomic resolution is necessary across the entire SNc to identify cell subtypes. Transcriptomic identification is further complicated given that transcriptome change has recently been shown with genetic manipulation (Gaertner et al., bioRxiv 2024, PMID: 38895448), and we would think could similarly change with increased activity. Assessing these issues are beyond the scope of this paper.

      (4) The running data is very interesting and the circadian rhythm alterations are compelling.

      However, it is unclear whether the CNO mice run more total compared with the vehicle mice.

      The authors should show the combined total running data to evaluate this. We now show total running data in Figure 1C.

      (5) The finding that acute CNO has no effect on the membrane potential of SNc neurons after chronic CNO exposure is very peculiar! Especially because the fiber photometry data suggests that CNO continues to have an effect in vivo. Is there any explanation for this?

      While there is no acute electrophysiological response to CNO detected in this group, there may be intracellular pathways activated by the DREADD that do not acutely impact membrane potential in current clamp (I = 0 pA) mode.

      (6) The terminology of chronic CNO is sometimes confusing as it refers to both 2-week and 4week administration. Using additional terminology such as 'early' and 'late' might help with clarity.

      We have decreased usage of ‘chronic,’ and increased usage of more specific treatment times in order to increase clarity throughout the manuscript.

      (7) In Figure 2C, the SNc image looks binarized.

      This image has been updated.

      (8) Also in Figure 2, why are TH and mCherry measured for the 4-week time point, but only TH measured for the 2-week time point?

      mCherry quantification was performed to further support the finding of DA neuron death, and was therefore not assessed at 2 weeks given that there was no change in the TH stereology.

      (9) Additional scale bars and labeling is needed in Figure 3. In addition, there is such a strong reduction in noise after chronic CNO in the fiber photometry recordings, and the noise does not return upon CNO washout. What is the explanation for this?

      Additional scale bars were added to Figure 3. Traces are not getting less noisy with chronic CNO treatment, rather, there is less bursting activity in the dopamine cells. Our interpretation is that the baseline activity is rescued during washout but this bursting activity is not.

      (10) While not necessary to support the claims in this paper, it would be very interesting to see if chronic inhibition of dopaminergic neurons had a similar or different effect, as too little dopaminergic activity may also cause degeneration in some cases.

      We agree that assessing chronic inhibition is valuable, and this is an important area for future research.

      Reviewer #3 (Recommendations For The Authors):

      All the mice used in the study are not listed in the methods section. For example, the GCaMP6f floxed mice discussed in the results section are not listed in the methods. Also, the breeding scheme used for the different mouse lines needs to be described. For example, did the DAT-Cre mice carry one or two alleles?

      Both the DAT<sup>IRES</sup>Cre and GCaMP6f floxed (Ai148) Jax mouse line numbers and RRIDs are included in the methods. DAT<sup>IRES</sup>Cre mice carried two alleles.

      In the methods section, the amount of virus injected needs to be mentioned.

      This information has been added to the methods section.

      In all result graphs, please include the individual data points so that the readers can see the distribution of the data and quickly see the sample size.

      Graphs have been updated to include all individual data points. For line graphs, the distribution is communicated by the error bars, while the n is in the legends.

      The authors provide running wheel data in supplementary figure 1A to validate that chemogenetic activation of dopamine neurons leads to increased locomotor activity. The results shown in the figure appear to be qualitative as no average data is presented. The authors should provide average data from all mice tested.

      Average IP response data for all mice assessed for running wheel activity has been included in Figure S1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment 

      fMRI was used to address an important aspect of human cognition - the capacity for structured representations and symbolic processing - in a cross-species comparison with non-human primates (macaques); the experimental design probed implicit symbolic processing through reversal of learned stimulus pairs. The authors present solid evidence in humans that helps elucidate the role of brain networks in symbolic processing, however the evidence from macaques was incomplete (e.g., sample size constraints, potential and hard-to-quantify differences in attention allocation, motivation, and lived experience between species).

      Thank you very much for your assessment. We would like to address the potential issues that you raise point-by-point below.

      We agree that for macaque monkey physiology, sample size is always a constraint, due to both financial and ethical reasons. We addressed this concern by combining the results from two different labs, which allowed us to test 4 animals in total, which is twice as much as what is common practice in the field of primate physiology. (We discuss this now on lines 473-478.)

      Interspecies differences in motivation, attention allocation, task strategies etc. could also be limiting factors. Note that we did address the potential lack of attention allocation directly in Experiment 2 using implicit reward association, which was successful as evidenced by the activation of attentional control areas in the prefrontal cortex. We cannot guarantee that the strategies that the two species deploy are identical, but we tentatively suggest that this might be a less important factor in the present study than in other interspecies comparisons that use explicit behavioral reports. In the current study, we directly measured surprise responses in the brain in the absence of any explicit instructions in either species, which allowed us to  measure the spontaneous reversal of learned associations, which is a very basic element of symbolic representation. Our reasoning is that such spontaneous responses should be less dependent on attention allocation and task strategies. (We discuss this now in more detail on lines 478-485.)

      Finally, lived experience could be a major factor. Indeed, obvious differences include a lifetime of open-field experiences and education in our human adult subjects, which was not available to the monkey subjects, and includes a strong bias towards explicit learning of symbolic systems (e.g. words, letters, digits, etc). However, we have previously shown that 5-month-old human infants spontaneously generalize learning to the reversed pairs after a short learning in the lab using EEG (Kabdebon et al, PNAS, 2019). This indicates that also with very limited experience, humans spontaneously reverse learned associations. (We discuss this now in more detail on lines 478-485.) It could be very interesting to investigate whether spontaneous reversal could be present in infant macaque monkeys, as there might be a critical period for this effect. Although neurophysiology in awake infant monkeys is highly challenging, it would be very relevant for future work. (We discuss this in more detail on lines 493-498.)

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Kerkoerle and colleagues present a very interesting comparative fMRI study in humans and monkeys, assessing neural responses to surprise reactions at the reversal of a previously learned association. The implicit nature of this task, assessing how this information is represented without requiring explicit decision-making, is an elegant design. The paper reports that both humans and monkeys show neural responses across a range of areas when presented with incongruous stimulus pairs. Monkeys also show a surprise response when the stimuli are presented in a reversed direction. However, humans show no such surprise response based on this reversal, suggesting that they encode the relationship reversibly and bidirectionally, unlike the monkeys. This has been suggested as a hallmark of symbolic representation, that might be absent in nonhuman animals. 

      I find this experiment and the results quite compelling, and the data do support the hypothesis that humans are somewhat unique in their tendency to form reversible, symbolic associations. I think that an important strength of the results is that the critical finding is the presence of an interaction between congruity and canonicity in macaques, which does not appear in humans. These results go a long way to allay concerns I have about the comparison of many human participants to a very small number of macaques. 

      We thank the reviewer for the positive assessment. We also very much appreciate the point about the interaction effect in macaque monkeys – indeed, we do not report just a negative finding. 

      I understand the impossibility of testing 30+ macaques in an fMRI experiment. However, I think it is important to note that differences necessarily arise in the analysis of such datasets. The authors report that they use '...identical training, stimuli, and whole-brain fMRI measures'. However, the monkeys (in experiment 1) actually required 10 times more training. 

      We agree that this description was imprecise. We have changed it to “identical training stimuli” (line 151), indeed the movies used for training were strictly identical. Furthermore, please note that we do report the fMRI results after the same training duration. In experiment 1, after 3 days of training, the monkeys did not show any significant results, even in the canonical direction. However, in experiment 2, with increased attention and motivation, a significant effect was observed on the first day of scanning after training, as was found in human subjects (see Figure 4 and Table 3).

      More importantly, while the fMRI measures are the same, group analysis over 30+ individuals is inherently different from comparing only 2 macaques (including smoothing and averaging away individual differences that might be more present in the monkeys, due to the much smaller sample size). 

      Thank you for understanding that a limited sampling size is intrinsic to macaque monkey physiology. We also agree that data analysis in humans and monkeys is necessarily different. As suggested by the reviewer, we added an analysis to address this, see the corresponding reply to the ‘Recommendations for the authors’ section below.

      Despite this, the results do appear to show that macaques show the predicted interaction effect (even despite the sample size), while humans do not. I think this is quite convincing, although had the results turned out differently (for example an effect in humans that was absent in macaques), I think this difference in sample size would be considerably more concerning. 

      Thank you for noting this. Indeed, the interaction effect is crucial, and the task design was explicitly made to test this precise prediction, described in our manuscript as the “reversibility hypothesis”. The congruity effect in the learned direction served as a control for learning, while the corresponding congruity effect in the reversed direction tested for spontaneous reversal. The reversibility hypothesis stipulates that in humans there should not be a difference between the learned and the reversed direction, while there should be for monkeys. We already wrote about that in the result section of the original manuscript and now also describe this more explicitly in the introduction and beginning of the result section.

      I would also note that while I agree with the authors' conclusions, it is notable to me that the congruity effect observed in humans (red vs blue lines in Fig. 2B) appears to be far more pronounced than any effect observed in the macaques (Fig. 3C-3). Again, this does not challenge the core finding of this paper but does suggest methodological or possibly motivational/attentional differences between the humans and the monkeys (or, for example, that the monkeys had learned the associations less strongly and clearly than the humans). 

      As also explained in response to the eLife assessment above, we expanded the “limitations” section of the discussion, with a deeper description of the possible methodological differences between the two species (see lines 478-485).

      With the same worry in mind, we did increase the attention and motivation of monkeys in experiment 2, and indeed obtained a greater activation to the canonical pairs and their violation, -notably in the prefrontal cortex – but crucially still without reversibility.

      In the end, we believe that the striking interspecies difference in size and extent of the violation effect, even for purely canonical stimuli, is an important part of our findings and points to a more efficient species-specific learning system, that our experiment tentatively relates to a symbolic competence.

      This is a strong paper with elegant methods and makes a worthwhile contribution to our understanding of the neural systems supporting symbolic representations in humans, as opposed to other animals. 

      We again thank the reviewer for the positive review.

      Reviewer #2 (Public Review): 

      In their article titled "Brain mechanisms of reversible symbolic reference: a potential singularity of the human brain", van Kerkoerle et al address the timely question of whether non-human primates (rhesus macaques) possess the ability for reverse symbolic inference as observed in humans. Through an fMRI experiment in both humans and monkeys, they analyzed the bold signal in both species while observing audio-visual and visual-visual stimuli pairs that had been previously learned in a particular direction. Remarkably, the findings pertaining to humans revealed that a broad brain network exhibited increased activity in response to surprises occurring in both the learned and reverse directions. Conversely, in monkeys, the study uncovered that the brain activity within sensory areas only responded to the learned direction but failed to exhibit any discernible response to the reverse direction. These compelling results indicate that the capacity for reversible symbolic inference may be unique to humans. 

      In general, the manuscript is skillfully crafted and highly accessible to readers. The experimental design exhibits originality, and the analyses are tailored to effectively address the central question at hand.

      Although the first experiment raised a number of methodological inquiries, the subsequent second experiment thoroughly addresses these concerns and effectively replicates the initial findings, thereby significantly strengthening the overall study. Overall, this article is already of high quality and brings new insight into human cognition. 

      We sincerely thank the reviewer for the positive comments. 

      I identified three weaknesses in the manuscript: 

      - One major issue in the study is the absence of significant results in monkeys. Indeed, authors draw conclusions regarding the lack of significant difference in activity related to surprise in the multidemand network (MDN) in the reverse congruent versus reverse incongruent conditions. Although the results are convincing (especially with the significant interaction between congruency and canonicity), the article could be improved by including additional analyses in a priori ROI for the MDN in monkeys (as well as in humans, for comparison). 

      First, we disagree with the statement about “absence of significant results in monkeys”. We do report a significant interaction which, as noted by the referee, is a crucial positive finding.

      Second, we performed the suggested analysis for experiment 2, using the bilateral ROIs of the putative monkey MDN from previous literature (Mitchell, et al. 2016), which are based on the human study by Fedorenko et al. (PNAS, 2013). 

      Author response table 1.

      Congruity effect for monkeys in Experiment 2 within the ROIs of the MDN (n=3). Significance was assessed with one-sided one-sample t-tests.

      As can be seen, none of the regions within the monkey MDN showed an FDR-corrected significant difference or interaction. Although the absence of a canonical congruity effect makes it difficult to draw strong conclusions, it did approach significance at an uncorrected level in the lateral frontal posterior region, similar to  the large prefrontal effect we report in Figures 4 and 5. Furthermore, for the reversed congruity effect there was never even a trend at the uncorrected level, and the crucial interaction of canonicity and congruity again approached significance in the lateral prefrontal cortex.  

      We also performed an ANOVA  in the human participants of the VV experiment on the average betas across the 7 different fronto-parietal ROIs as used by Mitchell et al to define their equivalent to the monkey brain (Fig 1a, right in Mitchell et al. 2016) with congruity, canonicity and hemisphere (except for the anterior cingulate which is a bilateral ROI) as within-subject factors. We confirmed the results presented in the manuscript (Figure 4C) with notably no significant interaction between congruity and canonicity in any of these ROIs (all F-values (except insula) <1). A significant main effect of congruity was observed in the posterior middle frontal gyrus (MFG) and inferior precentral sulcus at the FDR corrected level. Analyses restricted to the canonical trials found a congruity effect in these two regions plus the anterior insula and anterior cingulate/presupplementary motor area, whereas no ROIs were significant at a FDR corrected level for reverse trials. There was a trend in the middle MFG and inferior precentral region for reversed trials. Crucially, there was not even a trend for the interaction between congruity and canonicity at the uncorrected level. The difference in the effect size between the canonical and reversed direction can therefore be explained by the larger statistical power due to the larger number of congruent trials (70%, versus 10% for the other trial conditions), not by a significant effect by the canonical and the reversed direction. 

      Author response table 2.

      Congruity effect for humans in Experiment 2 within the ROIs of the MDN (n=23).

      These results support our contention that the type of learning of the stimulus pairs was very different in the two species. We thank the reviewer for suggesting these relevant additional analyses.

      - While the authors acknowledge in the discussion that the number of monkeys included in the study is considerably lower compared to humans, it would be informative to know the variability of the results among human participants. 

      We agree that this is an interesting question, although it is also very open-ended. For instance, we could report each subjects’ individual whole-brain results, but this would take too much space (and the interested reader will be able to do so from the data that we make available as part of this publication). As a step in this direction, we provide below a figure showing the individual congruity effects, separately for each experiment and for each ROI of table 5, and for each of the 52 participants for whom an fMRI localizer was available:

      Author response image 1.

      Difference in mean betas between congruent and incongruent conditions in a-priori linguistic and mathematical ROIs (see definition and analyses in Table 5) in both experiments (experiment 1 = AV, left panel; experiment 2= VV, right panel). Dots correspond to participants (red: canonical trials, green reversed trials).The boxplot notch is located at the median and the lower and upper box hinges at the 25th and 75th centiles. Whiskers extend to 1.5 inter-quartile ranges on either side of the hinges. ROIs are ranked by the median of the Incongruent-Congruent difference across canonical and reversed order, within a given experiment. For purposes of comparison between the two experiments, we have underlined with colors the top-five common ROIs between the two experiments. N.s.: non-significant congruity effect (p>0.05)

      Several regions show a rather consistent difference across subjects (see, for instance, the posterior STS in experiment 1, left panel). Overall, only 3 of the 52 participants did not show any beta superior to 2 in canonical or reversed in any ROIs. The consistency is quite striking, given the limited number of test trials (in total only 16 incongruent trials per direction per participant), and the fact that these ROIs were selected for their responses to spoken or written  sentences, as part of a subsidiary task quite different from the main task.

      - Some details are missing in the methods.  

      Thank you for these comments, we reply to them point-by-point below.

      Reviewer #3 (Public Review): 

      This study investigates the hypothesis that humans (but not non-human primates) spontaneously learn reversible temporal associations (i.e., learning a B-A association after only being exposed to A-B sequences), which the authors consider to be a foundational property of symbolic cognition. To do so, they expose humans and macaques to 2-item sequences (in a visual-auditory experiment, pairs of images and spoken nonwords, and in a visual-visual experiment, pairs of images and abstract geometric shapes) in a fixed temporal order, then measure the brain response during a test phase to congruent vs. incongruent pairs (relative to the trained associations) in canonical vs. reversed order (relative to the presentation order used in training). The advantage of neuroimaging for this question is that it removes the need for a behavioral test, which non-human primates can fail for reasons unrelated to the cognitive construct being investigated. In humans, the researchers find statistically indistinguishable incongruity effects in both directions (supporting a spontaneous reversible association), whereas in monkeys they only find incongruity effects in the canonical direction (supporting an association but a lack of spontaneous reversal). Although the precise pattern of activation varies by experiment type (visual-auditory vs. visual-visual) in both species, the authors point out that some of the regions involved are also those that are most anatomically different between humans and other primates. The authors interpret their finding to support the hypothesis that reversible associations, and by extension symbolic cognition, is uniquely human. 

      This study is a valuable complement to prior behavioral work on this question. However, I have some concerns about methods and framing. 

      We thank the reviewer for the careful summary of the manuscript, and the positive comments.

      Methods - Design issues: 

      The authors originally planned to use the same training/testing protocol for both species but the monkeys did not learn anything, so they dramatically increased the amount of training and evaluation. By my calculation from the methods section, humans were trained on 96 trials and tested on 176, whereas the monkeys got an additional 3,840 training trials and 1,408 testing trials. The authors are explicit that they continued training the monkeys until they got a congruity effect. On the one hand, it is commendable that they are honest about this in their write-up, given that this detail could easily be framed as deliberate after the fact. On the other hand, it is still a form of p-hacking, given that it's critical for their result that the monkeys learn the canonical association (otherwise, the critical comparison to the non-canonical association is meaningless). 

      Thank you for this comment. 

      Indeed, for experiment 1, the amount of training and testing was not equal for the humans and monkeys, as also mentioned by reviewer 2. We now describe in more detail how many training and imaging days we used for each experiment and each species, as well as the number of blocks per day and the number of trials per block (see lines 572-577). We also added the information on the amount of training receives to all of the legends of the Tables.

      We are sorry for giving the impression that we trained until the monkeys learned this. This was not the case. Based on previous literature, we actually anticipated that the short training would not be sufficient, and therefore planned additional training in advance. Specifically, Meyer & Olson (2011) had observed pair learning in the inferior temporal cortex of macaque monkeys after 816 exposures per pair. This is similar to the additional training we gave, about 80 blocks with 12 trials per pair per block. This is  now explained in more detail (lines 577-580).

      Furthermore, we strongly disagree with the pejorative term p-hacking. The aim of the experiment was not to show a congruency effect in the canonical direction in monkeys, but to track and compare their behavior in the same paradigm as that of humans for the reverse direction. It would have been unwise to stop after human-identical training and only show that humans learn better, which is a given. Instead, we looked at brain activations at both times, at the end of human-identical training and when the monkeys had learned the pairs in the canonical direction. 

      Finally, in experiment 2, monkeys were tested after the same 3 days of training as humans. We wrote: “Using this design, we obtained significant canonical congruity effects in monkeys on the first imaging day after the initial training (24 trials per pair), indicating that the animals had learned the associations” (lines 252-253).

      (2) Between-species comparisons are challenging. In addition to having differences in their DNA, human participants have spent many years living in a very different culture than that of NHPs, including years of formal education. As a result, attributing the observed differences to biology is challenging. One approach that has been adopted in some past studies is to examine either young children or adults from cultures that don't have formal educational structures. This is not the approach the authors take. This major confound needs to minimally be explicitly acknowledged up front. 

      Thank you for raising this important point. We already had a section on “limitations” in the manuscript, which we now extended (line 478-485). Indeed, this study is following a previous study in 5-month-old infants using EEG, in which we already showed that after learning associations between labels and categories, infants spontaneously generalize learning to the reversed pairs after a short learning period in the lab (Kabdebon et al, PNAS, 2019). We also cited preliminary results of the same paradigm as used in the current study but using EEG in 4-month-old infants (Ekramnia and Dehaene-Lambertz, 2019), where we replicated the results obtained by Kabdebon et al. 2019 showing that preverbal infants spontaneously generalize learning to the reversed pairs. 

      Functional MRI in awake infants remains a challenge at this age (but see our own work, DehaeneLambertz et al, Science, 2002), especially because the experimental design means only a few trials in the conditions of interest (10%) and thus a long experimental duration that exceed infants’ quietness and attentional capacities in the noisy MRI environment. (We discuss this on lines 493-496.)

      (3) Humans have big advantages in processing and discriminating spoken stimuli and associating them with visual stimuli (after all, this is what words are in spoken human languages). Experiment 2 ameliorates these concerns to some degree, but still, it is difficult to attribute the failure of NHPs to show reversible associations in Experiment 1 to cognitive differences rather than the relative importance of sound string to meaning associations in the human vs. NHP experiences. 

      As the reviewer wrote, we deliberately performed Experiment 2 with visual shapes to control for various factors that might have explained the monkeys' failure in Experiment 1. 

      (4) More minor: The localizer task (math sentences vs. other sentences) makes sense for math but seems to make less sense for language: why would a language region respond more to sentences that don't describe math vs. ones that do? 

      The referee is correct: our use of the word “reciprocally” was improper (although see Amalric et Dehaene, 2016 for significant differences in both directions when non-mathematical sentences concern specific knowledge). We changed the formulation to clarify this as follows: “In these ROIs, we recovered the subject-specific coordinates of each participant’s 10% best voxels in the following comparisons: sentences vs rest for the 6 language Rois ; reading vs listening for the VWFA ; and numerical vs non-numerical sentences for the 8 mathematical ROIs.” (lines 678-680).

      Methods - Analysis issues: 

      (5) The analyses appear to "double dip" by using the same data to define the clusters and to statistically test the average cluster activation (Kriegeskorte et al., 2009). The resulting effect sizes are therefore likely inflated, and the p-values are anticonservative. 

      It is not clear to us which result the reviewer is referring to. In Tables 1-4, we report the values that we found significant in the whole brain analysis, we do not report additional statistical tests for this data. For Table 5, the subject-specific voxels were identified through a separate localizer experiment, which was designed to pinpoint the precise activation areas for each subject in the domains of oral and written language-processing and math. Subsequently, we compared the activation at these voxel locations across different conditions of the main experiment. Thus, the two datasets were distinct, and there was no double dipping. In both interpretations of the comment, we therefore disagree with the reviewer.

      Framing: 

      (6) The framing ("Brain mechanisms of reversible symbolic reference: A potential singularity of the human brain") is bigger than the finding (monkeys don't spontaneously reverse a temporal association but humans do). The title and discussion are full of buzzy terms ("brain mechanisms", "symbolic", and "singularity") that are only connected to the experiments by a debatable chain of assumptions. 

      First, this study shows relatively little about brain "mechanisms" of reversible symbolic associations, which implies insights into how these associations are learned, recognized, and represented. But we're only given standard fMRI analyses that are quite inconsistent across similar experimental paradigms, with purely suggestive connections between these spatial patterns and prior work on comparative brain anatomy. 

      We agree with the referee that the term “mechanism” is ambiguous and, for systems neuroscientists, may suggest more than we are able to do here with functional MRI. We changed the title to “Brain areas for reversible symbolic reference, a potential singularity of the human brain”. This title better describes our specific contribution: mapping out the areas involved in reversibility in humans, and showing that they do not seem to respond similarly in macaque monkeys.

      Second, it's not clear what the relationship is between symbolic cognition and a propensity to spontaneously reverse a temporal association. Certainly, if there are inter-species differences in learning preferences this is important to know about, but why is this construed as a difference in the presence or absence of symbols? Because the associations aren't used in any downstream computation, there is not even any way for participants to know which is the sign and which is the signified: these are merely labels imposed by the researchers on a sequential task. 

      As explained in the introduction, the reversibility test addressed a very minimal core property of symbolic reference. There cannot be a symbol if its attachment doesn’t operate in both directions. Thus, this property is necessary – but we agree that it is not sufficient. Indeed, more tests are needed to establish whether and how the learned symbols are used in further downstream compositional tasks (as discussed in our recent TICS papers, Dehaene et al. 2022). We added a sentence in the introduction to acknowledge this fact:

      “Such reversibility is a core and necessary property of symbols, although we readily acknowledge that it is not sufficient, since genuine symbols present additional referential and compositional properties that will not be tested in the present work.” (lines 89-92).

      Third, the word "singularity" is both problematically ambiguous and not well supported by the results. "Singularity" is a highly loaded word that the authors are simply using to mean "that which is uniquely human". Rather than picking a term with diverse technical meanings across fields and then trying to restrict the definition, it would be better to use a different term. Furthermore, even under the stated definition, this study performed a single pairwise comparison between humans and one other species (macaques), so it is a stretch to then conclude (or insinuate) that the "singularity" has been found (see also pt. 2 above). 

      We have published an extensive review including a description of our use of the term “singularity” (Dehaene et al., TICS 2022). Here is a short except: “Humans are different even in domains such as drawing and geometry that do not involve communicative language. We refer to this observation using the term “human cognitive singularity”, the word singularity being used here in its standard meaning (the condition of being singular) as well as its mathematical sense (a point of sudden change). Hominization was certainly a singularity in biological evolution, so much so that it opened up a new geological age (the Anthropocene). Even if evolution works by small continuous change (and sometimes it doesn’t [4]), it led to a drastic cognitive change in humans.”

      We find the referee’s use of the pejorative term ”insinuate” quite inappropriate. From the title on, we are quite nuanced and refer only to a “potential singularity”. Furthermore, as noted above, we explicitly mention in the discussion the limitations of our study, and in particular the fact that only a single non-human species was tested (see lines 486-493). We are working hard to get chimpanzee data, but this is remarkably difficult for us, and we hope that our paper will incite other groups to collect more evidence on this point.

      (7) Related to pt. 6, there is circularity in the framing whereby the authors say they are setting out to find out what is uniquely human, hypothesizing that the uniquely human thing is symbols, and then selecting a defining trait of symbols (spontaneous reversible association) *because* it seems to be uniquely human (see e.g., "Several studies previously found behavioral evidence for a uniquely human ability to spontaneously reverse a learned association (Imai et al., 2021; Kojima, 1984; Lipkens et al., 1988; Medam et al., 2016; Sidman et al., 1982), and such reversibility was therefore proposed as a defining feature of symbol representation reference (Deacon, 1998; Kabdebon and DehaeneLambertz, 2019; Nieder, 2009).", line 335). They can't have it both ways. Either "symbol" is an independently motivated construct whose presence can be independently tested in humans and other species, or it is by fiat synonymous with the "singularity". This circularity can be broken by a more modest framing that focuses on the core research question (e.g., "What is uniquely human? One possibility is spontaneous reversal of temporal associations.") and then connects (speculatively) to the bigger conceptual landscape in the discussion ("Spontaneous reversal of temporal associations may be a core ability underlying the acquisition of mental symbols").

      We fail to understand the putative circularity that the referee sees in our introduction. We urge him/her to re-read it, and hope that, with the changes that we introduced, it does boil down to his/her summary, i.e. “What is uniquely human? One possibility is spontaneous reversal of temporal associations."

      Reviewer #1 (Recommendations For The Authors): 

      In general, the manuscript was very clear, easy to read, and compelling. I would recommend the authors carefully check the text for consistency and minor typos. For example: 

      The sample size for the monkeys kept changing throughout the paper. E.g., Experiment 1: n = 2 (line 149); n = 3 (line 205).  

      Thank you for catching this error, we corrected it. The number of animals was indeed 2  for experiment 1, and 3 for experiment 2. (Animals JD and YS participated in experiment 1 and JD, JC and DN in experiment 2. So only JD participated in both experiments.)

      Similarly, the number of stimulus pairs is reported inconsistently (4 on line 149, 5 pairs later in the paper). 

      We’re sorry that this was unclear. We used 5 sets of 4 audio-visual pairs each. We now clarify this, on line 157 and on lines 514-516.

      At least one case of p>0.0001, rather than p < 0.0001 (I assume). 

      Thank you once again, we now corrected this.

      Reviewer #2 (Recommendations For The Authors): 

      One major issue in the study is the absence of significant results in monkeys. Indeed, the authors draw conclusions regarding the lack of significant difference in activity related to surprise in the multidemand network (MDN) in the reverse congruent versus reverse incongruent conditions. Although the results are convincing (especially with the significant interaction between congruency and canonicity), the article could be improved by including additional analyses in a priori ROI for the MDN in monkeys (as well as in humans, for comparison). In other words: what are the statistics for the MDN regarding congruity, canonicity, and interaction in both species? Since the authors have already performed this type of analysis for language and Math ROIs (table 5), it should be relatively easy for them to extend it to the MDN. Demonstrating that results in monkeys are far from significant could further convince the reader. 

      Furthermore, while the authors acknowledge in the discussion that the number of monkeys included in the study is considerably lower compared to humans, it would be informative to know the variability of the results among human participants. Specifically, it would be valuable to describe the proportion of human participants in which the effects of congruency, canonicity, and their interaction are significant. Additionally, stating the variability of the F-values for each effect would provide reassurance to the reader regarding the distinctiveness of humans in comparison to monkeys. Low variability in the results would serve to mitigate concerns that the observed disparity is merely a consequence of testing a unique subset of monkeys, which may differ from the general population. Indeed, this would be a greater support to the notion that the dissimilarity stems from a genuine distinction between the two species. 

      We responded to both of these points above.

      In terms of methods, details are missing: 

      - How many trials of each condition are there exactly? (10% of 44 trials is 4.4) : 

      We wrote: “In both humans and monkeys, each block started with 4 trials in the learned direction (congruent canonical trials), one trial for each of the 4 pairs (2 O-L and 2 L-O pairs). The rest of the block consisted of 40 trials in which 70% of trials were identical to the training; 10% were incongruent pairs but the direction (O-L or L-O) was correct (incongruent canonical trials), thus testing whether the association was learned; 10% were congruent pairs but the direction within the pairs was reversed relative to the learned pairs (congruent reversed trials) and 10% were incongruent pairs in reverse (incongruent reversed trials).”(See lines 596-600.)

      Thus, each block comprised 4 initial trials, 28 canonical congruent trials, 4 canonical incongruent, 4 reverse congruent and 4 reverse incongruent trials, i.e. 4+28+3x4=40 trials.

      - How long is one trial? 

      As written in the method section: “In each trial, the first stimulus (label or object) was presented during 700ms, followed by an inter-stimulus-interval of 100ms then the second stimulus during 700ms. The pairs were separated by a variable inter-trial-interval of 3-5 seconds” i.e. 700+100+700=1500, plus 3 to 4.75 seconds of blank between the trials (see lines 531-533).

      - How are the stimulus presentations jittered? 

      See : “The pairs were separated by a variable inter-trial-interval randomly chosen among eight different durations between 3 and 4.75 seconds (step=250 ms). The series of 8 intervals was randomized again each time it was completed.”(lines 533-535).

      - What is the statistical power achieved for humans? And for monkeys? 

      We know of no standard way to define power for fMRI experiments. Power will depend on so many parameters, including the fMRI signal-to-noise ratio, the attention of the subject, the areas being considered, the type of analysis (whole-brain versus ROIs), etc.

      - Videos are mentioned in the methods, is it the image and sound? It is not clear. 

      We’re sorry that it was unclear. Video’s were only used for the training of the human subjects. We now corrected this in the method section (lines 552-554).

      Reviewer #3 (Recommendations For The Authors): 

      The main recommendations are to adjust the framing (making it less bold and more connected to the empirical evidence) and to ensure independence in the statistical analyses of the fMRI data. 

      See our replies to the reviewer’s comments on “Framing” above. In particular, we changed the title of the paper from “Brain mechanisms of reversible symbolic reference” to “Brain areas for reversible symbolic reference”.

      References cited in this response

      Dehaene, S., Al Roumi, F., Lakretz, Y., Planton, S., & Sablé-Meyer, M. (2022). Symbols and mental programs : A hypothesis about human singularity. Trends in Cognitive Sciences, 26(9), 751‑766. https://doi.org/10.1016/j.tics.2022.06.010.

      Dehaene-Lambertz, Ghislaine, Stanislas Dehaene, et Lucie Hertz-Pannier. Functional Neuroimaging of Speech Perception in Infants. Science 298, no 5600 (2002): 2013-15. https://doi.org/10.1126/science.1077066.

      Ekramnia M, Dehaene-Lambertz G. 2019. Investigating bidirectionality of associations in young infants as an approach to the symbolic system. Presented at the CogSci. p. 3449.

      Fedorenko E, Duncan J, Kanwisher N (2013) Broad domain generality in focal regions of frontal and parietal cortex. Proc Natl Acad Sci U S A 110:16616-16621.

      Kabdebon, Claire, et Ghislaine Dehaene-Lambertz. « Symbolic Labeling in 5-Month-Old Human Infants ». Proceedings of the National Academy of Sciences 116, no 12 (2019): 5805-10. https://doi.org/10.1073/pnas.1809144116.

      Mitchell, D. J., Bell, A. H., Buckley, M. J., Mitchell, A. S., Sallet, J., & Duncan, J. (2016). A Putative Multiple-Demand System in the Macaque Brain. Journal of Neuroscience, 36(33), 8574‑8585. https://doi.org/10.1523/JNEUROSCI.0810-16.2016

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment  

      This manuscript compiles existing algorithms into an open-source software package that enables realtime motor unit decomposition from muscle activity collected via grids of surface electrodes and indwelling electrode arrays. The software package is valuable given that many motor neuroscience labs are using such algorithms and that there exist a host of potential real-time applications for such data. Validation of the software package is generally solid but incomplete in some important areas: the primary data is narrow in scope and only from male participants, and there is a lack of ground truth tests on synthetic data. The impact of the software package could be strengthened by making it less tied to specific electrode hardware and by expanding it to easily permit offline analysis.

      We thank the reviewers and editors for their comments and suggestions after reading the initial version of our manuscript. In this second iteration, we have performed a validation of the algorithm using synthetic EMG signals. We have also added experimental data collected in female participants. Finally, the new version of I-Spin is compatible with the Open Ephys GUI that can interface with devices such as the Open Ephys and Intan acquisition boards. Another version has been developed for interfacing with the devices provided by the TMSi company (https://info.tmsi.com/blog/ispin-saga-real-timemotor-unit-decomposition-tool). We believe that such changes will make I-Spin more accessible for a broad range of experimental setups and research teams. Please find below the specific answers to the reviewers’ comments.

      Reviewer #1 (Public Review):  

      Many labs worldwide now use the blind source deconvolution technique to identify the firing patterns of multiple motor units simultaneously in human subjects. This technique has had a truly transformative effect on our understanding of the structure of motor output in both normal subjects and, increasingly, in persons with neurological disorders. The key advance presented here is that the software provides real-time identification of these firing patterns. The main strengths are the clarity of the presentation and the great potential that real-time decoding will provide. Figures are especially effective and statistical analyses are excellent. 

      We thank the reviewer for this positive appreciation of our work. 

      The main limitation of the work is that only male subjects were included in the validation of the software. The reason given - that yield of number of motor units identified is generally larger in males than females - is reasonable in the sense that this is the first systematic test of this real-time approach. At a minimum, however, the authors should clearly commit to future work with female subjects and emphasize the importance of considering sex differences. 

      As emphasised by the reviewer, the number of identified motor units is typically higher in males than females when using surface EMG (Taylor et al., 2022), which is the current main limitation of the implementation of offline EMG decomposition technique in a broad and representative sample of research participants. These differences between biological sex are less present when using intramuscular EMG, as the signals are less affected by the filtering effect of the volume conductor separating the motor units from the recording electrodes. Besides the different yields expected between males and females, we do not expect differences in terms of the accuracy of the motor unit identification algorithm, which is the main outcome of this paper. 

      Nevertheless, we acknowledge the importance to understand the reasons for this difference, and the imperative to refine algorithms and/or surface electrode design to mitigate this major limitation with surface EMG. 

      To support this point, the discussion has been updated (P20; L480):

      ‘An important consideration regarding the implementation of offline or real-time surface EMG decomposition is the difference between individuals, with an overall lower yield in number of identified motor units in females (here: 9 ± 12) than in males (here: 30 ± 13). Typically, the number of identified motor units from surface EMG is twice as low in females than males (32, 49, 50). The cause for this difference remains unclear. It may be related to variations in properties of the tissues separating the motor units from the recording electrodes, or to differences in the morphological and physiological properties of muscle fibres, as well as to the innervation ratios of motor units. These sex-related differences have so far only been supported by data extracted from animal experiments (51). However, the recent developments of simulation frameworks capable of generating highly realistic EMG signals for anthropometrically diverse populations may help understanding the impact of sex-related differences in humans (52). Specifically, these simulations can account for diverse anatomical (e.g. muscle volume and architecture, thickness of subcutaneous tissues) and physiological characteristics (e.g. innervation ratio, number of motor units, fibre cross sectional area, fibre conduction velocity, contribution of rate coding vs. spatial recruitment). Generating such dataset could help identifying the primary factors affecting EMG decomposition performance, ultimately enabling the refinement of algorithms and/or surface electrode design.’

      Finally, we have completed new experiments including males and females in this new iteration (P.12; L.295):

      ‘Application of motor unit filters in experimental data

      We then asked eight participants (4 males and 4 females) to perform trapezoidal isometric contractions with plateaus of force set at 10% and 20% MVC during which surface EMG signals were recorded from the TA with 256 electrodes separated by 4 mm. The aim of this experiment was to confirm the results of the simulation; specifically, to test the accuracy of the online decomposition when the level of force was below, equal to, or above the level of force produced during the baseline contraction used to estimate the motor unit filters (Figure 4). We assessed the accuracy of the motor unit spike trains identified in real time using their manually edited version as reference. 144 motor units were identified at both 10 and 20% MVC. When the test signals were recorded at the same level of force as the baseline contraction, we obtained rates of agreement of 95.6 ± 6.8% (10% MVC) and 93.9 ± 5.9% (20% MVC). The sensitivity reached 95.9 ± 6.7% (10% MVC) and 94.4 ± 5.6% (20% MVC), and the precision reached 99.6 ± 1.3% (10% MVC) and 99.4 ± 1.9% (20% MVC). 

      When the filters identified at 20% MVC were applied on signals recorded at a lower level of force (10% MVC), the rates of agreement decreased to 87.9 ± 16.2%. The sensitivity also decreased to 88.0 ± 16.2%, but the precision remained high (99.4 ± 4.3). Thus, the decrease in accuracy was mostly caused by missed discharge times rather than the false identification of artifacts or spikes from other motor units. When the filters identified at 10% MVC were applied to signals recorded at a higher level of force, the rates of agreement decreased to 83.3 ± 13.5%. The sensitivity decreased to 90.7 ± 8.1%, and the precision also decreased to 90.9 ± 12.6%. This result confirms what was observed with synthetic EMG, that is motor units recruited between 10 and 20% MVC can substantially disrupt the accuracy of the decomposition in real-time, as highlighted in Figure 4 (lower panel). Importantly, this situation does not happen for all the motor units, as suggested by the distribution of the values in Figure 4.’

      A second weakness is that the Introduction does a poor job of establishing the potential importance of the real-time approach. 

      The introduction has been modified to highlight the importance of identifying the spiking activity of motor units in real time. Specifically, the first paragraph has been rewritten to read (P3; L67): 

      ‘The activity of motor neuron – in the form of spike trains – represents the neural code of movement to muscles. Decoding this firing activity in real-time during various behaviours can thus substantially enhance our understanding of movement control (2-5). Real-time decoding is also essential for interfacing with external devices (6) or virtual limbs (7) when activity is present at the periphery of the nervous system. For example, individuals with a spinal cord injury can control a virtual hand with the residual firing activity of the motor units in their forearm (7). Furthermore, sampling the activity of motor units receiving a substantial portion of independent synaptic inputs may pave the way for movement augmentation – specifically, extending a person’s movement repertoire through the increase of controllable degrees of freedom (8). In this way, Formento et al. (3) showed that individuals can intuitively learn to independently control motor units within the same muscle using visual cues. Having access to open-source tools that perform the real-time decoding of motor units would allow an increasing number of researchers to improve and expand the range of these applications’

      Reviewer #2 (Public Review):  

      Rossato et al present I-spin live, a software package to perform real-time blind-source separation-based sorting of motor unit activity. The core contribution of this manuscript is the development and validation of a software package to perform motor unit sorting, apply the resulting motor unit filters in real-time during muscle contractions, and provide real-time visual feedback of the motor unit activity. I have a few concerns with the work as presented: 

      I found it challenging to specifically understand the technical contributions of this manuscript. The authors do not appear to be claiming anything novel algorithmically (with respect to spike sorting) or methodologically (with respect to manual editing of spikes before the use of the algorithms in real-time). My takeaway is that the key contributions are C1) development of an open-source implementation of the Negro algorithm, C2) validating it for real-time application (evaluating its sorting efficacy, and closed-loop performance, etc), and developing a software package to run in closed-loop with visual feedback. I will comment on each of these items separately below. It would be great if the authors could more explicitly lay out the key contributions of this manuscript in the text. 

      The main objective of this work was to provide an open-source implementation of the real-time identification of motor units together with a user interface that allow researchers to easily process the data and display the firing activity of motor unit in the form of several visual feedback. We have explicitly laid out these key contributions in the introduction: “Having access to open-source tools that perform the real-time decoding of motor units would allow an increasing number of researchers to improve and expand the range of these applications.’

      Related to the above, much of the validation of the algorithms in this manuscript has a "trust me" feel. The authors note that the Negro et al. algorithm has already been validated, so very few details or presentations of primary data showing the algorithm's performance are shown. Similarly, the efficacy of the decomposition approach is evaluated using manual editing of the sorting output as a reference, which is a subjective process, and users would greatly benefit from explicit guidance. There are very few details of manual editing shown in this manuscript (I believe the authors reference the Hug et al. 2021 paper for these details), and little discussion of the core challenges and variability of that process, even though it seems to be a critical step in the proposed workflow. So this is very hard to evaluate and would be challenging for readers to replicate. 

      To address the reviewer’s comment, we added a validation step using synthetic EMG data (P.10; L.235). 

      ‘Validation of the algorithm

      We first validated the accuracy of the algorithm using synthetic EMG signals generated with an anatomical model entailing a cylindrical muscle volume with parallel fibres [see Farina et al. (29), Konstantin et al. (36) for a full description of the model)]. In this model, subcutaneous and skin layers separate the muscle from a grid of 65 surface electrodes (5 columns, 13 rows), while an intramuscular array of electrodes is directly inserted in the muscle under the grid with an angle of 30 degrees. 150 motor units were distributed within the cross section of the muscle. Recruitment thresholds, firing rate/excitatory drive relations, and twitch parameters were assigned to each motor unit using the same procedure as Fuglevand et al. (37). During each simulation, a proportional-integral-derivative controller adjusted the level of excitatory drive to minimise the error between a predefined target of force and the force generated by the active motor units. 

      Figure 3A displays the raster plots of the active motor units during simulated trapezoidal isometric contractions with plateaus of force set at 10%, 20%, and 30% MVC. A sinusoidal isometric contraction ranging between 15 and 25% MVC at a frequency of 0.5 Hz was also simulated. We identified on average 10 ± 1 and 12 ± 2 motor units with surface and intramuscular arrays, respectively (Figure 3A). During the offline decomposition, the rate of agreement between the identified discharge times and the ground truth, that is, the simulated discharge times, reached 100.0 ± 0.0% for intramuscular EMG signals and 99.2 ± 1.8% for surface EMG signals (Figure 3B). The offline estimation of motor unit filters was therefore highly accurate, independently of the level of force or the pattern of the isometric contraction.

      Motor unit filters estimated during a baseline contraction at 20% MVC were then applied in real-time on signals simulated during a contraction with a different pattern (sinusoidal; Figure 3C). The rates of agreement between the online decomposition and the ground truth reached 96.3 ± 4.6% and 98.4 ± 2.3% for surface and intramuscular EMG signals, respectively. Finally, we tested whether the accuracy of the online decomposition changed when the level of force decreased or increased by 10% MVC when compared to the calibration performed at 20% MVC (Figure 3D). The rate of agreement remained high when applying the motor unit filters on signals recorded at 10% MVC: 99.8 ± 0.2% (surface EMG) and 99.5 ± 0.3% (intramuscular EMG). It is worth noting that only 3 out of 10 motor units identified from surface EMG at 20% MVC were active at 10% MVC, while 8 out of 12 motor units identified from intramuscular EMG were active at 10 % MVC. This shows how the decomposition of EMG signals tends to identify the last recruited motor units, which often innervate a larger number of fibres than the early recruited motor units (38). On the contrary, the application of motor unit filters on signals simulated at 30% MVC led to a decrease in the rate of agreement, with values of 88.6 ± 14.0% (surface EMG) and

      80.3 ± 19.2% (intramuscular EMG). This decrease in accuracy did not impact all the motor units, with 5 motor units keeping a rate of agreement above 95% in both signals. For the other motor units, we observed a decrease in precision, which estimates the ratio of true discharge times over the total number of identified discharge times. This was caused by the recruitment of two motor units sharing a similar space within the muscle, which resulted in a merge in the same pulse train (Figure 3D).’

      In addition, we added a new paragraph in the Method section to describe the manual editing process (P.26; L.658). 

      ‘There is a consensus among experts that automatic decomposition should be followed by visual inspection and manual editing (55).  Manual editing involves the following steps: i) removing spikes that result in erroneous firing rates (outliers), ii) adding discharge times thar are clearly distinguishable from the noise, iii) recalculating the separation vector, iv) reapplying the separation vector on the EMG signals (either a selected window or the entire signal), and v) repeating this procedure until no outliers are present and all clearly distinguishable spikes have been selected. Importantly, the manual editing of potentially missed or falsely identified discharge times should not be accepted before the application of the updated motor unit separation vector, thereby generating a new pulse train. Manual edits should be accepted only if the silhouette value improves following this operation or remains well above the preestablished threshold. A more extensive description of the manual editing of motor unit pulse trains can be found in (32). Even though some of the aforementioned steps involve subjective decision-making, evidence suggests that manual editing after EMG decomposition with blind source separation approaches remains highly reliable across operators (33). Specifically, the median rates of agreement calculated for 126 motor units over eight operators with various experience in manual editing was 99.6%.  All raw and processed data have been made available on a public data repository so that they can be used for training new operators (10.6084/m9.figshare.13695937).’

      I found the User Guide in the Github package to be easy to follow. Importantly, it seems heavily tied to the specific hardware (Quattrocento). I understand it may be difficult to make the full software package work with different hardware, but it seems important to at least make an offline analysis of recorded data possible for this package to be useful more broadly. 

      The software was updated to perform real-time decomposition with signals recorded from the Quattrocento and the Open Ephys GUI, which is compatible with Intan and Open Ephys acquisition boards. I-Spin has also been adapted by TMSi to perform real-time decomposition with their devices (https://info.tmsi.com/blog/ispin-saga-real-time-motor-unit-decomposition-tool). 

      Moreover, the manual editing panel of the software can now import any files from these devices and allow users to reformat data in mat files to perform offline analyses.

      While this may be a powerful platform, it is also very possible that without more details and careful guidance for users on potential pitfalls, many non-experts in sorting could use this as a platform for somewhat sloppy science. 

      We fully agree with the reviewer that real-time EMG decomposition - with a different approach here than spike sorting - may yield unreliable results if not applied properly. As outlined in the introduction of our initial manuscript, assessing the accuracy and limitations of real-time decomposition was a primary motivation for this study. Specifically, we compared accuracy between contraction intensities, muscles, and electrode types (see Results section). 

      We also demonstrated that manual editing of the decomposition outputs should be done after the training phase to improve the motor unit filters, thereby improving the accuracy of real-time decomposition. We also outlined the importance to never blindly accept the result of the decomposition without visual inspection and manual editing. (P8; L214)

      ‘These results show how manual editing can improve the accuracy of spike detection from the motor unit pulse trains. Moreover, a SIL value around 0.9 can be used as a threshold to automatically remove the motor unit pulse trains with a poor quality a priori. Thus, these two steps were performed in the all the subsequent analyses. Importantly, it is worth noting that the motor unit pulse train must always be visually inspected after the session to check for errors of the automatic identification of discharge times.’

      We have also included more detailed information about the manual editing process (see above).

      The authors mention that data is included with the Github software package. I could not find any included data, or instructions on how to run the software offline on example data. 

      This link to the data on figshare was added in the GitHub.

      Given the centrality of the real-time visual feedback to their system, the authors should show some examples of the actual display etc. so readers can understand what the system in action actually looks like (I believe there is no presentation of the actual system in the manuscript, just in the User Guide). Similarly, it would be helpful to have a schematic figure outlining the full workflow that a user goes through when using this system. 

      A figure of the workflow is present in the user manual. Additionally, we now display traces of visual feedback in figure 5 and we added videos of the software during each of the visual feedback in supplemental materials. 

      The authors note all data was collected with male subjects because more motor units can be decomposed from male subjects relative to females. But what is the long-term outlook for the field if studies avoid female subjects because their motor units may be harder to decompose? This should at least be discussed - it is an important challenge for the field to solve, and it is unacceptable if new methods just avoid this problem and are only tested on male subjects. 

      This point was rightly raised by each of the three reviewers. To solve this, we added data collected on four females, and discussed future developments to make the decomposition of surface EMG equally performant for everyone (P.20; L.480).

      ‘An important consideration regarding the implementation of offline or real-time surface EMG decomposition is the difference between individuals, with an overall lower yield in number of identified motor units in females (here: 9 ± 12) than in males (here: 30 ± 13). Typically, the number of identified motor units from surface EMG is twice as low in females than males (32, 49, 50). The cause for this difference remains unclear. It may be related to variations in properties of the tissues separating the motor units from the recording electrodes, or to differences in the morphological and physiological properties of muscle fibres, as well as to the innervation ratios of motor units. These sex-related differences have so far only been supported by data extracted from animal experiments (51). However, the recent developments of simulation frameworks capable of generating highly realistic EMG signals for anthropometrically diverse populations may help understanding the impact of sex-related differences in humans (52). Specifically, these simulations can account for diverse anatomical (e.g. muscle volume and architecture, thickness of subcutaneous tissues) and physiological characteristics (e.g. innervation ratio, number of motor units, fibre cross sectional area, fibre conduction velocity, contribution of rate coding vs. spatial recruitment). Generating such dataset could help identifying the primary factors affecting EMG decomposition performance, ultimately enabling the refinement of algorithms and/or surface electrode design.’

      Specific comments on the core contributions of this paper:  

      C1. Development of an open-source implementation of the Negro algorithm 

      This seems an important contribution and useful for the community. There are very few figures showing any primary data, the efficacy of sorting, raw traces showing the waveforms that are identified, cluster shapes, etc. I realize the high-level algorithm has been outlined elsewhere, but the implementation in this package, and its efficacy, is a core component of the system and the claims being made in this paper. Much more presentation of data is needed to evaluate this. 

      It is worth noting that the approach used here is based on blind source separation, which is different than spike-sorting algorithms as it relies on the statistical properties of the spike trains (their sparseness) rather than the profiles of the action potentials. In short, we optimise separation vectors that are applied onto the whitened signal to generate a sparse motor unit pulse train. The discharge times are then directly estimated from the high peaks of this pulse train (Section 1 of the results; overview of the approach).

      We are thus displaying motor unit pulse trains in three figures with the automatically detected discharge times, with cases of successful separation in figure 1 and merged motor units in the same pulse train in figures 3 and 4.

      We also validated the algorithm with synthetic EMG to provide objective data on the accuracy of the algorithm. These results are shown in the section ‘Validation of the algorithm’ and displayed in figure 3.

      Similarly, more information on the offline manual editing process (e.g. showing before/after examples with primary data) would be important to gain confidence in the method. The current paper shows application to both surface EMG and intramuscular EMG, but I could not find IM EMG examples in the Hug paper (apologies if I missed them). Surface and IM data are very, very different, so one would imagine the considerations when working with them should also be different. 

      In response to another comment from the reviewer, we have included more detailed information about the manual editing process (see above). As stated above, the decomposition approach used in our software differs from a spike sorting approach. Therefore, even though intramuscular and surface EMG signals are different, the decomposition and manual editing process is the same. 

      All descriptions of math/algorithms are presented in text, without any actual math, variable definitions, etc. This presentation makes it difficult to understand what is done. I would strongly recommend writing out equations and defining variables where possible. 

      More details on how the level of sparseness is controlled during optimization would be helpful.

      And how this sparseness penalty is weighed against other optimization costs. 

      A mathematical description of the model has been added in the methods (P25; L620)

      ‘Mathematical modelling of the recorded spike trains.

      The spike train of a motor neuron recorded over time 𝑡 ∈ [0, 𝑇] can be described as the result of a convolution between a delta function (d) representing the firing times (j), and finite impulse responses (h) representing action potentials of duration L: . In practice, the nature of h and the duration L depend on the type of recordings. For electrophysiological measurements, h characterises the local electrical field generated by the spike and conducted through the surrounding tissues. 

      As the recorded volume of tissue comprises many active neurons, each recording can be considered as a convolutive mixture of multiple sources, and the previous equation can be expressed in the form of a matrix to also consider all the electrodes of an array: given , where is a matrix of m electrophysiological signals, is a matrix of n motor neurons’ spike trains, and 𝐻(𝑙) is a m by n matrix containing the lth sample of action potentials from n neurons and m signals. In this situation, we can reformulate the model as an instantaneous mixture of an extended set of sources, that is, the motor neurons’ spike trains and their delayed versions. This allows us to simply write the previous equation as a multiplication of matrices, in which each source is delayed L times, L being the duration of the impulse response h. This model can be inverted for neural decoding with source-separation approaches.’

      The rest of the decomposition approach was rewritten to make it clearer for the reader:

      ‘The monopolar EMG signals collected during the baseline contractions were extended with an extension factor of   1000/m (21), where m is the number of channels free of any noise or artifact. The signals were then demeaned and whitened. A contrast function was iteratively applied to estimate a separation vector that maximised the level of sparseness of the motor unit pulse train (Figure 1B). This loop stopped when the variation of the separation vector between two successive iterations reaches a predefined lower bound. After the application of a peak detection algorithm, the motor unit pulse train contained high peaks (i.e., the spikes from the identified motor unit) and low peaks from other motor units and noise. High peaks were separated from low peaks and noise using K-mean classification with two classes (Figure 1B). The peaks from the class with the highest centroid were considered as spikes of the identified motor unit. A second algorithm refined the estimation of the discharge times by iteratively recalculating the separation vector and repeating the steps with peak detection and K-mean classification until the coefficient of variation of the inter-spike intervals was minimised. The accuracy of each estimated spike train was assessed by computing the silhouette (SIL) value between the two classes of peaks identified with K-mean classification (24). When the SIL exceeded a predetermined threshold, the motor unit filter was saved for the real-time decomposition, together with the centroids of the ‘spikes’ and ‘noise’ classes (Figure 2A).’

      Overall the paper is not very rigorous about the accuracy of motor unit identification. For example, the authors note that SIL of 0.9 is generally used for offline evaluation (why is this acceptable?), but it was lowered to 0.8 for particular muscles in this study. But overall, it is unclear how sorting accuracy/inaccuracy affects performance in the target applications of this work. 

      In the section mentioned by the reviewer, we aimed to show how this metric can help to automatically select motor units that are likely to have a higher accuracy of spike detections as the peaks of their pulse train are easily separable from the noise. 

      We reformulated the conclusion of this section to make it clearer (P8; L214):

      ‘These results show how manual editing can improve the accuracy of spike detection from the motor unit pulse trains. Moreover, a SIL value around 0.9 can be used as a threshold to automatically remove the motor unit pulse trains with a poor quality a priori. Thus, these two steps were performed in the all the subsequent analyses. Importantly, it is worth noting that the motor unit pulse train must always be visually inspected after the session to check for errors of the automatic identification of discharge times.’

      C2. For real-time experiments, variability/jitter is important to characterize. Fig. 4 seems to be presenting mean computational times, etc, but no presentation of variability is shown. It would be helpful to depict data distributions somehow, rather than just mean values. 

      The variability in computational time was added to this section (P.28; L.730):

      ‘The standard deviation of computational times across windows reached 5.4 ± 4.0 ms (raster plot), 4.0 ± 3.2 ms (smoothed firing rate), and 2.8 ± 2.5 ms (quadrant)’

      The computational time minimally varied between the successive windows, except when the labels of the x-axis were updated in real-time with scrolling feedback. It was overall always well below the duration of the window.

      Author response image 1.

      Computational time for each iteration of the algorithm in one participant. The top panels display the continuous computation time through the recording, while the bottom panels display the distribution of computational times. The dash line represents the duration of a window of EMG signals.

      There is some description about the difference between units identified during baseline contractions, and how they might be misidentified during online contractions ("Accuracy of the real-time identification..."). This should be described in more detail. 

      We added an additional section in the results to clarify the concept of motor unit filters, and the reapplication of motor unit filters on signals in real-time. We highlighted how each motor unit must have a unique spatio-temporal signature to be accurately identified by our algorithms, in opposition to merged motor units sharing the same spatio-temporal features. This section shows how motor units accurately identified during baseline contractions can be misidentified during online contractions (P12; L295).

      ‘Application of motor unit filters in experimental data

      We then asked eight participants (4 males and 4 females) to perform trapezoidal isometric contractions with plateaus of force set at 10% and 20% MVC during which surface EMG signals were recorded from the TA with 256 electrodes separated by 4 mm. The aim of this experiment was to confirm the results of the simulation; specifically, to test the accuracy of the online decomposition when the level of force was below, equal to, or above the level of force produced during the baseline contraction used to estimate the motor unit filters (Figure 4). We assessed the accuracy of the motor unit spike trains identified in real time using their manually edited version as reference. 144 motor units were identified at both 10 and 20% MVC. When the test signals were recorded at the same level of force as the baseline contraction, we obtained rates of agreement of 95.6 ± 6.8% (10% MVC) and 93.9 ± 5.9% (20% MVC). The sensitivity reached 95.9 ± 6.7% (10% MVC) and 94.4 ± 5.6% (20% MVC), and the precision reached 99.6 ± 1.3% (10% MVC) and 99.4 ± 1.9% (20% MVC).  

      When the filters identified at 20% MVC were applied on signals recorded at a lower level of force (10% MVC), the rates of agreement decreased to 87.9 ± 16.2%. The sensitivity also decreased to 88.0 ± 16.2%, but the precision remained high (99.4 ± 4.3). Thus, the decrease in accuracy was mostly caused by missed discharge times rather than the false identification of artifacts or spikes from other motor units.

      When the filters identified at 10% MVC were applied to signals recorded at a higher level of force, the rates of agreement decreased to 83.3 ± 13.5%. The sensitivity decreased to 90.7 ± 8.1%, and the precision also decreased to 90.9 ± 12.6%. This result confirms what was observed with synthetic EMG, that is motor units recruited between 10 and 20% MVC can substantially disrupt the accuracy of the decomposition in real-time, as highlighted in Figure 4 (lower panel). Importantly, this situation does not happen for all the motor units, as suggested by the distribution of the values in Figure 4.’

      Fig. 6: Given that a key challenge in sorting should be that collisions occur during large contractions, much more primary data should be presented/visualized to show how the accuracy of sorting changes during larger contractions in online experiments. 

      As indicated above, the decomposition approach implemented in our software is not based on spikesorting, so it does not require to separate overlapping profiles of action potentials (see Methods). 

      Fig.7: In presenting the accuracy of biofeedback, it is very hard to gain any intuition for performance by just looking at RMSE values. Showing the online decoded and edited trajectories would help readers understand the magnitude of errors. 

      We updated the figure to display examples of visual feedback before and after manual editing.

      Reviewer #3 (Public Review):  

      In this manuscript, Rossato and colleagues present a method for real-time decoding of EMG into putative single motor units. Their manuscript details a variety of decision points in their code and data collection pipeline that led to a final result of recording on the order of ~10 putative motor units per muscle in human males. Overall, the manuscript is highly restricted in its potential utility but may be of interest to aficionados. For those outside the field of human or nonhuman primate EMG, these methods will be of limited interest.

      We thank the reviewer for his/her throughout evaluation of our manuscript. We recognise that this tool/resource will immediately benefit groups working with humans or nonhuman primate models. However, the recent development of intramuscular thin films with various designs adapted to rodents and smaller animals could expand the range of future users (Chung et al., 2023, Elife).  Nonetheless, decoding motor units in humans could be useful for many fields, e.g. in the domains of movement restoration and augmentation. The following paragraph has been added in the introduction section to highlight the importance of real-time decoding of motor unit activity (P3; L67):  

      ‘The activity of motor neuron – in the form of spike trains – represents the neural code of movement to muscles. Decoding this firing activity in real-time during various behaviours can thus substantially enhance our understanding of movement control (2-5). Real-time decoding is also essential for interfacing with external devices (6) or virtual limbs (7) when activity is present at the periphery of the nervous system. For example, individuals with a spinal cord injury can control a virtual hand with the residual firing activity of the motor units in their forearm (7). Furthermore, sampling the activity of motor units receiving a substantial portion of independent synaptic inputs may pave the way for movement augmentation – specifically, extending a person’s movement repertoire through the increase of controllable degrees of freedom (8). In this way, Formento et al. (3) showed that individuals can intuitively learn to independently control motor units within the same muscle using visual cues. Having access to open-source tools that perform the real-time decoding of motor units would allow an increasing number of researchers to improve and expand the range of these applications.’

      Notes 

      (1) Artificial data should be used with this method to provide ground truth performance evaluations. Without it, the study assumptions are unchallenged and could be seriously flawed.

      A new section on the validation of the algorithm has been added. We verified the accuracy of the algorithm by comparing the series of identified discharge times with the ground truth, i.e., the simulated discharge times. (P10; L235)

      ‘Validation of the algorithm

      We first validated the accuracy of the algorithm using synthetic EMG signals generated with an anatomical model entailing a cylindrical muscle volume with parallel fibres [see Farina et al. (29), Konstantin et al. (36) for a full description of the model)]. In this model, subcutaneous and skin layers separate the muscle from a grid of 65 surface electrodes (5 columns, 13 rows), while an intramuscular array of electrodes is directly inserted in the muscle under the grid with an angle of 30 degrees. 150 motor units were distributed within the cross section of the muscle. Recruitment thresholds, firing rate/excitatory drive relations, and twitch parameters were assigned to each motor unit using the same procedure as Fuglevand et al. (37). During each simulation, a proportional-integral-derivative controller adjusted the level of excitatory drive to minimise the error between a predefined target of force and the force generated by the active motor units. 

      Figure 3A displays the raster plots of the active motor units during simulated trapezoidal isometric contractions with plateaus of force set at 10%, 20%, and 30% MVC. A sinusoidal isometric contraction ranging between 15 and 25% MVC at a frequency of 0.5 Hz was also simulated. We identified on average 10 ± 1 and 12 ± 2 motor units with surface and intramuscular arrays, respectively (Figure 3A). During the offline decomposition, the rate of agreement between the identified discharge times and the ground truth, that is, the simulated discharge times, reached 100.0 ± 0.0% for intramuscular EMG signals and 99.2 ± 1.8% for surface EMG signals (Figure 3B). The offline estimation of motor unit filters was therefore highly accurate, independently of the level of force or the pattern of the isometric contraction.

      Motor unit filters estimated during a baseline contraction at 20% MVC were then applied in real-time on signals simulated during a contraction with a different pattern (sinusoidal; Figure 3C). The rates of agreement between the online decomposition and the ground truth reached 96.3 ± 4.6% and 98.4 ± 2.3% for surface and intramuscular EMG signals, respectively. Finally, we tested whether the accuracy of the online decomposition changed when the level of force decreased or increased by 10% MVC when compared to the calibration performed at 20% MVC (Figure 3D). The rate of agreement remained high when applying the motor unit filters on signals recorded at 10% MVC: 99.8 ± 0.2% (surface EMG) and 99.5 ± 0.3% (intramuscular EMG). It is worth noting that only 3 out of 10 motor units identified from surface EMG at 20% MVC were active at 10% MVC, while 8 out of 12 motor units identified from intramuscular EMG were active at 10 % MVC. This shows how the decomposition of EMG signals tends to identify the last recruited motor units, which often innervate a larger number of fibres than the early recruited motor units (38). On the contrary, the application of motor unit filters on signals simulated at 30% MVC led to a decrease in the rate of agreement, with values of 88.6 ± 14.0% (surface EMG) and 80.3 ± 19.2% (intramuscular EMG). This decrease in accuracy did not impact all the motor units, with 5 motor units keeping a rate of agreement above 95% in both signals. For the other motor units, we observed a decrease in precision, which estimates the ratio of true discharge times over the total number of identified discharge times. This was caused by the recruitment of two motor units sharing a similar space within the muscle, which resulted in a merge in the same pulse train (Figure 3D).’

      (2) From the point of view of a motor control neuroscientist studying movement in animals other than humans or non-human primates, the title was misleadingly hopeful. The use case presented in this study requires human participants to perform isometric contractions, facilitating spatially redundant recordings across the muscle for the algorithm to work. It is unclear whether these methods will be of utility to use cases under more physiological conditions (ie. dynamic movement). 

      We modified the title to read: “I-Spin live: An open-source software based on blind-source separation for real-time decoding of motor unit activity in humans”. 

      (3) The text states that "EMG signals recorded with an array of electrodes can be considered and instantaneous mixture of the original motor unit spike trains and their delayed versions." While this may be a true statement, it is not a complete statement, since motor units at distal sites may be shared, not shared, or novel. It was not clear to me whether the diversity of these scenarios would affect the performance of the software or introduce artifacts. In other words, if at site 1 you can pick up the bulk signal of units 1,2,3,4; at site two you pick up the signals of units 2,3,4,5 and site three you pick up the signal of units 3,4,5,6, what does the algorithm assume is happening and what does it report and why?

      This section has been rewritten to clarify this point. The EMG signal represents indeed the sum of the active motor units within the recorded muscle volume. Put in other words, it is possible that deep motor units or motor units with innervated fibres far away from the grid were not in this recorded muscle volume, and thus non-identifiable. Another necessary condition to ensure the identifiability of the motor unit is its unique spatio-temporal signature within the signal. It means that two motor units close to each other within the muscle volume will be merged by the model. This point was clarified in the results during the validation and the application of filters on experimental data.

      (P5; L115)

      ‘An EMG signal represents the sum of trains of action potentials from all the active motor units within the recorded muscle volume (Figure 1A). During stationary conditions, e.g., isometric contractions, the train of motor unit action potentials can be modelled as the convolution of series of discrete delta functions, representing the discharge times, and motor unit action potentials that have a consistent shape across time. When EMG signals are recorded with an array of electrodes, the shape of the recorded potential of each motor unit differs across electrodes. This is due to 1) the varying conduction velocity of action potentials among the muscle fibres, and 2) the location/depth of the muscle fibres that belong to each motor unit relatively to the electrodes, which impact the low pass filtering effect of the tissue on the recorded potential. Increasing the number and density of recording electrodes increases the likelihood that each motor unit will have a unique motor unit action potential profile (shape), i.e., a temporal and spatial profile that differs from all the other active motor unit within the recorded volume (16, 29). The uniqueness of motor unit action potential profiles is necessary for the blind source separation to accurately estimate the motor unit discharge times. Conversely, the spike trains of two motor units with similar action potential profiles will be merged by the model.

      Our software uses a fast independent component analysis (fastICA) to retrieve motor unit spike trains from the EMG signals. For this, it iteratively optimises a separation vector (i.e., the motor unit filter) for each motor unit [Figure 1B; (24-26)]. (24-26)]. The projection of the EMG signals on this separation vector generates a sparse motor unit pulse train, with most of its samples close to zero and a smaller number of samples significantly greater than zero (Figure 1B). The discharge times are estimated from this motor unit pulse train using a peak detection function and a k-mean classification with two classes to separate the high peaks (spikes) from the low peaks (noise and other motor units). During the decomposition in real-time, short segments of EMG signals are projected on the saved separation vectors, and the peaks are classified as discharge times if they are closer to the centroid of the class ‘spikes’ than to the centroid of the class ‘noise’ (Figure 1C). The algorithm used to identify motor units discharge activity is based on that proposed by Negro et al. (24) and Barsakcioglu et al. (26).’

      (4) I could not fully appreciate the performance gap solved by the current methods. What was not achievable before that is now achievable? The 125 ms speed of deconvolution? What was achievable before? Intro text around ln 85 states that 'most of the current implementations of this approach rely on offline processing, which restricts its ability to be used..." but no reference is provided here about what the non 'most' of can achieve. 

      (8) The authors might try to add text to be more circumspect about the contributions of this method. I would recommend emphasizing the conceptual advances over the specifics of the performance of the algorithm since processor speed and implementation of the ideas in a faster environment (Matlab can be slow) will change those outcomes in a trivial way. Yet, much of the results section is very focused on these metrics. 

      The main contribution of this work submitted to the section ‘Tools and Resource’ of Elife is to provide a user interface that enables researchers to decompose EMG signals recorded with multichannel systems into motor unit activities, to perform this process in real-time, and to translate it into visual feedback. The user interface is fully open source and does not require coding experience. If necessary, the users can inspect the commented code and even modify it for their own experimental setup. The toolbox is now compatible with various acquisition boards, which can expand its use to novel surface and intramuscular arrays of electrodes.

      (5) Relatedly, it would have been nice to see a proof of concept using real-time feedback for some kind of biofeedback signal. If that is the objective here, why not show us this? I found the actual readout metrics of performance rather esoteric. They may be of interest to very close experts so I will defer to them for input.

      We agree with the reviewer. Videos were added to the supplemental materials to show the different forms of feedback, together with a case scenario where the participant try to separate the activity of two motor units from the same muscle.

      (6) I was disappointed to see that only male participants are used because of some vague statement that 'it is widely known in the field' that more motor units can be resolved in males, without thorough referencing. It seems that the objective of the algorithm is the speed of analysis, not the number of units, which makes the elimination of female participants not justified. 

      The reviewer is right and that was corrected in the new version of the manuscript. We first performed additional experiments in both males and females focused on the accuracy of the approach, and further discussed the differences in yield between men and women in the discussion together with research perspectives to solve this issue.

      Results (P12; L296):

      ‘We then asked eight participants (4 males and 4 females) to perform trapezoidal isometric contractions with plateaus of force set at 10% and 20% MVC during which surface EMG signals were recorded from the TA with 256 electrodes separated by 4 mm. The aim of this experiment was to confirm the results of the simulation; specifically, to test the accuracy of the online decomposition when the level of force was below, equal to, or above the level of force produced during the baseline contraction used to estimate the motor unit filters (Figure 4). We assessed the accuracy of the motor unit spike trains identified in real time using their manually edited version as reference. 144 motor units were identified at both 10 and 20% MVC. When the test signals were recorded at the same level of force as the baseline contraction, we obtained rates of agreement of 95.6 ± 6.8% (10% MVC) and 93.9 ± 5.9% (20% MVC). The sensitivity reached 95.9 ± 6.7% (10% MVC) and 94.4 ± 5.6% (20% MVC), and the precision reached 99.6 ± 1.3% (10% MVC) and 99.4 ± 1.9% (20% MVC).  

      When the filters identified at 20% MVC were applied on signals recorded at a lower level of force (10% MVC), the rates of agreement decreased to 87.9 ± 16.2%. The sensitivity also decreased to 88.0 ± 16.2%, but the precision remained high (99.4 ± 4.3). Thus, the decrease in accuracy was mostly caused by missed discharge times rather than the false identification of artifacts or spikes from other motor units. When the filters identified at 10% MVC were applied to signals recorded at a higher level of force, the rates of agreement decreased to 83.3 ± 13.5%. The sensitivity decreased to 90.7 ± 8.1%, and the precision also decreased to 90.9 ± 12.6%. This result confirms what was observed with synthetic EMG, that is motor units recruited between 10 and 20% MVC can substantially disrupt the accuracy of the decomposition in real-time, as highlighted in Figure 4 (lower panel). Importantly, this situation does not happen for all the motor units, as suggested by the distribution of the values in Figure 4.’

      Discussion (P20; L480):

      “An important consideration regarding the implementation of offline or real-time surface EMG decomposition is the difference between individuals, with an overall lower yield in number of identified motor units in females (here: 9 ± 12) than in males (here: 30 ± 13). Typically, the number of identified motor units from surface EMG is twice as low in females than males (32, 49, 50). The cause for this difference remains unclear. It may be related to variations in properties of the tissues separating the motor units from the recording electrodes, or to differences in the morphological and physiological properties of muscle fibres, as well as to the innervation ratios of motor units. These sex-related differences have so far only been supported by data extracted from animal experiments (51). However, the recent developments of simulation frameworks capable of generating highly realistic EMG signals for anthropometrically diverse populations may help understanding the impact of sex-related differences in humans (52). Specifically, these simulations can account for diverse anatomical (e.g. muscle volume and architecture, thickness of subcutaneous tissues) and physiological characteristics (e.g. innervation ratio, number of motor units, fibre cross sectional area, fibre conduction velocity, contribution of rate coding vs. spatial recruitment). Generating such dataset could help identifying the primary factors affecting EMG decomposition performance, ultimately enabling the refinement of algorithms and/or surface electrode design.”

      (7) Human curation is often used in spike sorting, but the description of criteria used in this step or how the human curation choices are documented is missing. 

      To address the reviewer’s comment, we added a new paragraph in the Method section to describe the manual editing process: (P26; L657)

      “There is a consensus among experts that automatic decomposition should be followed by visual inspection and manual editing (55).  Manual editing involves the following steps: i) removing spikes that result in erroneous firing rates (outliers), ii) adding discharge times thar are clearly distinguishable from the noise, iii) recalculating the separation vector, iv) reapplying the separation vector on the EMG signals (either a selected window or the entire signal), and v) repeating this procedure until no outliers are present and all clearly distinguishable spikes have been selected. Importantly, the manual editing of potentially missed or falsely identified discharge times should not be accepted before the application of the updated motor unit separation vector, thereby generating a new pulse train. Manual edits should be accepted only if the silhouette value improves following this operation or remains well above the preestablished threshold. A more extensive description of the manual editing of motor unit pulse trains can be found in (32). Even though some of the aforementioned steps involve subjective decision-making, evidence suggests that manual editing after EMG decomposition with blind source separation approaches remains highly reliable across operators (33). Specifically, the median rates of agreement calculated for 126 motor units over eight operators with various experience in manual editing was 99.6%.  All raw and processed data have been made available on a public data repository so that they can be used for training new operators (10.6084/m9.figshare.13695937).”

      Minor 

      Ln 115, "inversing" is not a word. "inverse" is not a verb 

      Changed as suggested

      Ln 186, typo, bioadhesive 

      Changed as suggested

      MVC should be defined on first use. It is currently defined on 3rd use or so. 

      The term rate is used in a variety of places without units. Eg line 465 but not limited to that 

      Changed as suggested

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Two minor comments: Para 125: it is not clear what is meant by "spatial distribution" of recording electrodes. 

      ‘Density’ was used instead of ‘spatial distribution’ to now read:

      ‘Increasing the number and density of recording electrodes increases the likelihood that each motor unit will have a unique motor unit action potential profile (shape), i.e., a temporal and spatial profile that differs from all the other active motor unit within the recorded volume (16, 29).’

      Para 545: perhaps a bit more explanation about why low spatial overlap is better would be appropriate. 

      We added a section in the results showing how motor units with similar spatial signatures are merged by our model, leading to a lower precision. We therefore changed this sentence to now read:

      ‘Therefore, the likelihood of having spatially overlapping motor unit action potentials - and thus merged motor units - is lower, which explains why the rate of agreement of motor units identified from intramuscular arrays of electrodes is much higher than grids of surface electrodes (12, 13).’

      Reviewer #2 (Recommendations For The Authors): 

      The authors mention that data is included with the Github software package. I could not find any included data, or instructions on how to run the software offline on example data. (Apologies if I missed this - it would be helpful to make it more prominent)

      The link to the data on figshare was added in the GitHub, as well as data samples to run the algorithm offline and test manual editing.

      Minor comments: 

      Not sure what is meant by "boundary capabilities of online decomposition" 

      This was removed to only discuss the accuracy of online decomposition.

      CoV for ISIs is not formally defined or justified.

      This was added to the caption of figure 2:

      ‘The CoV of ISI estimates the regularity of spiking for each motor unit, an expected behaviour during isometric contractions at consistent levels of force.’

      Fig. 4: slope units should be ms/motor unit, perhaps? 

      Changed as suggested.

      In some places, the manuscript uses "edition" to describe the editing process. I am not familiar with this usage, "editing" may be more common. 

      Editing is now used through the entire manuscript.

      Reviewer #3 (Recommendations For The Authors): 

      I would recommend that the authors revise their manuscript to conform to eLife formatting guidelines, including moving the methods to the end of the manuscript. This change may entail substantial editing since many ideas are presented in order from the beginning of the methods. While this suggestion may seem superficial, the success of the new publishing model might benefit from general uniformity in manuscript style.

      We changed and edited the draft to follow the classic format of Elife papers.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study dissects distinct pools of diacylglycerol (DAG), continuing a line of research on the central concept that there is a major lipid metabolism DAG pool in cells, but also a smaller signaling DAG pool. It tests the hypothesis that the second pool is regulated by Dip2, which influences Pkc1 signaling. The group shows that stressed yeast increase specific DAG species C36:0 and 36:1, and propose this promotes Pkc1 activation via Pck1 binding 36:0. The study also examines how perturbing the lipid metabolism DAG pool via various deletions such as lro1, dga1, and pah1 deletion impacts DAG and stress signaling. Overall this is an interesting study that adds new data to how different DAG pools influence cellular signaling.

      Strengths:

      The study nicely combined lipidomic profiling with stress signaling biochemistry and yeast growth assays.

      We thank the reviewer for finding this study of interest and appreciating our multi-pronged approach to prove our hypothesis that a distinct pool of DAGs regulated by Dip2 activate PKC signalling.

      Weaknesses:

      One suggestion to improve the study is to examine the spatial organization of Dip2 within cells, and how this impacts its ability to modulate DAG pools. Dip2 has previously been proposed to function at mitochondria-vacuole contacts (Mondal 2022). Examining how Dip2 localization is impacted when different DAG pools are manipulated such as by deletion Pah1 (also suggested to work at yeast contact sites such as the nucleus-vacuole junction), or with Lro1 or Dga1 deletion would broaden the scope of the study.

      We thank the reviewer for the suggestion to trace the localization of Dip2 in the absence of various DAG-acting enzymes. To address this, we generated Dip2-GFP knock-in (KI) in Δpah1, Δlro1 and Δdga1 strains, confirming successful integration by western blotting using an anti-GFP antibody. We then performed microscopy to examine the localization of Dip2. Since Dip2 is a mitochondria-vacuole contact site protein that predominantly localizes to mitochondria (approximately 60% puncta of Dip2 localize to mitochondria) (Mondal et al. 2022), we co-stained the cells with MitoTracker red to visualize mitochondria.

      Consistent with our previous findings, Dip2 colocalizes with the MitoTracker red in WT (Figure 3-figure supplement 2 A). As suggested by the reviewer, we deleted PAH1, which converts phosphatidic acid to DAGs and is also known to work at the nucleus-vacuole junction. On examining whether absence of PAH1 influences the localization of Dip2, we found that there is no change in Dip2’s spatial organization. This could also be due to no observable change in the DAG species on deleting PAH1, as noted in our lipidomic studies (Figure 4. figure supplement 2A). These observations suggest that in a homeostatic condition, Pah1 does not affect the DAG pool acted upon by Dip2 and therefore has no influence on Dip2’s subcellular localization. This data has been incorporated in the revised manuscript (line no. 286-289) and Figure 4-figure supplement 2D-E.

      Similarly, we probed for the localization of Dip2 in LRO1 and DGA1 knock out strains. These enzymes are responsible for converting bulk DAGs to TAGs. We have previously shown that Dip2 is selective for only C36:0 and C36:1 and does not act on the bulk DAGs (Mondal et al. 2022). Both Lro1 and Dga1 are endoplasmic reticulum (ER) resident proteins and the bulk DAG accumulation in their knockouts is shown to be in the ER (Li et al. 2020), not influencing the mitochondrial DAG pool. On tracing Dip2’s localization in these knockouts, we found that Dip2 remains in the mitochondria (Figure 3-figure supplement 2, Figure 4. figure supplement 2D,E). These results suggest that Dip2 localization is not influenced by bulk DAG accumulation, reinforcing its specificity toward selective DAGs, which are likely to be present at mitochondria and mitochondria-vacuole contact sites. We have added this data in the revised manuscript (line no. 240-246) with Figure 3. figure supplement 2.

      Reviewer #2 (Public review):

      Summary:

      The authors use yeast genetics, lipidomic and biochemical approaches to demonstrate the DAG isoforms (36:0 and 36:1) can specifically activate PKC. Further, these DAG isoforms originate from PI and PI(4,5)P2. The authors propose that the Psi1-Plc1-Dip2 functions to maintain a normal level of specific DAG species to modulate PKC signalling.

      Strengths:

      Data from yeast genetics are clear and strong. The concept is potentially interesting and novel.

      We would like to thank the reviewer for the positive comments on our work and finding the study novel and interesting.

      Weaknesses:

      More evidence is needed to support the central hypothesis. The authors may consider the following:

      (1) Figure 2: the authors should show/examine C36:1 DAG. Also, some structural evidence would be highly useful here. What is the structural basis for the assertion that the PKC C1 domain can only be activated by C36:0/1 DAG but not other DAGs? This is a critical conclusion of this work and clear evidence is needed.

      We thank the reviewer for the insightful comments. We were unable to include C36:1 DAG in our in vitro DAG binding assays because it is not commercially available. We have now explicitly mentioned it in the revised manuscript (Line no. 186).

      We agree with the reviewer that PKC activated by C36:0 and C36:1 DAGs is a critical conclusion of our work. While we understand that there is no obvious structural explanation as to how the DAG binding C1 domain of PKC attains the acyl chain specificity for DAGs, our conclusion that yeast Pkc1 is selective for C36:0 and C36:1 DAGs, is supported by a combination of robust in vitro and in vivo data:

      (1) In Vitro Evidence: The liposome binding assays demonstrate that the Pkc1 C1 domain binds only to the selective DAG and does not interact with bulk DAGs.

      (2) In Vivo Evidence: Lipidomic analyses of wild-type cells subjected to cell wall stress reveal increased levels of C36:0 and C36:1 DAGs, while levels of bulk DAGs remain unaffected.

      These findings collectively indicate that Pkc1 neither binds nor is activated by bulk DAGs, reinforcing its specificity for C36:0 and C36:1 DAGs.

      Moreover, the structural basis of this selectivity would require either a specific DAG-bound C1 domain structure of Pkc1, which is difficult owing to the flexibility of the longer acyl chains present in C36:0 and C36:1 DAGs. In addition, capturing the full-length Pkc1 structure that might provide deeper insights has been challenging for several other groups. Also, we hypothesize that the DAG selectivity by Pkc1 is more of a membrane phenomenon wherein these DAGs might create a specific microdomain or form a particular curvature that is sensed by Pkc1. Investigating this would require extensive structural and biophysical studies, that are beyond the scope of the current work but are planned for future research.

      (2) Does Dip2 colocalize with Plc1 or Pkc1?

      As shown in our previous study (Mondal et al. 2022) and in the above section (Figure 3. figure supplement 2(A-B)), Dip2 predominantly localizes to the mitochondria. Pkc1, on the other hand, is known to be found in the cytosol, plasma membrane and bud site (Andrews and Stark 2000). We also checked the localization of Pkc1, co-stained with mitotracker-red and observed no significant overlap between the two, confirming that Pkc1 does not colocalize with Dip2 (Author response image 1).

      Author response image 1.

      Live cell microscopy for tracing Pkc1 localization. (A) Microscopy image panel showing DIC image (left), fluorescence for (A) Pkc1 tagged with GFP, mitotracker-red for staining mitochondria and the merged image for both the fluorophores (right). Scale bar represents 5 µm. (B) Line scan plotted for the fluorescence intensity of Pkc1-GFP along with mitotracker-red across the line shown in the merged panel.

      Moreover, as suggested by the reviewer, we also checked the localization of Plc1 and found that Plc1 is present in cytosol and shows a partial colocalization with the mitochondria (Figure 4-figure supplement 3A-B). As some puncta of Dip2 also colocalize with the vacuoles, we checked whether Plc1 also follows such localization pattern. We costained Plc1-GFP with FM4-64, a vacuolar membrane dye and observed that Plc1 partially localizes to vacuoles as well (Figure 4-figure supplement 3C-D). This is also observed in a previous study where Plc1 was found in a subcellular fractionation of isolated yeast vacuoles and total cell lysate (Jun, Fratti, and Wickner 2004). We also checked similar to Dip2, whether Plc1 also localizes to the Mitochondria-vacuole contact site by using tri-colour imaging with FM4-64 for vacuole, DAPI for mitochondria and GFP tagged Plc1. We were not able to trace Dip2 and Plc1 simultaneously as we could not generate a strain endogenously tagged with two different colours even after several attempts. However, from our observations, we can conclude that Plc1 partially localizes to mitochondria and vacuole and might be locally producing the selective DAGs to be acted upon by Dip2. We have incorporated this data in the revised manuscript (line no. 301-304) with Figure 4-figure supplement 3.

      For probing the localization of Dip2 upon Plc1 activation, we used cell wall stress- a condition inducing Plc1 activation for selective DAG production (this study). Under this condition, we probed the localization of Dip2 by fluorescent microscopy and found that Dip2 does not move to the plasma membrane but remains localized to mitochondria (Figure. 1. figure supplement 3). This result has been added in the revised manuscript (line no. 153-160) with Figure. 1-figure supplement 3.

      This raises intriguing questions regarding the spatial regulation of Pkc1 by Dip2. Since Dip2’s localization remains unaffected, whether the selective DAGs, presumably at the mitochondria, move to the plasma membrane for Pkc1 activation or the Pkc1 translocates to the mitochondria needs further exploration. Addressing these possibilities will require a combination of genetic approaches, organellar lipidomics, and advanced microscopy, which we aim to explore in future studies.

      References:

      Andrews, P. D., and M. J. Stark. 2000. “Dynamic, Rho1p-Dependent Localization of Pkc1p to Sites of Polarized Growth.” Journal of Cell Science 113 ( Pt 15): 2685–93. doi:10.1242/jcs.113.15.2685.

      Jun, Youngsoo, Rutilio A. Fratti, and William Wickner. 2004. “Diacylglycerol and Its Formation by Phospholipase C Regulate Rab- and SNARE-Dependent Yeast Vacuole Fusion*.” Journal of Biological Chemistry 279(51): 53186–95. doi:10.1074/jbc.M411363200.

      Li, Dan, Shu-Gao Yang, Cheng-Wen He, Zheng-Tan Zhang, Yongheng Liang, Hui Li, Jing Zhu, et al. 2020. “Excess Diacylglycerol at the Endoplasmic Reticulum Disrupts Endomembrane Homeostasis and Autophagy.” BMC Biology 18(1): 107. doi:10.1186/s12915-020-00837-w.

      Mondal, Sudipta, Priyadarshan Kinatukara, Shubham Singh, Sakshi Shambhavi, Gajanan S Patil, Noopur Dubey, Salam Herojeet Singh, et al. 2022. “DIP2 Is a Unique Regulator of Diacylglycerol Lipid Homeostasis in Eukaryotes.” eLife 11: e77665. doi:10.7554/eLife.77665.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The biogenesis of outer membrane proteins (OMPs) into the outer membranes of Gram-negative bacteria is still not fully understood, particularly substrate recognition and insertion by beta-assembly machinery (BAM). In the studies, the authors present their studies that in addition to recognition by the last strand of an OMP, sometimes referred to as the beta-signal, an additional signal upstream of the last strand is also important for OMP biogenesis.

      Strengths:

      1. Overall the manuscript is well organized and written, and addresses an important question in the field. The idea that BAM recognizes multiple signals on OMPs has been presented previously, however, it was not fully tested.

      2. The authors here re-address this idea and propose that it is a more general mechanism used by BAM for OMP biogenesis.

      3. The notion that additional signals assist in biogenesis is an important concept that indeed needs fully tested in OMP biogenesis.

      4. A significant study was performed with extensive experiments reported in an attempt to address this important question in the field.

      5. The identification of important crosslinks and regions of substrates and Bam proteins that interact during biogenesis is an important contribution that gives clues to the path substrates take en route to the membrane.

      Weaknesses:

      Major critiques (in no particular order):

      1. The title indicates 'simultaneous recognition', however no experiments were presented that test the order of interactions during OMP biogenesis.

      We have replaced the word “Simultaneous” with “Dual” so as not to reflect on the timing of the recognition events for the distinct C-terminal signal and -5 signal.

      1. Aspects of the study focus on the peptides that appear to inhibit OmpC assembly, but should also include an analysis of the peptides that do not to determine this the motif(s) present still or not.

      We thank the reviewer for this comment. Our study focuses on the peptides which exhibited an inhibitory effect in order to elucidate further interactions between the BAM complex and substrate proteins, especially in early stage of the assembly process. In the case of peptide 9, which contains all of our proposed elements but did not have an inhibitory effect, there is the presence of an arginine residue at the polar residue next to hydrophobic residue in position 0 (0 Φ). As seen in Fig S5, S6, and S7, there are no positively charged amino acids in the polar residue positions in the -5 or last strands. This might be the reason why peptide 9, as well as peptide 24, the β-signal derived from the mitochondrial OMP Tom40 and contains a lysine at the polar position, did not display an inhibitory effect. Incorporating the reviewer's suggestions might elucidate conditions that should not be added to the elements, but this is not the focus of this paper and was not discussed to avoid complicating the paper.

      1. The β-signal is known to form a β-strand, therefore it is unclear why the authors did not choose to chop OmpC up according to its strands, rather than by a fixed peptide size. What was the rationale for how the peptide lengths were chosen since many of them partially overlap known strands, and only partially (2 residues) overlap each other? It may not be too surprising that most of the inhibitory peptides consist of full strands (#4, 10, 21, 23).

      A simple scan of known β-strands would have been an alternative approach, however this comes with the bias of limiting the experiments to predicted substrate (strand) sequences, and it presupposes that the secondary structure element would be formed by this tightly truncated peptide.

      Instead, we allowed for the possibility that OMPs meet the BAM complex in an unfolded or partially folded state, and that the secondary structure (β-strand) might only form via β-argumentation after the substrate is placed in the context of the lateral gate. We therefore used peptides that mapped right across the entirety of OmpC, with a two amino acid overlap.

      To clarify this important point regarding the unbiased nature of our screen, we have revised the text:

      (Lines 147-151) "We used peptides that mapped the entirety of OmpC, with a two amino acid overlap. This we considered preferable to peptides that were restricted by structural features, such as β-strands, in consideration that β-strand formation may or may not have occurred in early-stage interactions at the BAM complex."

      1. It would be good to have an idea of the propensity of the chosen peptides to form β-stands and participate in β-augmentation. We know from previous studies with darobactin and other peptides that they can inhibit OMP assembly by competing with substrates.

      We appreciate the reviewer's suggestion. However, we have not conducted biophysical characterizations of the peptides to calculate the propensity of each peptide to form β-stands and participate in β-augmentation. The sort of detailed biophysical analysis done for Darobactin (by the Maier and Hiller groups, The antibiotic darobactin mimics a β-strand to inhibit outer membrane insertase Nature 593:125-129) was a Nature publication based on this single peptide. A further biophysical analysis of all of the peptides presented here goes well beyond the scope of our study.

      1. The recognition motifs that the authors present span up to 9 residues which would suggest a relatively large binding surface, however, the structures of these regions are not large enough to accommodate these large peptides.

      The β-signal motif (ζxGxx[Ω/Φ]x[Ω/Φ]) is an 8-residue consensus, some of the inhibitory peptides include additional residues before and after the defined motif of 8 residues, and the lateral gate of BamA has been shown interact with a 7-residue span (eg. Doyle et al, 2022). Cross-linking presented in our study showed BamD residues R49 and G65 cross-linked to the positions 0 and 6 of the internal signal in OmpC (Fig. 6D).

      We appreciate this point of clarification and have modified the text to acknowledge that in the final registering of the peptide with its binding protein, some parts of the peptide might sit beyond the bounds of the BamD receptor’s binding pocket and the BamA lateral gate:

      (Lines 458-471) "The β-signal motif (ζxGxx[Ω/Φ]x[Ω/Φ]) is an eight-residue consensus, and internal signal motif is composed of a nine-residue consensus. Recent structures have shown the lateral gate of BamA interacts with a 7-residue span of substrate OMPs. Interestingly, inhibitory compounds, such as darobactin, mimic only three resides of the C-terminal side of β-signal motif. Cross-linking presented here in our study showed that BamD residues R49 and G65 cross-linked to the positions 0 and 6 of the internal signal in OmpC (Fig. 6D). Both signals are larger than the assembly machineries signal binding pocket, implying that the signal might sit beyond the bounds of the signal binding pocket in BamD and the lateral gate in BamA. These finding are consistent with similar observations in other signal sequence recognition events, such as the mitochondrial targeting presequence signal that is longer than the receptor groove formed by the Tom20, the subunit of the translocator of outer membrane (TOM) complex (Yamamoto et al., 2011). The presequence has been shown to bind to Tom20 in several different conformations within the receptor groove (Nyirenda et al., 2013)."

      Moreover, the distance between amino acids of BamD which cross-linked to the internal signal, R49 and Y62, is approximately 25 Å (pdbID used 7TT3). The distance of the maximum amino acid length of the internal signal of OmpC, from F280 to Y288, is approximately 22 Å (pdbID used 2J1N). This would allow for the signal to fit within the confines of the TRP motif of BamD.

      Author response image 1.

      1. The authors highlight that the sequence motifs are common among the inhibiting peptides, but do not test if this is a necessary motif to mediate the interactions. It would have been good to see if a library of non-OMP related peptides that match this motif could also inhibit or not.

      With respect, this additional work would not address any biological question relevant to the function of BamD. To randomize sequences and then classify those that do or don’t fit the motif would help in refining the parameters of the β-signal motif, but that was not our intent.

      We have identified the peptides from within the total sequence of an OMP, shown which peptides inhibit in an assembly assay, and then observed that the inhibitory peptides conform to a previously published (β-signal) motif.

      1. In the studies that disrupt the motifs by mutagenesis, an effect was observed and attributed to disruption of the interaction of the 'internal signal'. However, the literature is filled with point mutations in OMPs that disrupt biogenesis, particular those within the membrane region. F280, Y286, V359, and Y365 are all residues that are in the membrane region that point into the membrane. Therefore, more work is needed to confirm that these mutations are in parts of a recognition motif rather than on the residues that are disrupting stability/assembly into the membrane.

      As the reviewer pointed out, the side chains of the amino acids constituting the signal elements we determined were all facing the lipid side, of which Y286 and Y365 were important for folding as well as to be recognized. However, F280A and V359A had no effect on folding, but only on assembly through the BAM complex. The fact that position 0 functions as a signal has been demonstrated by peptidomimetics (Fig. 1) and point mutant analysis (Fig. 2). We appreciate this clarification and have modified the text to acknowledge that the all of the signal element faces the lipid side, which contributes to their stability in the membrane finally, and before that the BAM complex actively recognizes them and determines their orientation:

      (Lines 519-526) After OMP assembly, all elements of the internal signal are positioned such that they face into the lipid-phase of the membrane. This observation may be a coincidence, or may be utilized by the BAM complex to register and orientate the lipid facing amino acids in the assembling OMP away from the formative lumen of the OMP. Amino acids at position 6, such as Y286 in OmpC, are not only component of the internal signal for binding by the BAM complex, but also act in structural capacity to register the aromatic girdle for optimal stability of the OMP in the membrane.

      1. The title of Figure 3 indicates that disrupting the internal signal motif disrupts OMP assembly, however, the point mutations did not seem to have any effect. Only when both 280 and 286 were mutated was an effect observed. And even then, the trimer appeared to form just fine, albeit at reduced levels, indicating assembly is just fine, rather the rate of biogenesis is being affected.

      We appreciate this point and have revised the title of Figure 3 to be:

      (Lines 1070-1071) "Modifications in the putative internal signal slow the rate of OMP assembly in vivo."

      1. In Figure 4, the authors attempt to quantify their blots. However, this seems to be a difficult task given the lack of quality of the blots and the spread of the intended signals, particularly of the 'int' bands. However, the more disturbing trend is the obvious reduction in signal from the post-urea treatment, even for the WT samples. The authors are using urea washes to indicate removal of only stalled substrates. However a reduction of signal is also observed for the WT. The authors should quantify this blot as well, but it is clear visually that both WT and the mutant have obvious reductions in the observable signals. Further, this data seems to conflict with Fig 3D where no noticeable difference in OmpC assembly was observed between WT and Y286A, why is this the case?

      We have addressed this point by adding a statistical analysis on Fig. 4A. As the reviewer points out, BN-PAGE band quantification is a difficult task given the broad spread of the bands on these gels. Statistical analysis showed that the increase in intermediates (int) was statistically significant for Y286A at all times until 80 min, when the intermediate form signals decrease.

      (Lines 1093-1096) "Statistical significance was indicated by the following: N.S. (not significant), p<0.05; , p<0.005; *. Exact p values of intermediate formed by Wt vs Y286A at each timepoint were as follows; 20 minutes: p = 0.03077, 40 minutes: p = 0.02402, 60 minutes: p = 0.00181, 80 minutes: p = 0.0545."

      Further regarding the Int. band, we correct the statement as follows.

      (Lines 253-254) "Consistent with this, the assembly intermediate which was prominently observed at the OmpC(Y286A) can be extracted from the membranes with urea;"

      OMP assembly in vivo has additional periplasmic chaperones and factors present in order to support the assembly process. Therefore, it is likely that some proteins were assembled properly in vivo compared to their in vitro counterparts. Such a decrease has been observed not only in E. coli but also in mitochondrial OMP import (Yamano et al., 2010).

      1. The pull-down assays with BamA and BamD should include a no protein control at the least to confirm there is no non-specific binding to the resin. Also, no detergent was mentioned as part of the pull downs that contained BamA or OmpC, nor was it detailed if OmpC was urea solubilized.

      We have performed pull down experiments with a no-protein (Ni-NTA only) control as noted (Author response image 1). The results showed that the amount of OmpC carrying through on beads only was significantly lower than the amount of OmpC bound in the presence of BamD or BamA. The added OmpC was not treated with urea, but was synthesized by in vitro translation; the in vitro translated OmpC is the standard substrate in the EMM assembly assay (Supp Fig. S1) where it is recognized by the BAM complex. Thus, we used it for pull-down as well and, to make this clearer, we have revised as follows:

      Author response image 2.

      Pull down assay of radio-labelled OmpC with indicated protein or Ni-NTA alone (Ni-NTA) . T; total, FT; Flow throw, W; wash, E; Elute.

      (Lines 252-265) "Three subunits of the BAM complex have been previously shown to interact with the substrates: BamA, BamB, and BamD (Hagan et al., 2013; Harrison, 1996; Ieva et al., 2011). In vitro pull-down assay showed that while BamA and BamD can independently bind to the in vitro translated OmpC polypeptide (Fig .S9A), BamB did not (Fig. S9B)."

      11.

      • The neutron reflectometry experiments are not convincing primarily due to the lack controls to confirm a consistent uniform bilayer is being formed and even if so, uniform orientations of the BamA molecules across the surface.

      • Further, no controls were performed with BamD alone, or with OmpC alone, and it is hard to understand how the method can discriminate between an actual BamA/BamD complex versus BamA and BamD individually being located at the membrane surface without forming an actual complex.

      • Previous studies have reported difficulty in preparing a complex with BamA and BamD from purified components.

      • Additionally, little signal differences were observed for the addition of OmpC. However, an elongated unfolded polypeptide that is nearly 400 residues long would be expected to produce a large distinct signal given that only the C-terminal portion is supposedly anchored to BAM, while the rest would be extended out above the surface.

      • The depiction in Figure 5D is quite misleading when viewing the full structures on the same scales with one another.

      We have addressed these five points individually as follows.

      i. The uniform orientation of BamA on the surface is guaranteed by the fixation through a His-tag engineered into extracellular loop 6 of BamA and has been validated in previous studies as cited in the text. Moreover, to explain this, we reconstructed another theoretical model for BamA not oriented well in the system as below. However, we found that the solid lines (after fitting) didn’t align well with the experimental data. We therefore assumed that BamA has oriented well in the membrane bilayer.

      Author response image 3.

      Experimental (symbols) and fitted (curves) NR profiles of BamA not oriented well in the POPC bilayer in D2O (black), GMW (blue) and H2O (red) buffer.

      ii. There would be no means by which to do a control with OmpC alone or BamD alone as neither protein binds to the lipid layer chip. OmpC is diluted from urea and then the unbound OmpC is washed from the chip before NR measurements. BamD does not have an acyl group to anchor it to the lipid layer, without BamA to anchor to, it too is washed from the chip before NR measurements. We have reconstructed another theoretical model for both of BamA + BamD embedding in the membrane bilayer, and the fits were shown below. Apparently, the fits didn’t align well with the experimental data, which discriminate the BamA/BamD individually being located at the membrane surface without forming an actual complex.

      Author response image 4.

      Experimental (symbols) and fitted (curves) NR profiles of BamA+D embedding together in the POPC bilayer in D2O (black), GMW (blue) and H2O (red) buffer.

      iii. The previous studies that reported difficulty in preparing a complex with BamA and BamD from purified components were assays done in aqueous solution including detergent solubilized BamA, or with BamA POTRA domains only. Our assay is superior in that it reports the binding of BamD to a purified BamA that has been reconstituted in a lipid bilayer.

      iv. The relatively small signal differences observed for the addition of OmpC are expected, since OmpC is an elongated, unfolded polypeptide of nearly 400 residues long which, in the context of this assay, can occupy a huge variation in the positions at which it will sit with only the C-terminal portion anchored to BAM, and the rest moving randomly about and extended from the surface.

      v. We appreciate the point raised and have now added a note in the Figure legend that these are depictions of the results and not a scale drawing of the structures.

      1. In the crosslinking studies, the authors show 17 crosslinking sites (43% of all tested) on BamD crosslinked with OmpC. Given that the authors are presenting specific interactions between the two proteins, this is worrisome as the crosslinks were found across the entire surface of BamD. How do the authors explain this? Are all these specific or non-specific?

      The crosslinking experiment using purified BamD was an effective assay for comprehensive analysis of the interaction sites between BamD and the substrate. However, as the reviewer pointed out, cross-linking was observed even at the sites that, in the context of the BAM complex, interact with BamC as a protein-protein interaction and would not be available for substrate protein-protein interactions. To complement this, analysis and to address this issue, we also performed the experiment in Fig. 6C.

      In Fig. 6C, the interaction of BamD with the substrate is examined in vivo, and the results demonstrate that if BPA is introduced into the site, we designated as the substrate recognition site, it is cross-linked to the substrate. On the other hand, position 114 was found to crosslink with the substrate in vitro crosslinking, but not in vivo. It should be noted that position 114 has also been confirmed to form cross-link products with BamC, we believe that BamD-substrate interactions in the native state have been investigated. To explain the above, we have added the following description to the Results section.

      (Lines 319-321) "Structurally, these amino acids locate both the lumen side of funnel-like structure (e.g. 49 or 62) and outside of funnel-like structure such as BamC binding site (e.g. 114) (fig. S12C). (Lines 350-357) Positions 49, 53, 65, and 196 of BamD face the interior of the funnel-like structure of the periplasmic domain of the BAM complex, while position 114 is located outside of the funnel-like structure (Bakelar et al., 2016; Gu et al., 2016; Iadanza et al., 2016). We note that while position 114 was cross-linked with OmpC in vitro using purified BamD, that this was not seen with in vivo cross-linking. Instead, in the context of the BAM complex, position 114 of BamD binds to the BamC subunit and would not be available for substrate binding in vivo (Bakelar et al., 2016; Gu et al., 2016; Iadanza et al., 2016)."

      1. The study in Figure 6 focuses on defined regions within the OmpC sequence, but a more broad range is necessary to demonstrate specificity to these regions vs binding to other regions of the sequence as well. If the authors wish to demonstrate a specific interaction to this motif, they need to show no binding to other regions.

      The region of affinity for the BAM complex was determined by peptidomimetic analysis, and the signal region was further identified by mutational analysis of OmpC. Subsequently, the subunit that recognizes the signal region was identified as BamD. In other words, in the process leading up to Fig. 6, we were able to analyze in detail that other regions were not the target of the study. We have revised the text to make clear that we focus on the signal region including the internal signal, and have not also analyzed other parts of the signal region:

      (Lines 329-332) "As our peptidomimetic screen identified conserved features in the internal signal, and cross-linking highlighted the N-terminal and C-terminal TPR motifs of BamD as regions of interaction with OmpC, we focused on amino acids specifically within the β-signals of OmpC and regions of BamD which interact with β-signal."

      1. The levels of the crosslinks are barely detectable via western blot analysis. If the interactions between the two surfaces are required, why are the levels for most of the blots so low?

      These are western blots of cross-linked products – the efficiency of cross-linking is far less than 100% of the interacting protein species present in a binding assay and this explains why the levels for the blots are ‘so low’. We have added a sentence to the revised manuscript to make this clear for readers who are not molecular biologists:

      (Lines 345-348) "These western blots reveal cross-linked products representing the interacting protein species. Photo cross-linking of unnatural amino acid is not a 100% efficient process, so the level of cross-linked products is only a small proportion of the molecules interacting in the assays."

      15.

      • Figure 7 indicates that two regions of BamD promote OMP orientation and assembly, however, none of the experiments appears to measure OMP orientation?

      • Also, one common observation from panel F was that not only was the trimer reduced, but also the monomer. But even then, still a percentage of the trimer is formed, not a complete loss.

      (i) We appreciate this point and have revised the title of Figure 7 to be:

      (Lines 1137-1138) "Key residues in two structurally distinct regions of BamD promote β-strand formation and OMP assembly."

      (ii) In our description of Fig. 7F (Lines 356-360) we do not distinguish between the amount of monomer and trimer forms, since both are reflective of the overall assembly rate i.e. assembly efficiency. Rather, we state that:

      "The EMM assembly assay showed that the internal signal binding site was as important as the β-signal binding site to the overall assembly rates observed for OmpC (Fig. 7F), OmpF (fig. S15D), and LamB (fig. S15E). These results suggest that recognition of both the C-terminal β-signal and the internal signal by BamD is important for efficient protein assembly."

      16.

      • The experiment in Fig 7B would be more conclusive if it was repeated with both the Y62A and R197A mutants and a double mutant. These controls would also help resolve any effect from crowding that may also promote the crosslinks.

      • Further, the mutation of R197 is an odd choice given that this residue has been studied previously and was found to mediate a salt bridge with BamA. How was this resolved by the authors in choosing this site since it was not one of the original crosslinking sites?

      As stated in the text, the purpose of the experiment in Figure 7B is to measure the impact of pre-forming a β-strand in the substrate (OmpC) before providing it to the receptor (BamD). We thank the reviewer for the comment on the R197 position of BamD. The C-terminal domain of BamD has been suggested to mediate the BamA-BamD interface, specifically BamD R197 amino acid creates a salt-bridge with BamA E373 (Ricci et al., 2012). It had been postulated that the formation of this salt-bridge is not strictly structural, with R197 highlighted as a key amino acid in BamD activity and this salt-bridge acts as a “check-point” in BAM complex activity (Ricci et al., 2012, Storek et al., 2023). Our results agree with this, showing that the C-terminus of BamD acts in substrate recognition and alignment of the β-signal (Fig. 6, Fig S12). We show that amino acids in the vicinity of R197 (N196, G200, D204) cross-linked well to substrate and mutations to the β-signal prevent this interaction (Fig S12B, D). For mutational analysis of BamD, we looked then at the conservation of the C-terminus of BamD and determined R197 was the most highly conserved amino acid (Fig 6C). In order to account for this, we have adjusted the manuscript:

      (Lines 376-377) "R197 has previously been isolated as a suppressor mutation of a BamA temperature sensitive strain (Ricci et al., 2012)."

      (Lines 495-496) "This adds an additional role of the C-terminus of BamD beyond a complex stability role (Ricci et al., 2012; Storek et al., 2023)."

      1. As demonstrated by the authors in Fig 8, the mutations in BamD lead to reduction in OMP levels for more than just OmpC and issues with the membrane are clearly observable with Y62A, although not with R197A in the presence of VCN. The authors should also test with rifampicin which is smaller and would monitor even more subtle issues with the membrane. Oddly, no growth was observed for the Vec control in the lower concentration of VCN, but was near WT levels for 3 times VCN, how is this explained?

      While it would be interesting to correlate the extent of differences to the molecular size of different antibiotics such as rifampicin, such correlations are not the intended aim of our study. Vancomycin (VCN) is a standard measure of outer membrane integrity in our field, hence its use in our tests for membrane integrity.

      We apologize to the reviewer as Figure 8 D-G may have been misleading. Figure 8D,E are using bamD shut-down cells expressing plasmid-borne BamD mutants. Whereas Figure 8F, G are the same strain as used in Figure 3. We have adjusted the figure as well as the figure legend: (Lines 1165-1169) D, E E coli bamD depletion cells expressing mutations at residues, Y62A and R197A, in the β-signal recognition regions of BamD were grown with of VCN. F, G, E coli cells expressing mutations to OmpC internal signal, as shown in Fig 3, grown in the presence of VCN. Mutations to two key residues of the internal signal were sensitive to the presence of VCN.

      1. While Fig 8I indeed shows diminished levels for FY as stated, little difference was observed for the trimer for the other mutants compared to WT, although differences were observed for the dimer. Interestingly, the VY mutant has nearly WT levels of dimer. What do the authors postulate is going on here with the dimer to trimer transition? How do the levels of monomer compare, which is not shown?

      The BN-PAGE gel system cannot resolve protein species that migrate below ~50kDa and the monomer species of the OMPs is below this size. We can’t comment on effects on the monomer because it is not visualized. The non-cropped gel image is shown here. Recently, Hussain et al., has shown that in vitro proteo-liposome system OmpC assembly progresses from a “short-lived dimeric” form before the final process of trimerization (Hussain et al., 2021). However, their findings suggest that LPS plays the final role in stimulation of dimer-to-trimer, a step well past the recognition step of the β-signals. Mutations to the internal signal of OmpC results in the formation of an intermediate, the substrate stalled on the BAM complex. This stalling, presumably, causes a hinderance to the BAM complex resulting in reduced timer and loss of dimer OmpF signal in the EMM of cells expressing OmpC double mutant strain, FY. cannot resolve protein species that migrate below ~50kDa and the monomer species of the OMPs is below this size. We can’t comment on effects on the monomer because it is not visualized. The non-cropped gel image is shown here. We have noted this in the revised text:

      Author response image 5.

      Non-cropped gel of Fig. 8I. the asterisk indicates a band observed in the sample loading wells at the top of the gel.

      (Lines 417-418) "The dimeric form of endogenous OmpF was prominently observed in both the OmpC(WT) as well as the OmpC(VY) double mutant cells."

      1. In the discussion, the authors indicate they have '...defined an internal signal for OMP assembly', however, their study is limited and only investigates a specific region of OmpC. More is needed to definitively say this for even OmpC, and even more so to indicate this is a general feature for all OMPs.

      We acknowledge the reviewer's comment on this point and have expanded the statement to make sure that the conclusion is justified with the specific evidence that is shown in the paper and the supplementary data. We now state:

      (Lines 444-447) "This internal signal corresponds to the -5 strand in OmpC and is recognized by BamD. Sequence analysis shows that similar sequence signatures are present in other OMPs (Figs. S5, S6 and S7). These sequences were investigated in two further OMPs: OmpF and LamB (Fig. 2C and D)."

      Note, we did not state that this is a general feature for all OMPs. That would not be a reasonable proposition.

      20.

      • In the proposed model in Fig 9, it is hard to conceive how 5 strands will form along BamD given the limited surface area and tight space beneath BAM.

      • More concerning is that the two proposal interaction sites on BamD, Y62 and R197, are on opposite sides of the BamD structure, not along the same interface, which makes this model even more unlikely.

      • As evidence against this model, in Figure 9E, the two indicates sites of BamD are not even in close proximity of the modeled substrate strands.

      We can address the reviewer’s three concerns here:

      i. The first point is that the region (formed by BamD engaged with POTRA domains 1-2 and 5 of BamA) is not sufficient to accommodate five β-strands. Structural analysis reveals that the interaction between the N-terminal side of BamD and POTRA1-2 is substantially changed the conformation by substrate binding, and that this surface is greatly extended. This surface does have enough space to accommodate five beta-strands, as now documented in Fig. 9D, 9E using the latest structures (7TT5 and 7TT2) as illustrations of this. The text now reads:

      (Lines 506-515) "Spatially, this indicates the BamD can serve to organize two distinct parts of the nascent OMP substrate at the periplasmic face of the BAM complex, either prior to or in concert with, engagement to the lateral gate of BamA. Assessing this structurally showed the N-terminal region of BamD (interacting with the POTRA1-2 region of BamA) and the C-terminal region of BamD (interacting with POTRA5 proximal to the lateral gate of BamA) (Bakelar et al., 2016; Gu et al., 2016; Tomasek et al., 2020) has the N-terminal region of BamD changing conformation depending on the folding states of the last four β-strands of the substrate OMP, EspP (Doyle et al., 2022). The overall effect of this being a change in the dimensions of this cavity change, a change which is dependent on the folded state of the substrate engaged in it (Fig 9 B-E)."

      ii. The second point raised regards the orientation of the substrate recognition residues of BamD. Both Y62A and R197 were located on the lumen side of the funnel in the EspP-BAM transport intermediate structure (PDBID;7TTC); Y62A is relatively located on the edge of BamD, but given that POTRA1-2 undergoes a conformational change and opens this region, as described above, both are located in locations where they could bind to substrates. This was explained in the following text in the results section of revised manuscript.

      (Lines 377-379) "Each residue was located on the lumen side of the funnel-like structure in the EspP-BAM assembly intermediate structure (PDBID; 7TTC) (Doyle et al., 2022)."

      **Reviewer #2 (Public Review):"

      Previously, using bioinformatics study, authors have identified potential sequence motifs that are common to a large subset of beta-barrel outer membrane proteins in gram negative bacteria. Interestingly, in that study, some of those motifs are located in the internal strands of barrels (not near the termini), in addition to the well-known "beta-signal" motif in the C-terminal region.

      Here, the authors carried out rigorous biochemical, biophysical, and genetic studies to prove that the newly identified internal motifs are critical to the assembly of outer membrane proteins and the interaction with the BAM complex. The author's approaches are rigorous and comprehensive, whose results reasonably well support the conclusions. While overall enthusiastic, I have some scientific concerns with the rationale of the neutron refractory study, and the distinction between "the intrinsic impairment of the barrel" vs "the impairment of interaction with BAM" that the internal signal may play a role in. I hope that the authors will be able to address this.

      Strengths:

      1. It is impressive that the authors took multi-faceted approaches using the assays on reconstituted, cell-based, and population-level (growth) systems.

      2. Assessing the role of the internal motifs in the assembly of model OMPs in the absence and presence of BAM machinery was a nice approach for a precise definition of the role.

      Weaknesses:

      1. The result section employing the neutron refractory (NR) needs to be clarified and strengthened in the main text (from line 226). In the current form, the NR result seems not so convincing.

      What is the rationale of the approach using NR?

      We have now modified the text to make clear that:

      (Lines 276-280) "The rationale to these experiments is that NR provides: (i) information on the distance of specified subunits of a protein complex away from the atomically flat gold surface to which the complex is attached, and (ii) allows the addition of samples between measurements, so that multi-step changes can be made to, for example, detect changes in domain conformation in response to the addition of a substrate."

      What is the molecular event (readout) that the method detects?

      We have now modified the text to make clear that:

      (Lines 270-274) "While the biochemical assay demonstrated that the OmpC(Y286A) mutant forms a stalled intermediate with the BAM complex, in a state in which membrane insertion was not completed, biochemical assays such as this cannot elucidate where on BamA-BamD this OmpC(Y286A) substrate is stalled."

      What are "R"-y axis and "Q"-x axis and their physical meanings (Fig. 5b)?

      The neutron reflectivity, R, refers to the ratio of the incoming and exiting neutron beams and it is measured as a function of Momentum transfer Q, which is defined as Q=4π sinθ/λ, where θ is the angle of incident and λ is the neutron wavelength. R(Q)is approximately given byR(Q)=16π2/ Q2 |ρ(Q)|2, where R(Q) is the one-dimensional Fourier transform of ρ(z), the scattering length density (SLD) distribution normal to the surface. SLD is the sum of the coherent neutron scattering lengths of all atoms in the sample layer divided by the volume of the layer. Therefore, the intensity of the reflected beams is highly dependent on the thickness, densities and interface roughness of the samples. This was explained in the following text in the method section of revised manuscript.

      (Lines 669-678) "Neutron reflectivity, denoted as R, is the ratio of the incoming to the exiting neutron beams. It’s calculated based on the Momentum transfer Q, which is defined by the formula Q=4π sinθ/λ, where θ represents the angle of incidence and λ stands for the neutron wavelength. The approximate value of R(Q) can be expressed as R(Q)=16π2/ Q2 |ρ(Q)|2, where R(Q) is the one-dimensional Fourier transform of ρ(z), which is the scattering length density (SLD) distribution perpendicular to the surface. SLD is calculated by dividing the sum of the coherent neutron scattering lengths of all atoms in a sample layer by the volume of that layer. Consequently, factors such as thickness, volume fraction, and interface roughness of the samples significantly influence the intensity of the reflected beams."

      How are the "layers" defined from the plot (Fig. 5b)?

      The “layers” in the plot (Fig. 5b) represent different regions of the sample being studied. In this study, we used a seven-layer model to fit the experimental data (chromium - gold - NTA - HIS8 - β-barrel - P3-5 - P1-2. This was explained in the following text in the figure legend of revised manuscript. (Lines 1115-1116) The experimental data was fitted using a seven-layer model: chromium - gold - NTA - His8 - β-barrel - P3-5 - P1-2.

      What are the meanings of "thickness" and "roughness" (Fig. 5c)?

      We used neutron reflectometry to determine the relative positions of BAM subunits in a membrane environment. The binding of certain subunits induced conformational changes in other parts of the complex. When a substrate membrane protein is added, the periplasmic POTRA domain of BamA extends further away from the membrane surface. This could result in an increase in thickness as observed in neutron reflectometry measurements.

      As for roughness, it is related to the interface properties of the sample. In neutron reflectometry, the intensity of the reflected beams is highly dependent on the thickness, densities, and interface roughness of the samples. An increase in roughness could suggest changes in these properties, possibly due to protein-membrane interactions or structural changes within the membrane.

      (Lines 1116-1120) "Table summarizes of the thickness, roughness and volume fraction data of each layer from the NR analysis. The thickness refers to the depth of layered structures being studied as measured in Å. The roughness refers to the irregularities in the surface of the layered structures being studied as measured in Å."

      What does "SLD" stand for?

      We apologize for not explaining abbreviation when the SLD first came out. We explained it in revised manuscript. (Line 298)

      1. In the result section, "The internal signal is necessary for insertion step of assembly into OM" This section presents an important result that the internal beta-signal is critical to the intrinsic propensity of barrel formation, distinct from the recognition by BAM complex. However, this point is not elaborated in this section. For example, what is the role of these critical residues in the barrel structure formation? That is, are they involved in any special tertiary contacts in the structure or in membrane anchoring of the nascent polypeptide chains?

      We appreciate the reviewer's comment on this point. Both position 0 and position 6 appear to be important amino acids for recognition by the BAM complex, since mutations introduced at these positions in peptide 18 prevent competitive inhibition activity.

      In terms of the tertiary structure of OmpC, position 6 is an amino acid that contributes to the aromatic girdle, and since Y286A and Y365A affected OMP folding as measured in folding experiments, it is perhaps their position in the aromatic girdle that contributes to the efficiency of β-barrel folding in addition to its function as a recognition signal. We have added a sentence in the revised manuscript:

      (Lines 233-236) "Position 6 is an amino acid that contributes to the aromatic girdle. Since Y286A and Y365A affected OMP folding as measured in folding experiments, their positioning into the aromatic girdle may contributes to the efficiency of β-barrel folding, in addition to contributing to the internal signal."

      The mutations made at position 0 had no effect on folding, so this residue may function solely in the signal. Given the register of each β-strand in the final barrel, the position 0 residues have side-chains that face out into the lipid environment. From examination of the OmpC crystal structure, the residue at position 0 makes no special tertiary contacts with other, neighbouring residues.  

      Reviewer #1 (Recommendations For The Authors):

      Minor critiques (in no particular order):

      1. Peptide 18 was identified based on its strong inhibition for EspP assembly but another peptide, peptide 23, also shows inhibition and has no particular consensus.

      We would correct this point. Peptide 23 has a strong consensus to the canonical β-signal. We had explained the sequence consensus of β-signal in the Results section of the text. In the third paragraph, we have added a sentence indicating the relationship between peptide 18 and peptide 23.

      (Lines 152-168) "Six peptides (4, 10, 17, 18, 21, and 23) were found to inhibit EspP assembly (Fig. 1A). Of these, peptide 23 corresponds to the canonical β-signal of OMPs: it is the final β-strand of OmpC and it contains the consensus motif of the β-signal (ζxGxx[Ω/Φ]x[Ω/Φ]). The inhibition seen with peptide 23 indicated that our peptidomimetics screening system using EspP can detect signals recognized by the BAM complex. In addition to inhibiting EspP assembly, five of the most potent peptides (4, 17, 18, 21, and 23) inhibited additional model OMPs; the porins OmpC and OmpF, the peptidoglycan-binding OmpA, and the maltoporin LamB (fig. S3). Comparing the sequences of these inhibitory peptides suggested the presence of a sub-motif from within the β-signal, namely [Ω/Φ]x[Ω/Φ] (Fig. 1B). The sequence codes refer to conserved residues such that: ζ, is any polar residue; G is a glycine residue; Ω is any aromatic residue; Φ is any hydrophobic residue and x is any residue (Hagan et al., 2015; Kutik et al., 2008). The non-inhibitory peptide 9 contained some elements of the β-signal but did not show inhibition of EspP assembly (Fig. 1A).

      Peptide 18 also showed a strong sequence similarity to the consensus motif of the β-signal (Fig. 1B) and, like peptide 23, had a strong inhibitory action on EspP assembly (Fig. 1A). Variant peptides based on the peptide 18 sequence were constructed and tested in the EMM assembly assay (Fig. 1C)."

      1. It is unclear why the authors immediately focused on BamD rather than BamB, given that both were mentioned to mediate interaction with substrate. Was BamB also tested?

      We thank the reviewer for this comment. Following the reviewer's suggestion, we have now performed a pull-down experiment on BamB and added it to Fig. S9. We also modified the text of the results as follows.

      (Lines 262-265) "Three subunits of the BAM complex have been previously shown to interact with the substrates: BamA, BamB, and BamD (Hagan et al., 2013; Harrison, 1996; Ieva et al., 2011). In vitro pull-down assay showed that while BamA and BamD can independently bind to the in vitro translated OmpC polypeptide (Fig .S9A), BamB did not (Fig. S9B)."

      1. For the in vitro folding assays of the OmpC substrates, labeled and unlabeled, no mention of adding SurA or any other chaperone which is known to be important for mediating OMP biogenesis in vitro.

      We appreciate the reviewer’s concerns on this point, however chaperones such as SurA are non-essential factors in the OMP assembly reaction mediated by the BAM complex: the surA gene is not essential and the assembly of OMPs can be measured in the absence of exogenously added SurA. It remains possible that addition of SurA to some of these assays could be useful in detailing aspects of chaperone function in the context of the BAM complex, but that was not the intent of this study.

      1. For the supplementary document, it would be much easier for the reader to have the legends groups with the figures.

      Following the reviewer's suggestion, we have placed the legends of Supplemental Figures together with each Figure.

      1. Some of the figures and their captions are not grouped properly and are separated which makes it hard to interpret the figures efficiently.

      We thank the reviewer for this comment, we have revised the manuscript and figures to properly group the figures and captions together on a single page.

      1. The authors begin their 'Discussion' with a question (line 454), however, they don't appear to answer or even attempt to address it; suggest removing rhetorical questions.

      As per the reviewers’ suggestion, we removed this question.

      1. Line 464, 'unbiased' should be removed. This would imply that if not stated, experiments are 'negatively' biased.

      We removed this word and revised the sentence as follows:

      (Lines 431-433) "In our experimental approach to assess for inhibitory peptides, specific segments of the major porin substrate OmpC were shown to interact with the BAM complex as peptidomimetic inhibitors."

      1. Lines 466-467; '...go well beyond expected outcomes.' What does this statement mean?

      Our peptidomimetics led to unexpected results in elucidating the additional essential signal elements. The manuscript was revised as follows:

      (Lines 433-435) "Results for this experimental approach went beyond expected outcomes by identifying the essential elements of the signal Φxxxxxx[Ω/Φ]x[Ω/Φ] in β-strands other than the C-terminal strand."

      1. Line 478; '...rich information that must be oversimplified...'?

      We appreciate the reviewer’s pointed out. For more clarity, the manuscript was revised as follows:

      (Lines 450-453) "The abundance of information which arises from modeling approaches and from the multitude of candidate OMPs, is generally oversimplified when written as a primary structure description typical of the β-signal for bacterial OMPs (i.e. ζxGxx[Ω/Φ]x[Ω/Φ]) (Kutik et al., 2008)."

      1. There are typos in the supplementary figures.

      We have revised and corrected the Supplemental Figure legends.  

      Reviewer #2 (Recommendations For The Authors):

      1. In Supplementary Information, I recommend adding the figure legends directly to the corresponding figures. Currently, it is very inconvenient to go back and forth between legends and figures.

      Following the reviewer's suggestion, we have placed the legends of Supplemental Figures together with each Figure.

      1. Line 94 (p.3): "later"

      Lateral?

      Yes. We have corrected this.

      1. Line 113 (p.3): The result section, "Peptidomimetics derived from E. coli OmpC inhibit OMP assembly" Rationale of the peptide inhibition assay is not clear. How can the peptide sequence that effectively inhibit the assembly interpreted as the b-assembly signal? By competitive binding to BAM or by something else? What is the authors' hypothesis in doing this assay?

      In revision, we have added following sentence to explain the aim and design of the peptidomimetics:

      (Lines 140-145) "The addition of peptides with BAM complex affinity, such as the OMP β-signal, are capable of exerting an inhibitory effect by competing for binding of substrate OMPs to the BAM complex (Hagan et al., 2015). Thus, the addition of peptides derived from the entirety of OMPs to the EMM assembly assay, which can evaluate assembly efficiency with high accuracy, expects to identify novel regions that have affinity for the BAM complex."

      1. Line 113- (p.3) and Fig. S1: The result section, "Peptidomimetics derived from E. coli OmpC inhibit OMP assembly"

      Some explanation seems to be needed why b-barrel domain of EspP appears even without ProK?

      We appreciate the reviewer’s pointed out. We added following sentence to explain:

      (Lines 128-137) "EspP, a model OMP substrate, belongs to autotransporter family of proteins. Autotransporters have two domains; (1) a β-barrel domain, assembled into the outer membrane via the BAM complex, and (2) a passenger domain, which traverses the outer membrane via the lumen of the β-barrel domain itself and is subsequently cleaved by the correctly assembled β-barrel domain (Celik et al., 2012). When EspP is correctly assembled into outer membrane, a visible decrease in the molecular mass of the protein is observed due to the self-proteolysis. Once the barrel domain is assembled into the membrane it becomes protease-resistant, with residual unassembled and passenger domains degraded (Leyton et al., 2014; Roman-Hernandez et al., 2014)."

      1. Line 186 (p.6): "Y285"

      Y285A?

      We have corrected the error, it was Y285A.

      1. Lines 245- (p. 7)/ Lines 330- (p. 10)

      It needs to be clarified that the results described in these paragraphs were obtained from the assays with EMM.

      We appreciate the reviewer’s concerns on these points. For the first half, the following text was added at the beginning of the applicable paragraph to indicate that all of Fig. 4 is the result of the EMM assembly assay.

      (Line 241) "We further analyzed the role of internal β-signal by the EMM assembly assay. At the second half, we used purified BamD but not EMM. We described clearly with following sentence."

      (Lines 316-318) "We purified 40 different BPA variants of BamD, and then irradiated UV after incubating with 35S-labelled OmpC."

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 40-42: The sentence "The coupling of structural connectome (SC) and functional connectome (FC) varies greatly across different cortical regions reflecting anatomical and functional hierarchies as well as individual differences in cognitive function, and is regulated by genes" is a misstatement. Regional variations of structure-function coupling do not really reflect differences in cognitive function among individuals, but inter-subject variations do.

      Thank you for your comment. We have made revisions to the sentence to correct its misstatement. Please see lines 40-43: “The coupling of structural connectome (SC) and functional connectome (FC) varies greatly across different cortical regions reflecting anatomical and functional hierarchies[1, 6-9] and is regulated by genes[6, 8], as well as its individual differences relates to cognitive function[8, 9].”

      (2) In Figure 1, the graph showing the relation between intensity and cortical depth needs explanation.

      Thank you for your comment. We have added necessary explanation, please see lines 133-134: “The MPC was used to map similarity networks of intracortical microstructure (voxel intensity sampled in different cortical depth) for each cortical node.”

      (3) Line 167: Change "increased" to "increase".

      We have corrected it, please see lines 173-174: “…networks significantly increased with age and exhibited greater increase.”

      (4) Line 195: Remove "were".

      We have corrected it, please see line 204: “…default mode networks significantly contributed to the prediction…”

      (5) Lines 233-240, Reproducibility analyses: Comparisons of parcellation templates were not made with respect to gene weights. Is there any particular reason?

      Thank you for your comment. We have quantified the gene weights based on HCPMMP using the same procedures. We identified a correlation (r \= 0.25, p<0.001) between the gene weights in HCPMMP and BNA. Given that this is a relatively weak correlation, we need to clarify the following points.

      Based on HCPMMP, we produced an averaged gene expression profile for 10,027 genes covering 176 left cortical regions[1]. The excluding 4 cortical regions that had an insufficient number of assigned samples may lead to different templates having a relatively weak correlation of gene associations. Moreover, the effect of different template resolutions on the results of human connectome-transcriptome association is still unclear.

      In brain connectome analysis, the choice of parcellation templates can indeed influence the subsequent findings to some extent. A methodological study[2] provided referenced correlations about 0.4~0.6 for white matter connectivity and 0.2~0.4 for white matter nodal property between two templates (refer to Figure 4 and 5 in [2]). Therefore, the age-related coupling changes as a downstream analysis was calculated using multimodal connectome and correlated with gene expression profiles, which may be influenced by the choice of templates. 

      We have further supplemented gene weights results obtained from HCPMMP to explicitly clarify the dependency of parcellation templates.

      Please see lines 251-252: “The gene weights of HCPMMP was consistent with that of BNA (r = 0.25, p < 0.001).”

      Author response image 1.

      The consistency of gene weights between HCPMMP and BNA.

      Please see lines 601-604: “Finally, we produced an averaged gene expression profile for 10,027 genes covering 176 left cortical regions based on HCPMMP and obtained the gene weights by PLS analysis. We performed Pearson's correlation analyses to assess the consistency of gene weights between HCPMMP and BNA.”

      Reviewer #2 (Recommendations For The Authors):

      Your paper is interesting to read and I found your efforts to evaluate the robustness of the results of different parcellation strategies and tractography methods very valuable. The work is globally easy to navigate and well written with informative good-quality figures, although I think some additional clarifications will be useful to improve readability. My suggestions and questions are detailed below (I aimed to group them by topic which did not always succeed so apologies if the comments are difficult to navigate, but I hope they will be useful for reflection and to incorporate in your work).

      * L34: 'developmental disorder'

      ** As far as I understand, the subjects in HCP-D are mostly healthy (L87). Thus, while your study provides interesting insights into typical brain development, I wonder if references to 'disorder' might be premature. In the future, it would be interesting to extend your approach to the atypical populations. In any case, it would be extremely helpful and appreciated if you included a figure visualising the distribution of behavioural scores within your population and in relationship to age at scan for your subjects (and to include a more detailed description of the assessment in the methods section) given that large part of your paper focuses on their prediction using coupling inputs (especially given a large drop of predictive performance after age correction). Such figures would allow the reader to better understand the cognitive variability within your data, but also potential age relationships, and generally give a better overview of your cohort.

      We agree with your comment that references to 'disorder' is premature. We have made revisions in abstract and conclusion. 

      Please see lines 33-34: “This study offers insight into the maturational principles of SC-FC coupling in typical development.”

      Please see lines 395-396: “Further investigations are needed to fully explore the clinical implications of SC-FC coupling for a range of developmental disorders.”

      In addition, we have included a more detailed description of the cognitive scores in the methods section and provided a figure to visualize the distributions of cognitive scores and in relationship to age for subjects. Please see lines 407-413: “Cognitive scores. We included 11 cognitive scores which were assessed with the National Institutes of Health (NIH) Toolbox Cognition Battery (https://www.healthmeasures.net/exploremeasurement-systems/nih-toolbox), including episodic memory, executive function/cognitive flexibility, executive function/inhibition, language/reading decoding, processing speed, language/vocabulary comprehension, working memory, fluid intelligence composite score, crystal intelligence composite score, early child intelligence composite score and total intelligence composite score. Distributions of these cognitive scores and their relationship with age are illustrated in Figure S12.”

      Author response image 2.

      Cognitive scores and age distributions of scans.

      * SC-FC coupling

      ** L162: 'Regarding functional subnetworks, SC-FC coupling increased disproportionately with age (Figure 3C)'.

      *** As far as I understand, in Figure 3C, the points are the correlation with age for a given ROI within the subnetwork. Is this correct? If yes, I am not sure how this shows a disproportionate increase in coupling. It seems that there is great variability of SC-FC correlation with age across regions within subnetworks, more so than the differences between networks. This would suggest that the coupling with age is regionally dependent rather than network-dependent? Maybe you could clarify?

      The points are the correlation with age for a given ROI within the subnetwork in Figure 3C. We have revised the description, please see lines 168-174: “Age correlation coefficients distributed within functional subnetworks were shown in Figure 3C. Regarding mean SC-FC coupling within functional subnetworks, the somatomotor (𝛽𝑎𝑔𝑒\=2.39E-03, F=4.73, p\=3.10E-06, r\=0.25, p\=1.67E07, Figure 3E), dorsal attention (𝛽𝑎𝑔𝑒\=1.40E-03, F=4.63, p\=4.86E-06, r\=0.24, p\=2.91E-07, Figure 3F), frontoparietal (𝛽𝑎𝑔𝑒 =2.11E-03, F=6.46, p\=2.80E-10, r\=0.33, p\=1.64E-12, Figure 3I) and default mode (𝛽𝑎𝑔𝑒 =9.71E-04, F=2.90, p\=3.94E-03, r\=0.15, p\=1.19E-03, Figure 3J) networks significantly increased with age and exhibited greater increase.” In addition, we agree with your comment that the coupling with age is more likely region-dependent than network-dependent. We have added the description, please see lines 329-332: “We also found the SC-FC coupling with age across regions within subnetworks has more variability than the differences between networks, suggesting that the coupling with age is more likely region-dependent than network-dependent.” This is why our subsequent analysis focused on regional coupling.  

      *** Additionally, we see from Figure 3C that regions within networks have very different changes with age. Given this variability (especially in the subnetworks where you show both positive and negative correlations with age for specific ROIs (i.e. all of them)), does it make sense then to show mean coupling over regions within the subnetworks which erases the differences in coupling with age relationships across regions (Figures 3D-J)?

      Considering the interest and interpretation for SC-FC coupling, showing the mean coupling at subnetwork scales with age correlation is needed, although this eliminates variability at regional scale. These results at different scales confirmed that coupling changes with age at this age group are mainly increased.

      *** Also, I think it would be interesting to show correlation coefficients across all regions, not only the significant ones (3B). Is there a spatially related tendency of increases/decreases (rather than a 'network' relationship)? Would it be interesting to show a similar figure to Figure S7 instead of only the significant regions?

      As your comment, we have supplemented the graph which shows correlation coefficients across all regions into Figure 3B. Similarly, we supplemented to the other figures (Figure S3-S6).

      Author response image 3.

      Aged-related changes in SC-FC coupling. (A) Increases in whole-brain coupling with age. (B) Correlation of age with SC-FC coupling across all regions and significant regions (p<0.05, FDR corrected). (C) Comparisons of age-related changes in SC-FC coupling among functional networks. The boxes show the median and interquartile range (IQR; 25–75%), and the whiskers depict 1.5× IQR from the first or third quartile. (D-J) Correlation of age with SC-FC coupling across the VIS, SM, DA, VA, LIM, FP and DM. VIS, visual network; SM, somatomotor network; DA, dorsal attention network; VA, ventral attention network; LIM, limbic network; FP, frontoparietal network; DM, default mode network.

      *** For the quantification of MPC.

      **** L421: you reconstructed 14 cortical surfaces from the wm to pial surface. If we take the max thickness of the cortex to be 4.5mm (Fischl & Dale, 2000), the sampling is above the resolution of your anatomical images (0.8mm). Could you expand on what the interest is in sampling such a higher number of surfaces given that the resolution is not enough to provide additional information?

      The surface reconstruction was based on state-of-the-art equivolumetric surface construction techniques[3] which provides a simplified recapitulation of cellular changes across the putative laminar structure of the cortex. By referencing a 100-μm resolution Merkerstained 3D histological reconstruction of an entire post mortem human brain (BigBrain: https://bigbrain.loris.ca/main.php), a methodological study[4] systematically evaluated MPC stability with four to 30 intracortical surfaces when the resolution of anatomical image was 0.7 mm, and selected 14 surfaces as the most stable solution. Importantly, it has been proved the in vivo approach can serve as a lower resolution yet biologically meaningful extension of the histological work[4]. 

      **** L424: did you aggregate intensities over regions using mean/median or other statistics?

      It might be useful to specify.

      Thank you for your careful comment. We have revised the description in lines 446-447: “We averaged the intensity profiles of vertices over 210 cortical regions according to the BNA”.

      **** L426: personal curiosity, why did you decide to remove the negative correlation of the intensity profiles from the MPC? Although this is a common practice in functional analyses (where the interpretation of negatives is debated), within the context of cortical correlations, the negative values might be interesting and informative on the level of microstructural relationships across regions (if you want to remove negative signs it might be worth taking their absolute values instead).

      We agree with your comment that the interpretation of negative correlation is debated in MPC. Considering that MPC is a nascent approach to network modeling, we adopted a more conservative strategy that removing negative correlation by referring to the study [4] that proposed the approach. As your comment, the negative correlation might be informative. We will also continue to explore the intrinsic information on the negative correlation reflecting microstructural relationships.

      **** L465: could you please expand on the notion of self-connections, it is not completely evident what this refers to.

      We have revised the description in lines 493-494: “𝑁𝑐 is the number of connection (𝑁𝑐 = 245 for BNA)”.

      **** Paragraph starting on L467: did you evaluate the multicollinearities between communication models? It is possibly rather high (especially for the same models with similar parameters (listed on L440-444)). Such dependence between variables might affect the estimates of feature importance (given the predictive models only care to minimize error, highly correlated features can be selected as a strong predictor while the impact of other features with similarly strong relationships with the target is minimized thus impacting the identification of reliable 'predictors').

      We agree with your comment. The covariance structure (multicollinearities) among the communication models have a high probability to lead to unreliable predictor weights. In our study, we applied Haufe's inversion transform[5] which resolves this issue by computing the covariance between the predicted FC and each communication models in the training set. More details for Haufe's inversion transform please see [5]. We further clarified in the manuscript, please see in lines 497-499: “And covariance structure among the predictors may lead to unreliable predictor weights. Thus, we applied Haufe's inversion transform[38] to address these issues and identify reliable communication mechanisms.”

      **** L474: I am not completely familiar with spin tests but to my understanding, this is a spatial permutation test. I am not sure how this applies to the evaluation of the robustness of feature weight estimates per region (if this was performed per region), it would be useful to provide a bit more detail to make it clearer.

      As your comment, we have supplemented the detail, please see lines 503-507: “Next, we generated 1,000 FC permutations through a spin test[86] for each nodal prediction in each subject and obtained random distributions of model weights. These weights were averaged over the group and were investigated the enrichment of the highest weights per region to assess whether the number of highest weights across communication models was significantly larger than that in a random discovery.”

      **** L477: 'significant communication models were used to represent WMC...', but in L103 you mention you select 3 models: communicability, mean first passage, and flow graphs. Do you want to say that only 3 models were 'significant' and these were exactly the same across all regions (and data splits/ parcellation strategies/ tractography methods)? In the methods, you describe a lot of analysis and testing but it is not completely clear how you come to the selection of the final 3, it would be beneficial to clarify. Also, the final 3 were selected on the whole dataset first and then the pipeline of SC-FC coupling/age assessment/behaviour predictions was run for every (WD, S1, S2) for both parcellations schemes and tractography methods or did you end up with different sets each time? It would be good to make the pipeline and design choices, including the validation bit clearer (a figure detailing all the steps which extend Figure 1 would be very useful to understand the design/choices and how they relate to different runs of the validation).

      Thank you for your comment. In all reproducibility analyses, we used the same 3 models which was selected on the main pipeline (probabilistic tractography and BNA parcellation). According to your comment, we produced a figure that included the pipeline of model selection as the extend of Figure 1. And the description please see lines 106-108: “We used these three models to represent the extracortical connectivity properties in subsequent discovery and reproducibility analyses (Figure S1).” 

      Author response image 4.

      Pipeline of model selection and reproducibility analyses.

      **** Might the imbalance of features between structural connectivity and MPC affect the revealed SC-FC relationships (3 vs 1)? Why did you decide on this ratio rather than for example best WM structural descriptor + MPC?

      We understand your concern. The WMC communication models represent diverse geometric, topological, or dynamic factors. In order to describe the properties of WMC as best as possible, we selected three communication models after controlling covariance structure that can significantly predict FC from the 27 models. Compared to MPC, this does present a potential feature imbalance problem. However, this still supports the conclusion that coupling models that incorporate microarchitectural properties yield more accurate predictions of FC from SC[6, 7]. The relevant experiments are shown in Figure S2 below. If only the best WM structural descriptor is used, this may lose some communication properties of WMC.

      **** L515: were intracranial volume and in-scanner head motion related to behavioural measures? These variables likely impact the inputs, do you expect them to influence the outcome assessments? Or is there a mistake on L518 and you actually corrected the input features rather than the behaviour measures?

      The in-scanner head motion and intracranial volume are related to some age-adjusted behavioural measures, as shown in the following table. The process of regression of covariates from cognitive measures was based on these two cognitive prediction studies [8, 9]. Please see lines 549-554: “Prior to applying the nested fivefold cross-validation framework to each behaviour measure, we regressed out covariates including sex, intracranial volume, and in-scanner head motion from the behaviour measure[59, 69]. Specifically, we estimated the regression coefficients of the covariates using the training set and applied them to the testing set. This regression procedure was repeated for each fold.”

      Author response table 1.

      ** Additionally, in the paper, you propose that the incorporation of cortical microstructural (myelin-related) descriptors with white-matter connectivity to explain FC provides for 'a more comprehensive perspective for characterizing the development of SC-FC coupling' (L60). This combination of cortical and white-matter structure is indeed interesting, however the benefits of incorporating different descriptors could be studied further. For example, comparing results of using only the white matter connectivity (assessed through selected communication models) ~ FC vs (white matter + MPC) ~ FC vs MPC ~ FC. Which descriptors better explain FC? Are the 'coupling trends' similar (or the same)? If yes, what is the additional benefit of using the more complex combination? This would also add strength to your statement at L317: 'These discrepancies likely arise from differences in coupling methods, highlighting the complementarity of our methods with existing findings'. Yes, discrepancies might be explained by the use of different SC inputs. However, it is difficult to see how discrepancies highlight complementarity - does MCP (and combination with wm) provide additional information to using wm structural alone?~

      According to your comment, we have added the analyses based on different models using only the myelin-related predictor or WM connectivity to predict FC, and further compared the results among different models. please see lines 519-521: “In addition, we have constructed the models using only MPC or SCs to predict FC, respectively. Spearman’s correlation was used to assess the consistency between spatial patterns based on different models.” 

      Please see lines 128-130: “In addition, the coupling pattern based on other models (using only MPC or only SCs to predict FC) and the comparison between the models were shown in Figure S2A-C.” Please see lines 178-179: “The age-related patterns of SC-FC coupling based other coupling models were shown in Figure S2D-F.”

      Although we found that there were spatial consistencies in the coupling patterns between different models, the incorporation of MPC with SC connectivity can improve the prediction of FC than the models based on only MPC or SC. For age-related changes in coupling, the differences between the models was further amplified. We agree with you that the complementarity cannot be explicitly quantified and we have revised the description, please see line 329: “These discrepancies likely arise from differences in coupling methods.”

      Author response image 5.

      Comparison results between different models. Spatial pattern of mean SC-FC coupling based on MPC ~ FC (A), SCs ~ FC (B), and MPC + SCs ~ FC (C). Correlation of age with SC-FC coupling across cortex based on MPC ~ FC (D), SCs ~ FC (E), and MPC + SCs ~ FC (F).

      ** For the interpretation of results: L31 'SC-FC coupling is positively associated with genes in oligodendrocyte-related pathways and negatively associated with astrocyte-related gene'; L124: positive myelin content with SC-FC coupling...and similarly on L81, L219, L299, L342, and L490:

      ***You use a T1/T2 ratio which is (in large part) a measure of myelin to estimate the coupling between SC and FC. Evaluation with SC-FC coupling with myeline described in Figure 2E is possibly biased by the choice of this feature. Similarly, it is possible that reported positive associations with oligodendrocyte-related pathways and SC-FC coupling in your work could in part result from a bias introduced by the 'myelin descriptor' (conversely, picking up the oligodendrocyte-related genes is a nice corroboration for the T1/T2 ration being a myelin descriptor, so that's nice). However, it is possible that if you used a different descriptor of the cortical microstructure, you might find different expression patterns associated with the SCFC coupling (for example using neurite density index might pick up neuronal-related genes?). As mentioned in my previous suggestions, I think it would be of interest to first use only the white matter structural connectivity feature to assess coupling to FC and assess the gene expression in the cortical regions to see if the same genes are related, and subsequently incorporate MPC to dissociate potential bias of using a myelin measure from genetic findings.

      Thank you for your insightful comments. In this paper, however, the core method of measuring coupling is to predict functional connections using multimodal structural connections, which may yield more information than a single modal. We agree with your comment that separating SCs and MPC to look at the genes involved in both separately could lead to interesting discoveries. We will continue to explore this in the future.

      ** Generally, I find it difficult to understand the interpretation of SC-FC coupling measures and would be interested to hear your thinking about this. As you mention on L290-294, how well SC predicts FC depends on which input features are used for the coupling assessment (more complex communication models, incorporating additional microstructural information etc 'yield more accurate predictions of FC' L291) - thus, calculated coupling can be interpreted as a measure of how well a particular set of input features explain FC (different sets will explain FC more or less well) ~ coupling is related to a measure of 'missing' information on the SC-FC relationship which is not contained within the particular set of structural descriptors - with this approach, the goal might be to determine the set that best, i.e. completely, explains FC to understand the link between structure and function. When you use the coupling measures for comparisons with age, cognition prediction etc, the 'status' of the SC-FC changes, it is no longer the amount of FC explained by the given SC descriptor set, but it's considered a descriptor in itself (rather than an effect of feature selection / SC-FC information overlap) - how do you interpret/argue for this shift of use?

      Thank you for your comment. In this paper, we obtain reasonable SC-FC coupling by determining the optimal set of structural features to explain the function. The coupling essentially measures the direct correspondence between structure and function. To study the relationship between coupling and age and cognition is actually to study the age correlation and cognitive correlation of this direct correspondence between structure and function. 

      ** In a similar vein to the above comment, I am interested to hear what you think: on L305 you mention that 'perfect SC-FC coupling may be unlikely'. Would this reasoning suggest that functional activity takes place through other means than (and is therefore somehow independent of) biological (structural) substrates? For now, I think one can only say that we have imperfect descriptors of the structure so there is always information missing to explain function, this however does not mean the SC and FC are not perfectly coupled (only that we look at insufficient structural descriptors - limitations of what imaging can assess, what we measure etc). This is in line with L305 where you mention that 'Moreover, our results suggested that regional preferential contributions across different SCs lead to variations in the underlying communication process'. This suggests that locally different areas might use different communication models which are not reflected in the measures of SC-FC coupling that was employed, not that the 'coupling' is lower or higher (or coupling is not perfect). This is also a change in approach to L293: 'This configuration effectively releases the association cortex from strong structural constraints' - the 'release' might only be in light of the particular structural descriptors you use - is it conceivable that a different communication model would be more appropriate (and show high coupling) in these areas.

      Thank you for your insightful comments. We have changed the description, please see lines 315317: “SC-FC coupling is dynamic and changes throughout the lifespan[7], particularly during adolescence[6,9], suggesting that perfect SC-FC coupling may require sufficient structural descriptors.” 

      *Cognitive predictions:

      ** From a practical stand-point, do you think SC-FC coupling is a better (more accurate) indicator of cognitive outcomes (for example for future prediction studies) than each modality alone (which is practically easier to obtain and process)? It would be useful to check the behavioural outcome predictions for each modality separately (as suggested above for coupling estimates). In case SC-FC coupling does not outperform each modality separately, what is the benefit of using their coupling? Similarly, it would be useful to compare to using only cortical myelin for the prediction (which you showed to increase in importance for the coupling). In the case of myelin->coupling-> intelligence, if you are able to predict outcomes with the same performance from myelin without the need for coupling measures, what is the benefit of coupling?

      From a predictive performance point of view, we do not believe that SC-FC coupling is a better indicator than a single mode (voxel, network or other indicator). Our starting point is to assess whether SC-FC coupling is related to the individual differences of cognitive performances rather than to prove its predictive power over other measures. As you suggest, it's a very interesting perspective on the predictive power of cognition by separating the various modalities and comparing them. We will continue to explore this issue in the future study.

      ** The statement on L187 'suggesting that increased SC-FC coupling during development is associated with higher intelligence' might not be completely appropriate before age corrections (especially given the large drop in performance that suggests confounding effects of age).

      According to your comment, we have removed the statement.

      ** L188: it might be useful to report the range of R across the outer cross-validation folds as from Figure 4A it is not completely clear that the predictive performance is above the random (0) threshold. (For the sake of clarity, on L180 it might be useful for the reader if you directly report that other outcomes were not above the random threshold).

      According to your comment, we have added the range of R and revised the description, please see lines 195-198: “Furthermore, even after controlling for age, SC-FC coupling remained a significant predictor of general intelligence better than at chance (Pearson’s r\=0.11±0.04, p\=0.01, FDR corrected, Figure 4A). For fluid intelligence and crystal intelligence, the predictive performances of SC-FC coupling were not better than at chance (Figure 4A).”

      In a similar vein, in the text, you report Pearson's R for the predictive results but Figure 4A shows predictive accuracy - accuracy is a different (categorical) metric. It would be good to homogenise to clarify predictive results.

      We have made the corresponding changes in Figure 4.

      Author response image 6.

      Encoding individual differences in intelligence using regional SC-FC coupling. (A) Predictive accuracy of fluid, crystallized, and general intelligence composite scores. (B) Regional distribution of predictive weight. (C) Predictive contribution of functional networks. The boxes show the median and interquartile range (IQR; 25–75%), and the whiskers depict the 1.5× IQR from the first or third quartile.

      *Methods and QC:

      -Parcellations

      ** It would be useful to mention briefly how the BNA was applied to the data and if any quality checks were performed for the resulting parcellations, especially for the youngest subjects which might be most dissimilar to the population used to derive the atlas (healthy adults HCP subjects) ~ question of parcellation quality.

      We have added the description, please see lines 434-436: “The BNA[31] was projected on native space according to the official scripts (http://www.brainnetome.org/resource/) and the native BNA was checked by visual inspection.” 

      ** Additionally, the appropriateness of structurally defined regions for the functional analysis is also a topic of important debate. It might be useful to mention the above as limitations (which apply to most studies with similar focus).

      We have added your comment to the methodological issues, please see lines 378-379: “Third, the appropriateness of structurally defined regions for the functional analysis is also a topic of important debate.”

      - Tractography

      ** L432: it might be useful to name the method you used (probtrackx).

      We have added this name to the description, please see lines 455-456: “probabilistic tractography (probtrackx)[78, 79] was implemented in the FDT toolbox …”

      ** L434: 'dividing the total fibres number in source region' - dividing by what?

      We have revised the description, please see line 458: “dividing by the total fibres number in source region.”

      ** L436: 'connections in subcortical areas were removed' - why did you trace connections to subcortical areas in the first place if you then removed them (to match with cortical MPC areas I suspect)? Or do you mean there were spurious streamlines through subcortical regions that you filtered?

      On the one hand we need to match the MPC, and on the other hand, as we stated in methodological issues, the challenge of accurately resolving the connections of small structures within subcortical regions using whole-brain diffusion imaging and tractography techniques[10, 11]. 

      ** Following on the above, did you use any exclusion masks during the tracing? In general, more information about quality checks for the tractography would be useful. For example, L437: did you do any quality evaluations based on the removed spurious streamlines? For example, were there any trends between spurious streamlines and the age of the subject? Distance between regions/size of the regions?

      We did not use any exclusion masks. We performed visual inspection for the tractography quality and did not assess the relationship between spurious streamlines and age or distance between regions/size of the regions.

      ** L439: 'weighted probabilistic network' - this was weighted by the filtered connectivity densities or something else?

      The probabilistic network is weighted by the filtered connectivity densities.

      ** I appreciate the short description of the communication models in Text S1, it is very useful.

      Thank you for your comment.

      ** In addition to limitations mentioned in L368 - during reconstruction, have you noticed problems resolving short inter-hemispheric connections?

      We have not considered this issue, we have added it to the limitation, please see lines 383-384: “In addition, the reconstruction of short connections between hemispheres is a notable challenge.”

      - Functional analysis:

      ** There is a difference in acquisition times between participants below and above 8 years (21 vs 26 min), does the different length of acquisition affect the quality of the processed data?

      We have made relatively strict quality control to ensure the quality of the processed data.  

      ** L446 'regressed out nuisance variables' - it would be informative to describe in more detail what you used to perform this.

      We have provided more detail about the regression of nuisance variables, please see lines 476-477: “The nuisance variables were removed from time series based on general linear model.”

      ** L450-452: it would be useful to add the number of excluded participants to get an intuition for the overall quality of the functional data. Have you checked if the quality is associated with the age of the participant (which might be related to motion etc). Adding a distribution of remaining frames across participants (vs age) would be useful to see in the supplementary methods to better understand the data you are using.

      We have supplemented the exclusion information of the subjects during the data processing, and the distribution and aged correlation of motion and remaining frames. Please see lines 481-485: “Quality control. The exclusion of participants in the whole multimodal data processing pipeline was depicted in Figure S13. In the context of fMRI data, we computed Pearson’s correlation between motion and age, as well as between the number of remaining frames and age, for the included participants aged 5 to 22 years and 8 to 22 years, respectively. These correlations were presented in Figure S14.”

      Author response image 7.

      Exclusion of participants in the whole multimodal data processing pipeline.  

      Author response image 8.

      Figure S14. Correlations between motion and age and number of remaining frames and age.

      ** L454: 'Pearson's correlation's... ' In contrast to MPC you did not remove negative correlations in the functional matrices. Why this choice?

      Whether the negative correlation connection of functional signal is removed or not has always been a controversial issue. Referring to previous studies of SC-FC coupling[12-14], we find that the practice of retaining negative correlation connections has been widely used. In order to retain more information, we chose this strategy. Considering that MPC is a nascent approach to network modeling, we adopted a more conservative strategy that removing negative correlation by referring to the study [4] that proposed the approach.

      - Gene expression:

      ** L635, you focus on the left cortex, is this common? Do you expect the gene expression to be fully symmetric (given reported functional hemispheric asymmetries)? It might be good to expand on the reasoning.

      An important consideration regarding sample assignment arises from the fact that only two out of six brains were sampled from both hemispheres and four brains have samples collected only in the left. This sparse sampling should be carefully considered when combining data across donors[1]. We have supplemented the description, please see lines 569-571: “Restricting analyses to the left hemisphere will minimize variability across regions (and hemispheres) in terms of the number of samples available[40].”

      ** Paragraph of L537: you use evolution of coupling with age (correlation) and compare to gene expression with adults (cohort of Allen Human Brain Atlas - no temporal evolution to the gene expressions) and on L369 you mention that 'relative spatial patterns of gene expressions remain stable after birth'. Of course this is not a place to question previous studies, but would you really expect the gene expression associated with the temporary processes to remain stable throughout the development? For example, myelination would follow different spatiotemporal gradient across brain regions, is it reasonable to expect that the expression patterns remain the same? How do you then interpret a changing measure of coupling (correlation with age) with a gene expression assessed statically?

      We agree with your comment that the spatial expression patterns is expected to vary at different periods. We have revised the previous description, please see lines 383-386: “Fifth, it is important to acknowledge that changes in gene expression levels during development may introduce bias in the results.”

      - Reproducibility analyses:

      ** Paragraph L576: are we to understand that you performed the entire pipeline 3 times (WD, S1, S2) for both parcellations schemes and tractography methods (~12 times) including the selection of communication models and you always got the same best three communication models and gene expression etc? Or did you make some design choices (i.e. selection of communication models) only on a specific set-up and transfer to other settings?

      The choice of communication model is established at the beginning, which we have clarified in the article, please see lines 106-108: “We used these three models to represent the extracortical connectivity properties in subsequent discovery and reproducibility analyses (Figure S1).” For reproducibility analyses (parcellation, tractography, and split-half validation), we fixed other settings and only assessed the impact of a single factor.

      ** Paragraph of L241: I really appreciate you evaluated the robustness of your results to different tractography strategies. It is reassuring to see the similarity in results for the two approaches. Did you notice any age-related effects on tractography quality for the two methods given the wide age range (did you check?)

      In our study, the tractography quality was checked by visual inspection. Using quantifiable tools to tractography quality in future studies could answer this question objectively.

      ** Additionally, I wonder how much of that overlap is driven by the changes in MPC which is the same between the two methods... especially given its high weight in the SC-FC coupling you reported earlier in the paper. It might be informative to directly compare the connectivity matrices derived from the two tracto methods directly. Generally, as mentioned in the previous comments, I think it would be interesting to assess coupling using different input settings (with WM structural and MPC separate and then combined).

      As your previous comment, we have examined the coupling patterns, coupling differences, coupling age correlation, and spatial correlations between the patterns based on different models, as shown in Figure S2. Please see our response to the previous comment for details.

      ** L251 - I also wonder if the random splitting is best adapted to validation in your case given you study relationships with age. Would it make more sense to make stratified splits to ensure a 'similar age coverage' across splits?

      In our study, we adopt the random splitting process which repeated 1,000 times to minimize bias due to data partitioning. The stratification you mentioned is a reasonable method, and keeping the age distribution even will lead to higher verification similarity than our validation method. However, from the validation results of our method, the similarity is sufficient to explain the generalization of our findings.

      Minor comments

      L42: 'is regulated by genes'

      ** Coupling (if having a functional role and being regulated at all) is possibly resulting from a complex interplay of different factors in addition to genes, for example, learning/environment, it might be more cautious to use 'regulated in part by genes' or similar.

      We have corrected it, please see line 42.

      L43 (and also L377): 'development of SC-FC coupling'

      ** I know this is very nitpicky and depends on your opinion about the nature of SC-FC coupling, but 'development of SC-FC coupling' gives an impression of something maturing that has a role 'in itself' (for example development of eye from neuroepithelium to mature organ etc.). For now, I am not sure it is fully certain that SC-FC coupling is more than a byproduct of the comparison between SC and FC, using 'changes in SC-FC coupling with development' might be more apt.

      We have corrected it, please see lines 43-44.

      L261 'SC-FC coupling was stronger ... [] ... and followed fundamental properties of cortical organization.' vs L168 'No significant correlations were found between developmental changes in SC-FC coupling and the fundamental properties of cortical organization'.

      **Which one is it? I think in the first you refer to mean coupling over all infants and in the second about correlation with age. How do you interpret the difference?

      Between the ages of 5 and 22 years, we found that the mean SC-FC coupling pattern has become similar to that of adults, consistent with the fundamental properties of cortical organization. However, the developmental changes in SC-FC coupling are heterogeneous and sequential and do not follow the mean coupling pattern to change in the same magnitude.

      L277: 'temporal and spatial complexity'

      ** Additionally, communication models have different assumptions about the flow within the structural network and will have different biological plausibility (they will be more or less

      'realistic').

      Here temporal and spatial complexity is from a computational point of view.

      L283: 'We excluded a centralized model (shortest paths), which was not biologically plausible' ** But in Text S1 and Table S1 you specify the shortest paths models. Does this mean you computed them but did not incorporate them in the final coupling computations even if they were predictive?

      ** Generally, I find the selection of the final 3 communication models confusing. It would be very useful if you could clarify this further, for example in the methods section.

      We used all twenty-seven communication models (including shortest paths) to predict FC at the node level for each participant. Then we identified three communication models that can significantly predict FC. For the shortest path, he was excluded because he did not meet the significance criteria. We have further added methodological details to this section, please see lines 503-507.

      L332 'As we observed increasing coupling in these [frontoparietal network and default mode network] networks, this may have contributed to the improvements in general intelligence, highlighting the flexible and integrated role of these networks' vs L293 'SC-FC coupling in association areas, which have lower structural connectivity, was lower than that in sensory areas. This configuration effectively releases the association cortex from strong structural constraints imposed by early activity cascades, promoting higher cognitive functions that transcend simple sensori-motor exchanges'

      ** I am not sure I follow the reasoning. Could you expand on why it would be the decoupling promoting the cognitive function in one case (association areas generally), but on the reverse the increased coupling in frontoparietal promoting the cognition in the other (specifically frontoparietal)?

      We tried to explain the problem, for general intelligence, increased coupling in frontoparietal could allow more effective information integration enable efficient collaboration between different cognitive processes.

      * Formatting errors etc.

      L52: maybe rephrase?

      We have rephrased, please see lines 51-53: “The T1- to T2-weighted (T1w/T2w) ratio of MRI has been proposed as a means of quantifying microstructure profile covariance (MPC), which reflects a simplified recapitulation in cellular changes across intracortical laminar structure[6, 1215].”

      L68: specialization1,[20].

      We have corrected it.

      L167: 'networks significantly increased with age and exhibited greater increased' - needs rephrasing.

      We have corrected it.

      L194: 'networks were significantly predicted the general intelligence' - needs rephrasing.

      We have corrected it, please see lines 204-205: “we found that the weights of frontoparietal and default mode networks significantly contributed to the prediction of the general intelligence.”

      L447: 'and temporal bandpass filtering' - there is a verb missing.

      We have corrected it, please see line 471: “executed temporal bandpass filtering.”

      L448: 'greater than 0.15' - unit missing.

      We have corrected it, please see line 472: “greater than 0.15 mm”.

      L452: 'After censoring, regression of nuisance variables, and temporal bandpass filtering,' - no need to repeat the steps as you mentioned them 3 sentences earlier.

      We have removed it.

      L458-459: sorry I find this description slightly confusing. What do you mean by 'modal'? Connectional -> connectivity profile. The whole thing could be simplified, if I understand correctly your vector of independent variables is a set of wm and microstructural 'connectivity' of the given node... if this is not the case, please make it clearer.

      We have corrected it, please see line 488: “where 𝒔𝑖 is the 𝑖th SC profiles, 𝑛 is the number of SC profiles”.

      L479: 'values and system-specific of 480 coupling'.

      We have corrected it.

      L500: 'regular' - regularisation.

      We have changed it to “regularization”.

      L567: Do you mean that in contrast to probabilistic with FSL you use deterministic methods within Camino? For L570, you introduce communication models through 'such as': did you fit all models like before? If not, it might be clearer to just list the ones you estimated rather than introduce through 'such as'.

      We have changed the description to avoid ambiguity, please see lines 608-609: “We then calculated the communication properties of the WMC including communicability, mean first passage times of random walkers, and flow graphs (timescales=1).”

      Citation [12], it is unusual to include competing interests in the citation, moreover, Dr. Bullmore mentioned is not in the authors' list - this is most likely an error with citation import, it would be good to double-check.

      We have corrected it.

      L590: Python scripts used to perform PLS regression can 591 be found at https://scikitlearn.org/. The link leads to general documentation for sklearn.

      We have corrected it, please see lines 627-630: “Python scripts used to perform PLS regression can be found at https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html#sklearn.cro ss_decomposition.PLSRegression.”

      P26 and 27 - there are two related sections: Data and code availability and Code availability - it might be worth merging into one section if possible.

      We have corrected it, please see lines 623-633.

      References

      (1) Arnatkeviciute A, Fulcher BD, Fornito A. A practical guide to linking brain-wide gene expression and neuroimaging data. Neuroimage. 2019;189:353-67. Epub 2019/01/17. doi: 10.1016/j.neuroimage.2019.01.011. PubMed PMID: 30648605.

      (2) Zhong S, He Y, Gong G. Convergence and divergence across construction methods for human brain white matter networks: an assessment based on individual differences. Hum Brain Mapp. 2015;36(5):1995-2013. Epub 2015/02/03. doi: 10.1002/hbm.22751. PubMed PMID: 25641208; PubMed Central PMCID: PMCPMC6869604.

      (3) Waehnert MD, Dinse J, Weiss M, Streicher MN, Waehnert P, Geyer S, et al. Anatomically motivated modeling of cortical laminae. Neuroimage. 2014;93 Pt 2:210-20. Epub 2013/04/23. doi: 10.1016/j.neuroimage.2013.03.078. PubMed PMID: 23603284.

      (4) Paquola C, Vos De Wael R, Wagstyl K, Bethlehem RAI, Hong SJ, Seidlitz J, et al. Microstructural and functional gradients are increasingly dissociated in transmodal cortices. PLoS Biol. 2019;17(5):e3000284. Epub 2019/05/21. doi: 10.1371/journal.pbio.3000284. PubMed PMID: 31107870.

      (5) Haufe S, Meinecke F, Gorgen K, Dahne S, Haynes JD, Blankertz B, et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage. 2014;87:96-110. Epub 2013/11/19. doi: 10.1016/j.neuroimage.2013.10.067. PubMed PMID: 24239590.

      (6) Demirtas M, Burt JB, Helmer M, Ji JL, Adkinson BD, Glasser MF, et al. Hierarchical Heterogeneity across Human Cortex Shapes Large-Scale Neural Dynamics. Neuron. 2019;101(6):1181-94 e13. Epub 2019/02/13. doi: 10.1016/j.neuron.2019.01.017. PubMed PMID: 30744986; PubMed Central PMCID: PMCPMC6447428.

      (7) Deco G, Kringelbach ML, Arnatkeviciute A, Oldham S, Sabaroedin K, Rogasch NC, et al. Dynamical consequences of regional heterogeneity in the brain's transcriptional landscape. Sci Adv. 2021;7(29). Epub 2021/07/16. doi: 10.1126/sciadv.abf4752. PubMed PMID: 34261652; PubMed Central PMCID: PMCPMC8279501.

      (8) Chen J, Tam A, Kebets V, Orban C, Ooi LQR, Asplund CL, et al. Shared and unique brain network features predict cognitive, personality, and mental health scores in the ABCD study. Nat Commun. 2022;13(1):2217. Epub 2022/04/27. doi: 10.1038/s41467-022-29766-8. PubMed PMID: 35468875; PubMed Central PMCID: PMCPMC9038754.

      (9) Li J, Bzdok D, Chen J, Tam A, Ooi LQR, Holmes AJ, et al. Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity. Sci Adv. 2022;8(11):eabj1812. Epub 2022/03/17. doi: 10.1126/sciadv.abj1812. PubMed PMID: 35294251; PubMed Central PMCID: PMCPMC8926333.

      (10) Thomas C, Ye FQ, Irfanoglu MO, Modi P, Saleem KS, Leopold DA, et al. Anatomical accuracy of brain connections derived from diffusion MRI tractography is inherently limited. Proc Natl Acad Sci U S A. 2014;111(46):16574-9. Epub 2014/11/05. doi: 10.1073/pnas.1405672111. PubMed PMID: 25368179; PubMed Central PMCID: PMCPMC4246325.

      (11) Reveley C, Seth AK, Pierpaoli C, Silva AC, Yu D, Saunders RC, et al. Superficial white matter fiber systems impede detection of long-range cortical connections in diffusion MR tractography. Proc Natl Acad Sci U S A. 2015;112(21):E2820-8. Epub 2015/05/13. doi: 10.1073/pnas.1418198112. PubMed PMID: 25964365; PubMed Central PMCID: PMCPMC4450402.

      (12) Gu Z, Jamison KW, Sabuncu MR, Kuceyeski A. Heritability and interindividual variability of regional structure-function coupling. Nat Commun. 2021;12(1):4894. Epub 2021/08/14. doi: 10.1038/s41467-021-25184-4. PubMed PMID: 34385454; PubMed Central PMCID: PMCPMC8361191.

      (13) Liu ZQ, Vazquez-Rodriguez B, Spreng RN, Bernhardt BC, Betzel RF, Misic B. Time-resolved structure-function coupling in brain networks. Commun Biol. 2022;5(1):532. Epub 2022/06/03. doi: 10.1038/s42003-022-03466-x. PubMed PMID: 35654886; PubMed Central PMCID: PMCPMC9163085.

      (14) Zamani Esfahlani F, Faskowitz J, Slack J, Misic B, Betzel RF. Local structure-function relationships in human brain networks across the lifespan. Nat Commun. 2022;13(1):2053. Epub 2022/04/21. doi: 10.1038/s41467-022-29770-y. PubMed PMID: 35440659; PubMed Central PMCID: PMCPMC9018911.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents valuable new insights into HIV-associated nephropathy (HIVAN) kidney phenotype in the Tg26 transgenic mouse model and delineates the kidney cell types that express HIV genes and are injured in these HIV-transgenic mice. A series of compelling experiments demonstrated that PKR inhibition can ameliorate HIVAN with reversal of mitochondrial dysfunction (mainly confined to endothelial cells), a prominent feature shared in other kidney diseases. Although there are concerns regarding the specificity of C16 to PKR inhibition, as well as with the in situ hybridization studies, the data suggests that inhibition of PKR and mitochondrial dysfunction has potential clinical significance for HIVAN.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      HIV-associated nephropathy (HIVAN) is a rapidly progressing form of kidney disease that manifests secondary to untreated HIV infection, and is predominantly seen in individuals of African descent. Tg26 mice carrying an HIV transgene lacking gag and pol exhibit high levels of albuminuria and rapid decline in renal function that recapitulates many features of HIVAN in humans. HIVAN is seen predominantly in individuals carrying two copies of missense variants in the APOL1 gene, and the authors have previously shown that APOL1 risk variant mRNA induces activity of the double-strand RNA sensor kinase PKR. Because of the tight association between the APOL1 risk genotype and HIVAN, the authors hypothesized that PKR activation may mediate renal injury in Tg26 mice and tested this hypothesis by treating mice with a commonly used PKR inhibitory compound called C16. Treatment with C16 substantially attenuated renal damage in the Tg26 model as measured by urinary albumin/creatinine ratio, urinary NGAL/creatinine ratio, and improvement in histology. The authors then performed bulk and single-nucleus RNAseq on kidneys from mice from different treatment groups to identify pathways and patterns of cell injury associated with HIV transgene expression as well as to determine the mechanistic basis for the effect of C16 treatment. They show that proximal tubule nuclei from Tg26 mice appear to have more mitochondrial transcripts which was reversed by C16 treatment and suggest that this may provide evidence of mitochondrial dysfunction in this model. They explore this hypothesis by showing there is a decrease in the expression of nuclear-encoded genes and proteins involved in oxidative phosphorylation as well as a decrease in respiratory capacity via functional assessment of respiration in tubule and glomerular preparations from these mouse kidneys. All of these changes were reversed by C16 treatment. The authors propose the existence of a novel injured proximal tubule cell-type characterized by the leak of mitochondrial transcripts into the nucleus (PT-Mito). Analysis of HIV transgene expression showed high level expression in podocytes, consistent with the pronounced albuminuria that characterizes this model and HIVAN, but transcripts were also detected in tubular and endothelial cells. Because of the absence of mitochondrial transcripts in the podocytes, the authors speculate that glomerular mitochondrial dysfunction in this model is driven by damage to glomerular endothelial cells.

      Strengths:

      The strengths of this study include the comprehensive transcriptional analysis of the Tg26 model, including an evaluation of HIV transgene expression, which has not been previously reported. This data highlights that HIV transcripts are expressed in a subset of podocytes, consistent with the highly proteinuric disease seen in mice and humans. However, transcripts were also seen in other tubular cells, notably intercalated cells, principal cells and injured proximal tubule cells. Though the podocyte expression makes sense, the relevance of the tubular expression to human disease is still an open question.

      The data in support of mitochondrial dysfunction are also robust and rely on combined evidence from downregulation of transcripts involved in oxidative phosphorylation, decreases in complex I and II as determined by immunoblot, and assessments of respiratory capacity in tubular and glomerular preparations. These data are largely consistent with other preclinical renal injury models reported in the literature as well as previous, less thorough assessments in the Tg26 model.

      Weaknesses:

      The key weakness of the study lies in the use of a PKR inhibitor with questionable specificity. C16 has been reported to inhibit numerous other kinases including cyclin CDKs and GSK3α and -β, and this means that the conclusions of this study with respect to the role of PKR are highly questionable. The rationale for the dose used was not provided (and is lower than used in other publications with C16), and in the absence of drug exposure data and assessment of target engagement, it is difficult to ascertain whether substantial inhibition of PKR was achieved.

      A second key weakness lies in the identification of the PT-Mito cell cluster. Though the authors provide some rationale for the identification of this specific cell type, it seems equally plausible the cells merely reflect a high background capture of mitochondria in a subset of droplets. The IHC analysis that was provided is not convincing enough to support the claim and more careful high resolution imaging and in situ hybridization (with appropriate quantitation) will be needed to provide substantive support for the presence of a proximal tubule cell type with mitochondrial transcript that are trafficked to the nucleus.

      We appreciate the reviewer’s thoughtful summary.

      With regard to non-specificity of C16, we added to the Discussion a description and references that describe non-specificity of C16. as suggested by the reviewer. Of note, the C16 doses that we used were also used previously (Okamoto, CommBiol, 2018). Importantly, newly-added immunofluorescence images using a phospho-PKR specific antibody showed PKR inhibition (Supplemental Figure 1).

      Identification of the PT-Mito cluster in tissues was challenging, mainly due to the absence of existence of know marker genes for newly-identified cluster. Finally, We added in situ hybridization images, with a negative control probe, to show specificity of target probes.

      Reviewer #2 (Public Review):

      Summary:

      Numerous studies by the authors and other groups have demonstrated an important role for HIV gene expression kidney cells in promoting progressive chronic kidney disease, especially HIV-associated nephropathy. The authors had previously demonstrated a role for protein kinase R (PKR) in a non-HIV transgenic model of kidney disease (Okamoto, Commun Bio, 2021). In this study, the authors used innovative techniques including bulk and single nuclear RNAseq to demonstrate that mice expressing a replication-incompetent HIV transgene have prominent dysregulation of mitochondrial gene expression and activation of PKR and that treatment of these mice with a small molecule PKR inhibitor ameliorated the kidney disease phenotype in HIV-transgenic mice. They also identified STAT3 as a key upstream regulator of kidney injury in this model, which is consistent with previously published studies. Other important advances include identifying the kidney cell types that express the HIV transgene and have dysregulation of cellular pathways.

      Strengths:

      Major strengths of the study include the use of a wide variety of state-of-the-art molecular techniques to generate important new data on the pathogenesis of kidney injury in this commonly used model of kidney disease and the identification of PKR as a potential druggable target for the treatment of HIV-induced kidney disease. The authors also identify a potential novel cell type within the kidney characterized by high expression of mitochondrial genes.

      Weaknesses:

      Though the HIV-transgenic model used in these studies results in a phenotype that is very similar to HIV-associated nephropathy in humans, the model has several limitations that may prevent direct translation to human disease, including the fact that mice lack several genetic factors that are important contributors to HIV and kidney pathogenesis in humans. Additional studies are therefore needed to confirm these findings in human kidney disease.

      We appreciate the succinct summary of the present work. We agree that the findings from the HIV Tg26 mouse model warrant additional investigation in human kidney disease samples. Further studies will be needed to confirm whether the mechanisms presented here are operative in human HIVAN or other RNA virus-associated kidney diseases.

      Reviewer #1 (Recommendations For The Authors)

      The specificity of the C16 tool has been called into question in 3 publications - Chen et al, 2008, PMID: 19046382; Lopez-Grancha et al, 2021, PMID: 34531308; and Cusak et al, 2023, PMID: 36400288. Lopez-Grancha et al have reported a novel, more selective PKR inhibitor with good pharmacological properties that might enable a more robust test of the PKR hypothesis. Regardless, compound exposures and target engagement (i.e. by monitoring phosphorylation of PKR targets such eIF2α) should accompany these studies. Alternatively, it may be easier to probe the role of PKR in Tg26 pathogenicity by crossing the Tg26 line to a PKR knockout mouse.

      In response, we have added a description and references about the the possibility of non-specificity of C16 in the Discussion as a limitation as suggested. (Page 21).

      “Third, we acknowledge possibility of a non-specific effect of C16 as an inhibitor of PKR.66-68”

      Further, we added immunohistochemistry images of pPKR on kidney tissue as shown in Supplemental Figure 1A-D. Images showed PKR activation in Tg26 tubular cells, which was inhibited by C16 treatment.

      Author response image 1.

      Immunofluorescent images showing pPKR. (A-D) Immunofluorescent images showed PKR activation by detecting pPKR in Tg26 mouse kidney. pPKR was inhibited by C16 treatments.

      The suggested PKR knockout mice experiment is an excellent idea for future work but we believe Is outside the scope of the current manuscript.

      To enhance the evidentiary base for the PT-Mito cell type, it would be interesting to know whether these cells can also be found in human datasets like KPMP, though this might require reprocessing the original snRNAseq data. Further in situ hybridization in both mouse and human samples using fluorescent rather than colorimetric approaches should yield a more compelling dataset to provide evidence for this cell type. These approaches would also allow for more precise quantification of the PT-Mito cells compared to the population of proximal tubule cells. Again, the default assumption here should be that the mitochondrial transcripts represent a contamination, and the purpose of these additional experiments is to definitively rule out that explanation.

      Authors: First, as suggested, we carried out additional analyses. We examined a publiclyavailable human kidney snRNA-seq dataset (GSE131882) and found in it the same PT-Mito cluster as shown in Supplemental Figure 6. The PT-Mito cluster was located in close proximity to the PT cluster in a UMAP plot. We added this finding in the Results as follows (Page 12):

      “We also confirmed the existence of similar PT-Mito cluster in published human kidney single-nuclear RNA-seq data47 by the re-analysis of the original data. (Supplemental Figure 6A-C).”

      Author response image 2.

      PT-Mito cluster detection of publicly available human kidney single-nuclear RNA-seq data (GSE131882) (A) UMAP plot of human kidney single-nuclear RNA-seq data shows 16 clusters. Cluster 1, 4 are proximal tubule (PT) clusters, and cluster 7 is PT-Mito cluster. (B) Dot plot shows expression of PT marker genes and PT-Mito marker genes obtained from current manuscript data. PTMito markers including MT-CO1 and MT-CO2 had high expression in cluster 7. (C) UMAP plot shows all six samples are contributing to all cell clusters.

      Second, as suggested, we also included negative control data from in situ hybridization studies (Supplementary Figure 5A, 5B), which shows that the signals in Figure 4B, 4C are true signals.

      Author response image 3.

      Additional in situ hybridization images. (A) In situ hybridization images probing dapB (negative control probe) showed no signals. (B) In situ hybridization images probing Ppib (positive control probe) showed strong signals.

      Reviewer #2 (Recommendations For The Authors)

      (1) The supplementary data file seems to have been uploaded twice but the supplementary methods were not available which would have been helpful when assessing some methods such as using PodoCount to count podocytes.

      We acknowledge that we inadvertently failed to upload the Supplementary Methods section-thank you for pointing this out. The supplementary methods are now provided in the revised submission, including detailed methods about PodoCount. Corresponding descriptions are as follows:

      “Estimation of glomerular podocyte count

      PodoCount5, a computational tool for whole slide podocyte estimation from digitized histologic sections, was used to detect, enumerate, and characterize podocyte nuclear profiles in the glomeruli of immunohistochemically labeled (IHC-labeled) murine kidney sections. Formalin-fixed, paraffin embedded tissues (2 µm thickness) were IHC-labeled for p57kip2, a marker of podocyte terminal differentiation (ab75974, Abcam, Cambridge, UK), and detected with horse radish peroxidase (RU-HRP1000, Diagnostic BioSystems, Pleasanton, CA) and diaminobenzidine chromogen substrate (BSB0018A, Bio SB, Santa Barbara, CA). A periodic acid-Schiff post-stain was applied without hematoxylin counterstain. The tool uses a combination of stain deconvolution, digital image processing, and feature engineering to compute histologic podometrics6 with correction for section thickness7. In this study, PodoCount was used to assess mean glomerular podocyte count per mouse.“

      (2) In the abstract, the authors give the impression that they know definitively the sequence of HIV gene expression, cytoskeletal dysregulation, dedifferentiation, then loss from glomeruli. Since they could only examine cells that were present in glomeruli, they can't definitively say much about the cells that were lost from glomeruli.

      As suggested, deleted the following text: “and were lost from glomeruli tuft”

      (3) The authors state that 56,976 cells were used for snRNAseq studies. Was the number of cells similar for each of the 8 mice (from 4 different groups)?

      In response, we have created a new table summarizing numbers of nuclei from each sample (i.e. each mouse) added to the Supplemental Figure 2D as follows:

      Author response table 1.

      Pre-processing of single-nuclear RNA-seq data, Breakdown of nuclei numbers from each sample showed comparable numbers of nuclei analyzed.

      (4) Please provide information on the assay that was used to measure creatinine since some methods can be unreliable in mice

      This is now provided in the revised submission, including creatinine measurement methods (LC-MS/MS) on page 3 of Supplementary Material:

      “Mouse chemistry measurements

      Plasma creatinine was measured by isotope dilution LC-MS/MS at The University of Alabama at Birmingham O’Brien Center Core C (Birmingham, AL).”

      (5) The authors state that expression of PKR (Eif2ak2) was expressed in all nephron segments. However, it appears on visual inspection of the UMAP in Fig S2B that the percentage of cells expressing Eif2ak2 was low. What percent of cells expressed Eif2ak2 and if it was a low percentage, what is the authors hypothesis for how expression in a small percentage of cells led to the kidney phenotype?

      Supplemental Figure 2B (now 3B) does show modest expression of Eif2ak2, approximately 10%. The technique may lack sensitivity to detect low gene expression and even low gene expression may be sufficient to cause phenotypic change.

      (6a) In figure 4B and C, it is not clear what genotype/treatment group is shown.

      The legend for figure 4B, 4C has been modified to state that the group was wildtype mice

      (B, C) In situ hybridization of mt-Co1 and mt-Atp6 genes showed signals inside nuclei of WT mice

      (6b) Also, if these ISH images are from Tg26 mice, it would be helpful to do ISH in mice with/without C16 treatment.

      These images of ISH for these two genes are from wild-type mice, as now stated in the revised legend. Our purpose was to show that these mitochondrial-encoded gene transcripts (mt-Co1 and mt-Atp6) are transported to nuclei from the cytoplasm. We believe it is not necessary to do ISH in Tg26 mice because these genes are not disease-specific.

      (6c) Also, only 3-6% of cells express these "PT-mito" markers by snRNAseq, but it appears that far more are expressed by ISH, raising concerns for nonspecific binding of the ISH probe.

      (6d) Also, nonsense controls should be included to demonstrate the specificity of the ISH data.

      First (comment 6c), the PT-mito cluster does not have specific markers, to our knowledge. Second (comment 6d) , to address the concern for non-specific binding of the ISH probes, we have now added additional ISH images, together with a negative control probe (C. elegans gene dapB) and a positive control probe (mouse Ppib), as shown in Supplementary Figure 5A and 5B, respectively.

      Author response image 4.

      Additional in situ hybridization images. (A) In situ hybridization images probing dapB (negative control probe) showed no signals. (B) In situ hybridization images probing Ppib (positive control probe) showed strong signals.

      (7) The authors state that "mitochondrial dysfunction was most pronounced in the PT-Mito cluster" but in Figure 4D, the oxidative phosphorylation activation Z score was most down in the PT-inj (injured PT cells) and the PT-Mito cells were the 4-most downregulated cell type.

      We appreciate the careful reading and agree with reviewer’s comment. In the revision, we have deleted “most” from this description.

      (8) In Fig 4F, please state what "Cp expression" means.

      We have spelled out ceruloplasmin (Cp).

      (9) It is not clear in immunohistochemistry images in Fig 5F where the p-stat3 was detected due to the hematoxylin counterstain which may have obscured subtle nuclear staining. Also, some of the strongest staining appears to be in peritubular capillaries, instead of tubular and glomerular epithelial cells.

      We have added arrows to help readers see where we show that p-Stat3 was detected as faintly-brown and distinct cytoplasmic granules in injured tubular cells in Tg26 mice (panel F), as opposed to diffuse in tubular cytoplasmic color in wild-type mice (panel E).

      Author response image 5.

      (10) For the studies of mitochondrial oxygen consumption (Fig 6), it would be helpful to also provide data on the effect of C16 in wild-type kidneys, in case C16 somehow causes a primary increase in mitochondrial oxygen consumption rather than preventing HIV-induced loss in kidney cells from HIV-transgenic mice.

      We did not include Seahorse data regarding oxygen consumption from WT mice treated with C16, as C16 did not affect either renal function or transcriptomes in WT mice, in contrast to the Tg26 mice (Figure 1A-G).

      (11) The authors emphasize that podocytes had the highest expression of HIV genes (Fig 7). However, it appears that <2% of podocytes expressed HIV genes. How do the authors explain the severe renal phenotype given the relatively small number of cells expressing the HIV transgene? Also, did the same cells express all/most of the HIV transcripts, or did some cells express some HIV transcripts? For instance, since the authors state that vpr and nef have the most important role in kidney injury, were the same cells that expressed nef also expressing Vpr?

      We know that snRNA-seq cannot detect the whole transcriptome in each cell, due to the well-known drop-out effect characteristic of the method. Several factors may contribute to this drop-out effect, including stochastic patterns of gene expression, low RNA amounts and inefficient mRNA capture (Qiu, Nature Comm, 2020; Ran, Bioinformatics, 2020).

      Our interpretation is that HIV gene expressing-podocytes had higher expression of HIV genes, but it does not mean that other kidney cells entirely lack HIV gene expression. With regard to co-expression of other HIV transcripts, nef and vpr were more often coexpressed as shown in Figure 7J. Vpr was expressed in nef-positive podocytes and not detected in nef-negative podocytes.

      (12) In figure 8, the authors emphasize the dysregulation of genes involved in cell-cell interaction, particularly PDGF-D. They show some data for the effect of C16 in this system in Fig 8 but it would be helpful if they can state the effect in the text of the Results section.

      We have added text in the Results describing activating interactions in Tg26 mice, that were reduced by C16 exposure, as follows: (page 18)

      “For example, platelet derived growth factor D (PDGF-D) was upregulated in PT-Inj in Tg26 mice and was downregulated by C16 treatment (Figure 8D). Further, PDGF-D may interact with PDGFR-B in fibroblasts.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      We sincerely appreciate the editors for overseeing an efficient review process and for upholding the high standards of the journal. We have made extensive revisions to the manuscript after carefully reviewing the reviewers’ comments. We have addressed all the comments in our response and have incorporated the changes suggested by the reviewers to the best of our abilities. Notably, we have made the following major changes to the manuscript:

      (1) We have increased the patient cohort size from 10 to 23 for evaluating the levels of YEATS2 and H3K27cr.

      (2) To further strengthen the clinical relevance of our study, we have checked the expression of major genes involved in the YEATS2-mediated histone crotonylation axis (YEATS2, GCDH, ECHS1, Twist1 along with H3K27cr levels) in head and neck cancer tissues using immunohistochemistry.

      (3) We have performed extensive experiments to look into the role of p300 in assisting YEATS2 in regulating promoter histone crotonylation.

      The changes made to the manuscript figures have been highlighted in our response. We have also updated the Results section in accordance with the updated figures. Tables 1-4 and Supplementary files 1-3 have been moved to one single Excel workbook named ‘Supplementary Tables 1-8’. Additional revisions have been made to improve the overall quality of the manuscript and enhance data visualization. These additional changes are highlighted in the tracked changes version of the manuscript.

      Our response to the Public Reviews and ‘Recommendations to the Authors’ can be found below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigates a mechanism between the histone reader protein YEATS2 and the metabolic enzyme GCDH, particularly in regulating epithelial-to-mesenchymal transition (EMT) in head and neck cancer (HNC).

      Strengths:

      Great detailing of the mechanistic aspect of the above axis is the primary strength of the manuscript.

      Weaknesses:

      Several critical points require clarification, including the rationale behind EMT marker selection, the inclusion of metastasis data, the role of key metabolic enzymes like ECHS1, and the molecular mechanisms governing p300 and YEATS2 interactions.

      We would like to sincerely thank the reviewer for the detailed, in-depth, and positive response. We have implemented constructive revisions to the manuscript to address the reviewer’s concerns effectively.

      Major Comments:

      (1) The title, "Interplay of YEATS2 and GCDH mediates histone crotonylation and drives EMT in head and neck cancer," appears somewhat misleading, as it implies that YEATS2 directly drives histone crotonylation. However, YEATS2 functions as a reader of histone crotonylation rather than a writer or mediator of this modification. It cannot itself mediate the addition of crotonyl groups onto histones. Instead, the enzyme GCDH is the one responsible for generating crotonyl-CoA, which enables histone crotonylation. Therefore, while YEATS2 plays a role in recognizing crotonylation marks and may regulate gene expression through this mechanism, it does not directly catalyse or promote the crotonylation process.

      We thank the reviewer for their insightful comment regarding the precision of our title. We agree that the initial wording 'mediates' could imply a direct enzymatic role for YEATS2 in histone crotonylation, which is indeed not the case. As the reviewer correctly points out, YEATS2 functions as a 'reader' of histone crotonylation marks.

      However, our research demonstrates that YEATS2 plays a crucial indirect regulatory role in the establishment of these crotonylation marks. Specifically, our data indicates that YEATS2 facilitates the recruitment of the histone crotonyltransferase p300 to specific gene promoters, such as that of SPARC. This recruitment mechanism directly impacts the localized deposition of crotonyl marks on nearby histone residues. Therefore, while YEATS2 does not directly catalyze the addition of crotonyl groups, its presence and interaction with p300 are essential for the regulation and establishment of histone crotonylation at these critical sites.

      To accurately reflect this nuanced, yet significant, regulatory mechanism, we have revised the title. We are replacing 'mediates' with 'regulates' to precisely convey that YEATS2 influences the histone crotonylation process, albeit indirectly, through its role in recruiting the enzymatic machinery. The updated title will now read: 'Interplay of YEATS2 and GCDH regulates histone crotonylation and drives EMT in head and neck cancer.' We believe this change maintains the core message of our findings while enhancing the scientific accuracy of the title.

      (2) The study suggests a link between YEATS2 and metastasis due to its role in EMT, but the lack of clinical or pre-clinical evidence of metastasis is concerning. Only primary tumor (PT) data is shown, but if the hypothesis is that YEATS2 promotes metastasis via EMT, then evidence from metastatic samples or in vivo models should be included to solidify this claim.

      We thank the reviewer for their valuable suggestion regarding the need for clinical or pre-clinical evidence of metastasis. We fully agree that direct evidence linking YEATS2 to metastasis would significantly strengthen our claims, especially given its demonstrated role in EMT.

      Our primary objective in this study was to meticulously dissect the molecular mechanisms by which YEATS2 regulates histone crotonylation and drives EMT in head and neck cancer. We have provided comprehensive upstream and downstream molecular insights into this process, culminating in a clear demonstration of YEATS2's functional importance in promoting EMT through multiple in vitro phenotypic assays (e.g., Matrigel invasion, wound healing, 3D invasion assays). As the reviewer notes, EMT is a widely recognized prerequisite for cancer metastasis[1]. Therefore, establishing YEATS2 as a driver of EMT directly implicates its potential role in metastatic progression.

      To further address the reviewer's concern and bridge the gap between EMT and metastasis, we have performed additional analyses that will be incorporated into the revised manuscript:

      Clinical Correlation with Tumor Grade: We analyzed publicly available head and neck cancer patient datasets. Our analysis revealed a significant positive correlation between YEATS2 expression and increasing tumor grade. Specifically, we observed significantly higher YEATS2 expression in Grade 2-4 tumors compared to Grade 1 tumors. Given that higher tumor grades are frequently associated with increased metastatic potential and poorer prognosis in HNC[2], this finding provides compelling clinical correlative evidence linking elevated YEATS2 expression to more aggressive disease.

      Gene Set Enrichment Analysis (GSEA) for Metastasis Pathways: To further explore the biological processes associated with YEATS2 in a clinical context, we performed GSEA on TCGA HNC patient samples stratified by high versus low YEATS2 expression. This analysis robustly demonstrated a positive enrichment of metastasis-related gene sets in the high YEATS2 expression group, compared to the low YEATS2 group. This strengthens the mechanistic link by showing that pathways associated with metastasis are co-ordinately upregulated when YEATS2 is highly expressed.

      These new clinical data provide strong correlative evidence supporting a direct association of YEATS2 with metastasis, building upon our detailed mechanistic dissection of its role in EMT.

      (3) There seems to be some discrepancy in the invasion data with BICR10 control cells (Figure 2C). BICR10 control cells with mock plasmids, specifically shControl and pEGFP-C3 show an unclear distinction between invasion capacities. Normally, we would expect the control cells to invade somewhat similarly, in terms of area covered, within the same time interval (24 hours here). But we clearly see more control cells invading when the invasion is done with KD and fewer control cells invading when the invasion is done with OE. Are these just plasmid-specific significant effects on normal cell invasion? This needs to be addressed.

      We thank the reviewer for their careful examination of Figure 2C and their insightful observation regarding the appearance of the control cells in relation to the knockdown (Figure 2B) and overexpression (Figure 2C) experiments. We understand how, at first glance, the control invasion levels across these panels might seem disparate.

      We wish to clarify that Figure 2B (YEATS2 knockdown) and Figure 2C (YEATS2 overexpression) represent two entirely independent experiments, conducted with distinct experimental conditions and methodologies, as detailed in our Methods section.

      Specifically:

      Figure 2B (Knockdown): Utilizes lentivirus-mediated transduction for stable shRNA delivery (shControl as control).

      Figure 2C (Overexpression): Utilizes transfection with plasmid DNA (pEGFP-C3 as control) via a standard transfection reagent.

      These fundamental differences in genetic manipulation methods (transduction vs. transfection), along with potential batch-to-batch variations in reagents or cell passage number at the time of each independent experiment, can indeed lead to variations in absolute basal invasion rates of control cells[3].

      Therefore, the invasion capacity of BICR10 control cells in Figure 2B (shControl) should only be compared to the YEATS2 knockdown conditions within that same panel. Similarly, the invasion capacity of control cells in Figure 2C (pEGFP-C3) should only be compared to the YEATS2 overexpression conditions within that specific panel. The crucial finding in each panel lies in the relative change in invasion caused by YEATS2 manipulation (knockdown or overexpression) compared to its respective, concurrently run control.

      We have ensured that all statistical analyses (as indicated in the figure legends and methods) were performed by comparing the experimental groups directly to their matched internal controls within each independent experiment. The significant increase in invasion upon YEATS2 overexpression and the significant decrease upon YEATS2 knockdown, relative to their respective controls, are robust and reproducible findings.

      (4) In Figure 3G, the Western blot shows an unclear band for YEATS2 in shSP1 cells with YEATS2 overexpression condition. The authors need to clearly identify which band corresponds to YEATS2 in this case.

      We thank the reviewer for pointing out the ambiguity in the YEATS2 Western blot for the shSP1 + pEGFP-C3-YEATS2 condition in Figure 3G. We apologize for this lack of clarity. The two bands seen in the shSP1+pEGFP-C3-YEATS2 condition correspond to the endogenous YEATS2 band (lower band) and YEATS2-GFP band (upper band, corresponding to overexpressed YEATS2-GFP fusion protein, which has a higher molecular weight). To avoid confusion, the endogenous band is now highlighted (marked by *) in the lane representing the shSP1+pEGFP-C3-YEATS2 condition. We have also updated the figure legend accordingly.

      (5) In ChIP assays with SP1, YEATS2 and p300 which promoter regions were selected for the respective genes? Please provide data for all the different promoter regions that must have been analysed, highlighting the region where enrichment/depletion was observed. Including data from negative control regions would improve the validity of the results.

      Throughout our study, we have performed ChIP-qPCR assays to check the binding of SP1 on YEATS2 and GCDH promoter, and to check YEATS2 and p300 binding on SPARC promoter. Using transcription factor binding prediction tools and luciferase assays, we selected multiple sites on the YEATS2 and GCDH promoter to check for SP1 binding. The results corresponding to the site that showed significant enrichment were provided in the manuscript. The region of SPARC promoter in YEATS2 and p300 ChIP assay was selected on the basis of YEATS2 enrichment found in the YEATS2 ChIP-seq data. The ChIP-qPCR data for all the promoter regions investigated (including negative controls) can be found below (Author response image 1.).

      Authors’ response image 1.

      (A) SP1 ChIP-qPCR results indicating SP1 occupancy on different regions of YEATS2 promoter. YEATS2 promoter region showing SP1 binding sites (indicated by red boxes) is shown above. SP1 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 3D. (B) SP1 ChIPqPCR results indicating SP1 occupancy on different regions of GCDH promoter. GCDH promoter region showing SP1 binding sites (indicated by red boxes) is shown above. SP1 showed significant enrichment at F2R2 region. The results corresponding to F2R2 region were included in Figure 7E. (C) YEATS2 ChIP-qPCR results in shControl vs. shYEATS2 BICR10 cells indicating YEATS2 occupancy on different regions of SPARC promoter. SPARC promoter region showing YEATS2 ChIP-seq and H3K27cr ChIP-seq signals is shown above. YEATS2 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 5C. (D) p300 ChIP-qPCR results in shControl vs. shYEATS2 BICR10 cells indicating p300 occupancy on different regions of SPARC promoter. p300 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 5F.

      (6) The authors establish a link between H3K27Cr marks and GCDH expression, and this is an already well-known pathway. A critical missing piece is the level of ECSH1 in patient samples. This will clearly delineate if the balance shifted towards crotonylation.

      We greatly appreciate the reviewer's insightful comment regarding the importance of assessing ECSH1 levels in patient samples to clearly delineate the metabolic balance shifting towards crotonylation. We fully agree that this is a critical piece of evidence.

      To directly address this point and substantiate our claim regarding the altered metabolic balance in HNC, we had previously analyzed the expression of both GCDH and ECHS1 in TCGA HNC RNA-seq data (as presented in Figure 4—figure supplement 1A and B). This analysis revealed a consistent increase in GCDH expression and a concomitant decrease in ECHS1 expression in tumor samples compared to normal tissues. Based on these findings, we hypothesized that this altered expression profile would indeed lead to an accumulation of crotonyl-CoA and, consequently, an overall increase in histone crotonylation in HNC.

      To further validate and extend these findings at the protein level, we have now performed immunohistochemistry (IHC) analysis for both ECHS1 and GCDH in a cohort of HNC normal vs. tumor tissues. Our IHC results strikingly corroborate the RNA-seq data: GCDH consistently showed increased protein expression in tumor samples, whereas ECHS1 exhibited significantly reduced protein expression in tumors compared to their adjacent normal counterpart tissues (Figure 4E and Authors’ response figure 5).

      These new data, combined with existing TCGA HNC RNA-seq analysis strongly supports our proposed mechanism where altered GCDH and ECHS1 expression contributes to increased histone crotonylation in head and neck cancer.

      (7) The p300 ChIP data on the SPARC promoter is confusing. The authors report reduced p300 occupancy in YEATS2-silenced cells, on SPARC promoter. However, this is paradoxical, as p300 is a writer, a histone acetyltransferase (HAT). The absence of a reader (YEATS2) shouldn't affect the writer (p300) unless a complex relationship between p300 and YEATS2 is present. The role of p300 should be further clarified in this case. Additionally, transcriptional regulation of SPARC expression in YEATS2 silenced cells could be analysed via downstream events, like Pol-II recruitment. Assays such as Pol-II ChIP-qPCR could help explain this.

      We greatly appreciate the reviewer's insightful observation regarding the apparently paradoxical reduction of p300 occupancy on the SPARC promoter upon YEATS2 silencing (Figure 5F), and their call for further clarification of p300's role and the potential complex relationship with YEATS2. We agree that this point required further mechanistic investigation.

      As we have shown through RNA-seq and ChIP-seq analyses, YEATS2 broadly influences histone crotonylation levels at gene promoters, thereby impacting gene expression. While p300 is indeed a known histone acetyltransferase (HAT) with promiscuous acyltransferase activity, including crotonyltransferase activity[4], the precise mechanism by which its occupancy is affected by a 'reader' protein like YEATS2 was unclear. Our initial data suggested a dependency of p300 recruitment on YEATS2.

      To directly address the reviewer's concern and thoroughly delineate the molecular mechanism of cooperativity between YEATS2 and p300 in regulating histone crotonylation, we have now performed a series of targeted experiments, which have been incorporated into the revised manuscript:

      (a) Validation of p300's role in SPARC expression: We performed p300 knockdown in BICR10 cells, followed by immunoblotting to assess SPARC protein levels. As expected, a significant decrease in SPARC protein levels was observed upon p300 knockdown (Figure 5G). This confirms p300's direct involvement in SPARC gene expression.

      (b) Direct interaction between YEATS2 and p300: To investigate a potential physical association, we performed co-immunoprecipitation assays to check for an interaction between endogenous YEATS2 and p300. Our results clearly demonstrate the presence of YEATS2 in the p300-immunoprecipitate sample, indicating that YEATS2 and p300 physically interact and likely function together as a complex to drive the expression of target genes like SPARC (Figure 5H). This direct interaction provides the mechanistic basis for how YEATS2 influences p300 occupancy.

      (c) Impact on transcriptional activity (Pol II recruitment): As suggested, we performed RNA Polymerase II (Pol II) ChIP-qPCR on the SPARC promoter in YEATS2 knockdown cells. We observed a significant decrease in Pol II occupancy on the SPARC promoter after YEATS2 knockdown in BICR10 cells (Figure 6C). This confirms that YEATS2 silencing leads to reduced transcriptional initiation/elongation at this promoter.

      (d) p300's direct role in H3K27cr on SPARC promoter: To confirm p300's specific role in crotonylation at this locus, we performed H3K27cr ChIP-qPCR after p300 knockdown. As anticipated, a significant decrease in H3K27cr enrichment was observed on the SPARC promoter upon p300 knockdown (Figure 6J), directly demonstrating p300's crotonyltransferase activity at this site.

      (e) Rescue of p300 occupancy and H3K27cr by YEATS2 overexpression in SP1deficient cells: To further establish the YEATS2-p300 axis, we performed SP1 knockdown (which reduces YEATS2 expression) followed by ectopic YEATS2 overexpression, and then assessed p300 occupancy and H3K27cr levels on the SPARC promoter. While SP1 knockdown led to a decrease in both p300 and H3K27cr enrichment, we observed a significant rescue of both p300 occupancy and H3K27cr enrichment upon YEATS2 overexpression in the shSP1 cells (Figure 6E and F). This provides strong evidence that YEATS2 acts downstream of SP1 to regulate p300 recruitment and H3K27cr levels.

      Collectively, these comprehensive new results clearly establish that YEATS2 directly interacts with and assists in the recruitment of p300 to the SPARC promoter. This recruitment is crucial for p300's localized crotonyltransferase activity, leading to increased H3K27cr marks and subsequent activation of SPARC transcription. This clarifies the previously observed 'paradox' and defines a novel cooperative mechanism between a histone reader (YEATS2) and a writer (p300) in regulating histone crotonylation and gene expression.

      (8) The role of GCDH in producing crotonyl-CoA is already well-established in the literature. The authors' hypothesis that GCDH is essential for crotonyl-CoA production has been proven, and it's unclear why this is presented as a novel finding. It has been shown that YEATS2 KD leads to reduced H3K27cr, however, it remains unclear how the reader is affecting crotonylation levels. Are GCDH levels also reduced in the YEATS2 KD condition? Are YEATS2 levels regulating GCDH expression? One possible mechanism is YEATS2 occupancy on GCDH promoter and therefore reduced GCDH levels upon YEATS2 KD. This aspect is crucial to the study's proposed mechanism but is not addressed thoroughly.

      We appreciate the reviewer's valuable comment questioning the novelty of GCDH's role in crotonyl-CoA production and seeking further clarification on how YEATS2 influences crotonylation levels beyond its reader function.

      We agree that GCDH's general role in producing crotonyl-CoA is well-established[5,6]. Our study, however, aims to delineate a novel epigenetic-metabolic crosstalk in head and neck cancer, specifically investigating how the interplay between the histone crotonylation reader YEATS2 and the metabolic enzyme GCDH contributes to increased histone crotonylation and drives EMT in this context.

      Our initial investigations using GSEA on publicly available TCGA RNA-seq data revealed that HNC patients with high YEATS2 expression also exhibit elevated expression of genes involved in the lysine degradation pathway, prominently including GCDH. Recognizing the known roles of YEATS2 in preferentially binding H3K27cr7 and GCDH in producing crotonylCoA, we hypothesized that the elevated H3K27cr levels observed in HNC are a consequence of the combined action of both YEATS2 and GCDH. We have provided evidence that increased nuclear GCDH correlates with higher H3K27cr abundance, likely due to an increased nuclear pool of crotonyl-CoA, and that YEATS2 contributes through its preferential maintenance of crotonylation marks by recruiting p300 (as detailed in Figure 5FH and Figure 6J-L of the manuscript and elaborated in our response to point 7). Thus, our work highlights that both YEATS2 and GCDH are crucial for the regulation of histone crotonylation-mediated gene expression in HNC.

      To directly address the reviewer's query regarding YEATS2's influence on GCDH levels and nuclear histone crotonylation:

      • YEATS2 does not transcriptionally regulate GCDH: We did not find any evidence of YEATS2 directly regulating the expression levels of GCDH at the transcriptional level in HNC cells.

      • Novel finding: YEATS2 regulates GCDH nuclear localization: Crucially, we discovered that YEATS2 downregulation significantly reduces the nuclear pool of GCDH in head and neck cancer cells (Figure 7G). This is a novel mechanism suggesting that YEATS2 influences histone crotonylation not only by affecting promoter H3K27cr levels via p300 recruitment, but also by regulating the availability of the crotonyl-CoA producing enzyme, GCDH, within the nucleus.

      • Common upstream regulation by SP1: Interestingly, we found that both YEATS2 and GCDH expression are commonly regulated by the transcription factor SP1 in HNC. Our data demonstrate that SP1 binds to the promoters of both genes, and its downregulation leads to a decrease in their respective expressions (Figure 3 and Figure 7). This provides an important upstream regulatory link between these two key players.

      • Functional validation of GCDH in EMT: We further assessed the functional importance of GCDH in maintaining the EMT phenotype in HNC cells. Matrigel invasion assays after GCDH knockdown and overexpression in BICR10 cells revealed that the invasiveness of HNC cells was significantly reduced upon GCDH knockdown and significantly increased upon GCDH overexpression (results provided in revised manuscript Figure 7F and Figure 7—figure supplement 1F).

      These findings collectively demonstrate a multifaceted role for YEATS2 in regulating histone crotonylation by both direct recruitment of the writer p300 and by influencing the nuclear availability of the crotonyl-CoA producing enzyme GCDH. We acknowledge that the precise molecular mechanism governing YEATS2's effect on GCDH nuclear localization remains an exciting open question for future investigation, but our current data establishes a novel regulatory axis.

      (9) The authors should provide IHC analysis of YEATS2, SPARC alongside H3K27cr and GCDH staining in normal vs. tumor tissues from HNC patients.

      We thank the reviewer for their suggestion. We have performed IHC analysis for YEATS2, H3K27cr and GCDH in normal and tumor samples obtained from HNC patient.

      Reviewer #2 (Public review):

      Summary:

      The manuscript emphasises the increased invasive potential of histone reader YEATS2 in an SP1-dependent manner. They report that YEATS2 maintains high H3K27cr levels at the promoter of EMT-promoting gene SPARC. These findings assigned a novel functional implication of histone acylation, crotonylation.

      We thank the reviewer for the constructive comments. We are committed to making beneficial changes to the manuscript in order to alleviate the reviewer’s concerns.

      Concerns:

      (1) The patient cohort is very small with just 10 patients. To establish a significant result the cohort size should be increased.

      We thank the reviewer for this suggestion. We have increased the number of patient samples to assess the levels of YEATS2 (n=23 samples) and the results have been included in Figure 1G and Figure 1—figure supplement 1F.

      (2) Figure 4D compares H3K27Cr levels in tumor and normal tissue samples. Figure 1G shows overexpression of YEATS2 in a tumor as compared to normal samples. The loading control is missing in both. Loading control is essential to eliminate any disparity in protein concentration that is loaded.

      To address the reviewer’s concern, we have repeated the experiment and used H3 as a loading control as nuclear protein lysates from patient samples were used to check YEATS2 and H3K27cr levels.

      (3) Figure 4D only mentions 5 patient samples checked for the increased levels of crotonylation and hence forms the basis of their hypothesis (increased crotonylation in a tumor as compared to normal). The sample size should be more and patient details should be mentioned.

      As part of the revision, we have now checked the H3K27cr levels in a total of 23 patient samples and the results have been included in Figure 4D and Figure 4— figure supplement 1D. Patient details are provided in Supplementary Table 6.

      (4) YEATS2 maintains H3K27Cr levels at the SPARC promoter. The p300 is reported to be hyper-activated (hyperautoacetylated) in oral cancer. Probably, the activated p300 causes hyper-crotonylation, and other protein factors cause the functional translation of this modification. The authors need to clarify this with a suitable experiment.

      We thank the reviewer for this insightful comment regarding the functional relationship between YEATS2 and p300 in the context of H3K27cr, especially considering reports of p300 hyper-activation in oral cancer. We agree that a precise clarification of p300's role and its cooperativity with YEATS2 is crucial to fully understand the functional translation of this modification.

      As we have shown through global RNA-seq and ChIP-seq analyses, YEATS2 broadly affects gene expression by regulating histone crotonylation levels at gene promoters. We also recognize that the histone writer p300 is a promiscuous acyltransferase, known to add various non-acetyl marks, including crotonylation[4]. Our initial data, showing decreased p300 occupancy on the SPARC promoter upon YEATS2 downregulation (Figure 5F), suggested a strong dependency of p300 on YEATS2 for its recruitment. To fully delineate the molecular mechanism of this cooperativity and clarify how YEATS2 influences p300-mediated histone crotonylation and its functional outcomes, we have performed the following series of experiments, which have been integrated into the revised manuscript:

      (a) Validation of p300's role in SPARC expression: We performed p300 knockdown in BICR10 cells, followed by immunoblotting to assess SPARC protein levels. As expected, a significant decrease in SPARC protein levels was observed upon p300 knockdown (Figure 5G). This confirms p300's direct involvement in SPARC gene expression.

      (b) Direct interaction between YEATS2 and p300: To investigate a potential physical association, we performed co-immunoprecipitation assays to check for an interaction between endogenous YEATS2 and p300. Our results clearly demonstrate the presence of YEATS2 in the p300-immunoprecipitate sample, indicating that YEATS2 and p300 physically interact and likely function together as a complex to drive the expression of target genes like SPARC (Figure 5H). This direct interaction provides the mechanistic basis for how YEATS2 influences p300 occupancy.

      (c) Impact on transcriptional activity (Pol II recruitment): As suggested, we performed RNA Polymerase II (Pol II) ChIP-qPCR on the SPARC promoter in YEATS2 knockdown cells. We observed a significant decrease in Pol II occupancy on the SPARC promoter after YEATS2 knockdown in BICR10 cells (Figure 6C). This confirms that YEATS2 silencing leads to reduced transcriptional initiation/elongation at this promoter.

      (d) p300's direct role in H3K27cr on SPARC promoter: To confirm p300's specific role in crotonylation at this locus, we performed H3K27cr ChIP-qPCR after p300 knockdown. As anticipated, a significant decrease in H3K27cr enrichment was observed on the SPARC promoter upon p300 knockdown (Figure 6J), directly demonstrating p300's crotonyltransferase activity at this site.

      (e) Rescue of p300 occupancy and H3K27cr by YEATS2 overexpression in SP1deficient cells: To further establish the YEATS2-p300 axis, we performed SP1 knockdown (which reduces YEATS2 expression) followed by ectopic YEATS2 overexpression, and then assessed p300 occupancy and H3K27cr levels on the SPARC promoter. While SP1 knockdown led to a decrease in both p300 and H3K27cr enrichment, we observed a significant rescue of both p300 occupancy and H3K27cr enrichment upon YEATS2 overexpression in the sh_SP1_ cells (Figure 6K and L). This provides strong evidence that YEATS2 acts downstream of SP1 to regulate p300 recruitment and H3K27cr levels.

      Collectively, these comprehensive new results clearly establish that YEATS2 directly interacts with and assists in the recruitment of p300 to the SPARC promoter. This recruitment is crucial for p300's localized crotonyltransferase activity, leading to increased H3K27cr marks and subsequent activation of SPARC transcription. This clarifies the previously observed 'paradox' and defines a novel cooperative mechanism between a histone reader (YEATS2) and a writer (p300) in regulating histone crotonylation and gene expression.

      (5) I do not entirely agree with using GAPDH as a control in the western blot experiment since GAPDH has been reported to be overexpressed in oral cancer.

      We would like to clarify that GAPDH was not used as a loading control for protein expression comparisons between normal and tumor samples. GAPDH was used as a loading control only in experiments using head and neck cancer cell lines where shRNA-mediated knockdown or overexpression was employed. These manipulations specifically target the genes of interest and are not expected to alter GAPDH expression, making it a suitable loading control in these instances.

      (6) The expression of EMT markers has been checked in shControl and shYEATS2 transfected cell lines (Figure 2A). However, their expression should first be checked directly in the patients' normal vs. tumor samples.

      We thank the reviewer for the suggestion. We have now checked the expression of EMT marker Twist1 alongside YEATS2 expression in normal vs. tumor tissue samples using IHC (Figure 4E).

      (7) In Figure 3G, knockdown of SP1 led to the reduced expression of YEATS2 controlled gene Twist1. Ectopic expression of YEATS2 was able to rescue Twist1 partially. In order to establish that SP1 directly regulates YEATS2, SP1 should also be re-introduced upon the knockdown background along with YEATS2 for complete rescue of Twist1 expression.

      To address the reviewer’s concern regarding the partial rescue of Twist1 in SP1 depleted-YEATS2 overexpressed cells, we performed the experiment as suggested by the reviewer. We overexpressed both SP1 and YEATS2 in SP1-depleted cells and found that Twist1 depletion was almost completely rescued.

      Authors’ response image 2.

      Immunoblot depicting the decreased Twist1 levels on SP1 knockdown and its subsequent rescue of expression upon YEATS2 and SP1 overexpression in BICR10 (endogenous YEATS2 band indicated by *).

      (8) In Figure 7G, the expression of EMT genes should also be checked upon rescue of SPARC expression.

      We thank the reviewer for the suggestion. We have examined the expression of EMT marker Twist1 on YEATS2/ GCDH rescue. On overexpressing both YEATS2 and GCDH in sh_SP1_ cells we found that the depleted expression of Twist1 was rescued.

      Authors’ response image 3.

      Immunoblot depicting the decreased Twist1 levels on SP1 knockdown and its subsequent rescue of expression upon dual overexpression of YEATS2 and GCDH in BICR10 (* indicates GFP-tagged YEATS2 probed using GFP antibody).

      Reviewer #1 (Recommendations for the authors):

      While the study offers insights into the specific role of this axis in regulating epithelial-tomesenchymal transition (EMT) in HNC, its broader mechanistic novelty is limited by prior discoveries in other cancer types (https://doi.org/10.1038/s41586-023-06061-0). The manuscript would benefit from the inclusion of metastasis data, the role of key metabolic enzymes like ECHS1, the molecular mechanisms governing p300 and YEATS2 interactions, additional IHC data, negative control data in ChIP, and an explanation of discrepancies in certain figures.

      We thank the reviewer for their constructive suggestions. We have made extensive revisions to our manuscript to substantiate our findings. We have looked into the expression of ECHS1/ GCDH in HNC tumor tissues using IHC, performed extensive experiments to validate the role of p300 in YEATS2-mediated histone crotonylation, and provided additional data supporting our findings wherever required. The revised figures have been provided in the updated version of the manuscript and also in the Authors’ response.

      Minor Comments:

      (1) The study begins with a few EMT markers, such as Vimentin, Twist, and N-Cadherin to validate the role of YEATS2 in promoting EMT. Including a broader panel of EMT markers would strengthen the conclusions about the effects of YEATS2 on EMT and invasion. Additionally, the rationale for selecting these EMT markers is not fully elaborated. Why were other well-known EMT players not included in the analysis?

      On performing RNA-seq with shControl and sh_YEATS2_ samples, we discovered that TWIST1 was showing decrease in expression on YEATS2 downregulation. So Twist1 was investigated as a potential target of YEATS2 in HNC cells. N-Cadherin was chosen because it is known to get upregulated directly by Twist1[8]. Further, Vimentin was chosen as it a well-known marker for mesenchymal phenotype and is frequently used to indicate EMT in cancer cells[9].

      Authors’ response image 4.

      IGV plot showing the decrease in Twist1 expression in shControl vs. shYEATS2 RNA-seq data.

      Other than the EMT-markers used in our study, the following markers were amongst those that showed significant change in gene expression on YEATS2 downregulation.

      Authors’ response table 1.

      List of EMT-related genes that showed significant change in expression on YEATS2 knockdown in RNA-seq analysis.

      As depicted in the table above, majority of the genes that showed downregulation on YEATS2 knockdown were mesenchymal markers, while epithelial-specific genes such as Ecadherin and Claudin-1 showed upregulation. This data signifies the essential role of YEATS2 in driving EMT in head and neck cancer.

      (2) The authors use Ponceau staining, but the rationale behind this choice is unclear. Ponceau is typically used for transfer validation. For the same patient, western blot loading controls like Actin/GAPDH should be shown. Also, at various places throughout the manuscript, Ponceau staining has been used. These should also be replaced with Actin/GAPDH blots.

      Ponceau S staining is frequently used as alternative for housekeeping genes like GAPDH as control for protein loading[10]. However, to address this issue, we have repeated the western and used H3 as a loading control as nuclear protein lysates from patient samples were used to check YEATS2 and H3K27cr levels.

      For experiments (In Figures 5E, 6F, 6I, and 7H ) where we assessed SPARC levels in conditioned media obtained from BICR10 cells (secretory fraction), Ponceau S staining was deliberately used as the loading control. In such extracellular protein analyses, traditional intracellular housekeeping genes (like Actin or GAPDH) are not applicable. Ponceau S has been used as a control for showing SPARC expression in secretory fraction of mammalian cell lines in previous studies as well11.  

      (3) The manuscript briefly mentions that p300 was identified as the only protein with increased expression in tumours compared to normal tissue in the TCGA dataset. What other writers were checked for? Did the authors check for their levels in HNC patients?

      We thank the reviewer for this observation. As stated by previous studies [12,13], p300 and GCN5 are the histone writers that can act as crotonyltransferases at the H3K27 position. Although the crotonyltransferase activity of GCN5 has been demonstrated in yeast, it has not been confirmed in human. Whereas the histone crotonyltransferase activity of p300 has been validated in human cells using in vitro HCT assays[4,14]. Therefore, we chose to focus on p300 for further validation of its role in YEATS2mediated regulation of histone crotonylation. We did not check the levels of p300 in HNC patient tissues. However, p300 showed higher expression in tumor as compared to normal in publicly available HNC TCGA RNA-seq data (Figure 5—figure supplement 1G).

      We acknowledge that the original statement in the manuscript, 'For this we looked at expression of the known writers of H3K27Cr mark in TCGA dataset, and discovered that p300 was the only protein that had increased expression in tumor vs. normal HNC dataset…', was indeed slightly misleading. Our intention was to convey that p300 is considered the major and most validated histone crotonyltransferase capable of influencing crotonylation at the H3K27 position in humans, and that its expression was notably increased in the HNC TCGA tumor dataset. We have now reframed this sentence in the revised manuscript to accurately reflect our findings and focus, as follows:

      'For this, we checked the expression of p300, a known writer of H3K27cr mark in humans, in the TCGA dataset. We found that p300 had increased expression in tumor vs. normal HNC dataset…'

      This revised wording more accurately reflects our specific focus on p300's established role and its observed upregulation in HNC.

      (4) Figure 6E, blot should be replaced. The results aren't clearly visible.

      We thank the reviewer for this observation. We have repeated the western blot and the Figure 6E (Figure 6F in the revised version of manuscript) has now been replaced with a cleaner blot.

      (5) Reference 9 and 19 are the same. Please rectify.

      We apologize for this inadvertent error. We have rectified this error in the updated version of the manuscript.

      References

      (1) Brabletz, T.; Kalluri, R.; Nieto, M. A.; Weinberg, R. A. EMT in Cancer. Nat Rev Cancer 2018, 18(2), 128–134. https://doi.org/10.1038/nrc.2017.118.

      (2) Pisani, P.; Airoldi, M.; Allais, A.; Aluffi Valletti, P.; Battista, M.; Benazzo, M.; Briatore, R.; Cacciola, S.; Cocuzza, S.; Colombo, A.; Conti, B.; Costanzo, A.; Della Vecchia, L.; Denaro, N.; Fantozzi, C.; Galizia, D.; Garzaro, M.; Genta, I.; Iasi, G. A.; Krengli, M.; Landolfo, V.; Lanza, G. V.; Magnano, M.; Mancuso, M.; Maroldi, R.; Masini, L.; Merlano, M. C.; Piemonte, M.; Pisani, S.; Prina-Mello, A.; Prioglio, L.; Rugiu, M. G.; Scasso, F.; Serra, A.; Valente, G.; Zannetti, M.; Zigliani, A. Metastatic Disease in Head & Neck Oncology. Acta Otorhinolaryngol Ital 2020, 40 (SUPPL. 1), S1–S86. https://doi.org/10.14639/0392-100X-suppl.1-40-2020.

      (3) Lin, J.; Zhang, P.; Liu, W.; Liu, G.; Zhang, J.; Yan, M.; Duan, Y.; Yang, N. A Positive Feedback Loop between ZEB2 and ACSL4 Regulates Lipid Metabolism to Promote Breast Cancer Metastasis. Elife 2023, 12, RP87510. https://doi.org/10.7554/eLife.87510.

      (4) Liu, X.; Wei, W.; Liu, Y.; Yang, X.; Wu, J.; Zhang, Y.; Zhang, Q.; Shi, T.; Du, J. X.; Zhao, Y.; Lei, M.; Zhou, J.-Q.; Li, J.; Wong, J. MOF as an Evolutionarily Conserved Histone Crotonyltransferase and Transcriptional Activation by Histone Acetyltransferase-Deficient and Crotonyltransferase-Competent CBP/P300. Cell Discov 2017, 3 (1), 17016. https://doi.org/10.1038/celldisc.2017.16.

      (5) Jiang, G.; Li, C.; Lu, M.; Lu, K.; Li, H. Protein Lysine Crotonylation: Past, Present, Perspective. Cell Death Dis 2021, 12 (7), 703. https://doi.org/10.1038/s41419-021-03987-z.

      (6) Yuan, H.; Wu, X.; Wu, Q.; Chatoff, A.; Megill, E.; Gao, J.; Huang, T.; Duan, T.; Yang, K.; Jin, C.; Yuan, F.; Wang, S.; Zhao, L.; Zinn, P. O.; Abdullah, K. G.; Zhao, Y.; Snyder, N. W.; Rich, J. N. Lysine Catabolism Reprograms Tumour Immunity through Histone Crotonylation. Nature 2023, 617 (7962), 818–826. https://doi.org/10.1038/s41586-023-06061-0.

      (7) Zhao, D.; Guan, H.; Zhao, S.; Mi, W.; Wen, H.; Li, Y.; Zhao, Y.; Allis, C. D.; Shi, X.; Li, H. YEATS2 Is a Selective Histone Crotonylation Reader. Cell Res 2016, 26 (5), 629–632. https://doi.org/10.1038/cr.2016.49.

      (8) Alexander, N. R.; Tran, N. L.; Rekapally, H.; Summers, C. E.; Glackin, C.; Heimark, R. L. NCadherin Gene Expression in Prostate Carcinoma Is Modulated by Integrin-Dependent Nuclear Translocation of Twist1. Cancer Res 2006, 66 (7), 3365–3369.

      https://doi.org/10.1158/0008-5472.CAN-05-3401.

      (9) Satelli, A.; Li, S. Vimentin in Cancer and Its Potential as a Molecular Target for Cancer Therapy. Cellular and Molecular Life Sciences 2011, 68 (18), 3033–3046. https://doi.org/10.1007/s00018-011-0735-1.

      (10) Romero-Calvo, I.; Ocón, B.; Martínez-Moya, P.; Suárez, M. D.; Zarzuelo, A.; Martínez-Augustin, O.; de Medina, F. S. Reversible Ponceau Staining as a Loading Control Alternative to Actin in Western Blots. Anal Biochem 2010, 401 (2), 318–320. https://doi.org/https://doi.org/10.1016/j.ab.2010.02.036.

      (11) Ling, H.; Li, Y.; Peng, C.; Yang, S.; Seto, E. HDAC10 Inhibition Represses Melanoma Cell Growth and BRAF Inhibitor Resistance via Upregulating SPARC Expression. NAR Cancer 2024, 6 (2), zcae018. https://doi.org/10.1093/narcan/zcae018.

      (12) Gao, D.; Li, C.; Liu, S.-Y.; Xu, T.-T.; Lin, X.-T.; Tan, Y.-P.; Gao, F.-M.; Yi, L.-T.; Zhang, J. V; Ma, J.Y.; Meng, T.-G.; Yeung, W. S. B.; Liu, K.; Ou, X.-H.; Su, R.-B.; Sun, Q.-Y. P300 Regulates Histone Crotonylation and Preimplantation Embryo Development. Nat Commun 2024, 15 (1), 6418. https://doi.org/10.1038/s41467-024-50731-0.

      (13) Li, K.; Wang, Z. Histone Crotonylation-Centric Gene Regulation. Epigenetics Chromatin 2021, 14 (1), 10. https://doi.org/10.1186/s13072-021-00385-9.

      (14) Sabari, B. R.; Tang, Z.; Huang, H.; Yong-Gonzalez, V.; Molina, H.; Kong, H. E.; Dai, L.; Shimada, M.; Cross, J. R.; Zhao, Y.; Roeder, R. G.; Allis, C. D. Intracellular Crotonyl-CoA Stimulates Transcription through P300-Catalyzed Histone Crotonylation. Mol Cell 2015, 58 (2), 203–215. https://doi.org/https://doi.org/10.1016/j.molcel.2015.02.029.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable insight into a computational mechanism of pain perception. The evidence supporting the authors’ claims is solid, although the inclusion of 1) more diverse candidate computational models, 2) more systematic analysis of the temporal regularity effects on the model fit, and 3) tests on clinical samples would have strengthened the study. The work will be of interest to pain researchers working on computational models and cognitive mechanisms of pain in a Bayesian framework.

      Thank you very much again for considering the manuscript and judging it as a valuable contribution to understanding mechanisms of pain perception. We recognise the above-mentioned points of improvement and elaborate on them in the initial response to the reviewers.

      Response to the reviewers

      Reviewer 1:

      Reviewer Comment 1.1 — Selection of candidate computational models: While the paper juxtaposes the simple model-free RL model against a Kalman Filter model in the context of pain perception, the rationale behind this choice remains ambiguous. It prompts the question: could other RL-based models, such as model-based RL or hierarchical RL, offer additional insights? A more detailed explanation of their computational model selection would provide greater clarity and depth to the study.

      Initial reply: Thank you for this point. Our models were selected a-priori, following the modelling strategy from Jepma et al. (2018) and hence considered the same set of core models for clear extension of the analysis to our non-cue paradigm. The key question for us was whether expectations were used to weight the behavioural estimates, so our main interest was to compare expectation vs non-expectation weighted models.

      Model-based and hierarchical RL are very broad terms that can be used to refer to many different models, and we are not clear about which specific models the reviewer is referring to. Our Bayesian models are generative models, i.e. they learn the generative statistics of the environment (which is characterised by inherent stochasticity and volatility) and hence operate model-based analyses of the stimulus dynamics. In our case, this happened hierarchically and it was combined with a simple RL rule.

      Revised reply: We clarified our modelling choices in the ”Modelling strategy” subsection of the results section.

      Reviewer Comment 1.2 — Effects of varying levels of volatility and stochasticity: The study commendably integrates varying levels of volatility and stochasticity into its experimental design. However, the depth of analysis concerning the effects of these variables on model fit appears shallow. A looming concern is whether the superior performance of the expectation-weighted Kalman Filter model might be a natural outcome of the experimental design. While the non-significant difference between eKF and eRL for the high stochasticity condition somewhat alleviates this concern, it raises another query: Would a more granular analysis of volatility and stochasticity effects reveal fine-grained model fit patterns?

      Initial reply: We are sorry that the reviewer finds shallow ”the depth of analysis concerning the effects of these variables on model fit”. We are not sure which analysis the reviewer has in mind when suggesting a ”more granular analysis of volatility and stochasticity effects” to ”reveal fine-grained model fit patterns”. Therefore, we find it difficult to improve our manuscript in this regard. We are happy to add analyses to our paper but we would be greatful for some specific pointers. We have already provided:

      •    Analysis of model-naive performance across different levels of stochasticity and volatility (section 2.3, figure 3, supplementary information section 1.1 and tables S1-2)

      •    Model fitting for each stochasticity/volatility condition (section 2.4.1, figure 4, supplementary table S5)

      •    Group-level and individual-level differences of each model parameter across stochasticity/volatility conditions (supplementary information section 7, figures S4-S5).

      •    Effect of confidence on scaling factor for each stochasticity/volatility condition (figure 5)

      Reviewer Comment 1.3 — Rating instruction: According to Fig. 1A, participants were prompted to rate their responses to the question, ”How much pain DID you just feel?” and to specify their confidence level regarding their pain. It is difficult for me to understand the meaning of confidence in this context, given that they were asked to report their *subjective* feelings. It might have been better to query participants about perceived stimulus intensity levels. This perspective is seemingly echoed in lines 100-101, ”the primary aim of the experiment was to determine whether the expectations participants hold about the sequence inform their perceptual beliefs about the intensity of the stimuli.”

      Initial reply: Thank you for raising this question, which allows us to clarify our paradigm. On half of the trials, participants were asked to report the perceived intensity of the previous stimulus; on the remaining trials, participants were requested to predict the intensity of the next stimulus. Therefore, we did query ”participants about perceived stimulus intensity levels”, as described at lines 49-55, 296-303, and depicted in figure 1.

      The confidence refers to the level of confidence that participants have regarding their rating - how sure they are. This is done in addition to their perceived stimulus intensity and it has been used in a large body of previous studies in any sensory modality.

      Reviewer Comment 1.4 — Relevance to clinical pain: While the authors underscore the relevance of their findings to chronic pain, they did not include data pertaining to clinical pain. Notably, their initial preprint seemed to encompass data from a clinical sample (https://www.medrxiv.org /content/10.1101/2023.03.23.23287656v1), which, for reasons unexplained, has been omitted in the current version. Clarification on this discrepancy would be instrumental in discerning the true relevance of the study’s findings to clinical pain scenarios.

      Initial reply: The preprint that the Reviewer is referring to was an older version of the manuscript in which we combined two different experiments, which were initially born as separate studies: the one that we submitted to eLife (done in the lab, with noxious stimuli in healthy participants) and an online study with a different statistical learning paradigm (without noxious stimuli, in chronic back pain participants). Unfortunately, the paradigms were different and not directly comparable. Indeed, following submission to a different journal, the manuscript was criticised for this reason. We therefore split the paper in two, and submitted the first study to eLife. We are now planning to perform the same lab-based experiment with noxious stimuli on chronic back pain participants. Progress on this front has been slowed down by the fact that I (Flavia Mancini) am on maternity leave, but it remains top priority once back to work.

      Reviewer Comment 1.5 — Paper organization: The paper’s organization appears a little bit weird, possibly due to the removal of significant content from their initial preprint. Sections 2.12.2 and 2.4 seem more suitable for the Methods section, while 2.3 and 2.4.1 are the only parts that present results. In addition, enhancing clarity through graphical diagrams, especially for the experimental design and computational models, would be quite beneficial. A reference point could be Fig. 1 and Fig. 5 from Jepma et al. (2018), which similarly explored RL and KF models.

      Initial reply: Thank you for these suggestions. We will consider restructuring the paper in the revised version.

      Revised reply: We restructured introduction, results and parts of the methods. We followed the reviewer’s suggestion regarding enhancing clarity through graphical diagrams. We have visualised the experimental design in Figure 1D. Furthemore, we have visualised the two main computational models (eRL and eKF) in Figure 2, following from Jepma et al. (2018). As a result, we have updated the notation in Section 4.4 to be clearer and consistent with the graphical representation (rename the variable referring to observed thermal input from Ot to Nt).

      Reviewer Comment 1.6 — In lines 99-100, the statement ”following the work by [23]” would be more helpful if it included a concise summary of the main concepts from the referenced work.

      - It would be helpful to have descriptions of the conditions that Figure 1C is elaborating on.

      - In line 364, the ”N {t}” in the sentence ”The observation on trial t, N {t}”, should be O {t}.

      Initial reply: Thank you for spotting these and for providing the suggestions. We will include the correction in the revised version.

      Revised reply: We have added the following regarding the lines 99-100:

      ”We build on the work by [23], who show that pain perception is strongly influenced by expectations as defined by a cue that predicts high or low pain. In contrast to the cue-paradigm from [23], the primary aim of our experiment was to determine whether the expectations participants hold about the sequence itself inform their perceptual beliefs about the intensity of the stimuli.”

      See comment in the previous reply, regarding the notation change from Ot to Nt.

      Reviewer 2:

      Reviewer Comment 2.1 — This is a highly interesting and novel finding with potential implications for the understanding and treatment of chronic pain where pain regulation is deficient. The paradigm is clear, the analysis is state-of-the-art, the results are convincing, and the interpretation is adequate.

      Initial reply: Thank you very much for these positive comments.

      Reviewer 3:

      Summary:

      I am pleased to have had the opportunity to review this manuscript, which investigated the role of statistical learning in the modulation of pain perception. In short, the study showed that statistical aspects of temperature sequences, with respect to specific manipulations of stochasticity (i.e., randomness of a sequence) and volatility (i.e., speed at which a sequence unfolded) influenced pain perception. Computational modelling of perceptual variables (i.e., multi-dimensional ratings of perceived or predicted stimuli) indicated that models of perception weighted by expectations were the best explanation for the data. My comments below are not intended to undermine or question the quality of this research. Rather, they are offered with the intention of enhancing what is already a significant contribution to the pain neuroscience field. Below, I highlight the strengths and weaknesses of the manuscript and offer suggestions for incorporating additional methodological details.

      Strengths:

      The manuscript is articulate, coherent, and skilfully written, making it accessible and engaging.

      - The innovative stimulation paradigm enables the exploration of expectancy effects on perception without depending on external cues, lending a unique angle to the research.

      - By including participants’ ratings of both perceptual aspects and their confidence in what they perceived or predicted, the study provides an additional layer of information to the understanding of perceptual decision-making. This information was thoughtfully incorporated into the modelling, enabling the investigation of how confidence influences learning.

      - The computational modelling techniques utilised here are methodologically robust. I commend the authors for their attention to model and parameter recovery, a facet often neglected in previous computational neuroscience studies.

      - The well-chosen citations not only reflect a clear grasp of the current research landscape but also contribute thoughtfully to ongoing discussions within the field of pain neuroscience.

      Initial reply: We are really grateful for reviewer’s insightful comments and for providing useful guidance regarding our methodology. We are also thankful for highlighting the strengths of our manuscript. Below we respond to individual weakness mentioned in the reviews report.

      Reviewer Comment 3.1 — In Figure 1, panel C, the authors illustrate the stimulation intensity, perceived intensity, and prediction intensity on the same scale, facilitating a more direct comparison. It appears that the stimulation intensity has been mathematically transformed to fit a scale from 0 to 100, aligning it with the intensity ratings corresponding to either past or future stimuli. Given that the pain threshold is specifically marked at 50 on this scale, one could logically infer that all ratings falling below this value should be deemed non-painful. However, I find myself uncertain about this interpretation, especially in relation to the term ”arbitrary units” used in the figure. I would greatly appreciate clarification on how to accurately interpret these units, as well as an explanation of the relationship between these values and the definition of pain threshold in this experiment.

      Initial reply: Indeed, as detailed in the Methods section 4.3, the stimulation intensity was originally transformed from the 1-13 scale to 0-100 scale to match the scales in the participant response screens.

      Following the method used to establish the pain threshold, we set the stimulus intensity of 7 as the threshold on the original 1-13 scale. However, during the rating part of the experiment, several of the participants never or very rarely selected a value above 50 (their individually defined pain threshold), despite previously indicating a moment during pain threshold procedure when a stimulus becomes painful. This then results in the re-scaled intensity values as well the perception rating, both on the same 0-100 scale of arbitrary units, to never go above the pain threshold. Please see all participant ratings and inputs in the Figure below. We see that it would be more illustrative to re-plot Figure 1 with a different exemplary participant, whose ratings go above the pain threshold, perhaps with an input intensity on the 1-13 scale on the additional right-hand-side y-axis. We will add this in the revised version as well as highlight the fact above.

      Importantly, while values below 50 are deemed non-painful by participants, the thermal stimulation still activates C-fibres involved in nociception, and we would argue that the modelling framework and analysis still applies in this case.

      Revised reply: We re-plotted Figure 1E-F with a different exemplary participant, whose rating go above the pain threshold. We also included all participant pain perception and prediction ratings, noxious input sequences and confidence ratings in the supplement in Figures S1-S3.

      Reviewer Comment 3.2 — The method of generating fluctuations in stimulation temperatures, along with the handling of perceptual uncertainty in modelling, requires further elucidation. The current models appear to presume that participants perceive each stimulus accurately, introducing noise only at the response stage. This assumption may fail to capture the inherent uncertainty in the perception of each stimulus intensity, especially when differences in consecutive temperatures are as minimal as 1°C.

      Initial reply: We agree with the reviewer that there are multiple sources of uncertainty involved in the process of rating the intensity of thermal stimuli - including the perception uncertainty. In order to include an account of inaccurate perception, one would have to consider different sources that contribute to this, which there may be many. In our approach, we consider one, which is captured in the expectation weighted model, more clearly exemplified in the expectation-weighted Kalman-Filter model (eKF). The model assumes participants perception of input as an imperfect indicator of the true level of pain. In this case, it turns out that perception is corrupted as a result of the expectation participants hold about the upcoming stimuli. The extent of this effect is partly governed by a subjective level of noise ϵ, which may also subsume other sources of uncertainty beyond the expectation effect. Moreover, the response noise ξ, could also subsume any other unexplained sources of noise.

      Author response image 1.

      Stimulis intensity transformation

      Revised reply: We clarified our modelling choices in the ”2.2 Modelling strategy” subsection.

      Reviewer Comment 3.3 — A key conclusion drawn is that eKF is a better model than eRL. However, a closer examination of the results reveals that the two models behave very similarly, and it is not clear that they can be readily distinguished based on model recovery and model comparison results.

      Initial reply: While, the eKF appears to rank higher than the eRL in terms of LOOIC and sigma effects, we don’t wish to make make sweeping statements regarding significance of differences between eRL and eKF, but merely point to the trend in the data. We shall make this clearer in the revised version of the manuscript. However, the most important result is that the models involving expectation-weighing are arguably better capturing the data.

      Revised reply: We elaborated on the significance statements in the ”Modelling Results” subsection:

      • We considered at least a 2 sigma effect as indication of a significant difference. In each condition, the expectation weighted models (eKF and eRL) provided better fit than models without this element (KF and RL; approx. 2-4 sigma difference, as reported in Figure 5A-D). This suggests that regardless of the levels of volatility and stochasticity, participants still weigh perception of the stimuli with their expectation.

      and in the first paragraph of the Discussion:

      • When varying different levels of inherent uncertainty in the sequences of stimuli (stochasticity and volatility), the expectation and confidence weighted models fitted the data better than models weighted for confidence but not for expectations (Figure 5A-D). The expectation-weighted bayesian (KF) model offered a better fit than the expectation-weighted, model-free RL model, although in conditions of high stochasticity this difference was short of significance. Overall, this suggests that participants’ expectations play a significant role in the perception of sequences of noxious stimuli.

      We are aware of the limitations and lack of clear guidance regarding using sigma effects to establish significance (as per reviewer’s suggestion: https://discourse.mc-stan.org/t/loo-comparison-in-referenceto-standard-error/4009). Here we decided to use the above-mentioned threshold of 2-sigma as an indication of significance, but note the potential limitations of the inferences - especially when distinguishing between eRL/eKF models.

      Reviewer Comment 3.4 — Regarding model recovery, the distinction between the eKF and eRL models seems blurred. When the simulation is based on the eKF, there is no ability to distinguish whether either eKF or eRL is better. When the simulation is based on the eRL, the eRL appears to be the best model, but the difference with eKF is small. This raises a few more questions. What is the range of the parameters used for the simulations?

      Initial reply: We agree that the distinction between eKF and eRL in the model recovery is not that clean-cut, which may in turn point to the similarity between the two models. To simulate the data for the model and parameter recovery analysis, we used the group means and variances estimated on the participant data to sample individual parameter values.

      Reviewer Comment 3.5 — Is it possible that either eRL or eKF are best when different parameters are simulated? Additionally, increasing the number of simulations to at least 100 could provide more convincing model recovery results.

      Initial reply: It could be a possibility, but would require further investigation and comparison of fits for different bins/ranges of parameters to see if there is any consistent advantage of one model over another is each bin. We will consider adding this analysis, and provide an additional 50 simulations to paint a more convincing picture.

      Revised reply: We increased the number of simulations per model pair to ≈ 100 (after rejecting fits based on diagnostics criteria - E-BFMI and divergent transitions) and updated the confusion matrix (Table S4). Although the confusion between eRL and eKF remains, the model recovery shows good distinction between expectation weighted vs non-expectation weighted (and Random) models, which supports our main conclusion in the paper.

      Reviewer Comment 3.6 — Regarding model comparison, the authors reported that ”the expectation-weighted KF model offered a better fit than the eRL, although in conditions of high stochasticity, this difference was short of significance against the eRL model.” This interpretation is based on a significance test that hinges on the ratio between the ELPD and the surrounding standard error (SE). Unfortunately, there’s no agreed-upon threshold of SEs that determines significance, but a general guideline is to consider ”several SEs,” with a higher number typically viewed as more robust. However, the text lacks clarity regarding the specific number of SEs applied in this test. At a cursory glance, it appears that the authors may have employed 2 SEs in their interpretation, while only depicting 1 SE in Figure 4.

      Initial reply: Indeed, we considered 2 sigma effect as a threshold, however we recognise that there is no agreed-upon threshold, and shall make this and our interpretation clearer regarding the trend in the data, in the revision.

      Revised reply: We clarify this further, as per our revised response to Comment 3.3 above. We have also added the following statement in section 4.5.1 (Methods, Model comparison): ”There’s no agreed-upon threshold of SEs that determines significance, but the higher the sigma difference, the more robust is the effect.”

      Reviewer Comment 3.7 — With respect to parameter recovery, a few additional details could be included for completeness. Specifically, while the range of the learning rate is understandably confined between 0 and 1, the range of other simulated parameters, particularly those without clear boundaries, remains ambiguous. Including scatter plots with the simulated parameters on the xaxis and the recovered parameters on the y-axis would effectively convey this missing information.

      Furthermore, it would be beneficial for the authors to clarify whether the same priors were used for both the modelling results presented in the main paper and the parameter recovery presented in the supplementary material.

      Initial reply: Thanks for this comment and for the suggestions. To simulate the data for the model and parameter recovery analysis, we used the group means and variances estimated on the participant data to sample individual parameter values. The priors on the group and individual-level parameters in the recovery analysis where the same as in the fitting procedure. We will include the requested scatter plots in the next iteration of the manuscript.

      Revised reply: We included parameter recovery scatter plots for each model and parameter in the Supplement Figures S7-S11.

      Reviewer Comment 3.8 — While the reliance on R-hat values for convergence in model fitting is standard, a more comprehensive assessment could include estimates of the effective sample size (bulk ESS and/or tail ESS) and the Estimated Bayesian Fraction of Missing Information (EBFMI), to show efficient sampling across the distribution. Consideration of divergences, if any, would further enhance the reliability of the results.

      Initial reply: Thank you very much for this suggestion, we will aim to include these measures in the revised version.

      Revised reply: We have considered the suggested diagnostics and include bulk and tail ESS values for each condition, model, parameter in the Supplement Tables S6-S9. We also report number of chain with low E-BFMI (0), number of divergent transitions (0) and the E-BFMI values per chain in Table S10.

      Reviewer Comment 3.9 — The authors write: ”Going beyond conditioning paradigms based in cuing of pain outcomes, our findings offer a more accurate description of endogenous pain regulation.” Unfortunately, this statement isn’t substantiated by the results. The authors did not engage in a direct comparison between conditioning and sequence-based paradigms. Moreover, even if such a comparison had been made, it remains unclear what would constitute the gold standard for quantifying ”endogenous pain regulation.”

      Initial reply: This is valid point, indeed we do not compare paradigms in our study, and will remove this statement in the future version.

      Revised reply: We have removed this statement from the revised version.

      Reviewer Comment 3.10 — In relation to the comment on model comparison in my public review, I believe the following link may provide further insight and clarify the basis for my observation. It discusses the use of standard error in model comparison and may be useful for the authors in addressing this particular point: https://discourse.mc-stan.org/t/loo-comparison-in-referenceto-standard-error/4009

      Initial reply: Thank you for this suggestion, we will consider the forum discussion in our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer 1 (Public review):

      (1) The authors state that they have reclassified the allelic expression status of 32 genes (shown in Table S5, Supplementary Figure 3). The concern is the source of the tissue or cell line which was originally used to make the classification of XCI status, and whether the comparisons are equivalent. For example, if cell lines (and not tissues) were used to define the XCI status for EGFL6, TSPAN6, and CXorf38, then how can the authors be sure that the escape status in whole tissues would be the same? Also, along these lines, the authors should consider whether escape status in previous studies using immortalized/cancer cell lines (such as the meta-analyses done in Balaton publication) would be different compared to healthy tissues (seems like it should be). Therefore, making comparisons between healthy whole tissues and cancer cell lines doesn't make sense.

      Indeed, many previous classifications were based on clonal cell lines, which could result in atypical patterns of escape due to the profound and varied effects of adaptation to culture. However, one of the primary goals of our study was to directly determine allele-specific expression from the X-chromosome in healthy primary tissues, in part to exclude the potential confounding effects of cell culture. 

      Whereas we do perform comparisons with cell culture-based classifications, we also provide detailed comparisons with the previous classification of Tukiainen et al, which also uses primary human tissues. In addition, whereas the comparison with Balaton et al is not optimal, we hold that it is valuable as it reveals which genes may exhibit aberrant escape patterns in culture. Finally, despite the above reservations, our comparison revealed an over-whelming agreement with previous research which suggests that in the vast majority of cases, escape appears to be correctly maintained in culture. 

      (2) The authors note that skewed XCI is prevalent in the human population, and cite some publications (references 8, 10-12). If RNAseq data is available from these female individuals with skewed XCI (such as ref 12), the authors should consider using their allelic expression pipeline to identify XCI status of more X-linked genes.

      Indeed, we completely agree and are in the process of obtaining this data which has proven complex and time-consuming in the currently regulatory environment.

      (3) It has been well established that the human inactive X has more XCI escape genes compared to the mouse inactive X. In light of the author's observations across human tissues, how does the XCI status compare with the same tissues in mice?

      This is a very interesting point, and a comparison we are currently working on. However, this is a major undertaking and one that is outside of the scope of this study. We do appreciate the differences in mice and humans on X-chromosome level and could only speculate on the overlap being relatively small as the number of escapees in mice has been shown the be far lower than in humans.

      Reviewer 2 (Public review):

      In my view there are only minor weaknesses in this work, that tend to come about due to the requirement to study individuals with highly skewed X inactivation. I wonder whether the cause of the highly skewed X inactivation may somehow influence the likelihood of observing tissue-specific escape from X inactivation. In this light, it would be interesting to further understand the genetic cause for the highly skewed X inactivation in each of these three cases in the whole exome sequencing data. Future additional studies may validate these findings using single-cell approaches in unrelated individuals across tissues, where there is normal X inactivation.

      We thank the reviewer for their positive assessment of our work. This is a point we have and continue to grapple with. We cannot rule out that the genetic cause of complete skewing may influence tissue-specific XCI.  Moreover, the genetic cause for the non-mosaic XCI is currently unclear and is likely to vary between individuals, which could also result in inter-individual variation in tissue-specific escape. We are currently performing large prospective studies in the tissues of healthy females to specifically address this point.

      Reviewer 3 (Public review):

      There are very few, except that this escape catalogue is limited to 3 donors, based on a single(representative) tissue screen in 285 female donors, mostly using muscle samples. However, if only pituitary samples had been screened, nmXCI-1 would have been missed. Additional donors in the 285 representative samples cross a lower threshold of AE = 0.4. It would be worthwhile to query all tissues of the 285 donors to discover more nmXCI cases, as currently fewer than half of X-linked genes received a call using this very worthwhile approach.

      We thank the reviewer for their positive assessment of our work. Of course, we agree that a tissue-wide screen in all individuals would have been optimal and is a line of research we are currently pursuing. However, the analysis of allele-specific expression in all 5,000 RNA-seq samples is a massive undertaking and was simply not practicable within the time-scale of this study. 

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Thanks to the authors for an interesting manuscript! I enjoyed reading it and the care that has gone into explaining the analyses and the findings. There are a few recommendations that I have for strengthening the work.

      We thank the reviewer for the nice feedback. Much appreciated.

      (1) I would like to see a genetic analysis of the three individuals, to try and identify the genetic causes of the skewed X inactivation beyond just considering the XIC or translocations. The cause of the highly skewed X inactivation would be of interest to many.

      This is certainly a very interesting avenue of research and one that we are currently focusing on. However, in the current study we simply had too few skewed XCI females to assess this  in an exhaustive manner. To tackle this issue, we have begun a prospective study of healthy females to identify additional non-mosaic females.

      (2) I wonder whether the cause of the skewed XCI may somehow influence the assessment of tissue-specific escape? If there is a problem with X inactivation itself, perhaps escape would also be different, making it appear more constitutive than tissue-specific?

      This is a point we have and continue to grapple with. We cannot rule out that the genetic cause of complete skewing may influence tissue-specific XCI.  Moreover, the genetic cause for the non-mosaic XCI is currently unclear and is likely to vary between individuals, which could result in inter-individual variation in tissue-specific escape.

      (3) Presentation/wording suggestions:

      I think the abstract is likely a bit inaccessible to those outside the field. I am in the X inactivation field, but don't use the term non-mosaic X inactivation, but rather would call it highly skewed, or non-random X inactivation. In my view, it would be simpler for the abstract to call non-mosaic XCI highly skewed XCI instead, or to use more words to ensure it is clear for the reader.

      We agree that the terminology of completely skewed/non-mosaic XCI could be more clearly defined in the abstract and have clarified this. “Using females that are non-mosaic (completely skewed) for X-inactivation (nmXCI) has proven a powerful and natural genetic system for profiling X-inactivation in humans.”

      I would consider calling the always escape genes constitutive escapees, while the variable may be facultative.

      This is something we have also considered and have received differing feedback on. However, we will definitely keep this in mind for future publications.

      Line 132, it would be useful to explain median >0.475 as less than 2.5% of reads coming from the inactive allele here, not just in the methods. Can you also explain why this cutoff was chosen?

      We thank the reviewer for this clarification. A clarification has been added to the main text as suggested.

      The cutoff was applied to account for potential variations in skewing, given that we screened only a single tissue sample per individual. Although nmXCI females are theoretically expected to have 0% of reads originating from the 'inactive' allele, this is not always observed due to (a) technical errors such as PCR or sequencing inaccuracies, or (b) differences in skewing between tissue types.

      Lines 156-160 describe how the heterozygous SNPs were identified in relation to Figure 2. I read these in the methods so that I could understand Figure 1, so I suggest moving this section up.

      We have moved the section as suggested by the reviewer.

      Line 156, consider adding in a sentence to describe what is shown in Figures 2A and B i.e, the overlap of SNPs and spread along the X.

      We have added a sentence describing what is shown in Figures 2A and 2B as suggested by the reviewer.

      Line 217, it would be useful to give the % of genes that show tissue-specific escape, to quantify rare.

      We have added a sentence quantifying ‘rare’ at the suggested line.

      (4) Typos:

      Line 119, missing 'the most' before extensive (and remove an).

      We thank the reviewer for pointing this out. This error has been corrected.

      Reviewer #3 (Recommendations for the authors):

      Some results in the supplementary figures were quite striking. What is going on with DDX3X and ZRSR2? How come total read counts are so different between individuals?

      Indeed, this is a very intriguing observation and one that we have simply failed to understand thus far. We are currently performing a large prospective study to obtain greater number of non-mosaic females and tissues samples. Hopefully, additional observations across females will allow us to gain further insights into the inter-individual behaviour of DDX3X and ZRSR2.   

      One item I would like to see added is some analysis to address the cause of these extremely skewed XCI individuals. The copy number analysis suggests there are some segmental deletions on the X in all three nmXCI cases. Where are these deletions, and do any fall in the region of the X-inactivation centre? Have the authors performed any analysis of potentially deleterious X-linked variants in the WGS or WES data? Why are these donors so skewed? It's interesting that UPIC was still more skewed than the other two.

      The segmental deletions the reviewer points out are not segmental deletions, the same variation in coverage is found in all females we’ve looked at including females with a mosaic XCI (see Author response image 1 below where the same pattern of slightly lower read counts is observed at the same sites in all female samples). No deletions were identified in the XIC region. No analysis was performed of deleterious X-linked variants. Why the donors are so skewed is unknown and intriguing. Indeed, identifying the origin of extreme skewing (including the females in this study) is now the main focus of the group. Whereas UPIC had trisomy 17, which has likely resulted in the observed skewing, we have not yet found a genetic variant that could explain the skewing observed in 13PLJ or ZZPU.

      Author response image 1.

      Copy number as log2 ratio using 500kb bins across the X-chromosome for 3 mosaic XCI females (1QPFJ, OXRO, and RU1J) and 3 nmXCI females, UPIC, nmXCI-1 and nmXCI-2.

      This is not necessary to address with new analyses, but as alluded to above, the authors could screen more than a single representative tissue. And to apply this analysis to larger databases (UK biobank), which the authors may be planning to do already.

      This an avenue of research we are currently investigating. 

      The code is well-documented and accessible. Additional information on the manual reclassification (to deal with inflated binomial P-values) would be helpful. Why not require a minimal threshold for escape (10% of active X allele) in addition to a significant binomial P (inactive X exp. > 2.5% of active)?

      We thank the reviewer for this positive assessment of the code. 

      Indeed, how to define ‘escape’ is a vexed issue, and one we feel has been given undue weight within the field. In reality, studies of escape are often dealing with sparse data (e.g. read depth), few observations (genes and individuals) and substantial amounts of missing data. Thus, it is unlikely that a standard statistical approach will be sensitive and specific across different studies and data types. Similarly, cut-offs, though useful would also need to be adjusted to the data type and quality in any given study.

      Whereas we initially used a significant binomial P-value as our sole test (often quoted as ‘best practice’), this resulted in wide-spread inflation of P-values. Thus, we switched to manually curating the allelic expression status of all 380 genes using the empirical guideline of allelic ratio >0.4 (also a commonly used cut-off) as indicating mono-allelic expression. We considered combining the binomial P-value with the cut-off but felt that this would result in an overly complex definition of escape and would unnecessarily exclude many genes from classification, due to the opposing effects of low/high read depth on the binomial and cut-off approaches respectively.

      Indeed, due to the difficultly of both accurate and objective ‘classification’ of escape that we placed an emphasis on clearly displaying all data for each gene in each individual to allow readers to see all the data on which each classification was based.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank both reviewers for their supportive comments. Reviewer 1 has suggested a different data processing strategy to better resolve subunits at the CALHM4/CALHM2 interface:

      I recommend an alternative data processing strategy. First, refine particles with 2-4 CALHM4 subunits with symmetry imposed. This is followed by symmetry expansion, signal subtraction of two adjacent subunits, and subsequent classification and refinement of the subtracted particles. This approach, while not guaranteed, can potentially provide a clearer definition of CALHM2 and CALHM4 interfaces and show whether CALHM2 subunits adopt different conformations based on their proximity to CALHM4 subunits.

      We have followed the recommended strategy in an attempt to improve the resolution and better resolve the structural heterogeneity in CALHM2/4 channels. To this end, we have combined symmetry expansion and partial signal subtraction, as suggested by the reviewer. Initially, a symmetrized (C11) 3.4 Å consensus map of undecameric CALHM2/4 channels bound to sybodies SbC2 and SbC4 was used. The particles of this reconstruction were subjected to symmetry expansion (C11) followed by signal subtraction of nine adjacent subunits. Next, we performed focused, alignment-free 3D classification of the remaining two subunits followed by refinement of these classes, leading to the classification of CALHM subunit pairs. The majority of the classes feature well-resolved CALHM2 pairs, consistent with the original approach (Author response image 1A). A minority of the classes contain CALHM4 subunits, revealing heterogeneity similar to regions of CALHM4 subunits observed in the non-symmetrized channel reconstruction (Author response image 1B). Unfortunately, this approach thus did not improve resolution or facilitate a more accurate subunit assignment. Consequently, we decided not to include these attempts in our manuscript. The resubmitted version thus contains only small corrections compared to the previous version.

      Author response image 1.

      Classification of subunit pairs of undecameric CALHM2/4 channels bound to sybodies SbC2 and SbC4 after the processing combining symmetry expansion and partial signal subtraction. (A) Classes showing CALHM2 subunit pairs. (B) Classes showing subunits at interfaces to CALHM4.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Summary:

      Left-right asymmetry in the developing embryo is important for establishing correct lateralisation of the internal organs, including the gut. It has been shown previously that the dorsal mesentery (DM), which supports looping of the endodermal gut tube during development, is asymmetric with sharp delineation of left and right domains prior to gut looping. The authors set out to investigate the nature of the midline barrier that separates the left and right sides of the DM. They identify a transient basement membrane-like structure which is organised into two layers between the notochord and descending endoderm. In the time window when this basement membrane structure exists, there is no diffusion or cell mixing between the left and right sides of the DM, but once this structure starts breaking down, mixing and diffusion occur. This suggests it acts as a barrier, both physical and chemical, between left and right at the onset of gut lateralisation.

      Strengths:

      The authors identify a new midline structure that likely acts as a barrier to facilitate left and right separation during early organogenesis. This is an interesting addition to the field of laterality, with relevance to laterality-related disorders including heterotaxia, and may represent a gut-specific mechanism for establishing and maintaining early left-right asymmetry. The structure of this midline barrier appears to be an atypical basement membrane, comprising two adjacent basement membranes. The complexities of basement membrane assembly, maintenance, and function are of importance in almost all organismal contexts. Double basement membranes have been previously reported (for example in the kidney glomeruli as the authors note), and increasing evidence suggests that atypical basement membrane organisation or consideration is likely to be more prevalent than previously appreciated. Thus this work is both novel and broadly interesting.

      The data presented are well executed, using a variety of well-established methods. The characterisation of the midline barrier at the stages examined is extensive, and the data around the correlation between the presence of the midline barrier and molecular diffusion or cell mixing across the midline are convincing.

      Weaknesses:

      The study is rather descriptive, and the authors' hypotheses around the origins of the midline barrier are speculative and not experimentally demonstrated. While several potential origins of the midline are excluded raising interesting questions about the timing and cell-type-specific origin of the midline basement membrane, these remain unanswered which limits the scope of the paper.

      We extend our appreciation to Reviewer #1 for their thoughtful and comprehensive evaluation of our work, recognizing the considerable time and effort they dedicated to our work. We agree that functional data would significantly strengthen our understanding of the midline barrier and its exact role during LR asymmetric gut development. However, we would like to note that repeated and diligent attempts to perturb this barrier were made using various strategies, such as in vivo laser ablation, diphtheria toxin, molecular disruption (Netrin 4), and enzymatic digestion (MMP2 and MMP9 electroporation) but we observed no significant effect or stable disruption of the midline. We acknowledge and accept this limitation and hope that our discovery will invite future investigations and perturbation of this novel midline structure.

      For example, it is unclear whether the two basement membranes originally appear to be part of a single circular/spherical structure (which looks possible from the images) that simply becomes elongated, or whether it is indeed initially two separate basement membranes that extend.

      We favor the hypothesis that the elongation of the preexisting small circular structure to an extended double membrane of relatively increased length would be unlikely without continued contribution of new basement membrane components. However, our attempts to label and trace the basement membrane of the endoderm using tagged laminins (LAMB1-GFP, LAMB1-His, and LAMC1-His), and more recently tagged nidogen constructs (NID1-GFP and NID1-mNG) have met with export issues (despite extensive collaboration with experts, Drs. Dave Sherwood and Peter Yurchenco). As such, it remains difficult to differentiate between the two possibilities suggested. We also believe this is an important question and will continue to investigate methods to trace it.

      There is a substantial gap between the BMs at earlier stages before the endoderm has descended - is this a lumen, or is it filled with interstitial matrix?

      Our preliminary studies indicate that the gap enclosed by the basement membranes in the early midline structure does have extracellular matrix present, such as fibrillin-2 (see Author response image 1). Also, the electron microscopy shown in Fig. 2 C’’ supports that the space between the notochord and endoderm has fibrillar matrix.

      Author response image 1.

      The authors show where this basement membrane does not originate from, but only speculate on its origin. Part of this reasoning is due to the lack of Lama1-expressing cells either in the early midline barrier before it extends, or in the DM cells adjacent to it. However, the Laminin observed in the midline could be comprised of a different alpha subtype for example, that wasn't assessed (it has been suggested that the Laminin antibody used in this study is not specific to the alpha-1 subunit, see e.g. Lunde et al, Brain Struct Funct, 2015).

      We appreciate this comment and have tried other laminin RNA probes that showed similar lack of midline expression (Lama1, lama3, lama5). Importantly, the laminin alpha 1 subunit is a component of the laminin 111 heterotrimer, which along with laminin 511 is the first laminin to be expressed and assemble in embryonic basement membranes, as reviewed in Yurchenco 2011. Laminin 111 is particularly associated with embryonic development while laminins 511/521 become the most widespread in the adult (reviewed in Aumailley 2013). It is likely that the midline contains laminin 111 based on our antibody staining and the accepted importance and prevalence of laminin 111 in embryonic development. However, it is indeed worth noting that most laminin heterotrimers contain beta 1, gamma 1, or both subunits, and due to this immunological relation laminin antibody cross reactivity is certainly known (Aumailley 2013). As such, while laminin 511 remains a possibility as a component of the midline BM, our lama5 in situs have shown no differential expression at the midline of the dorsal mesentery (see Author response image 2), and as such we are confident that our finding of no local laminin transcription is accurate. Additionally, we will note that the study referenced by the Reviewer observed cross reactivity between the alpha 1 and alpha 2 subunits. Laminin 211/221 is an unlikely candidate based on the embryonic context, and because they are primarily associated with muscle basement membranes (Aumailley 2013). In further support, we recently conducted a preliminary transcriptional profile analysis of midline cells isolated through laser capture microdissection (LCM), which revealed no differential expression of any laminin subunit at the midline. Please note that these data will be included as part of a follow-up story and falls beyond the scope of our initial characterization.

      Author response image 2.

      Similarly, the authors show that the midline barrier breaks down, and speculate that this is due to the activity of e.g. matrix metalloproteinases, but don't assess MMP expression in that region.

      This is an important point, as the breakdown of the midline is unusually rapid. Our MMP2 RNA in situ hybridization at HH21, and ADAMTS1 (and TS9) at HH19-21 indicates no differential activity at the midline (see Author response images 3 and 4). Our future focus will be on identifying a potential protease that exhibits differential activity at the midline of the DM.

      Author response image 3.

      Author response image 4.

      The authors suggest the (plausible) hypothesis that the descent of the endoderm pulls or stretches the midline barrier out from its position adjacent to the notochord. This is an interesting possibility, but there is no experimental evidence to directly support this. Similarly, while the data supporting the barrier function of this midline is good, there is no analysis of the impact of midline/basement membrane disruption demonstrating that it is required for asymmetric gut morphogenesis. A more functional approach to investigating the origins and role of this novel midline barrier would strengthen the study.

      Yes, we fully agree that incorporating functional data would immensely advance our understanding of the midline barrier and its crucial role in left-right gut asymmetry. However, our numerous efforts to perturb this barrier have encountered technical obstacles. For instance, while perturbing the left and right compartments of the DM is a routine and well-established procedure in our laboratory, accessing the midline directly through similar approaches has been far more challenging. We have made several attempts to address this hurdle using various strategies, such as in vivo laser ablation, diphtheria toxin, molecular disruption (Netrin 4), and enzymatic digestion (MMP2 and MMP9 electroporation). Despite employing diverse approaches, we have yet to achieve effective and interpretable perturbation of this resilient structure. We acknowledge this limitation and remain committed to developing methods to disrupt the midline in our current investigations. We again thank Reviewer #1 for the detailed feedback on our manuscript, guidance, and the time taken to provide these comments.

      Recommendations For The Authors:

      Using Laminin subunit-specific antibodies, or exploring the mRNA expression of more laminin subunits may support the argument that the midline does not derive from the notochord, endoderm, or DM.

      As mentioned above, RNA in situ hybridization for candidate genes and a preliminary RNA-seq analysis of cells isolated from the dorsal mesentery midline revealed no differential expression of any laminin subunits.

      Similarly, expression analysis of Laminin-degrading MMPs, and/or application of an MMP inhibitor and assessment of midline integrity could strengthen the authors' hypothesis that the BM is actively and specifically broken down.

      Our MMP2 RNA in situ hybridization at HH21, and ADAMTS1 at HH19-21shows no differential expression pattern at the midline of the DM (see Author response image 3). We have not included these data in the revision, but future work on this topic will aim at identifying a protease that is differentially active at the midline of the DM.

      Functionally testing the role of barrier formation in regulating left-right asymmetry or the role of endoderm descent in elongating the midline barrier would be beneficial. Regarding the former, the authors show that Netrin4 overexpression is insufficient to disrupt the midline, but perhaps overexpression of e.g. MMP9 prior to descent of the endoderm would facilitate early degradation of the midline, and the impact of this on gut rotation could be assessed.

      Unfortunately, MMP9 electroporation has produced little appreciable effect. We acknowledge that the lack of direct evidence for the midline’s role in regulating left-right asymmetry is a shortcoming, but current work on this subject aims to define the midline’s function to LR asymmetric morphogenesis.

      Reviewer #2:

      When the left-right asymmetry of an animal body is established, the barrier that prevents the mixing of signals or cells across the midline is essential. The midline barrier that prevents the mixing of asymmetric signals during the patterning step has been identified. However, a midline barrier that separates both sides during asymmetric organogenesis is unknown. In this study, the authors discovered the cellular structure that seems to correspond to the midline in the developing midgut. This midline structure is transient, present at the stage when the barrier would be required, and composed of Laminin-positive membrane. Stage-dependent diffusion of dextran across the midline (Figure 6) coincides with the presence or absence of the structure (Figures 2, 3). These lines of indirect evidence suggest that this structure most likely functions as the midline barrier in the developing gut.

      We extend our gratitude to Reviewer #2 for their thoughtful assessment of our research and for taking the time to provide these constructive comments. We are excited to report that we have now included additional new data on midline diffusion using BODIPY and quantification method to further support our findings on the midline's barrier function. While our data on dextran and now BODIPY both indirectly suggests barrier function, we aspire to perturb the midline directly to assess its role in the dorsal mesentery more conclusively. However, our numerous efforts to perturb this barrier have encountered technical obstacles. For instance, while perturbing the left and right compartments of the DM is a routine and well-established procedure in our laboratory, accessing the midline directly through similar approaches has been far more challenging. We have made several attempts to address this hurdle using various strategies, such as in vivo laser ablation, diphtheria toxin, molecular disruption (Netrin 4), and enzymatic digestion (MMP2 and MMP9 electroporation). Despite employing diverse approaches, we have yet to achieve effective and interpretable perturbation of this resilient structure. Moving forward, our focus is on identifying an effective means of perturbation that can offer direct evidence of barrier function.

      Recommendations For The Authors:

      (1) It would be much nicer if the requirement of this structure for asymmetric morphogenesis was directly tested. However, experimental manipulations such as ectopic expression of Netrin4 or transplantation of the notochord were not able to influence the formation of this structure (these results, however, suggested the mechanism of the midline formation in the gut dorsal mesentery). Therefore, it seems not feasible to directly test the function of the structure, and this should be the next issue.

      We fully agree that the midline will need to be perturbed to fully elucidate its role in asymmetric gut morphogenesis. As noted, multiple attempts were ineffective at perturbing this structure. Extensive current work on this topic is dedicated to finding an effective perturbation method.

      (2) Whereas Laminin protein was present in the double basement membrane at the midline, Laminin mRNA was not expressed in the corresponding region (Fig. 4A-C). It is necessary to discuss (with experimental evidence if available) the origin of Laminin protein.

      As we have noted, the source of laminin and basement membrane components for the midline remains unclear - no local transcription and the lack of sufficiency of the notochord to produce a midline indicates that the endoderm to be a likely source of laminin, as we have proposed in our zippering endoderm model. We will note that Fig. 4A-C indicate that laminin is in fact actively transcribed in the endoderm. Currently, attempts to trace the endodermal basement membrane using tagged laminins (LAMB1-GFP, LAMB1-His, and LAMC1-His), and more recently tagged nidogen constructs (NID1-GFP and NID1-mNG) have met with export issues (despite extensive collaboration with experts, Drs. Dave Sherwood and Peter Yurchenco). Confirmation of our proposed endodermal origin model is a goal of our ongoing work.

      (3) Figure 4 (cell polarity from GM130 staining): addition of representative GM130 staining images for each Rose graph (Figure 4E) would help. They can be shown in Supplementary Figures. Also, a graph for the right coelomic epithelium in Fig. 4E would be informative.

      We have added the requested GM130 images in our Supplemental Figures (please refer to Fig. S4ABB’) and modified the main Fig. 4E to include a rose graph for the polarity of the right coelomic epithelium.

      (4) Histological image of HH19 DM shown in Fig. 2J looks somehow different from that shown in Fig. 3F. Does Fig. 2J represent a slightly earlier stage than Fig. 3F?

      Figure 2J and Figure 3F depict a similar stage, although the slight variation in the length of the dorsal mesentery is attributed to the pseudo time phenomenon illustrated in Figure 3J-J’’’. This implies that the sections in Figure 2J and Figure 3F might originate from slightly different positions along the anteroposterior axis. Nonetheless, these distinctions are minimal, and based on the dorsal mesentery's length in Figure 2J, the midline is likely extremely robust regardless of this minor pseudo time difference.

      Reviewer #3:

      Summary:

      The authors report the presence of a previously unidentified atypical double basement membrane (BM) at the midline of the dorsal mesentery (DM) during the establishment of left-right (LR) asymmetry. The authors suggest that this BM functions as a physical barrier between the left and the right sides of the DM preventing cell mixing and ligand diffusion, thereby establishing LR asymmetry.

      Strengths:

      The observation of the various components in the BM at the DM midline is clear and convincing. The pieces of evidence ruling out the roles of DM and the notochord in the origin of this BM are also convincing. The representation of the figures and the writing is clear.

      Weaknesses:

      The paper's main and most important weakness is that it lacks direct evidence for the midline BM's barrier and DM LR asymmetry functions.

      We thank Reviewer #3 for their thoughtful and comprehensive evaluation of our work, recognizing the considerable time and effort they dedicated to assessing our study. We fully agree that incorporating functional data would immensely advance our understanding of the midline barrier and its crucial role in left-right gut asymmetry. However, several distinct attempts at perturbing this barrier have encountered technical obstacles. While our laboratory routinely perturbs the left and right compartments of the DM via DNA electroporation and other techniques, directly perturbing the midline using these methods is far more challenging. We have made diligent attempts to address this using various strategies, such as in vivo laser ablation, diphtheria toxin, molecular disruption (Netrin 4), and enzymatic digestion (MMP2 and MMP9 electroporation). However, we have not yet been able to identify a means of producing consistent and interpretable perturbation of the midline. We acknowledge this limitation and remain committed to developing methods to disrupt the midline in our current investigations.

      Recommendations For The Authors:

      Major:

      (1) We suggest the authors test their hypotheses i.e., physical barrier and proper LR asymmetry establishment by the midline BM, by disrupting it using techniques such as physical ablation, over-expression of MMPs, or treatment with commercially available enzymes that digest the BM.

      As above, efforts involving physical ablation and MMP overexpression have not yielded significant effects on the midline thus far. Moving forward, investigating the midline's role in asymmetric morphogenesis will necessitate finding a method to perturb it effectively. In pursuit of progress on this critical question, we recently conducted laser capture microdissection (LCM) and RNA-sequencing of the midline to unravel the mechanisms underlying its formation and potential disruption. This work shows promise but it is still in its early stages; validating it will require significant time and effort, and it falls outside the scope of the current manuscript.

      (2) Lefty1's role in the midline BM was ruled out by correlating lack of expression of the gene at the midline during HH19 when BM proteins expression was observed. Lefty1 may still indirectly or directly trigger the expression of these BM proteins at earlier stages. The only way to test this is by inhibiting lefty1 expression and examining the effect on BM protein localization.

      We have added a section to discuss the potential of Lefty1 inhibition as a future direction. However, similar to perturbing global Nodal expression, interpreting the results of Lefty1 inhibition could be challenging. This is because it may not specifically target the midline but could affect vertebrate laterality as a whole. Despite this complexity, we acknowledge the value of such an experiment and consider it worth pursuing in the future.

      (3) Using a small dextran-based assay, the authors conclude that diffusible ligands such as cxcl2 and bmp4 do not diffuse across the midline (Figure 6). However, dextran injection in this system seems to label the cells, not the extracellular space. The authors measure diffusion, or the lack thereof, by counting the proportion of dextran-labeled cells rather than dextran intensity itself. Therefore, This result shows a lack of cell mixing across the midline (already shown in Figure 2 ) rather than a lack of diffusion.

      We should emphasize that the dextran-injected embryos shown in Fig. 6 D-F were isolated two hours post-injection, a timeframe insufficient for cell migration to occur across the DM (Mahadevan et al., 2014). We also collected additional post-midline stage embryos ten minutes after dextran injections - too short a timeframe for significant cellular migration (Mahadevan et al., 2014). Importantly, the fluorescent signal in those embryos was comparable to that observed in the embryos in Fig. 6. Thus, we believe the movement of fluorescent signal across the DM when the barrier starts to fragment (HH20-HH23) is unlikely to represent cell migration. More than a decade of DNA electroporation experiments of the left vs. right DM by our laboratory and others have never indicated substantial cell migration across the midline (Davis et al., 2008; Kurpios et al., 2008; Welsh et al., 2013; Mahadevan et al., 2014; Arraf et al. 2016; Sivakumar et al., 2018; Arraf et al. 2020; and Sanketi et al., 2022). This is also shown in our current GFP/RFP double electroporation data in Fig. 2 G-H, and DiI/DiO labeling data in Fig. 2 E-G. Collectively, our experiments suggest that the dextran signal we observed at HH20 and HH23 is likely not driven by cell mixing.

      To further strengthen this argument, we now have additional new data on midline diffusion using BODIPY diffusion and quantification method to support our findings on the midline's function against diffusion (please refer to New Fig. 6H-M). Briefly, we utilized a BODIPY-tagged version of AMD3100 (Poty et al., 2015) delivered via soaked resin beads surgically inserted into the left coelomic cavity (precursor to the DM). The ratio of average AMD3100-BODIPY intensity in the right DM versus the left DM was below 0.5 when the midline is intact (HH19), indicating little diffusion across the DM (Fig. 6J). At HH21 when no midline remains, this ratio significantly rises to near one, indicating diffusion of the drug is not impeded when the midline basement membrane structure is absent. Collectively, these data suggest that the basement membrane structure at the midline forms a transient functional barrier against diffusion.

      (4) Moreover, in a previous study (Mahadevan et al., Dev Cell., 2014), cxcl2 and bmp4 expression was observed on both the left and right side before gut closure (HH17, when midline BM is observed). Then their expression patterns were restricted on the left or right side of DM at around HH19-20 (when midline BM is dissociated). The authors must explain how the midline BM can act as a barrier against diffusible signals at HH-17 to 19, where diffusible signals (cxcl12 and bmp4) were localized on both sides.

      We appreciate the Reviewer's invitation to clarify this crucial point. Early in dorsal mesentery (DM) formation, genes like Cxcl12 (Mahadevan et al., Dev Cell 2014) and Bmp4 (Sanketi et al., Science 2021) exhibit symmetry before Pitx2 expression initiates on the left (around ~HH18, Sanketi et al., 2021). Pitx2 then inhibits BMP4 (transcription) and maintains Cxcl12 (mRNA) expression on the left side. The loss of Cxcl12 mRNA on the right is due to the extracellular matrix (ECM), particularly hyaluronan (Sivakumar et al., Dev Cell 2018). Our hypothesis is that during these critical stages of initial DM asymmetry establishment, the midline serves as a physical barrier against protein diffusion to protect this asymmetry during a critical period of symmetry breaking. Although some genes, such as Pitx2 and Cxcl12 continue to display asymmetric transcription after midline dissolution (Cxcl12 becomes very dynamic later on – see Mahadevan), it's crucial to note that the midline's primary role is preventing protein diffusion across it, akin to an insurance policy. Thus, the absence of the midline barrier at HH21 does not result in the loss of asymmetric mRNA expression. We think its primary function is to block diffusible factors from crossing the midline at a critical period of symmetry breaking. We acknowledge that confirming this hypothesis will necessitate experimental disruption of the midline and observing the consequent effects on asymmetry in the DM. This remains central to our ongoing research on this subject.

      (5) On page 11, lines 15-17, the authors mention that "We know that experimentally mixing left and right signals is detrimental to gut tilting and vascular patterning-for example, ectopic expression of pro-angiogenic Cxcl12 on the right-side results in an aberrant vessel forming on the right (Mahadevan et al., Dev Cell., 2014)". In this previous report from the author's laboratory, the authors suggested that ectopic expression of cxcl12 on the right side induced aberrant formation of the vessel on the right side, which was formed from stage HH17, and the authors also suggested that the vessel originated from left-sided endothelial cells. If the midline BM acts as a barrier against the diffusible signal, how the left-sided endothelial cells can contribute to vessel formation at HH17 (before midline BM dissociation)?

      To address this point, we suggest directing the Reviewer to previously published supplemental movies of time-lapse imaging, which clearly illustrate the migration path of endothelial cells from left to right DM (Mahadevan et al., Dev Cell 2014). While the Reviewer correctly notes that ectopic induction of Cxcl12 on the right induces left-to-right migration, it's crucial to highlight that these cells never cross the midline. Instead, they migrate immediately adjacent to the tip of the endoderm (please also refer to published Movies S2 and S3). We observe this migration pattern even in wild-type scenarios during the loss of the endogenous right-sided endothelial cords, where some endothelial cells from the right begin slipping over to the left around HH19-20 (over the endoderm), as the midline is beginning to fragment, but never traverse the midline. We attribute this migration pattern to a dorsal-to-ventral gradient of left-sided Cxcl12 expression, as disrupting this pattern perturbs the migration trajectory (Mahadevan).

      6) It is unclear how continuous is the midline BM across the anterior-posterior axis across the relevant stages. Relatedly, it is unclear how LR segregated the cells are, across the anterior-posterior axis across the relevant stages.

      We refer the reviewer to Fig. 3J-K, in which the linear elongation of the midline basement membrane structure is shown and measured at HH19 in three embryos from the posterior of the embryo to the anterior point at which the midline is fragmented and ceases to be continuous. Similarly, Fig. S2 shoes the same phenomenon in serial sections along the length of the anterior-posterior (AP) axis at HH17, also showing the continuity of the midline. All our past work at all observed sections of the AP axis has shown that cells do not move across the midline as indicated by electroporation of DNA encoding fluorescent reporters (Davis et al. 2008, Kurpios et al. 2008, Welsh et al. 2013, Mahadevan et al. 2014, Sivakumar et al. 2018, Sanketi et al. 2022), and is shown again in Fig. 2 E-H. As noted previously, very few endothelial cells cross the midline at a point just above the endoderm (image above) when the right endothelial cord remodels (Mahadevan et al. 2014), but this is a limited phenomenon to endothelial cells and cells of the left and right DM are fully segregated as previously established.

      Minor comments:

      (1) The authors found that left and right-side cells were not mixed with each other even after the dissociation of the DM midline at HH21 (Fig2 H). And the authors also previously mentioned that N-cadherin contributes to cell sorting for left-right DM segregation (Kurpios et al., Proc Natl Acad Sci USA., 2008). It could be a part of the discussion about the difference in tissue segregation systems before or after the dissociation of DM midline.

      We appreciate this thoughtful suggestion. N-cadherin mediated cell sorting is key to the LR asymmetry of the DM and gut tilting, and we believe it underlies the observed lack of cell mixing from left and right DM compartments after the midline fragments. We have added a brief section to the discussion concerning the asymmetries in N-cadherin expression that develop after the midline fragments.

      (2) Please add the time point on the images (Fig3 C, D, Fig 6A and B)

      We have updated these figures to provide the requested stage information.

      (3) The authors suggested that the endoderm might be responsible for making the DM BM midline because the endoderm links to DM midlines and have the same resistance to NTN4. The authors mentioned that the midline and endoderm might have basement membranes of the same "flavor." However, perlecan expression was strongly expressed in the midline BM compared with the endodermal BM. It could be a part of the discussion about the difference in the properties of the BM between the endoderm and DM midline.

      Perlecan does indeed localize strongly to the endoderm as well as the midline. The HH18 image included in prior Fig. S3 B’, B’’ appears to show atypically low antibody staining in the endoderm for all membrane components. Perlecan is an important component for general basement membrane assembly, and the bulk of our HH18 and HH19 images indicate strong staining for perlecan in both midline and endoderm. Perlecan staining at the very earliest stages of midline formation also indicate perlecan in the endoderm as well, supporting the endoderm as a potential source for the midline basement membrane. We have updated Fig. S3 to include these images in our revision.

      (4) The authors investigated whether the midline BM originates from the notochord or endoderm, but did not examine a role for endothelial cells and pericytes surrounding the dorsal aorta (DA). In Fig S1, Fig S2, and FigS3, the authors showed that DA is very close to the DM midline basement membrane, so it is worth checking their roles.

      We fully agree that the dorsal aorta and the endothelial cords that originate from the dorsal aorta may interact with the midline in important ways. However, accessing the dorsal aorta for electroporation or other perturbation is extremely difficult. Additionally, the basement membrane of vascular endothelial cells has a distinct composition from a non-vascular basement membrane. Vascular endothelial cells produce only alpha 4 and alpha 5 laminin subunits but contain no alpha 1 subunit in any known species (reviewed in DiRusso et al., 2017). Thus, endothelial cell-derived basement membranes would not contain the alpha 1 laminin subunit that we used in our studies as a robust marker of the midline basement membrane. Additionally, no fibronectin is found in the midline basement membrane, while it is enriched in the dorsal aorta (see Supplemental Figure 3CC’C’’). We will briefly note that our preliminary data in quail tissue indicates that QH1+ cord cells (i.e. endothelial cells) sometimes exhibit striking contact with the midline along the dorso-ventral length of the DM, suggesting not an origin but an important interaction.

      Reviewer #4 (Recommendations For The Authors):

      Major comments:

      (1) The descending endoderm zippering model for the formation of the midline lacks evidence.

      We have attempted to address this issue by introducing several tagged laminin constructs (LAMB1-GFP, LAMB1-His, LAMC1-His), and more recently tagged nidogen plasmids (NID1-GFP and NID1-mNG) to the endoderm via DNA electroporation to try to label the source of the basement membrane. Production of the tagged components occurred but no export was observed in any case (despite extensive collaboration with experts in this area, Drs. Dave Sherwood and Peter Yurchenco). This experiment was further complicated by the necessary large size of these constructs at 10-11kb due to the size of laminin subunit genes, resulting in low electroporation efficiency. We also believe this is an important question and are continuing to investigate methods to trace it.

      The midline may be Ntn4 resistant until it is injected in the source cells.

      Ntn4 has been shown to disrupt both assembling and existing basement membranes (Reuten et al. 2016). Thus, we feel that the midline and endodermal basement membranes’ resistance to degradation is not determined by stage of assembly or location of secretion.

      Have you considered an alternative origin from the bilateral dorsal aorta or the paraxial mesoderm, which would explain the double layer as a meeting of two lateral tissues? The left and right paraxial mesoderm seem to abut in Fig. S1B-C and S2E, and is laminin-positive in Fig 4A'. What are the cells present at the midline (Fig.4D-E)? Are they negative for the coelomic tracing, paraxial or aortic markers?

      We fully agree that alternate origins of the midline basement membrane cannot be ruled out from our existing data. We agree and have considered the dorsal aorta and even the endothelial cords that originate from the dorsal aorta. However, accessing the dorsal aorta for electroporation or other perturbation is extremely difficult. Importantly, the basement membrane of vascular endothelial cells has a distinct composition from a non-vascular basement membrane. Vascular endothelial cells produce only alpha 4 and alpha 5 laminin subunits but contain no alpha 1 subunit in any known species (reviewed in Hallmann et al. 2005). Thus, endothelial cell-derived basement membranes would not contain the alpha 1 laminin subunit that we used in our studies as a robust marker of the midline basement membrane. Note in Fig. 3 E-H that our laminin alpha 1 antibody staining does not label the aortae. Additionally, no fibronectin is found in the midline basement membrane, while it is enriched in the dorsal aorta (see Supplemental Figure 3CC’C’’). We will briefly note that our preliminary data in quail tissue indicates that QH1+ cord cells (i.e. endothelial cells) sometimes exhibit striking contact with the midline along the dorso-ventral length of the DM, suggesting not an origin but an important interaction. Moreover, at the earliest stages of midline basement membrane emergence, the dorsal aortae are distant from the nascent basement membrane, as are the somites, which have not yet undergone any epithelial to mesenchymal transition. Fig. S2G provides an example of an extremely early midline basement membrane without dorsal aorta or somite contact. S2G is from a section of the embryo that is fairly posterior in the embryo, it is thus less developed in pseudo-time and gives a window on midline formation in very early embryos.

      (2) The importance of the midline is inferred from previously published data and stage correlations but will require more direct evidence. Can the midline be manipulated with Hh signaling or MMPs?

      We agree that direct evidence in the form of midline perturbation will be critically required. As previously noted, our numerous efforts to perturb this barrier have encountered technical obstacles. For instance, while perturbing the left and right compartments of the DM is a routine and well-established procedure in our laboratory, accessing the midline directly through similar approaches has been far more challenging. We have made several attempts to address this hurdle using various strategies, such as in vivo laser ablation, diphtheria toxin, molecular disruption (Netrin 4), and enzymatic digestion (MMP2 and MMP9 electroporation). Despite employing diverse approaches, we have yet to achieve effective and interpretable perturbation of this resilient structure. Targeting Hh signaling between the endoderm and notochord is a good idea and we will continue these efforts. Thanks very much.

      Minor comments:

      - Please add the species in the title.

      We have altered the title as follows: “An atypical basement membrane forms a midline barrier during left-right asymmetric gut development in the chicken embryo.”

      - The number of observations in Fig2, Fig3A-B, 4A-C, G-H, S1, S3 is lacking.

      We have added the requested n numbers of biological replicates to the legends of the specified figures.

      - Please annotate Fig 3J to show what is measured in K.

      We have modified Fig. 3J to include a dashed bar indicating the length measurements in Fig. 3K.

      - Please provide illustrations of Fig 4E.

      We have added a representative image of GM130 staining to the supplement.

      - If laminin gamma is the target of Ntn4, its staining would help interpret the results of Ntn4 manipulation. Is laminin gamma present in different proportions in the different types of basement membranes, underlying variations in sensitivity?

      Laminin is exported as a heterotrimer consisting of an alpha, beta, and gamma subunit. Laminin gamma is therefore present in equal proportions to other laminins in all basement membranes with a laminin network. Several gamma isoforms do exist, but only laminin gamma 1 will bind to laminin alpha 1, which we use throughout this paper to mark the midline as well as nearby basement membranes that are sensitive to Ntn4 disruption. Thus, gamma laminin proportions or isoforms are unlikely to underlie the resistance of the midline and endodermal basement membranes to Ntn4 (reviewed in Yurchenco 2011).

      - Please comment: what is the red outline abutting the electroporated DM on the left of Fig5B?

      The noted structure is the basement membrane of the nephric duct – we added this information to Fig. 5B image and legend.

      - The stage in Fig 6A-B is lacking.

      We have added the requested stage information to Fig. 6.

      - Please comment on whether there is or is not some cell mixing Fig 2H, at HH21 after the midline disappearance. Is it consistent with Fig. 6E-F which labels cells?

      More than a decade of DNA electroporation experiments of the left vs. right DM by our laboratory and others have never indicated dorsal mesentery cell migration across the midline (Davis et al., 2008; Kurpios et al., 2008; Welsh et al., 2013; Mahadevan et al., 2014; Arraf et al. 2016; Sivakumar et al., 2018; Arraf et al. 2020; and Sanketi et al., 2022). This is also shown in our current GFP/RFP double electroporation data in Fig. 2 G-H, and DiI/DiO labeling data in Fig. 2 E-G. Cell mixing does not occur even after midline disappearance, most likely due to asymmetric N-cadherin expression on the left side of the DM (Kurpios et al., 2008). The sparse, green-labeled cells observed on the right side in Fig. 2H are likely a result of DNA electroporation - the accuracy of this process relies on the precise injection of the left (or right) coelomic cavity (precursor to the gut mesenchyme including the DM) and subsequent correct placement of the platinum electrodes.

      Based on these data, we strongly feel that cellular migration is not responsible for the pattern of dextran observed in Fig. 6E-F, especially in light of the N-cadherin mediated segregation of left and right. We will also note that there is no significant difference between dextran diffusion at HH19 and HH20, only a trend towards significance. Additionally, we would like to note that the dextran-injected embryos were isolated two hours post-injection, which we do not believe is sufficient time for any cell migration to occur across the DM. We also collected additional post-midline stage embryos ten minutes after dextran injections (data not shown), too short a timeframe for significant cellular migration, and the fluorescent signal in those embryos was comparable to that represented in the embryos in Fig. 6. Thus, we believe the movement of fluorescent signal across the DM observed when the barrier starts to fragment at HH20 and HH23 is unlikely to represent movement of cells.

      To further strengthen this argument, we now have additional new data on midline diffusion using BODIPY and quantification method to support our findings on the midline's function against diffusion (please refer to New Fig. 6H-M). Briefly, we utilized a BODIPY-tagged version of AMD3100 (Poty et al., 2015) delivered via soaked resin beads surgically inserted into the left coelomic cavity (precursor to the DM). The ratio of average AMD3100-BODIPY intensity in the right DM versus the left DM was below 0.5 when the midline is intact (HH19), indicating little diffusion across the DM (Fig. 6J). At HH21 when no midline remains, this ratio significantly rises to near one, indicating diffusion of the drug is not impeded when the midline basement membrane structure is absent. Collectively, these data suggest that the basement membrane structure at the midline forms a transient functional barrier against diffusion.

      - 'independent of Lefty1': rephrase or show the midline phenotype after lefty1 inactivation.

      We agree with this comment and have rephrased this section to indicate the midline is present “at a stage when Lefty1 is no longer expressed at the midline.”

      We again would like to extend our sincere gratitude to our reviewers and the editors at eLife for their dedicated time and thorough evaluation of our paper. Their meticulous attention to detail and valuable insights have strengthened our data and provided further support for our findings.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Griesius et al. addresses the dendritic integration of synaptic input in cortical GABAergic interneurons (INs). Dendritic properties, passive and active, of principal cells have been extensively characterized, but much less is known about the dendrites of INs. The limited information is particularly relevant in view of the high morphological and physiological diversity of IN types. The few studies that investigated IN dendrites focused on parvalbumin-expressing INs. In fact, in a previous study, the authors examined dendritic properties of PV INs, and found supralinear dendritic integration in basal, but not in apical dendrites (Cornford et al., 2019 eLife).

      In the present study, complementary to the prior work, the authors investigate whether dendrite-targeting IN types, NDNF-expressing neurogliaform cells, and somatostatin(SOM)-expressing O-LM neurons, display similar active integrative properties by combining clustered glutamate-uncaging and pharmacological manipulations with electrophysiological recording and calcium imaging from genetically identified IN types in mouse acute hippocampal slices.

      The main findings are that NDNF IN dendrites show strong supralinear summation of spatially- and temporally-clustered EPSPs, which is changed into sublinear behavior by bath application of NMDA receptor antagonists, but not by Na+-channel blockers. L-type calcium channel blockers abolished the supralinear behavior associated calcium transients but had no or only weak effect on EPSP summation. SOM IN dendrites showed similar, albeit weaker NMDA-dependent supralinear summation, but no supralinear calcium transients were detected in these INs. In summary, the study demonstrates that different IN types are endowed with active dendritic integrative mechanisms, but show qualitative and quantitative divergence in these mechanisms.

      While the research is conceptionally not novel, it constitutes an important incremental gain in our understanding of the functional diversity of GABAergic INs. In view of the central roles of IN types in network dynamics and information processing in the cortex, results and conclusions are of interest to the broader neuroscience community.

      The experiments are well designed, and closely follow the approach from the previous publication in parts, enabling direct comparison of the results obtained from the different IN types. The data is convincing and the conclusions are well-supported, and the manuscript is very well-written.

      I see only a few open questions and some inconsistencies in the presentation of the data in the figures (see details below).

      We thank the reviewer for the evaluation and address the detailed points below.

      Reviewer #2 (Public review):

      Summary:

      Griesius et al. investigate the dendritic integration properties of two types of inhibitory interneurons in the hippocampus: those that express NDNF+ and those that express somatostatin. They found that both neurons showed supralinear synaptic integration in the dendrites, blocked by NMDA receptor blockers but not by blockers of Na+ channels. These experiments are critically overdue and very important because knowing how inhibitory neurons are engaged by excitatory synaptic input has important implications for all theories involving these inhibitory neurons.

      Strengths:

      (1) Determined the dendritic integration properties of two fundamental types of inhibitory interneurons.

      (2) Convincing demonstration that supra-threshold integration in both cell types depends on NMDA receptors but not on Na+ channels.

      Weaknesses:

      It is unknown whether highly clustered synaptic input, as used in this study (and several previous studies), occurs physiologically.

      We are grateful to the reviewer for the critique. Indeed, the degree to which clustered inputs belonging to a functional neuronal assembly occur on interneuron dendrites is an open question. However, Chen et al (2013, Nature 499:295-300) reported that dendritic domains of PV-positive interneurons in visual cortex, unlike their somata, exhibit calcium transients in vivo which are highly tuned to stimulus orientation. This suggests that clustered inputs to dendritic segments may well belong to functional assemblies, much as in principal cells (e.g. Wilson et al, 2016, Nature Neuroscience 19:1003–1009; Iacaruso et al, 2017, Nature 547;449–452). In our earlier work reporting NMDAR-dependent supralinear summation of glutamate uncaging-evoked responses at a subset of dendrites on PV-positive interneurons, we demonstrated how this arrangement in an oscillating feedback circuit could be exploited to stabilise neuronal assemblies.

      Reviewer #3 (Public review):

      Summary:

      The authors study the temporal summation of caged EPSPs in dendrite-targeting hippocampal CA1 interneurons. There are some descriptive data presented, indicating non-linear summation, which seems to be larger in dendrites of NDNF expressing neurogliaform cells versus OLM cells. However, the underlying mechanisms are largely unclear.

      Strengths:

      Focal 2-photon uncaging of glutamate is a nice and detailed method to study temporal summation of small potentials in dendritic segments.

      Weaknesses:

      (1) NMDA-receptor signaling in NDNF-IN. The authors nicely show that temporal summation in dendrites of NDNF-INs is to a certain extent non-linear. However, this non-linearity varies massively from cell to cell (or dendrite to dendrite) from 0% up to 400% (Figure S2). The reason for this variability is totally unclear. Pharmacology with AP5 hints towards a contribution of NMDA receptors. However, the authors claim that the non-linearity is not dependent on EPSP amplitude (Figure S2), which should be the case if NMDA-receptors are involved. Unfortunately, there are no voltage-clamp data of NMDA currents similar to the previous study. This would help to see whether NMDA-receptor contribution varies from synapse to synapse to generate the observed variability? Furthermore, the NMDA- and AMPA-currents would help to compare NDNF with the previously characterized PV cells and would help to contribute to our understanding of interneuron function.

      We thank the reviewer for the helpful comments.

      We did not actually claim that EPSP amplitude has no role in determining the magnitude of non-linearity: “Among possible sources of variability for voltage supralinearity, we did not observe a systematic dependence on the average amplitude of individual uEPSPs […] (Fig. S2)”. Whilst we fully agree that, at first sight, a positive dependence of supralinearity on uEPSP amplitude might be expected simply from the voltage-dependent kinetics of NMDARs, there are two main reasons why this could have been obscured. First, the expected relationship is non-monotonic, because with large local depolarizations the driving force collapses, as seen in the overall sigmoid shape of the average relationship between the scaled observed response and arithmetic sum (e.g. Figs 2a & c; 4c & e). Therefore, we would arguably expect a parabolic relationship rather than a simple positive slope relating the degree of supralinearity to the average amplitude of individual uEPSPs. Second, given that the uncaging distance varied substantially, the average amplitudes of the individual uEPSPs recorded at the soma would have undergone different degrees of electrotonic attenuation and further distortion by active conductances before they were measured. Ultimately, the plots in Fig. S2 show too much scatter to be able to exclude a positive or parabolic relationship of nonlinearity to uEPSP amplitude. To avoid misunderstanding, we have changed the sentence in the Results that refers to Fig. S2 to: “Among possible sources of variability for voltage supralinearity, we did not observe a significant monotonic dependence on the average amplitude of individual uEPSPs, distance from the uncaging location along the dendrite to the soma, [or] the dendrite order (Fig. S2)”.

      As for the relative contributions of NMDARs and AMPARs, voltage clamp recordings from both neurogliaform and OLM interneurons have already been reported, with the conclusion that neurogliaform cells exhibit relatively larger NMDAR-mediated currents (e.g. Chittajallu et al. 2017; Booker et al. 2021; Mercier et al. 2022), entirely in keeping with the conclusions of our study. Repeating these measurements would add little to the study. Furthermore, because the mean baseline uEPSP amplitude was <0.5 mV (Fig S2), it would be difficult to obtain reliable meaurements of isolated NMDAR-mediated uEPSCs.

      Turning to the high variability of supralinearity, indeed, the 95% confidence interval for the data in Fig. 2d is 73%, 213%. This degree of variability is consistent with the wide range of NMDAR/AMPAR ratios reported by Chittajallu et al. 2017 (their Fig. 1g), compounded by the expected non-monotonic relationship alluded to above.

      (2) Sublinear summation in NDNF-INs. In the presence of AP5, the temporal summation of caged EPSPs is sublinear. That is potentially interesting. The authors claim that this might be dependent on the diameter of dendrites. Many voltage-gated channels can mediate such things as well. To conclude the contribution of dendritic diameter, it would be helpful to at least plot the extent of sublinearity in single NDNF dendrites versus the dendritic diameter. Otherwise, this statement should be deleted.

      We have plotted the degree of nonlinearity against dendritic diameter for neurogliaform cells (under baseline conditions and in D-AP5) in Fig S2h-k. We did not observe any significant linear correlations, other than between amplitude nonlinearity and dendrite diameter post D-AP5. This does not negate the possibility that the significant difference in average dendritic diameters between neurogliaform and OLM cells contributes to differences in impedance (which we have rephrased as “Among possible explanations is that the local dendritic impedance is greater in neurogliaform cells, lowering the threshold for recruitment of regenerative currents”).

      (3) Nonlinear EPSP summation in OLM-IN. The authors do similar experiments in dendrite-targeting OLM-INs and show that the non-linear summation is smaller than in NDNF cells. The reason for this remains unclear. The authors claim that this is due to the larger dendritic diameter in OLM cells. However, there is no analysis. The minimum would be to correlate non-linearity with dendritic diameter in OLM-cells. Very likely there is an important role of synapse density and glutamate receptor density, which was shown to be very low in proximal dendrites of OLM cells and strongly increase with distance (Guirado et al. 2014, Cerebral Cortex 24:3014-24, Gramuntell et al. 2021, Front Aging Neurosci 13:782737). Therefore, the authors should perform a set of experiments in more distal dendrites of OLM cells with diameters similar to the diameters of the NDNF cells. Even better would be if the authors would quantify synapse density by counting spines and show how this density compares with non-linearity in the analyzed NDNF and OLM dendrites.

      The difference in average dendritic diameters between OLM and neurogliaform cells is highly significant (Fig. 8q, P<0.001). We do not claim that dendritic diameter (and by implication local impedance) is the only determinant of the degree of non-linearity. The suggestion that a gradient of glutamate receptor density contributes is interesting. However, the results of uncaging experiments targeting more distal OLM dendrites of similar diameter as neurogliaform dendrites would be subject to numerous confounds, not least the very different electrotonic attenuation, likely differences in various active conductances, and the presence of spines in OLM dendrites (which are generally sparse and were not reliably imaged in our experiments). Moreover, the cell would have to remain patched for longer in order for the fluorescent dyes to invade the distal dendrites. This alone could potentially result in systematic biases among groups. We now cite Guirrado et al (2014) and Gramuntell et al (2021) to highlight that factors other than dendritic diameter per se, such as inhomogeneity in spine and NMDA receptor density may also contribute to the heterogeneity of nonlinear summation in OLM cells.

      (4) NMDA in OLM. Similar to the NDNF cells, the authors claim the involvement of NMDA receptors in OLM cells. Again there seems to be no dependence on EPSP amplitude, which is not understandable at this point (Figure S3). Even more remarkable is the fact that the authors claim that there is no dendritic calcium increase after activation of NMDA receptors. Similar to NDNF-cell analysis there are no NMDA currents in OLMs. Unfortunately, even no calcium imaging experiments were shown. Why? Are there calcium-impermeable NNDA receptors in OLM cells? To understand this phenomenon the minimum is to show some physiological signature of NMDA-receptors, for example, voltage-clamp currents. Furthermore, it would be helpful to systematically vary stimulus intensity to see some calcium signals with larger stimulation. In case there is still no calcium signal, it would be helpful to measure reversal potentials with different ion compositions to characterize the potentially 'Ca2+ impermeable' voltage-dependent NMDA receptors in OLM cells.

      The same response to point 1) above applies to OLM cells. As with neurogliaform cells, mean OLM baseline asynchronous (separate response) amplitudes were <0.5 mV, making it very difficult to record an isolated NMDAR-mediated uEPSC. Having said that, NMDARs do contribute to EPSCs elicited by stimulation of multiple afferents (e.g. Booker et al, 2021). We do not claim that dendritic calcium transients cannot be elicited following activation of NMDARs in OLM cells. We simply reported that the evoked uEPSPs, designed to approximate individual synaptic signals, were sub-threshold for detectable dendritic calcium signals under conditions that were suprathreshold in neurogliaform cells. The statement has been amended to specify that there were no detectable signals under our recording conditions. There is no evidence presented in the manuscript to suggest that OLM NMDARs are calcium impermeable and indeed no such claim was made.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There is a large variability in the observed dendritic nonlinearity, in NDNF IN dendrites e.g. the uEPSP amplitude nonlinearity measure varies from as low as 10-20% to over 200%. As only single dendrites were recorded from each IN, it is unclear if this variability is among the cells or between individual dendrites. While the authors analyzed some potential factors, such as distance along the dendrites, branch order, or response magnitude (amplitude and integral), they did not find any substantial correlation. It remains open if different dendrites of NDNF INs, located in the str. moleculare vs. those in or projecting towards str. radiatum, have divergent properties. Similarly, for SOM INs an important question is if axon-carrying dendrites show distinct properties.

      In this context, it would be interesting to see not only values for the mean nonlinearity but also the maximal nonlinearity and its distribution.

      Nonlinearity as defined in the manuscript is a cumulative measurement. The final value per dendritic segment is therefore the sum of nonlinearities at 1 to 12 near-synchronous uncaging locations. The data for the individual dendritic segments are shown in the slopegraphs as in Fig 2b, with their distribution visible. The averages referred to in the results correspond to the paired mean difference plots, which are the group summaries. The method section has been amended to clarify the analysis method. We did not address specifically whether dendrites projecting in different directions behaved differently. This is an interesting question beyond the scope of this study. Nor did we compare axon-carrying OLM dendrites to other dendrites.

      Figures:

      Figure 1: The gray line in plots g and h is not explained. While it looks like an identity line, the legend in plot i ("asynchronous") interferes.

      In plots g and h the gray line is the line of identity. In plot i it is an estimate of the linear summation. In plot i it is not the line of identity as it does not start at the origin with a slope of 1. The figure legend has been amended to clarify.

      In the same panels (Figure 1g,h, and subsequent figures) consider changing the title from "soma (voltage)" to uEPSP.

      The titles have been amended.

      In panel Figure 1i note the missing "(" in the title.

      Title amended.

      In panel Figure 1h: Shouldn't the X-axis label and legend text read "Arithmetic sum of (EPSP) integrals" instead of "Integral of arithmetic sum").

      The wording more accurately reflects the analytical operations. The asynchronous (separate) responses were summed arithmetically first, and then the integral was taken of each cumulative sum. We have therefore left the axis title and legend unchanged.

      Figure 2a,c: Could you please describe how the scaling was performed for the two axes?

      Method section amended.

      In the same panels (Figure 2a,c, and subsequent figures), the legend seems to be misleading: the plot is NS Amplitude/Integral vs Arithmetic sum, and the black line is the identity line (or scaled interpolation of the arithmetic sum, which is essentially the same).

      The scaled arithmetic sums (uEPSP amplitude, integral) represent linear summation and so overlap with the line of identity. The interpolation estimate of the asynchronous (separate) calcium transient response does not overlap with the line of identity as this estimate does not start at the origin with a slope of 1. The legends throughout the manuscript have been amended to clarify this.

      Figure 2b,d,f (and subsequent figures) slope plots: Please indicate that this is the average amplitude supralinearity for the individual recorded dendrites. Note here that the Results text mentions only the average amplitude supralinearity, but not the slop plots, paired mean difference, or Gardner-Altman estimation, illustrated in the figures.

      Nonlinearity as defined in the manuscript is a cumulative measurement. The final value per dendritic segment is therefore the sum of nonlinearities at 1 to 12 near-synchronous uncaging locations. The data for the individual dendritic segments are shown in the slopegraphs as in fig 2b, with their distribution visible. The averages referred to in the results correspond to the paired mean difference plots, which are the group summaries. The method section has been amended to clarify.

      Fig 2e: The legend (both text and figure, also in the following figures) is confusing, as the gray line and diamonds are defined as separate 12(?) responses, but it seems to represent a linear interpolation of the scaled arithmetic sums (ultimately nothing else but an identity line).

      The grey line shows the linear interpolation output between the calcium transient measurements at 1 uncaging location and at 12 uncaging locations. The 12th uncaging location is indicated in the key as “separate 12”. The linear interpolation in these plots does represent linear summation but is not the line of identity as it does not begin at the origin and does not have a slope of 1.

      Reviewer #2 (Recommendations for the authors):

      This study is well-developed and technically executed. I only have minor comments for the authors:

      (1) To target NDNF+ neurons, the authors use the NDNF-Cre mouse line and a Cre-dependent AAV using the mDLX promotor. Why the mDLX promotor? Would it have been sufficient to use any Cre-dependent fluorophore?

      Pilot experiments revealed leaky expression when a virus driving flexed ChR2 under a non-specific promoter (EF1a) was injected in the neocortex of Ndnf-Cre mice (Author response image 1). In our hands, and in line with Dimidschtein et al (2016),  the use of the mDLX enhancer reduced off-target expression.

      Author response image 1.

      A. AAV2/5-EF1a-DIO-hChR2(H134R)-mCherry injected into superficial neocortex of Ndnf-Cre mice led to expression in a few pyramidal neurons in addition to layer 1 neurogliaform cells. B. Patch-clamp recording from a non-labelled pyramidal cell showed that an optogenetically evoked glutamatergic current remained after blockade of GABAA and GABAB receptors, further confirming limited specificity of expression of ChR2. (Data from M Muller, M Mercier and V Magloire, Kullmann lab.)

      (2) The distance of the uncaring sites from the soma plays a key role. The authors should indicate the mean distance of the cluster and its variance.

      Uncaging distance from soma is indicated for both NGF and OLM interneurons in the supplementary figures S2 and S3 respectively.

      (3) Martina et al., in Science 2000, showed high levels of Na+ channels in the dendrites of OLM cells and hinted that spikes could occur in them. The authors should discuss this possible discrepancy.

      Discussion amended.

      (4) Looking at Figure 1d, the EPSPs look exceptionally long-lasting, longer than those observed by stimulating axonal inputs. Could this indicate spill-over excitation? If so, how could this affect the outcome of this study?

      The asynchronous (separate responses) decay to baseline within 100 ms, similar to the neurogliaform EPSPs evoked by electrical stimulation of axons in the SLM in Mercier et al. 2022. We observed clear plateau potentials in a minority of cells (e.g. Fig. S1b). Such plateau potentials can be generated by dendritic calcium channels and we do not consider that glutamate spillover needs to be invoked to account for them.

      (5) In the legend of Figure 2: "n=11 dendrites in 11 cells from 9 animals". Why do the authors only study 11 dendrites from 11 cells? Isn't it possible to repeatedly stimulate clusters of synaptic inputs onto the same cells? In principle, could one test many dendrites of the same cell at different distances from the soma? It is also remarkable that there were very few cells per animal.

      The goal always was to record from as many dendrites as possible from the same cells whilst maintaining high standards of cell health. When cell health indicators such as blebbing, input resistance change or resting voltage change were detected, no further dendritic location could be tested with reasonable confidence. In a given 400 um slice there would be relatively few healthy candidate cells at a suitable depth to attempt to patch-clamp.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      In this ms, Tejeda-Muñoz and colleagues examine the roles of macropinocytosis in WNT signalling activation in development (Xenopus) and cancer (CRC sections, cell lines and xenograft experiments). Furthermore, they investigate the effect of the inflammation inducer Phorbol-12-myristate-13-acetate (PMA) in WNT signalling activation through macropinocytosis. They propose that macropinocytosis is a key driver of WNT signalling, including upon oncogenic activation, with relevance in cancer progression.

      I found the analyses and conclusions of the relevance of macropinocytosis in WNT signalling compelling, notably upon constitutive activation both during development and in CRC.

      Thank you.

      However, I think this manuscript only partially characterises the effects of PMA in WNT signalling, largely due to a lack of an epistatic characterisation of PMA roles in Wnt activation. For example: 1- The authors show that PMA cooperate with 1) GSK3 inhibition in Xenopus to promote WNT activation, and 2) (possibly) with APCmut in SW480 to induce b-cat and FAK accumulation. To sustain a specific functional interaction between WNT and PMA, the effects should be tested through additional epistatic experiments. For example, does PMA cooperate with Wnt8 in axis duplication analyses? Does PMA cooperate with any other WNT alteration in CRC or other cell lines? Importantly, does APC re-introduction in SW480 rescue the effect of PMA? Such analyses could be critical to determine specificity of the functional interactions between WNT and PMA. This question could be addressed by performing classical epistatic analyses in cell lines (CRC or HEK) focusing on WNT activity, and by including rescue experiments targeting the WNT pathway downstream of the effects e.g., dnTCF, APC re- introduction, etc.

      We agree that there was need for additional direct evidence of functional interactions of between macropinocytosis, Wnt signaling, and PMA beyond the previously provided target gene assays in Xenopus (now shown in Figure 1I) and luciferase assays in cultured cells (Figure 1J) which used LiCl and inhibition by Bafilomycin. We therefore carried out a new experiment using 3T3 cells, now shown in Figure 1K-P. Wnt3a protein increased the uptake of TMR-dextran 70 kDa, and PMA enhanced this response. The macropinocytosis inhibitor EIPA blocked induction of macropinocytosis by Wnt3a and PMA. These results were quantitated in Figure 1Q. We think this new experiment strengthens the main conclusion that the tumor promoter PMA increases macropinocytosis. Thank you.

      2) While the epistatic analyses of WNT and macropinocytosis are clear in frog, the causal link in CRC cells is contained to b-catenin accumulation. While is clear that macropinocytosis reduces spheroid growth in SW480, the lack of rescue experiments with e.g., constitutive active b-catenin or any other WNT perturbation or/and APC re-introduction, limit the conclusions of this experiment.

      We now provide new experiments in 3T3 cells treated with LiCl, overexpression of constitutively-active β-catenin and constitutively-active Lrp6 (Figure 4, panels I through L’’); the new results indicate that Wnt signaling activation increases protein levels of the macropinocytosis activator Rac1.

      Minor comments:

      3- Different compounds targeting membrane trafficking are used to rescue modes of WNT activation (Wnt8 vs LiCl) in Xenopus.

      The main goal of our experiments was to test the requirement of membrane trafficking for tumor promoter activity through the Wnt pathway. We therefore used PMA, and a variety of inhibitors such as EIPA (Na+/H+ exchanger, Figure 1I and Figure 3D), Bafilomycin A (Figure 1H), DN-Rab7 (Figure 3G) and EHT1864 (a Rac1 inhibitor, Figure 4G). One could argue that using a wide variety of membrane trafficking inhibitors is a plus.

      4- The abstract does not state the results in CRC/xenografts

      We have added a sentence to the abstract.

      5- Labels of Figure 2E might be swap

      Thank you for detecting this error, we now label the last two columns in Figure 2E correctly.

      6- Figure 4i,j, 6 and s4 rely on qualitative analyses instead of quantifications, which underscores their evaluation. On the other hand, the detailed quantifications in Figure S3A-D strongly support the images of Figure 5

      The quantifications of the previous Figure 4I-J supported the data in the initial reviewed preprint, shown in Author response image 1:

      Author response image 1.

      However, these data have now been deleted from this version to make space for new experiments showing the stabilization of Rac1 by stabilized β-catenin and CA-LRP6. Quantifications in Figure 6C-F’’ are not shown because they represent changes in subcellular localization, but a western blot is provided in Figure 6B. Quantifications for Figure 6H-I’’ are shown in panel 6G. Supplemental Figure S4 already has 24 panels so introducing quantifications would be unwieldy.

      Thank you for the thoughtful comments.

      Reviewer #2 (Public Review):

      Tejeda Muñoz et al. investigate the intersection of Wnt signaling, macropinocytosis, lysosomes, focal adhesions and membrane trafficking in embryogenesis and cancer. Following up on their previous papers, the authors present evidence that PMA enhances Wnt signaling and embryonic patterning through macropinocytosis. Proteins that are associated with the endo-lysosomal pathway and Wnt signaling are co-increased in colorectal cancer samples, consistent with their pro-tumorigenic action. The function of macropinocytosis is not well understood in most physiological contexts, and its role in Wnt signaling is intriguing. The authors use a wide range of models - Xenopus embryos, cancer cells in culture and in xenografts and patient samples to investigate several endolysosomal processes that appear to act upstream or downstream of Wnt. A downside of this broad approach is a lack of mechanistic depth. In particular, few experiments monitor macropinocytosis directly, and macropinocytosis manipulations have pleiotropic effects that are open alternative interpretations. Several experiments are confirmatory of previous findings; the manuscript could be improved by focusing on the novel relationship between PMA-induced macropinocytosis and better support these conclusions with additional experiments.

      New additional experiments focusing on the role of PMA are now provided.

      The authors use a range of inhibitors that suppress macropinosome formation (EIPA, Bafilomycin A1, Rac1 inhibition). However, these are not specific macropinocytosis inhibitors (EIPA blocks an Na+/H+ exchanger, which is highly toxic and perturbs cellular pH balance; Bafilomycin blocks the V-ATPase, which has essential functions in the Golgi, endosomes and lysosomes; Rac1 signals through multiple downstream pathways). A specific macropinocytosis inhibitor does not exist, and it is thus important to support key conclusions with dextran uptake experiments.

      We used a wide range of inhibitors because the main idea is to show that membrane trafficking is important in Wnt and PMA activity. We would like to point out that the current experimental definition in the field of macropinocytosis, despite any caveats, is the ability to block dextran uptake with EIPA. Because inhibitors may not be entirely specific, we think using a broad approach to target membrane trafficking might be a plus. We now provide in Figure 1K-Q a new experiment showing that Wnt3a protein treatment increases dextran uptake and PMA stimulates this macropinocytosis in 3T3 cells. EIPA inhibited dextran macropinocytosis in the presence of Wnt and PMA (Figure 1N and 1Q). We also provide a time-lapse video of the rapid macropinocytic vesicles induction by PMA in SW480 CRC cells in which the plasma membrane is tagged (Supplemental Movie S1).

      The title states that PMA increases Wnt signaling through macropinocytosis. However, the mechanistic relationship between PMA-induced macropinocytosis and Wnt signaling is not well supported. The authors refer to a classical paper that demonstrates macropinocytosis induction by PMA in macrophages (PMID: 2613767). Unlike most cell types, macrophages display growth factor-induced and constitutive macropinocytic pathways (PMID: 30967001). It would thus be important to demonstrate macropinocytosis induction by PMA experimentally in Xenopus embryos / cancer cells. Does treatment with EIPA / Bafilomycin / Rac1i decrease the dextran signal in embryos? In macrophages, the PKC inhibitor Calphostin C blocks macropinocytosis induction by PMA (PMID: 25688212). Does Calphostin C block macropinocytosis in embryos / cancer cells? Do the various combinations of Wnts / Wnt agonists and PMA have additive or synergistic effects on dextran uptake? If the authors want to conclude that PMA activates Wnt signaling, it would also be important to demonstrate the effect of PMA on Wnt target gene expression.

      We now provide a new experiment showing macropinocytosis induction of PMA experimentally in cancer cells. CRC SW480 cells, despite having a mutant APC, are able to respond to PMA by further increasing TMR-dextran 70 kDa uptake over background within 1 hour (now shown in Figure S1):

      Investigating PKC and Calphostin C is outside of goals of this paper. With respect to final the point on the effect of PMA on Wnt target gene expression, this was shown in the context of the Xenopus embryo in Figure 1I (Siamois and Xnr3 are direct targets of Wnt).

      Author response image 2.

      The experiments concerning macropinosome formation in Xenopus embryos are not very convincing. Macropinosomes are circular vesicles whose size in mammalian cells ranges from 0.2 - 10 µM (PMID: 18612320). The TMR-dextran signal in Fig. 1A does not obviously label structures that look like macropinosomes; rather the signal is diffusely localized throughout the dorsal compartment, which could be extracellular (or perhaps cytosolic). I have similar concerns for the cell culture experiments, where dextran uptake is only shown for SW480 spheroids in Fig. S2. It would be helpful to quantify size of the circular structures (is this consistent with macropinosomes?).

      In response, we have deleted the TMR experiments in Xenopus embryos; they will be reinvestigated at a later time. With respect to macropinosome sizes in cultured cells, they are indeed large at the plasma membrane level (see new Supplemental Movie S1), but rapidly decrease in size once dextran is concentrated inside the cell. This can be visualized in the new experiments showing dextran vesicles in Supplemental Figure S1J-K and Figure 1K-P.

      In Fig. 4I - J, the dramatic decrease in b-catenin and especially in Rac1 after overnight EIPA treatment is rather surprising. How do the authors explain these findings? Is there any evidence that macropinocytosis stabilizes Rac1? Could this be another effect of EIPA or general toxicity?

      We now provide new evidence that Wnt signaling stabilizes Rac1. The old data relying on overnight EIPA treatment has been replaced by new experiments in 3T3 cells showing (i) that LiCl treatment increases levels of Rac1 protein and β-catenin levels (Figure 4I-J’’), (ii) that cells transfected with constitutively active β-catenin-GFP have higher levels of Rac1 than control untransfected cells (Figure 4K-K’’) and (iii) that Rac1 is stabilized in cells transfected with CA-Lrp6-GFP when compared to untransfected cells (Figure4L-L’’).

      On a similar note, Fig. 6 K - L the FAK staining in control cells appears to localize to focal adhesions, but in PMA-treated cells is strongly localized throughout the cell. Do the authors have any thoughts on how PMA stabilizes FAK and where the kinase localizes under these conditions? Does PMA treatment increase FAK signaling activity?

      The previous Figure 6K-L’’ are now found in Supplementary Figure S4, panels C-D’’. The result is that FAK is greatly stabilized by overnight incubation with PMA. How this achieved is unknown, perhaps the result of increased macropinocytosis, but we do not wish to speculate in the main manuscript. We have not measured FAK activity, but the FAK inhibitor PF-00562271 strongly decreased β-catenin signaling by GSK3 inhibition (Figure 6J) and has strong effects in neural development that mimic inhibition of the early Wnt signal (new experiments shown in Figure 6K-L’’’). The results suggest that FAK activity affects Wnt signaling and dorsal development; the molecular mechanism of this interaction is unknown but worthy of future studies.

      The tumor stainings in Figure 5 are interesting but correlative. Pak1 functions in multiple cellular processes and Pak1 levels are not a direct marker for macropinocytosis. In the discussion, the authors discuss evidence that the V-ATPase translocates to the plasma membrane in cancer to drive extracellular acidification. To which extent does the Voa3 staining reflect lysosomal V-ATPase? Do the authors have controls for antibody specificity?

      It is true that Pak1 has multiple functions, yet it is essential for the actin machinery that drives macropinocytosis. We have now rephrased the discussion to say “Rac1 is an upstream regulator of the Pak1 kinase required for the actin machinery that drive macropinocytosis (Redelman-Sidi et al., 2018)”. We also explain that: “V-ATPase has been associated with acidification of the extracellular milieu in tumors (Capecci and Forgac, 2013; Hinton et al., 2009; Perona and Serrano, 1988). Extracellular acidification is probably due to increased numbers of lysosomes which are exocytosed, since V0a3 was located within the cytoplasm in advanced cancer or xenografts in mice (Figures 5I and S3I)”. The antibody we used for V0a3 is highly specific and has been used widely (Ramirez et al., 2019).

      Reviewer #3 (Public Review):

      The manuscript by Tejeda-Munoz examines signaling by Wnt and macropinocytosis in Xenopus embryos and colon cancer cells. A major problem with the study is the extensive use of pleiotropic inhibitors as "specific" inhibitors of macropinocytosis in embryos. It is true that BafA and EIPA block macropinocytosis, but they do many other things as well. A major target of EIPA is the NheI Na+/proton transporter, which also regulates invasive structures (podosomes, invadopodia) which could have major roles in development. Similarly, Baf1 will disrupt lysosomes and the endocytic system, which secondary effects on mTOR signaling and growth factor receptor trafficking. The authors cannot assume that processes inhibited by these drugs demonstrate a role of macropinocytosis. While correlations in tumor samples between increased expression of PAK1 and V0a3 and decreased expression of GSK3 are consistent with a link between macropinocytosis and Wnt-driven malignancy, the cell and embryo-based experiments do not convincingly make this connection. Finally, the data on FAK and TES are not well integrated with the rest of the manuscript.

      The criticism that drugs are not entirely specific is a valid one. Our approach of using a variety of drugs such as EIPA, BafA, EHT1864 or FAK inhibitor PF-00562271 all point to the main conclusion that the membrane trafficking is important in signaling by Wnt and the action of the tumor promoter PMA. The data on FAK, TES and focal adhesions have been better integrated in the manuscript and new experiments on the effect of FAK inhibitor in embryonic dorsal development are now provided (Figure 6K-L’’’).

      1) The data in Fig. 1A do not convincingly demonstrate macropinocytosis - it is impossible to tell what is being labeled by the dextran.

      In response, we have deleted the TMR-dextran experiments in Xenopus embryos; they will be reported at a later time.

      2) The data in Fig. 2 do not make sense. LiCL2 bypasses the WNT activation pathway by inhibiting GSK3. If subsequent treatment with BafA blocks the effects of GSK3 inhibition, then BafrA is doing something unrelated to Wnt activation, whose target is the inhibition/sequestration of GSK3. While BafA might block GSK3 sequestration by inhibiting MVB function, it should have no effect on the inhibition of GSK3 by LiCl2.

      We now explain in the main text describing Figure 2 in the results, the initial effect of GSK3 inhibition by LiCl is to trigger macropinocytosis (Albrecht et al., 2020). If the downstream acidification of lysosomes is inhibited, then the brief treatment with LiCl (7 min at 32-cell stage) has no effect (LiCl 1st+BafA 2nd, Figure 2H). BafA inhibits lysosomal acidification at 32-cell stage resulting in ventralization, but the effect of brief BafA treatment can be reversed by inducing membrane trafficking by LiCl (BafA 1st+LiCl 2nd, Figure 2C). The labelling of the figure panels C and H has been modified to indicate this is an order-of-addition experiment. These order-of-addition experiments strongly support the proposal that endogenous lysosomal activity is required to generate the initial endogenous Wnt signal that takes place at the 32-cell stage of development (Tejeda-Muñoz and De Robertis, 2022a).

      3) The effect of EHT on MP in SW480 cells is not clearly related to what is happening in the embryos. The nearly total loss of staining for Rac and -catenin after overnight EIPA does not implicate MP in protein stability - critical controls for cell viability and overall protein turnover are absent. Inhibition of WNT signaling might be expected to enhance -catenin turnover, but the effect on Rac1 is surprising. A more quantitative analysis by western blotting is required.

      The results from SW480 cells inhibition by EIPA have been replaced in Figure 4. We now provide new evidence in 3T3 cells that Wnt signaling stabilizes Rac1. The old data relying on EIPA treatment in SW480 cells has been replaced by new experiments in 3T3 cells showing (i) that LiCl treatment increases levels of Rac1 protein and β-catenin levels (Figure 4I-J’’), (ii) that cells transfected with constitutively active β-catenin-GFP have higher levels of Rac1 than control untransfected cells (Figure 4K-K’’) and (iii) that Rac1 is stabilized in cells transfected with CA-Lrp6-GFP when compared to untransfected cells (Figure4L-L’’). In the original EIPA experiment in SW480 cells, now deleted from this version of the manuscript, we tested the cell viability using a Vi-Cell Beckman-Coulter Viability Analyzer and found that cells were 96-98% viable but proliferation was strongly decreased after 12 h of EIPA treatment. The effect of brief Rac1 inhibition (7 min) in decreasing dorsal development in embryos at the critical 32-cell stage is robust (Figure 4A-C). In addition, coinjection of EHT is able to entirely block the effects of microinjected xWnt8 mRNA (compare Figure 4E to 4G, see also Figure 4H), suggesting that Rac1 is required for Wnt signaling. Quantitative target gene expression analysis is provided for the embryo experiments (Figure 4C and 4H); for the stabilization of Rac1 by Wnt we are not providing quantitative measurements, but found similar results with 3 independent approaches (LiCl, CA-β-catenin and CA-Lrp6).

      4) The data on FAK inhibition and TES trafficking are poorly integrated with the rest of the paper.

      We attempted to better relate the TES trafficking to our previous paper showing that canonical Wnt signaling induces focal adhesion and Integrin-β1 endocytosis. We now write in the results: “We have previously reported a crosstalk between the Wnt and focal adhesion (FA) signaling pathways. Wnt3a treatment rapidly led to the endocytosis of Integrin β1 and of multiple focal adhesion proteins into MVBs (Tejeda-Muñoz et al., 2022). FAs link the actin cytoskeleton with the extracellular matrix (Figure 6A), and we now investigated whether FA activity is affected by Wnt signaling, PMA treatment and CRC progression”.

      Reviewer #3 (Recommendations For The Authors):

      The reliance on pleiotropic inhibitors is a weakness and should be supplemented by genetic approaches to inhibit macropinocytosis.

      We agree, but that would be outside of the scope of this study.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This solid study investigates the transdifferentiation of chicken embryonic fibroblasts into muscle and fat cells in 3D to create whole-cut meat mimics. The study is important and provides a method to control muscle, fat, and collagen content within the 3D meat mimics and thus provides a new avenue for customized cultured meat production. Limitations of this study include the use of transgene for transdifferentiation and thus the creation of GMO food.

      We are grateful for the substantial effort that editors and reviewers put into assessing our manuscript and providing insightful feedback. We have tried to address, as much as possible, all comments and criticisms. We believe that we have now a significantly improved manuscript. Below, there is a point-by-point response.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors presented here a novel 3D fibroblast culture and transdifferentiation approach for potential meat production with GelMA hydrogel.

      Strengths:

      (1) Reduced serum concentration for 3D chicken fibroblast culture and transdifferentiation is optimized.

      (2) Efficient myogenic transdifferentiation and lipogenesis as well as controlled fat deposition are achieved in the 3D GelMA.

      Weaknesses:

      (1) While the authors stated the rationale of using fibroblasts instead of myogenic/adipogenic stem cells for meat production, the authors did not comment on the drawbacks/disadvantages of genetic engineering (e.g., forced expression of MyoD) in meat production.

      Thanks for the reviewer for raise this important issue. We have now described this drawback in the discussion part.

      As a proof-of-concept study, we sought to explore the potential of utilizing the transdifferentiation integrated transgene tools for overexpressing a transdifferentiation factor to achieve the maximum muscle production. However, it is important to acknowledge that genetically modified meat products derived from the genetic engineering of cultured cells will not be suitable for consumer acceptance and market viability. We are currently testing other non-genomic integrating delivery means such as modRNAs and chemical cocktails to induce myogenic transdifferentiation in fibroblasts. We believe the new non-genomic integration means would be compatible for the meat production and consumer acceptance.

      Please see lines 439-445.

      “As a proof-of-concept, we utilized the transgene method to achieve maximum myogenic induction and the final products still retain the foreign transgene fragment in the cells’ genome. It is therefore posing a risk of genetic modified food which is not suitable for mass production. In the next step, other non-transgenic means such as non-integrating vectors, chemical reprogramming, modified RNAs, and recombinant transgene removal techniques will be explored to develop transgene-free end products.”

      (2) While the authors cited one paper to state the properties and applications of GelMA hydrogel in tissue engineering and food processing, concerns/examples of the food safety with GelMA hydrogel are not discussed thoroughly.

      Thank you for pointing out this issue. We discussed the drawbacks of Gelma hydrogel applications in the meat production in the main text.

      GelMA-based hydrogels have shown great potential due to their biocompatibility and mechanical tenability. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used Gelma hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider Gelma hydrogen as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022).

      Bomkamp, C., Skaalure, S. C., Fernando, G. F., Ben‐Arye, T., Swartz, E. W., & Specht, E. A. J. A. S. (2022). Scaffolding biomaterials for 3D cultivated meat: prospects and challenges. Advanced Science (Weinh), 9(3), 2102908.

      Jeong, D., Seo, J. W., Lee, H. G., Jung, W. K., Park, Y. H., & Bae, H. (2022). Efficient Myogenic/Adipogenic Transdifferentiation of Bovine Fibroblasts in a 3D Bioprinting System for Steak-Type Cultured Meat Production. Advanced Science (Weinh), 9(31), e2202877.

      Li, Y., Liu, W., Li, S., Zhang, M., Yang, F., & Wang, S. J. J. o. F. F. (2021). Porcine skeletal muscle tissue fabrication for cultured meat production using three-dimensional bioprinting technology. Journal of Future Foods, 1(1), 88-97.

      Park, S., Hong, Y., Park, S., Kim, W., Gwon, Y., Jang, K.-J., & Kim, J. J. J. o. B. E. (2023). Designing Highly Aligned Cultured Meat with Nanopatterns-Assisted Bio-Printed Fat Scaffolds. Journal of Biosystems Engineering, 48(4), 503-511.

      We discussed the drawbacks of GelMA hydrogel. Please see lines 445-457.

      “Another food safety concern in this study is the use of GelMA hydrogel for culture meat production. Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used GelMA hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider hydrogel as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022). ”

      (3) In Fig. 4C, there seems no significant difference in the Vimentin expression between Fibroblast_MyoD and Myofibroblast. The conclusion of "greatly reduced in the myogenic transdifferentiated cells" is overstated.

      Thanks for pointing out this mistake.

      We revised the wording accordingly. The vimentin expression was reduced in fibroblast_MyoD compare to the original fibroblast.

      Please see lines 231-233.

      “The fibroblast intermediate filament Vimentin (Tarbit et al., 2019) was abundantly expressed in the fibroblasts but reduced in the myogenic transdifferentiated cells (Figure 4C)”

      (4) The presented cell culture platform is only applied to chicken fibroblasts and should be tested in other species such as pigs and fish.

      Thank you for the suggestion.

      In this pilot cultured meat study, we utilized chicken embryonic fibroblasts. These specific cells were chosen for their near-immortal nature and robustness in culture, as well as the inducible myogenic capacity. In our previous experiments (Ren et al, Cell Reports, 2022, 40:111206), we have tested the myogenic transdifferentiation potential of fibroblasts from mice, pigs, and chickens, and observed varying efficiencies of myogenesis. It is important to note that fibroblast cells derived from different species, or even different tissues within the same species, would exhibit significant variations in their capacities for myogenic and adipogenic transdifferentiation.

      In this proof-of-concept study we used only one source of fibroblasts for testing culture meat production and confirmed the myogenic/adipogenic transdifferentiation could be manipulated as feasible means to precisely control muscle, fat and collagen content. We would expect that different origins of fibroblasts to display different transdifferentiation efficiencies and thus produce various muscle/fat ratios in meat mimics. That is beyond the scope of current study.

      Furthermore, we are also testing myogenic/adipogenic transdifferentiation of fibroblasts from pigs through non-genomic integration approaches. We believe only the non-transgene tools are viable solutions for culture meat production in the future. We added the species information in the discussion part.

      See lines 515-517.

      “This approach can be readily extrapolated to other species such as pigs and presents promising avenues for the large-scale production of customized and versatile meat products that may cater to varying consumer preferences.”

      Reviewer #2 (Public Review):

      The manuscript by Ma et al. tries to develop a protocol for cell-based meat production using chicken fibroblasts as three-dimensional (3D) muscle tissues with fat accumulation. The authors used genetically modified fibroblasts which can be forced to differentiate into muscle cells and formulated 3D tissues with these cells and a biphasic material (hydrogel). The degrees of muscle differentiation and lipid deposition in culture were determined by immunohistochemical, biochemical, and molecular biological evaluations. Notably, the protocol successfully achieved the process of myogenic and lipogenic stimulation in the 3D tissues.

      Overall, the study is reasonably designed and performed including adequate analysis. The manuscript is clearly written with well-supported figures. While it presents valuable results in the field of cultivated meat science and skeletal muscle biology, some critical concerns were identified. First, it is unclear whether some technical approaches were really the best choice for cell-based meat production. Next, more careful evaluations and justifications would be required to properly explain biological events in the results. These points include additional evaluations and considerations with regard to myocyte alignment and lipid accumulation in the differentiated 3D tissues. The present data are very suggestive in general, but further clarifications and arguments would properly support the findings and conclusions.

      Thanks for the reviewer’s comments. We have performed additional experiments and analysis to address the critical questions. We also revised the text extensively to clarify or discuss some of the concerns, such as the cell alignment and cellular distribution of intramuscular fat issues. We expect the revised data and text could adequately support the conclusions of the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) In Figure 1, the authors used 1% chicken serum. Have the authors tested other lower concentrations? It will be interesting to see the lowest chicken serum concentrations in fibroblast culture and transdifferentiation;

      Thank you for your suggestion.

      Yes, we actually have tested the lower concentrations of serum, such as 1% FBS, and 0.5% chicken serum. However, the cells are not in a healthy state under these low levels of serum, as shown by the abnormal cell morphology and nearly no cell growth. Please see the revised Supplementary Figure S1D, in which we added the 1%FBS and 0.5% chicken serum data. Hence, the 1% chicken serum is optimal in our hands. We will also test other types of specialized serum-free medium in future experiments.

      (2) In Figure 2, the authors should quantify the fold expansion of fibroblasts cultured in 3D gel after 1, 3, 5, and 9 days since this data is important for future meat manufacturing. In addition, long-term expansion (e.g., 1 month) in 3D gel should also be shown;

      Thanks for the question. We have quantified the cell growth in 3D by measuring the PHK26 stained cells. Since the cells were implanted into the gel, they propagated exponentially from 1 day to 9 days. The cell proliferation data provide good reference for the future meat manufacturing (Figure 2D). We have tried the long-term expansion in 3D but failed to measure the cell proliferation. Because the 3D gel always collapsed during 12-15 days in cell culture for some unknown reasons, either the cells are grown too crowded to compromise the gel structure or the gel matrix itself is not strong enough for standing long-term. We believe the cells will grow well in long-term if we provide enough 3D attachment surface, since they grow indefinitely in 2D. We will testing different 3D matrix in the future.

      Please see the revised Figure 2D for the quantification of cells.

      (3) In Figure 3, please also show MyoD staining as it'll be interesting to see the expression of exogenous and endogenous MyoD expression after dox treatment. In Figure G, the hydrogel meat seems very small, please show/discuss the maximum size of hydrogel meat that may be achieved using this approach;

      Thanks for asking this information. We performed the immunostaining by using the anti-MyoD and anti-Flag to show the expression of all MyoD (exogenous and endogenous) and only exogenous MyoD after dox treatment. The MyoD and 3xFlag were fused in-frame in the transgene plasmid and thus the anti-Flag staining indicate the exogenous MyoD expression and anti-MyoD staining indicate the expression of exogenous and endogenous MyoD together.

      As shown in Figure S4, we found that almost 100% of cells were positive for MyoD staining and 60% of which expressed Flag, these data were consistent with our previous results (Ren et al., 2022, Cell Reports).

      Author response image 1.

      As for the size of the culture meat based on hydrogel, we discussed the possibilities in scalable production of hydrogel based whole-cut meat mimics. Please see lines 446-449. “Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters.”

      (4) In Figure 5 and Supplementary Figure 6, please quantify the Oil-red O+ fat cells in the 2D and 3D lipogenic induction. Also in Fig. 6B, quantify the oil-red+MHC+ cells;

      Thank you for this advice. We have quantified the oil-red O stained images in the result “Stimulate the fat deposition in chicken fibroblasts in 3D” using analysis software imageJ and the quantification of Oil-red O area was added to the corresponding graphs (Figure 5C, Figure S6C and S6F).

      However, due to the unique structure of the 3D matrix, many MHC+ and Oil Red O+ double-positive cells overlap with each other across different Z-stack layers in 3D. This overlap makes it challenging to accurately position and quantify the double-positive cells as the different layers interfere with each other.

      (5) In Figure 7, please show immunostaining images of collagen and other major ECMs;

      Thank you for this question. We have tried to stain collagen networks the by the Picrosirius Red staining but failed. Instead, we employed the laminin immunostainings to confirm that the ECM contents in the 3D matrix is increasing steadily during cell culturation.

      Please see Figure 7C. Lines 346-348.

      “the laminin protein content was accumulated and increased steadily during 3D culturation (Figure 7C) “

      (6) In Figure 8, please show hierarchical clustering analysis of whole transcriptomes of 3D_fibroblasts, 3D_MyoD, 3D+FI, and 3D_MyoD+FI. A Venn Diagram showing the overlap and distinct gene expression among these groups is also appreciated.

      Thank you for the suggestion.

      We added the hierarchical clustering analysis of whole transcriptomes of 3D_fibroblasts, 3D_MyoD, 3D+FI, and 3D_MyoD+FI using Euclidean distance with ward.D cluster method. Please see Figure 8B. The result showed that these groups formed two large clusters, in which the 3D+FI clustered separately and the 3D_fibroblasts, 3D_MyoD and 3D_MyoD+FI were more similar. Please see Figure 8B.

      As the reviewer suggested, we also compared the transcriptomes of 3D_MyoD, 3D+FI, and 3D_MyoD+FI to the original 3D_fibroblasts to identify differentially expression genes (DEG) and then analyzed the overlap and distinct DEGs respectively. As shown in Figure 8D, the Venn Diagram showed that majority of DEG from 3D_MyoD+FI (3D_MyoD+FI versus 3D_fibroblasts) are overlapped with 3D_MyoD and 3D+FI, indicating that 3D_MyoD+FI are compatible with myogenic and adipogenic function.

      Please see the revised Figure 8.

      Reviewer #2 (Recommendations For The Authors):

      In this study, the authors demonstrated a new approach for cultivated meat production using chicken fibroblasts. Specifically, the cells were cultured as 3D and induced muscle differentiation and lipid deposition. The manuscript contains a good set of data, which would be valuable to researchers in the fields of both cell-based meat and skeletal muscle biology. From the aspect of cultivated meat science, the rationale behind the idea is understandable, but it remains unclear whether the proposed approach was really the best choice to achieve their final goal. On the other hand, when we read this manuscript as a paper in skeletal muscle biology, the overall approach was not innovative enough and several uncertain issues remain. The authors should add more sufficient justifications, arguments, and discussions.

      (1) When considering their goal to produce edible meat products, the current approach has some concerns. First, there are issues with the approach used for the induction of myogenesis by MyoD transgene. This makes the end products GMO foods, which are not easily acceptable to a wide range of consumers. Next, the hydrogel was used for 3D tissue formation, but it is unclear whether this matrix type is edible, safe, and bio-comparable for cell-based meat production. The authors already discussed these points by excusing that the current work remains proof-of-concept. However, more careful considerations and justifications would be required.

      Thank you for the suggestion.

      We acknowledge that the current transgene myogenic induction method is not suitable for mass production of culture meat because of the GMO food concerns. We utilized the MyoD transgene as the means of myogenic transdifferentiation at the first place, because of the ease of genetic manipulation and maximum efficiency. We are current testing non-genomic integration tools such as chemical cocktails and modified RNAs for myogenic transdifferentiation.

      When it comes to the applications of hydrogel in the food industry, certain types of hybrid hydrogels, such as those made from pectin or sodium polyacrylate, are not only edible but also safe for consumption. While GelMA hydrogel is typically utilized in tissue engineering and subsequent implantation in patients for therapeutic regenerative medicine purposes, it has not been commonly employed in food processing. In this study, we cultivated cells within GelMA hydrogel due to its durability and ease of use in cell culture. Moving forward, we plan to investigate alternative types of matrices to develop cultured meat suitable for food applications.

      We have now described the GMO and hydrogel drawbacks in the discussion part. Please see lines 439-457.

      “As a proof-of-concept, we utilized the transgene method to achieve maximum myogenic induction and the final products still retain the foreign transgene fragment in the cells’ genome. It is therefore posing a risk of genetic modified food which is not suitable for mass production. In the next step, other non-transgenic means such as non-integrating vectors, chemical reprogramming, modified RNAs, and recombinant transgene removal techniques will be explored to develop transgene-free end products. Another food safety concern in this study is the use of GelMA hydrogel for culture meat production. Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used GelMA hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider hydrogel as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022). ”

      (2) From the view of skeletal muscle biology, the approaches (MyoD overexpression, hydrogel-based 3D tissue formation, and lipogenic induction) have already been tested.

      Thank you for the insightful comments from the perspective of skeletal muscle cell biology. We totally agree that the current approaches including MyoD overexpression, 3D cell culture and lipogenic induction, were routine experiments in muscle cell biology. However, we want to highlight that utilization of these classical and robust muscle cell approaches, combine with the unique advantages of fibroblast cells (easily accessible, immortalized, cost-effective, ...) would provide a novel and practical avenue for culture meat production. We stated these issues in the revised manuscript in the discussion part.

      Please see lines 511-515.

      “In conclusion, we have effectively utilized immortalized chicken fibroblasts in conjunction with classical myogenic/adipogenic transdifferentiation approaches within 3D hydrogel to establish a cultured meat model. This model allows for the precise regulation of the synthesis of key components found in conventional meat, including muscle, fat, and ECM.”

      (3) The common emphasis in this manuscript is to use the advantages of 3D culture for tissue differentiation. As the authors described, skeletal muscle is a highly aligned tissue. In this study, some results successfully demonstrated advantages in terms of myocyte alignment, maturation, and lipid deposition. However, the current results cannot address whether the entire 3D tissues maintained these advantageous characteristics or not. Because the method for 3D formation does not have any additional modifications to make the cells aligned, like micropatterning, scaffolding, or bioprinting.

      Thank you for the suggestion.

      We agree with the reviewer that the skeletal muscle tissues are composed of well organized, directional bundles of fibers, and the cell alignment would greatly affect the meat tenderness and sensory properties. Therefore, it is a desired attribute if the cells in the culture meat matrix could be aligned together. But this alignment would require sophisticated biomaterial engineering mainly involved in the scaffold manipulation which is beyond the scope of this study. The hydrogel used in this study formed different sizes of pores at random directions and we would expect the embedded cells to be totally non-directional. But we still found localized cell alignments in some parts of the gel matrix which confirming the cell-cell interactions, please see figure 3D. We describe this feature in the results part. In the future, we will be testing the application of physical or electrical stimulations to the matrix to see if we can align the cells better to make all the muscle cells in the whole matrix to align together.

      Please see lines 186-190.

      “The separate XY axis views of the orthogonal projections at different depths (Figure 3D) and a multi-angle video (Supplementary Video 2) also showed the several myotubes were aligned together. Nevertheless, many myotubes were oriented in different directions, preventing the entire matrix from aligning in one direction.”

      (4) In the skeletal muscle, fat accumulation mainly occurs in adipocytes between myocytes. This means that "intra-" muscular fat deposition is identified. However, lipid deposition within myocytes also occurred in this preparation (Supplementary Figure 7C). This situation is not "intra-" muscular accumulation, which sounds different from what is going on in normal skeletal muscle tissues. Please explain what happened and what biological situations accounted for this. Also, the authors should clarify better how lipogenesis was induced in the 3D tissues, such as cell types (transdifferentiated myocytes, remained/un-transdifferentiated fibroblasts, or both).

      Thank you for the very insightful question. We have revised the corresponding text to further explain the intramuscular fat distribution in different cell types in culture meat.

      We totally agree with the reviewer that intramuscular fat accumulation may occur mainly in the intramuscular adipocytes. However, under some pathological and physiological conditions in human and animals, the lipid droplets were also abundantly observed inside myofibers (intramyocellular lipids within myofiber cytoplasm). For instance, high intramyocellular lipid content was found in insulin resistance patients and paradoxically in endurance trained athletes, (doi.org/10.1016/j.tem.2012.05.009), as well as in some farm animals under intensive selective breeding (doi:10.2174/1876142910901010059). In the current study, with the Oil Red O staining of lipid droplets, we identified lipid deposition in both the transdifferentiated myocytes and the remained un-transdifferentiated fibroblasts in the culture meat. This lipid distribution pattern is comparable to the intramuscular fat storage pattern observed in some human and animals, in which fat accumulation occurs in both myofibers (intramyocellular lipids) and intramuscular adipocyte cells (extramyocellular lipids) which reside within the muscle tissue bundle but between myofibers. We reason that current adipogenic induction treatment caused lipogenesis in both the MyoD-transdifferentiated cells and un-transdifferentiated fibroblasts. It is difficult to compare the absolute amount of lipids between these two types of cells via the Oil Red O staining. Also, it is almost impossible to separate these two types of cells from the 3D meat mimics. Thus, we can only confirm the lipid deposition occurs in both transdifferentiated myocytes and un-transdifferentiated fibroblasts, but without knowing which one is dominant and the major contributor to the intramuscular fat content in the culture meat.

      Please see lines 486-492.

      “In this study, the deposition of fat in the myotubes/myofibers facilitated the storage of significant lipid quantities in transdifferentiated muscle cells, known as intramyocellular lipids. Additionally, we observed Oil Red O staining in the remaining un-transdifferentiated fibroblasts, resembling cells of intramuscular adipocytes (extramyocellular lipids) found within muscle tissue. Hence, current adipogenic induction treatment caused lipogenesis in both the MyoD-transdifferentiated cells and un-transdifferentiated fibroblasts.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review): 

      Summary:

      The authors demonstrate that the immunosuppressive environment in pancreatic ductal adenocarcinoma (PDAC) can be mitigated by a combination of ionizing radiation (IR), CCR5 inhibition, and PD1 blockade. This combination therapy increases tissue-resident natural killer (trNK) cells that facilitate CD8 T cell activity, resulting in a reduction of E-cadherin positive tumor cells. They identify a specific "hypofunctional" NK cell population in both mouse and human PDAC that supports CD8 T cell involvement. A trNK signature is found to be associated with better survival outcomes in PDAC and other solid tumors.   

      Strengths: 

      Overall, I think this is an interesting study that combines testing of therapeutic concepts in mice with bioinformatics analysis of single-cell transcriptome data in primary tumors and exploration of clinical outcomes using signature genes in TCGA data. The key finding is that immunoregulatory properties of tumor-infiltrating/resident CD56-bright NK cells (assumed to be non-cytotoxic) are beneficial for outcome through cross-talk with DC and recruitment of CD8 T cells. The latter is specifically induced by irradiation combined with CCR5i and PD1 blockade. 

      "These results collectively support the notion that IR/CCR5i/αPD1 combination treatment alters immune infiltration by reducing Tregs and increasing NK and CD8 T cells, thereby resulting in greater local tumor control." I agree with this conclusion.  

      Weaknesses:  

      There are a few points to discuss and that the authors may want to address. 

      (1)   "Notably, CCR5i significantly reduced Treg infiltration but had no effect on the infiltration of other immune cells, indicating the active recruitment of CCR5+ Tregs in PDAC (Figure 2B)." 

      CCR5i treatment seems to inhibit infiltration of CD8 T cells and NK cells to a greater extent, in relative terms, compared to Treg, albeit it is not statistically significant. If this visual inspection of the graph does not reflect reality, additional experiments may be needed to verify the selective targeting of Tregs or confirm the fact that also CD8 T cells and NK cells are affected by single agent CCR5i. The reduced recruitment of Treg, NK cells, and CD8T cells was completely reversed when combined with irradiation. In the data shown in Figure 3E it seems as if CCR5i induced infiltration of Tregs along with other immune cells. However, this said, I agree with the conclusion of the authors that this combined treatment leads to an altered immune composition and ratio between Tregs and effector cells (CD8T cells and NK cells). Could this altered composition be displayed more clearly? 

      We would like to thank the reviewer for their comments and agree that there is a trend for reduced NK and T-cell infiltration during CCR5i standalone treatment (as seen in Figure 2B), although it does not reach significance. To reflect this more clearly, we have added n.s (non-significant) for the NK cells and CD8+ T-cells and adjusted the text to reflect a trend for decreased NK and CD8+ T-cell infiltration (See Lines 162-165). Moreover, to reflect the data accurately, we have taken the Treg data out of the original Figure 2B and present it separately as a percentage of CD45+CD3+ T-cells.

      (2) The definition of active and hypofunctional NK cells based on solely NKG2D expression alone seems like an oversimplification. I realize it is not trivial to test tumor-infiltrating NK cells from these tumors functionally but perhaps scRNAseq of the tumors would allow for characterization of cytotoxicity scores using KEGG or GO analysis or reversed gene set enrichment in responders/non-responders.  

      We agree that scRNA-seq of tumors would add to the overall characterization of the tumor-infiltrating NK cells and their characterization, however we are currently unfortunately not in the position to carry out this experiment. We did however immunophenotype the tumor infiltrating NK cell population in more depth by also looking at NKp46 and NKG2D surface expression. This newly added data demonstrates not only increased infiltration of “bona-fide” trNK cells (based on surface expression of CD103+CD49a+) under the triple treatment combination, but more importantly these trNK have reduced levels of CD69, NKp46, NKG2D and increased TIM-3 surface expression compared to conventional NK cells – suggesting that these trNKs could be more hypoactive compared to the conventional NK cells. These data have been added to the manuscript as Figure 4E, F; Figure supplement 4E-G and Lines 244-260 in the revised manuscript. To clarify this difference, we have replaced the word “hypofunctional” with “hypoactive” throughout the manuscript.

      (3) It seems as if the abstract refers to this phenotype incorrectly since the "hyporesponsive" subset is described as NKG2C-negative. 

      We apologize for the typographic confusion and have corrected our abstract and changed the subset to NKG2D-negative (as was intended).

      (4) "The NK_C1 cluster correlates best with the hypofunction NK phenotype observed in mice as similarly displayed reduced activation (reduced NKG7, NKp80, GZMA, and PRF1) with additional expression of tissue residency markers CD103, CD49a and, surprisingly, the adaptive activating receptor NKG2C (KLRC2) (Figure 5B, C)." 

      There is no doubt that NK_C1 represents tumor-infiltrating NK cells with a CD56bright gene signature with a strong tissue resident score. However, the transcriptional expression of KLRC2 on these is not surprising! It is well established that KLRC2 transcripts (but not protein) are highly expressed on conventional CD56bright NK cells. There are several published sources where the authors can find such data for confirmation. Thus, this is not to be confused with adaptive NK cells having an entirely different transcriptional signature and expressing high levels of NKG2C at the cell surface. I strongly recommend reinterpreting the results based on the fact that KLRC2 is expressed at high levels in conventional CD56bright NK cells. If not, it would be important to verify that these tissueresident NK cells express NKG2C and not NKG2A at the cell surface. 

      We agree with the reviewer and have modified the text accordingly in the revised manuscript (Lines 279-283), including references to tissue-resident adaptive-like cells as described previously in literature. 

      (5) NCAM1 transcript alone is not sufficient to deconvolute CD56bright NK cells in TCGA data (Figure 7A). As a single marker, it likely reflects NK cell infiltration without providing further evidence on the contribution of the bright/dim components. Therefore, the use of the bright Tr NK signature described in Table 1 is very important (Figure 7B). Table 1 is not provided. Nor Supplementary Table 1. There is only one supplementary figure in the ppt attached.

      We agree that a high NCAM1/CD56 single gene signature could also represent NK cell infiltration. We have rephrased this in the text accordingly (Lines 354-357). We apologize for the missing tables and Supplementary figures. We have added these now to the manuscript as Supplementary table 1.

      Reviewer #2 (Public Review)  

      Summary: 

      This work elaborates on a combined therapeutic approach comprising ionizing radiation and CCR5i/αPD1 immunotherapy as a promising strategy in pancreatic cancer. Previous research has established that NK cell-derived CCL5 and XCL1 play a crucial role in recruiting cDC1 cells to the tumor microenvironment, contributing to tumor control. In this study, by using a murine pancreatic cancer model, the authors propose that the addition of radiation therapy to CCR5i and αPD1 immunotherapy could upregulate CD8+ T cells and a subgroup of NK cells within the tumor and result in better tumor control. They further analyzed human single-cell sequencing data from pancreatic cancer patients and identified one subgroup of NK cells (NK C1) with tissue-resident features. Subsequent cell-cell contact analysis reveals the NK-cDC1-CD8 cell axis in pancreatic cancer. By analyzing TCGA data, they found that high NK C1 signature levels were associated with better survival in pancreatic cancer patients. Thus, radiotherapy could benefit the outcome of patients bearing low NK C1 signatures. Importantly, the positive correlation between NK C1 score with survival extends beyond pancreatic cancer, showing potential applicability across various solid cancers.  

      Strengths: 

      This study could add new insight into the clinical practice by introducing such novel combined therapy and shed light on the underlying immune cell dynamics. These findings hold potential for more effective and targeted treatment in the future. Mouse experiments nicely confirmed that such combined therapy could significantly reduce tumor volume. The elegant use of single-cell sequencing analysis and human database examination enriches the narrative and strengthens the study's foundation. Additionally, the notion that NK C1 signature correlates with patient survival in various solid cancers is of high interest and relevance.  

      Weaknesses: 

      The role of CCR5i requires further clarification. While the authors demonstrated its capacity to reduce Treg in murine tumors, its impact on other cell populations, including NK cells and CD8+ T cells, was not observed. Nevertheless, the effect of CCR5i on tumor growth in Figure 2B should be shown. If the combination of radiotherapy and αPD1 already can achieve good outcomes as shown in Figure 3A, the necessity to include CCR5i is questioned. Overall, a more comprehensive elucidation of the roles of CCL5 and CCR5i in this context would be good.  

      We would like to thank the reviewer for their comments and agree that standalone CCR5i also shows a trend of reduced infiltrating NK cells and CD8+ T-cells, although this does not reach significance. We have mentioned this trend in the manuscript (see Lines 162-165) and added n.s to Figure 2B as well. In regards to adding CCR5i; although we observe volumetric control by radiotherapy and anti-PD1, we observe an increase in necrosis induction only in the triple combination compared to radiotherapy combined with anti-PD1 – suggesting that there is an additive effect of CCR5i in our model only as a combination modality. We therefore believe that addition of CCR5i to radiotherapy and anti-PD1 has a beneficial effect. The growth curves for CCR5i alone were already presented in Figure 3A, and we have modified our manuscript to refer to this (see Lines 165-167).

      (1) In line with this, spatial plots in Figure 4 did not include the group with only radiotherapy and αPD1. This inclusion would facilitate a clearer comparison and better highlight the essential role of CCR5i. 

      We agree with the reviewer that inclusion of radiotherapy and αPD1 would facilitate a clear comparison of our data and our experiments did include single controls for radiotherapy and αPD1; however, unfortunately, the tissue slides were of bad quality and therefore not suitable for quantification. In line with this, we have added references to other studies that investigated the effect of immune checkpoint inhibitors in combination with radiotherapy (see Lines 169-172).

      (2) NK C1 cells should be also analyzed in the mouse model. The authors suggest that NKNKG2Dve could be the cell population. Staining of inhibitory markers should be considered, for example, TIGIT and TIM3 as presented in Figure 5B. 

      As per the reviewer suggestion, we have now included some additional data on the surface expression of inhibitory markers/activating receptor on tumor-infiltrating NK cells in our model under the triple combination. These additional data demonstrate increased infiltration of trNK under the triple combination that seem to be more ‘hypoactive’ than conventional NK cells.  This data has been added as Figure 4E in the revised Figure.

      (3) While the cell-cell contact analysis generated from single-cell sequencing data is insightful, extending this analysis to the mouse model under therapy would be highly informative. NK and CD8 cells in the tumor increased upon the combined therapy. However, cDC1 was not characterized. Analysis regarding cDC1 would provide more information on the NK/cDC1/CD8 axis. 

      We agree that looking into cDC1 would be highly interesting in our treatment model and its characterization is currently under investigation. The importance about the interaction between cDC1-NK cells has been described before by various groups, and we have provided additional references for that in our manuscript (see Lines 449-455)

      (4) Human database analysis showed a positive correlation between NK C1 score and CCL5 in pancreatic cancer. Furthermore, radiotherapy could benefit the outcome of patients bearing low NK C1 scores. It would be interesting to test if radiotherapy could also benefit patients with low CCL5 levels in this cohort. 

      We would like to thank the reviewer for their suggestion and please see the figure below for the comparison. Patients with CCL5high are enriched for NK_C1 (Figure 7D) and CCL5high patients with NK_C1high have significantly increased overall and disease-free survival compared to NK_C1low (Figure 7E); where those with NK_C1low significantly benefit from radiotherapy (Figure 7B). Accordingly, patients with CCL5high have significantly decreased overall survival compared to CCL5low patients, again confirming CCL5 as a prognostic marker (Figure 1A, Figure R1). When we look at CCL5low patients however, there is no additional significant benefit for radiotherapy (see insert below) in the CCL5low group (not significant; only significant p-values are shown). These data collectively support the strong correlation between CCL5 levels and NK_C1 enrichment, and imply that radiotherapy alone is insufficient to drive NK_C1 cells in the absence of high CCL5 gradients to improve overall survival. However, given the increased overall survival of CCL5low compared to CCL5high it is likely that other factors are at play. Future studies will be required to further elucidate the role of CCL5 gradients on NK_C1 cells and the beneficial effect of radiotherapy.

      Author response image 1.

      Overall survival of CCL5high versus CCL5low patients stratified into groups with and without radiotherapy using TCGA-PAAD. Log-rank p-value indicates the significance level across all groups while individual significant comparisons are shown as indicated.

      Reviewer #3 (Public Review):

      Summary

      In the submitted manuscript by Go et al, the authors evaluated the tumor microenvironment in pancreatic ductal adenocarcinoma (PDAC) and made a number of interesting observations, including the following: 1) CCL5 expression within the tumor microenvironment negatively correlated with clinical outcomes in human patients with PDAC; 2) there were both positive and negative correlations between CCL5 expression and the expression of specific genes (e.g. those encoding CD56 and CD16, respectively) included among gene signature lists for Treg, MDSC, TAM, and NK cells; 3) CCR5 inhibition with the inhibitor, maraviroc, reduced Treg infiltration but not that of other immune cell types in an orthotopic murine model of PDAC; 4) CCR5 inhibition augmented anti-PD1 immunotherapy when combined with ionizing radiation (IR) therapy in the murine model; 5) the above therapy resulted in increased infiltration of CD8+ cytotoxic T cells as well as of a subset of NKG2D-negative, tissueresidency (tr) marker expressing NK cells (deemed Cluster 1 NK in their data sets) that inversely correlated with the number of E-cadherin+ cells (i.e. tumor cells) and showed predicted interactions with cDC1 dendritic cells (including XCL1/XCL2 expressed by the NK and XCR1 expressed by the cDC1); 6) the authors identified a number of putative signals stemming from the trNK (e.g. IL-16, TNFSF14, FASLG, CSF, MIF) as well as incoming from cDC1s to NK (e.g. BAG6-NKp30); 7) these trNK cells positively correlated with good outcomes and with CD8+ T cell infiltrations in human PDAC as well as in many other solid tumor types; and 8) importantly, the benefit of IR therapy was specific to the subset of PDAC patients (represented in the TCGA dataset) that were predicted to have low amounts of trNK cells. The authors used murine experimental models, multiplexed imaging analyses, and a number of publicly available sequencing data sets from human tumor samples to perform their investigations. Based on their findings, the authors proposed that combining IR with CCR5 inhibition and anti-PD1 immunotherapy is a promising strategy to treat solid cancers.  

      Strengths

      Overall, the collective analyses and conclusions appear to be novel and could be of high and rapid impact on the field, particularly in terms of directing clinical trials to incorporate IR with CCR5 inhibition and immunotherapy. The manuscript is well written; the figures are for the most part clear; and the Discussion is very thoughtful.   

      Weaknesses

      There were a number of minor typographical errors, missing references, or minor issues with the figures. In general, while many of the observations provided strong suggestive evidence of relationships, phenotypes, and functions, the authors often used language to indicate that such things were confirmed, validated, or proven. In fact, there was a paucity of such functional/confirmatory experiments. This does not necessarily detract from the overall significance, excitement for, and potential impact of the study; but the language could likely be adjusted to be more in keeping with the true nature of the findings. The main title and running title are a bit different; consider making them more similar.

      We apologize for the typographical errors, missing references and issues with the figures. We have revised our manuscript, with a major focus on adjusting our language to more carefully reflect our data, and hope to have addressed all the concerns of the reviewer. The slight discrepancy between the main title and running title are to be able to convey the contents of this manuscript in a comprehensive way. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):  

      Please make sure all files are made available. Also please check available datasets describing KLRC2 transcripts in CD56brights. This is not to be confused with an adaptive-like signature. 

      We have added the missing table to the supplementary figures and revised the manuscript text in regards to KLRC2 transcript in our NK_C1 cluster and its implications for an adaptive-like signature in the context of tissue-residency (see Lines 279-283; 465-474).

      Reviewer #2 (Recommendations For The Authors): 

      Additional experiments as mentioned in the 'weakness' section could help to further strengthen this study. Besides these points, I would recommend the following: 

      (1) The description in the figure should be more precise and clear. Especially in Figure 3A, it seems the addition of IR into CCR5i or CCR5i/aPD1 leads to a bigger tumor volume.  

      We have adjusted the figure descriptions to more clearly describe the figures. We apologise for the confusion in Figure 3A, this was a figure legend error and has been correctly rectified in the revised Figures (i.e. closed symbols represent +IR conditions).

      (2) The definition of Tregs in figures should be described, e.g. it is not specified which population is shown in Figure S2c.  

      We have added a definition of Tregs (i.e. Live/CD45+CD3+CD4+FOXP3+) in our revised manuscript (see Lines 162-165). To avoid confusion, we have removed the subsequent gating of CCR5 and PD-1 of Tregs in our revised Supplementary Figures.

      (3) Please add a bar in all histology figures, for example, Figure 2A, S2A, S3E. It seems in Figure S3D, E, the green group is missing.  

      We have added the scale bar to all the indicated figures. Unfortunately, indeed as correctly pointed out by the reviewer, we are missing the green group (i.e. IR+CCR5i) as we felt that the excessive growth seen with CCR5i alone may have given a false impression of the extent of infiltration, therefore we did not include this in the original analysis and do not have the data in the Figure.

      (4) Please check through the manuscript, there are some grammar mistakes.  

      We apologise for the grammar mistakes in our original manuscript and have carefully revised the current manuscript to avoid grammar mistakes

      (5) Figure S7B, the left cell lacks a name.  

      We have annotated the left cell accordingly in our revised supplementary figure.

      Reviewer #3 (Recommendations For The Authors): 

      (1) Abbreviations (e.g. PDAC) should be spelled out the first time introduced in the manuscript.

      We have adjusted this in our revised manuscript.

      (2) Referring to the tissue-resident NK cells as "hypofunctional" may not be useful...they seem to be functional, just not in the conventional sense. The authors may want to consider another term, such as non-cytotoxic (given the low expression of cytolytic granules, etc) or immunoregulatory (as they actually refer to them on line 310).

      We agree with the reviewer and have revised the manuscript to refer to them as “immunoregulatory” or “hypoactive” when appropriate. The latter is supported by the additional experiments as shown in Figure 4E.

      (3) Barry et al 2018 Nat Med demonstrated that NK cells in melanoma could support cDC1s and promote positive clinical outcomes in the setting of immunotherapy. It would likely be beneficial to also cite this paper (e.g. on line 425). 

      Thank you for the suggestion, which would work in line with our hypothesis of crosstalk between NK_C1 and cDC1. We have looked for FLT3L in our NK_C1 cluster and did not find any enrichment for FLT3L transcript (see Figure 5E). Nevertheless, we have added the reference in the discussion of our manuscript to further support the importance of crosstalk between cDC1 and NK cells (see Lines 449455)

      (4) Figure 2B: by eye, it looks like the difference between CD8+ T cells in the two conditions would be significantly different; is this not the case? Same thing for the NK cells...what are the pvalues? 

      We have added n.s. to our revised Figure 2B. The p-values for CD8+ T-cells and NK cells were 0.14 and 0.19 {2-tailed students t-test), respectively.

      (5) The murine data strongly suggest that the combination therapy promotes trNK cell infiltration into the tumors, in turn resulting in cDC1-mediated CD8+ T cell infiltration and/or activation. It could be highly valuable/useful to functionally determine (e.g. by depleting NK cells in this model) if NK cells are required for the effects seen. 

      We agree that depletion of NK cells could really solidify the findings even more, and it is part of ongoing investigations for future projects. However, it would be imperative to first characterise these NK cells in more depth as conventional global ablation of NK cells is excepted to highly impact immunosurveillance as well. This is part of current ongoing work.

      (6) Figure 7B: how were "high" and "low" defined (for the NK signature)?

      An enrichment score of the NK_C1 gene signature (see Table supplement 1) was first calculated per patient sample in the TCGA RNA-seq dataset using the Gene Set Variation Analysis (GSVA) method. A cut-off value was then determined using the maximally selected rank statistics (max-stat R package) method to divide patients into “high” and “low”. 

      (7) Lines 164-165 of the Results: it would be good to include a reference supporting the statement.

      We have added rephrased the manuscript and added corresponding references (see Lines 170-173 in revised manuscript).

      (8) There are many conclusions and very speculative language based only on sequencing results, and these have not been validated (e.g. in the Discussion, lines 447-453). As another example, it was concluded that a decrease in NKG2D+ NK cells implied a reduction in overall NK cell cytolytic activity and that NKG2D- NK cells were hypofunctional and did not kill well. This was not tested. Generally, it would be useful for the authors to use language that conveys that the data are primarily suggestive (rather than "confirmatory", line 447) of relationships, phenotypes, and functions at this point. 

      We thank the reviewer for their concerns and have carefully adapted the manuscript text to more clearly clarify the findings in a careful manner.

      (9) On lines 246-247 the authors refer to cluster 3 NK cells, which express CD16, as "immature". The rationale for this designation is not provided, and most human NK cell development models hold that CD16+ NK cells represent the most mature subset(s). 

      We apologize for the typographic error – later on we refer to the NK_C3 cluster as cytotoxic NK cells and we have corrected this in our revised manuscript (see Lines 273-275).

      (10) On line 351, the authors reference supplemental Figure 7C...but I don't see this figure in the accompanying powerpoint file. 

      This should have been Supplementary Figure 7B, and we have corrected it in the revised manuscript (see Lines 374-377)

      (11) On line 417, the authors reference NKp40; this is likely a typographical error. 

      This has been corrected in the revised manuscript to NKp46 (see Lines 439-442).

    1. Author Response

      The following is the authors’ response to the current reviews.

      Overall Response

      We thank the reviewers for reviewing our manuscript, recognizing the significance of our study, and offering valuable suggestions. Based on the reviewer’s comments and the updated eLife assessment, we would like to chose the current version of our manuscript as the Version of Record of our manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Given knowledge of the amino acid sequence and of some version of the 3D structure of two monomers that are expected to form a complex, the authors investigate whether it is possible to accurately predict which residues will be in contact in the 3D structure of the expected complex. To this effect, they train a deep learning model which takes as inputs the geometric structures of the individual monomers, per-residue features (PSSMs) extracted from MSAs for each monomer, and rich representations of the amino acid sequences computed with the pre-trained protein language models ESM-1b, MSA Transformer, and ESM-IF. Predicting inter-protein contacts in complexes is an important problem. Multimer variants of AlphaFold, such as AlphaFold-Multimer, are the current state of the art for full protein complex structure prediction, and if the three-dimensional structure of a complex can be accurately predicted then the inter-protein contacts can also be accurately determined. By contrast, the method presented here seeks state-of-the-art performance among models that have been trained end-to-end for inter-protein contact prediction.

      Strengths:

      The paper is carefully written and the method is very well detailed. The model works both for homodimers and heterodimers. The ablation studies convincingly demonstrate that the chosen model architecture is appropriate for the task. Various comparisons suggest that PLMGraph-Inter performs substantially better, given the same input, than DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter.

      The authors control for some degree of redundancy between their training and test sets, both using sequence and structural similarity criteria. This is more careful than can be said of most works in the field of PPI prediction.

      As a byproduct of the analysis, a potentially useful heuristic criterion for acceptable contact prediction quality is found by the authors: namely, to have at least 50% precision in the prediction of the top 50 contacts.

      We thank the reviewer for recognizing the strengths of our work!

      Weaknesses:

      The authors check for performance drops when the test set is restricted to pairs of interacting proteins such that the chain pair is not similar as a pair (in sequence or structure) to a pair present in the training set. A more challenging test would be to restrict the test set to pairs of interacting proteins such that none of the chains are separately similar to monomers present in the training set. In the case of structural similarity (TM-scores), this would amount to replacing the two "min"s with "max"s in Eq. (4). In the case of sequence similarity, one would simply require that no monomer in the test set is in any MMSeqs2 cluster observed in the training set. This may be an important check to make, because a protein may interact with several partners, and/or may use the same sites for several distinct interactions, contributing to residual data leakage in the test set.

      We thank the reviewer for the suggestion! In the case of protein-protein prediction (“0D prediction”) or protein-protein interfacial residue prediction(“1D prediction”), we think making none of the chains in the test set separately similar to monomers in the training set is necessary, as the reviewer pointed out that a protein may interact with several partners, and may even use the same sites for the interactions. Since the task of this study is predicting the inter-protein residue-residue contacts (“2D prediction”), even though a protein uses the same site to interact with different partners, as long as the interacting partners are different, the inter-protein contact maps would be different. Therefore, we don’t think that in our task, making this restriction to the test set is necessary.

      The training set of AFM with v2 weights has a global cutoff of 30 April 2018, while that of PLMGraph-Inter has a cutoff of March 7 2022. So there may be structures in the test set for PLMGraph-Inter that are not in the training set of AFM with v2 weights (released between May 2018 and March 2022). The "Benchmark 2" dataset from the AFM paper may have a few additional structures not in the training or test set for PLMGraph-Inter. I realize there may be only few structures that are in neither training set, but still think that showing the comparison between PLMGraph-Inter and AFM there would be important, even if no statistically significant conclusions can be drawn.

      We thank the reviewer for the suggestion! It is not enough to only use the date cutoff to remove the redundancy, since similar structures can be deposited in the PDB in different dates. Because AFM does not release the PDB codes of its training set, it is difficult for us to totally remove the redundancy. Therefore, we think no rigorous conclusion can be drawn by including these comparisons in the manuscript. Besides, the main point of this study is to demonstrate that the integration of multiple protein language models using protein geometric graphs can dramatically improve the model performance for inter-protein contact prediction, which can provide some important enlightenments for the future development of more powerful protein complex structure prediction methods beyond AFM, rather than providing a tool which can beat AFM at this moment. We think including too many stuffs in the comparison with AFM may distract the readers. Therefore, we choose to not include these comparisons in the manuscript.

      Finally, the inclusion of AFM confidence scores is very good. A user would likely trust AFM predictions when the confidence score is high, but look for alternative predictions when it is low. The authors' analysis (Figure 6, panels c and d) seems to suggest that, in the case of heterodimers, when AFM has low confidence, PLMGraph-Inter improves precision by (only) about 3% on average. By comparison, the reported gains in the "DockQ-failed" and "precision-failed" bins are based on knowledge of the ground truth final structure, and thus are not actionable in a real use-case.

      We agree with the reviewer that more studies are needed for providing a model which can well complement or even beat AFM. The main point of this study is to demonstrate that the integration of multiple protein language models using protein geometric graphs can dramatically improve the model performance for inter-protein contact prediction, which can provide some important enlightenments for the future development of more powerful protein complex structure prediction methods beyond AFM.

      Reviewer #2 (Public Review):

      This work introduces PLMGraph-Inter, a new deep learning approach for predicting inter-protein contacts, which is crucial for understanding proteinprotein interactions. Despite advancements in this field, especially driven by AlphaFold, prediction accuracy and efficiency in terms of computational cost still remains an area for improvement. PLMGraph-Inter utilizes invariant geometric graphs to integrate the features from multiple protein language models into the structural information of each subunit. When compared against other inter-protein contact prediction methods, PLMGraph-Inter shows better performance which indicates that utilizing both sequence embeddings and structural embeddings is important to achieve high-accuracy predictions with relatively smaller computational costs for the model training.

      We thank the reviewer for recognizing the strengths of our work!

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      • I recommend renaming the section "Further potential redundancies removal between the training and the test" to "Further potential redundancies removal between the training and the test sets"

      Changed.

      • In lines 768-769, the sentence seems to end prematurely in "to use more stringent threshold in the redundancy removal"

      Corrected.

      • In Eq. (4), line 789, there are many instances of dashes that look like minus signs, creating some confusion.

      Corrected.

      • I think I may have mixed up figure references in my first review. When I said (Recommendations to the authors): "p. 22, line 2: from the figure, I would have guessed "greater than or equal to 0.7", not 0.8", I think I was referring to what is now lines 423-424, referring to what is now Figure 5c. The point stands there, I think.

      Corrected.

      • A couple of new grammatical mishaps have been introduced in the revision. These could be rectified.

      We carefully rechecked our revisions, and corrected the grammatical issues we found.

      Reviewer #2 (Recommendations For The Authors):

      Most of my concerns were resolved through the revision. I have only one suggestion for the main figure.

      The current scatter plots in Figure 2 are hard to understand as too many different methods are abstracted into a single plot with multiple colors. I would suggest comparing their performances using box plot or violin plot for the figure 2.

      We thank the reviewer for the suggestion! In the revision, we tried violin plot, but it does not look good since too many different methods are included in the plot. Besides, we chose the scatter plot as it can provide much more details. We also provided the individual head-to-head scatter plots as supplementary figures, we think which can also be helpful for the readers to capture the information of the figures.


      The following is the authors’ response to the original reviews.

      Overall Response

      We would like to thank the reviewers for reviewing our manuscript, recognizing the significance of our study, and offering valuable suggestions. We have carefully revised the manuscript to address all the concerns and suggestions raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Given knowledge of the amino acid sequence and of some version of the 3D structure of two monomers that are expected to form a complex, the authors investigate whether it is possible to accurately predict which residues will be in contact in the 3D structure of the expected complex. To this effect, they train a deep learning model that takes as inputs the geometric structures of the individual monomers, per-residue features (PSSMs) extracted from MSAs for each monomer, and rich representations of the amino acid sequences computed with the pre-trained protein language models ESM-1b, MSA Transformer, and ESM-IF. Predicting inter-protein contacts in complexes is an important problem. Multimer variants of AlphaFold, such as AlphaFold-Multimer, are the current state of the art for full protein complex structure prediction, and if the three-dimensional structure of a complex can be accurately predicted then the inter-protein contacts can also be accurately determined. By contrast, the method presented here seeks state-of-the-art performance among models that have been trained end-to-end for inter-protein contact prediction.

      Strengths:

      The paper is carefully written and the method is very well detailed. The model works both for homodimers and heterodimers. The ablation studies convincingly demonstrate that the chosen model architecture is appropriate for the task. Various comparisons suggest that PLMGraph-Inter performs substantially better, given the same input than DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter. As a byproduct of the analysis, a potentially useful heuristic criterion for acceptable contact prediction quality is found by the authors: namely, to have at least 50% precision in the prediction of the top 50 contacts.

      We thank the reviewer for recognizing the strengths of our work!

      Weaknesses:

      My biggest issue with this work is the evaluations made using bound monomer structures as inputs, coming from the very complexes to be predicted. Conformational changes in protein-protein association are the key element of the binding mechanism and are challenging to predict. While the GLINTER paper (Xie & Xu, 2022) is guilty of the same sin, the authors of CDPred (Guo et al., 2022) correctly only report test results obtained using predicted unbound tertiary structures as inputs to their model. Test results using experimental monomer structures in bound states can hide important limitations in the model, and thus say very little about the realistic use cases in which only the unbound structures (experimental or predicted) are available. I therefore strongly suggest reducing the importance given to the results obtained using bound structures and emphasizing instead those obtained using predicted monomer structures as inputs.

      We thank the reviewer for the suggestion! In the revision, to emphasize the performance of PLMGraph-Inter using the predicted monomer structures, we moved the evaluation results based on the predicted monomer from the supplementary to the main text (see the new Table 1 and Figure 2 in the revised manuscript) and re-organized the two subsections “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets” and “Impact of the monomeric structure quality on contact prediction” in the main text.

      In particular, the most relevant comparison with AlphaFold-Multimer (AFM) is given in Figure S2, not Figure 6. Unfortunately, it substantially shrinks the proportion of structures for which AFM fails while PLMGraph-Inter performs decently. Still, it would be interesting to investigate why this occurs. One possibility would be that the predicted monomer structures are of bad quality there, and PLMGraph-Inter may be able to rely on a signal from its language model features instead. Finally, AFM multimer confidence values ("iptm + ptm") should be provided, especially in the cases in which AFM struggles.

      We thank the reviewer for the suggestion! It is worth noting that AFM automatically searches monomer templates in the prediction, and when we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) at least 20 templates were identified (AFM employed the top 20 templates in the prediction), and 87.8% of the targets employed the native templates (line 455-462 in page 25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”). Therefore, we think Figure 6 not Figure S5 (the original Figure S2) shows a fairer comparison. Besides, it is also worth noting the targets used in this study would have a large overlap with the training set of AlphaFold-Multimer, since AFM used all protein complex structures in PDB deposited before 2018-04-30 in the model training, which would further cause the overestimation of the performance of AFM (line 450-455 in page 24-25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      To mimic the performance of AlphaFold2 in real practice and produce predicted monomeric structures with more diverse qualities, we only used the MSA searched from Uniref100 protein sequence database as the input to AlphaFold2 and set to not use the template (line 203~210 in page 12 in the subsection of “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets”). Since some of the predicted monomer structures are of bad quality, it is reasonable that the performance of PLMGraph-Inter drops when the predicted monomeric structures are used in the prediction. We provided a detailed analysis of the impact of the monomeric structure quality on the prediction performance in the subsection “Impact of the monomeric structure quality on contact prediction” in the main text.

      We provided the analysis of the AFM multimer confidence values (“iptm + ptm”) in the revision (Figure 6, Figure S5 and line 495-501 in page 27 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      Besides, in cases where any experimental structures - bound or unbound - are available and given to PLMGraph-Inter as inputs, they should also be provided to AlphaFold-Multimer (AFM) as templates. Withholding these from AFM only makes the comparison artificially unfair. Hence, a new test should be run using AFM templates, and a new version of Figure 6 should be produced. Additionally, AFM's mean precision, at least for top-50 contact prediction, should be reported so it can be compared with PLMGraph-Inter's.

      We thank the reviewers for the suggestion, and we are sorry for the confusion! In the AFM runs to predict protein complex structures, we used the default setting of AFM which automatically searches monomer templates in the prediction. When we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) employed at least 20 templates in their predictions (AFM only used the top 20 templates), and 87.8% of the targets employed the native template. We further clarified this in the revision (line 455462 in page 25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFoldMultimer”). We also included the mean precisions of AFM (top-50 contact prediction) in the revision (Table S5 and line 483-484 in page 26 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      It's a shame that many of the structures used in the comparison with AFM are actually in the AFM v2 training set. If there are any outside the AFM v2 training set and, ideally, not sequence- or structure-homologous to anything in the AFM v2 training set, they should be discussed and reported on separately. In addition, why not test on structures from the "Benchmark 2" or "Recent-PDB-Multimers" datasets used in the AFM paper?

      We thank the reviewer for the suggestion! The biggest challenge to objectively evaluate AFM is that as far as we known, AFM does not release the PDB ids of its training set and the “Recent-PDB-Multimers” dataset. “Benchmark 2” only includes 17 heterodimer proteins, and the number would be further decreased after removing targets redundant to our training set. We think it is difficult to draw conclusions from such a small number of targets.

      It is also worth noting that the AFM v2 weights have now been outdated for a while, and better v3 weights now exist, with a training cutoff of 2021-09-30.

      Author response image 1.

      The head-to-head comparison of qualities of complex predicted by AlphaFold-Multimer (2.2.0) and AlphaFold-Multimer (2.3.2) for each target PPI.

      We thank the reviewer for reminding the new version of AFM. The only difference between AFM V3 and V2 is the cutoff date of the training set. During the revision, we also tested the new version of AFM on the datasets of HomoPDB and HeteroPDB, but we found the performance difference between the two versions of AFM is actually very little (see the figure above, not shown in the main text). One reason might be that some targets in HomoPDB and HeteroPDB are redundant with the training sets of the two version of AFM. Since our test sets would have more overlaps with the training set of AFM V3, we keep using the AFM V2 weights in this study.

      Another weakness in the evaluation framework: because PLMGraph-Inter uses structural inputs, it is not sufficient to make its test set non-redundant in sequence to its training set. It must also be non-redundant in structure. The Benchmark 2 dataset mentioned above is an example of a test set constructed by removing structures with homologous templates in the AF2 training set. Something similar should be done here.

      We thank the reviewer for the suggestion! In the revision, we explored the performance of PLMGraph-Inter when using different thresholds of fold similarity scores of interacting monomers to further remove potential redundancies between the training and test sets (i.e. redundancy in structure ) (line 353-386 in page 19-21 in the subsection “Ablation study”; line 762-797 in page 41-43 in the subsection “Further potential redundancies removal between the training and the test”). We found that for heteromeric PPIs (targets in HeteroPDB), the further removal of potential redundancy in structure has little impact on the model performance (~3%, when TM-score 0.5 is used as the threshold). However, for homomeric PPIs (targets in HomoPDB), the further removal of potential redundancy in structure significantly reduce the model performance (~18%, when TM-score 0.5 is used as the threshold) (see Table 2). One possible reason for this phenomenon is that the binding mode of the homomeric PPI is largely determined by the fold of its monomer, thus the does not generalize well on targets whose folds have never been seen during the training.

      Whether the deep learning model can generalize well on targets with novel folds is a very interesting and important question. We thank the reviewer for pointing out this! However, to the best of our knowledge, this question has rarely been addressed by previous studies including AFM. For example, the Benchmark 2 dataset is prepared by ClusPro TBM (bioRxiv 2021.09.07.459290; Proteins 2020, 88:1082-1090) which uses a sequence-based approach (HHsearch) to identify templates not structure-based. Therefore, we don’t think this dataset is non-redundant in structure.

      Finally, the performance of DRN-1D2D for top-50 precision reported in Table 1 suggests to me that, in an ablation study, language model features alone would yield better performance than geometric features alone. So, I am puzzled why model "a" in the ablation is a "geometry-only" model and not a "LM-only" one.

      Using the protein geometric graph to integrate multiple protein language models is the main idea of PLMGraph-Inter. Comparing with our previous work (DRN-1D2D_Inter), we consider the building of the geometric graph as one major contribution of this work. To emphasize the efficacy of this geometric graph, we chose to use the “geometry-only” model as the base model.

      Reviewer #1 (Recommendations For The Authors):

      Some sections of the paper use technical terminology which limits accessibility to a broad audience. An obvious example is in the section "Results > Overview of PLMGraph-Inter > The residual network module": the average eLife reader is not a machine learning expert and might not be familiar with a "convolution with kernel size of 1 * 1". In general, the "Overview of PLMGraph-Inter" is a bit heavy with technical details, and I suggest moving many of these to Methods. This overview section can still be there but it should be shorter and written using less technical language.

      We thank the reviewer for the suggestion! We moved some technical details to the Methods section in the revision (line 184-185 in page 11; line 729-735 in page 39).

      List of typos and minor issues (page number according to merged PDF):

      • p. 3. line -3: remove "to"

      Corrected (line 36, page 3)

      • p. 5, line 7: "GINTER" should be "GLINTER"

      Corrected (line 64, page 5)

      • p. 6, line -4: "Given structures" -> "Given the structures"

      Corrected (line 95, page 6)

      • p. 6, line -2: "with which encoded"... ?

      We rephrased this sentence in revision. (line 97, page 6)

      • p. 9, line 1: "principal" -> "principle"

      Corrected (line 142, page 9)

      • p. 13, line 1: "has" -> "but have"

      Corrected (line 231, page 13)

      • p. 14, lines 6-7: "As can be seen from the figure that the predicted" -> "As can be seen from the figure, the predicted"

      We rephrased this paragraph, and the sentence was deleted in the revision (line 257-259 in page 15).

      • p. 18, line 1: the "five models" are presumably models a-e? If so, say "of models a-e"

      Corrected (line 310, page 17)

      • p. 22, line 2: from the figure, I would have guessed "greater than or equal to 0.7", not 0.8

      Based the Figure 3C, we think 0.8 is a more appropriate cutoff, since the precision drops significantly when the DTM-score is within 0.7~0.8.

      • p. 23, lines 2-3: "worth to making" -> "worth making"

      Corrected (line 443, page 24)

      • p. 24, line -5: "predict" -> "predicted"

      Corrected (line 484, page 26)

      • p 28, line -5: Please clarify what you mean by "We doubt": are you saying that you don't think these rearrangements exist in nature? If not, then reword.

      Corrected (line 566, page 30)

      • Figure 2, panel c, "DCPred" in the legend should be "CDPred"

      Corrected

      • Figures 3 and 5: Please improve the y-axis title in panel C. "Percent" of what?

      We changed the “Percent” to “% of targets” in the revision.

      We thank the reviewer for carefully reading our manuscript!

      Reviewer #2 (Public Review):

      This work introduces PLMGraph-Inter, a new deep-learning approach for predicting inter-protein contacts, which is crucial for understanding proteinprotein interactions. Despite advancements in this field, especially driven by AlphaFold, prediction accuracy and efficiency in terms of computational cost) still remains an area for improvement. PLMGraph-Inter utilizes invariant geometric graphs to integrate the features from multiple protein language models into the structural information of each subunit. When compared against other inter-protein contact prediction methods, PLMGraph-Inter shows better performance which indicates that utilizing both sequence embeddings and structural embeddings is important to achieve high-accuracy predictions with relatively smaller computational costs for the model training.

      The conclusions of this paper are mostly well supported by data, but test examples should be revisited with a more strict sequence identity cutoff to avoid any potential information leakage from the training data. The main figures should be improved to make them easier to understand.

      We thank the reviewer for recognizing the significance of our work! We have carefully revised the manuscript to address the reviewer’s concerns.

      (1) The sequence identity cutoff to remove redundancies between training and test set was set to 40%, which is a bit high to remove test examples having homology to training examples. For example, CDPred uses a sequence identity cutoff of 30% to strictly remove redundancies between training and test set examples. To make their results more solid, the authors should have curated test examples with lower sequence identity cutoffs, or have provided the performance changes against sequence identities to the closest training examples.

      We thank the reviewer for the valuable suggestion! The “40 sequence identity” is a widely used threshold to remove redundancy when evaluating deep-learning based protein-protein interaction and protein complex structure prediction methods, thus we also chose this threshold in our study (bioRxiv 2021.10.04.463034, Cell Syst. 2021 Oct 20;12(10):969-982.e6). In the revision, we explored whether PLMGraph-inter can keep its performance when more stringent thresholds (30%,20%,10%) is applied (line 353386 in page 20-21 in the subsection of “Ablation study” and line 762-780 in page 40 in the subsection of “Further potential redundancies removal between the training and the test”). The result shows that even when using “10% sequence identity” as the threshold, mean precisions of the predicted contacts only decreases by ~3% (Table 2).

      (2) Figures with head-to-head comparison scatter plots are hard to understand as scatter plots because too many different methods are abstracted into a single plot with multiple colors. It would be better to provide individual head-tohead scatter plots as supplementary figures, not in the main figure.

      We thank the reviewer for the suggestion! We will include the individual head-to-head scatter plots as supplementary figures in the revision (Figure S1 and Figure S2 in the supplementary).

      (3) The authors claim that PLMGraph-Inter is complementary to AlphaFoldmultimer as it shows better precision for the cases where AlphaFold-multimer fails. To strengthen the point, the qualities of predicted complex structures via protein-protein docking with predicted contacts as restraints should have been compared to those of AlphaFold-multimer structures.

      We thank the reviewer for the suggestion! We included this comparison in the revision (Figure S7).

      (4) It would be interesting to further analyze whether there is a difference in prediction performance depending on the depth of multiple sequence alignment or the type of complex (antigen-antibody, enzyme-substrates, single species PPI, multiple species PPI, etc).

      We thank the reviewer for the suggestion! We analyzed the relationship between the prediction performance and the depth of MSA in the revision (Figure S4 and Line 253264 in page 15 in the subsection of “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets” and line 798-806 in page 42 in the subsection of “Calculating the normalized number of the effective sequences of paired MSA”).

      Reviewer #2 (Recommendations For The Authors):

      I have the following suggestions in addition to the public review.

      (1) Overall, the manuscript is well-written; however, I recommend a careful review for minor grammar corrections to polish the final text.

      We carefully checked the manuscript and corrected all the grammar issues and typos we found in the revision.

      (2) It would be better to indicate that single sequence embeddings, MSA embeddings, and structure embeddings are ESM-1b, ESM-MSA & PSSM, and ESM-IF when they are first mentioned in the manuscript e.g. single sequence embeddings from ESM-1b, MSA embeddings from ESM-MSA and PSSM, and structural embeddings from ESM-IF.

      We revised the manuscript according to the reviewer’s suggestion (line 86-88 in page 6; line 99-101 in page 7).

      (3) I don't think "outer concatenation" is commonly used. Please specify whether it's outer sum, outer product, or horizontal & vertical tiling followed by concatenation.

      It is horizontal & vertical tiling followed by concatenation. We clarified this in the revision (line 129-130 in page 8).

      (4) 10th sentence on the page where the Results section starts, please briefly mention what are the other 2D pairwise features.

      We clarified this in the revision (line 131-132 in page 8).

      (5) In the result section, it states edges are defined based on Ca distances, but in the method section, it says edges are determined based on heavy atom distances. Please correct one of them.

      It should be Ca distances. We are sorry for the carelessness, and we corrected this in the revision (line 646 in page 35).

      (6) For the sentence, "Where ESM-1b and ESM-MSA-1b are pretrained PLMs learned from large datasets of sequences and MSAs respectively without label supervision,", I'd suggest replacing "without label supervision" with "with masked language modeling tasks" for clarity.

      We revised the manuscript according to the reviewer’s suggestion (line 150-151 in page 9).

      (7) It would be better to briefly explain what is the dimensional hybrid residual block when it first mentioned.

      We explained the dimensional hybrid residue block when it first mentioned in the revision (line 107 in page 7).

      (8) Please include error bars for the bar plots and standard deviations for the tables.

      We thank the reviewer for the suggestion! Our understanding is the error bars and standard deviations are very informative for data which follow gaussian-like distributions, but our data (precisions of the predicted contacts) are obviously not this type. Most previous studies in protein contact prediction and inter-protein contact prediction also did not include these in their plots or tables. In our case, including these elements requires a dramatic change of the styles of our figures and tables, but we would like to not change our figures and tables too much in the revision.

      (9) Please indicate whether the chain break is considered to generate attention map features from ESM-MSA-1b. If it's considered, please specify how.

      The paired sequences were directly concatenated without using any letter to connect them, which means we did not consider chain break in generating the attention maps from ESM-MSA-1b.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for the authors):

      The biochemical fractionation and use of the term "synaptic" were my biggest issues. I would recommend using a more targeted approach to measure the PSD or compare and contrast synaptic from extrasynaptic. For instance, PMID 16797717 does a PSD purification, whereas other papers have fractionated extrasynaptic from synaptic. Moreover, a PSD95 immunoprecipitation may be of interest as one question that could arise is since you see decreases in PSD95 GluN2B, but not 2A or GluA1, could the association of PSD95 with the different proteins be altered? To evaluate this, proteomics or some other unbiased methodology could enhance an understanding of the full panoply of changes induced by Prosapip1 within the dHP.

      The reviewer makes value points; however, this is a large endeavor, which we will address in future experiments.

      There seems to be a missed opportunity to really determine how Prosapip1 is influencing protein expression and/or phosphorylation at the PSD.

      There is no indication that Prosapip1 is linked to transcription or translation machinery; therefore, we don’t see the value of examining protein expression in this context. Phosphorylation is a broad term, and although this can be answered through phosphoproteomics, this is outside the scope of this study.

      At the very least, additional discussion within this realm would help the reader contextualize the biochemical data.

      Further studies are needed to determine the mechanism by which Prosapip1 controls the localization of PSD95, GlunN2B, and potentially others. It is plausible that posttranslational modifications are responsible for Prosapip1 function. For example, the Prosapip1 sequence contains a potential glycosylation site (Ser622), and several potential phosphorylation sites (https://glygen.org/protein/O60299#Glycosylation, https://www.phosphosite.org/proteinAction.action?id=18395&showAllSites=true#appletMsg). These posttranslational modifications can contribute to the stabilization of the synaptic localization of GluN2B and PSD95.

      We added to the discussion the paragraph above as well as the caveat that proteomic studies are needed for a comprehensive study of the role of Prosapip1 in the PSD.

      Weaknesses:

      (1) Methodological Weaknesses

      a. The synapsin-Cre mice may more broadly express Cre-recombinase than just in neuronal tissues. Specifically, according to Jackson Laboratories, there is a concern with these mice expressing Cre-recombinase germline. As the human protein atlas suggests that Prosapip1 protein is expressed extraneuronally, validation of neuron or at least brain-specific knockout would be helpful in interpreting the data. Having said that, the data demonstrating that the brain region-specific knockout has similar behavioral impacts helps alleviate this concern somewhat; however, there are no biochemical or electrophysiological readouts from these animals, and therefore an alternative mechanism in this adult knockout cannot be excluded.

      This is a valuable insight from the reviewer, especially considering the information from Jackson Laboratories. As mentioned in the paper, we exclusively used female Syn1-Cre carrying breeders to avoid germline recombination. Furthermore, we consistently assessed the prevalence of the Prosapip1 flox sites alongside the presence of Syn1-Cre with our regular litter genotyping, confirming the presence of Prosapip1. Additionally, Prosapip1 protein expression was directly examined in rats in Wendholdt et al., 2006, where this group reported that Prosapip1 is a brain-specific protein, minimizing the potential consequences of a peripheral loss of Prosapip1. In addition, to confirm that Prosapip1 is a brain-specific protein in mice, we performed a western blot analysis on the dorsal hippocampus, liver, and kidney of a C57BL/6 mouse (Author response image 1), and found that Prosapip1 protein is not found in these peripheral organs, aligning with the findings in rats reported by Wendholdt et al.

      Author response image 1. Prosapip1 protein in the dorsal hippocampus, liver, and kidney of C57BL/6 mice.

      b. The use of the word synaptic and the crude fractionation make some of the data difficult to interpret/contextualize. It is unclear how a single centrifugation that eliminates the staining of a nuclear protein can be considered a "synaptic" fraction. This is highlighted by the presence of GAPDH in this fraction which is a cytosolically-enriched protein. While GAPDH may be associated with some membranes it is not a synaptic protein. There is no quantification of GAPDH against total protein to validate that it is not enriched in this fraction over control. Moreover, it should not be used as a loading control in the synaptic fraction. There are multiple different ways to enrich membranes, extrasynaptic fractions, and PSDs and a better discussion on the caveats of the biochemical fractionation is a minimum to help contextualize the changes in PSD95 and GluN2B.

      We apologize for the confusion. As we described in the methods section, the crude synaptosome was isolated by several centrifugations as depicted in the figure which we are now including in the manuscript. As shown in Extended Figure 2, the P2 fraction does contain PSD-95 and synapsin, as well as GluN2B, GluN2A, and GluA1; however, it does not contain the transcription factor CREB, indicating the isolation of the crude synaptosomal fraction. As shown in the figure, a small amount of GAPDH is present in the crude synaptosomal fraction. The presence of GAPDH in the crude synaptosomal fraction has been previously reported in (Atsushi et al., 2003; Lee et al. 2016; Wang et al. 2012). As we have added to the discussion, there remains a caveat that we cannot differentiate the pre- and post-synaptic fraction, and as a result we do not know if Prosapip1 plays a role in the assembly of axonal proteins.

      c. Also, the word synaptosomal on page 7 is not correct. One issue is this is more than synaptosomes and another issue is synaptosomes are exclusively presynaptic terminals. The correct term to use is synaptoneurosome, which includes both pre and postsynaptic components. Moreover, as stated above, this may contain these components but is most likely not a pure or even enriched fraction.

      Since we cannot exclude the possibility that Prosapip1 is also expressed in glia, we do not believe that the term synaptoneurosome is accurate.

      d. The age at which the mice underwent injection of the Cre virus was not mentioned.

      We apologize for the oversight. As now noted in the methods, the mice used for experiments underwent surgery to infect neurons with the AAV-GFP or AAV-Cre viruses between 5 and 6 weeks of age to ensure full viral expression by the experimental window beginning at 8 weeks old.

      (2) Weaknesses of Results

      a. There were no measures of GluN1 or GluA2 in the biochemical assays. As GluN1 is the obligate subunit, how it is impacted by the loss of Prosapip1 may help contextualize the fact that GluN2B, but not GluN2A, is altered. Moreover, as GluA2 has different calcium permeance, alterations in it may be informative.

      Since we detect NMDAR current, which requires the obligatory subunit GluN1 and at least one GluN2 subunit (GluN2A, GluN2B, GluN2C, GluN2D), we did not see the rationale behind examining the level of GluN1 in the Prosapip1 knockout mice.

      b. While there was no difference in GluA1 expression in the "synaptic" fraction, it does not mean that AMPAR function is not impacted by the loss of Prosapip1. This is particularly important as Prosapip1 may interact with kinases or phosphatases or their targeting proteins. Therefore, measuring AMPAR function electrophysiologically or synaptic protein phosphorylation would be informative.

      We agree with the reviewer that the loss of Prosapip1 could potentially impact AMPAR function. To address this, we measured spontaneous excitatory postsynaptic currents (sEPSCs) in hippocampal pyramidal neurons from both Prosapip1(flx/flx);Syn1-Cre(-) and Prosapip1(flx/flx);Syn1-Cre(+) mice. Given that neurons were voltage-clamped at -70 mV and extracellular Mg<sup>2+</sup> was maintained at 1.3 mM, the sEPSCs we recorded were primarily mediated by AMPARs.

      We found no significant differences in either the frequency or amplitude of these AMPA-mediated sEPSCs between Prosapip1(flx/flx);Syn1-Cre(-) and Prosapip1(flx/flx);Syn1-Cre(+) mice, suggesting that AMPAR function in hippocampal pyramidal neurons is not noticeably affected by the loss of Prosapip1 (see Author response image 2 below).

      Author response image 2. Comparison of hippocampal sEPSCs between Prosapip1(flx/flx); Syn1-Cre(-) (Cre(-)) and Prosapip1(flx/flx);Syn1-Cre(+) (Cre(+)) mice. sEPSCs were recorded in the presence of 1.3 mM Mg²⁺ and 0.1 mM picrotoxin, with neurons clamped at -70 mV. (A) Sample sEPSC traces from Prosapip1(flx/flx); Syn1-Cre(-) (top) and Prosapip1(flx/flx); Syn1-Cre(+) (bottom) mice. (B, C) Bar graphs showing no significant differences in sEPSC frequency (B) or amplitude (C) between Prosapip1(flx/flx); Syn1-Cre(-)and Prosapip1(flx/flx); Syn1-Cre(+) mice. Statistical analysis was performed using an unpaired t-test; p > 0.05, n.s. (not significant). Data represent 11 neurons from 3 Prosapip1(flx/flx); Syn1-Cre(-) mice (11/3) and 8 neurons from 3 Prosapip1(flx/flx); Syn1-Cre(+) mice (8/3).

      c. There is a lack of mechanistic data on what specifically and how GluN2B and PSD95 expression is altered. This is due to some of the challenges with interpreting the biochemical fractionation and a lack of results regarding changes in protein posttranslational modifications.

      See response above.

      d. The loss of social novelty measures in both the global and dHP-specific Prosapip1 knockout mice were not very robust. As they were consistently lost in both approaches and as there were other consistent memory deficits, this does not impact the conclusions, but may be important to temper discussion to match these smaller deficits within this domain.

      There is a clear difference between the Prosapip1(flx/flx);Syn1-Cre(-) and Prosapip1(flx/flx);Syn1-Cre(+) mice as well as the AAV-GFP and AAV-Cre mice in the loss of social novelty metric. We have emphasized that the Prosapip1(flx/flx);Syn1-Cre(+) mice and AAV-Cre mice do not recognize social novelty, which is supported by the statistics.

      4E: Two-way ANOVA: Effect of Social Novelty F<sub>(1,20)</sub> = 17.60, p = 0.0002; Post hoc Familiar vs. Novel (Cre(-)) p = 0.0008, Familiar vs. Novel (Cre(+)) p = 0.1451.

      5I: Two-way ANOVA: Effect of Social Novelty F<sub>(1,31)</sub> = 9.777, p = 0.0038; Post hoc Familiar vs. Novel (AAV-GFP) p = 0.0303, Familiar vs. Novel (AAV-Cre) p = 0.1319.

      e. Alterations in presynaptic paired-pulse ratio measures are intriguing and may point to a role for Prosapip1 in synapse development, as discussed in the manuscript. It would be interesting to delineate if these PPR changes also occur in the adult knockout to help detail the specific Prosapip1-induced neuroadaptations that link to the alterations in novelty-induced behaviors.

      This interesting question will be addressed in future studies.

      Reviewer #2 (Recommendations for the authors):

      (1) The test statistics are required for each experiment for completeness. Currently, only p-values, tests used, and N are included.

      The entirety of the statistical information can be found in TYable 1, including test statistics and degrees of freedom (see Column 7, ‘Result’).

      (2) The authors claim that the function of Prosapip1 is not known in vivo, yet detail a study in the NAc where they investigated its function in vivo. The wording or discussion around what is and is not known should be altered to reflect this.

      The reviewer is correct to point to our previous manuscript (Laguesse et al. Neuron. 2017.) in which we found that Prosapip1 is important in mechanisms underlying alcohol-associated molecular, cellular and behavioral adaptations. However, these findings are specific to alcohol-related paradigms. Since the normal physiological role of Prosapip1 has never been delineated, this study was aimed to start addressing this gap in knowledge.

      References

      Wang, M., Li, S., Zhang, H. et al. Direct interaction between GluR2 and GAPDH regulates AMPAR-mediated excitotoxicity. Mol Brain 5, 13 (2012). https://doi.org/10.1186/1756-6606-5-13

      Atsushi Ikemoto, David G. Bole, Tetsufumi Ueda, Glycolysis and Glutamate Accumulation into Synaptic Vesicles: Role of Glyceraldehyde Phosphate Dehydrogenase and 3-Phosphoglycerate Kinase, Journal of Biological Chemistry, 8, 278 (2003). https://doi.org/10.1074/jbc.M211617200.

      Lee, F., Su, P., Xie, YF. et al. Disrupting GluA2-GAPDH Interaction Affects Axon and Dendrite Development. Sci Rep 6, 30458 (2016). https://doi.org/10.1038/srep30458

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this paper, Manley and Vaziri investigate whole-brain neural activity underlying behavioural variability in zebrafish larvae. They combine whole brain (single cell level) calcium imaging during the presentation of visual stimuli, triggering either approach or avoidance, and carry out whole brain population analyses to identify whole brain population patterns responsible for behavioural variability. They show that similar visual inputs can trigger large variability in behavioural responses. Though visual neurons are also variable across trials, they demonstrate that this neural variability does not degrade population stimulus decodability. Instead, they find that the neural variability across trials is in orthogonal population dimensions to stimulus encoding and is correlated with motor output (e.g. tail vigor). They then show that behavioural variability across trials is largely captured by a brain-wide population state prior to the trial beginning, which biases choice - especially on ambiguous stimulus trials. This study suggests that parts of stimulus-driven behaviour can be captured by brain-wide population states that bias choice, independently of stimulus encoding.

      Strengths:

      -The strength of the paper principally resides in the whole brain cellular level imaging in a well-known but variable behaviour.

      - The analyses are reasonable and largely answer the questions the authors ask.

      - Overall the conclusions are well warranted.

      Weaknesses:

      A more in-depth exploration of some of the findings could be provided, such as:

      - Given that thousands of neurons are recorded across the brain a more detailed parcelation of where the neurons contribute to different population coding dimensions would be useful to better understand the circuits involved in different computations.

      We thank the reviewer for noting the strengths of our study and agree that these findings have raised a number of additional avenues which we intend to explore in depth in future studies. In response to the reviewer’s comment above, we have added a number of additional figure panels (new Figures S1E, S3F-G, 4I(i), 4K(i), and S5F-G) and updated panels (Figures 4I(ii) and 4K(ii) in the revised manuscript) to show a more detailed parcellation of the visually-evoked neurons, noise modes, turn direction bias population, and responsiveness bias population. To do so. we have aligned our recordings to the Z-Brain atlas (Randlett et al., 2015) as shown in new Figure S1E. In addition, we provided a more detailed parcellation of the neuronal ensembles by providing projections of the full 3D volume along the xy and yz axes, in addition to the unregistered xy projection shown in Figures 4H and 4J in the revised manuscript. We also found that the distribution of neurons across our huc:h2b-gcamp6s recordings is very similar to the distribution of labeling in the huc:h2b-rfp reference image from the Z-Brain atlas (Figure S1E), which further supports our whole-brain imaging results.

      Overall, we find that this more detailed quantification and visualization is consistent with our interpretations. In particular, we show that the optimal visual decoding population (w<sub>opt</sub>) and the largest noise mode (e1) are localized to the midbrain (Figures S3F-G). This is expected, as in Figure 3 we first extracted a low-dimensional subspace of whole-brain neural activity that optimally preserved visual information. Additionally, we provide new evidence that the populations correlated with the turn bias and responsiveness bias are distributed throughout the brain, including a relatively dense localization to the cerebellum, telencephalon, and dorsal diencephalon (habenula, new Figures 4H-K and S5F-G).

      - Given that the behaviour on average can be predicted by stimulus type, how does the stimulus override the brain-wide choice bias on some trials? In other words, a better link between the findings in Figures 2 and 3 would be useful for better understanding how the behaviour ultimately arises.

      We agree with the reviewer that one of the most fundamental questions that this study has raised is how the identified neuronal populations predictive of decision variables (which we describe as an internal “bias”) interact with the well-studied, visually-evoked circuitry. A major limitation of our study is that the slow dynamics of the NL-GCaMP6s prevent clearly distinguishing any potential difference in the onset time of various neurons during the short trials, which might provide clues into which neurons drive versus later reflect the motor output. However, given that these ensembles were also found to be correlated with spontaneous turns, our hypothesis is that these populations reflect brain-wide drives that enable efficient exploration of the local environment (Dunn et al. 2016, doi.org/10.7554/eLife.12741). Further, we suspect that a sufficiently strong stimulus drive (e.g., large, looming stimuli) overrides these ongoing biases, which would explain the higher average pre-stimulus predictability in trials with small to intermediate-sized stimuli. An important follow-up line of experimentation could involve comparing the neuronal dynamics of specific components of the visual circuitry at distinct internal bias states, ideally utilizing emerging voltage indicators to maximize spatiotemporal specificity. For example, what is the difference between trials with a large looming stimulus in the left visual fields when the turn direction bias indicates a leftward versus rightward drive?

      - What other motor outputs do the noise dimensions correlate with?

      To better demonstrate the relationship between neural noise modes and motor activity that we described, we have provided a more detailed correlation analysis in new Figure S4A. We extracted additional features related to the larva’s tail kinematics, including tail vigor, curvature, principal components of curvature, angular velocity, and angular acceleration (S4A(i)). Some of these behavioral features were correlated with one another; for example, in the example traces, PC1 appears to capture nearly the same behavioral feature as tail vigor. The largest noise modes showed stronger correlations with motor output than the smaller noise modes, which is reminiscent recent work in the mouse showing that some of the neural dimensions with highest variance were correlated with various behavioral features (Musall et al. 2019; Stringer et al. 2019; Manley et al. 2024). We anticipate additional motor outputs would exhibit correlations with neural noise modes, such as pectoral fin movements (not possible to capture in our preparation due to immobilization) and eye movements.

      The dataset that the authors have collected is immensely valuable to the field, and the initial insights they have drawn are interesting and provide a good starting ground for a more expanded understanding of why a particular action is determined outside of the parameters experimenters set for their subjects.

      We thank the reviewer for noting the value of our dataset and look forward to future efforts motivated by the observations in our study.

      Reviewer #2 (Public Review):

      Overview

      In this work, Manley and Vaziri investigate the neural basis for variability in the way an animal responds to visual stimuli evoking prey-capture or predator-avoidance decisions. This is an interesting problem and the authors have generated a potentially rich and relevant data set. To do so, the authors deployed Fourier light field microscopy (Flfm) of larval zebrafish, improving upon prior designs and image processing schemes to enable volumetric imaging of calcium signals in the brain at up to 10 Hz. They then examined associations between neural activity and tail movement to identify populations primarily related to the visual stimulus, responsiveness, or turn direction - moreover, they found that the activity of the latter two populations appears to predict upcoming responsiveness or turn direction even before the stimulus is presented. While these findings may be valuable for future more mechanistic studies, issues with resolution, rigor of analysis, clarity of presentation, and depth of connection to the prior literature significantly dampen enthusiasm.

      Imaging

      - Resolution: It is difficult to tell from the displayed images how good the imaging resolution is in the brain. Given scattering and lensing, it is important for data interpretation to have an understanding of how much PSF degrades with depth.

      We thank the reviewer for their comments and agree that the dependence of the PSF and resolution as a function of depth is an important consideration in light field imaging. To quantify this, we measured the lateral resolution of the fLFM as a function of distance from the native image plane (NIP) using a USAF target. The USAF target was positioned at various depths using an automated z-stage, and the slice of the reconstructed volume corresponding to that depth was analyzed. An element was considered resolved if the modulation transfer function (MTF) was greater than 30%.

      In new Figure S1A, we plot the resolution measurements of the fLFM as compared to the conventional LFM (Prevedel et al., 2014), which shows the increase in resolution across the axial extent of imaging. In particular, the fLFM does not exhibit the dramatic drop in lateral resolution near the NIP which is seen in conventional LFM. In addition, the expanded range of high-resolution imaging motivates our increase from an axial range of 200 microns in previous studies to 280 microns in this study.

      - Depth: In the methods it is indicated that the imaging depth was 280 microns, but from the images of Figure 1 it appears data was collected only up to 150 microns. This suggests regions like the hypothalamus, which may be important for controlling variation in internal states relevant to the behaviors being studied, were not included.

      The full axial range of imaging was 280 microns, i.e. spanning from 140 microns below to 140 microns above the native imaging plane. After aligning our recordings to the Z-Brain dataset, we have compared the 3D distribution of neurons in our data (new Figure S1E(i)) to the labeling of the reference brain (Figure S1E(ii)). This provides evidence that our imaging preparation largely captures the labeling seen in a dense, high-resolution reference image within the indicated 280 microns range.

      - Flfm data processing: It is important for data interpretation that the authors are clearer about how the raw images were processed. The de-noising process specifically needs to be explained in greater detail. What are the characteristics of the noise being removed? How is time-varying signal being distinguished from noise? Please provide a supplemental with images and algorithm specifics for each key step.

      We thank the reviewer for their comment. To address the reviewer’s point regarding the data processing pipeline utilized in our study, in our revised manuscript we have added a number of additional figure panels in Figure S1B-E to quantify and describe the various steps of the pipeline in greater depth.

      First, the raw fLFM images are denoised. The denoising approach utilized in the fLFM data processing pipeline is not novel, but rather a custom-trained variant of Lecoq et al.’s (2021) DeepInterpolation method. In our original manuscript, we also described the specific architecture and parameters utilized to train our specific variation of DeepInterpolation model. To make this procedure clearer, we have added the following details to the methods:

      “DeepInterpolation is a self-supervised approach to denoising, which denoises the data by learning to predict a given frame from a set of frames before and after it. Time-varying signal can be distinguished from shot noise because shot noise is independent across frames, but signal is not. Therefore, only the signal is able to be predicted from adjacent frames. This has been shown to provide a highly effective and efficient denoising method (Lecoq et al., 2021).”

      Therefore, time-varying signal is distinguished from noise based on the correlations of pixel intensity across consecutive imaging frames. To better visualize this process, in new Figure S1B we show example images and fluorescence traces before and after denoising.

      - Merging: It is noted that nearby pixels with a correlation greater than 0.7 were merged. Why was this done? Is this largely due to cross-contamination due to a drop in resolution? How common was this occurrence? What was the distribution of pixel volumes after aggregation? Should we interpret this to mean that a 'neuron' in this data set is really a small cluster of 10-20 neurons? This of course has great bearing on how we think about variability in the response shown later.

      First, to be clear, nearby pixels were not merged; instead neuronal ROIs identified by CNMF-E were merged, as we had described: “the CNMF-E algorithm was applied to each plane in parallel, after which the putative neuronal ROIs from each plane were collated and duplicate neurons across planes were merged.” If this merging was not performed, the number of neurons would be overestimated due to the relatively dense 3D reconstruction with voxels of 4 m axially. Therefore, this merging is a requisite component of the pipeline to avoid double counting of neurons, regardless of the resolution of the data.

      However, we agree with the reviewer that the practical consequences of this merging were not previously described in sufficient detail. Therefore, in our revision we have added additional quantification of the two critical components of the merging procedure: the number of putative neuronal ROIs merged and the volume of the final 3D neuronal ROIs, which demonstrate that a neuron in our data should not be interpreted as a cluster of 10-20 neurons.

      In new Figure S1C(i), we summarize the rate of occurrence of merging by assessing the number of putative 2D ROIs which were merged to form each final 3D neuronal ROI. Across n=10 recordings, approximately 75% of the final 3D neuronal ROIs involved no merging at all, and few instances involved merging more than 5 putative ROIs. Next, in Figure S1C(ii), we quantify the volume of the final 3D ROIs. To do so, we counted the number of voxels contributing to each final 3D neuronal ROI and multiplied that by the volume of a single voxel (2.4 x 2.4 x 4 µm<sup>3</sup>). The majority of neurons had a volume of less than 1000 µm<up>3</sup>, which corresponds to a spherical volume with a radius of roughly 6.2 m. In summary, both the merging statistics and volume distribution demonstrate that few neuronal ROIs could be consistent with “a small cluster of 10-20 neurons”.

      - Bleaching: Please give the time constants used in the fit for assessing bleaching.

      As described in the Methods, the photobleaching correction was performed by fitting a bi-exponential function to the mean fluorescence across all neurons. We have provided the time constants determined by these fits for n=10 recordings in new Figure S1D(i). In addition, we provided an example of raw mean activity, the corresponding bi-exponential fit, and the mean activity after correction in Figure S1D(ii). These data demonstrate that the dominant photobleaching effect is a steep decrease in mean signal at the beginning of the recording (represented by the estimated time constant τ<sub>1</sub>), followed by a slow decay (τ<sub>2</sub>).

      Analysis

      - Slow calcium dynamics: It does not appear that the authors properly account for the slow dynamics of calcium-sensing in their analysis. Nuclear-localized GCaMP6s will likely have a kernel with a multiple-second decay time constant for many of the cells being studied. The value used needs to be given and the authors should account for variability in this kernel time across cell types. Moreover, by not deconvolving their signals, the authors allow for contamination of their signal at any given time with a signal from multiple seconds prior. For example, in Figure 4A (left turns), it appears that much of the activity in the first half of the time-warped stimulus window began before stimulus presentation - without properly accounting for the kernel, we don't know if the stimulus-associated activity reported is really stimulus-associated firing or a mix of stimulus and pre-stimulus firing. This also suggests that in some cases the signals from the prior trial may contaminate the current trial.

      We would like to respond to each of the points raised here by the reviewer individually.

      (1) “It does not appear that the authors properly account for the slow dynamics of calcium-sensing in their analysis. Nuclear-localized GCaMP6s will likely have a kernel with a multiple-second decay time constant for many of the cells being studied. The value used needs to be given…”

      We disagree with the reviewer’s claim that the slow dynamics of the calcium indicator GCaMP were not accounted for. While we did not deconvolve the neuronal traces with the GCaMP response kernel, in every step in which we correlated neural activity with sensory or motor variables, we convolved the stimulus or motor timeseries with the GCaMP kernel, as described in the Methods. Therefore, the expected delay and smoothing effects were accounted for when analyzing the correlation structure between neural and behavioral or stimulus variables, as well as during our various classification approaches. To better describe this, we have added the following description of the kernel to our Methods:

      “The NL-GCaMP6s kernel was estimated empirically by aligning and averaging a number of calcium events. This kernel corresponds to a half-rise time of 400 ms and half-decay time of 4910 ms.”

      This approach accounts for the GCaMP kernel when relating the neuronal dynamics to stimuli and behavior, while avoiding any artifacts that could be introduced from improper deconvolution or other corrections directly to the calcium dynamics. Deconvolution of calcium imaging data, and in particular nuclear-localized (NL) GCaMP6s, is not always a robust procedure. In particular, GCaMP6s has a much more nonlinear response profile than newer GCaMP variants such as jGCaMP8 (Zhang et al. 2023, doi:10.1038/s41586-023-05828-9), as the reviewer notes later in their comments. The nuclear-localized nature of the indicator used in our study also provides an additional nonlinear effect. Accounting for a nonlinear relationship between calcium concentration and fluorescence readout is significantly more difficult because such nonlinearities remove the guarantee that the optimization approaches generally used in deconvolution will converge to global extrema. This means that deconvolution assuming nonlinearities is far less robust than deconvolution using the linear approximation (Vogelstein et al. 2010, doi: 10.1152/jn.01073.2009). Therefore, we argue that we are not currently aware of any appropriate methods for deconvolving our NL-GCaMP6s data, and take a more conservative approach in our study.

      We also argue that the natural smoothness of calcium imaging data is important for the analyses utilized in our study (Shen et al., 2022, doi:10.1016/j.jneumeth.2021.109431). Even if our data were deconvolved in order to estimate spike trains or more point-like activity patterns, such data are generally smoothed (e.g., by estimating firing rates) before dimensionality reduction, which is a core component of our neuronal population analyses. Further, Wei et al. (2020, doi:10.1371/journal.pcbi.1008198) showed in detail that deconvolved calcium data resulted in less accurate population decoding, whereas binned electrophysiological data and raw calcium data were equally accurate. When using other techniques, such as clustering of neuronal activity patterns (a method we do not employ in this study), spike and deconvolved calcium data were instead shown to be more accurate than raw calcium data. Therefore, we do not believe deconvolution of the neuronal traces is appropriate in this case without a better understanding of the NL-GCaMP6s response, and do not rely on the properties of deconvolution for our analyses. Still, we agree with the reviewer that one must be mindful of the GCaMP kernel when analyzing and interpreting these data, and therefore have noted the delayed and slow kinematics of the NL-GCaMP within our manuscript, for example: “To visualize the neuronal activity during a given trial while accounting for the delay and kinematics of the nuclear-localized GCaMP (NL-GCaMP) sensor, a duration of approximately 15 seconds is extracted beginning at the onset of the 3-second visual stimulus period.”

      (2) “… and the authors should account for variability in this kernel time across cell types.”

      In addition to the points raised above, we are not aware of any deconvolution procedures which have successfully shown the ability to account for variability in the response kernel across cell types in whole-brain imaging data when cell type is unknown a priori. Pachitariu et al. (2018, doi:10.1523/JNEUROSCI.3339-17.2018) showed that the best deconvolution procedures for calcium imaging data rely on a simple algorithm with a fixed kernel. Further, more complicated approaches either utilize either explicit priors about the calcium kernel or learn implicit priors using supervised learning, neither of which we would be able to confirm are appropriate for our dataset without ground truth electrophysiological spike data.

      However, we agree with the reviewer that we must interpret the data while being mindful that there could be variability in this kernel across neurons, which is not accounted for in our fixed calcium kernel. We have added the following sentence to our revised manuscript to highlight this limitation:

      “The used of a fixed calcium kernel does not account for any variability in the GCaMP response across cells, which could be due to differences such as cell type or expression level. Therefore, this analysis approach may not capture the full set of neurons which exhibit stimulus correlations but exhibit a different GCaMP response.”

      (3) “without properly accounting for the kernel, we don't know if the stimulus-associated activity reported is really stimulus-associated firing or a mix of stimulus and pre-stimulus firing”

      While we agree with the reviewer that the slow dynamics of the indicator will cause a delay and smoothing of the signal over time, we would like to point out that this effect is highly directional. In particular, we can be confident that pre-stimulus activity is not contaminated by the stimulus given the data we describe in the next point regarding the timing of visual stimuli relative to the GCaMP kernel. The reviewer is correct that post-stimulus firing can be mixed with pre-stimulus firing due to the GCaMP kernel. However, our key claims in Figure 4 center around turn direction and responsiveness biases, which are present even before the onset of the stimulus. Still, we have highlighted this delay and smoothing to readers in the updated version of our manuscript.

      (4) “This also suggests that in some cases the signals from the prior trial may contaminate the current trial”

      We have carefully chosen the inter-stimulus interval for maximum efficiency of stimulation, while ensuring that contamination from the previous stimulus is negligible. The inter-stimulus interval was chosen by empirically analyzing preliminary data of visual stimulation with our preparation. New Figure S3C shows the delay and slow kinematics due to our indicator; indeed, visually-evoked activity peaks after the end of the short stimulus period. Importantly, however, the visually-evoked activity is at or near baseline at the start of the next trial.

      Finally, we would like to note that our stimulation protocol is randomized, as described in the Methods. Therefore, the previous stimulus has no correlation with the current stimulus, which would prevent any contamination from providing predictive power that could be identified by our visual decoding methods.

      - Partial Least Squares (PLS) regression: The steps taken to identify stimulus coding and noise dimensions are not sufficiently clear. Please provide a mathematical description.

      We have updated the Results and Methods sections of our revised manuscript to describe in more mathematical detail the approach taken to identify the relevant dimensions of neuronal activity:

      “The comparison of the neural dimensions encoding visual stimuli versus trial-to-trial noise was modeled after Rumyantsev et al. (2020). Partial least squares (PLS) regression was used to find a low-dimensional space that optimally predicted the visual stimuli, which we refer to as the visually-evoked neuronal activity patterns. To perform regression, a visual stimulus kernel was constructed by summing the timeseries of each individual stimulus type, weighted by the stimulus size and negated for trials on the right visual field, thus providing a single response variable encoding both the location, size, and timing of all the stimulus presentations. This stimulus kernel was the convolved with the temporal response kernel of our calcium indicator (NL-GCaMP6s).

      PLS regression identifies the normalized dimensions and that maximize the covariance between paired observations and , respectively. In our case, the visual stimulus is represented by a single variable , simplifying the problem to identifying the subspace of neural activity that optimally preserves information about the visual stimulus (sometimes referred to as PLS1 regression). That is, the N x T neural time series matrix X is reduced to a d x T matrix spanned by a set of orthonormal vectors. PLS1 regression is performed as follows:

      PLS1 algorithm

      Let X<sub>i</sub> = X and . For i = 1…d,

      (1) 

      (2) 

      (3) 

      (4) 

      (5)  (note this is scalar)

      (6) 

      The projections of the neural data {p<sub>i</sub>} thus span a subspace that maximally preserves information about the visual stimulus . Stacking these projections into the N x d matrix P that represents the transform from the whole-brain neural state space to the visually-evoked subspace, the optimal decoding direction is given by the linear least squares solution . The dimensionality d of PLS regression was optimized using 6-fold cross-validation with 3 repeats and choosing the dimensionality between d = 1 and 20 with the lowest cross-validated mean squared error for each larva. Then, was computed using all time points.

      For each stimulus type, the noise covariance matrix  was computed in the low-dimensional PLS space, given that direct estimation of the noise covariances across many thousands of neurons would likely be unreliable. A noise covariance matrix was calculated separately for each stimulus, and then averaged across all stimuli. As before, the mean activity µ<sub>i</sub> for each neuron  was computed over each stimulus presentation period. The noise covariance then describes the correlated fluctuations δ<sub>i</sub> around this mean response for each pair of neurons i and j, where

      The noise modes for α = 1 …d were subsequently identified by eigendecomposition of the mean noise covariance matrix across all stimuli, . The angle between the optimal stimulus decoding direction and the noise modes is thus given by .”

      - No response: It is not clear from the methods description if cases where the animal has no tail response are being lumped with cases where the animal decides to swim forward and thus has a large absolute but small mean tail curvature. These should be treated separately. 

      We thank the reviewer for raising the potential for this confusion and agree that forward-motion trials should not treated the same as motionless trials. While these types of trial were indeed treated separately in our original manuscript, we have updated the Methods section of our revised manuscript to make this clear:

      “Left and right turn trials were extracted as described previously. Response trials included both left and right turn trials (i.e., the absolute value of mean tail curvature > σ<sub>active</sub>), whereas nonresponse trials were motionless (absolute mean tail curvature < σ<sub>active</sub>). In particular, forward-motion trials were excluded from these analyses.”

      While our study has focused specifically on left and right turns, we hypothesize that the responsiveness bias ensemble may also be involved in forward movements and look forward to future work exploring the relationship between whole-brain dynamics and the full range of motor outputs.

      - Behavioral variability: Related to Figure 2, within- and across-subject variability are confounded. Please disambiguate. It may also be informative on a per-fish basis to examine associations between reaction time and body movement.

      The reviewer is correct that our previously reported summary statistics in Figure 2D-F were aggregated across trials from multiple larvae. Following the reviewer’s suggestion to make the magnitudes of across-larvae and within-larva variability clear, in our revised manuscript we have added two additional figure panels to Figure S2.

      New Figure S2A highlights the across-larvae variability in mean head-directed behavioral responses to stimuli of various sizes. Overall, the relationship between stimulus size and the mean tail curvature across trials is largely consistent across larvae; however, the crossing-over point between leftward (positive curvature) and rightward (negative curvature) turns for a given side of the visual field exhibits some variability across larvae.

      New Figure S2B shows examples of within-larva variability by plotting the mean tail curvature during single trials for two example larvae. Consistent with Figure 2G which also demonstrates within-larva variability, responses to a given stimulus are variable across trials in both examples. However, this degree of within-larva variability can appear different across larvae. For example, the larva shown on the left of Figure S2B exhibits greater overlap between responses to stimuli presented on opposite visual fields, whereas the larva shown on the right exhibits greater distinction between responses.

      - Data presentation clarity: All figure panels need scale bars - for example, in Figure 3A there is no indication of timescale (or time of stimulus presentation). Figure 3I should also show the time series of the w_opt projection.

      We appreciate the reviewer’s attention to detail in this regard. We have added scalebars to Figures 3A, 3H-I, S4B(ii), 4H, 4J in the revised manuscript, and all new figure panels where relevant. In addition, the caption of Figure 3A has been updated to include a description of the time period plotted relative to the onset of the visual stimulus.

      Additionally, we appreciate the reviewer’s idea to show w<sub>opt</sub> in Figure 3J of the revised manuscript (previously Figure 3I). This clearly shows that the visual decoding project is inactive during the short baseline period before visual stimulation begins, whereas the noise mode is correlated with motor output throughout the recording.

      - Pixel locations: Given the poor quality of the brain images, it is difficult to tell the location of highlighted pixels relative to brain anatomy. In addition, given that the midbrain consists of much more than the tectum, it is not appropriate to put all highlighted pixels from the midbrain under the category of tectum. To aid in data interpretation and better connect this work with the literature, it is recommended that the authors register their data sets to standard brain atlases and determine if there is any clustering of relevant pixels in regions previously associated with prey-capture or predator-avoidance behavior.

      We agree with the reviewer that registration of our datasets to a standard brain atlas is a highly useful addition. While the dense, pan-neuronal labeling makes the isolation of highly specific circuit components difficult, we have shown in more detail the specific brain regions contributing to these populations by aligning our recordings to the Z-Brain atlas (Randlett et al., 2015) as shown in new Figures S1E, S3F-G, 4I, 4K, and S5F-G. In addition, we provided a more detailed parcellation of the neuronal ensembles by providing projections of the full 3D volume along the xy and yz axes, in addition to the unregistered xy projection shown in new Figures 4H and 4J. We also found that the distribution of neurons in our huc:H2B-GCaMP6s recordings is very similar to the distribution of labeling in the huc:H2B-RFP reference image from the Z-Brain atlas (new Figure S1E), which further supports our whole-brain imaging results.

      Overall, we find that this more detailed quantification and visualization is consistent with the interpretations in the previous version of our manuscript. In particular, we show that optimal visual decoding population (w<sub>opt</sub>) and largest noise mode (e1) are localized to the midbrain (new Figures S3F-G), which is expected since in Figure 3 we first extracted a low-dimensional subspace of whole-brain neural activity that optimally preserved visual information. Additionally, we provide additional evidence that the populations correlated with the turn bias and responsiveness bias are distributed throughout the brain, including a relatively dense localization to the cerebellum, telencephalon, and dorsal diencephalon (habenula, new Figures 4H-K and S5F-G).

      Finally, the reviewer is correct that our original label of “tectum” was a misnomer; the region analyzed corresponded to the midbrain, including the tegmentum, torus longitudinalis, and torus semicicularis in addition to the tectum. We have updated the brain regions shown and labels throughout the manuscript.

      Interpretation

      - W_opt and e_1 orthogonality: The statement that these two vectors, determined from analysis of the fluorescence data, are orthogonal, actually brings into question the idea that true signal and leading noise vectors in firing-rate state-space are orthogonal. First, the current analysis is confounding signals across different time periods - one could assume linearity all the way through the transformations, but this would only work if earlier sources of activation were being accounted for. Second, the transformation between firing rate and fluorescence is most likely not linear for GCaMP6s in most of the cells recorded. Thus, one would expect a change in the relationship between these vectors as one maps from fluorescence to firing rate.

      Unfortunately, we are not entirely sure we have understood the reviewer’s argument. We are assuming that the reviewer’s first sentence is suggesting that the observation of orthogonality in the neural state space measured in calcium imaging precludes the possibility (“actually brings into question”, as the reviewer states) that the same neural ensembles could be orthogonal in firing rate state space measured by electrophysiological data. If this is the reviewer’s conjecture, we respectfully disagree with it. Consider a toy example of a neural network containing N ensembles of neurons, where the neurons within an ensemble all fire simultaneously, and two populations never fire at the same time. As long as the “switching” of firing between ensembles is not fast relative to the resolution of the GCaMP kernel, the largest principal components would represent orthogonal dimensions differentiating the various ensembles, both when observing firing rates or observing timeseries convolved by the GCaMP kernel. This is a simple example where the observed orthogonality would appear similar in both calcium imaging and electrophysical data, demonstrating that we should not allow conclusions from fluorescence data to “bring into question” that the same result could be observed in firing rate data.

      We also disagree with the reviewer’s argument that we are “confounding signals across time periods”. Indeed, we must interpret the data in light of the GCaMP response kernel. However, all of the analyses presented here are performed on instantaneous measurements of population activity patterns. These activity patterns do represent a smoothed, likely nonlinear integration of recent neuronal activity, but unless the variability in the GCaMP response kernel (discussed above) is widely different across these populations (which has not been observed in the literature), we do not expect that the GCaMP transformations would artificially induce orthogonality in our analysis approach. Such smoothing operations tend to instead increase correlations across neurons and population decoding approaches generally benefit from this smoothness, as we have argued above. However, a much more problematic situation would be if we were comparing the activity of two neuronal populations at different points in time (which we do not include in this study), in which case the nonlinearities could overaccentuate orthogonality between non-time-matched activity patterns.

      Finally, we agree with the reviewer that the transformation between firing rate and fluorescence is very likely nonlinear and that these vectors of population activity do not perfectly represent what would be observed if one had access to whole-brain, cellular-resolution electrophysiology spike data. However, similar observations regarding the brain-wide, distributed encoding of behavior have been confirmed across recording modalities in the mouse (Stringer et al., 2019; Steinmetz et al., 2019), where large-scale electrophysiology utilizing highly invasive probes (e.g., Neuropixels) is more feasible than in the larval zebrafish. With the advent of whole-brain voltage imaging in the larval zebrafish, we expect any differences between calcium and voltage dynamics will be better understood, yet such techniques will likely continue to suffer to some extent from the nonlinearities described here.

      - Sources of variability: The authors do not take into account a fairly obvious source of variability in trial-to-trial response - eye position. We know that prey capture responsiveness is dependent on eye position during stimulus (see Figure 4 of PMID: 22203793). We also expect that neurons fairly early in the visual pathway with relatively narrow receptive fields will show variable responses to visual stimuli as the degree of overlap with the receptive field varies with eye movement. There can also be small eye-tracking movements ahead of the decision to engage in prey capture (Figure 1D, PMID: 31591961) that can serve as a drive to initiate movements in a particular direction. Given these possibilities indicating that the behavioral measure of interest is gaze, and the fact that eye movements were apparently monitored, it is surprising that the authors did not include eye movements in the analysis and interpretation of their data.

      We agree with the reviewer that eye movements, such as saccades and convergence, are important motor outputs that are well-known to play a role in the sequence of motor actions during prey capture and other behaviors. Therefore, we have added the following new eye tracking results to our revised manuscript:

      “In order to confirm that the observed neural variability in the visually-evoked populations was not predominantly due to eye movements, such as saccades or convergence, we tracked the angle of each eye. We utilized DeepLabCut, a deep learning tool for animal pose estimation (Mathis et al., 2018), to track keypoints on the eye which are visible in the raw fLFM images, including the retina and pigmentation (Figure S3D(i)). This approach enabled identification of various eye movements, such as convergence and the optokinetic reflex (Figure S3D(ii-iii)). Next, we extracted a number of various eye states, including those based on position (more leftward vs. rightward angles) and speed (high angular velocity vs. low or no motion). Figure S3E(i) provides example stimulus response profiles across trials of the same visual stimulus in each of these eye states, similar to a single column of traces in Figure 3A broken out into more detail. These data demonstrate that the magnitude and temporal dynamics of the stimulus-evoked responses show apparently similar levels of variability across eye states. If neural variability was driven by eye movement during the stimulus presentation, for example, one would expect to see much more variability during the high angular velocity trials than low, which is not apparent. Next, we asked whether the dominant neural noise modes vary across eye states, which would suggest that the geometry of neuronal variability is influenced by eye movements or states. To do so, the dominant noise modes were estimated in each of the individual eye conditions, as well as bootstrapped trials from across all eye conditions. The similarity of these noise modes estimated from different eye conditions (Figure S3E(ii), right)) was not significantly different from the similarity of noise modes estimated from bootstrapped random samples across all eye conditions (Figure S3E(ii), left)). Therefore, while movements of the eye likely contribute to aspects of the observed neural variability, they do not dominate the observed neural variability here, particularly given our observation that the largest noise mode represents a considerable fraction of the observed neural variance (Figure 3E).”

      While these results provide an important control in our study, we anticipate further study of the relationship between eye movements or states, visually-evoked neural activity, and neural noise modes would identify the additional neural ensembles which are correlated with and drive this additional motor output.

      Reviewer #3 (Public Review):

      Summary:

      In this study, Manley and Vaziri designed and built a Fourier light-field microscope (fLFM) inspired by previous implementations but improved and exclusively from commercially available components so others can more easily reproduce the design. They combined this with the design of novel algorithms to efficiently extract whole-brain activity from larval zebrafish brains.

      This new microscope was applied to the question of the origin of behavioral variability. In an assay in which larval zebrafish are exposed to visual dots of various sizes, the fish respond by turning left or right or not responding at all. Neural activity was decomposed into an activity that encodes the stimulus reliably across trials, a 'noise' mode that varies across trials, and a mode that predicts tail movements. A series of analyses showed that trial-to-trial variability was largely orthogonal to activity patterns that encoded the stimulus and that these noise modes were related to the larvae's behavior.

      To identify the origins of behavioral variability, classifiers were fit to the neural data to predict whether the larvae turned left or right or did not respond. A set of neurons that were highly distributed across the brain could be used to classify and predict behavior. These neurons could also predict spontaneous behavior that was not induced by stimuli above chance levels. The work concludes with findings on the distributed nature of single-trial decision-making and behavioral variability.

      Strengths:

      The design of the new fLFM microscope is a significant advance in light-field and computational microscopy, and the open-source design and software are promising to bring this technology into the hands of many neuroscientists.

      The study addresses a series of important questions in systems neuroscience related to sensory coding, trial-to-trial variability in sensory responses, and trial-to-trial variability in behavior. The study combines microscopy, behavior, dynamics, and analysis and produces a well-integrated analysis of brain dynamics for visual processing and behavior. The analyses are generally thoughtful and of high quality. This study also produces many follow-up questions and opportunities, such as using the methods to look at individual brain regions more carefully, applying multiple stimuli, investigating finer tail movements and how these are encoded in the brain, and the connectivity that gives rise to the observed activity. Answering questions about variability in neural activity in the entire brain and its relationship to behavior is important to neuroscience and this study has done that to an interesting and rigorous degree.

      Points of improvement and weaknesses:

      The results on noise modes may be a bit less surprising than they are portrayed. The orthogonality between neural activity patterns encoding the sensory stimulus and the noise modes should be interpreted within the confounds of orthogonality in high-dimensional spaces. In higher dimensional spaces, it becomes more likely that two random vectors are almost orthogonal. Since the neural activity measurements performed in this study are quite high dimensional, a more explicit discussion is warranted about the small chance that the modes are not almost orthogonal.

      We agree with the reviewer that orthogonality is less “surprising” in high-dimensional spaces, and we have added this important point of interpretation to our revised manuscript. Still, it is important to remember that while the full neural state space is very high-dimensional (we record that activity of up to tens of thousands of neurons simultaneously), our analyses regarding the relationship between the trial-to-trial noise modes and decoding dimensions were performed in a low-dimensional subspace (up to 20 dimensions) identified by PLS regression to that optimally preserved visual information. This is a key step in our analysis which serves two purposes: 1. it removes some of the confound described the reviewer regarding the dimensionality of the neural state space analyzed; and 2. it ensures that the noise modes we analyze are even relevant to sensorimotor processing. It would certainly not be surprising or interesting if we identified a neural dimension outside the midbrain which was orthogonal to the optimal visual decoding dimension. 

      Regardless, in order to better control for this confound, we estimated the distribution of angles between random vectors in this subspace. As we describe in the revised manuscript:

      “However, in high-dimensional spaces, it becomes increasingly common that two random vectors could appear orthogonal. While this is particularly a concern when analyzing a neural state space spanned by tens of thousands of neurons, our application of PLS regression to identify a low-dimensional subspace of relevant neuronal activity partially mitigates this concern. In order to control for this confound, we compared the angles between w<sub>opt</sub> and e1 across larvae to that computed with shuffled versions of w<sub>opt,shuff</sub> estimated by randomly shuffling the stimulus labels before identifying the optimal decoding direction. While it is possible to observe shuffled vectors which are nearly orthogonal to e<sub>1</sub>, the shuffled distribution spans a significantly greater range of angles than the observed data, demonstrating that this orthogonality is not simply a consequence of analyzing multi-dimensional activity patterns.”

      The conclusion that sparsely distributed sets of neurons produce behavioral variability needs more investigation because the way the results are shown could lead to some misinterpretations. The prediction of behavior from classifiers applied to neural activity is interesting, but the results are insufficiently presented for two reasons.

      (1) The neurons that contribute to the classifiers (Figures 4H and J) form a sufficient set of neurons that predict behavior, but this does not mean that neurons outside of that set cannot be used to predict behavior. Lasso regularization was used to create the classifiers and this induces sparsity. This means that if many neurons predict behavior but they do so similarly, the classifier may select only a few of them. This is not a problem in itself but it means that the distributions of neurons across the brain (Figures 4H and J) may appear sparser and more distributed than the full set of neurons that contribute to producing the behavior. This ought to be discussed better to avoid misinterpretation of the brain distribution results, and an alternative analysis that avoids the confound could help clarify.

      We thank the reviewer for raising this point, which we agree should be discussed in the manuscript. Lasso regularization was a key ingredient in our analysis; l2 regularization alone was not sufficient to prevent overfitting to the training trials, particularly when decoding turn direction and responsiveness. Previous studies have also found that sparse subsets of neurons better predict behavior than single neuron or non-sparse populations, for example Scholz et al. (2018).

      While showing l2 regularization would not be a fair comparison given the poor performance of the l2-regularized classifiers, we opted to identify a potentially “fuller” set of neurons correlated with these biases based on the correlation between each neuron’s activity over the recording and the projection along the turn direction or responsiveness dimension identified using l1 regularization. This procedure has the potential to identify all neurons correlated with the final ensemble dynamics, rather than just a “sufficient set” for lasso regression. In new Figures S5F-G, we show the 3D distribution of all neurons significantly correlated with these biases, which appear similar to those in Figures 4H-K and widely distributed across practically the entire labeled area of the brain.

      (2) The distribution of neurons is shown in an overly coarse manner in only a flattened brain seen from the top, and the brain is divided into four coarse regions (telencephalon, tectum, cerebellum, hindbrain). This makes it difficult to assess where the neurons are and whether those four coarse divisions are representative or whether the neurons are in other non-labeled deeper regions. For these two reasons, some of the statements about the distribution of neurons across the brain would benefit from a more thorough investigation.

      We agree with the reviewer that a more thorough description and visualization of these distributed populations is warranted.

      While the dense, pan-neuronal labeling makes the isolation of highly specific circuit components difficult, we have shown in more detail the specific brain regions contributing to these populations by aligning our recordings to the Z-Brain atlas (Randlett et al., 2015) as shown in new Figures S1E, S3F-G, 4I, 4K, and S5F-G. In addition, we provided a more detailed parcellation of the neuronal ensembles by providing projections of the full 3D volume along the xy and yz axes, in addition to the unregistered xy projection shown in new Figures 4H and 4J. We also found that the distribution of neurons in our huc:H2B-GCaMP6s recordings is very similar to the distribution of labeling in the huc:H2B-RFP reference image from the Z-Brain atlas (new Figure S1E), which further supports our whole-brain imaging results.

      Overall, we find that this more detailed quantification and visualization is consistent with the interpretations in the previous version of our manuscript. In particular, we show that optimal visual decoding population (w<sub>opt</sub>) and largest noise mode (e1) are localized to the midbrain (new Figures S3F-G), which is expected since in Figure 3 we first extracted a low-dimensional subspace of whole-brain neural activity that optimally preserved visual information. Additionally, we provide additional evidence that the populations correlated with the turn bias and responsiveness bias are distributed throughout the brain, including a relatively dense localization to the cerebellum, telencephalon, and dorsal diencephalon (habenula, new Figures 4H-K and S5F-G).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      In addition to the overall strengths and weaknesses above, I have a few specific comments that I think could improve the study:

      (1) In lines 334-335 you write that 'We proceeded to build various logistic regression classifiers to decode'. Do you mean you tested this with other classifier types as well (e.g. SVM, Naive Bayes) or do you mean various because you trained the classifier described in the methods on each animal? This is not clear. If it is the first, more information is needed about what other classifiers you used.

      We appreciate the reviewer raising this point of clarification. Here, we simply meant that we fit the multiclass logistic regression classifier in the one-vs-rest scheme. In this sense, a single multiclass logistic regression classifier was fit for each larva. We have updated our revised manuscript with this clarification: “The visual stimuli were decoded using a one-versus-rest, multiclass logistic regression classifier with lasso regularization.”

      (2) In Figure 3 you train the decoder on all visually responsive cells identified across the brain. Does this reliability of stimulus decoding also hold for neurons sampled from specific brain regions? For example, does this reliable decoding come from stronger and more reliable responses in the optic tectum, whereas stimulus decodability is not as good in visual encoding neurons identified in other structures?

      In new Figure S5B, we show the performance of stimulus decoding from various brain regions. We find that stimulus classification is possible from the midbrain and cerebellum, very poor from the hindbrain, and not possible from the telencephalon during the period between stimulus onset and the decision.

      (3) In relation to point 2, it would be good to show in which brain areas the visually responsive neurons are located, and maybe the average coefficients per brain area. Plots like Figures 3G, and H would benefit from a quantification into areas. Similarly, a parcellation into more specific brain areas in Figure 4 would also be valuable.

      In addition to providing a more detailed parcellation of the turn direction and responsiveness bias populations in Figure 4, we have provided a similar visualization and quantification of the optimal stimulus decoding population and the dominant noise mode in new Figures S3F-G, respectively.

      (4) In Figure 3f, it is not clear to me how this shows that w<sub>opt</sub> and e1 are orthogonal. They appear correlated.

      The orthogonality we quantify is related to the pattern of coefficients across neurons, not necessarily the timeseries of their projections. The slight shift in the noise mode activations as you move from stimuli on the left visual field to the right actually comes from the motor outputs. Large left stimuli tend to evoke a rightward turn and vice versa, and the example noise mode shown encodes the directionality and vigor of tail movements, resulting in the slight shifts observed.

      (5) I think the wording of this conclusion is too strong for the results and a bit illogical:

      'Thus, our data suggest that the neural dynamics underlying single-trial action selection are the result of a widely-distributed circuit that contains subpopulations encoding internal time-varying biases related to both the larva's responsiveness and turn direction, yet distinct from the sensory encoding circuitry.'

      If that is the case, how is it even possible that the larvae can do a visually guided behaviour?

      Especially given Suppl Fig 4C it would be more appropriate to say something along the lines of: 'When stimuli are highly ambiguous, single trial action selection is dominated by widely-distributed circuit that contains subpopulations encoding internal time-varying biases related to both the larva's responsiveness and turn direction, that encode choice distinctly from the sensory encoding circuitry'.

      We appreciate the reviewer’s suggestion and have re-worded this line in the discussion in order to clarify that these time-varying biases are predominant in the case of ambiguous stimuli, as shown in Figure S5C in our revised manuscript (corresponding to Figure S4C in our original submission).

      (6) Line 599: typo: trial-to-trail

      We thank the reviewer for noting this error, which has been corrected in the revised text of the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      As you will see, the main changes in the revised manuscript pertain to the structure and content of the introduction. Specifically, we have tried to more clearly introduce our paradigm, the rationale behind the paradigm, why it is different from learning paradigms, and why we study “relief”.

      In this rebuttal letter, we will go over the reviewers’ comments one-by-one and highlight how we have adapted our manuscript accordingly. However, because one concern was raised by all reviewers, we will start with an in-depth discussion of this concern.

      The shared concern pertained to the validity of the EVA task as a model to study threat omission responses. Specifically, all reviewers questioned the effectivity of our so-called “inaccurate”, “false” or “ruse” instructions in triggering an equivalent level of shock expectancy, and relatedly, how this effectivity was affected by dynamic learning over the course of the task.

      We want to thank the reviewers for raising this important issue. Indeed, it is a vital part of our design and it therefore deserves considerable attention. It is now clear to us that in the previous version of the manuscript we may have focused too little on why we moved away from a learning paradigm, and how we made sure that the instructions were successful at raising the necessary expectations; and how the instructions were affected by learning. We believe this has resulted in some misunderstandings, which consequently may have cast doubts on our results. In the following sections, we will go into these issues.

      The rationale behind our instructed design

      The main aim of our study was to investigate brain responses to unexpected omissions of threat in greater detail by examining their similarity to the reward prediction error axioms (Caplin & Dean, 2008), and exploring the link with subjective relief. Specifically, we hypothesized that omission-related responses should be dependent on the probability and the intensity of the expected-but-omitted aversive event (i.e., electrical stimulation), meaning that the response should be larger when the expected stimulation was stronger and more expected, and that fully predicted outcomes should not trigger a difference in responding.

      To this end, we required that participants had varying levels of threat probability and intensity predictions, and that these predictions would most of the time be violated. Although we fully agree with the reviewers that fear conditioning and extinction paradigms can provide an excellent way to track the teaching properties of prediction error responses (i.e., how they are used to update expectancies on future trials), we argued that they are less suited to create the varying probability and intensity-related conditions we required (see Willems & Vervliet, 2021). Specifically, in a standard conditioning task participants generally learn fast, rendering relatively few trials on which the prediction is violated. As a result, there is generally little intraindividual variability in the prediction error responses. This precludes an in-depth analysis of the probability-related effects. Furthermore, conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted. As a result, intensity-related effects cannot be tested. Finally, because CS-US contingencies change over the course of a fear conditioning and extinction study (e.g. from acquisition to extinction), there is never complete certainty about when the US will (not) follow. This precludes a direct comparison of fully predicted outcomes.

      Another added value of studying responses to the prediction error at threat omission outside a learning context is that it can offer a way to disentangle responses to the violation of threat expectancy, with those of subsequent expectancy updating.

      Also note that Rutledge and colleagues (2010), who were the first to show that human fMRI responses in the Nucleus Accumbens comply to the reward prediction error axioms also did not use learning experiences to induce expectancy. In that sense, we argued it was not necessary to adopt a learning paradigm to study threat omission responses.

      Adaptations in the revised manuscript: We included two new paragraphs in the introduction of the revised manuscript to elaborate on why we opted not to use a learning paradigm in the present study (lines 90-112).

      “However, is a correlation with the theoretical PE over time sufficient for neural activations/relief to be classified as a PE-signal? In the context of reward, Caplin and colleagues proposed three necessary and sufficient criteria all PE-signals should comply to, independent of the exact operationalizations of expectancy and reward (the socalled axiomatic approach24,25; which has also been applied to aversive PE26–28). Specifically, the magnitude of a PE signal should: (1) be positively related to the magnitude of the reward (larger rewards trigger larger PEs); (2) be negatively related to likelihood of the reward (more probable rewards trigger smaller PEs); and (3) not differentiate between fully predicted outcomes of different magnitudes (if there is no error in prediction, there should be no difference in the PE signal).”

      “It is evident that fear conditioning and extinction paradigms have been invaluable for studying the role of the threat omission PE within a learning context. However, these paradigms are not tailored to create the varying intensity and probability-related conditions that are required to evaluate the threat omission PE in the light of the PE axioms. First, conditioning paradigms generally only include one level of aversive outcome: the electrical stimulation is either delivered or omitted. As a result, the magnitude-related axiom cannot be tested. Second, in conditioning tasks people generally learn fast, rendering relatively few trials on which the prediction is violated. As a result, there is generally little intra-individual variability in the PE responses. Moreover, because of the relatively low signal to noise ratio in fMRI measures, fear extinction studies often pool across trials to compare omission-related activity between early and late extinction16, which further reduces the necessary variability to properly evaluate the probability axiom. Third, because CS-US contingencies change over the course of the task (e.g. from acquisition to extinction), there is never complete certainty about whether the US will (not) follow. This precludes a direct comparison of fully predicted outcomes. Finally, within a learning context, it remains unclear whether PErelated responses are in fact responses to the violation of expectancy itself, or whether they are the result of subsequent expectancy updating.”

      Can verbal instructions be used to raise the expectancy of shock?

      The most straightforward way to obtain sufficient variability in both probability and intensityrelated predictions is by directly providing participants with instructions on the probability and intensity of the electrical stimulation. In a previous behavioral study, we have shown that omission responses (self-reported relief and omission SCR) indeed varied with these instructions (Willems & Vervliet, 2021). In addition, the manipulation checks that are reported in the supplemental material provided further support that the verbal instructions were effective at raising the associated expectancy of stimulation. Specifically, participants recollected having received more stimulations after higher probability instructions (see Supplemental Figure 2). Furthermore, we found that anticipatory SCR, which we used as a proxy of fearful expectation, increased with increasing probability and intensity (see Supplemental Figure 3). This suggests that it is not necessary to have expectation based on previous experience if we want to evaluate threat omission responses in the light of the prediction error axioms.

      Adaptations in the revised manuscript: We more clearly referred to the manipulation checks that are presented in the supplementary material in the results section of the main paper (lines 135-141).

      “The verbal instructions were effective at raising the expectation of receiving the electrical stimulation in line with the provided probability and intensity levels. Anticipatory SCR, which we used as a proxy of fearful expectation, increased as a function of the probability and intensity instructions (see Supplementary Figure 3). Accordingly, post-experimental questions revealed that by the end of the experiment participants recollected having received more stimulations after higher probability instructions, and were willing to exert more effort to prevent stronger hypothetical stimulations (see Supplementary Figure 2).”

      How did the inconsistency between the instructed and experienced probability impact our results?

      All reviewers questioned how the inconsistency between the instructed and experienced probability might have impacted the probability-related results. However, judging from the way the comments were framed, it seems that part of the concern was based on a misunderstanding of the design we employed. Specifically, reviewer 1 mentions that “To ensure that the number of omissions is similar across conditions, the task employs inaccurate verbal instructions; I.e., 25% of shocks are omitted regardless of whether subjects are told that the probability is 100%, 75%, 50%, 25%, 0%.”, and reviewer 3 states that “... the fact remains that they do not get shocks outside of the 100% probability shock. So learning is occurring, at least for subjects who realize the probability cue is actually a ruse.” We want to emphasize that this was not what we did, and if it were true, we fully agree with the reviewers that it would have caused serious trust- and learning related issues, given that it would be immediately evident to participants that probability instructions were false. It is clear that under such circumstances, dynamic learning would be a big issue.

      However, in our task 0% and 100% instructions were always accurate. This means that participants never received a stimulus following 0% instructions and always received the stimulation of the given intensity on the 100% instructions (see Supplemental Figure 1 for an overview of the trial types). Only for the 25%, 50% and 75% trials an equal reinforcement rate (25%) was maintained, meaning that the stimulation followed in 25% of the trials, irrespective of whether a 25%, 50% or 75% instruction was given. The reason for this was that we wanted to maximize and balance the number of omission trials across the different probability levels, while also keeping the total number of presentations per probability instruction constant. We reasoned that equating the reinforcement rate across the 25%, 50% and 75% instructions should not be detrimental, because (1) in these trials there was always the possibility that a stimulation would follow; and (2) we instructed the participants that each trial is independent of the previous ones, which should have discouraged them to actively count the number of shocks in order to predict future shocks.

      Adaptations in the revised manuscript: We have tried to further clarify the design in several sections of the manuscript, including the introduction (lines 121-125), results (line 220) and methods (lines 478-484) sections:

      Adaptation in the Introduction section: “Specifically, participants received trial-by-trial instructions about the probability (0%, 25%, 50%, 75% and 100%) and intensity (weak, moderate, strong) of a potentially painful upcoming electrical stimulation, time-locked by a countdown clock (see Fig.1A). While stimulations were always delivered on 100% trials and never on 0% trials, most of the other trials (25%-75%) did not contain the expected stimulation and hence provoked an omission PE.”

      Adaptation in the Results section: “Indeed, the provided instructions did not map exactly onto the actually experienced probabilities, but were all followed by stimulation in 25% on the trials (except for the 0% trials and the 100% trials).”

      Adaptation in the Methods section: “Since we were mainly interested in how omissions of threat are processed, we wanted to maximize and balance the number of omission trials across the different probability and intensity levels, while also keeping the total number of presentations per probability and intensity instruction constant. Therefore, we crossed all non-0% probability levels (25, 50, 75, 100) with all intensity levels (weak, moderate, strong) (12 trials). The three 100% trials were always followed by the stimulation of the instructed intensity, while stimulations were omitted in the remaining nine trials. Six additional trials were intermixed in each run: Three 0% omission trials with the information that no electrical stimulation would follow (akin to 0% Probability information, but without any Intensity information as it does not apply); and three trials from the Probability x Intensity matrix that were followed by electrical stimulation (across the four runs, each Probability x Intensity combination was paired at least once, and at most twice with the electrical stimulation).”

      Could the incongruence between the instructed and experienced reinforcement rate have detrimental effects on the probability effect? We agree with reviewer 2 that it is possible that the inconsistency between instructed and experienced reinforcement rates could have rendered the exact probability information less informative to participants, which might have resulted in them paying less attention to the probability information whenever the probability was not 0% or 100%. This might to some extent explain the relatively larger difference in responding between 0% and 25% to 75% trials, but the relatively smaller differences between the 25% to 75% trials.

      However, there are good reasons to believe that the relatively smaller difference between 25% to 75% trials was not caused by the “inaccurate” nature of our instructions, but is inherent to “uncertain” probabilities.

      We added a description of these reasons to the supplementary materials in a supplementary note (supplementary note 4; lines 97-129 in supplementary materials), and added a reference to this note in the methods section (lines 488-490).

      “Supplementary Note 4: “Accurate” probability instructions do not alter the Probability-effect

      A question that was raised by the reviewers was whether the inconsistency between the probability instruction and the experienced reinforcement rate could have detrimental effects on the Probability-related results; especially because the effect of Probability was smaller when only including non-0% trials.

      However, there are good reasons to believe that the relatively smaller difference between 25% to 75% trials was not caused by the “inaccurate” nature of our instructions, but that they are inherent to “uncertain” probabilities.

      First, in a previously unpublished pilot study, we provided participants with “accurate” probability instructions, meaning that the instruction corresponded to the actual reinforcement rate (e.g., 75% instructions were followed by a stimulation in 75% of the trials etc.). In line with the present results and our previous behavioral study (Willems & Vervliet, 2021), the results of this pilot (N = 20) showed that the difference in the reported relief between the different probability levels was largest when comparing 0% and the rest (25%, 50% and 75%). Furthermore the overall effect size of Probability (excluding 0%) matched the one of our previous behavioral study (Willems & Vervliet, 2021): ηp2 = +/- 0.50.”

      Author response image 1.

      Main effect of Probability including 0% : F(1.74,31.23) = 53.94, p < .001, ηp2 = 0.75. Main effect of Probability excluding 0%: F(1.50, 28.43) = 21.03, p < .001, ηp2 = 0.53.

      Second, also in other published studies that used CSs with varying reinforcement rates (which either included explicit written instructions of the reinforcement rates or not) showed that the difference in expectations, anticipatory SCR or omission SCR was largest when comparing the CS0% to the other CSs of varying reinforcement rates (Grings & Sukoneck, 1971; Öhman et al., 1973; Ojala et al., 2022).

      Together, this suggests that when there is a possibility of stimulation, any additional difference in probability will have a smaller effect on the omission responses, irrespective of whether the underlying reinforcement rate is accurate or not.

      Adaptation to methods section: “Note that, based on previous research, we did not expect the inconsistency between the instructed and perceived reinforcement rate to have a negative effect on the Probability manipulation (see Supplementary Note 4).”

      Did dynamic learning impact the believability of the instructions?

      Although we tried to minimize learning in our paradigm by providing instructions that trials are independent from one another, we agree with the reviewers that this cannot preclude all learning. Any remaining learning effects should present themselves by downweighing the effect of the probability instructions over time. We controlled for this time-effect by including a “run” regressor in our analyses. Results of the Run regressor for subjective relief and omission-related SCR are presented in Supplemental Figure 5. These figures show that although there was a general drop in reported relief pleasantness and omission SCR over time, the effects of probability and intensity remained present until the last run. This indicates that even though some learning might have taken place, the main manipulations of probability and intensity were still present until the end of the task.

      Adaptations in the revised manuscript: We more clearly referred to the results of the Blockregressor which were presented in the supplementary material in the results section of the main paper (lines 159-162).

      Note that while there was a general drop in reported relief pleasantness and omission SCR over time, the effects of Probability and Intensity remained present until the last run (see Supplementary Figure 5). This further confirms that probability and intensity manipulations were effective until the end of the task.

      In the following sections of the rebuttal letter, we will go over the rest of the comments and our responses one by one.

      Reviewer #1 (Public Review):

      Summary:

      Willems and colleagues test whether unexpected shock omissions are associated with reward-related prediction errors by using an axiomatic approach to investigate brain activation in response to unexpected shock omission. Using an elegant design that parametrically varies shock expectancy through verbal instructions, they see a variety of responses in reward-related networks, only some of which adhere to the axioms necessary for prediction error. In addition, there were associations between omission-related responses and subjective relief. They also use machine learning to predict relief-related pleasantness, and find that none of the a priori "reward" regions were predictive of relief, which is an interesting finding that can be validated and pursued in future work.

      Strengths:

      The authors pre-registered their approach and the analyses are sound. In particular, the axiomatic approach tests whether a given region can truly be called a reward prediction error. Although several a priori regions of interest satisfied a subset of axioms, no ROI satisfied all three axioms, and the authors were candid about this. A second strength was their use of machine learning to identify a relief-related classifier. Interestingly, none of the ROIs that have been traditionally implicated in reward prediction error reliably predicted relief, which opens important questions for future research.

      Weaknesses:

      To ensure that the number of omissions is similar across conditions, the task employs inaccurate verbal instructions; i.e. 25% of shocks are omitted, regardless of whether subjects are told that the probability is 100%, 75%, 50%, 25%, or 0%. Given previous findings on interactions between verbal instruction and experiential learning (Doll et al., 2009; Li et al., 2011; Atlas et al., 2016), it seems problematic a) to treat the instructions as veridical and b) average responses over time. Based on this prior work, it seems reasonable to assume that participants would learn to downweight the instructions over time through learning (particularly in the 100% and 0% cases); this would be the purpose of prediction errors as a teaching signal. The authors do recognize this and perform a subset analysis in the 21 participants who showed parametric increases in anticipatory SCR as a function of instructed shock probability, which strengthened findings in the VTA/SN; however given that one-third of participants (n=10) did not show parametric SCR in response to instructions, it seems like some learning did occur. As prediction error is so important to such learning, a weakness of the paper is that conclusions about prediction error might differ if dynamic learning were taken into account.

      We thank the reviewer for raising this important concern. We believe we replied to all the issues raised in the general reply above.

      Lastly, I think that findings in threat-sensitive regions such as the anterior insula and amygdala may not be adequately captured in the title or abstract which strictly refers to the "human reward system"; more nuance would also be warranted.

      We fully agree with this comment and have changed the title and abstract accordingly.

      Adaptations in the revised manuscript: We adapted the title of the manuscript.

      “Omissions of Threat Trigger Subjective Relief and Prediction Error-Like Signaling in the Human Reward and Salience Systems”

      Adaptations in the revised manuscript: We adapted the abstract (lines 27-29).

      “In line with recent animal data, we showed that the unexpected omission of (painful) electrical stimulation triggers activations within key regions of the reward and salience pathways and that these activations correlate with the pleasantness of the reported relief.”

      Reviewer #2 (Public Review):

      The question of whether the neural mechanisms for reward and punishment learning are similar has been a constant debate over the last two decades. Numerous studies have shown that the midbrain dopamine neurons respond to both negative and salient stimuli, some of which can't be well accounted for by the classic RL theory (Delgado et al., 2007). Other research even proposed that aversive learning can be viewed as reward learning, by treating the omission of aversive stimuli as a negative PE (Seymour et al., 2004).

      Although the current study took an axiomatic approach to search for the PE encoding brain regions, which I like, I have major concerns regarding their experimental design and hence the results they obtained. My biggest concern comes from the false description of their task to the participants. To increase the number of "valid" trials for data analysis, the instructed and actual probabilities were different. Under such a circumstance, testing axiom 2 seems completely artificial. How does the experimenter know that the participants truly believe that the 75% is more probable than, say, the 25% stimulation? The potential confusion of the subjects may explain why the SCR and relief report were rather flat across the instructed probability range, and some of the canonical PE encoding regions showed a rather mixed activity pattern across different probabilities. Also for the post-hoc selection criteria, why pick the larger SCR in the 75% compared to the 25% instructions? How would the results change if other criteria were used?

      We thank the reviewer for raising this important concern. We believe the general reply above covers most of the issues raised in this comment. Concerning the post-hoc selection criteria, we took 25% < 75% as criterium because this was a quite “lenient” criterium in the sense that it looked only at the effects of interest (i.e., did anticipatory SCR increase with increasing instructed probability?). However, also when the criterium was more strict (e.g., selecting participants only if their anticipatory SCR monotonically increased with each increase in instructed probability 0% < 25% < 50% < 75% < 100%, N = 11 participants), the probability effect (ωp2 = 0.08), but not the intensity effect, for the VTA/SN remained.

      To test axiom 3, which was to compare the 100% stimulation to the 0% stimulation conditions, how did the actual shock delivery affect the fMRI contrast result? It would be more reasonable if this analysis could control for the shock delivery, which itself could contaminate the fMRI signal, with extra confound that subjects may engage certain behavioral strategies to "prepare for" the aversive outcome in the 100% stimulation condition. Therefore, I agree with the authors that this contrast may not be a good way to test axiom 3, not only because of the arguments made in the discussion but also the technical complexities involved in the contrast.

      We thank the reviewer for addressing this additional confound. It was indeed impossible to control for the delivery of shock since the delivery of the shock was always present on the 100% trials (and thus completely overlapped with the contrast of interest). We added this limitation to our discussion in the manuscript. In addition, we have also added a suggestion for a contrast that can test the “no surprise equivalence” criterium.

      Adaptations in the revised manuscript: We adapted lines 358-364.

      “Thus, given that we could not control for the delivery of the stimulation in the 100% > 0% contrast (the delivery of the stimulation completely overlapped with the contrast of interest), it is impossible to disentangle responses to the salience of the stimulation from those to the predictability of the outcome. A fairer evaluation of the third axiom would require outcomes that are roughly similar in terms of salience. When evaluating threat omission PE, this implies comparing fully expected threat omissions following 0% instructions to fully expected absence of stimulation at another point in the task (e.g. during a safe intertrial interval).”

      Reviewer #3 (Public Review):

      We thank the reviewer for their comments. Overall, based on the reviewer’s comments, we noticed that there was an imbalance between a focus on “relief” in the introduction and the rest of the manuscript and preregistration. We believe this focus raised the expectation that all outcome measures were interpreted in terms of the relief emotion. However, this was not what we did nor what we preregistered. We therefore restructured the introduction to reduce the focus on relief.

      Adaptations in the revised manuscript: We restructured the introduction of the manuscript. Specifically, after our opening sentence: “We experience a pleasurable relief when an expected threat stays away1” we only introduce the role of relief for our research in lines 79-89.

      “Interestingly, unexpected omissions of threat not only trigger neural activations that resemble a reward PE, they are also accompanied by a pleasurable emotional experience: relief. Because these feelings of relief coincide with the PE at threat omission, relief has been proposed to be an emotional correlate of the threat omission PE. Indeed, emerging evidence has shown that subjective experiences of relief follow the same time-course as theoretical PE during fear extinction. Participants in fear extinction experiments report high levels of relief pleasantness during early US omissions (when the omission was unexpected and the theoretical PE was high) and decreasing relief pleasantness over later omissions (when the omission was expected and the theoretical PE was low)22,23. Accordingly, preliminary fMRI evidence has shown that the pleasantness of this relief is correlated to activations in the NAC at the time of threat omission. In that sense, studying relief may offer important insights in the mechanism driving safety learning.”

      Summary:

      The authors conducted a human fMRI study investigating the omission of expected electrical shocks with varying probabilities. Participants were informed of the probability of shock and shock intensity trial-by-trial. The time point corresponding to the absence of the expected shock (with varying probability) was framed as a prediction error producing the cognitive state of relief/pleasure for the participant. fMRI activity in the VTA/SN and ventral putamen corresponded to the surprising omission of a high probability shock. Participants' subjective relief at having not been shocked correlated with activity in brain regions typically associated with reward-prediction errors. The overall conclusion of the manuscript was that the absence of an expected aversive outcome in human fMRI looks like a reward-prediction error seen in other studies that use positive outcomes.

      Strengths:

      Overall, I found this to be a well-written human neuroimaging study investigating an often overlooked question on the role of aversive prediction errors, and how they may differ from reward-related prediction errors. The paper is well-written and the fMRI methods seem mostly rigorous and solid.

      Weaknesses:

      I did have some confusion over the use of the term "prediction-error" however as it is being used in this task. There is certainly an expectancy violation when participants are told there is a high probability of shock, and it doesn't occur. Yet, there is no relevant learning or updating, and participants are explicitly told that each trial is independent and the outcome (or lack thereof) does not affect the chances of getting the shock on another trial with the same instructed outcome probability. Prediction errors are primarily used in the context of a learning model (reinforcement learning, etc.), but without a need to learn, the utility of that signal is unclear.

      We operationalized “prediction error” as the response to the error in prediction or the violation of expectancy at the time of threat omission. In that sense, prediction error and expectancy violation (which is more commonly used in clinical research and psychotherapy; Craske et al., 2014) are synonymous. While prediction errors (or expectancy violations) are predominantly studied in learning situations, the definition in itself does not specify how the “expectancy” or “prediction” arises: whether it was through learning based on previous experience or through mere instruction. The rationale why we moved away from a conditioning study in the present manuscript is discussed in our general reply above.

      We agree with the reviewer that studying prediction errors outside a learning context limits the ecological validity of the task. However, we do believe there is also a strength to this approach. Specifically, the omission-related responses we measure are less confounded by subsequent learning (or updating of the wrongful expectation). Any difference between our results and prediction error responses in learning situation can therefore point to this exact difference in paradigm, and can thus identify responses that are specific to learning situations.

      An overarching question posed by the researchers is whether relief from not receiving a shock is a reward. They take as neural evidence activity in regions usually associated with reward prediction errors, like the VTA/SN . This seems to be a strong case of reverse inference. The evidence may have been stronger had the authors compared activity to a reward prediction error, for example using a similar task but with reward outcomes. As it stands, the neural evidence that the absence of shock is actually "pleasurable" is limited-albeit there is a subjective report asking subjects if they felt relief.

      We thank the reviewer for cautioning us and letting us critically reflect on our interpretation. We agree that it is important not to be overly enthusiastic when interpreting fMRI results and to attribute carelessly psychological functions to mere activations. Therefore, we will elaborate on the precautions we took not to minimize detrimental reverse inference.

      First, prior to analyzing our results, we preregistered clear hypotheses that were based on previous research, in addition to clear predictions, regions of interest and a testing approach on OSF. With our study, we wanted to investigate whether unexpected omissions of threat: (1) triggered activations in the VTA/SN, putamen, NAc and vmPFC (as has previously been shown in animal and human studies); (2) represent PE signals; and (3) were related to self-reported relief, which has also been shown to follow a PE time-curve in fear extinction (Vervliet et al., 2017). Based on previous research, we selected three criteria all PE signals should comply to. This means that if omission-related activations were to represent true PE signals, they should comply to these criteria. However, we agree that it would go too far to conclude based on our research that relief is a reward, or even that the omission-related activations represent only PE signals. While we found support for most of our hypotheses, this does not preclude alternative explanations. In fact, in the discussion, we acknowledge this and also discuss alternative explanations, such as responding to the salience (lines 395-397; “One potential explanation is therefore that the deactivation resulted from a switch from default mode to salience network, triggered by the salience of the unexpected threat omission or by the salience of the experienced stimulation.”), or anticipation (line 425-426; “... we cannot conclusively dismiss the alternative interpretation that we assessed (part of) expectancy instead”).

      Second, we have deliberately opted to only use descriptive labels such as omission-related activations when we are discussing fMRI results. Only when we are talking about how the activations were related to self-reported relief, we talk about relief-related activations.

      I have some other comments, and I elaborate on those above comments, below:

      (1) A major assumption in the paper is that the unexpected absence of danger constitutes a pleasurable event, as stated in the opening sentence of the abstract. This may sometimes be the case, but it is not universal across contexts or people. For instance, for pathological fears, any relief derived from exposure may be short-lived (the dog didn't bite me this time, but that doesn't mean it won't next time or that all dogs are safe). And even if the subjective feeling one gets is temporary relief at that moment when the expected aversive event is not delivered, I believe there is an overall conflation between the concepts of relief and pleasure throughout the manuscript. Overall, the manuscript seems to be framed on the assumption that "aversive expectations can transform neutral outcomes into pleasurable events," but this is situationally dependent and is not a common psychological construct as far as I am aware.

      We thank the reviewer for their comment. We have restructured the introduction because we agree with the reviewer that the introduction might have set false expectations concerning our interpretation of the results. The statements related to relief have been toned down in the revised manuscript.

      Still, we want to note that the initial opening statement “unexpected absence of danger constitutes the pleasurable emotion relief” was based on a commonly used definition of relief that states that relief refers to “the emotion that is triggered by the absence of expected or previously experienced negative stimulation ” (Deutsch, 2015). Both aspects that it is elicited by the absence of an otherwise expected aversive event and that it is pleasurable in nature has received considerable empirical support in emotion and fear conditioning research (Deutsch et al., 2015; Leknes et al., 2011; Papalini et al., 2021; Vervliet et al., 2017; Willems & Vervliet, 2021).

      That said, the notion that the feeling of relief is linked to the (reward) prediction error underlying the learning of safety is included in several theoretical papers in order to explain the commonly observed dopaminergic response at the time of threat omission (both in animals and humans; Bouton et al., 2020; Kalisch et al., 2019; Pittig et al., 2020).

      Together, these studies indicate that the definition of relief, and its potential role in threat omission-driven learning is – at least in our research field – established. Still, we felt that more direct research linking feelings of relief to omission-related brain responses was warranted.

      One of the main reasons why we specifically focus on the “pleasantness” of the relief is to assess the hedonic impact of the threat omission, as has been done in previous studies by our lab and others (Leknes et al., 2011; Leng et al., 2022; Papalini et al., 2021; Vervliet et al., 2017; Willems & Vervliet, 2021). Nevertheless, we agree with the reviewer that the relief we measure is a short-lived emotional state that is subjected to individual differences (as are all emotions).

      (2) The authors allude to this limitation, but I think it is critical. Specifically, the study takes a rather simplistic approach to prediction errors. It treats the instructed probability as the subjects' expectancy level and treats the prediction error as omission related activity to this instructed probability. There is no modeling, and any dynamic parameters affected by learning are unaccounted for in this design . That is subjects are informed that each trial is independently determined and so there is no learning "the presence/absence of stimulations on previous trials could not predict the presence/absence of stimulation on future trials." Prediction errors are central to learning. It is unclear if the "relief" subjects feel on not getting a shock on a high-probability trial is in any way analogous to a prediction error, because there is no reason to update your representation on future trials if they are all truly independent. The construct validity of the design is in question.

      (3) Related to the above point, even if subjects veered away from learning by the instruction that each trial is independent, the fact remains that they do not get shocks outside of the 100% probability shock. So learning is occurring, at least for subjects who realize the probability cue is actually a ruse.

      We thank the reviewer for raising these concerns. We believe that the general reply above covers the issues raised in points 2 and 3.

      (4) Bouton has described very well how the absence of expected threat during extinction can create a feeling of ambiguity and uncertainty regarding the signal value of the CS. This in large part explains the contextual dependence of extinction and the "return of fear" that is so prominent even in psychologically healthy participants. The relief people feel when not receiving an expected shock would seem to have little bearing on changing the long-term value of the CS. In any event, the authors do talk about conditioning (CS-US) in the paper, but this is not a typical conditioning study, as there is no learning.

      We fully agree with the reviewer that our study is no typical conditioning study. Nevertheless, because our research mostly builds on recent advances in the fear extinction domain, we felt it was necessary to introduce the fear extinction procedure and related findings. In the context of fear extinction learning, we have previously shown that relief is an emotional correlate of the prediction error driving acquisition of the novel safety memory (CSnoUS; Papalini et al., 2021; Vervliet et al., 2017). The ambiguity Bouton describes is the result of extinguished CS holding multiple meanings once the safety memory is acquired. Does it signal danger or safety? We agree with Bouton that the meaning of the CS for any new encounter will depend on the context, and the passage of time, but also on the initial strength of the safety acquisition (which is dependent on the size of the prediction error, and hence the amount of relief; Craske et al., 2014). However, it was not our objective to directly study the relation of relief to subsequent CS value, and our design is not tailored to do so post hoc.

      (5) In Figure 2 A-D, the omission responses are plotted on trials with varying levels of probability. However, it seems to be missing omission responses in 0% trials in these brain regions. As depicted, it is an incomplete view of activity across the different trial types of increasing threat probability.

      We thank the reviewer for pointing out this unclarity. The betas that are presented in the figures represent the ROI averages from each non-0% vs 0% contrasts (i.e., 25%>0%; 50%>0%; and 75%>0% for the weak, moderate and strong intensity levels). Any positive beta therefore indicates a stronger activation in the given region compared to a fully predicted omission. Any negative beta indicates a weaker activation.

      Adaptations in the revised manuscript: We have adapted the figure captions of figures 2 and 3.

      “The extracted beta-estimates in figures A-D represent the ROI averages from each non0% > 0% contrast (i.e., 25%>0%; 50%>0%; and 75%>0% for the weak, moderate and strong intensity levels). Any positive beta therefore indicates a stronger activation in the given region compared to a fully predicted omission. Any negative beta indicates a weaker activation.”

      (6) If I understand Figure 2 panels E-H, these are plotting responses to the shock versus no-shock (when no-shock was expected). It is unclear why this would be especially informative, as it would just be showing activity associated with shocks versus no-shocks. If the goal was to use this as a way to compare positive and negative prediction errors, the shock would induce widespread activity that is not necessarily reflective of a prediction error. It is simply a response to a shock. Comparing activity to shocks delivered after varying levels of probability (e.g., a shock delivered at 25% expectancy, versus 75%, versus 100%) would seem to be a much better test of a prediction error signal than shock versus no-shock.

      We thank the reviewer for this comment. The purpose of this preregistered contrast was to test whether fully predicted outcomes elicited equivalent activations in our ROIs (corresponding to the third prediction error axiom). Specifically, if a region represents a pure prediction error signal, the 100% (fully predicted shocks) > 0% (fully predicted shock omissions) contrast should be nonsignificant, and follow-up Bayes Factors would further provide evidence in favor of this null-hypothesis.

      We agree with the reviewer that the delivery of the stimulation triggers widespread activations in our regions of interest that confounded this contrast. However, given that it was a preregistered test for the prediction error axioms, we cannot remove it from the manuscript. Instead, we have argued in the discussion that future studies who want to take an axiomatic stance should consider alternative tests to examine this axiom.

      Adaptations in the revised manuscript: We adapted lines 358-364.

      “Thus, given that we could not control for the delivery of the stimulation in the 100% > 0% contrast (the delivery of the stimulation completely overlapped with the contrast of interest), it is impossible to disentangle responses to the salience of the stimulation from those to the predictability of the outcome. A fairer evaluation of the third axiom would require outcomes that are roughly similar in terms of salience. When evaluating threat omission PE, this implies comparing fully expected threat omissions following 0% instructions to fully expected absence of stimulation at another point in the task (e.g. during a safe intertrial interval).”

      Also note that our task did not lend itself for an in-depth analysis of aversive (worse-thanexpected) prediction error signals, given that there was only one stimulation trial for each probability x intensity level (see Supplemental Figure 1). The most informative contrast that can inform us about aversive prediction error signals contrasts all non-100% stimulation trials with all 100% stimulation trials. The results of this contrast are presented in Supplemental Figure 16 and Supplemental Table 11 for completeness.

      (7) I was unclear what the results in Figure 3 E-H were showing that was unique from panels A-D, or where it was described. The images looked redundant from the images in A-D. I see that they come from different contrasts (non0% > 0%; 100% > 0%), but I was unclear why that was included.

      We thank the reviewer for this comment. Our answer is related to that of the previous comment. Figure 3 presents the results of the axiomatic tests within the secondary ROIs we extracted from a wider secondary mask based on the non0%>0% contrast.

      (8) As mentioned earlier, there is a tendency to imply that subjects felt relief because there was activity in "the reward pathway ."

      We thank the reviewer for their comment, but we respectfully disagree. Subjective relief was explicitly probed when the instructed stimulations stayed away. In the manuscript we only talk about “relief” when discussing these subjective reports. We found that participants reported higher levels of relief-pleasantness following omissions of stronger and more probable threat. This was an observation that matches our predictions and replicates our previous behavioral study (Willems & Vervliet, 2021).

      The fMRI evidence is treated separately from the “pleasantness” of the relief. Specifically, we refrain from calling the threat omission-related neural responses “relief-activity” as this would indeed imply that the activation would only be attributed to this psychological function. Instead, we talked about omission-related activity, and we assessed whether it complied to the prediction error criteria as specified by the axiomatic approach.

      Only afterwards, because we hypothesized that omission-related fMRI activation and selfreported relief-pleasantness were related, and because we found a similar response pattern for both measures, we examined how relief and omission-related fMRI activations within our ROIs were related on a trial-by-trial basis. To this end, we entered relief-pleasantness ratings as a parametric modulator to the omission regressor.

      By no means do we want to reduce an emotional experience (relief) to fMRI activations in isolated regions in the brain. We agree with the reviewer that this would be far too reductionist. We therefore also ran a pre-registered LASSO-PCR analysis in order to identify whether a whole-brain pattern of activations can predict subjective relief (independent from the exact instructions we gave, and independent of our a priori ROIs). This analysis used trialby-trial patterns of activation across all voxels in the brain as the predictor and self-reported relief as the outcome variable. It is therefore completely data-driven and can be seen as a preregistered exploratory analysis that is intended to inform future studies.

      (9) From the methods, it wasn't entirely clear where there is jitter in the course of a trial. This centers on the question of possible collinearity in the task design between the cue and the outcome. The authors note there is "no multicollinearity between anticipation and omission regressors in the firstlevel GLMs," but how was this quantified? b The issue is of course that the activity coded as omission may be from the anticipation of the expected outcome.

      We thank the reviewer for pointing out this unclarity. Jitter was introduced in all parts of the trial: i.e., the duration of the inter-trial interval (4-7s), countdown clock (3-7s), and omission window (4-8s) were all jittered (see fig. 1A and methods section, lines 499-507). We added an additional line to the method section.

      Adaptations in the revised manuscript: We added an additional line of to the methods section to further clarify the jittering (lines 498-500).

      “The scale remained on the screen for 8 seconds or until the participant responded, followed by an intertrial interval between 4 and 7 seconds during which only a fixation cross was shown. Note that all phases in the trial were jittered (i.e., duration countdown clock, duration outcome window, duration intertrial interval).”

      Multicollinearity between the omission and anticipation regressors was assessed by calculating the variance inflation factor (VIF) of omission and anticipation regressors in the first level GLM models that were used for the parametric modulation analyses.

      Adaptations in the revised manuscript: We replaced the VIF abbreviation with “variance inflation factor” (line 423-424).

      “Nevertheless, there was no multicollinearity between anticipation and omission regressors in the first-level GLMs (VIFs Variance Inflation Factor, VIF < 4), making it unlikely that the omission responses purely represented anticipation.”

      (10) I did not fully understand what the LASSO-PCR model using relief ratings added. This result was not discussed in much depth, and seems to show a host of clusters throughout the brain contributing positively or negatively to the model. Altogether, I would recommend highlighting what this analysis is uniquely contributing to the interpretation of the findings.

      The main added value of this analyses is that it uses a different approach altogether. Where the (mass univariate) parametric modulation analysis estimated in each voxel (and each ROI) whether the activity in this voxel/ROI covaried with the reported relief, a significant activation only indicated that this voxel was related to relief. However, given that each voxel/ROI is treated independently in this analysis, it remains unclear how the activations were embedded in a wider network across the brain, and which regions contributed most to the prediction of relief. The multivariate LASSO-PCR analysis approach we took attempts to overcome this limitation by examining if a more whole-brain pattern can predict relief. Because we use the whole-brain pattern (and not only our a priori ROIs), this analysis is completely data-driven and is intended to inform future studies. In addition, the LASSO-PCR model was cross-validated using five-fold cross-validation, which is also a difference (and a strength) compared to the mass univariate GLM approach.

      One interesting finding that only became evident when we combined univariate and multivariate approaches is that despite that the parametric modulation analysis showed that omission-related fMRI responses in the ROIs were modulated by the reported relief, none of these ROIs contributed significantly to the prediction of relief based on the identified signature. Instead, some of the contributing clusters fell within other valuation and errorprocessing regions (e.g. lateral OFC, mid cingulate, caudate nucleus). This suggests that other regions than our a priori ROIs may have been especially important for the subjective experience of relief, at least in this task. However, all these clusters were small and require further validation in out of sample participants. More research is necessary to test the generalizability and validity of the relief signature to new individuals and tasks, and to compare the signature with other existing signature models (e.g., signature of pain, fear, reward, pleasure). However, this was beyond the scope of the present study.

      Adaptations in the revised manuscript: We altered the explanation of the LASSO-PCR approach in the results section (lines 286-295) and the discussion (lines 399-402)

      Adaptations in the Results section: “The (mass univariate) parametric modulation analysis showed that omission-related fMRI activity in our primary and secondary ROIs correlated with the pleasantness of the relief. However, given that each voxel/ROI is treated independently in this analysis, it remains unclear how the activations were embedded in a wider network of activation across the brain, and which regions contributed most to the prediction of relief. To overcome these limitations, we trained a (multivariate) LASSO-PCR model (Least Absolute Shrinkage and Selection Operator-Regularized Principle Component Regression) in order to identify whether a spatially distributed pattern of brain responses can predict the perceived pleasantness of the relief (or “neural signature” of relief)31. Because we used the whole-brain pattern (and not only our a priori ROIs), this analysis is completely data driven and can thus identify which clusters contribute most to the relief prediction.”

      Adaptations in the Discussion section: “In addition to examining the PE-properties of neural omission responses in our a priori ROIs, we trained a LASSO-PCR model to establish a signature pattern of relief. One interesting finding that only became evident when we compared the univariate and multivariate approach was that none of our a priori ROIs appeared to be an important contributor to the multivariate neural signature, even though all of them (except NAc) were significantly modulated by relief in the univariate analysis.”

      In addition to the public peer review, the reviewers provided some recommendation on how to further improve our manuscript. We will reply to the recommendations below.

      Reviewer #1 (Recommendations For The Authors):

      Given that you do have trial-level estimates from the classifier analysis, it would be very informative to use learning models and examine responses trial-by-trial to test whether there are prediction errors that vary over time as a function of learning.

      We thank the reviewer for the suggestion. However, based on the results of the run-regressor, we do not anticipate large learning effects in our paradigm. As we mentioned in our responses above, we controlled for time-related drops in omission-responding by including a “run” regressor in our analyses. Results of this regressor for subjective relief and omission-related SCR showed that although there was a general drop in reported relief pleasantness and omission SCR over time, the effects of probability and intensity remained present until the last run. This suggests that even though some learning might have taken place, its effect was likely small and did not abolish our manipulations of probability and intensity. In any case, we cannot use the LASSO-PCR signature model to investigate learning, as this model uses the trial-level brain pattern at the time of US omission to estimate the associated level of relief. These estimates can therefore not be used to examine learning effects.

      Reviewer #2 (Recommendations For The Authors):

      The LASSO-PCR model feels rather disconnected from the rest of the paper and does not add much to the main theme. I would suggest to remove this part from the paper.

      We thank the reviewer for this suggestion. However, the LASSO-PCR analysis was a preregistered. We therefore cannot remove it from the manuscript. We hope to have clarified its added value in the revised version of the manuscript.

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      - There were no mechanistic or causation-focused investigations that could have greatly strengthened the study. The study is ultimately providing two prioritized candidate genes that may be causative, reactive, or independent of the disease.

      Answer: We thank the reviewer for their positive assessment and agree that our study lacks formal causal analyses. We are aware of this limitation and have made it clear throughout the text. Through triangulation of evidence across tissues and species, we point to very interesting candidates that merit further study, which is the usual scope of such systems genetics investigations. Nevertheless, to introduce some causal inference and reinforce the human relevance of our results, we have performed Mendelian randomization (MR) analysis to investigate the potential associations between MUC4’s gene expression in human colons and the risk of IBD. EPHA6 lacks detectable eQTLs in human colon so we could not include it in this analysis. We found suggestive evidence that increased expression of MUC4 in the sigmoid, but not transverse, colon may increase the risk of IBD (nominal p = 0.033).

      The description in the manuscript:

      However, it is unclear through what mechanisms the genetic variants in the candidate genes affect IBD susceptibility. One possibility is that genetic variation leads to altered levels of expression of the gene, ultimately affecting disease susceptibility. To test this possibility, we examined the GTEx resource (GTEx Consortium, 2013) and found that MUC4, but not EPHA6, has cis-eQTLs in the sigmoid and transverse colon. To establish likely causal links with IBD incidence, we used these associations as instruments in a two-sample Mendelian randomization (MR) (Hemani, Tilling and Smith, 2017; Hemani et al., 2018) analysis. Using publicly available GWAS summary statistics for IBD, Crohn’s disease, and ulcerative colitis (Liu et al., 2015; Elsworth et al., 2020) as outcomes, we found suggestive evidence that increased expression of MUC4 in the sigmoid, but not transverse, colon may increase the risk of IBD (nominal P value = 0.033, Appendix 1 - Table 6). No eQTLs were reported for EPHA6 in the colon, precluding us from investigating the potential consequences of changes in its expression in these tissues.

      - Figures 3 and its supplement Figure 1: Among the 39 modules, the authors have only focused on significantly overlapping up-regulated IBD-related gene modules in both CD (M28 and M32) and HFD (M9 and M28) for their follow up analyses in Figures 4 and 5 to prioritize candidate genes. However, this reviewer thinks there is great value in also focusing on significantly overlapping down-regulated IBD-related gene modules in both CD (M17) and HFD (M15 and M26) for their follow up candidate gene prioritization analyses.

      Answer: Thank you for your suggestion. We had initially performed overrepresentation analyses in HFD_M15, HFD_M26 and CD_M17, but did not find enrichments related to inflammation (see Author response image 1 below). We did not include this result in the manuscript.

      Author response image 1.

      Dot plot showing the enrichment of IBD-related modules in hallmark genesets. Gene ratios higher than 0.1 are shown and represented by dot size. Dots are colored by -Log10(BH-adjusted P values).

      We also checked the module QTL mapping for the significantly overlapping down-regulated IBD-related gene modules in both CD and HFD. We did not find any loci that are significantly associated with these modules, indicating that they are not modulated by genetic variation and hence are less likely to inform on IBD susceptibility.

      The description in the manuscript:

      The ModQTL analysis was also performed on the modules that are significantly enriched in IBD-downregulated genes (HFD_M15, HFD_M24, and HFD_M26), but no significant or suggestive QTLs were detected. Therefore, we focused on the QTL for IBD-induced genes in HFD_M28 and annotated its candidate genes based on three criteria (Figure 5B).

      Reviewer #2 (Recommendations For The Authors):

      - One small addition that would be nice would be to indicate if the two candidate genes have cis eQTL in human tissues and/or have any protein-coding variants in humans. This would provide nice additional evidence of causality for these two genes.

      Answer: Thank you for your positive assessment and suggestion. MUC4 and EPHA6 both have protein-coding variants in humans that were listed in the Appendix – Table 3 and Table 4. In addition, cis-eQTLs have been found for MUC4 in both the sigmoid and transverse colon in humans (GTEx, https://gtexportal.org/home/locusBrowserPage/ENSG00000145113.21). As indicated in our response to the first comment of Reviewer #1, we have now performed mendelian randomization on human eQTL for MUC4. However, no eQTLs were reported for EPHA6 in the colon, preventing us from performing MR analysis on its expression.

      - Also, it would be helpful to include the size of the modules in the text of the manuscript. Especially the two modules that were followed up on.

      Answer: Thank you for your suggestion, we have indicated the size of IBD-related modules in the text of the manuscript.

      The description in the manuscript:

      Enrichment analyses indicated that modules HFD_M9 (484 genes), HFD_M16 (328 genes), and HFD_M28 (123 genes) were enriched with genes that are upregulated by DSS-induced colitis, while HFD_M15 (368 genes), HFD_M24 (159 genes), and HFD_M26 (135 genes) were significantly enriched with downregulated genes (Figure 3C). Of note, more than 20% of genes involved in HFD_M9 and HFD_M28 were part of the dysregulated genes of the acute phase of mouse UC (day6 and day7) (Figure 3C). Interestingly, genes perturbed during IBD pathogenesis in humans were also enriched in HFD_M9 and HFD_M28 (Figure 3C).

      While IBD-related genes were predominantly found in HFD modules, we also found that two modules, CD_M28 (185 genes) and CD_M32 (142 genes), in CD-fed mouse colons were associated with IBD (Figure 3—figure supplement 1A). These two-modules significantly overlapped with the IBD-related HFD_M9 and HFD_M28 modules, respectively (BH-adjusted P value < 0.05) (Figure 3—figure supplement 1B). Moreover, the molecular signatures underlying human UC and Crohn’s disease were also clustered in these two modules (CD_M28 and CD_M32) under CD (Figure 3—figure supplement 1C). Collectively, the co-expression and enrichment analyses identify HFD_M9 and HFD_M28 as IBD-related modules on which we focus our subsequent investigation.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public Review)

      Weaknesses

      1) The usage of young growing mice (8-10 weeks) versus adult mice (>4 months) in the murine mechanical overload experiments. The usage of adult mice would be preferable for these experiments given that maturational growth may somehow affect the outcomes.

      The basis for this critique is not clear as it has been shown that the longitudinal growth of bones is complete by ⁓8 weeks of age (e.g., PMID: 28326349, and 31997656). These studies, along with others, also indicate that 8 weeks is a post-pubescent age in mice. For these reasons, 8 weeks of age was viewed as being representative of the human equivalent of when people start to perform resistance exercise with the goal of increasing muscle mass. Also, it’s important to consider that the mice were 10-12 weeks of age when the muscles were collected which would be equivalent to a human in their lower 20’s. In our human study, the mean age of the subjects was 23. Given the above points, it’s hard for us to appreciate why the use of mice that started at 8-10 weeks of age is viewed as a weakness. With that being said, we recognize that there may be age-related changes in mechanisms of mechanical load-induced growth, but it was not our intent to address this topic.

      1b) No consideration for biological sex.

      We appreciate this point and we agree that sex is an important variable to consider. In this study, we explored an unchartered topic and therefore we wanted to minimize as many known variables as possible. We did that, in part, by focusing specifically on male subjects. In the future, it will certainly be important to explore whether sex (and age) impact the structural adaptations that drive the mechanical load-induced growth of muscle fibers.

      2) Information on whether myofibrillogenesis is dependent on hypertrophy induced by loading, or just hypertrophy in general. To provide information on this, the authors could use, for instance, inducible Myostatin KO mice (a model where hypertrophy and force production are not always in lockstep) to see whether hypertrophy independent from load induces the same result as muscle loading regarding myofibrillogenesis.

      This is a great suggestion, but it goes beyond the intended scope of our study. Nevertheless, with the publication of our FIM-ID methodology, the answer to this and related questions can now be obtained in a time- and cost-effective manner.

      3) Limited information on Type 1 fiber hypertrophy. A "dual overload" model is used for the mouse where the soleus is also overloaded, but presumably, the soleus was too damaged to analyze. Exploring hypertrophy of murine Type 1 fibers using a different model (weight pulling, weighted wheel running, or forced treadmill running) would be a welcome addition.

      The point is well taken and further studies that are aimed at determining whether there are differences in how Type I vs. Type II fibers grow would be an excellent subject for future studies.

      Reviewer #3 (Public Review)

      1) Supplemental Figure 1 is not very clear.

      Supplemental Figure 1 is now presented as Supplemental Figure 2. We carefully reexamined this figure and, in our opinion, the key points have been appropriately conveyed. We would be more than happy to revise the figure, but we would need guidance with respect to which aspect(s) of the figure were not clear to the reviewer.

      Reviewer #1 (Recommendations For The Authors)

      Introduction.

      1) I do not think the first paragraph is really necessary. Cell growth is a fundamental property of cell biology that requires no further justification.

      We believe that it is essential to remind all readers about the importance of skeletal muscle research. For some, the detrimental impact of skeletal muscle loss on one’s quality of life and the greater burden on the healthcare system may not be known.

      2) I prefer "fundamental" over "foundationally".

      All mentions of the word “foundational” and “foundationally” have been changed to “fundamental” and “fundamentally.”

      3) As usual for the Hornberger lab, the authors do an excellent job of providing the (historical) context of the research question.

      Thank you for this positive comment.

      4) I prefer “Goldspink” as “Dr. Goldspink” feels too personal especially when you are critical of his studies.

      All instances of “Dr.” have been removed when referring to the works of others. This includes Dr. Goldspink and Dr. Tokuyasu.

      5) Fourth paragraph, after reference #17. I felt like this discussion was not necessary and did not really add any value to the introduction.

      We believe that this discussion should remain since it highlights the widely accepted notion that mechanical loading leads to an increase in the number of myofibrils per fiber, yet there is no compelling data to support this notion. This discussion highlights the need for documented evidence for the increase in myofibril number in response to mechanical loading and, as such, it serves as a major part of the premise for the experiments that were conducted in our manuscript.

      6) The authors do a nice job of laying out the challenge of rigorously testing the Goldspink model of myofiber hypertrophy.

      Thank you!

      Results

      1). For the EM images, can the authors provide a representative image of myofibril tracing? From the EM image provided, it is difficult to evaluate how accurate the tracing is.

      -Representative images and an explanation of myofibril calculation have been provided in Supplemental Figure 5.

      2) In the mouse, how does the mean myofibril CSA compare between EM and FIM-ID?

      Author response image 1.

      The above figures compare the myofibril CSA and fiber CSA measurements that were obtained with EM and FIM-ID for all analyzed fibers, as well as the same fibers separated according to the fiber type (i.e., Ox vs. Gly). The above figure shows that the FIM-ID measurements of myofibril CSA were slightly, yet significantly, lower than the measurements obtained with EM. However, we believe that it would be misleading to present the data in this manner. Specifically, as shown in Fig. 4C, a positive linear relationship exists between myofibril CSA and fiber CSA. Thus, a direct comparison of myofibril CSA measurements obtained from EM and FIM-ID would only be meaningful if the mean CSA of the fibers that were analyzed were the same. As shown on the panel on the right, the mean CSA of the fibers analyzed with FIM-ID was slightly, yet significantly, lower than the mean CSA of the fibers analyzed with EM. As such, we believe that the most appropriate way to compare the measurements of the two methods is to express the values for the myofibril CSA relative to the fiber CSA and this is how we presented the data in the main figure (i.e., Fig. 4E).

      3) Looking at Fig. 3D, how is intermyofibrillar space calculated when a significant proportion of the ROI is odd-shaped myofibrils that are not outlined? It is not clear how the intermyofibrillar space between the odd-shaped myofibrils is included in the total intermyofibrillar space calculation for the fiber.

      The area occupied by the intermyofibrillar components is calculated by using our custom “Intermyofibrillar Area” pipeline within CellProfiler. Briefly, the program creates a binary image of the SERCA signal. The area occupied by the white pixels in the binary image is then used to calculate the area that is occupied by the intermyofibrillar components. To help readers, an example of this process is now provided in supplemental figure 4.

      4) What is the average percentage of each ROI that was not counted by CP (because a myofibril did not fit the shape criteria)? The concern is that the method of collection is biasing the data. In looking at EM images of myofibrils (from other studies), it is apparent that myofibrils are not always oval; in fact, it appears that often myofibrils have a more rectangular shape. These odd-shaped myofibrils are excluded from the analysis yet they might provide important information; maybe these odd-shaped myofibrils always hypertrophy such that their inclusion might change the overall conclusion of the study. I completely understand the challenges of trying to quantify odd-shaped myofibrils. I think it is important the authors discuss this important limitation of the study.

      First, we would like to clarify that myofibrils of a generally rectangular shape were not excluded. The intent of the filtering steps was to exclude objects that exhibited odd shapes because of an incomplete closure of the signal from SERCA. To illustrate this point we have annotated the images from Figure 3B-D with a red arrow which points to a rectangular object and blue arrows which point to objects that most likely consisted of two or more individual myofibrils that were falsely identified as a single object.

      Author response image 2.

      We appreciate the reviewer's concern that differences in the exclusion rates between groups could have biased the outcomes. Indeed, this was something that we were keeping a careful eye on during our analyses, and we hope that the reviewer will take comfort in knowing that objects were excluded at a very similar rate in both the mouse and human samples (44% vs. 46% for SHAM vs. MOV in mice, and 47% vs. 47% for PRE vs. POST in humans). We realize that this important data should have been included in our original submission and it is now contained with the results section of the revised version of our manuscript. Hopefully the explanation above, along with the inclusion of this data, will alleviate the reviewers concerns that differences between the groups may have been biased by the filtering steps.

      Discussion.

      1) I think the authors provided a balanced interpretation of the data by acknowledging the limitation of having only one time-point. i.e., not being able to assess the myofibril splitting mechanism.

      Thank you!

      2) I think a discussion on the important limitation of only quantifying oval-shaped myofibrils should be included in the discussion.

      Please refer to our response to comment #4 of the results section.

      Reviewer #2 (Recommendations For The Authors)

      Overall, this is a thoughtful, clear, and impactful manuscript that provides valuable tools and information for the skeletal muscle field. My specific comments are as follows:

      1) In the introduction, I really appreciate the historical aspect provided on myofbrillogenesis. As written, however, I was expecting the authors to tackle the myofibril "splitting" question in greater detail with their experiments given the amount of real estate given to that topic, but this was not the case. Consider toning this down a bit as I think it sets a false expectation.

      We acknowledge that the study does not directly address the question about myofibril splitting. However, we believe that it is important to highlight the background of this untested theory since it serves as a major part of the premise for the experiments that were performed.

      2) In the introduction, is it worth worth citing this study? https://rupress.org/jcb/articlepdf/111/5/1885/1464125/1885.pdf.

      This is a very interesting study but, despite the title, we do not believe that it is accurate to say that this study investigated myofibrillogenesis. Instead (as illustrated by the author in Fig. 9) the study focused on the in-series addition of new sarcomeres at the ends of the pre-existing myofibrils (i.e., it studied in-series sarcomerogenesis). In our opinion, the study does not provide any direct evidence of myofibrillogenesis, and we are not aware of any studies that have shown that the chronic stretch model employed by the authors induces myofibrillogenesis. However, numerous studies have shown that chronic stretch leads to the in-series addition of new sarcomeres.

      3) Is there evidence for myofbrillogenesis during cardiac hypertrophy that could be referenced here?

      This is a great question, and one would think that it would have been widely investigated. However, direct evidence for myofibrillogenesis during load-induced cardiac hypertrophy is just as sparse as the evidence for myofibrillogenesis during load-induced skeletal muscle hypertrophy.

      4) In the introduction, perhaps mention that prolonged fixation is another disadvantage of EM tissue preparation. This typically prevents the usage of antibodies afterwards, whereas the authors have been able to overcome this using their method, which is a great strength.

      Thank you for the suggestion. This point has been added the 5th paragraph of the introduction.

      5) In the introduction, are there not EM-compatible computer programs that could sidestep the manual tracing and increase throughput? Why could software such as this not be used? https://www.nature.com/articles/s41592-019-0396-9

      While we agree that automated pipelines have been developed for EM, such methods require a high degree of contrast between the measured objects. With EM, the high degree of contrast required for automated quantification is rarely observed between the myofibrils and the intermyofibrillar components (especially in glycolytic fibers). Moreover, one of the primary goals of our study was to develop a time and cost-effective method for identifying and quantifying myofibrils. As such, we developed a method that would not require the use of EM. We only incorporated EM imaging and analysis to validate the FIM-ID method. Therefore, utilizing an EM-compatible program to sidestep the manual tracing would have sped up the validation step, but it would not have accomplished one of the primary goals of our study.

      6) In the results, specifically for the human specimens, were "hybrid" fibers detected and, if so, how did the pattern of SERCA look? Also, did the authors happen to notice centrallynucleated muscle fibers in the murine plantaris after overload? If so, how did the myofibrils look? Could be interesting.

      For the analysis of the human fibers, two distinct immunolabeling methods were performed. One set of sections was stained for SERCA1 and dystrophin, while the other set was stained for SERCA2 and dystrophin. In other words, we did not perform dual immunolabeling for SERCA1 and SERCA2 on the same sections. Therefore, during the analysis of the human fibers, we did not detect the presence of hybrid fibers. Furthermore, while we did not perform nuclear staining on these sections, it should be noted that nuclei do not contain SERCA, and to the best of our recollection, we did not detect any SERCAnull objects within the center of the fibers. Moreover, our previous work has shown that the model of MOV used in this study does not lead to signs of degeneration/regeneration (You, Jae-Sung et al. (2019). doi:10.1096/fj.201801653RR). Therefore, it can be safely assumed that very few (if any) of the fibers analyzed in this study were centrally nucleated.

      7) In the Results, fixed for how long? This is important since, at least in my experience, with 24+ hours of fixation, antibody reactivity is significantly reduced unless an antigen retrieval step is performed (even then, not always successful). Also, presumably these tissues were drop-fixed? These details are in the Methods but some additional detail here could be warranted for the benefit of the discerning and interested reader.

      For both the mouse and human, the samples were immersion-fixed (presumably the equivalent of “drop-fixed”) in 4% paraformaldehyde in 0.1M phosphate buffer solution for a total of 24 hours (as described in the Methods section). We agree that prolonged aldehyde fixation can affect antibody reactivity; however, the antibodies used for FIM-ID did not require an antigen retrieval step.

      8) In the results regarding NADH/FAD autofluorescence imaging, a complimentary approach in muscle was recently described and could be cited here: https://journals.physiology.org/doi/full/10.1152/japplphysiol.00662.2022

      We appreciate the reviewer’s recommendation to add this citation for the support of our method for fiber type classification and have added it to the manuscript in the second paragraph under the “Further refinement and validation of the automated measurements with FIM-ID” subsection of the Results as citation number 57.

      9) In the results, "Moreover, no significant differences in the mean number of myofibrils per fiber CSA were found when the results from the FIM-ID and EM-based measurements were directly compared, and this point was true when the data from all analyzed fibers was considered..." Nit-picky, but should it be "were considered" since data is plural?

      Thanks, this error was corrected.

      10) In the discussion, are the authors developing a "methodology" or a "method"? I think it may be the latter.

      We agree that “method” is the correct term to use. Instances of the word “methodology” have been replaced with “method.”

      11) In the discussion, since the same fibers were not being tracked over time, I'm not sure that saying "radial growth" is strictly correct. It is intuitive that the fibers were growing during loading, of course, but it may be safer to say "larger fibers versus control or the Pre sample" or something of the like. For example, "all the fiber types that were larger after loading versus controls" as opposed to "showed significant radial growth"

      While we agree that the fiber size was not tracked over time, the experiments were designed to test for a main effect of mechanical loading. Therefore, we are attributing the morphological adaptations to the mechanical loading variable (i.e., mechanical loadinduced growth). The use of terms like “the induction of radial growth” or “the induction of hypertrophy” are commonly used in studies with the methods employed in this study. Respectfully, we believe that it would be more confusing for the readers if we used the suggested terms like "all the fiber types that were larger after loading versus controls". For instance, if I were the reader I would think to myself… but there fiber types that were larger than others before loading (e.g., Ox vs. Gly), so what are the authors really trying to talk about?

      12) I would suggest making a cartoon summary figure to complement and summarize the Methods/Results/Discussion

      Thank you for this suggestion. We created a cartoon that summarizes the overall workflow for FIM-ID and this cartoon is now presented in Supplemental Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public Review):

      The authors make a compelling case for the biological need to exquisitely control RecB levels, which they suggest is achieved by the pathway they have uncovered and described in this work. However, this conclusion is largely inferred as the authors only investigate the effect on cell survival in response to (high levels of) DNA damage and in response to two perturbations - genetic knock-out or over-expression, both of which are likely more dramatic than the range of expression levels observed in unstimulated and DNA damage conditions.

      In the discussion of the updated version of the manuscript, we have clarified the limits of our interpretation of the role of the uncovered regulation.

      Lines 411-417: “It is worth noting that the observed decrease in cell viability upon DNA damage was detected for relatively drastic perturbations such as recB deletion and RecBCD overexpression. Verifying these observations in the context of more subtle changes in RecB levels would be important for further investigation of the biological role of the uncovered regulation mechanism. However, the extremely low numbers of RecB proteins make altering its abundance in a refined, controlled, and homogeneous across cells manner extremely challenging and would require the development of novel synthetic biology tools.”

      Reviewer #3 (Public Review):

      The major weaknesses include a lack of mechanistic depth, and part of the conclusions are not fully supported by the data.

      (1) Mechanistically, it is still unclear why upon DNA damage, translation level of recB mRNA increases, which makes the story less complete. The authors mention in the Discussion that a moderate (30%) decrease in Hfq protein was observed in previous study, which may explain the loss of translation repression on recB. However, given that this mRNA exists in very low copy number (a few per cell) and that Hfq copy number is on the order of a few hundred to a few thousand, it's unclear how 30% decrease in the protein level should resides a significant change in its regulation of recB mRNA.

      We agree that the entire mechanistic pathway controlling recB expression may be not limited to just Hfq involvement. We have performed additional experiments, proposed by the reviewer, suggesting that a small RNA might be involved (see below, response to comments 3&4). However, we consider that the full characterisation of all players is beyond the scope of this manuscript. In addition to describing the new data (see below), we expanded the discussion to explain more precisely why changes in Hfq abundance upon DNA damage may impact RecB translation. 

      Lines 384-391: “A modest decrease (~30%) in Hfq protein abundance has been seen in a proteomic study in E. coli upon DSB induction with ciprofloxacin (DOI: 10.1016/j.jprot.2018.03.002). While Hfq is a highly abundant protein, it has many mRNA and sRNA targets, some of which are also present in large amounts (DOI: 10.1046/j.1365-2958.2003.03734.x). As recently shown, the competition among the targets over Hfq proteins results in unequal (across various targets) outcomes, where the targets with higher Hfq binding affinity have an advantage over the ones with less efficient binding (DOI: 10.1016/j.celrep.2020.02.016). In line with these findings, it is conceivable that even modest changes in Hfq availability could result in significant changes in gene expression, and this could explain the increased translational efficiency of RecB under DNA damage conditions. “

      (2) Based on the experiment and the model, Hfq regulates translation of recB gene through binding to the RBS of the upstream ptrA gene through translation coupling. In this case, one would expect that the behavior of ptrA gene expression and its response to Hfq regulation would be quite similar to recB. Performing the same measurement on ptrA gene expression in the presence and absence of Hfq would strengthen the conclusion and model.

      Indeed, based on our model, we expect PtrA expression to be regulated by Hfq in a similar manner to RecB. However, the product encoded by the ptrA gene, Protease III, (i) has been poorly characterised; (ii) unlike RecB, is located in the periplasm (DOI: 10.1128/jb.149.3.1027-1033.1982); and (iii) is not involved in any DNA repair pathway. Therefore, analysing PtrA expression would take us away from the key questions of our study.

      (3) The authors agree that they cannot exclude the possibility of sRNA being involved in the translation regulation. However, this can be tested by performing the imaging experiments in the presence of Hfq proximal face mutations, which largely disrupt binding of sRNAs.

      (4) The data on construct with a long region of Hfq binding site on recB mRNA deleted is less convincing. There is no control to show that removing this sequence region itself has no effect on translation, and the effect is solely due to the lack of Hfq binding. A better experiment would be using a Hfq distal face mutant that is deficient in binding to the ARN motifs.

      We performed the requested experiments. We included this data in the manuscript in the supplementary figure (Figure S11), and our interpretation in the discussion.

      Lines 354-378: “While a few recent studies have shown evidence for direct gene regulation by Hfq in a sRNA-independent manner (DOI: 10.1101/gad.302547.117; DOI: 10.1111/mmi.14799; DOI: 10.1371/journal.pgen.1004440; DOI: 10.1111/mmi.12961; DOI: 10.1038/emboj.2013.205), we attempted to investigate whether a small RNA could be involved in the Hfq-mediated regulation of RecB expression. We tested Hfq mutants containing point mutations in the proximal and distal sides of the protein, which were shown to disrupt either binding with sRNAs or with ARN motifs of mRNA targets, respectively [DOI: 10.1016/j.jmb.2013.01.006, DOI: 10.3389/fcimb.2023.1282258]. Hfq mutated in either proximal (K56A) or distal (Y25D) faces were expressed from a plasmid in a ∆hfq background. In both cases, Hfq expression was confirmed with qPCR and did not affect recB mRNA levels (Supplementary Figure S11b). When the proximal Hfq binding side (K56A) was disrupted, RecB protein concentration was nearly similar to that obtained in a ∆hfq mutant (Supplementary Figure S11a, top panel). This observation suggests that the repression of RecB translation requires the proximal side of Hfq, and that a small RNA is likely to be involved as small RNAs (Class I and Class II) were shown to predominantly interact with the proximal face of Hfq [DOI: 10.15252/embj.201591569]. When we expressed Hfq mutated in the distal face (Y25D) which is deficient in binding to mRNAs, less efficient repression of RecB translation was detected (Supplementary Figure S11a, bottom panel). This suggests that RecB mRNA interacts with Hfq at this position. We did not observe full de-repression to the ∆hfq level, which might be explained by residual capacity of Hfq to bind its recB mRNA target in the point mutant (Y25D) (either via the distal face with less affinity or via the lateral rim Hfq interface).”

      Taken together, these results suggest that Hfq binds to recB mRNA and that a small RNA might contribute to the regulation although this sRNA has not been identified.

      (5) Ln 249-251: The authors claim that the stability of recB mRNA is not changed in ∆hfq simply based on the steady-state mRNA level. To claim so, the lifetime needs to be measured in the absence of Hfq.

      We measured recB lifetime in the absence of Hfq in a time-course experiment where transcription initiation was inhibited with rifampicin and mRNA abundance was quantified with RT-qPCR. The results confirmed that recB mRNA lifetime in hfq mutants is similar to the one in the wild type (Figure S7d, referred to the line 263 of the manuscript).

      (6) What's the labeling efficiency of Halo-tag? If not 100% labeled, is it considered in the protein number quantification? Is the protein copy number quantification through imaging calibrated by an independent method? Does Halo tag affect the protein translation or degradation?

      Our previous study (DOI: 10.1038/s41598-019-44278-0) described a detailed characterization of the HaloTag labelling technique for quantifying low-copy proteins in single E. coli cells using RecB as a test case. 

      In that study, we showed complete quantitative agreement of RecB quantification between two fully independent methods: HaloTag-based labelling with cell fixation and RecB-sfGFP combined with a microfluidic device that lowers protein diffusion in the bacterial cytoplasm. This second method had previously been validated for protein quantification (DOI: 10.1038/ncomms11641) and provides detection of 80-90% of the labelled protein. Additionally, in our protocol, immediate chemical fixation of cells after the labelling and quick washing steps ensure that new, unlabelled RecB proteins are not produced. We, therefore, conclude that our approach to RecB detection is highly reliable and sufficient for comparing RecB production in different conditions and mutants.

      The RecB-HaloTag construct has been designed for minimal impact on RecB production and function. The HaloTag is translationally fused to RecB in a loop positioned after the serine present at position 47 where it is unlikely to interfere with (i) the formation of RecBCD complex (based on RecBCD structure, DOI: 10.1038/nature02988), (ii) the initiation of translation (as it is far away from the 5’UTR and the beginning of the open reading frame) and (iii) conventional C-terminalassociated mechanisms of protein degradation (DOI: 10.15252/msb.20199208). In our manuscript, we showed that the RecB-HaloTag degradation rate is similar to the dilution rate due to bacterial growth. This is in line with a recent study on unlabelled proteins, which shows that RecB’s lifetime is set by the cellular growth rate (DOI: 10.1101/2022.08.01.502339).

      Furthermore, we have demonstrated (DOI: 10.1038/s41598-019-44278-0) that (i) bacterial growth is not affected by replacing the native RecB with RecB-HaloTag, (ii) RecB-HaloTag is fully functional upon DNA damage, and (iii) no proteolytic processing of the RecB-HaloTag is detected by Western blot. 

      These results suggest that RecB expression and functionality are unlikely to be affected by the translational HaloTag insertion at Ser-47 in RecB.

      In the revised version of the manuscript, we have added information about the construct and discuss the reliability of the quantification.

      Lines 141-152: “To determine whether the mRNA fluctuations we observed are transmitted to the protein level, we quantified RecB protein abundance with singlemolecule accuracy in fixed individual cells using the Halo self-labelling tag (Fig. 2A&B).

      The HaloTag is translationally fused to RecB in a loop after Ser47(DOI: 10.1038/s41598-019-44278-0) where it is unlikely to interfere with the formation of RecBCD complex (DOI: 10.1038/nature02988), the initiation of translation and conventional C-terminal-associated mechanisms of protein degradation (DOI: 10.15252/msb.20199208). Consistent with minimal impact on RecB production and function, bacterial growth was not affected by replacing the native RecB with RecBHaloTag, the fusion was fully functional upon DNA damage and no proteolytic processing of the construct was detected (DOI: 10.1038/s41598-019-44278-0). To ensure reliable quantification in bacteria with HaloTag labelling, the technique was previously verified with an independent imaging method and resulted in > 80% labelling efficiency (DOI: 10.1038/s41598-019-44278-0, DOI: 10.1038/ncomms11641). In order to minimize the number of newly produced unlabelled RecB proteins, labelling and quick washing steps were followed by immediate chemical fixation of cells.”

      Lines 164-168: “Comparison to the population growth rate [in these conditions (0.017 1/min)] suggests that RecB protein is stable and effectively removed only as a result of dilution and molecule partitioning between daughter cells. This result is consistent with a recent high-throughput study on protein turnover rates in E. coli, where the lifetime of RecB proteins was shown to be set by the doubling time (DOI: 10.1038/s41467-024-49920-8).”

      (7) Upper panel of Fig S8a is redundant as in Fig 5B. Seems that Fig S8d is not described in the text.

      We have now stated in the legend of Fig S8a that the data in the upper panel were taken from Fig 5B to visually facilitate the comparison with the results given in the lower panel. We also noticed that we did not specify that in the upper panel in Fig S9a (the data in the upper panel of Fig S9a was taken from Fig 5C for the same reason). We added this clarification to the legend of the Fig S9 as well.

      We referred to the Fig S8d in the main text. 

      Lines 283-284: “We confirmed the functionality of the Hfq protein expressed from the pQE-Hfq plasmid in our experimental conditions (Fig. S8d).”

      Reviewer #1 (Recommendations For The Authors):

      (1) Experimental regime to measure protein and mRNA levels.

      (a) Authors expose cells to ciprofloxacin for 2 hrs. They provide a justification via a mathematical model. However, in the absence of a measurement of protein and mRNA across time, it is unclear whether this single time point is sufficient to make the conclusion on RecB induction under double-strand break.

      In our experiments, we only aimed to compare recB mRNA and RecB protein levels in two steady-state conditions: no DNA damage and DNA damage caused by sublethal levels of ciprofloxacin. We did not aim to look at RecB dynamic regulation from nondamaged to damaged conditions – this would indeed require additional measurements at different time points. We revised this part of the results to ensure that our conclusions are stated as steady-state measurements and not as dynamic changes.

      Line 203-205: “We used mathematical modelling to verify that two hours of antibiotic exposure was sufficient to detect changes in mRNA and protein levels and for RecB mRNA and protein levels to reach a new steady state in the presence of DNA damage.”

      (b) Authors use cell area to account for the elongation under damage conditions. However, it is unclear whether the number of copies of the recB gene are similar across these elongated cells. Hence, authors should report mRNA and protein levels with respect to the number of gene copies of RecB or chromosome number as well.

      Based on the experiments in DNA damaging conditions, our main conclusion is that the average translational efficiency of RecB is increased in perturbed conditions. We believe that this conclusion is well supported by our measurements and that it does not require information about the copy number of the recB gene but only the concentration of mRNA and protein. We did observe lower recB mRNA concentration upon DNA damage in comparison to the untreated conditions, which may be due to a lower concentration of genomic DNA in elongated cells upon DNA damage, as we mention in lines (221-223).

      Our calculation of translation efficiency could be affected by variations of mRNA concentration across cells in the dataset. For example, longer cells that are potentially more affected by DNA damage could have lower concentrations of mRNA. We verified that this is not the case, as recB mRNA concentration is constant across cell size distribution (see the figure below or Figure S5a from Supplementary Information).

      Therefore, we do not think that the measurements of recB gene copy would change our conclusions. We agree that measuring recB gene copies could help to investigate the reason behind the lower recB mRNA concentration under the perturbed conditions as this could be due to lower DNA content or due to shortage of resources (such as RNA polymerases). However, this is a side observation we made rather than a critical result, whose investigation is beyond the scope of this manuscript.

      Author response image 1.

      (2) RecB as a proxy for RecBCD. Authors suggest that RecB levels are regulated by hfq. However, how does this regulatory circuit affect the levels of RecC and RecD? Ratio of the three proteins has been shown to be important for the function of the complex.

      A full discussion of RecBCD complex formation regulation would require a complete quantitative model based on precise information on the dynamic of the complex formation, which is currently lacking. 

      We can however offer the following (speculative) suggestions assuming that all three subunits are present in similar abundance in native conditions (DOI: 10.1038/s41598019-44278-0 for RecB and RecC). As the complex is formed in 1:1:1 ratio (DOI: 10.1038/nature02988), we propose that the regulation mechanism of RecB expression affects complex formation in the following way. If the RecB abundance becomes lower than the level of RecC and RecD subunits, the complex formation would be limited by the number of available RecB subunits and hence the number of functional RecBCDs will be decreased. On the contrary, if the number of RecB is higher than the baseline, then, especially in the context of low numbers, we would expect that the probability of forming a complex RecBC (and then RecBCD) will be increased. Based on this simple explanation, we might speculate that regulation of RecB expression may be sufficient to regulate RecB levels and RecBCD complex formation. However, we feel that this argument is too speculative to be added to the manuscript. 

      (3) Role of Hfq in RecB regulation. While authors show the role of hfq in recB translation regulation in non-damage conditions, it is unclear as to how this regulation occurs under damage conditions.

      (a) Have the author carried out recB mRNA and protein measurement in hfqdeleted cells under ciprofloxacin treatment?

      We attempted to perform experiments in hfq mutants under ciprofloxacin treatment. However, the cells exhibited a very strong and pleiotropic phenotype: they had large size variability and shape changes and were also frequently lysing. Therefore, we did not proceed with mRNA and protein quantification because the data would not have been reliable. 

      (b) How do the authors propose that Hfq regulation is alleviated under conditions of DNA damage, when RecB translation efficiency increases?

      We propose that Hfq could be involved in a more global response to DNA damage as follows. 

      Based on a proteomic study where Hfq protein abundance has been found to decrease (~ 30%) upon DSB induction with ciprofloxacin (DOI: 10.1016/j.jprot.2018.03.002), we suggest that this could explain the increased translational efficiency of RecB. While Hfq is a highly abundant protein, it has many targets (mRNA and sRNA), some of which are also highly abundant. Therefore the competition among the targets over Hfq proteins results in unequal (across various targets) outcomes (DOI: 10.1046/j.13652958.2003.03734.x), where the targets with higher Hfq binding affinity have an advantage over the ones with less efficient binding. We reason that upon DNA damage, a moderate decrease in the Hfq protein abundance (30%) can lead to a similar competition among Hfq targets where high-affinity targets outcompete low-affinity ones as well as low-abundant ones (such as recB mRNAs). Thus, the regulation of lowabundant targets of Hfq by moderate perturbations of Hfq protein level is a potential explanation for the change in RecB translation that we have observed. Potential reasons behind the changes of Hfq levels upon DNA damage would be interesting to explore, however this would require a completely different approach and is beyond the scope of this manuscript.

      We have modified the text of the discussion to explain our reasoning:

      Lines 384-391: “A modest decrease (~30%) in Hfq protein abundance has been seen in a proteomic study in E. coli upon DSB induction with ciprofloxacin (DOI: 10.1016/j.jprot.2018.03.002). While Hfq is a highly abundant protein, it has many mRNA and sRNA targets, some of which are also present in large amounts (DOI: 10.1046/j.1365-2958.2003.03734.x). As recently shown, the competition among the targets over Hfq proteins results in unequal (across various targets) outcomes, where the targets with higher Hfq binding affinity have an advantage over the ones with less efficient binding (DOI: 10.1016/j.celrep.2020.02.016). In line with these findings, it is conceivable that even modest changes in Hfq availability could result in significant changes in gene expression, and this could explain the increased translational efficiency of RecB under DNA damage conditions.”

      (c) Is there any growth phenotype associated with recB mutant where hfq binding is disrupted in damage and non-damage conditions? Does this mutation affect cell viability when over-expressed or under conditions of ciprofloxacin exposure?

      We checked the phenotype and did not detect any difference in growth or cell viability affecting the recB-5 UTR* mutants either in normal conditions or upon exposure to ciprofloxacin. However, this is expected because the repair capacity is associated with RecB protein abundance and in this mutant, while translational efficiency of recB mRNA increases, the level of RecB proteins remains similar to the wild-type (Figure 5E).

      Minor points:

      (1) Introduction - authors should also discuss the role of RecFOR at sites of fork stalling, a likely predominant pathway for break generated at such sites.

      The manuscript focuses on the repair of DNA double-strand breaks (DSBs). RecFOR plays a very important role in the repair of stalled forks because of single-strand gaps but is not involved in the repair of DSBs (DOI: 10.1038/35003501). We have modified the beginning of the introduction to mention the role of RecFOR. 

      Lines 35-39: “For instance, replication forks often encounter obstacles leading to fork reversal, accumulation of gaps that are repaired by the RecFOR pathway (DOI: 10.1038/35003501) or breakage which has been shown to result in spontaneous DSBs in 18% of wild-type Escherichia coli cells in each generation (DOI: 10.1371/journal.pgen.1007256), underscoring the crucial need to repair these breaks to ensure faithful DNA replication.”

      (2) Methods: The authors refer to previous papers for the method used for single RNA molecule detection. More information needs to be provided in the present manuscript to explain how single molecule detection was achieved.

      We added additional information in the method section on the fitting procedure allowing quantifying the number of mRNAs per detected focus.

      Lines 515-530: “Based on the peak height and spot intensity, computed from the fitting output, the specific signal was separated from false positive spots (Fig. S1a). To identify the number of co-localized mRNAs, the integrated spot intensity profile was analyzed as previously described (DOI: 10.1038/nprot.2013.066). Assuming that (i) probe hybridization is a probabilistic process, (ii) binding each RNA FISH probe happens independently, and (iii) in the majority of cases, due to low-abundance, there is one mRNA per spot, it is expected that the integrated intensities of FISH probes bound to one mRNA are Gaussian distributed. In the case of two co-localized mRNAs, there are two independent binding processes and, therefore, a wider Gaussian distribution with twice higher mean and twice larger variance is expected. In fact, the integrated spot intensity profile had a main mode corresponding to a single mRNA per focus, and a second one representing a population of spots with two co-localized mRNAs (Fig. S1b). Based on this model, the integrated spot intensity histograms were fitted to the sum of two Gaussian distributions (see equation below where a, b, c, and d are the fitting parameters), corresponding to one and two mRNA molecules per focus. An intensity equivalent corresponding to the integrated intensity of FISH probes in average bound to one mRNA was computed as a result of multiple-Gaussian fitting procedure (Fig. S1b), and all identified spots were normalized by the one-mRNA equivalent.

      Reviewer #2 (Recommendations For The Authors):

      Overall the work is carefully executed and highly compelling, providing strong support for the conclusions put forth by the authors.

      One point: the potential biological consequences of the post-transcriptional mechanism uncovered in the work would be enhanced if the authors could 1) tune RecB protein levels and 2) directly monitor the role that RecB plays in generating single-standed DNA at DSBs.

      We agree that testing viability of cells in case of tunable changes in RecB levels would be important to further investigate the biological role of the uncovered regulation mechanism. However, this is a very challenging experiment as it is technically difficult to alter the low number of RecB proteins in a controlled and homogeneous across-cell manner, and it would require the development of precisely tunable and very lowabundant synthetic designs. 

      We did monitor real-time RecB dynamics by tracking single molecules in live E. coli cells in a different study (DOI: 10.1101/2023.12.22.573010) that is currently under revision. There, reduced motility of RecB proteins was observed upon DSB induction indicating that RecB is recruited to DNA to start the repair process.

    1. Author Response:

      The following is the authors' response to the original reviews.

      Thank you for considering our manuscript “An Unexpected Role of Neutrophils in Clearing Apoptotic Hepatocytes In Vivo". We also thank the referees for their review. We have addressed their comments in detail and added new data to buttress our conclusions.

      Reviewer #1 (Public Review):

      This study by Cao et al. demonstrates role of Neutrophil in clearing apoptotic hepatocytes by directly burrowing into the apoptotic hepatocytes and ingesting the effete cells from inside without causing inflammation. The authors applied intravital microscopy, Immunostaining and electron microscopy to visualize perforocytosis of neutrophil in hepatocytes. They also found that neutrophil depletion impairs the clearance of apoptotic hepatocytes causing impaired liver function and generation of autoantibodies, implying a role of defective neutrophil- mediated clearance of apoptotic cells in Autoimmune Liver disease. The experiments were well designed and conducted, the results were reasonably interpreted, and the manuscript was clearly written with logical inputs.

      Thank you for your comments.

      One weak point is that the signals/mechanisms that determine why neutrophil specifically target apoptotic hepatocytes in liver and no other organs or cells is not clearly understood.

      We are still studying why neutrophils selectively phagocytose hepatocytes but not HUVEC or 293 cells. We have some intriguing preliminary data so far showing that apoptotic 293 cells have no significant increase of IL-1β production as compared with their nonapoptotic controls; both apoptotic 293 cells and HUVECs do not have increased surface selectin proteins (new Fig. S3C).

      Reviewer #2 (Public Review):

      […] By examination of HE-stained, noncancerous liver tissue sections from patients with hepatocellular carcinoma and hepatic hemangioma, the authors observed that cells with neutrophil nuclear morphology were inside apoptotic hepatocytes. The authors also further characterized this observation by staining the sections with neutrophil and apoptosis markers. In addition, the authors observed the same phenomena in mouse livers using intravital microscopy, which also recorded the time course of the disappearance of a neutrophil-associated apoptotic cell. The author went on further characterization of neutrophil-mediated efferocytosis of cultured hepatic cells in vitro and demonstrated the process was specific for apoptotic hepatic cells, but not HEK293 or endothelial cells. The in vitro system was then used to characterize the molecular bases for neutrophil-mediated efferocytosis of apoptotic hepatic cells. The evidence was provided to suggest that IL1b and IL-8 released from and selectins upregulated in apoptotic hepatic cells were important. Importantly, the authors used two methods to deplete the neutrophils and showed that the neutrophil depletion increased apoptotic cells in livers. Finally, the authors showed that neutrophil depletion caused defects in liver function parameters. At the end, the authors presented evidence to suggest that AIL disease may be due to defective neutrophils that fail to perform "perforocytosis."

      Thank you for your comments.

      Point #1. Although the evidence in its totality indicates that neutrophils burrow into apoptotic hepatocytes, the significance of this "perforocytosis" phenomenon and the circumstances under which it may occur remain to be better defined. In both neutrophil depletion models, the TNUEL-positive cells were not definitively identified rather than assuming they were hepatocytes.

      Anatomically, the apoptotic hepatocytes are randomly distributed in the hepatic plate from the central vein to the portal region (please refer to the image below: hematoxylin staining of liver tissues, black arrowhead indicates perforocytosis sites).

      Author response image 1.

      Histologically, the structure of liver/hepatic lobe are well defined, and the cell types in the livers are easy to histologically identify based on their location, morphology and the relationship to hepatic plate and sinusoid. In addition, the hepatocytes are well known for its rich cytoplasmic components, cellular connection and prominent large round nucleus. Thus, hepatocytes are very easy to identify even without using specific molecular markers such as E-cadherin or albumin. Based on these characteristics, the TUNEL positive cells that we displayed in Fig. 5A are apoptotic hepatocytes.

      Point #2. In addition, there are discrepancies in the number of neutrophils and apoptotic cells in mouse liver studies; Fig. 2a WT (many neutrophils; locations unclear) vs Fig. 5A Ctr (a few neutrophils that appear in or near a vessel), and Fig. 2a DTR (a few apoptotic cells) vs Fig. 5A Depletion (many apoptotic cells).

      In response, Fig. 2A demonstrates a larger area of the mouse liver (bar, 100 µm), while Fig. 5A exhibits a relatively small area of the liver sample (bars, 20 µm for Ctrl and 15 µm for DTR). Similarly, apoptotic cells in Fig. 2A DTR need to zoom in to quantify. We apologize for the confusion, and we did quantify the apoptotic cells in Fig.2A WT vs DTR (see the bar graph next to the images in Fig. 2A).

      Point #3. Importantly, Fig 5a Ctrl, which is presumably a section from a mouse without any surgical treatment or without inflammation, the sole TUNNEL signal does not appear to be associated with neutrophils. Does this mean that "perforocytosis" primarily occurs in inflamed livers (Of note, human liver samples in Fig 1 are from patient with tumors. There should be inflammation in the livers of these patients).

      In Fig 5A Ctrl, the TUNEL signal indicates apoptotic hepatocytes. The neutrophils (stained with anti-NE antibody, red) are associated with the apoptotic hepatocyte (Fig. 5A). We observed that perforocytosis primarily occurs in normal noninflamed livers.

      Human liver samples in Fig 1 are from patient with tumors, hence it is possible that neutrophil burrowing is somehow associated with cancerous/inflammatory livers as the reviewer pointed out. This possibility was ruled out based on our method of sample preparation and experimental results themselves.

      1) Both noncancerous and cancerous liver samples were sliced based on the anatomical appearance of normal and cancer tissues (differences were rather easy to identify, and these samples were prepared by highly experienced pathologists from the Liver Cancer Center of Zhongshan Hospital, Shanghai). Furthermore, the results were confirmed by determining whether the surrounding tissue contained microlesions characteristic of metastatic tumors. We only counted apoptotic hepatocytes in noncancerous regions having normal liver lobes and morphologically normal hepatocytes, plates, sinusoid and Kupffer cells. We also excluded hepatoma, chronic inflammatory regions, and necrotic regions.

      2) We did not observe recruitment of neutrophils into apoptotic HCC cells, indicating that the clearance of apoptotic cancer cells was not mediated by neutrophils (unpublished observations).

      3) It is hard for us to obtain normal human liver samples; however, we did study samples from patients with liver hemangioma characterized by aberrant vasculature in livers but with normal liver functions and the structure of hemangioma livers that we analyzed are nearly identical to a healthy liver in histology (these liver samples contained no cancerous regions and there was no apparent cirrhosis or inflammation). And here we obtained similar results (these are shown in Fig. 1B; a total of 40 apoptotic hepatocytes were examined).

      4) Our data from normal mouse livers, isolated primary cells (hepatocytes and neutrophils) and cell lines (NCTC and HL60) all confirmed the central findings in this paper (Fig. 2, 3).

      Point #4. The data on human AIL patient neutrophils raises more questions: how many AIL patients have been examined? Do these AIL neutrophils lack IL1, IL8 receptors, and/or selectin ligands? Are there increases in apoptotic hepatocytes in AIL patients?

      In response, we have analyzed 16 AIL patient samples (see table below).

      Author response table 1.

      We performed microarray assay to screen the differential gene expression of neutrophils from normal and liver autoimmune patients. We have identified that IL-1β receptor, IL1R1 and selectin binding protein, P- selectin glycoprotein ligand 1 (PSGL-1) were all decreased in neutrophils from the AIL patients (new Fig 7D). These findings are consistent with our observations using cells and mouse models.

      Point #5. Additionally, the overall numbers of apoptotic cells even in the absence of neutrophils are rare; thus, it is questionable that such rarity of apoptotic cells can cause significant AIL phenotypes.

      We quantified apoptotic liver cells in percentages instead of overall numbers (Fig. 5, we were not able to precisely calculate the overall numbers, which could be large since billions of cells undergoing apoptosis daily). Depletion of neutrophils increased the percentage of apoptotic cells about 5-6-fold in livers, and we observed the generation of autoantibodies (Fig. 6).

      Reviewer #1 (Recommendations For The Authors):

      This study by Cao et al. was well designed and conducted, the results were reasonably interpreted, and the manuscript was clearly written with logical inputs.

      It would further gain the significance of this study if authors could address the following questions:

      1.  What are the mechanisms/ signals that prevents AIL Liver neutrophils from burrowing into hepatocytes?

      We have identified that IL-1β receptor, IL1R1 and selectin binding protein, P-selectin glycoprotein ligand 1 (PSGL-1) were all decreased in neutrophils from the AIL patients (new Fig 7D).

      2.  Have authors looked if autoantigens expressed on hepatocytes, which are often found in autoimmune liver disease trigger signaling events that activate neutrophils to burrow?

      Thank you for the comment, we have not examined autoantigens expressed in hepatocytes and plan to carry out this research as suggested.

      3.  Is perforocytosis observed in apoptotic hepatocytes induced by different agents like LPS, TNF-a , rapamycin, alcohol etc?

      We did not observe perforocytosis in LPS or TNF-a treated hepatocytes. One possible reason is that LPS or TNF-a we used induced massive necrosis instead of apoptosis. Howere, we did observe neutrophil perforocytosis in FasL-induced apoptotic hepatocytes (unpublished observations).

      Reviewer #2 (Recommendations For The Authors):

      In addition to the questions raised in the "Public review" section, the authors are also recommended to address the following issues:

      1) Why is CD11b+ not associated with the apoptotic sites as neutrophils express CD11b

      We have co-immunostained human liver samples with CD11b antibody (from Abcam: ab133357) and MPO antibody (from R&D: AF3667) and observed that tissue infiltrating neutrophils in livers have low to undetectable levels of CD11b expression (please refer the image below; white arrowheads point to neutrophils). Few CD11b+ cells in liver tissues express MPO (the CD11b+ cells are mostly macrophages, unpublished observations).

      Based on these data, we conclude that CD11b is hardly expressed in neutrophils inside livers.

      Author response image 2.

      2) Can TUNEL signals in Fig. S1C be from apoptotic neutrophils?

      In response, the fragmentation of nucleus is a hallmark of apoptosis hence TUNEL staining will uniformly label all fragmented parts of apoptotic nucleus. The nucleus of NE+ neutrophils are not labelled by TUNEL staining in Fig. S1C. The TUNEL+ nuclear fragments seen inside neutrophils are nuclear debris of apoptotic hepatocytes phagocytosed by neutrophils (Fig. S1C).

      3) The Fig 2B experiment may be done with induced apoptosis so that neutrophil burrowing steps may be recorded from the very beginning and a better time course for the entire process can be assessed.

      Thank you for the suggestions, we had tried many times with various conditions, yet still had no success to capture the very beginning of perforocytosis in vivo. We are continuing to work on this.

      4) In "we found thatU937 cells exhibited much lower phagocytosis of apoptotic NCTC cells than did HL60 cells (Fig. S2B, C)," the citation should be only S2C

      Thank you for pointing this out, we have corrected this in the manuscript.

      5) Both neutrophil depletion models cause neutrophil death, which may complicate the interpretation of the liver function and AIL disease phenotypes. A neutropenic model such as G-CSFR−/− or Cebpe-/- mice may be used to avoid the caveat of antibody/DTR-dependent depletion models.

      Thank you for this thoughtful suggestion. We have also induced AIL phenotypes in mice by using α- Galcer. α-Galcer did not cause neutrophil death but impaired neutrophil perforocytosis and futher generated AIL phenotypes in mice (unpublished observations). We plan to perform the simiarl experiments in G-CSFR−/− or Cebpe−/− mice as the reviewer suggested.

      6) RNAi silencing experiments need additional controls for off-target effects

      These RNAi silencing constructs were purchased from Santa Cruz Biotechnology and the off-target effects have been tested by the company. No significant off-target effects have been detected according to the manufacture report.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment:

      This manuscript is a valuable study of the responses of GPi neurons to DBS stimulation in human PD and dystonia patients and it finds evidence for altered short-term and long-term plasticity in response to DBS between the two patient populations. This data set is of interest to both basic and clinical researchers working in the field of DBS and movement disorders. While there was enthusiasm for the potential significance of these findings, support for their conclusions was incomplete. Thir data may be indicative of more interesting and complex interpretations than currently considered in the article. 

      The authors would like to express their gratitude to the Editorial Team and Reviewers for their invaluable feedback which helped to improve the manuscript.

      Reviewer #1:

      Summary:

      Sumarac et al investigate differences in globus pallidus internus (GPi) spike activity and short- and long-term plasticity of direct pathway projections in patients with Parkinson's disease (PD) and dystonia. Their main claims are that GPi neurons exhibit distinct characteristics in these two disorders, with PD associated with specific power-frequency oscillations and dystonia showing lower firing rates, increased burstiness, and less regular activity. Additionally, long-term plasticity and synaptic depression appear to differ between the two conditions. The authors suggest that these findings support the concept of hyperfunctional GPi output in PD and hypofunctional output in dystonia, possibly driven by variations in the plasticity of striato-pallidal synapses. Overall enthusiasm is relatively high, but I think the discussion omits discussing findings that don't align well with standard models. 

      Strengths: 

      These types of studies are valuable as the data arise from patients who have dystonia or PD. This could provide unique insights into disease pathophysiology that might not be recapitulated in animal systems work. 

      Thank you for the positive feedback.

      Weaknesses: 

      - The rate model and indirect/direct pathway ideas lack explanatory power; too much of the hypothesis generation and discussion in this manuscript is set in the context of these old ideas. Their data in my view emphasize this somewhat emphatically. Most patients with the 'hypokinetic' movement disorder PD have dystonia as a part of their motor features. Dystonia is a form of excessive muscle activation that on the one hand is 'hyperkinetic' but on the other usually decreases the speed of motor tasks, even in patients with primary dystonia. Similarly, PD patients display a bewildering variety of hyperkinetic manifestations as well (rest tremor, dystonia, dyskinesia). If these are truly independent classifications, i.e. hyper- versus hypo-kinetic, the authors must acknowledge that there is considerable overlap in the spike activity across groups - numerous dystonia patients display higher discharge rates than the majority of the PD sample. Based on the firing rate alone, it would not be possible to distinguish these groups. 

      Thank you for your insightful comments regarding the discussion of the rate model and the distinction between hyperkinetic and hypokinetic movement disorders. We acknowledge that the rate model, primarily derived from limited number of animal subjects [1], may not fully encapsulate the complexities of Parkinson's disease (PD) and dystonia. Our study aimed to validate animal model findings in humans by correlating single-neuron features with disease symptom severity. However, we concur with the Reviewer’s comment regarding the overlapping motor features in hypokinetic and hyperkinetic disorders. We can speculate that the overlap in neuronal properties may be reflected in the overlap of, for example, hyperkinetic features being also present in PD, as suggested by the Reviewer. Per the Reviewer’s request, we have now acknowledged this notion in the manuscript. Interestingly, hypokinetic symptoms have been reported to occur in dystonia in response to GPi-stimulation and have been associated with beta activity in the LFP [2], which reinforces the notion that neural activity may be more related to specific symptoms rather than diseases as a whole. Supplementing our analyses, in addition to total UPDRSIII scores, we have now provided correlations with only hypokinetic (i.e. bradykinesia) subscores of the UPDRSIII to focus on more direct assessment of hypokinetic features in PD versus hyperkinetic features in dystonia. We have updated our methods and results accordingly.

      [1] M. R. DeLong, “Primate models of movement disorders of basal ganglia origin.,” Trends Neurosci, vol. 13, no. 7, pp. 281–285, Jul. 1990, doi: 10.1016/0166-2236(90)90110-v.

      [2] R. Lofredi et al., “Pallidal Beta Activity Is Linked to Stimulation-Induced Slowness in Dystonia,” Movement Disorders, vol. 38, no. 5, pp. 894–899, 2023, doi: 10.1002/mds.29347.

      Amendments to the manuscript:

      “Indeed, variability in spike firing rates in PD may be reflected in the considerable overlap in spiking activity between PD and dystonia (Fig. 1A), with many dystonia patients exhibiting higher discharge rates compared to PD patients.”

      “Given that UPDRSIII includes both hypokinetic and hyperkinetic symptoms of PD, we further sought to disaggregate the score by only considering items 23-26 in UPDRSIII, which assess hypokinetic symptoms of PD.”

      “… with a marginally stronger correlation for PD hypokinetic symptoms only (items 23-26 of UPDRSIII, Spearman's rho=0.32, p=.0330; Supplementary Fig. 3)”

      Supplementary Fig. 3: We provided correlations with hypokinetic (i.e., bradykinesia) subscore of the UPDRSIII. There is very little difference between correlation results of UPDRSIII total (Fig. 1) and the hypokinetic-only subscore (Supplementary Fig. 3).

      “though our results do not change substantially when only hypokinetic PD features are considered (Supplementary Fig. 3).”

      - If beta power is pathognomonic of parkinsonism, the authors found no differences in beta-related spike discharges across the groups. One would have predicted greater beta power in PD than in primary dystonia. This should be discussed explicitly and an interpretation should be provided. 

      We agree with the reviewer that considering the previous LFP literature, one might have expected a difference in single-neuron oscillation power between PD and dystonia. However, while prior studies [3], [4] have reported significant differences in oscillatory power between the two diseases, researchers examined local field potential (LFP) activity only. Other work [5] in non-human primates investigated single-neuron oscillations and reported no differences between PD and dystonia at the single-neuron level, in line with our findings. However, despite the lack of difference in overall power presented here, we provide evidence that the strength of the beta-frequency single-neuron oscillations nevertheless correlates with symptom severity in PD but not dystonia; whereas the strength of the theta-frequency single-neuron oscillations correlates with symptom severity in dystonia but not PD.

      [3] P. Silberstein et al., “Patterning of globus pallidus local field potentials differs between Parkinson’s disease and dystonia.,” Brain, vol. 126, no. Pt 12, pp. 2597–2608, Dec. 2003, doi: 10.1093/brain/awg267.

      [4] D. D. Wang et al., “Pallidal Deep-Brain Stimulation Disrupts Pallidal Beta Oscillations and Coherence with Primary Motor Cortex in Parkinson’s Disease,” J Neurosci, vol. 38, no. 19, pp. 4556–4568, May 2018, doi: 10.1523/JNEUROSCI.0431-18.2018.

      [5] P. A. Starr et al., “Spontaneous pallidal neuronal activity in human dystonia: comparison with Parkinson’s disease and normal macaque.,” J Neurophysiol, vol. 93, no. 6, pp. 3165–3176, Jun. 2005, doi: 10.1152/jn.00971.2004.

      Amendments to the manuscript:

      “Although previous research has reported differences in the LFP power between PD and dystonia [27,28], a study in non-human primates found no such differences in single-neuron oscillatory strength [8], as reflected in our findings. However, despite a lack of difference in overall power across disorders, we were able to derive disease/frequency-specific relationships with respect to clinical scores (Fig. 1C; oscillatory features).”

      - The study lacks a healthy control group, making it challenging to differentiate disease-specific findings from normal variations in GPi activity and plasticity. Although this is acknowledged in the discussion, this complicates the interpretation of the results. The sample sizes for PD and dystonia patients are relatively small, and the study combines various forms of dystonia, potentially masking subtype-specific differences. A larger and more homogenous sample could enhance the study's reliability.

      Indeed, intraoperative microelectrode recordings cannot be obtained in healthy individuals. We agree with the Reviewer that this limits the interpretation of the data. However, directly comparing clinical correlations with single neuron readouts between two distinct clinical entities may, to some degree, compensate for the lack of healthy control data. This contrast, while not providing a healthy control, is still able to point to disease-specific differences. This approach has previously been used to comparisons at the LFP level [6]. While the sample size is indeed small, it is comparable or even higher to similar studies that have investigated the relation of symptom severity of single neuron readouts [7]. The Reviewer is right in that we do not differentiate between generalized or cervical dystonia. We chose to do so because our subgroup analysis provided in the Supplementary Material did not suggest specific differences; though there is insufficient data from specific dystonia subtypes to make formal statistical comparisons. Indeed, future studies should investigate specific subtypes further.

      [6] R. Lofredi et al., “Pallidal beta bursts in Parkinson’s disease and dystonia,” Movement Disorders, vol. 34, no. 3, pp. 420–424, 2019, doi: 10.1002/mds.27524.

      [7] A. Gulberti et al., “Subthalamic and nigral neurons are differentially modulated during parkinsonian gait,” Brain, p. awad006, Feb. 2023, doi: 10.1093/brain/awad006.

      Amendments to the manuscript:

      “While we did not observe differences across dystonia subtypes (Supplementary Fig. 1), future studies in larger patient cohorts would are warranted. Finally, as many findings in Fig. 1 do not survive corrections for multiple comparisons, we suggest interpretation of results with caution. Despite this, many of our findings related to neuronal correlates are generally in line with previous literature, especially related to oscillatory correlates of PD and dystonia.”

      - While they mention that data are available on request, sharing data openly would increase transparency and allow for independent validation of the results. It is unclear how sharing deidentified data would compromise patient privacy or present ethical issues of any kind, as claimed by the authors. 

      Much of the data in question were collected under an old Research Ethics Board (REB) protocol which did not address data sharing. However, we have consulted with our REB and gained retroactive permission to post de-identified data which are now available in the Supplementary Material.

      Amendments to the manuscript:

      “The data that support the findings of this study are available in a public repository (see: https://osf.io/nqzd2/)”

      - They appropriately acknowledge several limitations, such as the inability to use pharmacological interventions and the need for further research in the chronic setting. 

      Thank you for the comment.

      - The manuscript highlights differences in GPi activity and plasticity between PD and dystonia but could provide more context on the clinical implications of these findings, particularly regarding what the implications would be novel paradigms for deep brain stimulation. 

      Thank you for the comment. Our finding that striato-pallidal plasticity decays more slowly in dystonia compared to PD may relate to the slower time course of symptom relief associated with GPi-DBS in dystonia, as presently outlined in the discussion. On the other hand, symptoms are also suppressed for longer after the cessation of stimulation in dystonia compared to PD, which may reflect long-term plastic changes [8], [9]. In the context of clinical DBS, plasticity modulation may be facilitated by intermittent stimulation algorithms that may achieve the necessary plastic network change by applying stimulation for a defined time but could then be switched off for improved energy consumption and perhaps as a means of mitigating side effects. DBS devices with chronic sensing may enable monitoring of evoked potential amplitudes for future adaptive stimulation applications; however, currently available devices are limited by low sampling rates, but future devices may overcome these technical limitations.

      [8] D. Ruge et al., “Deep brain stimulation effects in dystonia: time course of electrophysiological changes in early treatment.,” Mov Disord, vol. 26, no. 10, pp. 1913–1921, Aug. 2011, doi: 10.1002/mds.23731.

      [9] D. Ruge et al., “Shaping reversibility? Long-term deep brain stimulation in dystonia: the relationship between effects on electrophysiology and clinical symptoms.,” Brain, vol. 134, no. Pt 7, pp. 2106–2115, Jul. 2011, doi: 10.1093/brain/awr122.

      Amendments to the manuscript:

      “While further work is certainly required to better understand disease-related differences in plasticity, our findings may nevertheless motivate the development of periodic intermittent (ON/OFF) DBS strategies which periodically modulate synaptic plasticity for therapeutic benefits which outlast stimulation delivery, as have recently been employed in preclinical work [52,53].”

      - While statistical tests are mentioned, the manuscript could benefit from a more detailed presentation of statistical methods, including correction for multiple comparisons and effect sizes. Did the authors consider different recording sites within each patient as independent observations? I think this is not appropriate if that was the case. 

      Thank you for your constructive feedback. In response to the concerns regarding the statistical methods, we have expanded our analysis to provide a more comprehensive statistical overview. Specifically, we implemented the Bonferroni correction for multiple comparisons across each of the seven tests conducted for the differences in single-neuron features between PD and dystonia. The adjustment revealed that only the burst index and coefficient of variation retain statistical significance after post hoc correction, while the firing rate does not. Results of the Bonferroni corrections are now presented in Supplementary Table 3. Reflecting on the initial comment about firing rates between the two disorders, our updated findings underscore the limitation of using firing rates alone to differentiate between PD and dystonia, and instead, our analysis now points to burstiness and firing irregularity as more reliable discriminators. Regarding the clinical correlations, we refined our statistical analysis by employing nonparametric Monte Carlo permutation tests with 5000 permutations, as used in recent work [10], [11]. This method is chosen for its independence from assumptions regarding data distribution. Specifically, we computed and tested the Spearman rho for significance using the permutation test. Then, to address multiple comparisons, we controlled the false discovery rate (FDR) using the Benjamini-Hochberg procedure. Results of these comparisons are now presented in Supplementary Table 4. Lastly, to address the concern regarding recording site independence within patients, we updated our plasticity analysis methodology. In our study, 6 out of 18 patients had multiple recording sites. Thus, to account for this, we employed linear mixed models (LMM) with patient ID as a random factor to appropriately account for the non-independence of these observations.

      [10] v Lofredi et al., “Dopamine-dependent scaling of subthalamic gamma bursts with movement velocity in patients with Parkinson’s disease,” Elife, vol. 7, p. e31895, Feb. 2018, doi: 10.7554/eLife.31895.

      [11] R. Lofredi et al., “Subthalamic beta bursts correlate with dopamine-dependent motor symptoms in 106 Parkinson’s patients,” npj Parkinsons Dis., vol. 9, no. 1, Art. no. 1, Jan. 2023, doi: 10.1038/s41531-022-00443-3.

      Amendments to the manuscript:

      “For comparing differences in single-neuron features between PD and dystonia, significant results were followed up with post hoc multiple comparisons with a Bonferroni correction. For clinical correlations, non-parametric Monte Carlo permutation tests were used, avoiding assumptions about data distribution. The tested values were randomly shuffled 5,000 times to form a probability distribution, with the p-value reflecting the original sample rank. All tests underwent adjustment for multiple comparisons, controlling the false discovery rate (FDR) at an α-level of 0.05.”

      “analyzed using a linear mixed model (LMM) with patient ID as a random factor, normalized fEP amplitudes as the response variable, and epoch as a fixed effect”

      “using a LMM with patient ID as a random factor”

      “However, none of the clinical correlations survived Benjamini-Hochberg FDR-correction for multiple comparisons (Supplementary Table 4).”

      “In PD, fEP amplitudes were significantly greater after compared to before HFS (LMM; p = .0075, effect size = 5.42 ± 1.79; Fig. 2C), while in dystonia, the increase approached but did not reach statistical significance (LMM; p = .0708, effect size = 2.82 ± 1.45; Fig. 2C).”

      All statistics were updated in the results section and the figures.

      “Finally, as many findings in Fig. 1 do not survive corrections for multiple comparisons, we suggest interpretation of results with caution. Despite this, many of our findings related to neuronal correlates are generally in line with previous literature, especially related to oscillatory correlates of PD and dystonia.”

      - The manuscript could elaborate on the potential mechanisms underlying the observed differences in GPi activity and plasticity and their relevance to the pathophysiology of PD and dystonia. 

      Thank you for your feedback. We have enhanced the manuscript by integrating additional discussions on previous studies related to plasticity in dystonia and PD (e.g., [12], [13]), which highlight excessive plasticity in dystonia. Although these may appear contradictory to our findings of increased plasticity in PD compared to dystonia, we propose (also justified by previous literature) that chronic dopaminergic medication use may lead to synaptic over-sensitization, which has been hypothesized as a biological mechanism underlying levodopa-induced dyskinesias (a hyperkinetic feature) in PD [14].

      [12] Y. Tamura et al., “Disordered plasticity in the primary somatosensory cortex in focal hand dystonia.,” Brain, vol. 132, no. Pt 3, pp. 749–755, Mar. 2009, doi: 10.1093/brain/awn348.

      [13] D. A. Peterson, T. J. Sejnowski, and H. Poizner, “Convergent evidence for abnormal striatal synaptic plasticity in dystonia.,” Neurobiol Dis, vol. 37, no. 3, pp. 558–573, Mar. 2010, doi: 10.1016/j.nbd.2009.12.003.

      [14] P. Calabresi, B. Picconi, A. Tozzi, V. Ghiglieri, and M. Di Filippo, “Direct and indirect pathways of basal ganglia: a critical reappraisal.,” Nat Neurosci, vol. 17, no. 8, pp. 1022–1030, Aug. 2014, doi: 10.1038/nn.3743.

      Amendments to the manuscript:

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that long term potentiation effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the magnitude of direct pathway plasticity [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      Reviewer #2: 

      Summary: 

      The authors investigated how neuronal activity and metrics of plasticity using local electrical stimulation in the GPi were different between Parkinson's disease and dystonia patients. 

      Strengths: 

      The introduction highlights the importance of the work and the fundamental background needed to understand the rest of the paper. It also clearly lays out the novelty (i.e., that the dynamics of plastic effects in GPi between dystonia and PD have not been directly compared). 

      The methods are clearly described and the results are well organized in the figures. 

      The results are strong with measurements from a large population of patients for each disease group and with distinct findings for each group. 

      Thank you for the kind appraisal.

      Weaknesses: 

      The discussion was hard to follow in several places, making it difficult to fully appreciate how well the authors' claims and conclusions are justified by their data, mostly in relation to the plasticity results. It may help to summarize the relevant findings for each section first and then further expand on the interpretation, comparison with prior work, and broader significance. Currently, it is hard to follow each section without knowing which results are being discussed until the very end of the section. With the current wording in the "Neuronal correlates.." section, it is not always clear which results are from the current manuscript, and where the authors are referring to past work.

      Thank you for this feedback. The main findings are now summarized in a paragraph at the beginning of the Discussion section, before being discussed in comparison to other studies in the literature in subsequent sub-sections. Moreover, throughout the Discussion, findings from our study are now always reflected by a reference to the relevant figure to more easily differentiate current findings from previous literature. Additionally, Discussion sub-sections have been expanded to consider additional literature in response to various comments throughout the Review process (including the subsequent Review comment).

      Amendments to the manuscript:

      Paper findings are referenced to figures which depict the results at hand; discussion sub-sections expanded; and the following text has been added at the start of the Discussion:

      “In particular, we found that GPi neurons exhibited lower firing rates, but greater burstiness and variability in dystonia compared to PD (Fig. 1A). While no differences were found in the power of spiketrain oscillations across disorders (Fig. 1B), we found that PD symptom severity positively correlated with the power of low-beta frequency spiketrain oscillations, whereas dystonia symptom severity positively correlated with the power of theta frequency spiketrain oscillations (Fig. 1C). Dystonia symptom severity moreover correlated negatively with firing rate, and positively with neuronal variability. These results are discussed in greater detail with respect to previous literature in the subsequent Discussion section entitled “Neuronal correlates of PD and dystonia.” In response to electrical stimulation (protocol depicted in Fig. 2A), we found significant increases in the amplitudes of positive-going stimulation-evoked field potential amplitudes (considered to reflect striato-pallidal synaptic strength; as exemplified in Fig. 2B) before versus after HFS in both PD and dystonia (Fig. 2C); with recording sites in PD exhibiting significantly greater increases (Fig. 2D). While changes to evoked potential amplitude before versus after stimulation can be considered to be reflective of long-term plasticity [15,18], the dynamics of evoked potentials during HFS (as depicted in Fig. 2E) can be considered as reflective of short-term synaptic plasticity [18,21]. To this end, our findings are suggestive of faster latency synaptic depression in PD compared to dystonia (Fig. 2F/G). Plasticity findings are discussed in greater detail in the Discussion section entitled “Direct pathway plasticity.”

      Also, I felt that more discussion could be used to highlight the significance of the current results by comparing and/or contrasting them to prior relevant work and mechanisms. The novelty or impact is not very clear as written. Could this be further substantiated in the Discussion? 

      Thank you for the feedback. The discussion has been expanded to include additional literature that is relevant to the findings reported in the manuscript. For example, with regards to the neuronal correlates sub-section, we now highlight the important findings [15] that show changes to the discharge rates and oscillatory tendencies of GPi neurons in non-human primates in response to staged MPTP applications to progressively titrate motor severity; these results substantiate our lack of correlation with firing rates in PD, and presence of a clinical correlation with beta oscillations. We additionally now emphasize human studies that found LFP power difference between PD and dystonia [3], [4]; but simultaneously highlight studies that did not find such differences in spike-train oscillations (in non-human primates) [5], which is reflective of our own findings. With regards to our plasticity sub-section, we have added new content related to previous literature on plasticity in dystonia and PD (also addressed in response to a query from Reviewer #1). For example, we bring to light a variety of previous studies [12], [13] emphasizing excessive plasticity in dystonia. However, while such studies may seem to contradict our findings of greater plasticity in PD compared to dystonia, we additionally provide hypotheses (justified by previous literature) that prolonged used of dopaminergic medication may result in synaptic over-sensitization, thus giving rise to levodopa-induced dyskinesias (a hyperkinetic feature) in PD [14].

      [3] P. Silberstein et al., “Patterning of globus pallidus local field potentials differs between Parkinson’s disease and dystonia.,” Brain, vol. 126, no. Pt 12, pp. 2597–2608, Dec. 2003, doi: 10.1093/brain/awg267.

      [4] D. D. Wang et al., “Pallidal Deep-Brain Stimulation Disrupts Pallidal Beta Oscillations and Coherence with Primary Motor Cortex in Parkinson’s Disease,” J Neurosci, vol. 38, no. 19, pp. 4556–4568, May 2018, doi: 10.1523/JNEUROSCI.0431-18.2018.

      [5] P. A. Starr et al., “Spontaneous pallidal neuronal activity in human dystonia: comparison with Parkinson’s disease and normal macaque.,” J Neurophysiol, vol. 93, no. 6, pp. 3165–3176, Jun. 2005, doi: 10.1152/jn.00971.2004.

      [12] Y. Tamura et al., “Disordered plasticity in the primary somatosensory cortex in focal hand dystonia.,” Brain, vol. 132, no. Pt 3, pp. 749–755, Mar. 2009, doi: 10.1093/brain/awn348.

      [13] D. A. Peterson, T. J. Sejnowski, and H. Poizner, “Convergent evidence for abnormal striatal synaptic plasticity in dystonia.,” Neurobiol Dis, vol. 37, no. 3, pp. 558–573, Mar. 2010, doi: 10.1016/j.nbd.2009.12.003.

      [14] P. Calabresi, B. Picconi, A. Tozzi, V. Ghiglieri, and M. Di Filippo, “Direct and indirect pathways of basal ganglia: a critical reappraisal.,” Nat Neurosci, vol. 17, no. 8, pp. 1022–1030, Aug. 2014, doi: 10.1038/nn.3743.

      [15] A. Muralidharan et al., “Physiological changes in the pallidum in a progressive model of Parkinson’s disease: Are oscillations enough?,” Exp Neurol, vol. 279, pp. 187–196, May 2016, doi: 10.1016/j.expneurol.2016.03.002.

      Amendments to the manuscript:

      “Despite the lack of correlations with firing rate in PD, our findings seem to align with those of Muralidharan and colleagues [25], who showed that GPi neuronal firing rates may not directly correlate with motor severity but exhibit variability across the disease severity continuum in parkinsonian non-human primates (initially increasing, then decreasing, then increasing again at mild, moderate, and severe disease manifestations, respectively). Thus, while GPi discharge rates may change in PD, such changes may not be reflected by linear relationships with motor sign development and progression. Indeed, variability in spike firing rates in PD may be reflected in the considerable overlap in spiking activity between PD and dystonia (Fig. 1A), with many dystonia patients exhibiting higher discharge rates compared to PD patients. While differences in discharge rates were nevertheless observed between PD and dystonia, it may be that the combination of rate and pattern (reflected in the BI and CV) changes best differentiates the two disorders.”

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation (LTP) at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that LTP effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the amount of plasticity elicited in GPi [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      Some specific comments and questions about the Discussion: 

      Lines 209-211 - This sentence was hard to understand, could it be clarified? 

      Lines 211-213 - What do phasic and tonic components mean exactly? Could this be specifically defined? Are there specific timescales (as referred to in Intro)?

      Lines 215-217 - It's not clear what was delayed in dystonia, and how the authors are trying to contrast this with the faster time course in PD. I think some of this is explained in the introduction, but could also be re-summarized here as relevant to the results discussed. 

      Lines 223-224 - I'm not sure I follow the implication that network reorganization leads to delayed functional benefits. Could this be further elaborated? 

      Reply & Amendments to the manuscript: Thank you for your feedback. We've made the following concise revisions to address the comments:

      We've clarified lines 209-211 to explain that variations in electrical stimulation effects on pathways in PD and dystonia may reveal the operational mechanisms of DBS, despite a common target:

      “The variation in the modulation of these projections / pathways to electrical stimulation may also indicate the mechanism by which DBS operates across PD and dystonia, despite a common stimulation target.”

      In response to the second comment on lines 211-213 about phasic and tonic components, we now specify that phasic refers to dynamic muscle contractions, and tonic to continuous muscle contractions, providing clear definitions relevant to our context:

      “Clinical studies in dystonia have shown that DBS leads to a more rapid improvement in the transient, dynamic muscle contractions (phasic components) of the disorder when compared to the sustained, continuous muscle contractions (tonic or fixed components) [33]”

      For lines 215-217, we've refined our discussion to clearly contrast the delayed response in dystonia with the faster onset in PD:

      “This contrast with PD, where the, the maximal clinical response to DBS occurs within a much faster time course [13,36].”

      On lines 223-224, we've expanded the explanation of how network reorganization may lead to delayed functional benefits, highlighting adjustments in neural connectivity and synaptic efficacy in response to stimulation:

      “which involves adjustments in neural connectivity or synaptic efficacy in response to the stimulation [14,35].”

      Could the absence of a relationship between FR and disease in PD be discussed? 

      Thank you for raising this point. Despite observing higher firing rates in PD compared to dystonia, it is unexpected that these rates do not correlate with symptom severity according to the rate model of PD [1]. However, despite the lack of correlations with firing rates, our findings align with similar animal work of Muralidharan et al. [15], which reported that neuronal firing rates within the GPi of rhesus monkeys did not increase linearly with respect to varying intensities of parkinsonian motor severity. We did however show that low beta oscillatory strength within the GPi may play a significant role in the manifestation of motor symptoms in PD; which is also in line with findings of Muralidharan and colleagues. As per the Reviewer’s request, we have included this content into our discussion.

      [1] M. R. DeLong, “Primate models of movement disorders of basal ganglia origin.,” Trends Neurosci, vol. 13, no. 7, pp. 281–285, Jul. 1990, doi: 10.1016/0166-2236(90)90110-v.

      [15] A. Muralidharan et al., “Physiological changes in the pallidum in a progressive model of Parkinson’s disease: Are oscillations enough?,” Exp Neurol, vol. 279, pp. 187–196, May 2016, doi: 10.1016/j.expneurol.2016.03.002.

      Amendments to the manuscript:

      “Despite the lack of correlations with firing rate in PD, our findings seem to align with those of Muralidharan and colleagues [25], who showed that GPi neuronal firing rates may not directly correlate with motor severity but exhibit variability across the disease severity continuum in parkinsonian non-human primates (initially increasing, then decreasing, then increasing again at mild, moderate, and severe disease manifestations, respectively). Thus, while GPi discharge rates may change in PD, such changes may not be reflected by linear relationships with motor sign development and progression.”

      “Indeed, Muralidharan and colleagues [25] also showed linear group-level relationships between low-beta frequency spiketrain oscillations and disease severity in parkinsonian non-human primates, despite the lack of linear relationships with spike discharge rates (as discussed above).”

      It wasn't very clear how the direct pathway can be attributed to plasticity changes if the GPi makes up both the direct and indirect pathways. Could this be further clarified? 

      The reviewer brings up an important nuanced point. Recent work from our lab [16] shows that inhibitory evoked fields in STN (which receives inhibitory fields from GPe; no other inhibitory sources) are persistent with very minimal depression during HFS. On the other hand, inhibitory fields in the SNr (which receives majority of its inhibitory inputs from striatum; though some come by way of GPe as well per anatomical literature) depress quickly. We have previously also shown these rapidly depressing fields in GPi [17], [18], which also receives the majority of its inhibitory inputs via striatum, though some also from GPe. As such, the disaggregation of striatum-mediated versus GPe-mediated inhibitory fields is achieved based on: lack of rapidly depressing inhibitory evoked field potentials in STN (which receives inhibitory inputs via GPe and not striatum), but a common presence of rapidly depressing evoked field potentials in SNr and GPi (which both receive most of their inhibitory inputs from striatum); differences in the morphology of purportedly GPe- (fast latency) versus striatum-mediated (slow latency) evoked field potentials [16]; and the presence of slow latency caudato-nigral evoked field potentials in slices [19] that are reversed by GABA antagonist application [20]. These points are indeed outlined in the first paragraph of the Discussion sub-section “Direct pathway plasticity.” However, we have now additionally added a point to the Limitations that inhibitory inputs to the GPi also come by way of GPe, though in a lesser abundance.

      [16] L. A. Steiner et al., “Persistent synaptic inhibition of the subthalamic nucleus by high frequency stimulation,” Brain Stimul, vol. 15, no. 5, pp. 1223–1232, 2022, doi: 10.1016/j.brs.2022.08.020.

      [17] L. D. Liu, I. A. Prescott, J. O. Dostrovsky, M. Hodaie, A. M. Lozano, and W. D. Hutchison, “Frequency-dependent effects of electrical stimulation in the globus pallidus of dystonia patients.,” J Neurophysiol, vol. 108, no. 1, pp. 5–17, Jul. 2012, doi: 10.1152/jn.00527.2011.

      [18] L. Milosevic et al., “Modulation of inhibitory plasticity in basal ganglia output nuclei of patients with Parkinson’s disease,” Neurobiology of Disease, vol. 124, pp. 46–56, Apr. 2019, doi: 10.1016/j.nbd.2018.10.020.

      [19] M. Yoshida and W. Precht, “Monosynaptic inhibition of neurons of the substantia nigra by caudato-nigral fibers,” Brain Res, vol. 32, no. 1, pp. 225–228, Sep. 1971, doi: 10.1016/0006-8993(71)90170-3.

      [20] W. Precht and M. Yoshida, “Blockage of caudate-evoked inhibition of neurons in the substantia nigra by picrotoxin,” Brain Res, vol. 32, no. 1, pp. 229–233, Sep. 1971, doi: 10.1016/0006-8993(71)90171-5.

      Amendments to the manuscript:

      “Indeed, GPi receives the greatest abundance of inhibitory inputs from striatum (direct pathway), but also it also receives inhibitory inputs by way of GPe (indirect pathway). Although we can functionally disaggregate these pathway-specific responses based on differences in morphology and dynamics of GPe-mediated versus striatum-mediated inhibitory fEPs [21]; the possibility of compounded effects cannot be completely ruled out.”

      The mechanism of short- and long-term plasticity as applied in the protocols used in this work are outlined in reference to previous citations [15, 16, 18]. Because this is a central aspect of the current work and interpreting the results, it was difficult to appreciate how these protocols provide distinct metrics of short and long-term plasticity in GPi without some explanation of how it applies to the current work and the specific mechanisms. It would also help to be able to better link how the results fit with the broader conclusions. 

      Short-term plasticity is measured as the dynamic change to the fEP during ongoing HFS. For long-term plasticity analyses, the fEP amplitudes during LFS were compared pre- versus post-HFS. To make this analysis more intuitive we have added a protocol illustration to Fig 2. We have moreover greatly expanded the discussion to include more literature related to disease-specific differences in plasticity, and implications of modulating plasticity using DBS.

      Amendments to the manuscript:

      Added new panel to Fig 2

      Author response image 1.

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that long term potentiation effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the amount of plasticity elicited in GPi [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      In the Conclusion, it was difficult to understand the sentence about microcircuit interaction (line 232) and how it selectively modulates the efficacy of target synapses. Some further explanation here would be helpful. Also, it was not clear how these investigations (line 237) provide cellular-level support for closed-loop targeting. Could the reference to closed-loop targeting also be further explained? 

      We agree with the reviewer that the current wording may be confusing. We have changed the wording to be clearer. We have additionally added content related to closed-loop DBS based on chronic monitoring of evoked potential responses.

      Amendments to the manuscript:

      “Furthermore, chronic monitoring of evoked fields may allow for tracking of subcortical neuronal projections as indexed by inhibitory fields reported in this study. microcircuit interaction to selectively modulate the efficacy of target synapses.”

      future applications of DBS may also benefit from closed loop tuning of basal-ganglia-thalamo-cortical circuit dynamics and plasticity through chronic monitoring of evoked potential responses [56].

      How is the burst index calculated (Methods)? 

      Thank you for pointing out that the burst index definition was missing from the paper. It has now been added to the manuscript.

      Amendments to the manuscript:

      “The burst index was computed by taking the ratio of the means from a two-component Gaussian mixture model applied to the log interspike interval distribution, a modification of the previous mode-over-mean ISI method [20]”

      Figures and figure captions are missing some details:

      Fig. 1 - What does shading represent? 

      The shading in Fig. 1 illustrates results that were significant before adjustment for multiple comparisons.

      Amendments to the manuscript:

      “Depicted scatterplots are results that were significant before correction for multiple comparisons”

      Fig. 2 - Can the stimulation artifact be labeled so as not to be confused with the physiological signal? Is A representing the average of all patients or just one example? Are there confidence intervals for this data as it's not clear if the curves are significantly different or not (may not be important to show if just one example)? Same for D. What is being plotted in E? Is this the exponential fitted on data? Can this be stated in the figure citation directly so readers don't have to find it in the text, where it may not be directly obvious which figure the analyses are being applied towards? 

      Thank you for your comments regarding Fig. 2. We have made the following revisions to address the concerns:

      To clarify the presence of stimulation artifacts and differentiate them from the physiological signal, we have updated Panel B and E in the updated Fig. 2 which highlight the stimulation artifacts accordingly.

      Regarding the comment about Panel A (now B in the updated figure), it represents one single example per disease, rather than an average of all patients.

      In response to the comment about what is plotted in Panel E, we have revised the figure caption to explicitly state that it includes the exponential fit on the data.

      Amendments to the manuscript:

      Figure 2 panel B and E now highlight stimulation artifacts.

      Author response image 2.

      Author response image 3.

      The figure captions could use more details, that can be taken from the text, so that readers can understand figures without searching for relevant details across the paper. 

      Thank you for your feedback. We have revised the figure captions accordingly to provide more details.

      Amendments to the manuscript:

      “Fig 1 – GPi spiketrain feature analyses and clinical correlates of PD and dystonia. (A) With respect to (A) rate-based spiketrain features, firing rate was greater in PD while burst index (BI) and coefficient of variation (CV) were greater in dystonia; whereas no differences were found for (B) oscillatory spiketrain features for theta, alpha, low beta, high beta frequencies. MWU statistical results depicted are not corrected for multiple comparisons; after correction using the Bonferroni method, only CV and BI results remain significant (please see Supplementary Table 3). (C) In PD, the power of low beta spiketrain oscillations positively correlated (Spearman correlation) with symptom severity; in dystonia, neuronal firing rate negatively correlated with symptom severity, whereas CV and the power of theta spiketrain oscillations positively correlated with symptom severity. Depicted scatterplots are results that were significant before correction for multiple comparisons; however, none of the results persist after Benjamini-Hochberg correction for false discovery rate (please see Supplementary Table 4).”

      “Fig 2 – Long-term and short-term effects of HFS on striato-pallidal plasticity in PD and dystonia. (A) Schematic of the plasticity protocol to assess long-term plasticity via fEP amplitude comparisons pre- versus post-HFS and short-term plasticity via fEP dynamics during HFS. (B) Highlights example fEP traces for measuring long-term plasticity pre- versus post-HFS, with (C) displaying group-level fEP amplitudes pre- versus post-HFS across diseases. (D) Illustrates the amount of plasticity (i.e., percentage change in fEP amplitudes pre- versus post-HFS) in both PD and dystonia, with PD showing higher levels of plasticity. (E) Provides an example of fEP traces during HFS for assessing short-term plasticity, with (F) depicting group-level decay rates of fEP amplitudes using an exponential fit on the fEP amplitudes over the first 5 stimulus pulses across diseases. (G) Shows the half-life of the fitted exponential (i.e., rate of attenuation of fEP amplitudes) between PD and dystonia, with PD demonstrating faster fEP attenuation.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      To gain further insight into the dynamics of microglial aging in the hippocampus, the authors used a bioinformatics method known as "pseudotime" or "trajectory inference" to understand how cells may progress through different functional states, as defined by cellular transcriptome (15,16). These bioinformatics approaches can reveal key patterns in scRNAseq / snRNAseq datasets and, in the present study, the authors conclude that a "stress response" module characterized by expression of TGFb1 represents a key "checkpoint" in microglial aging in midlife, after which the cells can move along distinct transcriptional trajectories as aging progresses. This is an intriguing possibility. However, pseudotime analyses need to be validated via additional bioinformatics as well as follow-up experiments. Indeed, Heumos et al, in their Nature Genetics "Expert Guidelines" Review, emphasize that "inferred trajectories might not necessarily have biological meaning." They recommend that "when the expected topology is unknown, trajectories and downstream hypotheses should be confirmed by multiple trajectory inference methods using different underlying assumptions."(15) Numerous algorithms are available for trajectory inference (e.g. Monocle, PAGA, Slingshot, RaceID/StemID, among many others) and their performance and suitability depends on the individual dataset and nature of the trajectories that are to be inferred. It is recommended to use dynGuidelines(16) for the selection of optimal pseudotime analysis methods. In the present manuscript, the authors do not provide any justification for their use of Monocle 3 over other trajectory inference approaches, nor do they employ a secondary trajectory inference method to confirm observations made with Monocle 3. Finally, follow-up validation experiments that the authors carry out have their own limitations and caveats (see below). Hence, while the microglial aging trajectories identified by this study are intriguing, they remain hypothetical trajectories that need to be proven with additional follow-up experiments.

      We thank the reviewer for their suggestion. We have utilized the dynGuidelines kindly provided by the reviewer to utilize an additional trajectory inference tool to analyze our data. We selected Scorpius based on the structure of our data. The tool has provided additional support that microglia progress from a homeostatic state (Cx3cr1, Mef2c) to the induction of stress genes (Hspa1, Atf3) at an intermediate point during aging progression. Furthermore, we observe a concordant increase in ribosomal protein genes at a time point in the pseudotime analysis immediately prior to activation of inflammation-related genes (Il1b, Cst7). These additional analyses support the main findings of our original pseudotime analysis and have been added to the manuscript as Figure S3C,D. Additionally, in the statistical test that uncovers differentially expressed genes along the pseudotime trajectory in this analyses, we find that Tgfb1 is one of the genes that is differentially expressed with peak expression at an intermediate timepoint along the pseudotime trajectory. Furthermore, we have done some preliminary trajectory analysis with slingshot (Street et al, BMC Genomics, PMID: 29914354) that found a similar trajectory with analogous gene expression patterns and dynamic expression of Tgfb1.

      To follow up on the idea that TGFb1 signaling in microglia plays a key role in determining microglial aging trajectories, the authors use RNAscope to show that TGFb1 levels in microglia peak in middle age. They also treat primary LPS-activated microglia with TGFb1 and show that this restores expression of microglial homeostatic gene expression and dampens expression of stress response and, potentially, inflammatory genes. Finally, they utilize transgenic approaches to delete TGFb1 from microglia around 8-10mo of age and scRNAseq to show that homeostatic signatures are lost and inflammatory signatures are gained. Hence, findings in this study support the idea that TGFb1 can strongly regulate microglial phenotype. Loss of TGFb1 signaling to microglia in adulthood has already been shown to cause decreased microglial morphological complexity and upregulation of genes typically associated with microglial responses to CNS insults(17-19). TGFb1 signaling to microglia has also been implicated in microglial responses to disease and manipulations to increase this signaling can improve disease progression in some cases(19). In this light, the findings in the present study are largely confirmatory of previous findings in the literature. They also fall short of unequivocally demonstrating that TGFb1 signaling acts as a "checkpoint" for determining subsequent microglial aging trajectory. To show this clearly, one would need to perturb TGFb1 signaling around 12mo of age and carry out sequencing (bulkRNAseq or scRNAseq) of microglia at 18mo and 24mo. Such experiments could directly demonstrate whether the whole microglial population has been diverted to the TGFb1-low aging trajectory (that progresses through a translational burst state to an inflammation state as proposed). Future development of tools to tag TGFb1 high or low microglia could also enable fate tracing type experiments to directly show whether the TGFb1 state in middle age predicts cell state at later phases of aging.

      We apologize for the use of the term “checkpoint” when referring to the role of Tgfb1 in microglial aging. Instead, our model posits that Tgfb1 expression increases in response to the early insults of the aging process in an attempt to return microglia to homeostasis. Therefore, this would predict that increasing TGFB1 levels after an insult would decrease activation and age-related progression of microglia, which we demonstrate in vitro (Figure 3). Alternatively, the loss of TGFB1 should prevent microglia from returning to a homeostatic state after an age-related stressor, and thus increase the number of microglia in activated states. We observe this increase in activated microglia in our middle-aged microglia-specific Tgfb1 knockout mouse model. Furthermore, the haploinsufficiency of Tgfb1 at this age indicates that TGFB1 signaling in microglia is sensitive to relative levels of Tgfb1. The transient increase in Tgfb1 expression further suggests that the threshold for TGFB1 signaling is dynamic. Finally, RNA-Seq analysis of both in vitro TGFB1 supplemented microglia and in vivo Tgfb1 depleted microglia highlight that TGFB1 alters the aging microglia transcriptome. Combined, these results provide evidence that Tgfb1 modulates advancement of microglia through an aging continuum.

      The present study would also like to draw links between features of microglial aging in the hippocampus and a decline in hippocampal-dependent cognition during aging. To this end, they carry out behavioral testing in 8-10mo old mice that have undergone microglial-specific TGFb1 deletion and find deficits in novel object recognition and contextual fear conditioning. While this provides compelling evidence that TGFb1 signaling in microglia can impact hippocampus-dependent cognition in midlife, it does not demonstrate that this signaling accelerates or modulates cognitive decline (see below). Age-associated cognitive decline refers to cognitive deficits that emerge as a result of the normative brain aging process (20-21). For a cognitive deficit to be considered age-associated cognitive decline, it must be shown that the cognitive operation under study was intact at some point earlier in the adult lifespan. This requires longitudinal study designs that determine whether a manipulation impacts the relationship between brain status and cognition as animals age (22-24). Alternatively, cross-sectional studies with adequate sample sizes can be used to sample the variability in cognitive outcomes at different points of the adult lifespan (22-24) and show that this is altered by a particular manipulation. For this specific study, one would ideally demonstrate that hippocampal-based learning/memory was intact at some point in the lifespan of mice with microglial TGFb1 KO but that this manipulation accelerated or exacerbated the emergence of deficits in hippocampal-dependent learning/memory during aging. In the absence of these types of data, the authors should tone down their claims that they have identified a cellular and molecular mechanism that contributes to cognitive decline.

      We agree with the reviewer that to adequately demonstrate an age-dependent effect of microglia-derived TGFB1 on cognition it is necessary to perturb microglial TGFB1 at young and mature ages and assess the age-dependent effect on cognition. To address this, we have now performed a complementary behavioral study utilizing the Tmem119-CreER mouse model to drive the microglia-specific excision of Tgfb1 in two separate cohorts of mice – one young (2-3 months) and one in mature mice (7-8 months) – followed by cognitive testing. Using the novel object recognition test, we find that young mice of all genotypes (WT, Tgfb1 Het and Tgfb1 cKO ) retain the ability to recognize the novel object (as determined by having a significant preference in exploring the novel object). Alternatively, only the WT mature mice demonstrate a preference for the novel object, while the Tgfb1 Het and Tgfb1 cKO show no preference for the novel object. These behavioral data demonstrate an age-dependent necessity for microglia-specific TGFB1 in in maintain proper hippocampal-dependent memory and is now included in the manuscript as revised Figure 4I-J. We have also included additional behavioral tests (Y-Maze and open field) that did not show any difference between the genotypes as Figure S6D-G. Unfortunately, we were unable to perform the fear conditioning testing, as our apparatus broke during this time. Together, these results reveal that there is an age-dependent necessity for microglia-derived TGFB1 for hippocampal-dependent cognitive function.

      A final point of clarification for the reader pertains to the mining of previously generated data sets within this study. The language in the results section, methods, and figure legends causes confusion about which experiments were actually carried out in this study versus previous studies. Some of the language makes it sound as though parabiosis experiments and experiments using mouse models of Alzheimer's Disease were carried out in this study. However, parabiosis and AD mouse model experiments were executed in previous studies (25,26), and in the present study, RNAseq datasets were accessed for targeted data mining. It is fantastic to see further mining of datasets that already exist in the field. However, descriptions in the results and methods sections need to make it crystal clear that this is what was done.

      The reviewer makes an excellent point. While we referenced the public dataset in the original manuscript, the citation style of superscripted numbers diminishes our ability to adequately reference the datasets. Therefore, we have added the names of the first authors (Palovics for the parabiosis dataset and Sala Frigerio for the Alzheimer’s Disease dataset) to all the instances in the results and figure legends when we refer to these datasets.

      Additional recommendations:

      Major comments.

      (1) There is some ambiguity surrounding how to interpret the microglial TGFb1 knockout that seems incompatible with viewing this molecule as a "checkpoint" in microglial aging. TGFb1 is believed to be primarily produced by microglia. Secreted TGFb1 is then detected by microglial TGFbR2. Are the microglia that have high levels of TGFb1 in middle age signaling to themselves (autocrine signaling)? Or contributing to a local milieu that impacts multiple neighbor microglia (paracrine signaling)? The authors could presumably look in their own dataset to evaluate microglial capacity to detect TGFb1 via its receptors.

      We thank the reviewer for this insightful suggestion. We have undertaken analysis of our dataset to assess whether Tgfb1 acts through autocrine or paracrine signaling. To do so, we reanalyzed our microglia aging scRNA-Seq dataset leveraging the variation in microglia Tgfb1 expression to probe the relative activity of TGFB1. Specifically, we partitioned microglia into quartiles based on their Tgfb1 expression, and subsequently investigated the expression of TGFB signaling effectors and targets. High expression of downstream TGFB signaling pathway components in microglia with high Tgfb1 expression would point to autocrine mechanisms while, alternatively, high expression of downstream TGFB signaling pathway components in microglia with low Tgfb1 expression would point to paracrine mechanisms. We observed highest expression of TGFB signaling pathway components and targets in microglia with the highest expression of Tgfb1. These data suggest that Tgfb1 acts through an autocrine mechanism. These results have been added to our manuscript as Figure S4E-G. Additionally, while our manuscript was under review, a paper by Bedolla et al (Nature Communications 2024; PMID: 38906887) was published that investigated the role of Tgfb1 in adult microglia. This paper utilized orthogonal techniques – sparse microglia-specific Tgfb1 knockout and IHC - to also suggest that microglia utilize autocrine Tgfb1 signaling. Together, these complementary data provide strong evidence that Tgfb1 acts through an autocrine mechanism in adult microglia.

      (2) Conclusions of the study rest on the assumption that microglial inflammatory responses are a central driver of cognitive decline. They assume that manipulations that increase microglial progression into an inflammatory state will negatively impact cognitive function. Although there are certainly a lot of data in the field that inflammatory factors can impact synaptic function, additional experiments would be required to unequivocally demonstrate that a "TGFb1 dependent" progression of microglia to an inflammatory state underlies any observed changes in cognition. For example, in the context of microglial TGFb1 deletion, can NSAIDs or blockers of soluble TNFa (e.g. XENP345), or blockers of SPP1, etc. rescue behavior? Can microglial depletion in this context rescue behavior? Assuming behavior was carried out in the same microglial TGFb1 KO mice that were used for microglial scRNAseq, they could also carry out linear regression-type analyses to link microglial inflammatory status to the behavioral performance of individual mice. In the absence of additional evidence of this sort, the authors should tone down claims about mechanistic relationships between microglial state and cognitive performance.

      We thank the reviewer for realizing that the link between cognition and inflammation in our paper is speculative. Therefore, we have taken the reviewer’s advice and toned down the claims linking inflammation to cognition in our manuscript. Instead, we connect the disruption in cognition to what is observed in our data, a loss of microglia homeostasis and a shift in the microglia aging trajectories.

      Additional Recommendations:

      Minor comments:

      (1) Ideally at some point in the results or discussion, the authors should acknowledge that the hippocampus has highly distinct sub-regions and that microglia show different functions and properties across these sub-regions (e.g. microglia in hilus and subgranular zone vs microglia in stratum radiatum, vs microglia immediately adjacent to or embedded within stratum pyrimidale). Do expression levels of TGFb1 and microglial aging trajectories vary across sub-regions? To what extent can this account for heterogeneity of aging trajectories observed in microglial aging within the hippocampus?

      We are interested in how microglia heterogeneity during aging is influenced by the specific functions, and thus microenvironments within the hippocampus. Therefore, we have expanded our IHC analysis of microglia to determine how the microenvironment influences microglia phenotypes by looking at several different regions of the hippocampus. We have included this regional analysis as Figure S2 in the manuscript. This analysis has revealed region-specific effects on microglia activation during aging.

      (2) For immunohistochemistry data, it is not particularly convincing to see one example of one cell from each condition. Generally, an accepted approach in the field is to present lower magnification images accompanied by zoom panels for several cells from each field of view. This reassures the reader that specific cells haven't simply been "cherry-picked" to support a particular conclusion.

      To allay the concerns of the reviewer that cells haven’t been “cherry-picked”, we have provided low magnification images for the aging CD68 and NF<sub>κ</sub>B stains in Supplemental Figure S2.

      (3) In immunohistochemistry data, have measures been taken to ensure that observed signals are not simply autofluorescence that becomes prominent in tissues with aging? (i.e. use of trueblack or photoquenching of tissue prior to staining) See PMID 37923732

      We agree that autofluorescence, at least partially due to the accumulation of lipofuscin, becomes prominent in certain regions and cells of the hippocampus during aging. This most prominently occurs in the microglia of the hilus. This autofluorescence has a particular subcellular distribution, as it is localized to lyso-endosomal bodies. The microglia activation marker CD68 is also localized to lysosomes. A previous publication by Burns et al (eLife; PMID: 32579115) identified autofluorescent microglia (AF+) with unique molecular profiles that accumulate with age. They posited that these AF+ microglia resembled other microglia subsets that have pronounced storage compartments, such as the pro-inflammatory lipid droplet-containing microglia that accumulate with age reported by Marschallinger et al (Nature; PMID: 31959936). As such, autofluorescence present in microglia potentially represents distinctive and functional states of microglia. Our CD68 immunostaining accumulates with age, which could overlap with autofluorescent storage bodies. Thus, we performed a complementary CD68 immunostaining in an independent cohort of young (3 months) and aged (24 months) mice with autofluorescence quencher TrueBlack, and found that the staining pattern and accumulation of CD68 microglia with age persisted as previously observed after use of this quencher (see Authpr response image 1). Images are IBA1 (cyan) and CD68 (yellow) with the molecular layer (ML), granule cell (GC), and hilus illustrated and corresponding quantification provided (Two-way ANOVA with Sidak’s multiple comparisons test; ***P<0.001; ****P<0.0001).

      We would like to note that the subcellular localization of the other immunostainings included in the manuscript was distinct from CD68, and not likely to be associated with the autofluorescent storage bodies. Additionally, our RNAScope staining for Tgfb1 did not show an accumulation with age, but rather a transient increase at 12 months of age, which indicates that the interpretation of the RNAScope stain for Tgfb1 was not unduly influenced by autofluorescence.

      Author response image 1.

      (4) Ideally, more care is needed with the language used to describe microglial state during aging. The terms "dystrophic," "dysfunctional," and "inflammatory" all carry their own implications and assumptions. Many changes exhibited by microglia during aging can initially be adaptive or protective, particularly during middle age. Without additional experiments to show that specific microglial attributes during aging are actively detrimental to the tissue and additional experiments to show that microglia have ceased to be capable of engaging in many of their normal actions to support tissue homeostasis, the authors should exercise caution in using terms like dysfunctional.

      We appreciate the reviewers’ suggestion. To allay the concerns of the reviewer about the multiple implications of terms such as “dysfunctional” and “inflammatory”, we have tried to replace them throughout the text with more specific terms.

      Reviewer #2:

      That said, given what we recently learned about microglia isolation for RNA-seq analysis, there is a danger that some of the observations are a result of not age, but cell stress from sample preparation (enzymatic digestion 10min at 37C; e.g. PMID: 35260865). Changes in cell state distribution along aging were made based on scRNA-seq and were not corroborated by any other method, such as imaging of cluster-specific marker expression in microglia at different ages. This analysis would allow confirming the scRNA-seq data and would also give us an idea of where the subsets are present within the hippocampus, and whether there is any interesting distribution of cell states (e.g. some are present closer to stem cells?). Since TGFb is thought to be crucial to microglia biology, it would be valuable to include more analysis of the mice with microglia-specific Tgfb deletion e.g. what was the efficiency of recombination in microglia? Did their numbers change after induction of Tgfb deletion in Cx3cr1-creERT2::Tgfb-flox mice.

      We thank the reviewer for their comment regarding potential ex vivo transcriptional alterations with the approaches used in our study. We performed our aging microglia scRNA-Seq characterization prior to the release of Marsh et al (Nature Neuroscience; PMID: 35260865), which revealed the potential transcriptional artefacts induced by isolation. That being said, we took great care to minimize the amount of time samples were subjected to enzymatic digestion (15 minutes) and kept cells at 4C during the remainder of the isolation. Furthermore, we performed all isolations simultaneously, so that transcriptional changes induced by the isolation would be present across all ages and should not be observed during our analysis unless indicative of a true age-related change. Additionally, we have corroborated changes in cell state distribution across ages using several markers (Tgfb1 and KLF2 for the intermediate stress state, S6 for the translation state, and NFKB and CD68 for activation states). In the revised manuscript, we have added additional hippocampal subregion analysis of several IHC immunostains to provide spatial insights into the microglia aging process (Figure S2). This analysis reveals unique spatial dynamics of microglia aging. For example, as the reviewer foresaw, we found that the granule cell layer (the location of adult hippocampal neurogenesis) had a more pronounced age-associated progression of microglial activation than several other regions. A subset of regions had minimal levels of activation during aging, such as the molecular layer and the stratum radiatum of the CA1 (inner CA1in the manuscript) – regions enriched in synaptic terminals. Furthermore, this analysis highlights the susceptibility of microglia aging to microenvironmental influences.

      Regarding the temporally controlled microglia-specific genetic KO mouse model used in our original submission, the Cx3cr1-CreER allele selected (B6.129P2(Cg)-Cx3cr1tm2.1(cre/ERT2)Litt/WganJ) has been reported to have very high recombination efficiency (~94% in Parkhurst et al (Cell; PMID: 24360280)), and we used a tamoxifen induction protocol very similar to Faust et al. (Cell Reports; PMID: 37635351) that achieved ~98% recombination (they injected 100mg/kg for 5 days, while we injected 90mg/kg for 5 days). We analyzed our scRNA-Seq data for the expression of Tgfb1 and found that the knockout mice had a 67% reduction in cells expressing higher levels of Tgfb1 (see panel A in Author response image 2). This is likely a large underestimate of the recombination efficiency, as exon 3 is floxed and residual nonfunctional transcripts could be present, given nonsense-mediated decay is not realized in a number of knockout lines (Lindner et al, Methods, PMID: 33838271). We likely achieved a much higher excision efficiency. We would like to highlight that our data indicating increased microglia activation after tamoxifen treatment (Figure S5A) and the involvement of autonomous signaling (Figure S4E-G) are consistent with recently published work by Bedolla et al, (Nature Communications; PMID: 38906887). Additionally, as part of the revision process, we have now corroborated our behavioral data using and independent temporally controlled microglia-specific KO mouse model - Tmem119-CreER::Tgfb1 knockout mice (Figure 4I-K). We performed qPCR on sorted microglia to determine RNA levels in wildtype and knockout mice. Relative levels of Tgfb1 and exon 3 of Tgfb1 (the floxed exon) on technical replicates of 3 pooled samples indicated overall loss of Tgfb1 expression, as well as undetectable levels of exon 3 as normalized to Actb (see panel B in Author response image 2).

      Author response image 2.

      With respect to the effects of aging and Tgfb1 on microglia density, we find a slight region-specific increase in microglia density with age (see Author response image 3). The density of Iba1 cells across hippocampal regions was analyzed at 3 and 24 months of age (see panel A in Author response image 3) and along an aging continuum at 3, 6, 12, 18, and 24 months (see panel B in Author response image 3). These data are also included in the revised manuscript (Figure S2D-F).

      Author response image 3.

      Deletion of Tgfb1 also had region-specific effects on microglia. While there was no difference in microglia density between wildtype and heterozygous microglia, there was a significant increase in microglia density in the hilus and molecular layers in knockout mice (see Author response image 4) and included in the revised manuscript (Figure S5A). These data indicate that there are subtle region-specific increases in microglia density with age, as well as following the deletion of Tgfb1 from microglia of mature mice.

      Author response image 4.

      Additional Recommendations:

      (1) The problem of possible digestion artifacts in scRNA-seq should be at least addressed in the discussion as a caveat in data interpretation. Staining for unique cluster markers in undigested tissue would solve the problem. It can be done with microscopy or using flow cytometry, but for this microglia, isolation should be done with no enzymes or with Actinomycin (PMID: 35260865).

      The ex vivo activation signature uncovered by Marsh et al. (Nature Neuroscience; PMID: 35260865) arises from the digestion methods used to isolate microglia. We took the utmost care in processing our microglia identically within experiments, which should minimize the amount of uneven ex vivo activation of microglia. This is borne out by the structures of our single-cell sequencing data. Unlike Marsh et al_. where they observe unique cluster after addition of their inhibitors, we do not see any clusters unique to a single condition, suggesting that any influence of _ex vivo activation was evenly distributed.

      Importantly, as suggested by the review, we have we have complemented our scRNA-Seq analysis by corroborating several markers for various stages of microglia aging progression using RNAScope and IHC in intact tissue. Specifically, the transient age-dependent increase in Tgfb1 high microglia was confirmed using RNAScope (Figure 3B), the age-related increase in ribosomal high microglia was confirmed using S6 immunostaining (Figure 3I), and the increase of various markers of age-associated activation (C1q, CD68 and NFkB) was confirmed using immunostaining (Figure 1F and Figure S2D-I). Additionally, we have also performed immunostainings for KLF2 and confirmed peak microglia expression at 18 months of age with lower levels at 24 months of age (Figure 2H).

      (2) The figures of GO and violin plots are not easy to follow sometimes... what are the data points in the violin plots, maybe worth showing them as points? For the GO, e.g. in 3D, 3J, including a short description of the figure could help, e.g. in Figure 1. it was clear.

      We chose not to include the datapoints in the violin plots for aesthetic purposes. Each violin plot would have had hundreds of points that would have made the plots very busy and hidden the structure of the distribution. In Author response image 5 we show the violin plot in Figure 2M with (panel A) and without (panel B) individual points. In a small format, the points overlap and become jumbled together. Therefore, we chose to present the violin plots without points for clarity on the data structure. As for the gene ontology plots in Figure 3, we have updated the descriptions in both the text and figure legends to provide clarification on what they represent.

      Author response image 5.

      (3) I'm very curious to see the mechanism of action of "aged" microglia in the TGFb-depletion model. Is it creating hostile conditions for stem cells, or we have increased synapse loss? Something else?

      We thank the reviewer for their insightful questions. We would like to note that during the revision process of our manuscript, a complementary study was published reporting that the loss of microglia-derived Tgfb1 leads to an aberrant increase in the density of dendritic spines in the CA1 region of the hippocampus (Bedolla et al, Nature Communications, PMID: 38906887). The data from Bedolla et al, shows sparsely labeled neurons in the CA1 with a mGreenLantern expressing virus in mice the had Tgfb1 deleted from microglia using the Cx3cr1-CreERT driver (Figure 7U,V). Additionally, McNamara et al (Nature; PMID: 36517604) demonstrated that microglia-derived Tgfb1 signaling regulates myelin integrity during development and several studies have revealed links between Tgfb1 signaling and altered neurogenesis (e.g., He et al, Nature, PMID: 24859199 and Dias et al, Neuron, PMID: 25467979). Together, this growing body of work indicates that microglia-derived TGFB1 regulates myelination, neurogenesis and synaptic plasticity, which have all been shown to play a role in cognition.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study introduces an innovative method for measuring interocular suppression depth, which implicates mechanisms underlying subconscious visual processing. The evidence supporting the effectiveness of this method would be solid after successfully addressing concerns raised by the reviewers. The novel method will be of interest not only to cognitive psychologists and neuroscientists who study sensation and perception but also to philosophers who work on theories of consciousness.

      Thank you for the recognition and appreciation of our work.

      Public Reviews:

      Reviewer #1 (Public Review):

      Strengths:

      The authors introduced a new adapted paradigm from continuous flash suppression (CFS). The new CFS tracking paradigm (tCFS) allowed them to measure suppression depth in addition to breakthrough thresholds. This innovative approach provides a more comprehensive understanding of the mechanisms underlying continuous flash suppression. The observed uniform suppression depth across target types (e.g., faces and gratings) is novel and has new implications for how the visual system works. The experimental manipulation of the target contrast change rate, as well as the modeling, provided strong support for an early interocular suppression mechanism. The authors argue that the breakthrough threshold alone is not sufficient to infer about unconscious processing.

      Weaknesses:

      A major finding in the current study is the null effect of the image categories on the suppression depth measured in the tCFS paradigm, from which the authors infer an early interocular mechanism underlying CFS suppression. This is not strictly logical as an inference based on the null effect. The authors may consider statistical evaluation of the null results, such as equivalence tests or Bayesian estimation.

      We have now included a Bayesian model comparison (implemented in JASP), to assess the strength of evidence in favour of the alternative hypothesis (or null effect). For example in Experiment 1 (comparing discrete to tCFS), we found inconsistent evidence in favour of the null effect of image-category on suppression depth:

      Lines 382 – 388: “We quantified the evidence for this null-effect on suppression depth with a subsequent Bayesian model comparison. A Bayesian repeated-measures ANOVA (2 x 2; procedure x image type on suppression depth) found that the best model to explain suppression depth included the main effect of procedure (BF10 = 3231.74), and weak evidence/data insensitivity for image type (BF10 = 0.37). This indicates that the data was insensitive as to whether image-type was better at predicting suppression depth than the null model.”

      In Experiment 2, which was specifically designed to investigate the effect of image category on suppression depth, we found strong evidence in favour of the null:

      Lines 429 – 431: “A Bayesian repeated-measures ANOVA (1 x 5, effect of image categories on suppression depth), confirmed strong evidence in favour of the null hypothesis (BF01 =20.30).

      In Experiment 3, we also had image categories, but the effect of rate of contrast change was our main focus. For completeness, we have also included the Bayes factors for image-category in Experiment 3 in our text.

      Lines 487- 490> “This null-effect of image-type was again confirmed with a Bayesian model comparison (3 speed x 4 image categories on suppression depth), demonstrating moderate support for the null effect of image category (BF01= 4.06).”

      We have updated our Methods accordingly with a description of this procedure

      Lines 297-305: “We performed Bayesian model comparison to quantify evidence for and against the null in JASP, using Bayesian repeated measures ANOVAs (uninformed prior with equal weight to all models). We report Bayes factors (B) for main effects of interest (e.g. effect of image type on suppression depth), as evidence in favour compared to the null model (BF10= B). Following the guidelines recommended in (Dienes 2021), B values greater than 3 indicate moderate evidence for H1 over H0, and B values less than 1/3 indicate moderate evidence in favour of the null. B values residing between 1/3 and 3 are interpreted as weak evidence, or an insensitivity of the data to distinguish between the null and alternative models.”

      More importantly, since limited types of image categories have been tested, there may be some exceptional cases. According to "Twofold advantages of face processing with or without visual awareness" by Zhou et al. (2021), pareidolia faces (face-like non-face objects) are likely to be an exceptional case. They measured bidirectional binocular rivalry in a blocked design, similar to the discrete condition used in the current study. They reported that the face-like non-face object could enter visual awareness in a similar fashion to genuine faces but remain in awareness in a similar fashion to common non-face objects. We could infer from their results that: when compared to genuine faces, the pareidolia faces would have a similar breakthrough threshold but a higher suppression threshold; when compared to common objects, the pareidolia faces would have a similar suppression threshold but a low breakthrough threshold. In this case, the difference between these two thresholds for pareidolia faces would be larger than either for genuine faces or common objects. Thus, it would be important for the authors to discuss the boundary between the findings and the inferences.

      This is correct. We acknowledge that our sampling of image-categories is limited, and have added a treatment of this limitation in our discussion. We have expanded on the particular case of Zhou et al (2021), and the possibility of the asymmetries suggested:

      Lines 669 – 691: “As a reminder, we explicitly tested image types that in other studies have shown differential susceptibility to CFS attributed to some form of expedited unconscious processing. Nevertheless, one could argue that our failure to obtain evidence for category specific suppression depth is based on the limited range of image categories sampled in this study. We agree it would be informative to broaden the range of image types tested using tCFS to include images varying in familiarity, congruence and affect. We can also foresee value in deploying tCFS to compare bCFS and reCFS thresholds for visual targets comprising physically meaningless ‘tokens’ whose global configurations can synthesise recognizable perceptual impressions. To give a few examples, dynamic configurations of small dots varying in location over time can create the compelling impression of rotational motion of a rigid, 3D object (structure from motion) or of a human engaged in given activity (biological motion) (Grossmann & Dobbins, 2006; Watson et al., 2004). These kinds of visual stimuli are associated with neural processing in higher-tier visual areas of the human brain, including the superior occipital lateral region (e.g., Vanduffel et al., 2002) and the posterior portion of the superior temporal sulcus (e.g., Grossman et al., 2000). These kinds of perceptually meaningful impressions of objects from rudimentary stimulus tokens are capable of engaging binocular rivalry. Such stimuli would be particularly useful in assessing high-level processing in CFS because they can be easily manipulated using phase-scrambling to remove the global percept without altering low-level stimulus properties. In a similar vein, small geometric shapes can be configured so as to resemble human or human-like faces, such as those used by (Zhou et al., 2021)[1]. These kinds of faux faces could be used in concert with tCFS to compare suppression depth with that associated with actual faces.

      [1] Zhou et al. (2021) derived dominance and suppression durations with fixed-contrast images. In their study, genuine face images and faux faces remained suppressed for equivalent durations whereas genuine faces remained dominant significantly longer than did faux faces. The technique used by those investigators - interocular flash suppression (Wolfe, 1994) - is quite different from CFS in that it involves abrupt, asynchronous presentation of dissimilar stimuli to the two eyes. It would be informative to repeat their experiment using the tCFS procedure.

      Reviewer #2 (Public Review):

      Summary

      The paper introduces a valuable method, tCFS, for measuring suppression depth in continuous flash suppression (CFS) experiments. tCFS uses a continuous-trial design instead of the discrete trials standard in the literature, resulting in faster, better controlled, and lower-variance estimates. The authors measured suppression depth during CFS for the first time and found similar suppression depths for different image categories. This finding provides an interesting contrast to previous results that breakthrough thresholds differ for different image categories and refine inferences of subconscious processing based solely on breakthrough thresholds. However, the paper overreaches by claiming breakthrough thresholds are insufficient for drawing certain conclusions about subconscious processing.

      We agree that breakthrough thresholds can provide useful information to draw conclusions about unconscious processing – as our procedure is predicated on breakthrough thresholds. Our key point is that breakthrough provides only half of the needed information.

      We have amended our manuscript thoroughly (detailed below) to accommodate this nuance and avoid this overreaching claim.

      Strengths

      (1) The tCFS method, by using a continuous-trial design, quickly estimates breakthrough and re-suppression thresholds. Continuous trials better control for slowly varying factors such as adaptation and attention. Indeed, tCFS produces estimates with lower across-subject variance than the standard discrete-trial method (Fig. 2). The tCFS method is straightforward to adopt in future research on CFS and binocular rivalry.

      (2) The CFS literature has lacked re-suppression threshold measurements. By measuring both breakthrough and re-suppression thresholds, this work calculated suppression depth (i.e., the difference between the two thresholds), which warrants different interpretations from the breakthrough threshold alone.

      (3) The work found that different image categories show similar suppression depths, suggesting some aspects of CFS are not category-specific. This result enriches previous findings that breakthrough thresholds vary with image categories. Re-suppression thresholds vary symmetrically, such that their differences are constant.

      Thank you for this positive and succinct summary of our contribution. We have adopted your 3rd point “... suggesting that some aspects...” in our revised manuscript to more appropriately treat the ways that bCFS and reCFS thresholds may interact with suppression depths. For example:

      Lines 850 – 852: “These [low level] factors could be parametrically varied to examine specifically whether they modulate bCFS thresholds alone, or whether they also cause a change in suppression depth by asymmetrically affecting reCFS thresholds”.

      Weaknesses

      (1) The results and arguments in the paper do not support the claim that 'variations in breakthrough thresholds alone are insufficient for inferring unconscious or preferential processing of given image categories,' to take one example phrasing from the abstract. The same leap in reasoning recurs on lines 28, 39, 125, 566, 666, 686, 759, etc.

      We have thoroughly updated our manuscript with respect to mentions of preferential processing, to avoid this leap in reasoning throughout. For example, this phrase in the abstract now reads:

      Lines 27-30: “More fundamentally, it shows that variations in bCFS thresholds alone are insufficient for inferring whether the barrier to achieving awareness exerted by interocular suppression is weaker for some categories of visual stimuli compared to others”.

      Take, for example, the arguments on lines 81-83. Grant that images are inequivalent, and this explains different breakthrough times. This is still no argument against differential subconscious processing. Why are images non-equivalent? Whatever the answer, does it qualify as 'residual processing outside of awareness'? Even detecting salience requires some processing. The authors appear to argue otherwise on lines 694-696, for example, by invoking the concept of effective contrasts, but why is effective contrast incompatible with partial processing? Again, does detecting (effective) contrast not involve some processing? The phrases 'residual processing outside of awareness' and 'unconscious processing' are broad enough to encompass bottom-up salience and effective contrast. Salience and (effective) contrast are arguably uninteresting, but that is a different discussion. The authors contrast 'image categories' or semantics with 'low-level factors.' In my opinion, this is a clearer contrast worth emphasizing more. However, semantic processing is not equal to subconscious processing writ large.

      We are in agreement with your analysis that differential subconscious processing may contribute to differences between images, and have updated our manuscript to clarify this possibility. In particular, we have now included a section in our Discussion which offers a suggestion for future research, linking sensitivity to different low-level image features with differences in gain of the respective contrast-response functions.

      From Lines 692 – 722: “Next we turn to another question raised about our conclusion concerning invariant depth of suppression: If certain image types have overall lower bCFS and reCFS contrast thresholds relative to other image types, does that imply that images in the former category enjoy “preferential processing” relative to those in the latter? Given the fixed suppression depth, what might determine the differences in bCFS and reCFS thresholds? Figure 3 shows that polar patterns tend to emerge from suppression at slightly lower contrasts than do gratings and that polar patterns, once dominant, tend to maintain dominance to lower contrasts than do gratings and this happens even though the rate of contrast change is identical for both types of stimuli. But while rate of contrast change is identical, the neural responses to those contrast changes may not be the same: neural responses to changing contrast will depend on the neural contrast response functions (CRFs) of the cells responding to each of those two types of stimuli, where the CRF defines the relationship between neural response and stimulus contrast. CRFs rise monotonically with contrast and typically exhibit a steeply rising initial response as stimulus contrast rises from low to moderate values, followed by a reduced growth rate for higher contrasts. CRFs can vary in how steeply they rise and at what contrast they achieve half-max response. CRFs for neurons in mid-level vision areas such as V4 and FFA (which respond well to polar stimuli and faces, respectively) are generally steeper and shifted towards lower contrasts than CRFs for neurons in primary visual cortex (which responds well to gratings). Therefore, the effective strength of the contrast changes in our tCFS procedure will depend on the shape and position of the underlying CRF, an idea we develop in more detail in Supplementary Appendix 1, comparing the case of V1 and V4 CRFs. Interestingly, the comparison of V1 and V4 CRFs shows two interesting points: (i) that V4 CRFs should produce much lower bCFS and reCFS thresholds than V1 CRFs, and (ii) that V4 CRFs should produce more suppression than V1 CRFs. Our data do not support either prediction: Figure 3 shows that bCFS and reCFS thresholds are very similar for all image categories and suppression depth is uniform. There is no room in these results to support the claim that certain images receive “preferential processing” or processing outside of awareness, although there are many other kinds of images still to be tested and exceptions may potentially be found. As a first step in exploring this idea, one could use standard psychophysical techniques (e.g., (Ling & Carrasco, 2006)) to derive CRFs for different categories of patterns and then measure suppression depth associated with those patterns using tCFS.”

      We have also expanded on this nuanced line of reasoning in a new Supplementary Appendix for the interested reader.

      The preceding does not detract from the interest in finding uniform suppression depth. Suppression depth and absolute bCFS can conceivably be due to orthogonal mechanisms warranting their own interpretations. In fact, the authors briefly take this position in the Discussion (lines 696-704, 'A hybrid model ...'). The involvement of different mechanisms would defeat the argument on lines 668-670.

      We agree with this analysis, and note our response to Reviewer 1 and the possibility of exceptional cases that may affect absolute bCFS or reCFS thresholds independently.

      Similarly, we agree with the notion that some aspects of CFS may not be category specific. The symmetric relationship of thresholds for a given category of stimuli should be assessed in the context of other categories, such as with pontillist images and by incorporating semantic features of images into the mask as in Che et al. (2019) and Han et al. (2021). This line of reasoning and suggestions for future research is provided in the revised discussion, beginning:

      Lines 67: “Nevertheless, one could argue that our failure to obtain evidence for category specific suppression depth is based on a limited range of image categories….”

      (2) These two hypotheses are confusing and should be more clearly distinguished: a) varying breakthrough times may be due to low-level factors (lines 76-79); b) uniform suppression depth may also arise from early visual mechanisms (e.g., lines 25-27).

      Thank you for highlighting this opportunity for clarification. We have updated our text:

      Lines 25 – 27: “This uniform suppression depth points to a single mechanism of CFS suppression, one that likely occurs early in visual processing, because suppression depth was not modulated by target salience or complexity”

      Lines 78 – 79: “Sceptics argue, however, that differences in breakthrough times can be attributed to low-level factors such as spatial frequency, orientation and contrast that vary between images”

      Neutral remarks

      The depth between bCFS and reCFS depended on measurement details such as contrast change speed and continuous vs. discrete trials. With discrete trials, the two thresholds showed inverse relations (i.e., reCFS > bCFS) in some participants. The authors discuss possible reasons at some length (adaptation, attention, etc. ). Still, a variable measure does not clearly indicate a uniform mechanism.

      We have ensured our revised manuscript makes no mention of a uniform mechanism, although we frequently mention our result of uniform suppression depth.

      Reviewer #3 (Public Review):

      Summary:

      In the 'bCFS' paradigm, a monocular target gradually increases in contrast until it breaks interocular suppression by a rich monocular suppressor in the other eye. The present authors extend the bCFS paradigm by allowing the target to reduce back down in contrast until it becomes suppressed again. The main variable of interest is the contrast difference between breaking suppression and (re) entering suppression. The authors find this difference to be constant across a range of target types, even ones that differ substantially in the contrast at which they break interocular suppression (the variable conventionally measured in bCFS). They also measure how the difference changes as a function of other manipulations. Interpretation in terms of the processing of unconscious visual content, as well as in terms of the mechanism of interocular suppression.

      Thank you for your positive assessment of our methodology.

      Strengths:

      Interpretation of bCFS findings is mired in controversy, and this is an ingenuous effort to move beyond the paradigm's exclusive focus on breaking suppression. The notion of using the contrast difference between breaking and entering suppression as an index of suppression depth is interesting, but I also feel like it can be misleading at times, as detailed below.

      Weaknesses:

      Here's one doubt about the 'contrast difference' measure used by the authors. The authors seem confident that a simple subtraction is meaningful after the logarithmic transformation of contrast values, but doesn't this depend on exactly what shape the contrast-response function of the relevant neural process has? Does a logarithmic transformation linearize this function irrespective of, say, the level of processing or the aspect of processing that we're talking about?

      Given that stimuli differ in terms of the absolute levels at which they break (and re-enter) suppression, the linearity assumption needs to be well supported for the contrast difference measure to be comparable across stimuli.

      Our motivation to quantify suppression depth after log-transform to decibel scale was two-fold. First, we recognised that the traditional use of a linear contrast ramp in bCFS is at odds with the well-characterised profile of contrast discrimination thresholds which obey a power law (Legge, 1981) and the observations that neural contrast response functions show the same compressive non-linearity in many different cortical processing areas (e.g.: V1, V2, V3, V4, MT, MST, FST, TEO. See (Ekstrom et al., 2009)). Increasing contrast in linear steps could thus lead to a rapid saturation of the response function, which may account for the overshoot that has been reported in many canonical bCFS studies. For example, in (Jiang et al., 2007), target contrast reached 100% after 1 second, yet average suppression times for faces and inverted faces were 1.36 and 1.76 seconds respectively. As contrast response functions in visual neurons saturate at high contrast, the upper levels of a linear contrast ramp have less and less effect on the target's strength. This approach to response asymptote may have exaggerated small differences between stimulus conditions and may have inflated some previously reported differences. In sum, the use of a log-transformed contrast ramp allows finer increments in contrast to be explored before saturation, a simple manipulation which we hope will be adopted by our field.

      Second, by quantifying suppression depth as a decibel change we enable the comparison of suppression depth between experiments and laboratories, which inevitably differ in presentation environments. As a comparison, a reaction-time for bCFS of 1.36 s can not easily be compared without access to near-identical stimulation and testing environments. In addition once ramp contrast is log transformed it effectively linearises the neural contrast response function. This means that comparing different studies that use different contrast levels for masker or target can be directly compared because a given suppression depth (for example, 15 dB) is the same proportionate difference between bCFS and reCFS regardless of the contrasts used in the particular study.

      We also acknowledge that different stimulus categories may engage neural and visual processing associated with different contrast gain values (e.g., magno- vs parvo-mediated processing). But the breaks and returns to suppression of a given stimulus category would be dependent on the same contrast gain function appropriate for that stimulus which thus permits their direct comparison. Indeed, this is why our novel approach offers a promising technique for comparing suppression depth associated with various stimulus categories (a point mentioned above). Viewed in this way, differences in actual durations of break times (such as we report in our paper) may tell us more about differences in gain control within neural mechanisms responsible for processing of those categories.

      We have now included a summary of these arguments in a new paragraph of our discussion (from lines 696- cf Reviewer 2 above), as well as a new Supplementary Appendix.

      Here's a more conceptual doubt. The authors introduce their work by discussing ambiguities in the interpretation of bCFS findings with regard to preferential processing, unconscious processing, etc. A large part of the manuscript doesn't really interpret the present 'suppression depth' findings in those terms, but at the start of the discussion section (lines 560-567) the authors do draw fairly strong conclusions along those lines: they seem to argue that the constant 'suppression depth' value observed across different stimuli argues against preferential processing of any of the stimuli, let alone under suppression. I'm not sure I understand this reasoning. Consider the scenario that the visual system does preferentially process, say, emotional face images, and that it does so under suppression as well as outside of suppression. In that scenario, one might expect the contrast at which such a face breaks suppression to be low (because the face is preferentially processed under suppression) and one might also expect the contrast at which the face enters suppression to be low (because the face is preferentially processed outside of suppression). So the difference between the two contrasts might not stand out: it might be the same as for a stimulus that is not preferentially processed at all. In sum, even though the author's label of 'suppression depth' on the contrast difference measure is reasonable from some perspectives, it also seems to be misleading when it comes to what the difference measure can actually tell us that bCFS cannot.

      We have addressed this point with respect to the differences between suppression depth and overall value of contrast thresholds in our revised discussion (reproduced above), and supplementary appendix.

      The authors acknowledge that non-zero reaction time inflates their 'suppression depth' measure, and acknowledge that this inflation is worse when contrast ramps more quickly. But they argue that these effects are too small to explain either the difference between breaking contrast and re-entering contrast to begin with, or the increase in this difference with the contrast ramping rate. I agree with the former: I have no doubt that stimuli break suppression (ramping up) at a higher contrast than the one at which they enter suppression (ramping down). But about the latter, I worry that the RT estimate of 200 ms may be on the low side. 200 ms may be reasonable for a prepared observer to give a speeded response to a clearly supra-threshold target, but that is not the type of task observers are performing here. One estimate of RT in a somewhat traditional perceptual bistability task is closer to 500 ms (Van Dam & Van Ee, Vis Res 45 2005), but I am uncertain what a good guess is here. Bottom line: can the effect of contrast ramping rate on 'suppression depth' be explained by RT if we use a longer but still reasonable estimated RT than 200 ms?

      A 500 ms reaction time estimate would not account for the magnitude of the changes observed in Experiment 3. Suppression depths in our slow, medium, and fast contrast ramps were 9.64 dB, 14.64 dB and 18.97 dB, respectively (produced by step sizes of .035, .07 and .105 dB per video frame at 60 fps). At each rate, assuming a 500 ms reaction time for both thresholds would capture a change of 2.1 dB, 4.2 dB, 6.3 dB. This difference cannot account for the size of the effects observed between our different ramp speeds. Note that any critique using the RT argument also applies to all other bCFS studies which inevitably will have inflated breakthrough points for the same reason.

      We’ve updated our discussion with this more conservative estimate:

      Lines 744 – 747: “For example, if we assume an average reaction time of 500 ms for appearance and disappearance events, then suppression depth will be inflated by ~4.2 dB at the rate of contrast change used in Experiments 1 and 2 (.07 dB per frame at 60 fps). This cannot account for suppression depth in its entirety, which was many times larger at approximately 14 dB across image categories.”

      Lines 755 – 760: [In Experiment 3] “Using the same assumptions of a 500 ms response time delay, this would predict a suppression depth of 2.1 dB, 4.2 dB and 6.3 dB for the slow, medium and fast ramp speeds respectively. However, this difference cannot account for the size of the effects (Slow 9.64 dB, Medium 14.6 dB, Fast 18.97 dB). The difference in suppression depth based on reaction-time delays (± 2.1 dB) also does not match with our empirical data (Medium - Slow = 4.96 dB; Fast - Medium = 4.37 dB)”

      A second remark about the 'ramping rate' experiment: if we assume that perceptual switches occur with a certain non-zero probability per unit time (stochastically) at various contrasts along the ramp, then giving the percept more time to switch during the ramping process will lead to more switches happening at an earlier stage along the ramp. So: ramping contrast upward more slowly would lead to more switches at relatively low contrast, and ramping contrast downward more slowly would lead to more switches at relatively high contrasts. This assumption (that the probability of switching is non-zero at various contrasts along the ramp) seems entirely warranted. To what extent can that type of consideration explain the result of the 'ramping rate' experiment?

      We agree that for a given ramp speed there is a variable probability of a switch in perceptual state for both bCFS and reCFS portions of the trial. To put it in other words, for a given ramp speed and a given observer the distribution of durations at which transitions occur will exhibit variance. We see that variance in our data (just as it’s present in conventional binocular rivalry duration histograms), as a non-zero probability of switches at very short durations (for example). One might surmise that slower ramp speeds would afford more opportunity for stochastic transitions to occur and that the measured suppression depths for slow ramps are underestimates of the suppression depth produced by contrast adaptation. Yet by the same token, the same underestimation would occur during fast ramp speeds, indicating that that difference may be even larger than we reported. In our revision we will spell this out in more detail, and indicate that a non-zero probability of switches at any time may lead to an underestimation of all recorded suppression depths.

      In our data, we believe the contribution of these stochastic switches are minimal. Our current Supplementary Figure 1(d) indicates that there is a non-zero probability of responses early in each ramp (e.g. durations < 2 seconds), yet these are a small proportion of all percept durations. This small proportion is clear in the empirical cumulative density function of percept durations, which we include below. Notably, during slow-ramp conditions, average percept durations actually increased, implying a resistance to any effect of early stochastic switching.

      Author response image 1.

      The data from Supplementary FIgure 1D. (right) Same data reproduced as a cumulative density function. The non-zero probability of a switch occurring (for example at very short percept durations) is clear, but a small proportion of all switches. Notably, In slow ramp trials, there is more time for this stochastic switching to occur, which should underestimate the overall suppression depth. Yet during slow-ramp conditions, average percept durations increased (vertical arrows), implying a resistance to any effect of early stochastic switching.

      When tying the 'dampened harmonic oscillator' finding to dynamic systems, one potential concern is that the authors are seeing the dampened oscillating pattern when plotting a very specific thing: the amount of contrast change that happened between two consecutive perceptual switches, in a procedure where contrast change direction reversed after each switch. The pattern is not observed, for instance, in a plot of neural activity over time, threshold settings over time, etcetera. I find it hard to assess what the observation of this pattern when representing a rather unique aspect of the data in such a specific way, has to do with prior observations of such patterns in plots with completely different axes.

      We acknowledge that fitting the DHO model to response order (rather than time) is a departure from previous investigations modelling oscillations over time. Our alignment to response order was a necessary step to avoid the smearing which occurs due to variation in individual participant threshold durations.

      Our Supplementary Figure 1 shows the variation in participant durations for the three rates of contrast change. From this pattern we can expect that fitting the DHO to perceptual changes over time would result in the poorest fit for slow rates of change (with the largest variation in durations), and best fit for fast rates of change (with least variation in durations).

      That is indeed what we see, reproduced in the review figure below. We include this to show the DHO is still applicable to perceptual changes over time when perceptual durations have relatively low variance (in the fast example), but not the alternate cases. Thus the DHO is not only produced by our alignment to response number - but this step is crucial to avoid the confound of temporal smearing when comparing between conditions.

      Author response image 2.

      DHO fit to perceptual thresholds over time. As a comparison to manuscript Figure 5 (aligning to response order), here we display the raw detrended changes in threshold over time per participant, and their average. Individual traces are shown in thin lines, the average is thick. Notably, in the slow and medium conditions, when perceptual durations had relatively high variance, the DHO is a poor fit to the average (shown in pink). The DHO is still an excellent fit in fast conditions, when modelling changes in threshold over time, owing to the reduced variance in perceptual durations (cf. Supplementary Figure 1). As a consequence, to remove the confound of individual participant durations, we have fitted the DHO when aligned to response order in our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The terminology used: "suppression depth". The depth of interocular suppression indexed by detection threshold has long been used in the literature, such as in Tsuchiya et al., 2006. I notice that this manuscript has created a totally different manipulative definition of the depth of suppression, the authors should make this point clear to the readers to avoid confusion.

      We believe that our procedure does not create a new definition for suppression depth, but rather utilises the standard definition used for many years in the binocular rivalry literature: the ratio between a threshold measured for a target while it is in the state of suppression and for that same target when in the dominance state.

      We have now revised our introduction to make the explicit continuation from past methods to our present methodology clear:

      Lines 94 – 105: “One method for measuring interocular suppression is to compare the threshold for change-detection in a target when it is monocularly suppressed and when it is dominant, an established strategy in binocular rivalry research (Alais, 2012; Alais et al., 2010; Alais & Melcher, 2007; Nguyen et al., 2003). Probe studies using contrast as the dependent variable for thresholds measured during dominance and during suppression can advantageously standardise suppression depth in units of contrast within the same stimulus (e.g., Alais & Melcher, 2007; Ling et al., 2010). Ideally, the change should be a temporally smoothed contrast increment to the rival image being measured (Alais, 2012), a tactic that preludes abrupt onset transients and, moreover, provides a natural complement to the linear contrast ramps that are standard in bCFS research. In this study, we measure bCFS thresholds as the analogue of change-detection during suppression, and as their complement, record thresholds for returns to suppression (reCFS).”

      The paper provides a new method to measure CFS bidirectionally. Given the possible exceptional case of pareidolia faces, it would be important to discuss how the bidirectional measurement offers more information, e.g., how the bottom-up and top-down factors would be involved in the breakthrough phase and the re-suppression phase.

      In our discussion, we have now included the possibility of exceptional cases (such as pareidolia faces), and how an asymmetry may arise with respect to separate image categories affecting either bCFS or reCFS thresholds orthogonally.

      Lines 688 - 691: “...In a similar vein, small geometric shapes can be configured so as to resemble human faces, such as those used by Zhou et al. (2021)[footnote]. These kinds of faux faces could be used in concert with tCFS to compare suppression depth with that associated with actual faces.

      [footnote] Zhou et al. (2021) derived dominance and suppression durations with fixed-contrast images. In their study, genuine face images and faux faces remained suppressed for equivalent durations whereas genuine faces remained dominant significantly longer than did faux faces. The technique used by those investigators - interocular flash suppression (Wolfe, 1994) - is quite different from CFS in that it involves abrupt, asynchronous presentation of dissimilar stimuli to the two eyes. It would be informative to repeat their experiment using the tCFS procedure.”

      What makes the individual results in the discrete condition much less consistent than the tCFS (in Figure 2c)? The authors discussed that motivation or attention to the task would change between bCFS and reCFS blocks (Line 589). But this point is not clear. Does not the attention to task also fluctuate in the tCFS paradigm, as the target continuously comes and goes?

      We believe the discrete conditions have greater variance owing to the blocked design of the discrete conditions. A sequence of bCFS thresholds was collected in order (over ~15 mins), before switching to a sequence of back-to-back discrete reCFS thresholds (another ~15 mins), or a sequence of the tCFS condition. As the order of these blocks was randomized, thresholds collected in the discrete bCFS vs reCFS blocks could be separated by many minutes. In contrast, during tCFS, every bCFS threshold used to calculate the average is accompanied by a corresponding reCFS threshold collected within the same trial, separated by seconds. Thus the tCFS procedure naturally controls for waxing and waning attention, as within every change in attention, both thresholds are recorded for comparison.

      A second advantage is that because the tCFS design changes contrast based on visibility, targets spend more time close to the threshold governing awareness. This reduced distance to thresholds remove the opportunity for other influences (such as oculomotor influences, blinks, etc), from introducing variance into the collected thresholds.

      Experiment 3 reported greater suppression depth with faster contrast change. Because the participant's response was always delayed (e.g., they report after they become aware that the target has disappeared), is it possible that the measured breakthrough threshold gets lower, the re-suppression threshold gets higher, just because the measuring contrast is changing faster?

      We have included an extended discussion of the contribution of reaction-times to the differences in suppression depth we report. Importantly, even a conservative reaction time of 500 ms, for both bCFS and reCFS events, cannot account for the difference in suppression depth between conditions.

      Lines 755 – 760> “Using the same assumptions of a 500 ms response time delay, this would predict a suppression depth of 2.1 dB, 4.2 dB and 6.3 dB for the slow, medium and fast ramp speeds respectively. However, this difference cannot account for the size of the effects (Slow 9.64 dB, Medium 14.6 dB, Fast 18.97 dB). The difference in suppression depth based on reaction-time delays (± 2.1 dB) also does not match with our empirical data (Medium - Slow = 4.96 dB; Fast - Medium = 4.37 dB).”

      In the current manuscript, some symbols are not shown properly (lines 145, 148, 150, 303).

      Thank you for pointing this out, we will arrange with the editors to fix the typos.

      Reviewer #2 (Recommendations For The Authors):

      Line 13: 'time needed'-> contrast needed?

      This sentence was referring to previous experiments which predominantly focus on the time of breakthrough.

      Line 57: Only this sentence uses saliency; everywhere else in the paper uses salience.

      We have updated to salience throughout.

      Fig. 1c: The higher variance in discrete measurement results may be due to more variation in discrete trials, e.g., trial duration and inter-trial intervals (ITIs). Tighter control is indeed one advantage of the continuous tCFS design. For the discrete condition, it would help to report more information about variation across trials. How long and variable are the trials? The ITIs? This information is also relevant to the hypothesis about adaptation in Experiment 3.

      In the discrete condition, each trial ended after the collection of a single response. Thus the variability of the trials is the same as the variability of the contrast thresholds reported in Figure 2. The distribution of these ‘trials’ (aka percept durations), is also shown in Supplementary Figure 1.

      The ITI between discrete trials was self-paced, and not recorded during the experiment.

      Line 598: 'equivalently' is a strong word. The benefit is perhaps best stated relatively: bCFS and reCFS are measured under closer conditions (e.g., adaptation, attention) with continuous experiments compared to discrete ones.

      We agree - and have amended our manuscript:

      Lines 629 – 632: “Alternating between bCFS/reCFS tasks also means that any adaptation occurring over the trial will occur in close proximity to each threshold, as will any waning of attention. The benefit being that bCFS and reCFS thresholds are measured under closer conditions in continuous trials, compared to discrete ones.”

      Reviewer #3 (Recommendations For The Authors):

      Figure 1 includes fairly elaborate hypothetical results and how they would be interpreted by the authors, but I didn't really see any mention of this content in the main text. It wasn't until I started reading the caption that I figured it out. A more elaborate reference to the figure would prevent readers from overlooking (part of) the figure's message.

      We have now made it clearer in the text that those details are contained in the caption to Figure 1.

      Lines 113 – 115: “Figure 1 outlines hypothetical results that can be obtained when recording reCFS thresholds as a complement to bCFS thresholds in order to measure suppression depth.”

      A piece of text seems to have been accidentally removed on line 267.

      Thank you, this has now been amended

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The authors have developed a compelling coarse-grained simulation approach for nucleosome-nucleosome interactions within a chromatin array. The data presented are solid and provide new insights that allow for predictions of how chromatin interactions might occur in vivo, but some of the claims should be tempered. The tools will be valuable for the chromosome biology field.

      Response: We want to thank the editors and all the reviewers for their insightful comments. We have made substantial changes to the manuscript to improve its clarity and temper necessary claims, as detailed in the responses, and we performed additional analyses to address the reviewers’ concerns. We believe that we have successfully addressed all the comments, and the quality of our paper has improved significantly.

      In the following, we provide point-to-point responses to all the reviewer comments. 

      RESPONSE TO REFEREE 1:

      Comment 0: This study develops and applies a coarse-grained model for nucleosomes with explicit ions. The authors perform several measurements to explore the utility of a coarse-grained simulation method to model nucleosomes and nucleosome arrays with explicit ions and implicit water. ’Explicit ions’ means that the charged ions are modeled as particles in simulation, allowing the distributions and dynamics of ions to be measured. Since nucleosomes are highly charged and modulated by charge modifications, this innovation is particularly relevant for chromatin simulation.

      Response: We thank the reviewer’s excellent summary of the work.

      Comment 1: Strengths: This simulation method produces accurate predictions when compared to experiments for the binding affinity of histones to DNA, counterion interactions, nucleosome DNA unwinding, nucleosome binding free energies, and sedimentation coefficients of arrays. The variety of measured quantities makes both this work and the impact of this coarse-grained methodology compelling. The comparison between the contributions of sodium and magnesium ions to nucleosome array compaction, presented in Figure 3, was exciting and a novel result that this simulation methodology can assess.

      Response: We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank him/her for the detailed suggestions and comments.

      Comment 2: Weaknesses: The presentation of experimental data as representing in vivo systems is a simplification that may misrepresent the results of the simulation work. In vivo, in this context, typically means experimental data from whole cells. What one could expect for in vivo experimental data is measurements on nucleosomes from cell lysates where various and numerous chemical modifications are present. On the contrary, some of the experimental data used as a comparison are from in vitro studies. In vitro in this context means nucleosomes were formed ’in a test tube’ or under controlled conditions that do not represent the complexity of an in vivo system. The simulations performed here are more directly compared to in vitro conditions. This distinction likely impacts to what extent these simulation results are biologically relevant. In vivo and in vitro differences could be clarified throughout and discussed.

      Response: As detailed in Response to Comment 3, we have made numerous modifications in the Introduction, Results, and Discussion Section to emphasize the differences between reconstituted and native nucleosomes. The newly added texts also delve into the utilization of the interaction strength measured for reconstituted nucleosomes as a reference point for conceptualizing the interactions among native nucleosomes.

      Comment 3: In the introduction (pg. 3), the authors discuss the uncertainty of nucleosome-tonucleosome interaction strengths in vivo. For example, the authors discuss works such as Funke et al. However, Funke et al. used reconstituted nucleosomes from recombinant histones with one controlled modification (H4 acetylation). Therefore, this study that the authors discuss is measuring nucleosome’s in vitro affinity, and there could be significant differences in vivo due to various posttranslational modifications. Please revise the introduction, results section ”Close contacts drive nucleosome binding free energy,” and discussion to reflect and clarify the difference between in vitro and in vivo measurements. Please also discuss how biological variability could impact your findings in vivo. The works of Alexey Onufriev’s lab on the sensitivity of nucleosomes to charge changes (10.1016/j.bpj.2010.06.046, 10.1186/s13072-018-0181-5), such as some PTMs, are one potential starting place to consider how modifications alter nucleosome stability in vivo.

      Response: We thank the reviewer for the insightful comments and agree that native nucleosomes can differ from reconstituted nucleosomes due to the presence of histone modifications.

      We have revised the introduction to emphasize the differences between in vitro and in vivo nucleosomes. The new text now reads

      "The relevance of physicochemical interactions between nucleosomes to chromatin organization in vivo has been constantly debated, partly due to the uncertainty in their strength [cite]. Examining the interactions between native nucleosomes poses challenges due to the intricate chemical modifications that histone proteins undergo within the nucleus and the variations in their underlying DNA sequences [cite]. Many in vitro experiments have opted for reconstituted nucleosomes that lack histone modifications and feature wellpositioned 601-sequence DNA to simplify the chemical complexity. These experiments aim to establish a fundamental reference point for understanding the strength of interactions within native nucleosomes. Nevertheless, even with reconstituted nucleosomes, a consensus regarding the significance of their interactions remains elusive. For example, using force-measuring magnetic tweezers, Kruithof et al. estimated the inter-nucleosome binding energy to be ∼ 14 kBT [cite]. On the other hand, Funke et al. introduced a DNA origamibased force spectrometer to directly probe the interaction between a pair of nucleosomes [cite], circumventing any potential complications from interpretations of single molecule traces of nucleosome arrays. Their measurement reported a much weaker binding free energy of approximately 2 kBT. This large discrepancy in the reported reference values complicates a further assessment of the interactions between native nucleosomes and their contribution to chromatin organization in vivo."

      We modified the first paragraph of the results section to read

      "Encouraged by the explicit ion model’s accuracy in reproducing experimental measurements of single nucleosomes and nucleosome arrays, we moved to directly quantify the strength of inter-nucleosomes interactions. We once again focus on reconstituted nucleosomes for a direct comparison with in vitro experiments. These experiments have yielded a wide range of values, ranging from 2 to 14 kBT [cite]. Accurate quantification will offer a reference value for conceptualizing the significance of physicochemical interactions among native nucleosomes in chromatin organization in vivo."

      New text was added to the Discussion Section to emphasize the implications of simulation results for interactions among native nucleosomes.

      "One significant finding from our study is the predicted strong inter-nucleosome interactions under the physiological salt environment, reaching approximately 9 kBT. We showed that the much lower value reported in a previous DNA origami experiment is due to the restricted nucleosomal orientation inherent to the device design. Unrestricted nucleosomes allow more close contacts to stabilize binding. A significant nucleosome binding free energy also agrees with the high forces found in single-molecule pulling experiments that are needed for chromatin unfolding [cite]. We also demonstrate that this strong inter-nucleosomal interaction is largely preserved at longer nucleosome repeat lengths (NRL) in the presence of linker histone proteins. While posttranslational modifications of histone proteins may influence inter-nucleosomal interactions, their effects are limited, as indicated by Ding et al. [cite], and are unlikely to completely abolish the significant interactions reported here. Therefore, we anticipate that, in addition to molecular motors, chromatin regulators, and other molecules inside the nucleus, intrinsic inter-nucleosome interactions are important players in chromatin organization in vivo."

      The suggested references (10.1016/j.bpj.2010.06.046, 10.1186/s13072-018-0181-5) are now included as citations # 44 and 45.

      Comment 4: Due to the implicit water model, do you know if ions can penetrate the nucleosome more? For example, does the lack of explicit water potentially cause sodium to cluster in the DNA grooves more than is biologically relevant, as shown in Figure 1?

      Response: We thank the reviewer for the insightful comments. The parameters of the explicit-ion model were deduced from all-atom simulations and fine-tuned to replicate crucial aspects of the local ion arrangements around DNA (1). The model’s efficacy was demonstrated in reproducing the radial distribution function of Na+ and Mg2+ ion distributions in the proximity of DNA (see Author response image 1). Consequently, the number of ions near DNA in the coarse-grained models aligns with that observed in all-atom simulations, and we do not anticipate any significant, unphysical clustering. It is worth noting that previous atomistic simulations have also reported the presence of a substantial quantity of Na+ ions in close proximity to nucleosomal DNA (refer to Author response image 2).

      Author response image 1.

      Comparison between the radial distribution functions of Na+ (left) and Mg2+ (right) ions around the DNA phosphate groups computed from all-atom (black) and coarse-grained (red) simulations. Figure adapted from Figure 4 of Ref. 1. The coarse-grained explicit ion model used in producing the red curves is identical to the one presented in the current manuscript. (© 2011, AIP Publishing. This figure is reproduced with permission from Figure 4 in Freeman GS, Hinckley DM, de Pablo JJ (2011) A coarse-grain three-site-pernucleotide model for DNA with explicit ions. The Journal of Chemical Physics 135:165104. It is not covered by the CC-BY 4.0 license and further reproduction of this figure would need permission from the copyright holder.)

      Author response image 2.

      Three-dimensional distribution of sodium ions around the nucleosome determined from all-atom explicit solvent simulations. Darker blue colors indicate higher sodium density and high density of sodium ions around the DNA is clearly visible. The crystallographically identified acidic patch has been highlighted as spheres on the surface of the histone core and a high level of sodium condensation is observed around these residues. Figure adapted from Ref. 2. (© 2009, American Chemical Society. This figure is reproduced with permission from Figure 7 in Materese CK, Savelyev A, Papoian GA (2009) Counterion Atmosphere and Hydration Patterns near a Nucleosome Core Particle. J. Am. Chem. Soc. 131:15005–15013.. It is not covered by the CC-BY 4.0 license and further reproduction of this figure would need permission from the copyright holder.)

      Comment 5: Histone side chain to DNA interactions, such as histone arginines to DNA, are essential for nucleosome stability. Therefore, can the authors provide validation or references supporting your model of the nucleosome with one bead per amino acid? I would like to see if the nucleosomes are stable in an extended simulation or if similar dynamic motions to all-atom simulations are observed.

      Response: The nucleosome model, which employs one bead per amino acid and lacks explicit ions, has undergone extensive calibration and has found application in numerous prior studies. For instance, the de Pablo group utilized a similar model to showcase its ability to accurately replicate the experimentally measured nucleosome unwinding free energy penalty (3), sequence-dependent nucleosome sliding (4), and the interaction between two nucleosomes (5). Similarly, the Takada group employed a comparable model to investigate acetylation-modulated tri-nucleosome structures (6), chromatin structures influenced by chromatin factors (7), and nucleosome sliding (8). Our group also employed this model to study the structural rearrangement of a tetranucleosome (9) and the folding of larger chromatin systems (10). In cases where data were available, simulations frequently achieved quantitative reproduction of experimental results.

      We added the following text to the manuscript to emphasize previous studies that validate the model accuracy.

      "We observe that residue-level coarse-grained models have been extensively utilized in prior studies to examine the free energy penalty associated with nucleosomal DNA unwinding [cite], sequence-dependent nucleosome sliding [cite], binding free energy between two nucleosomes [cite], chromatin folding [cite], the impact of histone modifications on tri-nucleosome structures [cite], and protein-chromatin interactions [cite]. The frequent quantitative agreement between simulation and experimental results supports the utility of such models in chromatin studies. Our introduction of explicit ions, as detailed below, further extends the applicability of these models to explore the dependence of chromatin conformations on salt concentrations."

      We agree that arginines are important for nucleosome stability. Since we assign positive charges to these residues, their contribution to DNA binding can be effectively captured. The model’s ability in reproducing nucleosome stability is supported by the good agreement between the simulated free energy penalty associated with nucleosomal DNA unwinding and experimental value estimated from single molecule experiments (Figure 1).

      To further evaluate nucleosome stability in our simulations, we conducted a 200-ns-long simulation of a nucleosome featuring the 601-sequence under physiological salt conditions– 100 mM NaCl and 0.5 mM MgCl2, consistent with the conditions in Figure 1 of the main text. We found that the nucleosome maintains its overall structure during this simulation. The nucleosome’s radius of gyration (Rg) remained proximate to the value corresponding to the PDB structure (3.95 nm) throughout the entire simulation period (see Author response image 3).

      Author response image 3.

      Time trace of the radius of gyration (Rg) of a nucleosome with the 601-sequence along an unbiased, equilibrium trajectory. It is evident the Rg fluctuates around the value found in the PDB structure (3.95 nm), supporting the stability of the nucleosome in our simulation.

      Occasional fluctuations in Rg corresponded to momentary, partial unwrapping of the nucleosomal DNA, a phenomenon observed in single-molecule experiments. However, we advise caution due to the coarse-grained nature of our simulations, which prevents a direct mapping of simulation timescale to real time. Importantly, the rate of DNA unwrapping in our simulations is notably overestimated.

      It’s plausible that coarse-grained models, lacking side chains, might underestimate the barrier for DNA sliding along the nucleosome. Specifically, our model, without differentiation between interactions among various amino acids and nucleotides, accurately reproduces the average nucleosomal DNA binding affinity but may not capture the energetic variations among binding interfaces. Since sliding’s contribution to chromatin organization is minimal due to the use of strongly positioning 601 sequences, we imposed rigidity on the two nucleotides situated at the dyad axis to prevent nucleosomal DNA sliding. In future studies, enhancing the calibration of protein-DNA interactions to achieve improved sequence specificity would be an intriguing avenue. To underscore this limitation of the model, we have included the following text in the discussion section of the main text.

      "Several aspects of the coarse-grained model presented here can be further improved. For instance, the introduction of specific protein-DNA interactions could help address the differences in non-bonded interactions between amino acids and nucleotides beyond electrostatics [cite]. Such a modification would enhance the model’s accuracy in predicting interactions between chromatin and chromatin-proteins. Additionally, the single-bead-per-amino-acid representation used in this study encounters challenges when attempting to capture the influence of histone modifications, which are known to be prevalent in native nucleosomes. Multiscale simulation approaches may be necessary [cite]. One could first assess the impact of these modifications on the conformation of disordered histone tails using atomistic simulations. By incorporating these conformational changes into the coarse-grained model, systematic investigations of histone modifications on nucleosome interactions and chromatin organization can be conducted. Such a strategy may eventually enable the direct quantification of interactions among native nucleosomes and even the prediction of chromatin organization in vivo."

      Comment 6: The solvent salt conditions vary in the experimental reference data for internucleosomal interaction energies. The authors note, for example, that the in vitro data from Funke et al. differs the most from other measurements, but the solvent conditions are 35 mM NaCl and 11 mM MgCl2. Since this simulation method allows for this investigation, could the authors speak to or investigate if solvent conditions are responsible for the variability in experimental reference data? The authors conclude on pg. 8-9 and Figure 4 that orientational restraints in the DNA origami methodology are responsible for differences in interaction energy. Can the authors rule out ion concentration contributions?

      Response: We thank the reviewer for the insightful comment. We would like to clarify that the black curve presented in Figure 4B of the main text was computed using the salt concentration specified by Funke et al. (35 mM NaCl and 11 mM MgCl2). Furthermore, there were no restraints placed on nucleosome orientations during these calculations. Consequently, the results in Figure 4B can be directly compared with the black curve in Figure 5C. The data in Figure 5C were calculated under physiological salt conditions (150 mM NaCl and 2 mM MgCl2), which are the standard solvent salt conditions used in most studies. It is worth noting that the free energy of nucleosome binding is significantly higher at the salt concentration employed by Funke et al. (14 kBT) than the value at the physiological salt condition (9 kBT). Therefore, comparing the results in Figure 4B and 5C eliminates ion concentration conditions as a potential cause for the the almost negligible result reported by Funke et al.

      Comment 7: In the discussion on pg. 12 residual-level should be residue-level.

      Response: We apologize for the oversight and have corrected the grammatical error in our manuscript.

      RESPONSE TO REFEREE 2:

      Comment 0: In this manuscript, the authors introduced an explicit ion model using the coarse-grained modelling approach to model the interactions between nucleosomes and evaluate their effects on chromatin organization. The strength of this method lies in the explicit representation of counterions, especially divalent ions, which are notoriously difficult to model. To achieve their aims and validate the accuracy of the model, the authors conducted coarse-grained molecular dynamics simulations and compared predicted values to the experimental values of the binding energies of protein-DNA complexes and the free energy profile of nucleosomal DNA unwinding and inter-nucleosome binding. Additionally, the authors employed umbrella sampling simulations to further validate their model, reproducing experimentally measured sedimentation coefficients of chromatin under varying salt concentrations of monovalent and divalent ions.

      Response: We thank the reviewer’s excellent summary of the work.

      Comment 1: The significance of this study lies in the authors’ coarse-grained model which can efficiently capture the conformational sampling of molecules while maintaining a low computational cost. The model reproduces the scale and, in some cases, the shape of the experimental free energy profile for specific molecule interactions, particularly inter-nucleosome interactions. Additionally, the authors’ method resolves certain experimental discrepancies related to determining the strength of inter-nucleosomal interactions. Furthermore, the results from this study support the crucial role of intrinsic physicochemical interactions in governing chromatin organization within the nucleus.

      Response: We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank him/her for the detailed suggestions and comments.

      Comment 2: The method is simple but can be useful, given the authors can provide more details on their ion parameterization. The paper says that parameters in their ”potentials were tuned to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations.” However, no details on their all-atom simulations were provided; at some point, the authors refer to Reference 67 which uses all-atom simulations but does not employ the divalent ions. Also, no explanation is given for their modelling of protein-DNA complexes.

      Response: We appreciate the reviewer’s suggestion on clarifying the parameterization of the explicition model. The parameterization was not carried out in reference 67 nor by us, but by the de Pablo group in citation 53. Specifically, ion potentials were parameterized to fit the potential of mean force between both monovalent and divalent ion pairs, calculated either from all-atom simulations or from the literature. The authors carried out extensive validations of the model parameters by comparing the radial distribution functions of ions computed using the coarse-grained model with those from all-atom simulations. Good agreements between coarse-grained and all-atom results ensure that the parameters’ accuracy in reproducing the local structures of ion interactions.

      To avoid confusion, we have revised the text from:

      "Parameters in these potentials were tuned to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations."

      to

      "Parameters in these potentials were tuned by Freeman et al. [cite] to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations."

      We modified the Supporting Information at several places to clarify the setup and interpretation of protein-DNA complex simulations.

      For example, we clarified the force fields used in these simulation with the following text

      "All simulations were carried out using the software Lammps [cite] with the force fields defined in the previous two sections."

      We added details on the preparation of these simulations as follows

      "We carried out a series of umbrella-sampling simulations to compute the binding free energies of a set of nine protein-DNA complexes with experimentally documented binding dissociation constants [cite]. Initial configurations of these simulations were prepared using the crystal structures with the corresponding PDB IDs listed in Fig. S1."

      We further revised the caption of Figure S1 (included as Author response image 4) to facilitate the interpretation of simulation results.

      Author response image 4.

      The explicit-ion model predicts the binding affinities of protein-DNA complexes well, related to Fig. 1 of the main text. Experimental and simulated binding free energies are compared for nine protein-DNA complexes [cite], with a Pearson Correlation coefficient of 0.6. The PDB ID for each complex is indicated in red, and the diagonal line is drawn in blue. The significant correlation between simulated and experimental values supports the accuracy of the model. To further enhance the agreement between the two, it will be necessary to implement specific non-bonded interactions that can resolve differences among amino acids and nucleotides beyond simple electrostatics. Such modifications will be interesting avenues for future research. See text Section: Binding free energy of protein-DNA complexes for simulation details.

      Comment 3: Overall, the paper is well-written, concise and easy to follow but some statements are rather blunt. For example, the linker histone contribution (Figure 5D) is not clear and could be potentially removed. The result on inter-nucleosomal interactions and comparison to experimental values from Ref#44 is the most compelling. It would be nice to see if the detailed shape of the profile for restrained inter-nucleosomal interactions in Figure 4B corresponds to the experimental profile. Including the dependence of free energy on a vertex angle would also be beneficial.

      Response: We thank the reviewer for the comments and agree that the discussion on linker histone results was brief. However, we believe the results are important and demonstrate our model’s advantage over mesoscopic approaches in capturing the impact of chromatin regulators on chromatin organization.

      Therefore, instead of removing the result, we expanded the text to better highlight its significance, to help its comprehension, and to emphasize its biological implications. The image in Figure 5D was also redesigned to better visualize the cross contacts between nucleosomes mediated by histone H1. The added texts are quoted as below, and the new Figure 5 is included.

      Author response image 5.

      Revised main text Figure 5, with Figure 5D modified for improved visual clarity.

      "Importantly, we found that the weakened interactions upon extending linker DNA can be more than compensated for by the presence of histone H1 proteins. This is demonstrated in Fig. 5C and Fig. S8, where the free energy cost for tearing part two nucleosomes with 167 bp DNA in the presence of linker histones (blue) is significantly higher than the curve for bare nucleosomes (red). Notably, at larger inter-nucleosome distances, the values even exceed those for 147 bp nucleosomes (black). A closer examination of the simulation configurations suggests that the disordered C-terminal tail of linker histones can extend and bind the DNA from the second nucleosome, thereby stabilizing the internucleosomal contacts (as shown in Fig. 5D). Our results are consistent with prior studies that underscore the importance of linker histones in chromatin compaction [cite], particularly in eukaryotic cells with longer linker DNA [cite]."

      We further compared the simulated free energy profile, depicting the center of mass distance between nucleosomes, with the experimental profile, as depicted in Author response image 6. The agreement between the simulated and experimental results is evident. The nuanced features observed between 60 to 80 Ain the simulated profile stem from DNA unwinding˚ to accommodate the incoming nucleosome, creating a small energy barrier. It’s worth noting that such unwinding is unlikely to occur in the experimental setup due to the hybridization method used to anchor nucleosomes onto the DNA origami. Moreover, our simulation did not encompass configurations below 60 A, resulting in a lack of data in˚ that region within the simulated profile.

      We projected the free energy profile onto the vertex angle of the DNA origami device, utilizing the angle between two nucleosome faces as a proxy. Once more, the simulated profile demonstrates reasonable agreement with the experimental data (Author response image 6). Author response image 6 has been incorporated as Figure S4 in the Supporting Information.

      Author response image 6.

      Explicit ion modeling reproduces the experimental free energy profiles of nucleosome binding. (A) Comparison between the simulated (black) and experimental (red) free energy profile as a function of the inter-nucleosome distance. Error bars were computed as the standard deviation of three independent estimates. The barrier observed between 60A and 80˚ A arises from the unwinding of nucleosomal DNA when the two nu-˚ cleosomes are in close proximity, as highlighted in the orange circle. (B) Comparison between the simulated (black) and experimental (red) free energy profile as a function of the vertex angle. Error bars were computed as the standard deviation of three independent estimates. (C) Illustration of the vertex angle Φ used in panel (B).

      Comment 4: Another limitation of this study is that the authors’ model sacrifices certain atomic details and thermodynamic properties of the modelled systems. The potential parameters of the counter ions were derived solely by reproducing the radial distribution functions (RDFs) and potential of mean force (PMF) based on all-atom simulations (see Methods), without considering other biophysical and thermodynamic properties from experiments. Lastly, the authors did not provide any examples or tutorials for other researchers to utilize their model, thus limiting its application.

      Response: We agree that residue-level coarse-grained modeling indeed sacrifices certain atomistic details. This sacrifice can be potentially limiting when studying the impact of chemical modifications, especially on histone and DNA methylations. We added a new paragraph in the Discussion Section to point out such limitations and the relevant text is quoted below.

      "Several aspects of the coarse-grained model presented here can be further improved. For instance, the introduction of specific protein-DNA interactions could help address the differences in non-bonded interactions between amino acids and nucleotides beyond electrostatics [cite]. Such a modification would enhance the model’s accuracy in predicting interactions between chromatin and chromatin-proteins. Additionally, the single-bead-per-amino-acid representation used in this study encounters challenges when attempting to capture the influence of histone modifications, which are known to be prevalent in native nucleosomes. Multiscale simulation approaches may be necessary [cite]. One could first assess the impact of these modifications on the conformation of disordered histone tails using atomistic simulations. By incorporating these conformational changes into the coarse-grained model, systematic investigations of histone modifications on nucleosome interactions and chromatin organization can be conducted. Such a strategy may eventually enable the direct quantification of interactions among native nucleosomes and even the prediction of chromatin organization in vivo."

      Nevertheless, it’s important to note that while the model sacrifices accuracy, it compensates with superior efficiency. Atomistic simulations face significant challenges in conducting extensive free energy calculations required for a quantitative evaluation of ion impacts on chromatin structures.

      The explicit ion model, introduced by the de Pablo group, follows a standard approach adopted by other research groups, such as the parameterization of ion models using the potential of mean force from atomistic simulations (11; 12). According to multiscale coarse-graining theory, reproducing potential mean force (PMF) enables the coarsegrained model to achieve thermodynamic consistency with the atomistic model, ensuring identical statistical properties derived from them. However, it’s crucial to recognize that an inherent limitation of such approaches is their dependence on the accuracy of atomistic force fields in reproducing thermodynamic properties from experiments, as any inaccuracies in the atomistic force fields will similarly affect the resulting coarse-grained (CG) model.

      We have provided the implementation of CG model and detailed instructions on setting up and performing simulations GitHub repository. Examples include simulation setup for a protein-DNA complex and for a nucleosome with the 601-sequence.

      References [1] Freeman GS, Hinckley DM, de Pablo JJ (2011) A coarse-grain three-site-pernucleotide model for DNA with explicit ions. The Journal of Chemical Physics 135:165104.

      [2] Materese CK, Savelyev A, Papoian GA (2009) Counterion Atmosphere and Hydration Patterns near a Nucleosome Core Particle. J. Am. Chem. Soc. 131:15005–15013.

      [3] Lequieu J, Cordoba A, Schwartz DC, de Pablo JJ´ (2016) Tension-Dependent Free Energies of Nucleosome Unwrapping. ACS Cent. Sci. 2:660–666.

      [4] Lequieu J, Schwartz DC, De Pablo JJ (2017) In silico evidence for sequence-dependent nucleosome sliding. Proc. Natl. Acad. Sci. U.S.A. 114.

      [5] Moller J, Lequieu J, de Pablo JJ (2019) The Free Energy Landscape of Internucleosome Interactions and Its Relation to Chromatin Fiber Structure. ACS Cent. Sci. 5:341–348.

      [6] Chang L, Takada S (2016) Histone acetylation dependent energy landscapes in trinucleosome revealed by residue-resolved molecular simulations. Sci Rep 6:34441.

      [7] Watanabe S, Mishima Y, Shimizu M, Suetake I, Takada S (2018) Interactions of HP1 Bound to H3K9me3 Dinucleosome by Molecular Simulations and Biochemical Assays. Biophysical Journal 114:2336–2351.

      [8] Brandani GB, Niina T, Tan C, Takada S (2018) DNA sliding in nucleosomes via twist defect propagation revealed by molecular simulations. Nucleic Acids Research 46:2788–2801.

      [9] Ding X, Lin X, Zhang B (2021) Stability and folding pathways of tetra-nucleosome from six-dimensional free energy surface. Nat Commun 12:1091.

      [10] Liu S, Lin X, Zhang B (2022) Chromatin fiber breaks into clutches under tension and crowding. Nucleic Acids Research 50:9738–9747.

      [11] Savelyev A, Papoian GA (2010) Chemically accurate coarse graining of doublestranded DNA. Proc. Natl. Acad. Sci. U.S.A. 107:20340–20345.

      [12] Noid WG (2013) Perspective: Coarse-grained models for biomolecular systems. The Journal of Chemical Physics 139:090901.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to Reviewer 1

      Summary:

      The authors introduce a denoising-style model that incorporates both structure and primary-sequence embeddings to generate richer embeddings of peptides. My understanding is that the authors use ESM for the primary sequence embeddings, take resolved structures (or use structural predictions from AlphaFold when they're not available), and then develop an architecture to combine these two with a loss that seems reminiscent of diffusion models or masked language model approaches. The embeddings can be viewed as ensemble-style embedding of the two levels of sequence information, or with AlphaFold, an ensemble of two methods (ESM+AlphaFold). The authors also gather external datasets to evaluate their approach and compare it to previous approaches. The approach seems promising and appears to out-compete previous methods at several tasks. Nonetheless, I have strong concerns about a lack of verbosity as well as the exclusion of relevant methods and references.

      Thank you for the comprehensive summary. Regarding the concerns listed in the review below, we have made point-to-point response. We also modified our manuscript in accordance. 

      Advances:

      I appreciate the breadth of the analysis and comparisons to other methods. The authors separate tasks, models, and sizes of models in an intuitive, easy-to-read fashion that I find valuable for selecting a method for embedding peptides. Moreover, the authors gather two datasets for evaluating embeddings' utility for predicting thermostability. Overall, the work should be helpful for the field as more groups choose methods/pretraining strategies amenable to their goals, and can do so in an evidence-guided manner.

      Thank you for recognizing the strength of our work in terms of the notable contributions, the solid analysis, and the clear presentation.

      Considerations:

      (1) Primarily, a majority of the results and conclusions (e.g., Table 3) are reached using data and methods from ProteinGym, yet the best-performing methods on ProteinGym are excluded from the paper (e.g., EVEbased models and GEMME). In the ProteinGym database, these methods outperform ProtSSN models. Moreover, these models were published over a year---or even 4 years in the case of GEMME---before ProtSSN, and I do not see justification for their exclusion in the text.

      We decided to exclude the listed methods from the primary table as they are all MSA-based methods, which are considered few-shot methods in deep learning (Rao et al., ICML, 2021). In contrast, the proposed ProtSSN is a zero-shot method that makes inferences based on less information than few-shot methods. Moreover, it is possible for MSA-based methods to query aligned sequences based on predictions. For instance, Tranception (Notin et al., ICML, 2022) selects the model with the optimal proportions of logits and retrieval results according to the average correlation score on ProteinGym (Table 10, Notin et al., 2022).

      With this in mind, we only included zero-shot deep learning methods in Table 3, which require no more than the sequence and structure of the underlying wild-type protein when scoring the mutants. In the revision, we have added the performance of SaProt to Table 3, and the performance of GEMME, TranceptEVE, and SaProt to Table 5. Furthermore, we have released the model's performance on the public leaderboard of ProteinGym v1 at proteingym.org.

      (2) Secondly, related to the comparison of other models, there is no section in the methods about how other models were used, or how their scores were computed. When comparing these models, I think it's crucial that there are explicit derivations or explanations for the exact task used for scoring each method. In other words, if the pre-training is indeed an important advance of the paper, the paper needs to show this more explicitly by explaining exactly which components of the model (and previous models) are used for evaluation. Are the authors extracting the final hidden layer representations of the model, treating these as features, and then using these features in a regression task to predict fitness/thermostability/DDG etc.? How are the model embeddings of other methods being used, since, for example, many of these methods output a k-dimensional embedding of a given sequence, rather than one single score that can be correlated with some fitness/functional metric? Summarily, I think the text lacks an explicit mention of how these embeddings are being summarized or used, as well as how this compares to the model presented.

      Thank you for the suggestion. Below we address the questions in three points. 

      (1) The task and the scoring for each method. We followed your suggestion and added a new paragraph titled “Scoring Function” on page 9 to provide a detailed explanation of the scoring functions used by other deep learning zero-shot methods.

      (2) The importance of individual pre-training modules. The complete architecture of the proposed ProtSSN model has been introduced on page 7-8. Empirically, the influence of each pre-training module on the overall performance has been examined through ablation studies on page 12. In summary, the optimal performance is achieved by combining all the individual modules and designs.

      (3) The input of fitness scoring. For a zero-shot prediction task, the final score for a mutant will be calculated by wildly-used functions named log-odds ratio (for encoder models, including ours) or loglikelihood (for autoregressive models or inverse folding models. In the revision, we explicitly define these functions in sections “Inferencing” (page 7) and “Scoring Function” (page 9). 

      (3) I think the above issues can mainly be addressed by considering and incorporating points from Li et al. 2024[1] and potentially Tang & Koo 2024[2]. Li et al.[1] make extremely explicit the use of pretraining for downstream prediction tasks. Moreover, they benchmark pretraining strategies explicitly on thermostability (one of the main considerations in the submitted manuscript), yet there is no mention of this work nor the dataset used (FLIP (Dallago et al., 2021)) in this current work. I think a reference and discussion of [1] is critical, and I would also like to see comparisons in line with [1], as [1] is very clear about what features from pretraining are used, and how. If the comparisons with previous methods were done in this fashion, this level of detail needs to be included in the text.

      The initial version did not include an explicit comparison with the mentioned reference due to the difference in the learning task. In particular, [1] formulates a supervised learning task on predicting the continuous scores of mutants of specific proteins. In comparison, we make zero-shot predictions, where the model is trained in a self-supervised learning manner that requires no labels from experiments. In the revision, we added discussions in “Discussion and Conclusion” (lines 476-484):

      Recommendations For The Authors:

      Comment 1

      I found the methods lacking in the sense that there is never a simple, explicit statement about what is the exact input and output of the model. What are the components of the input that are required by the user (to generate) or supply to the model? Are these inputs different at training vs inference time? The loss function seems like it's trying to de-noise a modified sequence, can you make this more explicit, i.e. exactly what values/objects are being compared in the loss?

      We have added a more detailed description in the "Model Pipeline" section (page 7), which explains the distinct input requirements for training and inference, as well as the formulation of the employed loss function. To summarize:

      (1) Both sequence and structure information are used in training and inference. Specifically, structure information is represented as a 3D graph with coordinates, while sequence information consists of AA-wise hidden representations encoded by ESM2-650M. During inference, instead of encoding each mutant individually, the model encodes the WT protein and uses the output probability scores relevant to the mutant to calculate the fitness score. This is a standard operation in many zero-shot fitness prediction models, commonly referred to as the log-odds-ratio.

      (2) The loss function compares the differences between the noisy input sequence and the output (recovered) AA sequence. Noise is added to the input sequences, and the model is trained to denoise them (see “Ablation Study” for the different types of noise we tested). This approach is similar to a one-step diffusion process or BERT-style token permutation. The model learns to recover the probability of each node (AA) being one of 33 tokens. A cross-entropy loss is then applied to compare this distribution with the ground-truth (unpermuted) AA sequence, aiming to minimize the difference.

      To better present the workflow, we revised the manuscript accordingly.

      Comment 2

      Related to the above, I'm not exactly sure where the structural/tertiary structure information comes from. In the methods, they don't state exactly whether the 3D coordinates are given in the CATH repository or where exactly they come from. In the results section they mention using AlphaFold to obtain coordinates for a specific task---is the use of AlphaFold limited only to these tasks/this is to show robustness whether using AlphaFold or realized coordinates?

      The 3D coordinates of all proteins in the training set are derived from the crystal structures in CATH v4.3.0 to ensure a high-quality input dataset (see "Training Setup," Page 8). However, during the inference phase, we used predicted structures from AlphaFold2 and ESMFold as substitutes. This approach enhances the generalizability of our method, as in real-world scenarios, the crystal structure of the template protein to be engineered is not always available. The associated descriptions can be found in “Training Setup” (lines 271-272) and “Folding Methods” (lines 429-435).

      Comment 3

      Lines 142+144 missing reference "Section establishes", "provided in Section ."

      199 "see Section " missing reference

      214 missing "Section"

      Thank you for pointing this out. We have fixed all missing references in the revision.

      Comment 4

      Table 2 - seems inconsistent to mention the number of parameters in the first 2 methods, then not in the others (though I see in Table 3 this is included, so maybe should just be omitted in Table 2).

      In Table 2, we present the zero-shot methods used as baselines. Since many methods have different versions due to varying hyperparameter settings, we decided to list the number of parameters in the following tables.

      We have double-checked both Table 3 and Table 5 and confirm that there is no inconsistency in the reported number of parameters. One potential explanation for the observed difference in the comment could be due to the differences in the number of parameters between single and ensemble methods. The ensemble method averages the predictions of multiple models, and we sum the total number of parameters across all models involved. For example, RITA-ensemble has 2210M parameters, derived from the sum of four individual models with 30M, 300M, 680M, and 1200M parameters.

      Comment 5

      In general, I found using the word "type" instead of "residue" a bit unnatural. As far as I can tell, the norm in the field is to say "amino acid" or "residue" rather than "type". This somewhat confused me when trying to understand the methods section, especially when talking about injecting noise (I figured "type" may refer to evolutionarily-close, or physicochemically-close residues). Maybe it's not necessary to change this in every instance, but something to consider in terms of ease of reading.

      Thank you for your suggestion. The term "type" we used is a common expression similar to "class" in the NLP field. To avoid further confusion to the biologists, we have revised the manuscript accordingly. 

      Comment 6

      197 should this read "based on the kNN "algorithm"" (word missing) or maybe "based on "its" kNN"?

      We have corrected the typo accordingly. It now reads “the 𝑘-nearest neighbor algorithm (𝑘NN)” (line 198).

      Comment 7

      200 weights of dimension 93, where does this number come from?

      The edge features are derived by Zhou et al., 2024. We have updated the reference in the manuscript for clarity (lines 201-202).

      Comment 8

      210-212 "representations of the noisy AA sequence are encoded from the noisy input" what is the "noisy AA sequence?" might be helpful to exactly defined what is "noisy input" or "noisy AA sequence". This sentence could potentially be worded to make it clearer, e.g. "we take the modified input sequence and embed it using [xyz]."

      We have revised the text accordingly. In the revised see lines 211-212:

      Comment 9

      In Table 3

      Formatting, DTm (million), (million) should be under "# Params" likely?

      Also for DDG this is reported on only a few hundred mutations, it might be worth plotting the confidence intervals over the Spearman correlation (e.g. by bootstrapping the correlation coefficient).

      We followed the suggestion and added “million” under the "# Params". We have added the bootstrapped results for DDG and DTm to Table 6. For each dataset, we randomly sampled 50% of the data for ten independent runs. ProtSSN achieves the top performance with a considerably small variance.

      Comment 10

      The paragraph in lines 319 to lines 328 I feel may lack sufficient evidence.

      "While sequence-based analysis cannot entirely replace the role of structure-based analysis, compared to a fully structure-based deep learning method, a protein language model is more likely to capture sufficient information from sequences by increasing the model scale, i.e., the number of trainable parameters."

      This claim is made without a citation, such as [1]. Increasing the scale of the model doesn't always align with improving out-of-sample/generalization performance. I don't feel fully convinced by the claim that worse prediction is ameliorated by increasing the number of parameters. In Table 3 the performance is not monotonic with (nor scales with) the number of parameters, even within a model. See ProGen2 Expression scores, or ESM-2 Stability scores, as a function of their model sizes. In [1], the authors discuss whether pretraining strategies are aligned with specific tasks. I think rewording this paragraph and mentioning this paper is important. Figure 3 shows that maybe there's some evidence for this but I don't feel entirely convinced by the plot.

      We agree that increasing the number of learnable parameters does not always result in better performance in downstream tasks. However, what we intended to convey is that language models typically need to scale up in size to capture the interactions among residues, while structure-based models can achieve this more efficiently with lower computational costs. We have rephrased this paragraph in the paper to clarify our point in lines 340-342.

      Comment 11

      Line 327 related to my major comment, " a comprehensive framework, such as ProtSSN, exhibits the best performance." Refers to performance on ProteinGym, yet the best-performing methods on ProteinGym are excluded from the comparison.

      The primary comparisons were conducted using zero-shot models for fairness, meaning that the baseline models were not trained on MSA and did not use test performance to tune their hyperparameters. It's also worth noting that SaProt (the current SOTA model) had not been updated on the leaderboard at the time of submitting this paper. In the revised manuscript, we have included GEMME and TranceptEVE in Table 5 and SaProt in Tables 3, 5, and 6. While ProtSSN does not achieve SOTA performance in every individual task, our key argument in the analysis is to highlight the overall advantage of hybrid encoders compared to single sequence-based or structure-based models. We made clearer statement in the revised manuscript (line 349):

      Comment 12

      Line 347, line abruptly ends "equivariance when embedding protein geometry significantly." (?).

      We have fixed the typo, (lines 372-373): 

      Comment 13

      Figure 3 I think can be made clearer. Instead of using True/false maybe be more explicit. For example in 3b, say something like "One-hot encoded" or "ESM-2 embedded".

      The labels were set to True/False with the title of the subfigures so that they can be colored consistently.

      Following the suggestion, we have updated the captions in the revised manuscript for clarity.

      Comment 14

      Lines 381-382 "average sequential embedding of all other Glycines" is to say that the score is taken as the average score in which Glycine is substituted at every other position in the peptide? Somewhat confused by the language "average sequential embedding" and think rephrasing could be done to make things clearer.

      We have revised the related text accordingly a for clearer presentation (lines 406-413). 

      Comment 15

      Table 5, and in mentions to VEP, if ProtSSN is leveraging AlphaFold for its structural information, I disagree that ProtSSN is not an MSA method, and I find it unfair to place ProtSSN in the "non-MSA" categories. If this isn't the case, then maybe making clearer the inputs etc. in the Methods will help.

      Your response is well-articulated and clear, but here is a slight revision for improved clarity and flow:

      We respectfully disagree with classifying a protein encoding method based solely on its input structure. While AF2 leverages MSA sequences to predict protein structures, this information is not used in our model, and our model is not exclusive to AF2-predicted structures. When applicable, the model can encode structures derived from experimental data or other folding methods. For example, in the manuscript, we compared the performance of ProtSSN using proteins folded by both AF2 and ESMFold.

      However, we would like to emphasize that comparing the sensitivity of an encoding method across different structures or conformations is not the primary focus of our work. In contrast, some methods explicitly use MSA during model training. For instance, MSA-Transformer encodes MSA information directly into the protein embedding, and Tranception-retrieval utilizes different sets of MSA hyperparameters depending on the validation set's performance.

      To avoid further confusion, we have revised the terms "MSA methods" and "non-MSA methods" in the manuscript to "zero-shot methods" and "few-shot methods."

      Comment 16

      Table 3 they're highlighted as the best, yet on ProteinGym there's several EVE models that do better as well as GEMMA, which are not referenced.

      The comparison in Table 3 focuses on zero-shot methods, whereas GEMME and EVE are few-shot models. Since these methods have different input requirements, directly comparing them could lead to

      unfair conclusions. For this reason, we reserved the comparisons with these few-shot models for Table 5, where we aim to provide a more comprehensive evaluation of all available methods.            

      Response to Reviewer 2

      Summary:

      To design proteins and predict disease, we want to predict the effects of mutations on the function of a protein. To make these predictions, biologists have long turned to statistical models that learn patterns that are conserved across evolution. There is potential to improve our predictions however by incorporating structure. In this paper, the authors build a denoising auto-encoder model that incorporates sequence and structure to predict mutation effects. The model is trained to predict the sequence of a protein given its perturbed sequence and structure. The authors demonstrate that this model is able to predict the effects of mutations better than sequence-only models.

      Thank you for your thorough review and clear summary of our work. Below, we provide a detailed, pointby-point response to each of your questions and concerns. 

      Strengths:

      The authors describe a method that makes accurate mutation effect predictions by informing its predictions with structure.

      Thank you for your clear summary of our highlights.

      Weaknesses:

      Comment 1

      It is unclear how this model compares to other methods of incorporating structure into models of biological sequences, most notably SaProt.

      (https://www.biorxiv.org/content/10.1101/2023.10.01.560349v1.full.pdf).

      In the revision, we have updated the performance of SaProt single models (with both masked and unmasked versions with the pLDDT score) and ensemble models in the Tables 3, 5, and 6.

      In the revised manuscript, we have updated the performance results for SaProt's single models (both masked and unmasked versions with the pLDDT score) as well as the ensemble models. These updates are reflected in Tables 3, 5, and 6.

      Comment 2

      ProteinGym is largely made of deep mutational scans, which measure the effect of every mutation on a protein. These new benchmarks contain on average measurements of less than a percent of all possible point mutations of their respective proteins. It is unclear what sorts of protein regions these mutations are more likely to lie in; therefore it is challenging to make conclusions about what a model has necessarily learned based on its score on this benchmark. For example, several assays in this new benchmark seem to be similar to each other, such as four assays on ubiquitin performed at pH 2.25 to pH 3.0.

      We agree that both DTm and DDG are smaller datasets, making them less comprehensive than ProteinGym. However, we believe DTm and DDG provide valuable supplementary insights for the following reasons:

      (1) These two datasets are low-throughput and manually curated. Compared to datasets from highthroughput experiments like ProteinGym, they contain fewer errors from experimental sources and data processing, offering cleaner and more reliable data.

      (2) Environmental factors are crucial for the function and properties of enzymes, which is a significant concern for many biologists when discussing enzymatic functions. Existing benchmarks like ProteinGym tend to simplify these factors and focus more on global protein characteristics (e.g., AA sequence), overlooking the influence of environmental conditions.

      (3) While low-throughput datasets like DTm and DDG do not cover all AA positions or perform extensive saturation mutagenesis, these experiments often target mutations at sites with higher potential for positive outcomes, guided by prior knowledge. As a result, the positive-to-negative ratio is more meaningful than random mutagenesis datasets, making these benchmarks more relevant for evaluating model performance.

      We would like to emphasize that DTm and DDG are designed to complement existing benchmarks rather than replace ProteinGym. They address different scales and levels of detail in fitness prediction, and their inclusion allows for a more comprehensive evaluation of deep learning models.

      Recommendations For The Authors:

      Comment 1

      I recommend including SaProt in your benchmarks.

      In the revision, we added comparisons with SaProt in all the Tables (3, 5 and 6). 

      Comment 2

      I also recommend investigating and giving a description of the bias in these new datasets.

      The bias of the new benchmarks could be found in Table 1, where the mutants are distributed evenly at different level of pH values.

      In the revision, we added a discussion regarding the new datasets in “Discussion and Conclusion” (lines 496-504 of the revised version).

      Comment 3

      I also recommend reporting the model's ability to predict disease using ClinVar -- this experiment is conspicuously absent.

      Following the suggestion, we retrieved 2,525 samples from the ClinVar dataset available on ProteinGym’s website. Since the official source did not provide corresponding structure files, we performed the following three steps:

      (1) We retrieved the UniProt IDs for the sequences from the UniProt website and downloaded the corresponding AlphaFold2 structures for 2,302 samples.

      (2) For the remaining proteins, we used ColabFold 1.5.5 to perform structure prediction.

      (3) Among these, 12 proteins were too long to be folded by ColabFold, for which we used the AlphaFold3 server for prediction.

      All processed structural data can be found at https://huggingface.co/datasets/tyang816/ClinVar_PDB. Our test results are provided in the following table. ProtSSN achieves the top performance over baseline methods.

      Author response table 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      This manuscript examines the individual and dual effects of CHIP and LOY in MI employing a cohort of ~460 individuals. CHIP is assessed by NGS and LOY is assessed by PCR. The threshold for CHIP is set at 2% (an arbitrary cutoff that is often used) and LOY at 9% (according to the Discussion text - this reviewer may have missed the section that describes why this threshold was employed). The investigation assessed whether LOY could modulate inflammation, atherosclerotic burden, or MI risk associated with CHIP. Neither CHIP nor LOY independently affected hsCRP, atherosclerotic burden, or MI incidence, nor did LOY presence diminish these outcomes in CHIP+ male subjects.

      This study represents the first dual analysis of CHIP and LOY on CVD outcomes. The results are largely negative, contradictory to other studies (many with much larger sample sizes). I would attribute the limitation of sample size as a major contributor to the negative data. While the negative data are suspect, the "positive" finding that LOY abolishes the prognostic significance of CHIP on MI is of interest (and consistent with what is understood from mechanistic studies).

      Overall, I enjoyed reading the paper, and it is of interest to the research community.

      However, I disagree with some of the authors' interpretations of the data.

      Generally, many conclusions on CHIP interpretation are based on the comparison of findings from very large datasets that have been evaluated by shallow NGS DNA sequencing. These studies lack sensitivity and accuracy, but this is counterbalanced by their very large sample sizes. Thus, they draw conclusions from the sickest individuals (ICD codes) with the largest clones (explaining the 10% VAF threshold). Here, the study has a well-phenotyped cohort, but as far as this reviewer can tell, the DNA sequencing is "shallow" NGS. Typically, to assess smaller datasets, investigators employ an error-correction method (DNA barcodes, duplex sequencing, etc.) for the sensitivity and accuracy of calling variants. Thus, the current study appears to suffer from this limitation (small sample sizes combined with NGS).

      We thank the reviewer for his/her positive and open comment. We acknowledge that we did not use error-corrected sequencing method for our study. However, we do not fully agree with the statement that our NGS sequencing technique is “shallow”.

      Considering our entire sequencing panel, we achieve a sequencing depth ≥100X and ≥300X for 100% [99%;100%] and 99% [99%;100%] of the targeted regions respectively. This corresponds to a median depth of 2111X [1578;2574] for all regions sequenced. When considering “CHIP genes”, the median depth is 2694X [1875;3785] for patients from the CHAth study and 3455X [2266;4885] for patients from the 3C study. More specifically, for DNMT3A and TET2 genes, the median depths of sequencing are 2531X [1818;3313] and 3710X [2444;4901] for patients from the CHAth and 3C study respectively. These values are far much higher than the 300X recommended for NGS sequencing by capture technology by the French National Institute of Cancer. Coupling this high depth of sequencing with our bioinformatic pipeline that uses 3 different variant callers, a manual curing for all variants by trained hematobiologists and a bioinformatic tool to estimate the background noise allow us to detect somatic mutation with a VAF of 1% with a high accuracy. Noteworthy, our accuracy in detecting mutations in leukemia-associated genes is tested twice a year as part of our quality control program organized by the French Group of Molecular Biologists in Hematology (GBMHM). We added the information about the depth of sequencing in the Supplementary Methods section.

      While the "negative" data from this study are inconclusive, the positive data (i.e. CHIP being prognostic for MI in the absence but not presence of MI) is of interest. Thus, the investigators may want to consider a shorter report that largely focuses on this finding.

      We thank the reviewer for his/her interest in this result. We also agree that it would be interesting to focus specifically on demonstrating the impact of mLOY in countering the cardiovascular risk associated with CHIP. We performed additional analysis to demonstrate that this effect was independent of age and cardiovascular risk factors and included this information in the results section.

      However, we believe that it is also of interest to show negative results that, although probably due to limitation in sample size, suggest that the cardiovascular risk associated with CHIP is not as strong and clinically pertinent as initially suggested. Of note, if CHIP really increase the risk of Myocardial Infarction in a significant manner, they would be more frequently detected in subjects who suffered from a MI compared to those who did not, which was not observed in our cohort. Moreover, we were able to determine that if CHIP increases the risk of MI, they do it to a much lesser extent (HR = 1.03 for CHIP) -than other established cardiovascular risk factors such as hypercholesterolemia or tobacco use HR = 1.47 and HR = 1.86 respectively in our cohort), which questions the pertinence of considering for CHIP in the management of patients with atherothrombosis. These data have been added in the Results and Discussion sections.

      We also believe that our study has the merit to assess directly the impact of CHIP on atheroma burden, which has been performed in only a limited number of studies in the context of coronary artery disease. This could not be possible by analyzing only male subjects in our cohort because it would further decrease the statistical power of our analyses.

      Reviewer #2 (Public Review):

      Summary: 

      The preprint by Fawaz et al. presents the findings of a study that aimed to assess the relationship between somatic mutations associated with clonal hematopoiesis (CHIP) and the prevalence of myocardial infarction (MI). The authors conducted targeted DNA sequencing analyses on samples from 149 MI patients and 297 non-MI controls from a separate cohort. Additionally, they investigated the impact of the loss of the Y chromosome (LOY), another somatic mutation frequently observed in clonally expanded blood cells. The results of the study primarily demonstrate no significant associations, as neither CHIP nor LOY were found to be correlated with an increased prevalence of MI. Of note, the null findings regarding CHIP are in conflict with several larger studies in the literature.

      Strengths:

      Overall, this is a useful research work on an emerging risk factor for cardiovascular disease (CVD). The use of a targeted sequencing approach is a strength, as it offers higher sensitivity than the whole exome sequencing approaches used in many previous studies.

      Weaknesses:

      Reporting null findings is definitely relevant in an emerging field such as the role of somatic mutations in cardiovascular disease. Nevertheless, the study suffers from severe limitations, which casts doubts on the authors' conclusions, as detailed below:

      (1) The small sample size of the study population is a critical limitation, particularly when reporting null findings that conflict (partly) with positive findings in much larger studies, totaling hundreds of thousands of individuals (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023; Zhao et al, JAMA Cardio 2024). The authors claim that they have 90% power to detect an effect size of CHIP on MI comparable to that in a previous report (Jaiswal et al, NEJM 2017). However, the methodology used to estimate statistical power is not described.

      We thank the reviewer for his/her pertinent and constructive comments. We totally agree that our study presents a substantially smaller sample size as compared to the studies of Zekavat et al, Vlasschaert et al or Zhao et al.

      The CHAth study was designed as a prospective study (which is not frequent in CHIP reports) to demonstrate that, if CHIP increase the risk of MI, they would be detected more frequently in patients who suffered from a MI compared to those who did not. To achieve this, we defined eligibility criteria to have a rather high prevalence of CHIP and optimize the statistical power of a study based on a limited number of patients. We thus enrolled patients who suffered from a first MI after the age of 75 years. These patients had to be compared with subjects from the Three-City study who had 65 years or more at inclusion and did not present any cardiovascular event before inclusion.

      To determine the number of patients necessary to achieve our objective, we considered a CHIP prevalence of 20% in the general population after the age of 75 years, as estimated when we set up our study (Genovese et al, NEJM 2014, Jaiswal et al, NEJM 2014, Jaiswal et al, NEJM 2017). At this time the relative risk of MI associated with CHIP was shown to be 1.7, leading to an expected prevalence of CHIP of 37% in subjects who presented a MI. Based on these hypotheses, the recruitment of 112 patients in the CHAth would have been sufficient to detect a significant higher prevalence of CHIP in MI(+) patients compared to MI(-) subjects with a power of 0.90 at a type I error rate of 5%. These calculations were performed by the Research Methodology Support Unit of the University Hospital of Bordeaux. These data were added in the Supplementary Methods section to expose more clearly the design and objectives of the CHAth study.

      Finally, we recruited 149 patients in the CHAth study and compared them to 297 control subjects. Although recruiting more patients than initially needed, we observed a similar prevalence of CHIP between our 2 cohorts, suggesting that the cardiovascular risk associated with CHIP is lower than the 1.7 increased risk claimed in most publications related to CHIP in the cardiovascular field. We have to notice that our study was not designed to demonstrate the impact of CHIP on the occurrence of MI during follow-up, which could explain our negative results due to a limited number of patients as stated by the reviewers. This statement has been added in the Supplementary Methods section. However, performing such analysis allowed us to confirm that the risk of MI associated with CHIP was lower than 1.7 and lower than the one associated with hypercholesterolemia or smoking.

      We would like also to notice that the eligibility criteria for both CHAth and the Three-City study can have led to a selection bias, possibly contributing to the contradiction of our results with other studies. As stated before, in the CHAth study, only patients who experience a first MI after the age of 75 were enrolled. In the Three-City study, all subjects had 65 years or more at inclusion. On the contrary, most of the cohorts showing an association between CHIP and cardiovascular events were composed of younger subjects:

      -          Bioimage : median age 70 years (55-80 years)

      -          MDC : median age 60 years

      -          ATVB : subjects with a MI before 45 years

      -          PROMIS : subjects between 30 and 80 years

      -          UK Biobank : between 40 and 70 years at inclusion, median age of 58 years in the study of Vlasschaert et al.

      -          Zhao et al : median age of 53.83 years (45.35-62.39 years).

      This last information was added in the Discussion section (lines 452-454).

      Furthermore, the work by Jaiswal et al (NEJM 2017) showed a hazard ratio of approx. 2.0, but more recent work in much larger populations suggests that the overall effect of CHIP on atherosclerotic CVD is smaller, most likely due to the heterogeneity of effects of different mutated genes (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023; Zhao et al, JAMA Cardio 2024).

      We thank the reviewer for insisting on the fact that the initial HR of 2.0 observed by Jaiswal et al was shown to be smaller in more recent studies. This corresponds to what we wrote in the introduction (lines 103-109) and discussion (lines 365-370, 465-471).

      In addition, several analyses in the current manuscript are conducted separately in MI(+) (n= 149) and MI(-) (N=297) individuals, further limiting statistical power. Power is still lower in the investigation of the effects of LOY and its interaction with CHIP, as only men are included in these analyses. Overall, I believe the study is severely underpowered, which calls into question the validity of the reported null findings.

      We agree with the reviewer that the statistical power of our study is lower than the one of other studies, in particular those based on several hundred thousand patients. Whenever possible, we analyzed our data by combining MI(+) and MI(-) subjects. However, for some aspects such as atherosclerosis, we did not have the same parameters available for these 2 groups and had to analyze them separately, leading to a more limited statistical power. We also have to acknowledge that our study was not designed to demonstrate an effect of CHIP on incident MI (as stated before), limiting our statistical power to demonstrate an effect of CHIP +/- mLOY on the incident risk of coronary artery disease.

      However, when designing our prospective study (CHAth study), we aimed to address the limitations of a small cohort and obtain rapid, significant results regarding the impact of CHIP. We hypothesized that if CHIP really increases the risk of myocardial infarction (MI), it would be detected more frequently in patients who have experienced a MI compared to those who have not. This study design would demonstrate the importance of CHIP in MI pathophysiology without requiring thousands of patients. However, we did not observe such an association questioning the relevance of detecting CHIP for the management of patients in the field of Cardiology. This was confirmed by the fact that in our cohort, the cardiovascular risk associated with CHIP appears to be low (HR = 1.03 [0.657;1.625] after adjustment on sex, age and cardiovascular risk factors) compared to hypercholesterolemia (HR = 1.474 [0.758;2.866]) or smoking (HR = 1.865 [0.943;3.690]). These data have been added in the Results and Discussion sections.

      In addition, we would like to mention that despite the limited number of subjects studied, we do not have only negative results. When studying only men subjects, we were able to show that CHIP accelerate the occurrence of MI, particularly in the absence of mLOY (Figure 2D). This effect was independent of age and cardiovascular risk factors (diabetes, cholesterol and high blood pressure). We added this last information in the results section of the manuscript, although we acknowledge that this has to be confirmed in future work.

      (2) Related to the above, it is widely accepted that the effects of CHIP on CVD are highly heterogeneous, as some mutated genes appear to have a strong impact on atherosclerosis, whereas the effect of others is negligible (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023, among others). TET2 mutations are frequently considered a "positive control", given the multiple lines of evidence suggesting that these mutations confer a higher risk of atherosclerotic disease.

      However, no association with MI or related variables was found for TET2 mutations in the current work. Reporting the statistical power specifically for assessing the effect of TET2 mutations would enhance the interpretation of these results.

      We thank the reviewer for this pertinent remark. It has indeed been shown that depending on the somatic mutation, the impact of CHIP on inflammation, atherosclerosis and cardiovascular risk is different. The studies cited by the reviewer suggest that DNMT3A mutations have a low impact on atherosclerosis/atherothrombosis while other “non-DNMT3A” mutations, including TET2 mutations, have a greater impact. In particular, Zekavat et al suggested that TP53, PPM1D, ASXL1 and spliceosome mutations have a similar impact on atherosclerosis/atherothrombosis to TET2.

      To answer to the reviewer in our cohort, we did not find a clear association between the detection of TET2 mutation with a VAF≥2% and:

      -          A history of MI at inclusion (p=0.5339)

      -          Inflammation (p=0.440)

      -          Atherosclerosis burden :

      -   In the CHAth study:

      -  p=0.031 for stenosis≥50%

      -  p=0.442 fir multitruncular lesions

      -  p=0.241 for atheroma volume

      -   in the 3C study :

      -  p=0.792 for the presence of atheroma

      -  p=0.3966 for the number of plaques

      -  p=0.876 for intima-media thickness

      -          Incidence of MI (p=0.5993)

      Similarly we did not find any association between the detection of TET2 mutations with a VAF≥1% and:

      -          A history of MI at inclusion (p=0.5339)

      -          Inflammation (p=0.802)

      -          Atherosclerosis burden :

      -   In the CHAth study :

      -  p=0.104 for stenosis≥50%

      -  p=0.617 fir multitruncular lesions

      -  p=0.391 for atheroma volume

      -   in the 3c study:

      -  p=0.3291 for the presence of atheroma

      -  p=0.2060 for the number of plaques

      -  p=0.2300 for intima-media thickness

      -          Incidence of MI (p=0.195)

      However, analyzing the specific effect of TET2 mutations reduces the cohort of CHIP(+) subjects to 61 individuals. In these conditions, considering a prevalence of “TET2-CHIP” of 13.5% (in our cohort) and a hazard ratio of 1.3 (Vlasschaert et al), the statistical power to show an increased risk of MI is only 16%.

      (3) One of the most essential features of CHIP is the tight correlation with age. In this study, the effect of age on CHIP (Supplementary Tables S5, S6) seems substantially milder than in previous studies. Given the relatively weak association with age here, it is not surprising that no association with MI or atherosclerotic disease was found, considering that this association would have a much smaller effect size.

      We thank the reviewer for highlighting this point. Although the difference of median age between subjects with or without a CHIP is not very important in our cohort, we did observe a significant association of CHIP with age:

      -          The differences in age were statistically significant both in the CHAth and 3C study (Supplementary Tables S5 and S6)

      -          We observed a significant association between age and CHIP prevalence (p<0.001 for the total cohort, p=0.0197 for the CHAth study, and p=0.0394 for the 3C cohort after adjustment on sex). This association was already shown in the figure 1. We added the significant association between age and CHIP prevalence in the Results section (line 279).

      As stated before, we have to remind the reviewer that we enrolled only subjects of ≥75 years and ≥65 years in the CHAth and 3C studies respectively. This led to a median age in our cohort that was substantially higher than in other cohorts (in particular the UK Biobank and the different cohorts studied by Jaiswal et al). This could have contributed to an apparent milder effect of age on CHIP, even if this association was still observed.

      In addition, there are previous reports of sex-related differences in the prevalence of CHIP, is there an association between CHIP and age after adjusting for sex? 

      The reviewer correctly pointed out that sex has been associated with various aspects of CHIP. While Zekavat et al reported that CHIP carriers were more frequently males, Kar et al (Nature Genetics 2022), and Kamphuis et al (Hemasphere 2023) did not observe a difference in the prevalence of CHIP between males and females, but rather a difference in the mutational spectrum. Male presented more frequently SRSF2, ASXL1, SF3B1, U2AF1, JAK2, TP53 and PPM1D mutations while females had more frequently DNMT3A, CBL and GNB1 mutations.

      In our study, the association between CHIP prevalence and age was indeed significant even after adjustment on sex (p<0.001 for the total cohort, p=0.0197 for the CHAth study and p=0.0394 for the 3C).

      (4) The mutated genes included in the definition of "CHIP" here are markedly different than those in most previous studies, particularly when considering specifically the studies that demonstrated an association between CHIP and atherosclerotic CVD. For instance, the definition of CHIP in this manuscript includes genes such as ANKRD26, CALR, CCND2, and DDX41... that are not prototypical CHIP genes. This is unlikely to have a major impact on the main results, as the vast majority of mutations detected are indeed in bona fide CHIP genes, but it should be at least acknowledged.

      We agree with the reviewer that our gene panel includes genes that are not considered prototypical CHIP genes. This acknowledgment has been added in the Supplementary Methods section. To perform this study, we did not design a specific targeted sequencing panel. We used the one that is used for the diagnosis of myeloid malignancies at the University Hospital of Bordeaux. ANKRD26 and DDX41 are genes that, when mutated, predispose to the development of hematological malignancies. CALR mutations are frequently detected in Myeloproliferative Neoplasms while CCND2 mutation can be detected in acute myeloid leukemia among other diseases. As usually performed in our routine practice, we analyzed all the genes in the panel. However, as stated by the reviewer, most of the mutations we detected involved bona fide CHIP genes.

      Furthermore, the strategy used here for the CHIP variant calling and curation seems substantially different than that used in previous studies, which precludes a direct comparison. This is important because such differences in the definition of CHIP and the curation of variants are the basis of most conflicting findings in the literature regarding the effects of this condition. Ideally, the authors should conduct sensitivity analyses restricted to prototypical CHIP genes, using the criteria that have been previously established in the field (e.g. Vlasschaert et al, Blood 2023).

      We agree with the reviewer, our strategy for CHIP variant calling and curation was substantially different from what has been used in other studies. We decided to apply the criteria we used in previous studies for the analysis of somatic mutation in myeloid malignancies. Because CHIP are defined by the detection of “somatic mutations in leukemia driver genes”, this appeared to follow the definition of CHIP.

      We also acknowledge that this discrepancy with the criteria defined by Vlasschaert et al could contribute to our findings that differ from those of other studies. We thus checked whether the variants detected were in accordance or not with the criteria defined by Vlasschaert et al. Pooling the 2 cohorts, we detected 439 variants, 381 of which were in accordance with the criteria established by Vlasschaert et al, representing a concordance rate of 86.8%. Moreover, the variants “wrongly” retained according to these criteria had an impact on the conclusion on the detection of CHIP in only 15 patients (because these variants were associated with a mutation in a bona fide CHIP gene and/or because its VAF was below 2%). Thus, the impact of CHIP variant calling and curation had only a limited impact on our results. This has been added in the discussion (lines 455-459).

      However, we would like to discuss the criteria that have been defined by Vlasschaert et al which are probably too restrictive. For some genes, such as ZRSR2, in addition to frameshift and non-sens mutations that are expected to be associated with a loss of function, only some single nucleotide variations were retained (probably those detected by this group). In our patient 20785, we detected a c.524A>G, p.(Tyr175Cys) mutation that was not reported in the list published by Vlasscheart et al. However, this variant presents a VAF presumptive of a somatic origin (3%), affects the Zn finger domain of the protein and is observed in a male subject. Thus, it presents several criteria to consider it as associated with a loss of function. Similarly, the CBL variant c.1139T>C, p.(Leu380Pro) observed in our patient 21536, although not affecting the residues 381-421 of the protein (the criteria defined by Vlasschaert et al), has been reported in 29 cases of hematological malignancies. It is thus likely to have a significant impact on the behavior of hematopoietic cells. Moreover, in the same patient, a TET2 c.4534G>A, p.(Ala1512Thr) variant was detected. Although not affecting directly the CD1 domain, it has been reported in a case of AML with a VAF suggestive of a somatic origin (Papaemmanuil et al, NEJM 2016). The SH2B3 gene is not considered by Vlasschaert et al as a bona fide CHIP gene, contrary to other genes involved in cell signaling such as JAK2, GNAS, GNB1, CBL. However, inactivating mutations in SH2B3 can be detected in myeloid malignancies and were recently shown to drive the phenotype in some patients with a MPN (Zhang et al, American Journal of Hematology 2024). We could thus expect that this also happens in our patients 22591 and 21998 who harbor mutations of SH2B3 (a SNV in the PH domain and a frameshift mutation respectively).

      Regarding BCOR, STAG2, SMC3 and RAD21 genes, although frameshift mutations are the most prevalent, there are several reports on the existence of SNV in the context of hematological malignancies (COSMIC, Blood (2021) 138 (24): 2455–2468, Blood Cancer Journal (2023)13:18 ; https://doi.org/10.1038/s41408-023-00790-1).

      We can also add that although Vlasschaert et al did not consider CSF3R and CALR as CHIP-genes, Kessler et al did. Because CHIP are an emerging field, it should be considered that the concepts that define it are expected to evolve, as demonstrated by the recent study of the Jyoti Nangalia’s group (Bernstein et al, Nature Genetics 2024) who showed that 17 additional genes (including SH2B3) should be considered as driver of clonal hematopoiesis.

      (5) An important limitation of the current study is the cross-sectional design of most of the analyses. For instance, it is not surprising that no association is found between CHIP and prevalent atherosclerosis burden by ultrasound imaging, considering that many individuals may have developed atherosclerosis years or decades before the expansion of the mutant clones, limiting the possible effect of CHIP on atherosclerosis burden. Similarly, the analysis of the relationship between CHIP and a history of MI may be confounded by the potential effects of MI on the expansion of mutant clones. In this context, it is noteworthy that the only positive results here are found in the analysis of the relationship between CHIP at baseline and incident MI development over follow-up. Increasing the sample size for these longitudinal analyses would provide deeper insights into the relationship between CHIP and MI. 

      We agree with the reviewer that increasing the sample size for longitudinal analyses would provide deeper insights into the relationship between CHIP and MI. Unfortunately, for the moment, we do not have access to additional samples of the 3C study and are not able to perform these additional analyses.

      (6) The description of some analyses lacks detail, but it seems that statistical analyses were exclusively adjusted for age or age and sex. The lack of adjustment for conventional cardiovascular risk factors in statistical analyses may confound results, particularly given the marked differences in several variables observed between groups.

      The reviewer is right when saying that we adjusted our analyses on age and/or sex. This was done because as stated before, our results did not show a lot of significant differences. However, we reanalyzed our data, adjusting further the tests for conventional cardiovascular risk factors, and observed similar results. These data have been added in the results section (lines 286-287, 303, 319, 331-332, 341).

      (7) The variant allele fraction (VAF) threshold for identifying clinically relevant clonal hematopoiesis is still a subject of debate. The authors state that subjects without any detectable mutation or with mutations with a VAF below 2% were considered non-CHIP carriers. While this approach is frequent in the field, it likely misses many impactful mutations with lower VAFs. Such false negatives could contribute to the null findings reported here. Ideally, the authors should determine the lower detection limit of their sequencing approach (either computationally or through serial dilution experiments) and identify the threshold of VAF that can be detected reliably with their sequencing assay. The association between CHIP and MI should then be evaluated considering all mutations above this VAF threshold, in addition to sensitivity analyses with other thresholds frequent in the literature, such as 1% VAF, 2% VAF, and 10% VAF.

      We agree with the reviewer that the VAF threshold for identifying clinically relevant CH is still debated. As stated in the manuscript and by the reviewer, we used the conventional threshold of 2%. Considering that different studies have shown that the cardiovascular risk is increased in a more important manner for CHIP with a high VAF (Jaiswal et al, NEJM 2017, Kessler et al Nature 2022, Vlasschaert et al, Circulation 2023), it is not sure that considering variant with a very low VAF (below 2%) would help us in finding an impact of CHIP on inflammation, atherosclerosis or atherothrombotic risk.

      However, as mentioned by the reviewer, variants with a low VAF could have a clinical impact as recently reported by Zhao et al. In France, the use of biological analysis for medical purposes imposes to demonstrate that all its aspects are mastered, including their performances. In that context, we determined that our NGS strategy allowed us to reliably detect mutation with a VAF down to 1% (data not shown). As stated in the discussion, we also analyzed our results considering variants with a VAF of 1% and found similar results (lines 394-395). The sensitivity analyses were already mentioned in the manuscript, as we also searched for an effect of CHIP with a high VAF (≥5%) and found no effect neither. We did not have a sufficient number of subjects carrying variants with a VAF≥10% to perform analysis with this threshold.

      (8) The authors should justify the use of 3D vascular ultrasound imaging exclusively in the supra-aortic trunk. I am not familiar with this technique, but it seems to be most typically used to evaluate atherosclerosis burden in superficial vascular beds such as carotids or femorals. I am concerned about the potential impact of tissue depth on the accurate quantification of atherosclerosis burden in the current study (e.g. https://doi.org/10.1016/j.atherosclerosis.2016.03.002). It is unclear whether the carotids or femorals were imaged in the study population. 

      We apologize for the lack of precision in the Methods section. As stated by the reviewer, we evaluated the atherosclerosis burden in superficial vascular beds. We measured atheroma volume at the site of the common carotid (as described by B Lopez-Melgar, in Atheroslerosis, 2016). We did not analyze femoral arteries in this study. The sentence is now corrected in the Methods (lines 176-179).

      (9) The specific criteria used to define LOY need to be justified. LOY is stated to be defined based on a "A cut off of 9% of cells with mLOY defined the detection of a mLOY based on the study of 30 men of less than 40 years who had a normal karyotype as assessed by conventional cytogenetic study." As acknowledged by the authors, this definition of LOY is substantially different than that used in recent studies employing the same technique to detect LOY (Mas-Peiro et al, EHJ 2023). In addition, it seems essential to provide more detailed information on the ddPCR assay used to determine LOY, including the operating range and, more importantly, the lower limit of detection (%LOY) of the assay. A dilution series of a control DNA with no LOY would be helpful in this context. 

      We apologize if the definition of the threshold for detecting mLOY was unclear. To test the performance of our ddPCR technique, we first determined the background noise by testing DNA obtained from total leukocytes in 30 men of ≤40 years who presented a normal karyotype as assessed by conventional cytogenetic technics. In this control population supposed not to carry mLOY, we detected of proportion of cells with mLOY of 2,34+/-1,98 (see Author response image 1, panel A). We thus considered a threshold above 9% as being different from background noise (mean + 3 times the standard deviation).

      We then compared the proportion of cells with mLOY measured by ddPCR and conventional karyotype and observed a rather good correlation between the 2 technics (R2\=0.6430, p=0.0053, see Author response image 1, panel B). Finally, we tested the reliability of our ddPCR assay in detecting different levels of mLOY using a dilution series of control DNA (from an equivalent of 2% of cell with mLOY to 98% of cells with mLOY). We observed a very nice correlation between the theoretical and measured proportions of cells with mLOY (R2\=0.9989, p<0.001, see Author response image 1, panel C). Of note, the proportion of mLOY measured for values ≤10% were concordant with theoretical values. However, considering the background noise determined with control DNA, we were unable to confirm that this “signal” was different from the background noise. Therefore, we set a threshold of 9% to define the detection of mLOY by ddPCR. It is also noteworthy that the 10% cell population with mLOY was consistently detected by the ddPCR technique. This has been added in the Methods section (lines 228-235).

      Author response image 1.

      (10) Our understanding of the relationship between CHIP and CVD is evolving fast, and the manuscript should be considered in the context of recent literature in the field. For instance, the recent work by Zhao et al (JAMA Cardio 2024, doi:10.1001/jamacardio.2023.5095) should be considered, as it used a similar targeted DNA sequencing approach as the one used here, but found a clear association between CHIP and coronary heart disease (in a population of 6181 individuals). 

      We thank the reviewer for this pertinent reference. We did not include it in the first version of our manuscript because it was not published yet when we submitted our work. We included this reference in the discussion (lines 451, 455, 464). We also included the recent study of Heimlich et al (Circ Gen Pre Med 2024, lines 464-468) who studied the association of CHIP with atherosclerosis burden.

      (11) The use of subjective terms like "comprehensive" or "thorough" in the title of the manuscript does not align with the objective nature of scientific reporting. 

      We removed the terms “comprehensive” and “thorough” from the title and the text.

      Recommendations for the authors:

      Reviewing Editor:

      The Editors believe that in light of the small study the word Comprehensive has to be removed (including from the title and abstract).

      We agree and removed the term comprehensive from the title and the text.

      Reviewer #1 (Recommendations For The Authors):

      Other comments:

      It has long been recognized that hsCRP does not adequately address the inflammation associated with CHIP. For example, see Bick et al Nature 2020; 586:763. Through an assessment of a large dataset, the regulation of multiple inflammatory mediators was associated with CHIP but not with CRP. 

      We agree that hsCRP is probably not the most sensitive marker for inflammatory state associated with CHIP. However, it is the most commonly used one in medical practise. However, as indicated in the discussion (lines 418-420), we did not observe any association between CHIP and the plasmatic level of different cytokines (IL1ß, IL6, IL18 and TNFα) in patients enrolled in the CHAth study.

      Many of the citations lack journal names, volumes, page numbers, etc. 

      We apologize for this and corrected the citations.

      Please provide more details on the methodology (i.e. is CHIP assessed only through NGS with no error correction?). Specify the rationale for why the 9% LOY threshold was employed. Provide this information in the Methods section.

      We added more details on the methodology as demanded in the results section (lines 212-214 and 228-235).

      Supplementary Table S3 lacks headings. What are the designations for columns 6-8? 

      We apologize for this and corrected the Table. Columns 6-8 correspond to the VAF, coverage of the variants and depth of sequencing, as for Table S4.