10,000 Matching Annotations
  1. Jan 2026
    1. Reviewer #2 (Public review):

      Summary:

      This paper aims to elucidate the gene regulatory network governing the development of cone photoreceptors, the light-sensing neurons responsible for high acuity and color vision in humans. The authors provide a comprehensive analysis through stage-matched comparisons of gene expression and chromatin accessibility using scRNA-seq and scATAC-seq from the cone-dominant 13-lined ground squirrel (13LGS) retina and the rod-dominant mouse retina. The abundance of cones in the 13LGS retina arises from a dominant trajectory from late retinal progenitor cells (RPCs) to photoreceptor precursors and then to cones, whereas only a small proportion of rods are generated from these precursors.

      Strengths:

      The paper presents intriguing insights into the gene regulatory network involved in 13LGS cone development. In particular, the authors highlight the expression of cone-promoting transcription factors such as Onecut2, Pou2f1, and Zic3 in late-stage neurogenic progenitors, which may be driven by 13LGS-specific cis-regulatory elements. The authors also characterize candidate cone-promoting genes Zic3 and Mef2C, which have been previously understudied. Overall, I found that the across-species analysis presented by this study is a useful resource for the field.

      Comments on Revision:

      The authors have addressed my questions, and the revised text now presents their findings more clearly.

    2. Reviewer #3 (Public review):

      Summary:

      The authors perform deep transcriptomic and epigenetic comparisons between mouse and 13-lined ground squirrel (13LGS) to identify mechanisms that drive rod vs cone rich retina development. Through cross species analysis the authors find extended cone generation in 13LGS, gene expression within progenitor/photoreceptor precursor cells consistent with lengthened cone window, and differential regulatory element usage. Two of the transcription factors, Mef2c and Zic3, were subsequently validated using OE and KO mouse lines to verify role of these genes in regulating competence to generate cone photoreceptors.

      Strengths:

      Overall, this is an impactful manuscript with broad implications toward our understanding of retinal development, cell fate specification, and TF network dynamics across evolution and with the potential to influence our future ability to treat vision loss in human patients. The generation of this rich new dataset profiling the transcriptome and epigenome of the 13LGS is a tremendous addition to the field that assuredly will be useful for numerous other investigations and questions of a variety of interests. In this manuscript, the authors use this dataset and compare to data they previously generated for mouse retinal development to identify 2 new regulators of cone generation and shed insights onto their regulation and their integration into the network of regulatory elements within the 13LGS compared to mouse.

      The authors have done considerable work to address reviewer concerns from the first draft. The current version of the manuscript is strong and supports the claims.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary 

      In this manuscript, Weir et al. investigate why the 13-lined ground squirrel (13LGS) retina is unusually rich in cone photoreceptors, the cells responsible for color and daylight vision. Most mammals, including humans, have rod-dominant retinas, making the 13LGS retina both an intriguing evolutionary divergence and a valuable model for uncovering novel mechanisms of cone generation. The developmental programs underlying this adaptation were previously unknown. 

      Using an integrated approach that combines single-cell RNA sequencing (scRNAseq), scATACseq, and histology, the authors generate a comprehensive atlas of retinal neurogenesis in 13LGS. Notably, comparative analyses with mouse datasets reveal that in 13LGS, cones can arise from late-stage neurogenic progenitors, a striking contrast to mouse and primate retinas, where late progenitors typically generate rods and other late-born cell types but not cones. They further identify a shift in the timing (heterochrony) of expression of several transcription factors.

      Further, the authors show that these factors act through species-specific regulatory elements. And overall, functional experiments support a role for several of these candidates in cone production. 

      Strengths 

      This study stands out for its rigorous and multi-layered methodology. The combination of transcriptomic, epigenomic, and histological data yields a detailed and coherent view of cone development in 13LGS. Cross-species comparisons are thoughtfully executed, lending strong evolutionary context to the findings. The conclusions are, in general, well supported by the evidence, and the datasets generated represent a substantial resource for the field. The work will be of high value to both evolutionary neurobiology and regenerative medicine, particularly in the design of strategies to replace lost cone photoreceptors in human disease. 

      Weaknesses 

      (1) Overall, the conclusions are strongly supported by the data, but the paper would benefit from additional clarifications. In particular, some of the conclusions could be toned down slightly to reflect that the observed changes in candidate gene function, such as those for Zic3 by itself, are modest and may represent part of a more complex regulatory network.  

      We have revised the text to qualify these conclusions as suggested.

      “Zic3 promotes cone-specific gene expression and is necessary for generating the full complement of cone photoreceptors”

      “Pou2f1 overexpression upregulated an overlapping but distinct, and larger, set of cone-specific genes relative to Zic3, while also downregulating many of the same rod-specific genes, often to a greater extent (Fig. 3C).”

      “This resulted in a statistically significant ~20% reduction in the density of cone photoreceptors in the mutant retina (Fig. 3E,F), while the relative numbers of rods and horizontal cells remained unaffected (Fig. S4A-D).”

      “Our analysis suggests that gene regulatory networks controlling cone specification are highly redundant, with transcription factors acting in complex, redundant, and potentially synergistic combinations. This is further supported by our findings on the synergistic effects of combined overexpression of Zic3 and Pou2f1 increasing both the number of differentially expressed genes and their level of change in expression relative to the modest changes seen with overexpression of either gene alone (Fig. 3) and the relatively mild or undetectable phenotypes observed following loss of function of Zic3 and Mef2c (Fig. 3, Fig. S6), as well as other cone-promoting factors such as Onecut1 and Pou2f1[18,19].“

      (2) Additional explanations about the cell composition of the 13LGS retina are needed. The ratios between cone and rod are clearly detailed, but do those lead to changes in other cell types? 

      The 13LGS retina, like most cone-dominant retinas, shows relatively lower numbers of rod and cone photoreceptors (~20%) than do nocturnal species such as mice (~80%). The difference is made up by increased numbers of inner retinal neurons and Muller glia. While rigorous histological quantification of the abundance of inner retinal cell types has not yet been performed for 13LGS, we can estimate these values using our snATAC-Seq data.  These numbers are provided in Table ST1, and are now discussed in the text.  

      (3) Could the lack of a clear trajectory for rod differentiation be just an effect of low cell numbers for this population? 

      This is indeed likely to be the case. This is now stated explicitly in the text.

      “However, no clear trajectory for rod differentiation was detected, likely due to the very low number of rod cells detected prior to P17 (Fig. 2A).”

      (4) The immunohistochemistry and RNA hybridization experiments shown in Figure S2 would benefit from supporting controls to strengthen their interpretability. While it has to be recognized that performing immunostainings on non-conventional species is not a simple task, negative controls are necessary to establish the baseline background levels, especially in cases where there seems to be labeling around the cells. The text indicates that these experiments are both immunostainings and ISH, but the figure legend only says "immunohistochemistry". Clarifying these points would improve readers' confidence in the data. 

      The figure legend has been corrected, and negative controls for P24 have been added. The figure legend has been modified as follows:

      “Fluorescent in situ hybridization showing co-expression of (A) Pou2f1 and Otx2 or (B) Zic3, Rxrg, and Otx2 in P1, P5, P10, and P24 retinas. Insets show higher power images of highlighted areas. (C) Zic3, Rxrg, and Otx2 fluorescent in situ hybridization from P24 with matched (C’) negative controls.  (D) Pou2f1 and Otx2 fluorescent in situ hybridization from P24 with matched (D’) negative controls. (E) Quantification of the fraction of Otx2-positive cells in the outer neuroblastic layer (P1, P5) and ONL (P10, P24) that also express Zic3. (F) Immunohistochemical analysis Mef2c and Otx2 expression in P1, P5, P10, and P24 retinas. (G) Mef2c and Otx2 immunohistochemistry from P24 with matched (G’) negative controls. Negative controls for fluorescent in situ hybridization omit the probe and for immunohistochemistry omit primary antibodies. Scale bars, 10 µm (S2A-F), 50 µm (S2G) and 5 µm (inset). Cell counts in E were analyzed using one-way ANOVA analysis with Sidak multiple comparisons test and 95% confidence interval. ** = p <0.01, **** = p <0.0001, and ns = non-significant. N=3 independent experiments.”

      (5) Figure S3: The text claims that overexpression of Zic3 alone is sufficient to induce the conelike photoreceptor precursor cells as well as horizontal cell-like precursors, but this is not clear in Figure S3A nor in any other figure. Similarly, the effects of Pou2f1 overexpression are different in Figure S3A and Figure S3B. In Figure S3B, the effects described (increased presence of cone-like and horizontal-like precursors) are very clear, whereas it is not in Figure S3A. How are these experiments different? 

      These UMAP data represent two independent experiments. Total numbers and relative fractions of each cell type are now included in Table ST5.

      In these experiments, cone-like precursors were identified by both cell type clustering and differential gene expression. Cells from all conditions were found in the cone-like precursor cluster. However, cells electroporated with a plasmid expressing GFP alone only showed GFP as a differentially expressed gene, identifying them most likely as GFP+ rods. In contrast, Zic3 overexpression resulted in increased expression of cone-specific genes and decreased expression of rod-specific genes in both cone-like precursors and rods relative to controls electroporated with GFP alone. Cell type proportions across independent overexpression singlecell experiments could be influenced by a number of factors, including electroporation efficiency and ex vivo growth conditions. 

      (6) The analyses of Zic3 conditional mutants (Figure S4) reveal an increase in many cone, rod, and pan-photoreceptor genes with only a reduction in some cone genes. Thus, the overall conclusion that Zic3 is essential for cones while repressing rod genes doesn't seem to match this particular dataset. 

      We observe that loss of function of Zic3 in developing retinal progenitors leads to a reduction in the total number of cones (Fig. 4E,F). In Fig. S4, we investigate how gene expression is altered in both the remaining cones and in other retinal cell types. We only observed significant changes in mutant cones and Muller glia relative to controls. We observe a mixed phenotype in cones, with a subset of cone-specific genes downregulated (notably including Thrb), a subset of others upregulated (including Opn1sw). We also find that genes expressed both in rods and cones, as well as rod-specific genes, are downregulated in cKO cones. Since rods are fragile cells that are located immediately adjacent to cones, some level of contamination of rod-specific genes is inevitable in single-cell analysis of dissociated cones (c.f. PMID: 31128945, 34788628), and this reduced level of rod contamination could result from altered adhesion between mutant rods and cones. In mutant Muller glia, in contrast, we see a broad decrease in expression of Muller glia-specific genes, which likely reflects the indirect effects of Zic3 loss of function in retinal progenitors, and an upregulation of both broadly photoreceptor-specific genes and a subset of rod-specific genes, which may also result from altered adhesion between Muller glia and rods. 

      This is consistent with the conclusions in the text, although we have both modified the text and included heatmaps showing downregulation of rod-specific genes in mutant cones, to clarify this finding.

      “In addition, we observe a broad decrease in expression of genes expressed at high levels in both cones and rods (Rpgrip1, Drd4) and rod-specific genes (Rho, Cnga1, Pde6b) in mutant cones (Fig. S4F). Since rods are fragile cells that are located immediately adjacent to cones, some level of contamination of rod-specific genes is inevitable in single-cell analysis of dissociated cones (c.f. PMID: 31128945, 34788628), and this reduced level of rod contamination could result from altered adhesion between mutant rods and cones. In contrast, increased expression of rod-specific genes (Rho, Nrl, Pde6g, Gngt1) and pan-photoreceptor genes (Crx, Stx3, Rcvrn) was observed in Müller glia (Fig. S4G), which may likewise result from altered adhesion between Muller glia and rods. Finally, several Müller glia-specific genes were downregulated, including Clu, Aqp4, and Notch pathway components such as Hes1 and Id3, with the exception of Hopx, which was upregulated (Fig. S4G). This likely reflects the indirect effects of Zic3 loss of function in retinal progenitors. These findings indicate that Zic3 is essential for the proper expression of photoreceptor genes in cones while also playing a role in regulating expression of Müller glia-specific genes.”

      (7) Throughout the text, the authors used the term "evolved". To substantiate this claim, it would be important to include sequence analyses or to rephrase to a more neutral term that does not imply evolutionary inference. 

      We have modified the text as requested to replace “evolved” and “evolutionarily conserved” where possible, with examples of revised text listed below:  

      “These results demonstrate that modifications to gene regulatory networks underlie the development of cone-dominant retina,...”

      “Our results demonstrate that heterochronic expansion of the expression of transcription factors that promote cone development is a key event in the development of the cone-dominant 13LGS retina.”

      “Conserved patterns of motif accessibility, identified using ChromVAR and theTRANSFAC2018 database, (Fig. S1F, Table ST1)...”

      “However, most of these elements  mapped to sequences that were not shared between 13LGS and mouse, with intergenic enhancers exhibiting particularly low levels of conservation (Fig. 5B).”

      “We conclude that the development of the cone-dominant retina in 13LGS is driven by novel cisregulatory elements…”

      “Based on our bioinformatic analysis, the cone-dominant 13LGS retina follows this paradigm, in which species-specific enhancer elements…”

      “Dot plots showing the enrichment of binding sites for Otx2 and Neurod1, TFs which are broadly expressed in both neurogenic RPC and photoreceptor precursors, which are enriched in both conserved cis-regulatory elements in both species. (D) Bar plots showing the number of conversed and species-specific enhancers per TSS in four cone-promoting genes between 13LGS and mouse.”

      Reviewer #2 (Public review): 

      Summary: 

      This paper aims to elucidate the gene regulatory network governing the development of cone photoreceptors, the light-sensing neurons responsible for high acuity and color vision in humans. The authors provide a comprehensive analysis through stage-matched comparisons of gene expression and chromatin accessibility using scRNA-seq and scATAC-seq from the conedominant 13-lined ground squirrel (13LGS) retina and the rod-dominant mouse retina. The abundance of cones in the 13LGS retina arises from a dominant trajectory from late retinal progenitor cells (RPCs) to photoreceptor precursors and then to cones, whereas only a small proportion of rods are generated from these precursors. 

      Strengths: 

      The paper presents intriguing insights into the gene regulatory network involved in 13LGS cone development. In particular, the authors highlight the expression of cone-promoting transcription factors such as Onecut2, Pou2f1, and Zic3 in late-stage neurogenic progenitors, which may be driven by 13LGS-specific cis-regulatory elements. The authors also characterize candidate cone-promoting genes Zic3 and Mef2C, which have been previously understudied. Overall, I found that the across-species analysis presented by this study is a useful resource for the field. 

      Weaknesses: 

      The functional analysis on Zic3 and Mef2C in mice does not convincingly establish that these factors are sufficient or necessary to promote cone photoreceptor specification. Several analyses lack clarity or consistency, and figure labeling and interpretation need improvement. 

      We have modified the text and figures to more clearly describe the observed roles of Zic3 and Mef2c in cone photoreceptor development as detailed in our responses to reviewer recommendations.

      Reviewer #3 (Public review): 

      Summary: 

      The authors perform deep transcriptomic and epigenetic comparisons between mouse and 13lined ground squirrel (13LGS) to identify mechanisms that drive rod vs cone-rich retina development. Through cross-species analysis, the authors find extended cone generation in 13LGS, gene expression within progenitor/photoreceptor precursor cells consistent with a lengthened cone window, and differential regulatory element usage. Two of the transcription factors, Mef2c and Zic3, were subsequently validated using OE and KO mouse lines to verify the role of these genes in regulating competence to generate cone photoreceptors. 

      Strengths: 

      Overall, this is an impactful manuscript with broad implications toward our understanding of retinal development, cell fate specification, and TF network dynamics across evolution and with the potential to influence our future ability to treat vision loss in human patients. The generation of this rich new dataset profiling the transcriptome and epigenome of the 13LGS is a tremendous addition to the field that assuredly will be useful for numerous other investigations and questions of a variety of interests. In this manuscript, the authors use this dataset and compare it to data they previously generated for mouse retinal development to identify 2 new regulators of cone generation and shed insights into their regulation and their integration into the network of regulatory elements within the 13LGS compared to mouse. 

      Weaknesses: 

      (1) The authors chose to omit several cell classes from analyses and visualizations that would have added to their interpretations. In particular, I worry that the omission of 13LGS rods, early RPCs, and early NG from Figures 2C, D, and F is notable and would have added to the understanding of gene expression dynamics. In other words, (a) are these genes of interest unique to late RPCs or maintained from early RPCs, and (b) are rod networks suppressed compared to the mouse? 

      We were unable to include 13LGS rods in our analysis due to the extremely low number of cells detected prior to P17. Relative expression levels of cone-promoting transcription factors in 13LGS in early RPCs and early NG cells is shown in Fig. 2H. Particularly when compared to mice, we also observe elevated expression of cone-promoting genes in early-stage RPC and/or early NG cells. These include Zic3, Onecut2, Mef2c, and Pou2f1, as well as transcription factors that promote the differentiation of post-mitotic cone precursors, such as Thrb and Rxrg. Contrast this with genes that promote specification and differentiation of both rods and cones, such as Otx2 and Crx, which show similar or even slightly higher expression in mice. Genes such as Casz1, which act in late NG cells to promote rod specification, are indeed downregulated in 13LGS late NG cells relative to mice. We have modified the text to clarify these points, as shown below:

      “To further characterize species-specific patterns of gene expression and regulation during postnatal photoreceptor development, we analyzed differential gene expression, chromatin accessibility, and motif enrichment across late-stage primary and neurogenic progenitors, immature photoreceptor precursors, rods, and cones. Due to their very low number before time point P17, we were unable to include 13LGS rods in the analysis.”

      “In contrast, two broad patterns of differential expression of cone-promoting transcription factors were observed between mouse and 13LGS.”

      “First, transcription factors identified in this network that are known to be required for committed cone precursor differentiation, including Thrb, Rxrg, and Sall3 [25,26,45], consistently showed stronger expression in late-stage RPCs and early-stage primary and/or neurogenic RPCs of 13LGS compared to mice.”

      “Second, transcription factors in the network known to promote cone specification in early-stage mouse RPCs, such as Onecut2 and Pou2f1, exhibited enriched expression in early and latestage primary and/or neurogenic RPCs of 13LGS, implying a heterochronic expansion of conepromoting factors into later developmental stages.”

      “In contrast, genes such as Casz1, which act in late neurogenic RPCs to promote rod specification, are downregulated in 13LGS late neurogenic RPCs relative to mice.”

      (2) The authors claim that the majority of cones are generated by late RPCs and that this is driven primarily by the enriched enhancer network around cone-promoting genes. With the temporal scRNA/ATACseq data at their disposal, the authors should compare early vs late born cones and RPCs to determine whether the same enhancers and genes are hyperactivated in early RPCs as well as in the 13LGS. This analysis will answer the important question of whether the enhancers activated/evolved to promote all cones, or are only and specifically activated within late RPCs to drive cone genesis at the expense of rods. 

      This is an excellent question.  We have addressed this question by analyzing both expression of the cone-promoting genes identified in C2 and C3 in Figure 2C and accessibility of their associated enhancer sequences, which are shown in Figure 6B, in early and late-stage RPCs and cone precursors.  The results are shown in Author response image 1 below. We observe that cone-promoting genes consistently show higher expression in both late-stage RPCs and cones.  We do not observe any clear differences in the accessibility of the associated enhancer regions, as determined by snATAC-Seq.  However, since we have not performed CUT&RUN analysis in embryonic retina for H3K27Ac or any other marker of active enhancer elements, we cannot determine whether the total number of active enhancers differs between early and late-stage RPCs. We suspect, however, this is likely to be the case, given the differences in the expression levels of these genes.

      Author response image 1.

      Relative expression levels of cone-promoting genes and accessibility of enhancer elements associated with these genes in early- and late-stage RPCs and cone precursors.

      (3) The authors repeatedly use the term 'evolved' to describe the increased number of local enhancer elements of genes that increase in expression in 13LGS late RPCs and cones. Evolution can act at multiple levels on the genome and its regulation. The authors should consider analysis of sequence level changes between mouse, 13LGS, and other species to test whether the enhancer sequences claimed to be novel in the 13LGS are, in fact, newly evolved sequence/binding sites or if the binding sites are present in mouse but only used in late RPCs of the 13LGS. 

      Novel enhancer sequences here are defined as having divergent sequences rather than simply divergent activity. This point has been clarified in the text, with the following changes made:

      “However, most of these elements mapped to sequences that were not shared between 13LGS and mouse, with intergenic enhancers exhibiting particularly low levels of conservation (Fig. 5B).”

      “...demonstrated far greater motif enrichment in active regulatory elements in 13LGS than in mice, though few of these elements mapped to sequences that were shared between 13LGS and mouse (Fig. 5C,D, Table ST10).”

      (4) The authors state that 'Enhancer elements in 13LGS are predicted to be directly targeted by a considerably greater number of transcription factors than in mice'. This statement can easily be misread to suggest that all enhancers display this, when in fact, this is only the conepromoting enhancers of late 13LGS RPCs. In a way, this is not surprising since these genes are largely less expressed in mouse vs 13LGS late RPCs, as shown in Figure 2. The manuscript is written to suggest this mechanism of enhancer number is specific to cone production in the 13LGS- it would help prove this point if the authors asked the opposite question and showed that mouse late RPCs do not have similar increased predicted binding of TFs near rodpromoting genes in C7-8. 

      The Reviewer’s point is well taken, and we agree that this mechanism is unlikely to be specific to cone photoreceptors, since we are simply looking at genes that show higher expression in late-stage neurogenic RPCs in 13LGS. We have changed the relevant text to now state:

      “Enhancer elements associated with cone-specific genes in 13LGS are predicted to be directly targeted by a considerably greater number of transcription factors in late-stage neurogenic RPCs than in mice, as might be expected, given the higher expression levels of these genes.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Minor: Clusters C1-C8 (Figure 2) are labeled as "C1-8" in the text but "G1-8" in the figure. 

      This has been done.

      (2) Minor: Showing other neurogenic factors (Olig2, Ascl1, Otx2) and late-stage specific factors (Lhx2, Sox8, Nfia/b) could be shown in Figure 2 to better support the text. 

      This has been done. These motifs are consistent in both species, but Figure 2F shows differential motifs. The reference to Figure 2F has been altered to include Table ST4, while Neurod1 motifs are shown in Fig. 2F.

      Reviewer #2 (Recommendations for the authors): 

      (1) Figure 2 

      2A-B: The exclusion of early-stage data from the species-integrated analysis is puzzling, as it could reveal significant differences between early-stage neurogenic progenitors in mice and late-stage progenitors in 13LGS that both give rise to cones. This analysis would also shed light on how cone-promoting transcription factors are suppressed in mouse early-stage progenitors, limiting the window for cone genesis.

      2C: The figure labels G1-8, while C1-8 are referenced in the text. 

      2F: Neurog2, Olig2, Ascl1, and Neurod1 are mentioned in the text but not labeled in the figure. 

      2A-B: There are indeed substantial differences between early-stage RPC in 13LGS and latestage RPC in mice that are broadly linked to control of temporal patterning, which are mentioned in the text. For instance, early-stage RPCs in both animals express higher levels of Nr2f1/2, Meis1/2, and Foxp1/4, while late-stage RPCs express higher levels of Nfia/b/x, indicating that core distinction between early- and late-stage RPCs is maintained.  What most clearly differs in 13-LGS is the sustained expression of a subset of cone-promoting transcription factors in late-stage RPCs that are normally restricted to early-stage RPCs in mice. However, as mentioned in response to Reviewer #3’s first point, we do observe some evidence for increased expression of cone-promoting transcription factors in early-stage RPCs and NG cells of 13LGS relative to mice, although this is much less dramatic than observed at later stages.  We have modified the text to directly mention this point. G1-8 has been corrected to C1-8 in the figure, a reference to Table ST4 has been added in discussion of neurogenic bHLH factors, and Fig. 2F has been modified to label Neurod1. 

      “First, transcription factors identified in this network that are known to be required for committed cone precursor differentiation, including Thrb, Rxrg, and Sall3 [25,26,45], consistently showed stronger expression in late-stage RPCs and early-stage primary and/or neurogenic RPCs of 13LGS compared to mice.”

      “Second, transcription factors in the network known to promote cone specification in early-stage mouse RPCs, such as Onecut2 and Pou2f1, exhibited enriched expression in early and latestage primary and/or neurogenic RPCs of 13LGS, implying a heterochronic expansion of conepromoting factors into later developmental stages.”

      (2) Figure 3 

      In 3F, the cone density in the WT retina is approximately 0.25 cones per micron, while in the Zic3 cKO retina, it is about 0.2 cones per micron. However, the WT control in Figure S6C also shows about 0.2 cones per micron, raising questions about whether there is a genuine decrease in cone number or if it results from quantification variability. Additionally, the proportion of cone cells in the Zic3 cKO scRNA-seq data shown in Figure S4E appears comparable to the WT control, which is inconsistent with the conclusion that Zic3 cKO leads to reduced cone production. Therefore, I found that the conclusion that Zic3 is necessary for cone development is not supported by the data.

      The cone density counts in the two mutant lines and accompanying littermate controls were collected by blinded counting by two different observers (R.A. for the Zic3 cKO and N.P. for the Mef2c cKO). We believe that the ~20% difference in the observed cone density in the two control samples likely represents investigator-dependent differences. These can exceed 20% between even highly skilled observers when quantifying dissociated cells (PMID: 35198419) and are likely to be even higher for immunohistochemistry samples.  Since both controls were done in parallel with littermate mutant samples, we therefore stand by our interpretation of these results.

      (3) Figures 4 and 5

      These figures are duplicates. In Figure 4, Mef2C overexpression in postnatal progenitors leads to increased numbers of neurogenic RPCs, suggesting it may promote cell proliferation rather than inhibit rod cell fate or promote cone cell fate. Electroporation of plasmids into P0 retina typically does not label cone cells, as cones are born prenatally in mice. Given the widespread GFP signal in Figure 4D, the authors should consider that the high background of GFP signal may have misled the quantification of the result.

      The figure duplication has been corrected. We respectfully disagree with the Reviewer’s statement that ex vivo electroporation performed at P0, as is the case here, does not label cones. We routinely observe small numbers of electroporated cones when performing this analysis. Cones at this age are located on the scleral face of the retina at this age and therefore in direct contact with the buffer solution containing the plasmid in question (c.f. PMID: 20729845, 31128945, 34788628, 40654906). Furthermore, since the level of GFP expression that is used to gate electroporated cells for isolation using FACS is typically considerably less than that used to identify a GFP-positive cell using standard immunohistochemical techniques, making it difficult to directly compare the efficiency of cone electroporation between these approaches. We agree, however, that Mef2c overexpression seems to broadly delay the differentiation of rod photoreceptors, and have modified the text to include discussion of this point.

      “Although a few GFP-positive electroporated cells co-expressing the cone-specific marker Gnat2 were detected in control (likely due to the electroporation of cone precursors, which we have previously observed in P0 retinal explants (Clark et al., 2019; Leavey et al., 2025; Lyu et al., 2021; Onishi et al., 2010)), there was a significant increase in double-positive cells in the test condition, matching the novel cone-like precursor population found in the scRNA-Seq (Fig. 4E).”

      “Indeed, overexpression of Mef2c increased the number of both neurogenic RPCs and immature photoreceptor precursors, suggesting that rod differentiation was broadly delayed.”

      (4) Figure S2 

      The figure legend lacks information about panels A and B. It is unclear which panels represent immunohistochemistry and which represent RNA hybridization chain reaction. Overall, the staining results are difficult to interpret, as it appears that all examined RNAs/proteins are positively stained across the sections with varying background levels. Specificity is hard to assess. For instance, in Figure S2B, the background intensity of Zic3 staining varies inconsistently from P1 to P24. The number of Zic3 mRNA dots seems to peak at P5 and decrease at P10, which contradicts the scRNA-seq results showing peak expression in mature cones.

      The figure legend has been corrected. Negative controls are now included for both in situ hybridization (Fig. S2C’) and immunostaining (Fig. S2G) at P24, along with paired experimental data.  We have quantified the total fraction of Otx2+ cells that also contain Zic3 foci, and find that coexpression peaks at P5 and P10.  This is now included as Fig. S2E.

      The number of Zic3 foci is in fact higher at P5 than P10, with XX foci/Otx2+ cell at P5 vs. YY foci/Otx2+ cell at P10.

      “Fluorescent in situ hybridization showing co-expression of (A) Pou2f1 and Otx2 or (B) Zic3, Rxrg, and Otx2 in P1, P5, P10, and P24 retinas. Insets show higher power images of highlighted areas. (C) Zic3, Rxrg, and Otx2 fluorescent in situ hybridization from P24 with matched (C’) negative controls. (D) Pou2f1 and Otx2 fluorescent in situ hybridization from P24 with matched (D’) negative controls. (E) Quantification of the fraction of Otx2-positive cells in the outer neuroblastic layer (P1, P5) and ONL (P10, P24) that also express Zic3. (F) Immunohistochemical analysis Mef2c and Otx2 expression in P1, P5, P10, and P24 retinas. (G) Mef2c and Otx2 immunohistochemistry from P24 with matched (G’) negative controls. Negative controls for fluorescent in situ hybridization omit the probe and for immunohistochemistry omit primary antibodies. Scale bars, 10 µm (S2A-F),  50 µm (S2G) and 5 µm (inset). Cell counts in E were analyzed using one-way ANOVA analysis with Sidak multiple comparisons test and 95% confidence interval. ** = p <0.01, **** = p <0.0001, and ns = non-significant. N=3 independent experiments.”

      (5) Figure S3

      In S3A and S3B, the UMAPs of the empty vector-treated groups are distinctly different. The same goes for Zic3+Pou2F1 UMAPS.

      In S3A, Zic3 overexpression alone does not appear to have any impact on cell fate. It is not evident that Zic3, even in combination with Pou2F1, has any significant impact on cone or other cell type production, as the proportions of the cones and cone precursors seem similar across different groups.

      In S3B, Zic3+Pou2F1 seems to increase HC-like precursors without increasing cone-like procursors or cones.

      Moreover, the cone-like precursors described do not seem to contribute to cone generation, as there is no increase in cones in the adult mouse retina; rather, these cells resemble rod-cone mosaic cells with expression of both rod- and cone-specific genes.

      As the Reviewer states, we observe some differences in the proportion of cell types in both control and experimental conditions between the two experiments. Notably, relatively more photoreceptors and correspondingly fewer progenitors, bipolar, and amacrine cells are observed in the samples shown in Fig. S3A relative to Fig. S3B.  However, these represent two independent experiments. Cell type proportions seen across independent ex vivo electroporation experiments such as these can be affected by a number of variables, including precise developmental age of the samples, electroporation efficiency, cell dissociation conditions, and ex vivo growth conditions.  Some differences are inevitable, which is why paired negative controls must always be done for results to be interpretable.

      In both experiments, we observe that overexpression of Zic3, Pou2f1, and most notably Zic3 and Pou2f1 lead to an increase in the relative fraction of cone-like precursors. In the experiment shown in Fig. S3B, we also observe that Zic3 alone, Onecut1 alone, and Zic3 and Pou2f1 in combination also promote generation of horizontal-like cells. All treatments likewise induce expression of different subsets of cone-enriched genes in the cone-like precursors, while also suppressing rod-specific genes in these same cells.

      Total numbers and relative fractions of each cell type are now included in Table ST5.

      (6) Figure S4

      The proportion of cone cells in the Zic3 cKO scRNA-seq data shown in Figure S4E appears comparable to the WT control, contradicting the conclusion that Zic3 cKO leads to reduced cone production. 

      Total numbers and relative fractions of each cell type are now included in Table ST6.

      (7) Figure S5

      In Figure S5A, Mef2C overexpression does not decrease expression of the rod gene Nrl. 

      This is correct, and is mentioned in the text.

      “No obvious reduction in the relative number of Nrl-positive cells was observed (Fig. S5A).”

      Reviewer #3 (Recommendations for the authors): 

      (1) The authors make several broad and definitive statements that have the potential to confuse readers. In the first sections of Results: 'retinal ganglion cells and amacrine cells were generated predominantly by early stage progenitors' but later say 'late-stage RPCs in 13LGS retina are competent to generate cone photoreceptors but not other early born cell types.' In the discussion, the authors themselves point out limitations of analyses without birthdating. These definitive statements should be qualified/amended. 

      Both single-cell RNA and ATAC-Seq analysis can be used to accurately profile cells that have recently exited mitosis and committed to a specific cell fate. When applied to data obtained from a developmental timecourse such as is the case here, this can in turn serve as a reasonable proxy for generating birthdating data. Nonetheless, we have modified the text to state that BrdU/EdU labeling is indeed the gold standard for drawing conclusions about cell birthdates, and should be used to confirm these findings in future studies.

      “The expected temporal patterns of neurogenesis were observed in both species: retinal ganglion cells and amacrine cells were generated predominantly in the early stage, whereas bipolar cells and Müller glia were produced in the late stage.”

      “Though BrdU/EdU labeling would be required to unambiguously demonstrate species-specific differences in birthdating, our findings strongly indicate that 13LGS exhibit a selective expansion of the temporal window of cone generation, extending into late stages of neurogenesis.”

      This sentence does not make a definitive statement about 13LGS RPC competence, and we have left it unaltered. 

      “These findings suggest that late-stage RPCs in 13LGS retina are competent to generate cone photoreceptors but not other early-born cell types…”

      (2) Figure 2C clusters are referred to as C1-8 in the text but G1-8 in the figure. This is confusing and should be fixed. 

      This has been corrected.

      (3) The authors refer to many genes that show differential expression in Figure 2F, but virtually none of these are labelled in the heatmap, making it hard to follow the narrative. 

      Figure 2F represents transcription factor binding motifs that are differentially active between mouse and 13LGS, not gene expression. We have modified the figure to include names of all differentially active motifs discussed in the text, and otherwise refer the reader to Table ST4, which includes a list of all differentially expressed genes.

    1. eLife Assessment

      This valuable retrospective analysis identified three independent components of glucose dynamics - "value," "variability," and "autocorrelation" - which may be used in predicting coronary plaque vulnerability. The study is solid and of interest to a wide range of investigators in the medical field who are interested in the role of glycemia on cardiometabolic health. The manuscript has been substantially strengthened by clarifying methods, improving transparency, and validating key findings, resulting in a coherent and persuasive case for autocorrelation as a meaningful third dimension of glucose dynamics despite remaining design-related limitations.

    2. Reviewer #2 (Public review):

      Summary:

      Sugimoto et al. explore the relationship between glucose dynamics-specifically value, variability, and autocorrelation-and coronary plaque vulnerability in patients with varying glucose tolerance levels. The study identifies three independent predictive factors for %NC and emphasizes the use of continuous glucose monitoring (CGM)-derived indices for coronary artery disease (CAD) risk assessment. By employing robust statistical methods and validating findings across datasets from Japan, America, and China, the authors highlight the limitations of conventional markers while proposing CGM as a novel approach for risk prediction.The study has the potential to reshape CAD risk assessment by emphasizing CGM-derived indices, aligning well with personalized medicine trends.

      Further, the revised version includes expanded biological interpretation, improved statistical justification, and a new web-based calculator for clinical translation. Together, these updates make the study an important contribution to precision risk assessment in diabetes and cardiovascular research.

      Strengths:

      The introduction of autocorrelation as a predictive factor for plaque vulnerability adds a novel dimension to glucose dynamic analysis.

      Inclusion of datasets from diverse regions enhances generalizability.

      The use of a well-characterized cohort with controlled cholesterol and blood pressure levels strengthens the findings.

      The focus on CGM-derived indices aligns with personalized medicine trends, showcasing potential for CAD risk stratification.

      The benchmarking of CGM-derived measures against established CAD risk models (e.g., Framingham Risk Score) enhances interpretability and significance.

      The addition of a web-based computational tool makes the proposed indices accessible for potential clinical and research use.

      Weaknesses:

      The biological mechanism linking glucose autocorrelation to plaque vulnerability, although plausibly associated with insulin clearance pathways, remains largely theoretical.

      The primary cohort size is still modest, and while supported by power analysis and external datasets, broader prospective validation will be important.

      Strict participant selection criteria as employed by the study may reduce applicability to broader populations.

      CGM-derived indices like AC_Var and ADRR may be too complex for routine clinical use without simplified models or guidelines.

      Comments on revised version:

      The authors have thoroughly addressed previous concerns and produced a much stronger manuscript. The study now provides a coherent, validated, and well-reasoned argument for including autocorrelation as a third major dimension of glucose dynamics. It offers both conceptual novelty and translational potential and will likely stimulate further research on temporal glucose metrics in metabolic and cardiovascular risk assessment.

    3. Reviewer #3 (Public review):

      Summary:

      This is a retrospective analysis of 53 individuals over 26 features (12 clinical phenotypes, 12 CGM features, and 2 autocorrelation features) to examine which features were most informative in predicting percent necrotic core (%NC) as parameter for coronary plaque vulnerability. Multiple regression analysis demonstrated a better ability to predict %NC from 3 selected CGM derived features than 3 selected clinical phenotypes. LASSO regularization and partial least squares (PLS) with VIP scores were used to identify 4 CGM features that most contribute to the precision of %NC. Using factor analysis they identify 3 components that have CGM related features: value (relating to the value of blood glucose), variability (relating to glucose variability), and autocorrelation (composed of the two autocorrelation features). These three groupings appeared in the 3 validation cohorts and when performing hierarchical clustering. To demonstrate how these three features change, a simulation was created to allow the user to examine these features under different conditions.

      Summary of Revision 1. This is a Valuable study supported by Solid evidence. The revisions meaningfully strengthen the manuscript by clarifying methods, improving transparency, and refining presentation. The work provides useful conceptual and methodological advances for understanding CGM-derived glucose dynamics and their possible relationship to cardiovascular pathology.

      Strengths:

      The authors have provided a much clearer exposition of how each glycemic component was defined and validated across cohorts. The revised manuscript now includes explicit pairwise correlations, clarified p- and q-value reporting, and better visualization of key associations between CGM indices and %NC. The justification for LASSO and PLS use is now well explained, and additional details on cohort timing relative to PCI, validation dataset structure, and statistical robustness (e.g., VIP stability with covariates) address prior concerns. The inclusion of precise factor definitions and clearer graphics notably improves interpretability.

      Limitations:

      Some limitations remain inherent to the study design, including the modest primary sample size, reliance on retrospective data, and differences between validation datasets in outcome ascertainment. However, these are now acknowledged more openly.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      We appreciate the reviewer for the critical review of the manuscript and the valuable comments. We have carefully considered the reviewer’s comments and have revised our manuscript accordingly.

      The reviewer’s comments in this letter are in Bold and Italics.

      Summary:

      This study identified three independent components of glucose dynamics-"value," "variability," and "autocorrelation", and reported important findings indicating that they play an important role in predicting coronary plaque vulnerability. Although the generalizability of the results needs further investigation due to the limited sample size and validation cohort limitations, this study makes several notable contributions: validation of autocorrelation as a new clinical indicator, theoretical support through mathematical modeling, and development of a web application for practical implementation. These contributions are likely to attract broad interest from researchers in both diabetology and cardiology and may suggest the potential for a new approach to glucose monitoring that goes beyond conventional glycemic control indicators in clinical practice.

      Strengths:

      The most notable strength of this study is the identification of three independent elements in glycemic dynamics: value, variability, and autocorrelation. In particular, the metric of autocorrelation, which has not been captured by conventional glycemic control indices, may bring a new perspective for understanding glycemic dynamics. In terms of methodological aspects, the study uses an analytical approach combining various statistical methods such as factor analysis, LASSO, and PLS regression, and enhances the reliability of results through theoretical validation using mathematical models and validation in other cohorts. In addition, the practical aspect of the research results, such as the development of a Web application, is also an important contribution to clinical implementation.

      We appreciate reviewer #1 for the positive assessment and for the valuable and constructive comments on our manuscript.

      Weaknesses:

      The most significant weakness of this study is the relatively small sample size of 53 study subjects. This sample size limitation leads to a lack of statistical power, especially in subgroup analyses, and to limitations in the assessment of rare events. 

      We appreciate the reviewer’s concern regarding the sample size. We acknowledge that a larger sample size would increase statistical power, especially for subgroup analyses and the assessment of rare events.

      We would like to clarify several points regarding the statistical power and validation of our findings. Our sample size determination followed established methodological frameworks, including the guidelines outlined by Muyembe Asenahabi, Bostely, and Peters Anselemo Ikoha. “Scientific research sample size determination.” (2023). These guidelines balance the risks of inadequate sample size with the challenges of unnecessarily large samples. For our primary analysis examining the correlation between CGM-derived measures and %NC, power calculations (a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4) indicated that a minimum of 47 participants was required. Our sample size of 53 exceeded this threshold and allowed us to detect statistically significant correlations, as described in the Methods section. Moreover, to provide transparency about the precision of our estimates, we have included confidence intervals for all coefficients. 

      Furthermore, our sample size aligns with previous studies investigating the associations between glucose profiles and clinical parameters, including Torimoto, Keiichi, et al. “Relationship between fluctuations in glucose levels measured by continuous glucose monitoring and vascular endothelial dysfunction in type 2 diabetes mellitus.” Cardiovascular Diabetology 12 (2013): 1-7. (n=57), Hall, Heather, et al. “Glucotypes reveal new patterns of glucose dysregulation.” PLoS biology 16.7 (2018): e2005143. (n=57), and Metwally, Ahmed A., et al. “Prediction of metabolic subphenotypes of type 2 diabetes via continuous glucose monitoring and machine learning.” Nature Biomedical Engineering (2024): 1-18. (n=32).

      Furthermore, the primary objective of our study was not to assess rare events, but rather to demonstrate that glucose dynamics can be decomposed into three main factors - mean, variance and autocorrelation - whereas traditional measures have primarily captured mean and variance without adequately reflecting autocorrelation. We believe that our current sample size effectively addresses this objective. 

      Regarding the classification of glucose dynamics components, we have conducted additional validation across diverse populations including 64 Japanese, 53 American, and 100 Chinese individuals. These validation efforts have consistently supported our identification of three independent glucose dynamics components.

      However, we acknowledge the importance of further validation on a larger scale. To address this, we conducted a large follow-up study of over 8,000 individuals (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      To address the sample size considerations, we have added the following sentences in the Discussion section (lines 409-414): 

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      We appreciate the reviewer’s feedback and believe that these clarifications improve the manuscript.

      In terms of validation, several challenges exist, including geographical and ethnic biases in the validation cohorts, lack of long-term follow-up data, and insufficient validation across different clinical settings. In terms of data representativeness, limiting factors include the inclusion of only subjects with well-controlled serum cholesterol and blood pressure and the use of only short-term measurement data.

      We appreciate the reviewer’s comment regarding the challenges associated with validation. In terms of geographic and ethnic diversity, our study includes validation datasets from diverse populations, including 64 Japanese, 53 American and 100 Chinese individuals. These datasets include a wide range of metabolic states, from healthy individuals to those with diabetes, ensuring validation across different clinical conditions. In addition, we recognize the limited availability of publicly available datasets with sufficient sample sizes for factor decomposition that include both healthy individuals and those with type 2 diabetes (Zhao, Qinpei, et al. “Chinese diabetes datasets for data-driven machine learning.” Scientific Data 10.1 (2023): 35.). The main publicly available datasets with relevant clinical characteristics have already been analyzed in this study using unbiased approaches.

      However, we fully agree with the reviewer that expanding the geographic and ethnic scope, including long-term follow-up data, and validation in different clinical settings would further strengthen the robustness and generalizability of our findings. To address this, we conducted a large follow-up study of over 8,000 individuals with two years of follow-up (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      Regarding the validation considerations, we have added the following sentences to the Discussion section (lines 409-414, 354-361): 

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      Although our LASSO and factor analysis indicated that CGM-derived measures were strong predictors of %NC, this does not mean that other clinical parameters, such as lipids and blood pressure, are irrelevant in T2DM complications. Our study specifically focused on characterizing glucose dynamics, and we analyzed individuals with well-controlled serum cholesterol and blood pressure to reduce confounding effects. While we anticipate that inclusion of a more diverse population would not alter our primary findings regarding glucose dynamics, it is likely that a broader data set would reveal additional predictive contributions from lipid and blood pressure parameters.

      In terms of elucidation of physical mechanisms, the study is not sufficient to elucidate the mechanisms linking autocorrelation and clinical outcomes or to verify them at the cellular or molecular level.

      We appreciate the reviewer’s point regarding the need for further elucidation of the physical mechanisms linking glucose autocorrelation to clinical outcomes. We fully agree with the reviewer that the detailed molecular and cellular mechanisms underlying this relationship are not yet fully understood, as noted in our Discussion section.

      However, we would like to emphasize the theoretical basis that supports the clinical relevance of autocorrelation. Our results show that glucose profiles with identical mean and variability can exhibit different autocorrelation patterns, highlighting that conventional measures such as mean or variance alone may not fully capture inter-individual metabolic differences. Incorporating autocorrelation analysis provides a more comprehensive characterization of metabolic states. Consequently, incorporating autocorrelation measures alongside traditional diabetes diagnostic criteria - such as fasting glucose, HbA1c and PG120, which primarily reflect only the “mean” component - can improve predictive accuracy for various clinical outcomes. While further research at the cellular and molecular level is needed to fully validate these findings, it is important to note that the primary goal of this study was to analyze the characteristics of glucose dynamics and gain new insights into metabolism, rather than to perform molecular biology experiments.

      Furthermore, our previous research has shown that glucose autocorrelation reflects changes in insulin clearance (Sugimoto, Hikaru, et al. “Improved detection of decreased glucose handling capacities via continuous glucose monitoring-derived indices.” Communications Medicine 5.1 (2025): 103.). The relationship between insulin clearance and cardiovascular disease has been well documented (Randrianarisoa, Elko, et al. “Reduced insulin clearance is linked to subclinical atherosclerosis in individuals at risk for type 2 diabetes mellitus.” Scientific reports 10.1 (2020): 22453.), and the mechanisms described in this prior work may potentially explain the association between glucose autocorrelation and clinical outcomes observed in the present study.

      Rather than a limitation, we view these currently unexplored associations as an opportunity for further research. The identification of autocorrelation as a key glycemic feature introduces a new dimension to metabolic regulation that could serve as the basis for future investigations exploring the molecular mechanisms underlying these patterns.

      While we agree that further research at the cellular and molecular level is needed to fully validate these findings, we believe that our study provides a theoretical framework to support the clinical utility of autocorrelation analysis in glucose monitoring, and that this could serve as the basis for future investigations exploring the molecular mechanisms underlying these autocorrelation patterns, which adds to the broad interest of this study. Regarding the physical mechanisms linking autocorrelation and clinical outcomes, we have added the following sentences in the Discussion section (lines 331-339, 341-352): 

      This study also provided evidence that autocorrelation can vary independently from the mean and variance components using simulated data. In addition, simulated glucose dynamics indicated that even individuals with high AC_Var did not necessarily have high maximum and minimum blood glucose levels. This study also indicated that these three components qualitatively corresponded to the four distinct glucose patterns observed after glucose administration, which were identified in a previous study (Hulman et al., 2018). Thus, the inclusion of autocorrelation in addition to mean and variance may improve the characterization of inter-individual differences in glucose regulation and improve the predictive accuracy of various clinical outcomes.

      Despite increasing evidence linking glycemic variability to oxidative stress and endothelial dysfunction in T2DM complications (Ceriello et al., 2008; Monnier et al., 2008), the biological mechanisms underlying the independent predictive value of autocorrelation remain to be elucidated. Our previous work has shown that glucose autocorrelation is influenced by insulin clearance (Sugimoto et al., 2025), a process known to be associated with cardiovascular disease risk (Randrianarisoa et al., 2020). Therefore, the molecular pathways linking glucose autocorrelation to cardiovascular disease may share common mechanisms with those linking insulin clearance to cardiovascular disease. Although previous studies have primarily focused on investigating the molecular mechanisms associated with mean glucose levels and glycemic variability, our findings open new avenues for exploring the molecular basis of glucose autocorrelation, potentially revealing novel therapeutic targets for preventing diabetic complications.

      Reviewer #2 (Public review):

      We appreciate the reviewer for the critical review of the manuscript and the valuable comments. We have carefully considered the reviewer’s comments and have revised our manuscript accordingly. The reviewer’s comments in this letter are in Bold and Italics.

      Sugimoto et al. explore the relationship between glucose dynamics - specifically value, variability, and autocorrelation - and coronary plaque vulnerability in patients with varying glucose tolerance levels. The study identifies three independent predictive factors for %NC and emphasizes the use of continuous glucose monitoring (CGM)-derived indices for coronary artery disease (CAD) risk assessment. By employing robust statistical methods and validating findings across datasets from Japan, America, and China, the authors highlight the limitations of conventional markers while proposing CGM as a novel approach for risk prediction. The study has the potential to reshape CAD risk assessment by emphasizing CGM-derived indices, aligning well with personalized medicine trends.

      Strengths:

      (1) The introduction of autocorrelation as a predictive factor for plaque vulnerability adds a novel dimension to glucose dynamic analysis.

      (2) Inclusion of datasets from diverse regions enhances generalizability.

      (3) The use of a well-characterized cohort with controlled cholesterol and blood pressure levels strengthens the findings.

      (4) The focus on CGM-derived indices aligns with personalized medicine trends, showcasing the potential for CAD risk stratification.

      We appreciate reviewer #2 for the positive assessment and for the valuable and constructive comments on our manuscript.

      Weaknesses:

      (1) The link between autocorrelation and plaque vulnerability remains speculative without a proposed biological explanation. 

      We appreciate the reviewer’s point about the need for a clearer biological explanation linking glucose autocorrelation to plaque vulnerability. We fully agree with the reviewer that the detailed biological mechanisms underlying this relationship are not yet fully understood, as noted in our Discussion section.

      However, we would like to emphasize the theoretical basis that supports the clinical relevance of autocorrelation. Our results show that glucose profiles with identical mean and variability can exhibit different autocorrelation patterns, highlighting that conventional measures such as mean or variance alone may not fully capture inter-individual metabolic differences. Incorporating autocorrelation analysis provides a more comprehensive characterization of metabolic states. Consequently, incorporating autocorrelation measures alongside traditional diabetes diagnostic criteria - such as fasting glucose, HbA1c and PG120, which primarily reflect only the “mean” component - can improve predictive accuracy for various clinical outcomes.

      Furthermore, our previous research has shown that glucose autocorrelation reflects changes in insulin clearance (Sugimoto, Hikaru, et al. “Improved detection of decreased glucose handling capacities via continuous glucose monitoring-derived indices.” Communications Medicine 5.1 (2025): 103.). The relationship between insulin clearance and cardiovascular disease has been well documented (Randrianarisoa, Elko, et al. “Reduced insulin clearance is linked to subclinical atherosclerosis in individuals at risk for type 2 diabetes mellitus.” Scientific reports 10.1 (2020): 22453.), and the mechanisms described in this prior work may potentially explain the association between glucose autocorrelation and clinical outcomes observed in the present study. 

      Rather than a limitation, we view these currently unexplored associations as an opportunity for further research. The identification of autocorrelation as a key glycemic feature introduces a new dimension to metabolic regulation that could serve as the basis for future investigations exploring the molecular mechanisms underlying these patterns.

      While we agree that further research at the cellular and molecular level is needed to fully validate these findings, we believe that our study provides a theoretical framework to support the clinical utility of autocorrelation analysis in glucose monitoring, and that this could serve as the basis for future investigations exploring the molecular mechanisms underlying these autocorrelation patterns, which adds to the broad interest of this study. Regarding the physical mechanisms linking autocorrelation and clinical outcomes, we have added the following sentences in the Discussion section (lines 331-339, 341-352): 

      This study also provided evidence that autocorrelation can vary independently from the mean and variance components using simulated data. In addition, simulated glucose dynamics indicated that even individuals with high AC_Var did not necessarily have high maximum and minimum blood glucose levels. This study also indicated that these three components qualitatively corresponded to the four distinct glucose patterns observed after glucose administration, which were identified in a previous study (Hulman et al., 2018). Thus, the inclusion of autocorrelation in addition to mean and variance may improve the characterization of inter-individual differences in glucose regulation and improve the predictive accuracy of various clinical outcomes.

      Despite increasing evidence linking glycemic variability to oxidative stress and endothelial dysfunction in T2DM complications (Ceriello et al., 2008; Monnier et al., 2008), the biological mechanisms underlying the independent predictive value of autocorrelation remain to be elucidated. Our previous work has shown that glucose autocorrelation is influenced by insulin clearance (Sugimoto et al., 2025), a process known to be associated with cardiovascular disease risk (Randrianarisoa et al., 2020). Therefore, the molecular pathways linking glucose autocorrelation to cardiovascular disease may share common mechanisms with those linking insulin clearance to cardiovascular disease. Although previous studies have primarily focused on investigating the molecular mechanisms associated with mean glucose levels and glycemic variability, our findings open new avenues for exploring the molecular basis of glucose autocorrelation, potentially revealing novel therapeutic targets for preventing diabetic complications.

      (2) The relatively small sample size (n=270) limits statistical power, especially when stratified by glucose tolerance levels. 

      We appreciate the reviewer’s concern regarding sample size and its potential impact on statistical power, especially when stratified by glucose tolerance levels. We fully agree that a larger sample size would increase statistical power, especially for subgroup analyses.

      We would like to clarify several points regarding the statistical power and validation of our findings. Our sample size followed established methodological frameworks, including the guidelines outlined by Muyembe Asenahabi, Bostely, and Peters Anselemo Ikoha. “Scientific research sample size determination.” (2023). These guidelines balance the risks of inadequate sample size with the challenges of unnecessarily large samples. For our primary analysis examining the correlation between CGM-derived measures and %NC, power calculations (a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4) indicated that a minimum of 47 participants was required. Our sample size of 53 exceeded this threshold and allowed us to detect statistically significant correlations, as described in the Methods section. Moreover, to provide transparency about the precision of our estimates, we have included confidence intervals for all coefficients. 

      Furthermore, our sample size aligns with previous studies investigating the associations between glucose profiles and clinical parameters, including Torimoto, Keiichi, et al. “Relationship between fluctuations in glucose levels measured by continuous glucose monitoring and vascular endothelial dysfunction in type 2 diabetes mellitus.” Cardiovascular Diabetology 12 (2013): 1-7. (n=57), Hall, Heather, et al. “Glucotypes reveal new patterns of glucose dysregulation.” PLoS biology 16.7 (2018): e2005143. (n=57), and Metwally, Ahmed A., et al. “Prediction of metabolic subphenotypes of type 2 diabetes via continuous glucose monitoring and machine learning.” Nature Biomedical Engineering (2024): 1-18. (n=32).

      Regarding the classification of glucose dynamics components, we have conducted additional validation across diverse populations including 64 Japanese, 53 American, and 100 Chinese individuals. These validation efforts have consistently supported our identification of three independent glucose dynamics components.

      However, we acknowledge the importance of further validation on a larger scale. To address this, we conducted a large follow-up study of over 8,000 individuals with two years of followup (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      To address the sample size considerations, we have added the following sentences in the Discussion section (lines 409-414): 

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      (3) Strict participant selection criteria may reduce applicability to broader populations. 

      We appreciate the reviewer’s comment regarding the potential impact of strict participant selection criteria on the broader applicability of our findings. We acknowledge that extending validation to more diverse populations would improve the generalizability of our findings.

      Our study includes validation cohorts from diverse populations, including 64 Japanese, 53 American and 100 Chinese individuals. These cohorts include a wide range of metabolic states, from healthy individuals to those with diabetes, ensuring validation across different clinical conditions. However, we acknowledge that further validation in additional populations and clinical settings would strengthen our conclusions. To address this, we conducted a large follow-up study of over 8,000 individuals (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      We have added the following text to the Discussion section to address these considerations (lines 409-414, 354-361):

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      Although our LASSO and factor analysis indicated that CGM-derived measures were strong predictors of %NC, this does not mean that other clinical parameters, such as lipids and blood pressure, are irrelevant in T2DM complications. Our study specifically focused on characterizing glucose dynamics, and we analyzed individuals with well-controlled serum cholesterol and blood pressure to reduce confounding effects. While we anticipate that inclusion of a more diverse population would not alter our primary findings regarding glucose dynamics, it is likely that a broader data set would reveal additional predictive contributions from lipid and blood pressure parameters.

      (4) CGM-derived indices like AC_Var and ADRR may be too complex for routine clinical use without simplified models or guidelines. 

      We appreciate the reviewer’s concern about the complexity of CGM-derived indices such as AC_Var and ADRR for routine clinical use. We acknowledge that for these indices to be of practical use, they must be both interpretable and easily accessible to healthcare providers. 

      To address this concern, we have developed an easy-to-use web application that automatically calculates these measures, including AC_Var, mean glucose levels, and glucose variability (https://cgmregressionapp2.streamlit.app/). This tool eliminates the need for manual calculations, making these indices more practical for clinical implementation.

      Regarding interpretability, we acknowledge that establishing specific clinical guidelines would enhance the practical utility of these measures. For example, defining a cut-off value for AC_Var above which the risk of diabetes complications increases significantly would provide clearer clinical guidance. However, given our current sample size limitations and our predefined objective of investigating correlations among indices, we have taken a conservative approach by focusing on the correlation between AC_Var and %NC rather than establishing definitive cutoffs. This approach intentionally avoids problematic statistical practices like phacking. It is not realistic to expect a single study to accomplish everything from proposing a new concept to conducting large-scale clinical trials to establishing clinical guidelines. Establishing clinical guidelines typically requires the accumulation of multiple studies over many years. Recognizing this reality, we have been careful in our manuscript to make modest claims about the discovery of new “correlations” rather than exaggerated claims about immediate routine clinical use.

      To address this limitation, we conducted a large follow-up study of over 8,000 individuals in the next study (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which proposed clinically relevant cutoffs and reference ranges for AC_Var and other CGM-derived indices. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, by integrating automated calculation tools with clear clinical thresholds, we expect to make these measures more accessible for clinical use.

      We have added the following text to the Discussion section to address these considerations (lines 415-419):

      While CGM-derived indices such as AC_Var and ADRR hold promise for CAD risk assessment, their complexity may present challenges for routine clinical implementation. To improve usability, we have developed a web-based calculator that automates these calculations. However, defining clinically relevant thresholds and reference ranges requires further validation in larger cohorts.

      (5) The study does not compare CGM-derived indices to existing advanced CAD risk models, limiting the ability to assess their true predictive superiority. 

      We appreciate the reviewer’s comment regarding the comparison of CGMderived indices with existing CAD risk models. Given that our study population consisted of individuals with well-controlled total cholesterol and blood pressure levels, a direct comparison with the Framingham Risk Score for Hard Coronary Heart Disease (Wilson, Peter WF, et al. “Prediction of coronary heart disease using risk factor categories.” Circulation 97.18 (1998): 1837-1847.) may introduce inherent bias, as these factors are key components of the score.

      Nevertheless, to further assess the predictive value of the CGM-derived indices, we performed additional analyses using linear regression to predict %NC. Using the Framingham Risk Score, we obtained an R² of 0.04 and an Akaike Information Criterion (AIC) of 330. In contrast, our proposed model incorporating the three glycemic parameters - CGM_Mean, CGM_Std, and AC_Var - achieved a significantly improved R² of 0.36 and a lower AIC of 321, indicating superior predictive accuracy. 

      We have added the following text to the Result section (lines 115-122):

      The regression model including CGM_Mean, CGM_Std and AC_Var to predict %NC achieved an R² of 0.36 and an Akaike Information Criterion (AIC) of 321. Each of these indices showed statistically significant independent positive correlations with %NC (Fig. 1A). In contrast, the model using conventional glycemic markers (FBG, HbA1c, and PG120) yielded an R² of only 0.05 and an AIC of 340 (Fig. 1B). Similarly, the model using the Framingham Risk Score for Hard Coronary Heart Disease (Wilson et al., 1998) showed limited predictive value, with an R² of 0.04 and an AIC of 330 (Fig. 1C).

      (6) Varying CGM sampling intervals (5-minute vs. 15-minute) were not thoroughly analyzed for impact on results. 

      We appreciate the reviewer’s comment regarding the potential impact of different CGM sampling intervals on our results. To assess the robustness of our findings across different sampling frequencies, we performed a down sampling analysis by converting our 5minute interval data to 15-minute intervals. The AC_Var value calculated from 15-minute intervals was significantly correlated with that calculated from 5-minute intervals (R = 0.99, 95% CI: 0.97-1.00). Furthermore, the regression model using CGM_Mean, CGM_Std, and AC_Var from 15-minute intervals to predict %NC achieved an R² of 0.36 and an AIC of 321, identical to the model using 5-minute intervals. These results indicate that our results are robust to variations in CGM sampling frequency. 

      We have added this analysis to the Result section (lines 122-125):

      The AC_Var computed from 15-minute CGM sampling was nearly identical to that computed from 5-minute sampling (R = 0.99, 95% CI: 0.97-1.00) (Fig. S1A), and the regression using the 15‑min features yielded almost the same performance (R² = 0.36; AIC = 321; Fig. S1B).

      Reviewer #3 (Public review):

      We appreciate the reviewer for the critical review of the manuscript and the valuable comments. We have carefully considered the reviewer’s comments and have revised our manuscript accordingly. The reviewer’s comments in this letter are in Bold and Italics.

      Summary:

      This is a retrospective analysis of 53 individuals over 26 features (12 clinical phenotypes, 12 CGM features, and 2 autocorrelation features) to examine which features were most informative in predicting percent necrotic core (%NC) as a parameter for coronary plaque vulnerability. Multiple regression analysis demonstrated a better ability to predict %NC from 3 selected CGM-derived features than 3 selected clinical phenotypes. LASSO regularization and partial least squares (PLS) with VIP scores were used to identify 4 CGM features that most contribute to the precision of %NC. Using factor analysis they identify 3 components that have CGM-related features: value (relating to the value of blood glucose), variability (relating to glucose variability), and autocorrelation (composed of the two autocorrelation features). These three groupings appeared in the 3 validation cohorts and when performing hierarchical clustering. To demonstrate how these three features change, a simulation was created to allow the user to examine these features under different conditions.

      We appreciate reviewer #3 for the valuable and constructive comments on our manuscript.

      The goal of this study was to identify CGM features that relate to %NC. Through multiple feature selection methods, they arrive at 3 components: value, variability, and autocorrelation. While the feature list is highly correlated, the authors take steps to ensure feature selection is robust. There is a lack of clarity of what each component (value, variability, and autocorrelation) includes as while similar CGM indices fall within each component, there appear to be some indices that appear as relevant to value in one dataset and to variability in the validation. 

      We appreciate the reviewer’s comment regarding the classification of CGMderived measures into the three components: value, variability, and autocorrelation. As the reviewer correctly points out, some measures may load differently between the value and variability components in different datasets. However, we believe that this variability reflects the inherent mathematical properties of these measures rather than a limitation of our study.

      For example, the HBGI clusters differently across datasets due to its dependence on the number of glucose readings above a threshold. In populations where mean glucose levels are predominantly below this threshold, the HBGI is more sensitive to glucose variability (Fig. S3A). Conversely, in populations with a wider range of mean glucose levels, HBGI correlates more strongly with mean glucose levels (Fig. 3A). This context-dependent behaviour is expected given the mathematical properties of these measures and does not indicate an inconsistency in our classification approach.

      Importantly, our main findings remain robust: CGM-derived measures systematically fall into three components-value, variability, and autocorrelation. Traditional CGM-derived measures primarily reflect either value or variability, and this categorization is consistently observed across datasets. While specific indices such as HBGI may shift classification depending on population characteristics, the overall structure of CGM data remains stable.

      To address these considerations, we have added the following text to the Discussion section (lines 388-396):

      Some indices, such as HBGI, showed variation in classification across datasets, with some populations showing higher factor loadings in the “mean” component and others in the “variance” component. This variation occurs because HBGI calculations depend on the number of glucose readings above a threshold. In populations where mean glucose levels are predominantly below this threshold, the HBGI is more sensitive to glucose variability (Fig. S5A). Conversely, in populations with a wider range of mean glucose levels, the HBGI correlates more strongly with mean glucose levels (Fig. 3A). Despite these differences, our validation analyses confirm that CGM-derived indices consistently cluster into three components: mean, variance, and autocorrelation.

      We are sceptical about statements of significance without documentation of p-values. 

      We appreciate the reviewer’s concern regarding statistical significance and the documentation of p values.

      First, given the multiple comparisons in our study, we used q values rather than p values, as shown in Figure 1D. Q values provide a more rigorous statistical framework for controlling the false discovery rate in multiple testing scenarios, thereby reducing the likelihood of false positives.

      Second, our statistical reporting follows established guidelines, including those of the New England Journal of Medicine (Harrington, David, et al. “New guidelines for statistical reporting in the journal.” New England Journal of Medicine 381.3 (2019): 285-286.), which recommend that “reporting of exploratory end points should be limited to point estimates of effects with 95% confidence intervals” and that “replace p values with estimates of effects or association and 95% confidence intervals”. According to these guidelines, p values should not be reported in this type of study. We determined significance based on whether these 95% confidence intervals excluded zero - a method for determining whether an association is significantly different from zero (Tan, Sze Huey, and Say Beng Tan. "The correct interpretation of confidence intervals." Proceedings of Singapore Healthcare 19.3 (2010): 276-278.). 

      For the sake of transparency, we provide p values for readers who may be interested, although we emphasize that they should not be the basis for interpretation, as discussed in the referenced guidelines. Specifically, in Figure 1A-B, the p values for CGM_Mean, CGM_Std, and AC_Var were 0.02, 0.02, and <0.01, respectively, while those for FBG, HbA1c, and PG120 were 0.83,

      0.91, and 0.25, respectively. In Figure 3C, the p values for factors 1–5 were 0.03, 0.03, 0.03, 0.24, and 0.87, respectively, and in Figure S8C, the p values for factors 1–3 were <0.01, <0.01, and 0.20, respectively.

      We appreciate the opportunity to clarify our statistical methodology and are happy to provide additional details if needed.

      While hesitations remain, the ability of these authors to find groupings of these many CGM metrics in relation to %NC is of interest. The believability of the associations is impeded by an obtuse presentation of the results with core data (i.e. correlation plots between CGM metrics and %NC) buried in the supplement while main figures contain plots of numerical estimates from models which would be more usefully presented in supplementary tables. 

      We appreciate the reviewer’s comment regarding the presentation of our results and recognize the importance of ensuring clarity and accessibility of the core data. 

      The central finding of our study is twofold: first, that the numerous CGM-derived measures can be systematically classified into three distinct components-mean, variance, and autocorrelation-and second, that each of these components is independently associated with %NC. This insight cannot be derived simply from examining scatter plots of individual correlations, which are provided in the Supplementary Figures. Instead, it emerges from our statistical analyses in the main figures, including multiple regression models that reveal the independent contributions of these components to %NC.

      We acknowledge the reviewer’s concern regarding the accessibility of key data. To improve clarity, we have moved several scatter plots from the Supplementary Figures to the main figures (Fig. 1D-J) to allow readers to more directly visualize the relationships between CGM-derived measures and %NC. We believe this revision improved the transparency and readability of our results while maintaining the rigor of our analytical approach.

      Given the small sample size in the primary analysis, there is a lot of modeling done with parameters estimated where simpler measures would serve and be more convincing as they require less data manipulation. A major example of this is that the pairwise correlation/covariance between CGM_mean, CGM_std, and AC_var is not shown and would be much more compelling in the claim that these are independent factors.

      We appreciate the reviewer’s feedback on our statistical analysis and data presentation. The correlations between CGM_Mean, CGM_Std, and AC_Var were documented in Figure S1B. However, to improve accessibility and clarity, we have moved these correlation analyses to the main figures (Fig. 1F). 

      Regarding our modeling approach, we chose LASSO and PLS methods because they are wellestablished techniques that are particularly suited to scenarios with many input variables and a relatively small sample size. These methods have been used in the literature as robust approaches for variable selection under such conditions (Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Stat Soc 58:267–288. Wold S, Sjöström M, Eriksson L. 2001. PLS-regression: a basic tool of chemometrics. Chemometrics Intellig Lab Syst 58:109–130. Pei X, Qi D, Liu J, Si H, Huang S, Zou S, Lu D, Li Z. 2023. Screening marker genes of type 2 diabetes mellitus in mouse lacrimal gland by LASSO regression. Sci Rep 13:6862. Wang C, Kong H, Guan Y, Yang J, Gu J, Yang S, Xu G. 2005. Plasma phospholipid metabolic profiling and biomarkers of type 2 diabetes mellitus based on high-performance liquid chromatography/electrospray mass spectrometry and multivariate statistical analysis.

      Anal Chem 77:4108–4116.). 

      Lack of methodological detail is another challenge. For example, the time period of CGM metrics or CGM placement in the primary study in relation to the IVUS-derived measurements of coronary plaques is unclear. Are they temporally distant or proximal/ concurrent with the PCI? 

      We appreciate the reviewer’s important question regarding the temporal relationship between CGM measurements and IVUS-derived plaque assessments. As described in our previous work (Otowa‐Suematsu, Natsu, et al. “Comparison of the relationship between multiple parameters of glycemic variability and coronary plaque vulnerability assessed by virtual histology–intravascular ultrasound.” Journal of Diabetes Investigation 9.3 (2018): 610615.), all individuals underwent continuous glucose monitoring for at least three consecutive days within the seven-day period prior to the PCI procedure. To improve clarity for readers, we have added the following text to the Methods section (lines 440-441):

      All individuals underwent CGM for at least three consecutive days within the seven-day period prior to the PCI procedure.

      A patient undergoing PCI for coronary intervention would be expected to have physiological and iatrogenic glycemic disturbances that do not reflect their baseline state. This is not considered or discussed. 

      We appreciate the reviewer’s concern regarding potential glycemic disturbances associated with PCI. As described in our previous work (Otowa‐Suematsu, Natsu, et al. “Comparison of the relationship between multiple parameters of glycemic variability and coronary plaque vulnerability assessed by virtual histology–intravascular ultrasound.” Journal of Diabetes Investigation 9.3 (2018): 610-615.), all CGM measurements were performed before the PCI procedure. This temporal separation ensures that the glycemic patterns analyzed in our study reflect the baseline metabolic state of the patients, rather than any physiological or iatrogenic effects of PCI. To avoid any misunderstanding, we have clarified this temporal relationship in the revised manuscript (lines 440-441):

      All individuals underwent CGM for at least three consecutive days within the seven-day period prior to the PCI procedure.

      The attempts at validation in external cohorts, Japanese, American, and Chinese are very poorly detailed. We could only find even an attempt to examine cardiovascular parameters in the Chinese data set but the outcome variables are unspecified with regard to what macrovascular events are included, their temporal relation to the CGM metrics, etc. Notably macrovascular event diagnoses are very different from the coronary plaque necrosis quantification. This could be a source of strength in the findings if carefully investigated and detailed but due to the lack of detail seems like an apples-to-oranges comparison. 

      We appreciate the reviewer’s comment regarding the validation cohorts and the need for greater clarity, particularly in the Chinese dataset. We acknowledge that our initial description lacked sufficient methodological detail, and we have expanded the Methods section to provide a more comprehensive explanation.

      For the Chinese dataset, the data collection protocol was previously documented (Zhao, Qinpei, et al. “Chinese diabetes datasets for data-driven machine learning.” Scientific Data 10.1 (2023): 35.). Briefly, trained research staff used standardized questionnaires to collect demographic and clinical information, including diabetes diagnosis, treatment history, comorbidities, and medication use. Physical examinations included anthropometric measurements, and body mass index was calculated using standard protocols. CGM was performed using the FreeStyle Libre H device (Abbott Diabetes Care, UK), which records interstitial glucose levels at 15-minute intervals for up to 14 days. Laboratory measurements, including metabolic panels, lipid profiles, and renal function tests, were obtained within six months of CGM placement. While previous studies have linked necrotic core to macrovascular events (Xie, Yong, et al. “Clinical outcome of nonculprit plaque ruptures in patients with acute coronary syndrome in the PROSPECT study.” JACC: Cardiovascular Imaging 7.4 (2014): 397-405.), we acknowledge the limitations of the cardiovascular outcomes in the Chinese data set. These outcomes were extracted from medical records rather than standardized diagnostic procedures or imaging studies. To address these concerns, we have added the following text to the Methods section (lines 496-504):

      The data collection protocol for the Chinese dataset was previously documented (Zhao et al., 2023). Briefly, trained research staff used standardized questionnaires to collect demographic and clinical information, including diabetes diagnosis, treatment history, comorbidities, and medication use. CGM records interstitial glucose levels at 15-minute intervals for up to 14 days. Laboratory measurements, including metabolic panels, lipid profiles, and renal function tests, were obtained within six months of CGM placement. While previous studies have linked necrotic core to macrovascular events, we acknowledge the limitations of the cardiovascular outcomes in the Chinese data set. These outcomes were extracted from medical records rather than from standardized diagnostic procedures or imaging studies.

      Finally, the simulations at the end are not relevant to the main claims of the paper and we would recommend removing them for the coherence of this manuscript. 

      We appreciate the reviewer’s feedback regarding the relevance of the simulation component of our manuscript. The primary contribution of our study goes beyond demonstrating correlations between CGM-derived measures and %NC; it highlights three fundamental components of glycemic patterns-mean, variability, and autocorrelation-and their independent relationships with coronary plaque characteristics. The simulations are included to illustrate how glycemic patterns with identical means and variability can have different autocorrelation structures. Because temporal autocorrelation can be conceptually difficult to interpret, these visualizations were intended to provide intuitive examples for the readers. 

      However, we agree with the reviewer’s concern about the coherence of the manuscript. In response, we have streamlined the simulation section by removing simulations that do not directly support our primary conclusions (old version of the manuscript, lines 239-246, 502526), while retaining only those that enhance understanding of the three glycemic components. Regarding reviewer 2’s minor comment #4, we acknowledge that autocorrelation can be challenging to understand intuitively. To address this, we kept Fig. 4A with a brief description.

      Recommendations for the authors:

      Reviewer 2# (Recommendations for the authors):

      Summary:

      The study by Sugimoto et. al. investigates the association between components of glucose dynamics-value, variability, and autocorrelation-and coronary plaque vulnerability (%NC) in patients with varying glucose tolerance levels. The research identifies three key factors that independently predict %NC and highlights the potential of continuous glucose monitoring (CGM)-derived indices in risk assessment for coronary artery disease (CAD). Using robust statistical methods and validation across diverse populations, the study emphasizes the limitations of conventional diagnostic markers and suggests a novel, CGMbased approach for improved predictive performance While the study demonstrates significant novelty and potential impact, several issues must be addressed by the authors.

      Major Comments:

      (1) The study demonstrates originality by introducing autocorrelation as a novel predictive factor in glucose dynamics, a perspective rarely explored in prior research. While the innovation is commendable, the biological mechanisms linking autocorrelation to plaque vulnerability remain speculative. Providing a hypothesis or potential pathways would enhance the scientific impact and practical relevance of this finding.

      We appreciate the reviewer’s point about the need for a clearer biological explanation linking glucose autocorrelation to plaque vulnerability. Our previous research has shown that glucose autocorrelation reflects changes in insulin clearance (Sugimoto, Hikaru, et al. “Improved detection of decreased glucose handling capacities via continuous glucose monitoring-derived indices.” Communications Medicine 5.1 (2025): 103.). The relationship between insulin clearance and cardiovascular disease has been well documented (Randrianarisoa, Elko, et al. “Reduced insulin clearance is linked to subclinical atherosclerosis in individuals at risk for type 2 diabetes mellitus.” Scientific reports 10.1 (2020): 22453.), and the mechanisms described in this prior work may potentially explain the association between glucose autocorrelation and clinical outcomes observed in the present study. We have added the following sentences to the Discussion section (lines 341-352):

      Despite increasing evidence linking glycemic variability to oxidative stress and endothelial dysfunction in T2DM complications (Ceriello et al., 2008; Monnier et al., 2008), the biological mechanisms underlying the independent predictive value of autocorrelation remain to be elucidated. Our previous work has shown that glucose autocorrelation is influenced by insulin clearance (Sugimoto et al., 2025), a process known to be associated with cardiovascular disease risk (Randrianarisoa et al., 2020). Therefore, the molecular pathways linking glucose autocorrelation to cardiovascular disease may share common mechanisms with those linking insulin clearance to cardiovascular disease. Although previous studies have primarily focused on investigating the molecular mechanisms associated with mean glucose levels and glycemic variability, our findings open new avenues for exploring the molecular basis of glucose autocorrelation, potentially revealing novel therapeutic targets for preventing diabetic complications.

      (2) The inclusion of datasets from Japan, America, and China adds a valuable cross-cultural dimension to the study, showcasing its potential applicability across diverse populations. Despite the multi-regional validation, the sample size (n=270) is relatively small, especially when stratified by glucose tolerance categories. This limits the statistical power and applicability to diverse populations. A larger, multi-center cohort would strengthen conclusions.

      We appreciate the reviewer’s concern regarding sample size and its potential impact on statistical power, especially when stratified by glucose tolerance levels. We fully agree that a larger sample size would increase statistical power, especially for subgroup analyses.

      We would like to clarify several points regarding the statistical power and validation of our findings. Our study adheres to established methodological frameworks for sample size determination, including the guidelines outlined by Muyembe Asenahabi, Bostely, and Peters Anselemo Ikoha. “Scientific research sample size determination.” (2023). These guidelines balance the risks of inadequate sample size with the challenges of unnecessarily large samples. For our primary analysis examining the correlation between CGM-derived measures and %NC, power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4 indicated that a minimum of 47 participants was required. Our sample size of 53 exceeded this threshold and allowed us to detect statistically significant correlations, as described in the Methods section.

      Furthermore, our sample size aligns with previous studies investigating the associations between glucose profiles and clinical parameters, including Torimoto, Keiichi, et al. “Relationship between fluctuations in glucose levels measured by continuous glucose monitoring and vascular endothelial dysfunction in type 2 diabetes mellitus.” Cardiovascular Diabetology 12 (2013): 1-7. (n=57), Hall, Heather, et al. “Glucotypes reveal new patterns of glucose dysregulation.” PLoS biology 16.7 (2018): e2005143. (n=57), and Metwally, Ahmed A., et al. “Prediction of metabolic subphenotypes of type 2 diabetes via continuous glucose monitoring and machine learning.” Nature Biomedical Engineering (2024): 1-18. (n=32). Moreover, to provide transparency about the precision of our estimates, we have included confidence intervals for all coefficients.

      Regarding the classification of glucose dynamics components, we have conducted additional validation across diverse populations including 64 Japanese, 53 American, and 100 Chinese individuals. These validation efforts have consistently supported our identification of three independent glucose dynamics components. Furthermore, the primary objective of our study was not to assess rare events, but rather to demonstrate that glucose dynamics can be decomposed into three main factors - mean, variance and autocorrelation - whereas traditional measures have primarily captured mean and variance without adequately reflecting autocorrelation. We believe that our current sample size effectively addresses this objective. 

      However, we acknowledge the importance of further validation on a larger scale. To address this, we conducted a large follow-up study of over 8,000 individuals with two years of followup (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      To address the sample size considerations, we have added the following sentences to the Discussion section (lines 409-414):

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      (3) The study focuses on a well-characterized cohort with controlled cholesterol and blood pressure levels, reducing confounding variables. However, this stringent selection might exclude individuals with significant variability in these parameters, potentially limiting the study's applicability to broader, real-world populations. The authors should discuss how this may affect generalizability and potential bias in the results.

      We appreciate the reviewer’s comment regarding the potential impact of strict participant selection criteria on the broader applicability of our findings. We acknowledge that extending validation to more diverse populations would improve the generalizability of our findings.

      Our validation strategy included multiple cohorts from different regions, specifically 64 Japanese, 53 American and 100 Chinese individuals. These cohorts represent a clinically diverse population, including both healthy individuals and those with diabetes, allowing for validation across a broad spectrum of metabolic conditions. However, we recognize that further validation in additional populations and clinical settings would strengthen our conclusions. To address this, we conducted a large follow-up study of over 8,000 individuals with two years of follow-up (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which confirmed our main finding that glucose dynamics consist of mean, variance, and autocorrelation. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, it provides further support for the clinical relevance and generalizability of our findings.

      We have added the following text to the Discussion section to address these considerations (lines 409-414, 354-361):

      Although our analysis included four datasets with a total of 270 individuals, and our sample size of 53 met the required threshold based on power calculations with a type I error of 0.05, a power of 0.8, and an expected correlation coefficient of 0.4, we acknowledge that the sample size may still be considered relatively small for a comprehensive assessment of these relationships. To further validate these findings, larger prospective studies with diverse populations are needed.

      Although our LASSO and factor analysis indicated that CGM-derived measures were strong predictors of %NC, this does not mean that other clinical parameters, such as lipids and blood pressure, are irrelevant in T2DM complications. Our study specifically focused on characterizing glucose dynamics, and we analyzed individuals with well-controlled serum cholesterol and blood pressure to reduce confounding effects. While we anticipate that inclusion of a more diverse population would not alter our primary findings regarding glucose dynamics, it is likely that a broader data set would reveal additional predictive contributions from lipid and blood pressure parameters.

      (4) The study effectively highlights the potential of CGM-derived indices as a tool for CAD risk assessment, a concept that aligns with contemporary advancements in personalized medicine. Despite its potential, the complexity of CGM-derived indices like AC_Var and ADRR may hinder their routine clinical adoption. Providing simplified models or actionable guidelines would facilitate their integration into everyday practice.

      We appreciate the reviewer’s concern about the complexity of CGM-derived indices such as AC_Var and ADRR for routine clinical use. We recognize that for these indices to be of practical use, they must be both interpretable and easily accessible to healthcare providers.

      To address this, we have developed an easy-to-use web application that automatically calculates these measures, including AC_Var, mean glucose levels, and glucose variability. By eliminating the need for manual calculations, this tool streamlines the process and makes these indices more practical for clinical use.

      Regarding interpretability, we acknowledge that establishing specific clinical guidelines would enhance the practical utility of these measures. For example, defining a cut-off value for AC_Var above which the risk of diabetes complications increases significantly would provide clearer clinical guidance. However, given our current sample size limitations and our predefined objective of investigating correlations among indices, we have taken a conservative approach by focusing on the correlation between AC_Var and %NC rather than establishing definitive cutoffs. This approach intentionally avoids problematic statistical practices like phacking. It is not realistic to expect a single study to accomplish everything from proposing a new concept to conducting large-scale clinical trials to establishing clinical guidelines. Establishing clinical guidelines typically requires the accumulation of multiple studies over many years. Recognizing this reality, we have been careful in our manuscript to make modest claims about the discovery of new “correlations” rather than exaggerated claims about immediate routine clinical use.

      To address this limitation, we conducted a large follow-up study of over 8,000 individuals in the next study (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which proposed clinically relevant cutoffs and reference ranges for AC_Var and other CGM-derived indices. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper; however, by integrating automated calculation tools with clear clinical thresholds, we expect to make these measures more accessible for clinical use.

      We have added the following text to the Discussion section to address these considerations (lines 415-419):

      While CGM-derived indices such as AC_Var and ADRR hold promise for CAD risk assessment, their complexity may present challenges for routine clinical implementation. To improve usability, we have developed a web-based calculator that automates these calculations. However, defining clinically relevant thresholds and reference ranges requires further validation in larger cohorts.

      (5) The exclusion of TIR from the main analysis is noted, but its relevance in diabetes management warrants further exploration. Integrating TIR as an outcome measure could provide additional clinical insights.

      We appreciate the reviewer’s comment regarding the potential role of time in range (TIR) as an outcome measure in our study. Because TIR is primarily influenced by the mean and variance of glucose levels, it does not fully capture the distinct role of glucose autocorrelation, which was the focus of our investigation.

      To clarify this point, we have expanded the Discussion section as follows (lines 380-388):

      Although time in range (TIR) was not included in the main analyses due to the relatively small number of T2DM patients and the predominance of participants with TIR >70%, our results demonstrate that CGM-derived indices outperformed conventional markers such as FBG, HbA1c, and PG120 in predicting %NC. Furthermore, multiple regression analysis between factor scores and TIR revealed that only factor 1 (mean) and factor 2 (variance) were significantly associated with TIR (Fig. S8C, D). This finding confirms the presence of three distinct components in glucose dynamics and highlights the added value of examining AC_Var as an independent glycemic feature beyond conventional CGM-derived measures.

      (6) While the study reflects a commitment to understanding CAD risks in a global context by including datasets from Japan, America, and China, the authors should provide demographic details (e.g., age, gender, socioeconomic status) and discuss how these factors might influence glucose dynamics and coronary plaque vulnerability.

      We appreciate the reviewer’s comment regarding the potential influence of demographic factors on glucose dynamics and coronary plaque vulnerability. We examined these relationships and found that age and sex had minimal effects on glucose dynamics characteristics, as shown in Figure S8A and S8B. These findings suggest that our primary conclusions regarding glucose dynamics and coronary risk remain robust across demographic groups within our data set.

      To address the reviewer’s suggestion, we have added the following discussion (lines 361-368):

      In our analysis of demographic factors, we found that age and gender had minimal influence on glucose dynamics characteristics (Fig. S8A, B), suggesting that our findings regarding the relationship between glucose dynamics and coronary risk are robust across different demographic groups within our dataset. Future studies involving larger and more diverse populations would be valuable to comprehensively elucidate the potential influence of age, gender, and other demographic factors on glucose dynamics characteristics and their relationship to cardiovascular risk.

      (7) While the article shows CGM-derived indices outperform traditional markers (e.g., HbA1c, FBG, PG120), it does not compare these indices against existing advanced risk models (e.g., Framingham Risk Score for CAD). A direct comparison would strengthen the claim of superiority.

      We appreciate the reviewer’s comment regarding the comparison of CGMderived indices with existing CAD risk models. Given that our study population consisted of individuals with well-controlled total cholesterol and blood pressure levels, a direct comparison with the Framingham Risk Score for Hard Coronary Heart Disease (Wilson, Peter WF, et al. “Prediction of coronary heart disease using risk factor categories.” Circulation 97.18 (1998): 1837-1847.) may introduce inherent bias, as these factors are key components of the score.

      Nevertheless, to further assess the predictive value of the CGM-derived indices, we performed additional analyses using linear regression to predict %NC. Using the Framingham Risk Score, we obtained an R² of 0.04 and an Akaike Information Criterion (AIC) of 330. In contrast, our proposed model incorporating the three glycemic parameters - CGM_Mean, CGM_Std, and AC_Var - achieved a significantly improved R² of 0.36 and a lower AIC of 321, indicating superior predictive accuracy. We have updated the Result section as follows (lines 115-122):

      The regression model including CGM_Mean, CGM_Std and AC_Var to predict %NC achieved an R<sup>2</sup> of 0.36 and an Akaike Information Criterion (AIC) of 321. Each of these indices showed statistically significant independent positive correlations with %NC (Fig. 1A). In contrast, the model using conventional glycemic markers (FBG, HbA1c, and PG120) yielded an R² of only 0.05 and an AIC of 340 (Fig. 1B). Similarly, the model using the Framingham Risk Score for Hard Coronary Heart Disease (Wilson et al., 1998) showed limited predictive value, with an R² of 0.04 and an AIC of 330 (Fig. 1C).

      (8) The study mentions varying CGM sampling intervals across datasets (5-minute vs. 15minute). Authors should employ sensitivity analysis to assess the impact of these differences on the results. This would help clarify whether higher-resolution data significantly improves predictive performance.

      We appreciate the reviewer’s comment regarding the potential impact of different CGM sampling intervals on our results. To assess the robustness of our findings across different sampling frequencies, we performed a down sampling analysis by converting our 5minute interval data to 15-minute intervals. The AC_Var value calculated from 15-minute intervals was significantly correlated with that calculated from 5-minute intervals (R = 0.99, 95% CI: 0.97-1.00). Consequently, the main findings remained consistent across both sampling frequencies, indicating that our results are robust to variations in temporal resolution. We have added this analysis to the Result section (lines 122-126):

      The AC_Var computed from 15-minute CGM sampling was nearly identical to that computed from 5-minute sampling (R = 0.99, 95% CI: 0.97-1.00) (Fig. S1A), and the regression using the 15‑min features yielded almost the same performance (R<sup>2</sup>  = 0.36; AIC = 321; Fig. S1B).

      (9) The identification of actionable components in glucose dynamics lays the groundwork for clinical stratification. The authors could explore the use of CGM-derived indices to develop a simple framework for stratifying risk into certain categories (e.g., low, moderate, high). This could improve clinical relevance and utility for healthcare providers.

      We appreciate the reviewer’s suggestion regarding the potential for CGMderived indices to support clinical stratification. We completely agree with the idea that establishing risk categories (e.g., low, moderate, high) based on specific thresholds would enhance the clinical utility of these measures. However, given our current sample size limitations and our predefined objective of investigating correlations among indices, we have taken a conservative approach by focusing on the correlation between AC_Var and %NC rather than establishing definitive cutoffs. This approach intentionally avoids problematic statistical practices like p-hacking. It is not realistic to expect a single study to accomplish everything from proposing a new concept to conducting large-scale clinical trials to establishing clinical thresholds. Establishing clinical thresholds typically requires the accumulation of multiple studies over many years. Recognizing this reality, we have been careful in our manuscript to make modest claims about the discovery of new “correlations” rather than exaggerated claims about immediate routine clinical use.

      To address this limitation, we conducted a large follow-up study of over 8,000 individuals in the next study (Sugimoto, Hikaru, et al. “Stratification of individuals without prior diagnosis of diabetes using continuous glucose monitoring” medRxiv (2025)), which proposed clinically relevant cutoffs and reference ranges for AC_Var and other CGM-derived indices. As this large study was beyond the scope of the present manuscript due to differences in primary objectives and analytical approaches, it was not included in this paper. However, we expect to make these measures more actionable in clinical use by integrating automated calculation tools with clear clinical thresholds.

      We have added the following text to the Discussion section to address these considerations (lines 415-419):

      While CGM-derived indices such as AC_Var and ADRR hold promise for CAD risk assessment, their complexity may present challenges for routine clinical implementation. To improve usability, we have developed a web-based calculator that automates these calculations. However, defining clinically relevant thresholds and reference ranges requires further validation in larger cohorts.

      (10) While the study acknowledges several limitations, authors should also consider explicitly addressing the potential impact of inter-individual variability in glucose metabolism (e.g., age-related changes, hormonal influences) on the findings.

      We appreciate the reviewer’s comment regarding the potential impact of interindividual variability in glucose metabolism, including age-related changes and hormonal influences, on our results. In our analysis, we found that age had minimal effects on glucose dynamics characteristics, as shown in Figure S8A. In addition, CGM-derived measures such as ADRR and AC_Var significantly contributed to the prediction of %NC independent of insulin secretion (I.I.) and insulin sensitivity (Composite index) (Fig. 2). These results suggest that our primary conclusions regarding glucose dynamics and coronary risk remain robust despite individual differences in glucose metabolism.

      To address the reviewer’s suggestion, we have added the following discussion (lines 186-188, 361-368):

      Conventional indices, including FBG, HbA1c, PG120, I.I., Composite index, and Oral DI, did not contribute significantly to the prediction compared to these CGM-derived indices.

      In our analysis of demographic factors, we found that age and gender had minimal influence on glucose dynamics characteristics (Fig. S8A, B), suggesting that our findings regarding the relationship between glucose dynamics and coronary risk are robust across different demographic groups within our dataset. Future studies involving larger and more diverse populations would be valuable to comprehensively elucidate the potential influence of age, gender, and other demographic factors on glucose dynamics characteristics and their relationship to cardiovascular risk.

      (11) It's unclear whether the identified components (value, variability, and autocorrelation) could serve as proxies for underlying physiological mechanisms, such as beta-cell dysfunction or insulin resistance. Please clarify.

      We appreciate the reviewer’s comment regarding the physiological underpinnings of the glucose components we identified. The mean, variance, and autocorrelation components we identified likely reflect specific underlying physiological mechanisms related to glucose regulation. In our previous research (Sugimoto, Hikaru, et al. “Improved detection of decreased glucose handling capacities via continuous glucose monitoring-derived indices.” Communications Medicine 5.1 (2025): 103.), we explored the relationship between glucose dynamics characteristics and glucose control capabilities using clamp tests and mathematical modelling. These investigations revealed that autocorrelation specifically shows a significant correlation with the disposition index (the product of insulin sensitivity and insulin secretion) and insulin clearance parameters.

      Furthermore, our current study demonstrates that CGM-derived measures such as ADRR and AC_Var significantly contributed to the prediction of %NC independent of established metabolic parameters including insulin secretion (I.I.) and insulin sensitivity (Composite index), as shown in Figure 2. These results suggest that the components we identified capture distinct physiological aspects of glucose metabolism beyond traditional measures of beta-cell function and insulin sensitivity. Further research is needed to fully characterize these relationships, but our results imply that these characteristics of glucose dynamics offer supplementary insight into the underlying beta-cell dysregulation that contributes to coronary plaque vulnerability.

      To address the reviewer’s suggestion, we have added the following discussion to the Result section (lines 186-188):

      Conventional indices, including FBG, HbA1c, PG120, I.I., Composite index, and Oral DI, did not contribute significantly to the prediction compared to these CGM-derived indices.

      Minor Comments:

      (1) The use of LASSO and PLS regression is appropriate, but the rationale for choosing these methods over others (e.g., Ridge regression) should be explained in greater detail.

      We appreciate the reviewer’s comment and have added the following discussion to the Methods section (lines 578-585):

      LASSO regression was chosen for its ability to perform feature selection by identifying the most relevant predictors. Unlike Ridge regression, which simply shrinks coefficients toward zero without reaching exactly zero, LASSO produces sparse models, which is consistent with our goal of identifying the most critical features of glucose dynamics associated with coronary plaque vulnerability. In addition, we implemented PLS regression as a complementary approach due to its effectiveness in dealing with multicollinearity, which was particularly relevant given the high correlation among several CGM-derived measures.

      (2) While figures are well-designed, adding annotations to highlight key findings (e.g., significant contributors in factor analysis) would improve clarity.

      We appreciate the reviewer’s suggestion to improve the clarity of our figures. In the factor analysis, we decided not to include annotations because indicators such as ADRR and J-index can be associated with multiple factors, which could lead to misleading or confusing interpretations. However, in response to the suggestion, we have added annotations to the PLS analysis, specifically highlighting items with VIP values greater than 1 (Fig. 2D, S2D) to emphasize key contributors.

      (3) The term "value" as a component of glucose dynamics could be clarified. For instance, does it strictly refer to mean glucose levels, or does it encompass other measures?

      We appreciate the reviewer’s question regarding the term “value” in the context of glucose dynamics. Factor 1 was predominantly influenced by CGM_Mean, with a factor loading of 0.99, indicating that it primarily represents mean glucose levels. Given this strong correlation, we have renamed Factor 1 to “Mean” (Fig. 3A) to more accurately reflect its role in glucose dynamics.

      (4) The concept of autocorrelation may be unfamiliar to some readers. A brief, intuitive explanation with a concrete example of how it manifests in glucose dynamics would enhance understanding.

      We appreciate the reviewer’s suggestion. Autocorrelation refers to the relationship between a variable and its past values over time. In the context of glucose dynamics, it reflects how current glucose levels are influenced by past levels, capturing patterns such as sustained hyperglycemia or recurrent fluctuations. For example, if an individual experiences sustained high glucose levels after a meal, the strong correlation between successive glucose readings indicates high autocorrelation. We have included this explanation in the revised manuscript (lines 519-524) to improve clarity for readers unfamiliar with the concept. Additionally, Figure 4A shows an example of glucose dynamics with different autocorrelation.

      (5) Ensure consistent use of terms like "glucose dynamics," "CGM-derived indices," and "plaque vulnerability." For instance, sometimes indices are referred to as "components," which might confuse readers unfamiliar with the field.

      We appreciate the reviewer’s comment about ensuring consistency in terminology. To avoid confusion, we have reviewed and standardized the use of terms such as “CGM-derived indices,” and “plaque vulnerability” throughout the manuscript. Additionally, while many of our measures are strictly CGM-derived indices, several “components” in our analysis include fasting blood glucose (FBG) and glucose waveforms during the OGTT. For these measures, we retained the descriptors “glucose dynamics” and “components” rather than relabelling them as CGM-derived indices.

      (6) Provide a more detailed overview of the supplementary materials in the main text, highlighting their relevance to the key findings.

      We appreciate the reviewer’s suggestion. We revised the manuscript by integrating the supplementary text into the main text (lines 129-160), which provides a clearer overview of the supplementary materials. Consequently, the Supplementary Information section now only contains supplementary figures, while their relevance and key details are described in the main text. 

      Reviewer #3 (Recommendations for the authors):

      Other Concerns:

      (1) The text states the significance of tests, however, no p-values are listed: Lines 118-119: Significance is cited between CGM indices and %NC, however, neither the text nor supplementary text have p-values. Need p-values for Figure 3C, Figure S10. When running the https://cgm-basedregression.streamlit.app/ multiple regression analysis, a p-value should be given as well. Do the VIP scores (Line 142) change with the inclusion of SBP, DBP, TG, LDL, and HDL? Do the other datasets have the same well-controlled serum cholesterol and BP levels?

      We appreciate the reviewer’s concern regarding statistical significance and the documentation of p values.

      First, given the multiple comparisons in our study, we used q values rather than p values, as shown in Figure 1D. Q values provide a more rigorous statistical framework for controlling the false discovery rate in multiple testing scenarios, thereby reducing the likelihood of false positives.

      Second, our statistical reporting follows established guidelines, including those of the New England Journal of Medicine (Harrington, David, et al. “New guidelines for statistical reporting in the journal.” New England Journal of Medicine 381.3 (2019): 285-286.), which recommend that “reporting of exploratory end points should be limited to point estimates of effects with 95% confidence intervals” and that “replace p values with estimates of effects or association and 95% confidence intervals”. According to these guidelines, p values should not be reported in this type of study. We determined significance based on whether these 95% confidence intervals excluded zero - a statistical method for determining whether an association is significantly different from zero (Tan, Sze Huey, and Say Beng Tan. “The correct interpretation of confidence intervals.” Proceedings of Singapore Healthcare 19.3 (2010): 276-278.).

      For the sake of transparency, we provide p values for readers who may be interested, although we emphasize that they should not be the basis for interpretation, as discussed in the referenced guidelines. Specifically, in Figure 1A-B, the p values for CGM_Mean, CGM_Std, and AC_Var were 0.02, 0.02, and <0.01, respectively, while those for FBG, HbA1c, and PG120 were 0.83, 0.91, and 0.25, respectively. In Figure 3C, the p values for factors 1–5 were 0.03, 0.03, 0.03, 0.24, and 0.87, respectively, and in Figure S8C, the p values for factors 1–3 were <0.01, <0.01, and 0.20, respectively. We appreciate the opportunity to clarify our statistical methodology and are happy to provide additional details if needed.

      We confirmed that the results of the variable importance in projection (VIP) analysis remained stable after including additional covariates, such as systolic blood pressure (SBP), diastolic blood pressure (DBP), triglycerides (TG), low-density lipoprotein cholesterol (LDL-C), and high-density lipoprotein cholesterol (HDL-C). The VIP values for ADRR, MAGE, AC_Var, and LI consistently exceeded one even after these adjustments, suggesting that the primary findings are robust in the presence of these clinical variables. We have added the following sentences in the Results and Methods section (lines 188-191, 491-494):

      Even when SBP, DBP, TG, LDL-C, and HDL-C were included as additional input variables, the results remained consistent, and the VIP scores for ADRR, AC_Var, MAGE, and LI remained greater than 1 (Fig. S2D).

      Of note, as the original reports document, the validation datasets did not specify explicit cutoffs for blood pressure or cholesterol. Consequently, they included participants with suboptimal control of these parameters.

      (2) Negative factor loadings have not been addressed and consistency in components: Figure 3, Figure S7. All the main features for value in Figure 3A are positive. However, MVALUE in S7B is very negative for value whereas the other features highlighted for value are positive. What is driving this difference? Please explain if the direction is important. Line 480 states that variables with factor loadings >= 0.30 were used for interpretation, but it appears in the text (Line 156, Figure 3) that oral DI was used for value, even though it had a -0.61 loading. Figure 3, Figure S7. HBGI falls within two separate components (value and variability). There is not a consistent component grouping. Removal of MAG (Line 185) and only MAG does not seem scientific. Did the removal of other features also result in similar or different Cronbach's ⍺? It is unclear what Figure S8B is plotting. What does each point mean?

      We appreciate the reviewer’s comment regarding the classification of CGMderived measures into the three components: value, variability, and autocorrelation. As the reviewer correctly points out, some measures may load differently between the value and variability components in different datasets. However, we believe that this variability reflects the inherent mathematical properties of these measures rather than a limitation of our study.

      For example, the HBGI clusters differently across datasets due to its dependence on the number of glucose readings above a threshold. In populations where mean glucose levels are predominantly below this threshold, the HBGI is more sensitive to glucose variability (Fig. S3A). Conversely, in populations with a wider range of mean glucose levels, HBGI correlates more strongly with mean glucose levels (Fig. 3A). This context-dependent behaviour is expected given the mathematical properties of these measures and does not indicate an inconsistency in our classification approach.

      Importantly, our main findings remain robust: CGM-derived measures systematically fall into three components-value, variability, and autocorrelation. Traditional CGM-derived measures primarily reflect either value or variability, and this categorization is consistently observed across datasets. While specific indices such as HBGI may shift classification depending on population characteristics, the overall structure of CGM data remains stable.

      With respect to negative factor loadings, we agree that they may appear confusing at first. However, in the context of exploratory factor analysis, the magnitude, or absolute value, of the loading is most critical for interpretation, rather than its sign. Following established practice, we considered variables with absolute loadings of at least 0.30 to be meaningful contributors to a given component. Accordingly, although the oral DI had a negative loading of –0.61, its absolute magnitude exceeded the threshold of 0.30, so it was considered in our interpretation of the “value” component. Regarding the reviewer’s observation that MVALUE in Figure S7B shows a strongly negative loading while other indices in the same component show positive loadings, we believe this reflects the relative orientation of the factor solution rather than a substantive difference in interpretation. In factor analysis, the direction of factor loadings is arbitrary: multiplying all the loadings for a given factor by –1 would not change the factor’s statistical identity. Therefore, the important factor is not whether a variable loads positively or negatively but rather the strength of its association with the latent component (i.e., the absolute value of the loading).

      The rationale for removing MAG was based on statistical and methodological considerations. As is common practice in reliability analyses, we examined whether Cronbach’s α would improve if we excluded items with low factor loadings or weak item–total correlations. In the present study, we recalculated Cronbach’s α after removing the MAG item because it had a low loading. Its exclusion did not substantially affect the theoretical interpretation of the factor, which we conceptualize as “secretion” (without CGM). MAG’s removal alone is scientifically justified because it was the only item whose exclusion improved Cronbach's α while preserving interpretability. In contrast, removing other items would have undermined the conceptual clarity of the factor or would not have meaningfully improved α. Furthermore, the MAG item has a high factor 2 loading.

      Each point in Figure S8B (old version) corresponds to an individual participant.

      To address these considerations, we have added the following text to the Discussion, Methods, (lines 388-396, 600-601) and Figure S6B (current version) legend:

      Some indices, such as HBGI, showed variation in classification across datasets, with some populations showing higher factor loadings in the “mean” component and others in the “variance” component. This variation occurs because HBGI calculations depend on the number of glucose readings above a threshold. In populations where mean glucose levels are predominantly below this threshold, the HBGI is more sensitive to glucose variability (Fig. S5A). Conversely, in populations with a wider range of mean glucose levels, the HBGI correlates more strongly with mean glucose levels (Fig. 3A). Despite these differences, our validation analyses confirm that CGM-derived indices consistently cluster into three components: mean, variance, and autocorrelation.

      Variables with absolute factor loadings of ≥ 0.30 were used in interpretation.

      Box plots comparing factors 1 (Mean), 2 (Variance), and 3 (Autocorrelation) between individuals without (-) and with (+) diabetic macrovascular complications. Each point corresponds to an individual. The boxes represent the interquartile range, with the median shown as a horizontal line. Mann–Whitney U tests were used to assess differences between groups, with P values < 0.05 considered statistically significant.

      Minor Concerns:

      (1) NGT is not defined.

      We appreciate the reviewer for pointing out that the term “NGT” was not clearly defined in the original manuscript. We have added the following text to the Methods section (lines 447-451):

      T2DM was defined as HbA1c ≥ 6.5%, fasting plasma glucose (FPG) ≥ 126 mg/dL or 2‑h plasma glucose during a 75‑g OGTT (PG120) ≥ 200 mg/dL. IGT was defined as HbA1c 6.0– 6.4%, FPG 110–125 mg/dL or PG120 140–199 mg/dL. NGT was defined as values below all prediabetes thresholds (HbA1c < 6.0%, FPG < 110 mg/dL and PG120 < 140 mg/dL).

      (2) Is it necessary to list the cumulative percentage (Line 173), it could be clearer to list the percentage explained by each factor instead.

      We appreciate the reviewer’s suggestion to list the percentage explained by each factor rather than the cumulative percentage for improved clarity. According to the reviewer’s suggestion, we have revised the results to show the individual contribution of each factor (39%, 21%, 10%, 5%, 5%) rather than the cumulative percentages (39%, 60%, 70%, 75%, 80%) that were previously listed (lines 220-221).

      (3) Figure S10. How were the coefficients generated for Figure S10? No methods are given.

      We conducted a multiple linear regression analysis in which time in range (TIR) was the dependent variable and the factor scores corresponding to the first three latent components (factor 1 representing the mean, factor 2 representing the variance, and factor 3 representing the autocorrelation) were the independent variables. We have added the following text to the figure legend (Fig. S8C) to provide a more detailed description of how the coefficients were generated:

      Comparison of predicted Time in range (TIR) versus measured TIR using multiple regression analysis between TIR and factor scores in Figure 3. In this analysis, TIR was the dependent variable, and the factor scores corresponding to the first three latent components (factor 1 representing the mean, factor 2 representing the variance, and factor 3 representing the autocorrelation) were the independent variables. Each point corresponds to the values for a single individual.

      (4) In https://cgm-basedregression.streamlit.app/, more explanation should be given about the output of the multiple regression. Regression is spelled incorrectly on the app.

      We appreciate the reviewer for pointing out the need for a clearer explanation of the multiple regression analysis presented in the online tool

      (https://cgmregressionapp2.streamlit.app/). We have added the description about the regression and corrected the typographical error in the spelling of “regression” within the app. 

      (5) The last section of results (starting at line 225) appears to be unrelated to the goal of predicting %NC.

      We appreciate the reviewer’s feedback regarding the relevance of the simulation component of our manuscript. The primary contribution of our study goes beyond demonstrating correlations between CGM-derived measures and %NC; it highlights three fundamental components of glycemic patterns-mean, variance, and autocorrelation-and their independent relationships with coronary plaque characteristics. The simulations are included to illustrate how glycemic patterns with identical means and variability can have different autocorrelation structures. As reviewer 2 pointed out in minor comment #4, temporal autocorrelation can be difficult to interpret, so these visualizations were intended to provide intuitive examples for readers.

      However, we agree with the reviewer’s concern about the coherence of the manuscript. In response, we have streamlined the simulation section by removing technical simulations that do not directly support our primary conclusions (old version of the manuscript, lines 239-246, 502-526), while retaining only those that enhance understanding of the three glycemic components (Fig. 4A).

      (6) Figure S2. The R2 should be reported.

      We appreciate the reviewer for suggesting that we report R² in Figure S2. In the revised version, we have added the correlation coefficients and their 95% confidence intervals to Figure 1E.

      (7) Multiple panels have a correlation line drawn with a slope of 1 which does not reflect the data or r^2 listed. this should be fixed.

      We appreciate the reviewer’s concern that several panels included regression lines with a fixed slope of one that did not reflect the associated R² values. We have corrected Figures 1A–C and 3C to display regression lines representing the estimated slopes derived from the regression analyses.

    1. eLife Assessment

      This valuable study identifies a novel regulator of stress-induced gene quiescence in C. elegans: the multi-Zinc-finger protein ZNF-236. The work provides evidence for an active mechanism that maintains the repressed state of inducible genes under basal conditions in the absence of stress. The claims for discovery made in the title and abstract are supported by solid experimental data. However, a deeper investigation into the mechanisms of ZNF-236 action could substantially enhance the manuscript's impact and value.

    2. Reviewer #1 (Public review):

      Summary:

      The paper by ILBAY et al describes a screen in C. elegans for loss-of-function of factors that are presumed to constitutively downregulate heat shock or stress genes regulated by HSF-1. The hypothesis posits an active mechanism of downregulation of these genes under non-stressed conditions. The screen robustly identified ZNF-236, a multi zinc finger containing protein, whose loss upregulates heat-shock and stress-induced prion-like protein genes, but which does not appear to act in cis at the relevant promoters. The authors speculate that ZNF-236 acts indirectly on chromatin or chromatin domains to repress hs genes under non-stressed conditions.

      Strengths:

      The screen is clever, well-controlled and quite straightforward. I am convinced that ZNF-236 has something to do with keeping heat shock and other stress transcripts low. The mapping of potential binding sites of ZNF-236 is negative, despite the development of a new method to monitor binding sites. I am not sure whether this assay has a detection/sensitivity threshold limit, as it is not widely used. Up to this point, the data are solid, and the logic is easy to follow.

      Weaknesses:

      While the primary observations are well-documented, the mode of action of ZNF-236 is inadequately explored. Multi Zn finger proteins often bind RNA (TFIII3A is a classic example), and the following paper addresses multivalent functions of Zn finger proteins in RNA stability and processing: Mol Cell 2024 Oct 3;84(19):3826-3842.e8. doi: 10.1016/j.molcel.2024.08.010.). I see no evidence that would point to a role for ZNF-236 in nuclear organization, yet this is the authors' favorite hypothesis. In my opinion, this proposed mechanism is poorly justified, and certainly should not be posited without first testing whether ZNF-236 acts post-transcriptionally, directly down-regulating the relevant mRNAs in some way. It could regulate RNA stability, splicing, export or translation of the relevant RNAs rather than their transcription rates. This can be tested by monitoring whether ZNF-236 alters run-on transcription rates or not. If nascent RNA synthesis rates are not altered, but rather co- and/or post-transcriptional events, and if ZNF-236 is shown to bind RNA (which is likely), the paper could still postulate that the protein plays a role in downregulating stress and heat shock proteins. However, they could rule out that it acts on the promoter by altering RNA Pol II engagement. Another option that should be tested is that ZNF-236 acts by nucleating an H3K9me domain that might shift the affected genes to the nuclear envelope, sequestering them in a zone of low-level transcription. That is also easily tested by tracking the position of an affected gene in the presence and absence of SNF-236. This latter mechanism is also right in line with known modes of action for Zn finger proteins (in mammals, acting through KAP1 and SETDB1). A role for nucleating H3K9me could be easily tested in worms by screening MET-2 or SET-25 knockouts for heat shock or stress mRNA levels. These data sets are already published.

      Without testing these two obvious pathways of action (through RNA or through H3K9me deposition), this paper is too preliminary.

      Appraisal:

      The authors achieved their initial aim with the screen, and the paper is of interest to the field. However, they do not adequately address the likely modes of action. Indeed, I think their results fail to support the conclusion or speculation that ZNF-236 acts on long-range chromatin organization. No solid evidence is presented to support this claim.

      Impact:

      If the paper were to address and/or rule out likely modes of action, the paper would be of major value to the field of heat shock and stress mRNA control.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript reports the identification of ZNF-236 as a key regulator that maintains quiescence of heat shock inducible genes in C. elegans. Using a forward genetic screen for constitutive activation of an endogenous hsp-16.41 reporter, the authors show that loss of znf-236 leads to widespread, HSF-1-dependent expression of inducible heat shock proteins (iHSPs) and a subset of prion-like stress-responsive genes, in the absence of proteotoxic stress. Transcriptomic analysis reveals that znf-236 mutants partially overlap with the canonical heat shock response, selectively activating highly inducible iHSPs rather than the full HSR program. iHSP transgenes integrated throughout the genome generally become de-repressed in znf-236 mutants, whereas the same constructs on extrachromosomal arrays or inserted into the rDNA locus re insensitive to znf-236 loss. Using a newly developed method, Transcription Factor Deaminase Sequencing (TFD-seq), the authors show that ZNF-236 binds sparsely across the genome and does not associate with iHSP promoters, supporting an indirect mode of regulation. Physiologically, znf-236 mutants exhibit increased thermotolerance and maintain iHSP expression during aging.

      Strengths:

      This is a carefully executed and internally consistent study that identifies a new regulator of stress-induced gene quiescence in C. elegans. The genetics are clean and the phenotypes are robust.

      Weaknesses:

      The manuscript is largely descriptive. It would be substantially strengthened by deeper mechanistic insight into what ZNF-236 does beyond being required for default silencing.

    4. Reviewer #3 (Public review):

      Summary:

      The researchers performed a genetic screen to identify a protein, ZNF-236, which belongs to the zinc finger family, and is required for repression of heat shock inducible genes. The researchers applied a new method to map the binding sites of ZNF-236, and based on the data, suggested that the protein does not repress genes by directly binding to their regulatory regions targeted by HSF1. Insertion of a reporter in multiple genomic regions indicates that repression is not needed in repetitive genomic contexts. Together, this work identifies ZNF-236, a protein that is important to repress heat-shock-responsive genes in the absence of heat shock.

      Strengths:

      A hit from a productive genetic screen was validated, and followed up by a series of well-designed experiments to characterize how the repression occurs. The evidence that the identified protein is required for the repression of heat shock response genes is strong.

      Weaknesses:

      The researchers propose and discuss one model of repression based on protein binding data, which depends on a new technique and data that are not fully characterized.

      Major Comments:

      (1) The phrase "results from a shift in genome organization" in the abstract lacks strong evidence. This interpretation heavily relies on the protein binding technique, using ELT-2 as a positive and an imperfect negative control. If we assume that the binding is a red herring, the interpretation would require some other indirect regulation mechanism. Is it possible that ZNF-236 binds to the RNA of a protein that is required to limit HSF-1 and potentially other transcription factors' activation function? In the extrachromosomal array/rDNA context, perhaps other repressive mechanisms are redundant, and thus active repression by ZNF-236 is not required. This possibility is mentioned in one sentence in the discussion, but most of the other interpretations rely on the ZNF-236 binding data to be correct. Given that there is other evidence for a transcriptional role for ZNF-236, and no negative control (e.g. deletion of the zinc fingers, or a control akin to those done for ChIP-seq (like a null mutant or knockdown), a stronger foundation is needed for the presented model for genome organization.

      (2) Continuing along the same line, the study assumes that ZNF-236 function is transcriptional. Is it possible to tag a protein and look at localization? If it is in the nucleus, it could be additional evidence that this is true.

      (3) I suggest that the authors analyze the genomic data further. A MEME analysis for ZNF-236 can be done to test if the motif occurrences are enriched at the binding sites. Binding site locations in the genome with respect to genes (exon, intron, promoter, enhancer?) can be analyzed and compared to existing data, such as ATAC-seq. The authors also propose that this protein could be similar to CTCF. There are numerous high-quality and high-resolution Hi-C data in C. elegans larvae, and so the authors can readily compare their binding peak locations to the insulation scores to test their hypothesis.

      (4) The researchers suggest that ZNF-236 is important for some genomic context. Based on the transcriptomic data, can they find a clue for what that context may be? Are the ZNF-236 repressed genes enriched for not expressed genes in regions surrounded by highly expressed genes?

    5. Author response:

      We thank the reviewers for their insights and suggestions. We appreciate that the reviewers were engaged by both the observations and their interpretation, and consider their interest in further analysis and clarified discussion to be the best possible compliment to this work.

      As noted by the reviewers, the working hypothesis of a nuclear organization role for ZNF-236 is just one model. Clarifying this model and potential alternatives will certainly add to the manuscript and this will be a key part of the revision.  Beyond this, several suggested analyses should explore extant models, while providing context for considering alternatives.  We look forward to carrying out such analyses as feasible and will report them in the revised manuscript.

    1. eLife Assessment

      This study delivers valuable new insights into the neural circuits involved in post-mating responses (PMR) in Drosophila females, supported by convincing evidence that the circuits for mating receptivity and egg-laying are distinct. The new experimental evidence adds to the current understanding the neural circuits and molecular mechanisms underpinning PMR.

    2. Reviewer #1 (Public review):

      Summary:

      Authors explore how sex-peptide (SP) affects post-mating behaviours in adult females, such as receptivity and egg laying. This study identifies different neurons in the adult brain and the VNC that become activated by SP, largely by using an intersectional gene expression approach (split-GAL4) to narrow down the specific neurons involved. They confirm that SP binds to the well-known Sex Peptide Receptor (SPR), initiating a cascade of physiological and behavioural changes related to receptivity and egg laying.

      Comments on revised version:

      The authors have substantially strengthened the manuscript in response to our main concerns.

      In particular, they now explicitly test multiple established PMR nodes (including SAG/SPSN as well as pC1, OviDN/OviEN/OviIN and vpoDN), which helps separate direct SP targets from downstream PMR circuitry and supports their interpretation that some of these known nodes can affect receptivity without necessarily inducing oviposition. They also addressed key technical/clarity points: the requested head/trunk expression controls are provided (Suppl Fig S1), and the VT003280 annotation is corrected (now FD6 rather than "SAG driver"). Overall, these additions make the central conclusion, that distinct CNS neuron subsets ("SPRINz") are sufficient to elicit PMR components, more convincing, and the added comparisons with genital tract expressing lines further argue against a simple "periphery only" explanation.

    3. Reviewer #2 (Public review):

      Sex peptide (SP) transferred during mating from male to female induces various physiological responses in the receiving female. Among those, the increase in oviposition and decrease in sexual receptivity are very remarkable. Naturally, a long standing and significant question is the identify of the underlying sex peptide target neurons that express the SP receptor and are underlying these responses. Identification of these neurons will eventually lead to the identification of the underlying neuronal circuitry.

      The Soller lab has addressed this important question already several years ago (Haussmann et al. 2013), using relevant GAL4-lines and membrane-tethered SP. The results already showed that the action of SP on receptivity and oviposition is mediated by different neuronal subsets and hence can be separated. The GAL4-lines used at that time were, however, broad, and the individual identity of the relevant neurons remained unclear.

      In the present paper, Nallasivan and colleagues carried this analysis a significant step further, using new intersectional approaches and transsynaptic tracing.

      Strength:

      The intersectional approach is appropriate and state-of-the art. The analysis is a very comprehensive tour-de-force and experiments are carefully performed to a high standard. The authors also produced a useful new transgenic line (UAS-FRTstopFRT mSP). The finding that neurons in the brain (head) mediate the SP effect on receptivity, while neurons in the abdomen and thorax (ventral nerve cord or peripheral neurons) mediate the SP effect on oviposition, is a significant step forward in the endavour to identify the underlying neuronal networks and hence a mechanistic understanding of SP action. The analysis identifies a small set of neurons underlying SP responses. Some are part of the post-mating circuitry aind influence receptivity, while other are likely involved in higher order sensory processing. Though these results are not entirely unexpected, they are novel and represent a significant step forwards as the analysis is at a much higher resolution as previous work.

      Weakness:

      Though the analysis is at a much higher resolution as previous work on SP targets, it does not yet reach the resolution of single neuronal cell types. The last paragraph in the discussion rightfully speculates about the neurochemical identity of some of the intersection neurons (e.g. dopaminergic P1 neurons, NPF neurons). These suggested identities could have been confirmed by straight-forward immunostainings agains NPF or TH, for which antisera are available. Moreover, specific GAL4 lines for NPF or P1 or at least TH neurons are available which could be used to express mSP to test whether SP activation of those neurons is sufficient to trigger the SP effect. Moreover, the conclusion that SP target neurons operate as key integrators of sensory information for decision of behavioural outputs needs further experimental confirmation.

    4. Reviewer #3 (Public review):

      Summary:

      This paper reports new findings regarding neuronal circuitries responsible for female post-mating responses (PMRs) in Drosophila. The PMRs are induced by sex peptide (SP) transferred from males during mating. The authors sought to identify SP target neurons using a membrane-tethered SP (mSP) and a collection of GAL4 lines, each containing a fragment derived from the regulatory regions of the SPR, fru, and dsx genes involved in PMR. They identified several lines that induced PMR upon expression of mSP. Using split-GAL4 lines, they identified distinct SP-sensing neurons in the central brain and ventral nerve cord. Analyses of pre- and post-synaptic connection using retro- and trans-Tango placed SP target neurons at the interface of sensory processing interneurons that connect to two common post-synaptic processing neuronal populations in the brain. The authors proposed that SP interferes with the processing of sensory inputs from multiple modalities.

      Strengths:

      Besides the main results described in the summary above, the authors discovered the following:

      (1) Reduction of receptivity and induction of egg-laying are separable by restricting the expression of membrane-tethered SP (mSP): head-specific expression of mSP induces reduction of receptivity only, whereas trunk-specific expression of mSP induces oviposition only. Also, they identified a GAL4 line (SPR12) that induced egg laying but did not reduce receptivity.

      (2) Expression of mSP in the genital tract sensory neurons does not induce PMR. The authors identified three GAL4 drivers (SPR3, SPR 21, and fru9), which robustly expressed mSP in genital tract sensory neurons but did not induce PMRs. Also, SPR12 does not express in genital tract neurons but induces egg laying by expressing mSP.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public Review):

      Areas of improvement and suggestions:

      (1) "These results suggest the SP targets interneurons in the brain that feed into higher processing centers from different entry points likely representing different sensory input" and "All together, these data suggest that the abdominal ganglion harbors several distinct type of neurons involved in directing PMRs"

      The characterization of the post-mating circuitry has been largely described by the group of Barry Dickson and other labs. I suggest ruling out a potential effect of mSP in any of the well-known post-mating neuronal circuitry, i.e: SPSN, SAG, pC1, vpoDN or OviDNs neurons. A combination of available split-Gal4 should be sufficient to prove this.

      We agree that this information is important to distinguish neurons which are direct SP targets from neurons which are involved in directing reproductive behaviors. We have now tested drivers for these neurons and added these data in Fig 3 (SAG neurons) and as Suppl Figs S4 (SPSN and genital tract neuron drivers SPR3 and SPR21), Suppl Fig S6 (overlap in single cell expression atlas), Suppl Fig S7 (overlap of SPSN split drivers with SPR8, fru11/12 and dsx split drivers in the brain inducing PMRs) and Suppl Fig S9 (pC1, OviDNs, OviENs, OviINs and vpoDN).  

      The newly added data are in full support of our conclusion that SP targets central nervous system neurons, which we termed SP Response Inducing Neurons (SPRINz). In particular, we find lines that express in genital tract neurons, but do not induce an SP response (Supp Figs S4, S7 and S10) or do not express in genital tract neurons and induce an SP response (Fig 2 and Supp Fig S2).

      We have analysed the expression of SPSN in the brain and VNC and find expression in few neurons (Suppl Fig S4). This result is consistent with expression of the genes driving SPSN expression in the single cell expression atlas indicating overlap of expression in very few neurons (Suppl Fig S6). We have already shown that FD6 (VT003280) which is part of the SPSN splitGal4 driver, expresses in the brain and VNC and can induce PMRs from SP expression (Fig 4).

      We have taken this further to test another SPSN driver (VT058873) in combination with SPR8, fru11/12 and dsx and find PMRs induced by mSP expression (Suppl Fig S7). Moreover, if we restrict expression of mSP to the brain with otdflp we can induce PMRs from mSP expression and obtain the same response by activating these brain neurons (Suppl Fig S7). We note that the VT058873 ∩ fru11/12 intersection in combination with otdflp stopmSP or stopTrpA1 in the head, did not result in PMRs. Here, PMR inducing neurons likely reside in the VNC, but currently no tools are available to test this further.

      We further tested pC1, OviDNs, OviENs, OviINs and vpoDN for induction of PMRs from expression of mSP. We are pleased to see that OviEN-SS2s, OviIN-SS1 and vpoDN splitGAl4 drivers can reduce receptivity, but not induce oviposition (Suppl Fig S8). We predicted such drivers based on previously published data (Haussmann et al. 2013), which we now validated.

      (2) Authors must show how specific is their "head" (elav/otd-flp) and "trunk" (elav/tsh) expression of mSP by showing images of the same constructs driving GFP.

      The expression pattern for tshGAL, which expresses in the trunk is already published (Soller et al., 2006). We have added images for “head” expression for tshGAL and adjusted our statement to be pre-dominantly expressed in the VNC in Suppl Fig 1.

      (3) VT3280 is termed as a SAG driver. However, VT3280 is a SPSN specific driver (Feng et al., 2014; Jang et al., 2017; Scheunemann et al., 2019; Laturney et al., 2023). The authors should clarify this.

      According to the reviewers suggestion, we have clarified the specificity of VT003280 and now say that this is FD6.

      (4) Intersectional approaches must rule out the influence of SP on sex-peptide sensing neurons (SPSN) in the ovary by combining their constructs with SPSN-Gal80 construct. In line with this, most of their lines targets the SAG circuit (4I, J and K). Again, here they need to rule out the involvement of SPSN in their receptivity/egg laying phenotypes. Especially because "In the female genital tract, these split-Gal4 combinations show expression in genital tract neurons with innervations running along oviduct and uterine walls (Figures S3A-S3E)".

      We agree with this reviewer that we need a higher resolution of expression to only one cell type. However, this is a major task that we will continue in follow up studies.

      In principal, use of GAL80 is a valid approach to restrict expression, if levels of GAL80 are higher than those of GAL4, because GAL80 binds GAL4 to inhibit its activity. Hence, if levels of GAL80 are lower, results could be difficult to interpret.

      (5) The authors separate head (brain) from trunk (VNC) responses, but they don't narrow down the neural circuits involved on each response. A detailed characterization of the involved circuits especially in the case of the VNC is needed to (a) show that the intersectional approach is indeed labelling distinct subtypes and (b) how these distinct neurons influence oviposition.

      Again, we agree with this reviewer that we need a higher resolution of expression to only one cell type. However, this is a major task that we will continue in follow up studies.

      Reviewer #2 (Public Review):

      Strength:

      The intersectional approach is appropriate and state-of-the art. The analysis is a very comprehensive tour-de-force and experiments are carefully performed to a high standard. The authors also produced a useful new transgenic line (UAS-FRTstopFRT mSP). The finding that neurons in the brain (head) mediate the SP effect on receptivity, while neurons in the abdomen and thorax (ventral nerve cord or peripheral neurons) mediate the SP effect on oviposition, is a significant step forward in the endavour to identify the underlying neuronal networks and hence a mechanistic understanding of SP action. Though this result is not entirely unexpected, it is novel as it was not shown before.

      We thank reviewer 2 for recognizing the advance of our work.

      Weakness:

      Though the analysis identifies a small set of neurons underlying SP responses, it does not go the last step to individually identify at least a few of them. The last paragraph in the discussion rightfully speculates about the neurochemical identity of some of the intersection neurons (e.g. dopaminergic P1 neurons, NPF neurons). At least these suggested identities could have been confirmed by straight-forward immunostainings agains NPF or TH, for which antisera are available. Moreover, specific GAL4 lines for NPF or P1 or at least TH neurons are available which could be used to express mSP to test whether SP activation of those neurons is sufficient to trigger the SP effect.

      We appreciate this reviewers recognition of our previous work showing that receptivity and oviposition are separable. As pointed out we have now gone one step further and identified in a tour de force approach subsets of neurons in the brain and VNC.

      We agree with this reviewer that we need a higher resolution of expression to only one cell type. As pointed out by this reviewer, the neurochemical identity is an excellent suggestions and will help to further restrict expression to just one type of neuron. However, this is a major task that we will continue in follow up studies.

      Reviewer #3 (Public Review):

      Strengths:

      Besides the main results described in the summary above, the authors discovered the following:

      (1) Reduction of receptivity and induction of egg-laying are separable by restricting the expression of membrane-tethered SP (mSP): head-specific expression of mSP induces reduction of receptivity only, whereas trunk-specific expression of mSP induces oviposition only. Also, they identified a GAL4 line (SPR12) that induced egg laying but did not reduce receptivity.

      (2) Expression of mSP in the genital tract sensory neurons does not induce PMR. The authors identified three GAL4 drivers (SPR3, SPR 21, and fru9), which robustly expressed mSP in genital tract sensory neurons but did not induce PMRs. Also, SPR12 does not express in genital tract neurons but induces egg laying by expressing mSP.

      We thank reviewer 2 for recognizing these two important points regarding the SP response that point to a revised model for how the underlying circuitry induces the post-mating response. To further substantiate these findings we now have added a splitGal4 nSyb ∩ ppk which expresses in genital tract neurons, but does not induce PMRs from mSP expression.

      Weaknesses:

      (1) Intersectional expression involving ppk-GAL4-DBD was negative in all GAL4AD lines (Supp. Fig.S5). As the authors mentioned,   neurons may not intersect with SPR, fru, dsx, and FD6 neurons in inducing PMRs by mSP. However, since there was no PMR induction and no GAL4 expression at all in any combination with GAL4-AD lines used in this study, I would like to have a positive control, where intersectional expression of mSP in ppk-GAL4-DBD and other GAL4-AD lines (e.g., ppk-GAL4-AD) would induce PMR.

      We have added a positive control for ppk expression by combining the ppk-DBD line with a nSyb-AD which expresses in all neurons in Supp Fig S8. This experiment confirms our previous observations that ppk splitGal4 in combination with other drivers does not induce an SP response despite driving expression in genital tract neurons. We have expanded the discussion section to point out that we have identified additional cells in the brain expressing ppkGAL4, but expression of split-GAL4 ppk is absent in these cells. Part of this work has previously been published (Nallasivan et al. 2021). Accordingly, we amended the text to say when expression was achieved with ppkGAL or ppk splitGAL4.

      (2) The results of SPR RNAi knock-down experiments are inconclusive (Figure 5). SPR RNAi cancelled the PMR in dsx ∩ fru11/12 and partially in SPR8 ∩ fru 11/12 neurons. SPR RNAi in dsx ∩ SPR8 neurons turned virgin females unreceptive; it is unclear whether SPR mediates the phenotype in SPR8 ∩ fru 11/12 and dsx ∩ SPR8 neurons.

      We agree with this reviewer that the interpretation of the SPR RNAi results are complicated by the fact that SP has additional receptors (Haussmann et al 2013). The results are conclusive for all three intersections when expressing UAS mSP in SPR RNAi with respect to oviposition, e.g. egg laying is not induced in the absence of SPR. For receptivity, the results are conclusive for dsx ∩ fru11/12 and partially for SPR8 ∩ fru 11/12.

      Potentially, SPR RNAi knock-down does not sufficiently reduce SPR levels to completely reduce receptivity in some intersection patterns, likely also because splitGal4 expression is less efficient.

      Why SPR RNAi in dsx ∩ SPR8 neurons turned virgin females unreceptive is unclear, but we anticipate that we need a higher resolution of expression to only one cell type to resolve this unexpected result. However, this is a major task that we will continue in follow up studies.

      SPR RNAi knock-down experiments may also help clarify whether mSP worked autocrine or juxtacrine to induce PMR. mSP may produce juxtacrine signaling, which is cell non-autonomous.

      Whether membrane-tethered SP induces the response in a autocrine manner is an import aspect in the interpretation of the results from mSP expression.

      Removing SPR by SPR RNAi and expression of mSP in the same neurons did not induce egg laying for all three intersection and did not reduce receptivity for dsx ∩ fru11/12 and for SPR8 ∩ fru 11/12. Accordingly, we can conclude that for these neurons the response is induced in an autocrine manner.

      We have added this aspect to the discussion section.

  2. Dec 2025
    1. eLife Assessment

      This study investigates the function of Chi3l1 in hepatic macrophages in the context of MASLD, providing useful insights at a time when the distinct roles of Kupffer cells or monocyte-derived macrophages in this disease remain incompletely defined. The data suggests that CHI3L1 in Kupffer cells modulates glucose handling in obesity and mitigates systemic metabolic dysfunction and hepatic steatosis during high-fat, high-fructose feeding. However, the loss-of-function studies employing Kupffer cell restricted versus a pan myeloid Cre lines are not sufficient to support the assertion that CHI3L1 activity is confined to resident Kupffer cells. Additionally, the flow-cytometric analyses reveal a modest depletion of Kupffer cells and no recruitment of TIM4low monocyte-derived macrophages, indicating that the system reflects simple steatosis rather than substantial macrophage turnover or niche remodelling. While the findings are intriguing, further experimentation is required to clarify the cellular specificity and mechanistic basis of the phenotypes observed.

    2. Reviewer #1 (Public review):

      The manuscript by Shan et al seeks to define the role of the CHI3L1 protein in macrophages during the progression of MASH. The authors argue that the Chil1 gene is expressed highly in hepatic macrophages. Subsequently, they use Chil1 flx mice crossed to Clec4F-Cre or LysM-Cre to assess the role of this factor in the progression of MASH using a high fat high, fructose diet (HFFC). They found that loss of Chil1 in KCs (Clec4F Cre) leads to enhanced KC death and worsened hepatic steatosis. Using scRNA seq they also provide evidence that loss of this factor promotes gene programs related to cell death. From a mechanistic perspective they provide evidence that CHI3L serves as a glucose sink and thus loss of this molecule enhances macrophage glucose uptake and susceptibility to cell death. Using a bone marrow macrophage system and KCs they demonstrate that cell death induced by palmitic acid is attenuated by the addition of rCHI3L1. While the article is well written and potentially highlights a new mechanism of macrophage dysfunction in MASH and the authors have addressed some of my concerns there are some concerns about the current data that continue to limit my enthusiasm for the study. Please see my specific comments below.

      Major:

      (1) The authors' interpretation of the results from the KC ( Clec4F) and MdM KO (LysM-Cre) experiments is flawed. The authors have added new data that suggests LyM-Cre only leads to a 40% reduction of Chil1 in KCs and that this explains the difference in the phenotype compared to the Clec4F-Cre. However, this claim would be made stronger using flow sorted TIM4hi KCs as the plating method can lead to heterogenous populations and thus an underestimation of knockdown by qPCR. Moreover, in the supplemental data the authors show that Clec4f-Cre x Chil1flx leads to a significant knockdown of this gene in BMDMs. As BMDMs do not express Clec4f this data calls into question the rigor of the data. I am still concerned that the phenotype differences between Clec4f-cre and LyxM-cre is not related to the degree of knockdown in KCs but rather some other aspect of the model (microbiota etc). It woudl be more convincing if the authors could show the CHI3L reduction via IF in the tissue of these mice.

      (2) Figure 4 suggests that KC death is increased with KO of Chil1. The authors have added new data with TIM4 that better characterizes this phenotype. The lack of TIM4 low, F4/80 hi cells further supports that their diet model is not producing any signs of the inflammatory changes that occur with MASLD and MASH. This is also supported by no meaningful changes in the CD11b hi, F4/80 int cells that are predominantly monocytes and early Mdms). It is also concerning that loss of KCs does not lead to an increase in Mo-KCs as has been demonstrated in several studies (PMID37639126, PMID:33997821). This would suggest that the degree of resident KC loss is trivial.

      (3) The authors demonstrated that Clec4f-Cre itself was not responsible for the observed phenotype, which mitigates my concerns about this influencing their model.

      (4) I remain somewhat concerned about the conclusion that Chil1 is highly expressed in liver macrophages. The author agrees that mRNA levels of this gene are hard to see in the datasets; however, they argue that IF demonstrates clear evidence of the protein, CHI3L. The IF in the paper only shows a high power view of one KC. I would like to see what percentage of KCs express CHI3L and how this changes with HFHC diet. In addition, showing the knockout IF would further validate the IF staining patterns.

      Minor:

      (1) The authors have answered my question about liver fibrosis. In line with their macrophage data their diet model does not appear to induce even mild MASH.

    3. Reviewer #2 (Public review):

      In the revised version of the manuscript, the authors have attempted to address my questions, however, a number of my original concerns still remain.

      Firstly, I had asked for a validation of the different CRE lines used - Lysm and Clec4f. The authors have now looked at BMDMs and KCs (steady state) from these animals. They conclude LysM only targets BMDMs not KCs, while CLEC4F targets both KCs and BMDMs. This I do not understand, BMDMs do not express CLEC4F so why are they targeted with this CRE? Additionally, BMDMs are not the correct control here, rather the authors should look at the incoming moMFs in the livers of these mice in the MASLD setting. Similarly, the KO in the MASLD KCs should be verified.

      Then I had asked for validation of macrophage expression of Chil1 in other MASLD human and mouse datasets. The authors have looked into this, but the data provided do not suggest it is highly expressed by these cells either in the other mouse models or in the human. Nevertheless, they include a statement suggesting a similar expression pattern (although also being expressed by other cells). This is not an accurate discussion of the data and hence must be revised. This also prompted me to take another look at their data and this has left me querying the data in Figure 1D. Is the percent expressed 1%? In Figure 1C the scale goes from 0-100 but here 0-1. If we are talking about expression in 1% of cells which would fit with the additional public mouse data now analysed then how relevant are any of these claims? How sure are the authors that the effects seen are through KCs/moMFs? In figure 1D all cells profiled by scRNA-seq should be shown not just MFs to get a better sense of this data. What is macrophage expression of Chil1 compared with all other liver cells?

      The cell death had also previously concerned me that 40-60% of KCs were tunel +ve. I do not understand how 60% are +ve at 8 weeks but then they have more or less same number of TIM4+ cells at 16 weeks? How can this be? why do the tunel +ve cells not die? This concern remains as I don't understand how they reached these numbers given the images. Additional, larger images were also not provided to be sure that they are representative images in the figure. Now in the images provided, there are clearly cells which are TIM4+ where the tunel does not overlap, likely it is in a LSEC or other neighbouring cell. Indeed also taking Fig S11b as an example there are ˜7KCs and at best 1 expresses tunel so how do they get to 60%?

    4. Reviewer #3 (Public review):

      This paper investigates the role of Chi3l1 in regulating the fate of liver macrophages in the context of metabolic dysfunction leading to the development of MASLD. I do see value in this work, but some issues exist that should be addressed as well as possible.

      Here are my comments:

      (1) Chi3l1 has been linked to macrophage functions in MASLD/MASH, acute liver injury, and fibrosis models before (e.g., PMID: 37166517), which limits the novelty of the current work. It has even been linked to macrophage cell death/survival (PMID: 31250532) in the context of fibrosis, which is a main observation from the current study.

      (2) The LysCre-experiments differ from experiments conducted by Ariel Feldstein's team (PMID: 37166517). What is the explanation for this difference? - The LysCre system is neither specific to macrophages (it also depletes in neutrophils, etc), nor is this system necessarily efficient in all myeloid cells (e.g., Kupffer cells vs other macrophages). The authors need to show the efficacy and specificity of the conditional KO regarding Chi3l1 in the different myeloid populations in the liver and the circulation.

      (3) The conclusions are exclusively based on one MASLD model. I recommend confirming the key findings in a second, ideally a more fibrotic, MASH model.

      (4) Very few human data are being provided (e.g., no work with own human liver samples, work with primary human cells). Thus, the translational relevance of the observations remains unclear.

      Comments on revisions:

      The authors have done a thorough job addressing my comments. However, I am not convinced about the MCD diet model, which is somewhat hidden in the Supplementary Files. Neither seems MASH different nor are any fibrosis data shown to support the conclusions. I am not satisfied with this part of the revised manuscript, and I do not agree that the second MASH model would support the conclusions.

    5. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      The manuscript by Shan et al seeks to define the role of the CHI3L1 protein in macrophages during the progression of MASH. The authors argue that the Chil1 gene is expressed highly in hepatic macrophages. Subsequently, they use Chil1 flx mice crossed to Clec4F-Cre or LysM-Cre to assess the role of this factor in the progression of MASH using a high-fat, high-cholesterol diet (HFHC). They found that loss of Chil1 in KCs (Clec4F Cre) leads to enhanced KC death and worsened hepatic steatosis. Using scRNA seq, they also provide evidence that loss of this factor promotes gene programs related to cell death. From a mechanistic perspective, they provide evidence that CHI3L serves as a glucose sink and thus loss of this molecule enhances macrophage glucose uptake and susceptibility to cell death. Using a bone marrow macrophage system and KCs they demonstrate that cell death induced by palmitic acid is attenuated by the addition of rCHI3L1. While the article is well written and potentially highlights a new mechanism of macrophage dysfunction in MASH, there are some concerns about the current data that limit my enthusiasm for the study in its current form. Please see my specific comments below.

      (1) The authors' interpretation of the results from the KC (Clec4F) and MdM KO (LysM-Cre) experiments is flawed. For example, in Figure 2 the authors present data that knockout of Chil1 in KCs using Clec4f Cre produces worse liver steatosis and insulin resistance. However, in supplemental Figure 4, they perform the same experiment in LysM-Cre mice and find a somewhat different phenotype. The authors appear to be under the impression that LysM-Cre does not cause recombination in KCs and therefore interpret this data to mean that Chil1 is relevant in KCs and not MdMs. However, LysM-Cre DOES lead to efficient recombination in KCs and therefore Chil1 expression will be decreased in both KCs and MdM (along with PMNs) in this line.

      Therefore, a phenotype observed with KC-KO should also be present in this model unless the authors argue that loss of Chil1 from the MdMs has the opposite phenotype of KCs and therefore attenuates the phenotype. The Cx3Cr1 CreER tamoxifen inducible system is currently the only macrophage Cre strategy that will avoid KC recombination. The authors need to rethink their results with the understanding that Chil1 is deleted from KCs in the LysM-Cre experiment. In addition, it appears that only one experiment was performed, with only 5 mice in each group for both the Clec4f and LysM-Cre data. This is generally not enough to make a firm conclusion for MASH diet experiments.

      We thank the reviewer for raising this important point regarding our data interpretation. We have carefully examined the deletion efficiency of Chi3l1 in primary Kupffer cells (KCs) from Lyz2<sup>∆Chil1</sup> (LysM-Cre) mice. Our results show roughly a 40% reduction in Chi3l1 expression at both the mRNA and protein levels (Revised Manuscript, Figure S7B and C). Given this modest decrease, Chi3l1 deletion in KCs of Lyz2<sup>∆Chil1</sup> mice was incomplete, which likely accounts for the phenotypic differences observed between Clec4f<sup>∆Chil1</sup> and Lyz2<sup>∆Chil1</sup> mice in the MASLD model.

      Furthermore, we have increased the sample size in both the Clec4f- and LysM-Cre experiments to 9–12 mice per group following the HFHC diet, thereby strengthening the statistical power and reliability of our findings (Revised Figures 2 and S8).

      (2) The mouse weight gain is missing from Figure 2 and Supplementary Figure 4. This data is critical to interpret the changes in liver pathology, especially since they have worse insulin resistance.

      We thank the reviewer for this valuable comment. We have now included the mouse body weight data in the revised manuscript (Figure 2A, B and Figures S8A, B). Compared with mice on a normal chow diet (NCD), all groups exhibited progressive weight gain during HFHC diet feeding. Notably, Clec4f<sup>∆Chil1</sup> mice gained significantly more body weight than Chil1<sup>fl/fl</sup> controls, whereas Lyz2<sup>∆Chil1</sup> mice showed a similar weight gain trajectory to Chil1<sup>fl/fl</sup> mice under the same conditions.

      (3) Figure 4 suggests that KC death is increased with KO of Chil1. However, this data cannot be concluded from the plots shown. In Supplementary Figure 6 the authors provide a more appropriate gating scheme to quantify resident KCs that includes TIM4. The TIM4 data needs to be shown and quantified in Figure 4. As shown in Supplementary Figure 6, the F4/80 hi population is predominantly KCs at baseline; however, this is not true with MASH diets. Most of the recruited MoMFs also reside in the F4/80 hi gate where they can be identified by their lower expression of TIM4. The MoMF gate shown in this figure is incorrect. The CD11b hi population is predominantly PMNs, monocytes, and cDC,2 not MoMFs (PMID:33997821). In addition, the authors should stain the tissue for TIM4, which would also be expected to reveal a decrease in the number of resident KCs.

      We thank the reviewer for raising this critical point regarding the gating strategy and interpretation of KC death. We have now refined our flow cytometry gating based on the reviewer’s suggestion. Specifically, we analyzed TIM4 expression and attempted to identify TIM4<sup>low</sup> MoMFs populations in our model. However, we did not detect a distinct TIM4<sup>low</sup> population, likely because our mice were fed the HFHC diet for only 16 weeks and had not yet developed liver fibrosis. We therefore reason that MoMFs have not fully acquired TIM4 expression at this stage.

      To improve our analysis, we referred to published strategies (PMID: 41131393; PMID: 32562600) and gated KCs as CD45<sup>+</sup>CD11b<sup>+</sup>F4/80<sup>hi</sup> TIM4<sup>hi</sup> and MoMFs as CD45<sup>+</sup>Ly6G<sup>-</sup>CD11b<sup>+</sup>F4/80<sup>low</sup> TIM4<sup>low/-</sup>. Using this approach, we observed a gradual reduction of KCs and a corresponding increase in MoMFs in WT mice, with a significantly faster loss of KCs in Chil1<sup>-/-</sup> mice (Revised Figure 4C, D; Figure S10A).

      Furthermore, immunofluorescence staining for TIM4 combined with TUNEL or cleaved caspase-3 confirmed an increased number of dying KCs in Chil1<sup>-/-</sup> mice compared to WT following HFHC diet feeding (Revised Figure 4E; Figure S10B).

      (4) While the Clec4F Cre is specific to KCs, there is also less data about the impact of the Cre system on KC biology. Therefore, when looking at cell death, the authors need to include some mice that express Clec4F cre without the floxed allele to rule out any effects of the Cre itself. In addition, if the cell death phenotype is real, it should also be present in LysM Cre system for the reasons described above. Therefore, the authors should quantify the KC number and dying KCs in this mouse line as well.

      We thank the reviewer for raising this important point. During our study, we indeed observed an increased number of KCs in Clec4f-Cre mice compared to WT controls, suggesting that the Clec4f-Cre system itself may modestly affect KC homeostasis. To address this, we compared KCs numbers between Clec4f<sup>∆Chil1</sup> and Clec4f-Cre mice and found that Clec4f<sup>∆Chil1</sup> mice displayed a significant reduction in KCs numbers following HFHC diet feeding. Moreover, co-staining for TIM4 and TUNEL revealed a marked increase in KCs death in Clec4f<sup>∆Chil1</sup> mice relative to Clec4f-Cre mice, indicating that the observed phenotype is attributable to Chil1 deletion rather than Cre expression alone. These data have been reported in our related manuscript (He et al., bioRxiv, 2025.09.26.678483; doi: 10.1101/2025.09.26.678483).

      In addition, we quantified KCs numbers and KCs death in the Lyz2-Cre line. TIM4/TUNEL co-staining showed comparable levels of KCs death between Chil1<sup>fl/fl</sup> and Lyz2<sup>∆Chil1</sup> mice (Revised Figure S11B). Consistently, flow cytometry analyses revealed no significant differences in KCs numbers between these two groups before (0 weeks) or after (20 weeks) HFHC diet feeding (Revised Figures S11C, D). As discussed in our response to Comment 1, this may be due to the incomplete deletion of Chi3l1 in KCs (<50%) in the Lyz2-Cre line, which likely attenuates the phenotype.

      (5) I am somewhat concerned about the conclusion that Chil1 is highly expressed in liver macrophages. Looking at our own data and those from the Liver Atlas it appears that this gene is primarily expressed in neutrophils. At a minimum, the authors should address the expression of Chil1 in macrophage populations from other publicly available datasets in mouse MASH to validate their findings (several options include - PMID: 33440159, 32888418, 32362324). If expression of Chil1 is not present in these other data sets, perhaps an environmental/microbiome difference may account for the distinct expression pattern observed. Either way, it is important to address this issue.

      We thank the reviewer for this insightful comment and agree that analysis of scRNA-seq data, including our own and those reported in the Liver Atlas as well as in the referenced studies (PMID: 33440159, 32888418, 32362324), indicates that Chil1 is predominantly expressed in neutrophils.

      However, our immunofluorescence staining under normal physiological conditions revealed that Chi3l1 protein is primarily localized in Kupffer cells (KCs), as demonstrated by strong co-staining with TIM4 (Revised Figure 1E). In MASLD mouse models induced by HFHC or MCD diets, we observed that both KCs and monocyte-derived macrophages (MoMFs) express Chi3l1, with particularly high levels in MoMFs.

      We speculate that the apparent discrepancy between scRNA-seq datasets and our in situ findings may reflect differences in cellular proportions and detection sensitivity. Since hepatic macrophages (particularly KCs and MoMFs) constitute a larger proportion of total liver immune cells compared with neutrophils, their contribution to total Chi3l1 protein levels in tissue staining may appear dominant, despite lower transcript abundance per cell in sequencing datasets. We have included a discussion of this point in the revised manuscript to clarify this distinction (Revised manuscript, page 8,line 341-350 ).

      Minor points:

      (1) Were there any changes in liver fibrosis or liver fibrosis markers present in these experiments?

      We assessed liver fibrosis using Sirius Red staining and α-SMA Western blot analysis.

      We found no induction of liver fibrosis in our HFHC-induced MASLD model (Revised Figure S1A, B), but a clear elevation of fibrosis markers in the MCD-induced MASH model (Revised Figure S6A, B).

      (2) In Supplementary Figure 3, the authors do a western blot for CHI3L1 in BMDMs. This should also be done for KCs isolated from these mice. Does this antibody work for immunofluorescence? Staining liver tissue would provide valuable information on the expression patterns.

      We have included qPCR and western blot for Chi3l1 in isolated primary KCs from Lyz2<sup>∆Chil1</sup> mice. The data show a slight, non-significant reduction in both mRNA and protein levels in KCs (Revised Figure S7B, C). The immunofluorescence staining on liver tissue showed that Chi3l1 is more likely expressed in the plasma membranes of TIM4<sup>+</sup> F4/80<sup>+</sup> KCs both under NCD and HFHC diet (Revised Figure 1E).

      (3) What is the impact of MASH diet feeding on Chil1 expression in KCs or in the liver in general?

      In both our MASLD and MASH models, diet feeding consistently upregulates Chi3l1 in KCs or in the liver in general (Revised Figure 1F, G, S6C,D).

      (4) In Figure S1 the authors show tSNE plots of various monocyte and macrophage genes in the liver. Are these plots both diets together? How do things look when comparing these markers between the STD and HFHC diet? The population of recruited LAMs seems very small for 16 weeks of diet. Moreover, Chil1 should also be shown on these tSNE plots as well.

      Yes, these plots are both diets together. When compared separately, the core marker expression is consistent between NCD and HFHC diets. However, the HFHC diet induces a relative increase in KC marker expression within the MoMF cluster, suggesting phenotypic adaptation (Author response image 1A, below). Moreover, Chil1 expression on the t-SNE plot was shown (Author response image 1B, below). However, compared to lineage-specific marker genes, Chi3l1 expression is rather low.

      Author response image 1.

      Gene expression levels of lineage-specific marker genes in monocytes/macrophages clusters between NCD and HFHC diets. (A) UMAP plots show the scaled expression changes of lineage-specific markers in KCs/monocyte/macrophage clusters from mice under NCD and HFHC diets. Color represents the level of gene expression. (B) UMAP plots show the scaled expression changes of Chil1 in KCs/monocyte/macrophage clusters from mice under NCD and HFHC diets. Color represents the level of gene expression.

      (5) In Figure 5, the authors demonstrate that CHI3L1 binds to glucose. However, given that all chitin molecules bind to carbohydrates, is this a new finding? The data showing that CHI3L is elevated in the serum after diet is interesting. What happens to serum levels of this molecule in KC KO or total macrophage KO mice? Do the authors think it primarily acts as a secreted molecule or in a cell-intrinsic manner?

      We thank the reviewer for these insightful comments, which helped us clarify the novelty of our findings.

      (1) Novelty of CHI3L1-Glucose Binding:

      While chitin-binding domains are known to interact with carbohydrate polymers, our key discovery is that CHI3L1 (YKL-40)—a mammalian chitinase-like protein lacking enzymatic activity—specifically binds to glucose, a simple monosaccharide. This differs fundamentally from canonical binding to insoluble polysaccharides such as chitin and reveals a potential role for CHI3L1 in monosaccharide recognition, linking it to glucose metabolism and energy sensing. We clarified this point in the revised manuscript (page 9, line374-379).

      (2) Serum CHI3L1 in Knockout Models:

      Consistent with the reviewer’s suggestion, serum Chi3l1 levels are altered in our knockout models:

      KC-specific KO (Clec4f<sup>ΔChil1</sup>): Under normal chow, serum CHI3L1 is markedly reduced compared to controls and remains lower following HFHC feeding (Author response image 2A, below), indicating that Kupffer cells are the main source of circulating CHI3L1 under basal and disease conditions.

      Macrophage KO (Lyz2<sup>ΔChil1</sup>): No significant changes were observed between Chil1<sup>fl/fl</sup> and Lyz2<sup>ΔChil1</sup> mice under either diet (Author response image 2B, below), likely due to minimal monocyte-derived macrophage recruitment in this HFHC model (see Revised Figure 4C,D).

      (3) Secreted vs. Cell-Intrinsic Role:

      CHI3L1 predominantly localizes to the KC plasma membrane, consistent with a secreted role, and its serum reduction in KC-specific knockouts supports the physiological relevance of its secreted role. While cell-intrinsic effects have been reported elsewhere, our current data do not address this in KCs and warrant future investigation.

      Author response image 2.

      Chi3l1 expression in serum before and after HFHC in CKO mice. (A) Western blot to detect Chi3l1 expression in serum of Chil1<sup>fl/fl</sup> and Clec4f<sup>ΔChil1</sup> mice before and after 16 weeks’ HFHC diet. n=3 mice/group. (B) Western blot to detect Chi3l1 expression in serum of Chil1<sup>fl/fl</sup> and Lyz2ΔChil1 before and after 16 weeks’ HFHC diet. n=3 mice/group.

      Reviewer #2 (Public review):

      The manuscript from Shan et al., sets out to investigate the role of Chi3l1 in different hepatic macrophage subsets (KCs and moMFs) in MASLD following their identification that KCs highly express this gene. To this end, they utilise Chi3l1KO, Clec4f-CrexChi3l1fl, and Lyz2-CrexChi3l1fl mice and WT controls fed a HFHC for different periods of time.

      Major:

      Firstly, the authors perform scRNA-seq, which led to the identification of Chi3l1 (encoded by Chil1) in macrophages. However, this is on a limited number of cells (especially in the HFHC context), and hence it would also be important to validate this finding in other publicly available MASLD/Fibrosis scRNA-seq datasets. Similarly, it would be important to examine if cells other than monocytes/macrophages also express this gene, given the use of the full KO in the manuscript. Along these lines, utilisation of publicly available human MASLD scRNA-seq datasets would also be important to understand where the increased expression observed in patients comes from and the overall relevance of macrophages in this finding.

      We thank the reviewer for this valuable suggestion and acknowledge the limited number of cells analyzed under the HFHC condition in our original dataset. To strengthen our findings, we have now examined four additional publicly available scRNA-seq datasets— two from mouse models and two from human MASLD patients (Revised Figure S3, manuscript page 4, line 164-172). Across these datasets, the specific cell type showing the highest Chil1 expression varied somewhat between studies, likely reflecting model differences and disease stages. Nevertheless, Chil1 expression was consistently enriched in hepatic macrophage populations, including both Kupffer cells and infiltrating macrophages, in mouse and human livers. Notably, Chil1 expression was higher in infiltrating macrophages compared to resident Kupffer cells, supporting its upregulation during MASLD progression. These additional analyses confirm the robustness and crossspecies relevance of our finding that macrophages are the primary Chil1-expressing cell type in the liver.

      Next, the authors use two different Cre lines (Clec4f-Cre and Lyz2-Cre) to target KCs and moMFs respectively. However, no evidence is provided to demonstrate that Chil1 is only deleted from the respective cells in the two CRE lines. Thus, KCs and moMFs should be sorted from both lines, and a qPCR performed to check the deletion of Chil1. This is especially important for the Lyz2-Cre, which has been routinely used in the literature to target KCs (as well as moMFs) and has (at least partial) penetrance in KCs (depending on the gene to be floxed). Also, while the Clec4f-Cre mice show an exacerbated MASLD phenotype, there is currently no baseline phenotype of these animals (or the Lyz2Cre) in steady state in relation to the same readouts provided in MASLD and the macrophage compartment. This is critical to understand if the phenotype is MASLD-specific or if loss of Chi3l1 already affects the macrophages under homeostatic conditions.

      We thank the reviewer for raising this important point.

      (1) Chil1 deletion efficiency in Clec4f-Cre and Lyz2-Cre lines:

      We have assessed the efficiency of Chil1 deletion in both Lyz2<sup>∆Chil1</sup> and Clec4f<sup>∆Chil1</sup> mice by evaluating mRNA and protein levels of Chi3l1. For the Lyz2<sup>∆Chil1</sup> mice, we measured Chi3l1 expression in bone marrow-derived macrophages (BMDMs) and primary Kupffer cells (KCs). Both qPCR (for mRNA) and Western blotting (for protein) reveal that Chi3l1 is almost undetectable in BMDMs from Lyz2<sup>∆Chil1</sup> mice when compared to Chil1<sup>fl/fl</sup> controls. In contrast, we observe no significant reduction in Chi3l1 expression in KCs from these animals (Revised Figure S7B, C), suggesting Chil1 is deleted in BMDMs but not in KCs in Lyz2-Cre line.

      For the Clec4f<sup>∆Chil1</sup> mice, both mRNA and protein levels of Chi3l1 are barely detectable in BMDMs and primary KCs when compared to Chil1<sup>fl/fl</sup> controls (Revised Figure S4B, C). However, we did observe a faint Chi3l1 band in KCs of Clec4f<sup>∆Chil1</sup> mice, which we suspect is due to contamination from LSECs during the KC isolation process, given that the TIM4 staining for KCs was approximately 90%. Overall, Chil1 is deleted in both KCs and BMDMs in Clec4f-Cre line.

      Notably, since we observed a pronounced MASLD phenotype in Clec4f-Cre mice but not in Lyz2-Cre mice, these findings further underscore the critical role of Kupffer cells in the progression of MASLD.

      (2) Whether the phenotype is MASLD-specific or whether loss of Chi3l1 already affects the macrophages under homeostatic conditions: We now included phenotypic data of Clec4f<sup>ΔChil1</sup> mice (KC-specific KO) and Lyz2<sup>∆Chil1</sup> mice (MoMFs-specific KO) fed with NCD 16w (Revised Figure 2A-F, S8A-F). Shortly speaking, there is no baseline difference between Chil1<sup>fl/fl</sup> and Clec4f<sup>ΔChil1</sup> or Lyz2<sup>∆Chil1</sup> mice in steady state in relation to the same readouts provided in MASLD.

      Next, the authors suggest that loss of Chi3l1 promotes KC death. However, to examine this, they use Chi3l1 full KO mice instead of the Clec4f-Cre line. The reason for this is not clear, because in this regard, it is now not clear whether the effects are regulated by loss of Chi3l1 from KCs or from other hepatic cells (see point above). The authors mention that Chi3l1 is a secreted protein, so does this mean other cells are also secreting it, and are these needed for KC death? In that case, this would not explain the phenotype in the CLEC4F-Cre mice. Here, the authors do perform a basic immunophenotyping of the macrophage populations; however, the markers used are outdated, making it difficult to interpret the findings. Instead of F4/80 and CD11b, which do not allow a perfect discrimination of KCs and moMFs, especially in HFHC diet-fed mice, more robust and specific markers of KCs should be used, including CLEC4F, VSIG4, and TIM4.

      We thank the reviewer for raising this important point. We performed experiments in Clec4f<sup>∆Chil1</sup> (KC-specific KO) model. The phenotype in these mice closely mirrors that of the full KO: we observed a significant reduction in KC numbers and a concurrent increase in KC cell death following an HFHC diet in Clec4f<sup>∆Chil1</sup> mice post HFHC diet compared to Clec4f-cre mice. We have reported these data in the following related manuscript (Figure 6 D-G). This confirms that the loss of CHI3L1 specifically from KCs is sufficient to drive this effect.

      Hyperactivated Glycolysis Drives Spatially-Patterned Kupffer Cell Depletion in MASLD Jia He, Ran Li, Cheng Xie, Xiane Zhu, Keqin Wang, Zhao Shan bioRxiv 2025.09.26.678483; doi: https://doi.org/10.1101/2025.09.26.678483

      While other hepatic cells (e.g., neutrophils and liver sinusoidal endothelial cells) also express Chi3l1, our data indicate that KC-secreted Chi3l1 plays a dominant and cellautonomous role in maintaining KCs viability. The potential contribution of other cellular sources to this phenotype remains an interesting direction for future study.

      We apologize for the lack of clarity in our initial immunophenotyping. We have revised the flow cytometry data to clearly show that KCs are rigorously defined as TIM4+ cells (Revised Figure 4C, D).

      Additionally, while the authors report a reduction of KCs in terms of absolute numbers, there are no differences in proportions. Thus, coupled with a decrease also in moMF numbers at 16 weeks (when one would expect an increase if KCs are decreased, based on previous literature) suggests that the differences in KC numbers may be due to differences in total cell counts obtained from the obese livers compared with controls. To rule this out, total cell counts and total live CD45+ cell counts should be provided. Here, the authors also provide tunnel staining in situ to demonstrate increased KC death, but as it is typically notoriously difficult to visualise dying KCs in MASLD models, here it would be important to provide more images. Similarly, there appear to be many more Tunel+ cells in the KO that are not KCs; thus, it would be important to examine this in the CLEC4F-Cre line to ascertain direct versus indirect effects on cell survival.

      We thank the reviewer for raising this important point. We have now included the total cell counts and total live CD45<sup>+</sup> cell counts, which showed similar numbers between WT and Chil1<sup>-/-</sup> mice post HFHC diet (Figure 3A, below).

      Moreover, we included cleavaged caspase 3 and TIM4 co-staining in WT and Chil1<sup>-/-</sup> mice before and after HFHC diets, which confirmed increased KCs death in Chil1<sup>-/-</sup> mice (Revised Figure S10B). We have compared KCs number and KCs death between Clec4fcre and Clec4f<sup>∆Chil1</sup> mice under NCD and HFHC diet in the following manuscript (Figure 6 D-G). The data showed similar KCs number under NCD and reduced KCs number in Clec4f<sup>∆Chil1</sup> mice compared to Clec4f-cre mice, which confirms direct effects of Chi3l1 on cell survival but not because of cre insertion.

      Hyperactivated Glycolysis Drives Spatially-Patterned Kupffer Cell Depletion in MASLD Jia He, Ran Li, Cheng Xie, Xiane Zhu, Keqin Wang, Zhao Shan bioRxiv 2025.09.26.678483; doi: https://doi.org/10.1101/2025.09.26.678483

      Author response image 3.

      Number of total cells and total live CD45+ cells in liver of WT and Chil1<sup>-/-</sup> mice. (A) Number of total cells and total live CD45+ cells/liver were statistically analyzed. n= 3-4 mice per group.

      Finally, the authors suggest that Chi3l1 exerts its effects through binding glucose and preventing its uptake. They use ex vivo/in vitro models to assess this with rChi3l1; however, here I miss the key in vivo experiment using the CLEC4F-Cre mice to prove that this in KCs is sufficient for the phenotype. This is critical to confirm the take-home message of the manuscript.

      We agree that it is essential to confirm the in vivo relevance of Chi3l1-mediated glucose regulation in Kupffer cells (KCs). Our data suggest that KCs undergo cell death not because they express Chi3l1 per se, but because they exhibit a glucose-hungry metabolic phenotype that makes them uniquely dependent on Chi3l1-mediated regulation of glucose uptake. To directly assess this mechanism in vivo, we injected 2-NBDG, a fluorescent glucose analog, into overnight-fasted and refed mice and quantified its uptake in hepatic KCs. Notably, Chi3l1-deficient KCs exhibited significantly increased 2-NBDG uptake compared with controls, and this effect was markedly suppressed by co-treatment with recombinant Chi3l1 (rChi3l1) (Revised Figure 6G, H). These findings demonstrate that Chi3l1 regulates glucose uptake by KCs in vivo, supporting our proposed mechanism that Chi3l1 controls KC metabolic homeostasis through modulation of glucose availability.

      Minor points:

      (1) Some key references of macrophage heterogeneity in MASLD are not cited: PMID: 32362324 and PMID: 32888418.

      We thank the reviewer for highlighting these critical references and have included them in the introduction (Revised manuscript, page 2, line 64-73).

      (2) In the discussion, Figure 3H is referenced (Serum data), but there is no Figure 3H. If the authors have this data (increased Chi3l1 in serum of mice fed HFHC diet), what happens in CLEC4F-Cre mice fed the diet? Is this lost completely? This comes back to the point regarding the specificity of expression.

      We apologize for the mistake. It should be Figure 5F now in the revised version, in which serum Chi3l1 was significantly upregulated after HFHC diet. Moreover, under a normal chow diet (NCD), serum CHI3L1 is significantly lower in Clec4f<sup>ΔChil1</sup> mice compared to controls (Chil1<sup>fl/fl</sup>). Following an HFHC diet, levels increase in both genotypes but remain relatively lower in the KC-KO mice (please see Figure 2A above). This data strongly suggests that Kupffer Cells (KCs) are the primary source of serum CHI3L1 under basal conditions and a major contributor during MASLD progression.

      Reviewer #3 (Public review):

      This paper investigates the role of Chi3l1 in regulating the fate of liver macrophages in the context of metabolic dysfunction leading to the development of MASLD. I do see value in this work, but some issues exist that should be addressed as well as possible.

      (1) Chi3l1 has been linked to macrophage functions in MASLD/MASH, acute liver injury, and fibrosis models before (e.g., PMID: 37166517), which limits the novelty of the current work. It has even been linked to macrophage cell death/survival (PMID: 31250532) in the context of fibrosis, which is a main observation from the current study.

      We thank the reviewer for this insightful comment regarding the novelty of our findings. We agree that Chi3l1 has previously been linked to macrophage survival and function in models of liver injury and fibrosis (e.g., PMID: 37166517, 31250532). However, our study focuses specifically on the early stage of MASLD, prior to the onset of fibrosis, revealing a distinct mechanistic role for CHI3L1 in this context.

      We demonstrate that CHI3L1 directly interacts with extracellular glucose to regulate its cellular uptake—a previously unrecognized biochemical function. Furthermore, we show that CHI3L1’s protective role is metabolically dependent, safeguarding glucose-dependent Kupffer cells (KCs) but not monocyte-derived macrophages (MoMFs). This metabolic dichotomy and the direct link between CHI3L1 and glucose sensing represent conceptual advances beyond previous studies of CHI3L1 in fibrotic or injury models.

      (2) The LysCre-experiments differ from experiments conducted by Ariel Feldstein's team (PMID: 37166517). What is the explanation for this difference? - The LysCre system is neither specific to macrophages (it also depletes in neutrophils, etc), nor is this system necessarily efficient in all myeloid cells (e.g., Kupffer cells vs other macrophages). The authors need to show the efficacy and specificity of the conditional KO regarding Chi3l1 in the different myeloid populations in the liver and the circulation.

      We thank the reviewer for this important comment and the opportunity to clarify both the efficiency and specificity of our conditional knockouts, as well as the differences from the study by Feldstein’s group (PMID: 37166517).

      (1) Chil1 deletion efficiency in Clec4f-Cre and Lyz2-Cre lines:

      We have assessed the efficiency of Chil1 deletion in both Lyz2<sup>∆Chil1</sup> and Clec4f<sup>∆Chil1</sup> mice by evaluating mRNA and protein levels of Chi3l1. For the Lyz2<sup>∆Chil1</sup> mice, we measured Chi3l1 expression in bone marrow-derived macrophages (BMDMs) and primary Kupffer cells (KCs). Both qPCR (for mRNA) and Western blotting (for protein) reveal that Chi3l1 is almost undetectable in BMDMs from Lyz2<sup>∆Chil1</sup> mice when compared to Chil1<sup>fl/fl</sup> controls. In contrast, we observe no significant reduction in Chi3l1 expression in KCs from these animals (Revised Figure S7B, C), suggesting that Chil1 is deleted in BMDMs but not in KCs in Lyz2-Cre line.

      For the Clec4f<sup>∆Chil1</sup> mice, both mRNA and protein levels of Chi3l1 are barely detectable in BMDMs and primary KCs when compared to Chil1<sup>fl/fl</sup> controls (Revised Figure S4B, C). However, we did observe a faint Chi3l1 band in KCs of Clec4f<sup>∆Chil1</sup> mice, which we suspect is due to contamination from LSECs during the KC isolation process, given that the TIM4 staining for KCs was approximately 90%. Overall, Chil1 is deleted in both KCs and BMDMs in Clec4f-Cre line.

      Notably, since we observed a pronounced MASLD phenotype in Clec4f-Cre mice but not in Lyz2-Cre mice, these findings further underscore the critical role of Kupffer cells in the progression of MASLD.

      (2) Explanation for Differences from Feldstein et al. (PMID: 37166517):

      Our findings differ from those reported by Feldstein’s group primarily due to differences in disease stage and model. We used a high-fat, high-cholesterol (HFHC) diet to model earlystage MASLD characterized by steatosis and inflammation without fibrosis (Revised Figure S1A,B). In this context, we observed KC death but minimal MoMF infiltration (Revised Figure 4D). Accordingly, deletion of Chi3l1 in MoMFs (Lyz2<sup>∆Chil1</sup>) had no measurable effect on insulin resistance or steatosis, consistent with limited MoMF involvement at this stage. In contrast, the Feldstein study employed a CDAA-HFAT diet that models later-stage MASH with fibrosis. In that setting, Lyz2<sup>∆Chil1</sup> mice showed reduced recruitment of neutrophils and MoMFs, which likely underlies the attenuation of fibrosis and disease severity reported. Together, these data support a model in which KCs and MoMFs play temporally distinct roles during MASLD progression: KCs primarily drive early lipid accumulation and metabolic dysfunction, whereas MoMFs contribute more substantially to inflammation and fibrosis at later stages.

      (3) The conclusions are exclusively based on one MASLD model. I recommend confirming the key findings in a second, ideally a more fibrotic, MASH model.

      We thank the reviewer for this valuable suggestion to validate our findings in an additional MASH model. We have now included data from a methionine- and choline-deficient (MCD) diet–induced MASH model, which exhibits pronounced hepatic lipid accumulation and fibrosis (Revised Figure S6A,B). Consistent with our HFHC results, Clec4f<sup>∆Chil1</sup> mice displayed exacerbated MASH progression in this model, including increased lipid deposition, inflammation, and fibrosis (Revised Figure S6E-G).These findings confirm that CHI3L1 deficiency in Kupffer cells promotes hepatic lipid accumulation and disease progression across distinct MASLD/MASH models.

      (4) Very few human data are being provided (e.g., no work with own human liver samples, work with primary human cells). Thus, the translational relevance of the observations remains unclear.

      We thank the reviewer for this important comment regarding translational relevance. We fully agree that validation in human liver samples would further strengthen our study. However, obtaining tissue from early-stage steatotic livers is challenging due to the asymptomatic nature of this disease stage. Nonetheless, multiple studies have consistently reported Chi3l1 upregulation in human fibrotic and steatotic liver disease (PMID: 31250532, 40352927, 35360517), supporting the clinical significance of our mechanistic findings. We have now expanded the Discussion to highlight these human data and better contextualize our results within the spectrum of human MASLD/MASH progression (Revised manuscript, page 9, line390-394).

      Minor points:

      The authors need to follow the new nomenclature (e.g., MASLD instead of MAFLD, e.g., in Figure 1).

      "MASLD" used throughout.

      We thank the reviewers for their rigorous critique again. We thank eLife for fostering an environment of fairness and transparency that enables authors to communicate openly and present their data honestly.

      Reference

      (1) Tran, S. Baba I, Poupel L, et al(2020) Impaired Kupffer Cell Self-Renewal Alters the Liver Response to Lipid Overload during Non-alcoholic Steatohepatitis. Immunity 53, 627-640.

    1. eLife Assessment

      This study describes a genetic screen to identify deubiquitinases (DUBs) that counteract the activity of small molecule degraders (PROTACs). The presented data is valuable, identifying OTUD6A and UCHL5 as DUBs that impact the efficacy and potency of PROTAC-mediated degradation in distinct subcellular compartments. While the conclusions are broadly supported and the methods employed are solid, the validation of OTUD6A and UCHL5 mechanisms requires additional study. Overall, these findings merit further evaluation by the targeted protein degradation community when developing and optimizing PROTACs and efforts to achieve compartment-specific degradation.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, the authors investigate the role of deubiquitinases (DUBs) in modulating the efficacy of PROTAC-mediated degradation of the cell-cycle kinase AURKA. Using a focused siRNA screen of 97 human DUBs, they identify UCHL5 and OTUD6A as negative regulators of AURKA degradation by PROTACs. They further offer a mechanistic explanation of enhanced AURKA degradation in the nucleus via OTUD6A expression being restricted to the cytosol, thereby protecting the cytoplasmic pool of AURKA. These findings provide important insight into how subcellular localization and DUB activity influence the efficiency of targeted protein degradation strategies, which could have implications for therapy.

      Strengths:

      The manuscript is well-structured, with clearly defined objectives and well-supported conclusions.

      The study employs a broad range of well-validated techniques-including live-cell imaging, proximity ligation assays, HiBiT reporter systems, and ubiquitin pulldowns - to dissect the regulation of PROTAC activity.

      The authors use informative experimental controls, including assessment of cell-cycle progression effects, rescue experiments with siRNA-resistant constructs to confirm specificity, and the application of both AURKA-targeting PROTACs with different warheads and orthogonal degrader systems (e.g., dTAG-13 and dTAGv-1) to differentiate between target- and ligase-specific effects.

      The identification of OTUD6A as a cytosol-restricted DUB that protects cytoplasmic but not nuclear AURKA is novel and may have therapeutic relevance for selectively targeting oncogenic nuclear AURKA pools.

      Weaknesses:

      Although UCHL5 and OTUD6A are shown to limit AURKA degradation, direct physical interaction was not assessed.

      While the authors suggest that combining PROTACs with DUB inhibition could enhance degradation, this was not experimentally tested.

      The authors acknowledge the apparent discrepancy between the enhanced degradation observed with CRBN-recruiting PROTACs and the lack of change in ubiquitination following UCHL5 knockdown, yet they do not propose any mechanistic explanation.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors present a screening approach to identify deubiquitylases that may impact PROTAC efficacy/potency, specifically in this case using a previously reported AURKA PROTAC as an initial model. The authors claim that UCHL5 is able to control the level of degradation of both AURKA and dTAG when using CRBN mediated PROTACs, however that VHL is not impacted by UCHL5 activity. They additionally claim that OTUD6A is able to control extent of AURKA degradation in a target protein-specific manner and that this effect is specific to cytoplasm located AURKA.

      Overall, the endeavour is of interest and important. Some of the claims made were overly generalised, and in the main effects observed when knocking down the respective DUBs were small. In addition, the systems used are highly artificial, and the data is not presented in a way that makes understanding absolute (rather than relative) changes easy to understand.

      Strengths:

      The topic is of high interest and relevance and explores an underappreciated and understudied area of the PROTAC mechanism of action. If further supported and understood, they would certainly bring value to the field.

      Weaknesses:

      The overall effects observed are sometimes limited in real terms. The data provided often omits the absolute changes in protein abundance observed. Data on endogenous/less engineered systems and/or with higher resolution read-outs would<br /> greatly strengthen some conclusions.

    4. Author response:

      The following is the authors’ response to the original reviews.

      We are grateful for the insightful and constructive feedback received from reviewers. As outlined in our previous response to the public reviews of the manuscript, we have made only minor changes to the manuscript to clarify some points noted by Reviewers 1 and 3. Firstly, we identify the DUB shown in the correlation plot (Fig 3B) - whose knockdown enhances PROTAC sensitivity without significantly altering cell cycle progression - as BAP1. Secondly, we explain in more detail how we selected DUB hits for further study, and thirdly, we acknowledge that the result in Figure 5G is unexpected given prevailing knowledge in the field.

      Please see below the detailed list of changes we have made to the manuscript.

      In response to Reviewer 1 (Point 2 of public review and Point 2 in recommendations to author)

      We have labelled one of the hits (as BAP1) in Figure 3B

      In response to Reviewer 1 (Point 2 of public review and Point 2 in recommendations to author) and Reviewer 3 (Point 6 in recommendations to authors)

      We have rewritten our description of Figure 3 in order to make clarifications about how we selected which hits to take forwards in our study

      In response to Reviewer 3 (Point 1 in the recommendation to authors)

      We corrected a typo in the first subtitle of the results section

      In response to Reviewer 3 (Point 2 in the recommendation to authors)

      We added information requested about how we selected our top hits

      In response to Reviewer 1 (Point 4 in public review and Point 4 in recommendation to authors)

      We pointed out the seemingly contradictory nature of the UCHL5 result in Figure 5G for the reader

      All of the changes have been aimed at clarifying our narrative, without any change to data content, analysis or interpretation, and we hope these improvements can be agreed by editorial review.

    1. eLife Assessment

      This important study contributes to our understanding of how epithelial cells establish polarity by identifying a hierarchy in which Par3 acts upstream of centrosome positioning and apical membrane initiation. The evidence supporting the main conclusions is convincing, although several aspects of the model remain only partially supported due to unresolved questions about microtubule organization and the need for clearer integration of quantitative and conceptual points raised in review. The work will be of interest to cell and developmental biologists, but the conclusions would be strengthened by greater precision in methodology, terminology, and interpretation.

    2. Reviewer #1 (Public review):

      Summary:

      Wang, Po-Kai et al., utilized the de novo polarization of MDCK cells cultured in Matrigel to assess the interdependence between polarity protein localization, centrosome positioning and apical membrane formation. They show that the inhibition of Plk4 with Centrinone does not prevent apical membrane formation, but does result in its delay, a phenotype the authors attribute to the loss of centrosomes due to the inhibition of centriole duplication. However, the targeted mutagenesis of specific centrosome proteins implicated in the positioning of centrosomes in other cell types (CEP164, ODF2, PCNT and CEP120), as well as the use of dominant negative constructs to inhibit centrosomal microtubule nucleation did not affect centrosome positioning in 3D cultured MDCK cells. A screen of proteins previously implicated in MDCK polarization revealed that the polarity protein Par-3 was upstream of centrosome positioning, similar to other cell types.

      Strengths:

      The investigation into the temporal requirement and interdependence of previously proposed regulators of cell polarization and lumen formation is valuable. The authors have provided a detailed analysis of many of these components at defined stages of polarity establishment, and well demonstrate that centrosomes are not necessary for apical polarity formation, but are involved in the efficient establishment of the apical membrane.

      Weaknesses:

      Key questions remain regarding the structure of the intracellular cytoskeleton following depletion of centrosomes, centrosome proteins,or abrogation of centrosome microtubule nucleation. The authors strengthen their model that centrosomes are positioned independently of microtubule nucleation using dominant negative Cdk5RAP2 and NEDD-1 constructs, however, the structure of the intracellular microtubule network remains unresolved and will be an important avenue for future investigation.

    3. Reviewer #3 (Public review):

      Here the Wang et al resubmit their manuscript describing the events in the establishment of polarity in MDCK cells cultured in vitro. As with the original version, the description is throughout and is important to the field to report as it establishes a hierarchy of events in polarization, placing Par3 upstream of centrosome positioning and apical membrane component trafficking. Unfortunately, in the revised version, the authors addressed almost none of my points. They did a cursory job of responding in the rebuttal letter but made little attempt to actually address what was being asked or to incorporate any of my suggestions into the manuscript. The particularly egregious examples are cited below:

      Comments on revisions:

      (1) My original main experimental concern was not addressed: I had originally asked what role microtubules play in the process of polarization (either centrosomal or non-centrosomal). An obvious model is that Gp135, Rab11, etc. are delivered to the AMIS on centrosomal microtubules. Centrosomes might be also be pulled to the AMIS via cortically derived microtubules as is the case in the C. elegans intestine where the centrosome moves apically on apical microtubules via dynein directed transport to the cortically anchored minus ends. The authors do not explore the role of microtubules in the revision, citing that it was not possible to observe the microtubules directly or to perform nocodazole experiments during polarization. Instead, the authors use a relatively new genetic tool to disrupt centrosomal microtubules. They appear to succeed in displacing centrosomal g-tubulin using this tool, but without being able to observe microtubules, a remaining caveat of this experiment is that it is still unclear whether the authors have removed centrosomal microtubules. Compounding this issue is that this tool has never been used in MDCK cells. The authors conclude "we found that cells lacking centrosomal microtubules were still able to polarize and position the centrioles apically.", but they have not shown this, instead the data suggest this conclusion and the authors should acknowledge the caveat that they have no idea whether centrosomal microtubules are abolished. Similarly, the authors also state: "Additionally, although PCNT knockout cells show reduced microtubule nucleation ability, they still recruit a small amount of γ-tubulin". Where are the data that show that microtubule nucleation is reduced in these PCNT knock out cells?

      (2) Many of my comments were addressed in the rebuttal, but not in the text.<br /> The non-centrosomal GP135 in Figure 2 is not acknowledged or explained.

      That the polarity index does not actually measure polarity, but nuclear-centrosome distance is not acknowledged or explained in the paper.

      I still don't believe that the quantification in Figure 3D matches the images I am being shown in Figure 3A. In the centrinone treatment condition, there is certainly an enrichment of GP135 at the AMIS that is not detected in the quantification. The method described in the rebuttal might miss this enrichment if it is offset from line drawn between the centroid of the two nuclei.

      Cell height changes in the centrosome depleted cysts are still referenced in the text ("the cell heights of the centrosome-depleted cysts are less uniform"), but no specific data or image is called out. Currently, Figure 3G is referenced, but that is a graph of GP135 intensity

      In my original review, I called on the authors to comment on the striking similarity of the mechanisms they documented in MDCK cells to what has been shown in in vivo systems. The authors did not do this, instead restating in the rebuttal some features of what they found. But, the mechanisms shown here are remarkably similar to the polarization of primordia that generate tubular organs in vivo. Perhaps most striking is the similarity to the C> elegans intestine where Par3 localizes to the cortex at the site of an apical MTOC that pulls the centrosome to the apical surface via dynein (Feldman and Priess, 2012). Instead of discussing this similarity, the authors state: "Par3 is likely to regulate centrosome positioning through some intermediate molecules or mechanisms, but its specific mechanism is still unclear and requires further investigation." Given the acetylated tubulin signal emanating from the Par3 positive patch in Figure 5E and F, I suspect similar mechanisms to the C. elegans intestine are at play here. Such a parallel should be noted in the Discussion.

      I had originally commented that "I find the results in Figure 6G puzzling. Why is ECM signaling required for Gp135 recruitment to the centrosome. Could the authors discuss what this means?" The authors responded that "The data in Figure 6G do not indicate that ECM signaling is required for the recruitment of Gp135 to the centrosome". In Figure 6G, the localization of GP135 to the centrosome appears significantly delayed compared to its localization to the centrosome in images where cells were cultured in Matrigel. Indeed, the authors argue that the centrosomal localization precedes and contributes to its localization to the AMIS. In the absence of ECM, GP135 localizes to the membrane before it localizes to the centrosome and its localization to the centrosome appears significantly reduced. Thus, my original and current interpretation is that ECM signaling is somehow required for the centrosomal targeting of GP135. One could make a competition argument, i.e. that the cortex in the absence of ECM is somehow a more desirable place to localize than the centrosome, but this experiment also argues that the centrosome does not need to be a source of this material in order for it to end up on the cortex.

      (3) There needs to be precision in the language used in many places:

      I don't understand this line in the abstract: "When cultured in Matrigel, de novo polarization of a single epithelial cell is often coupled with mitosis." If a cell has divided, it is no longer a single cell.

      The authors state in the Introduction "Because of its strong ability to nucleate microtubules, the centrosome functions as the primary microtubule organizing center", but then state ""In polarized epithelial cells, the centrosome is localized at the apical region during interphase, which contributes to the construction of an asymmetric microtubule network conducive to polarized vesicle trafficking". In the latter statement, I assume the authors are describing the well-characterized apical microtubule network in epithelial cells that is non-centrosomal. Thus, the latter sentence is at odds with the former.

      The authors continually refer to Par3 as a tight junction protein. "Par3, which controls tight junction assembly to partition the apical surface from the basolateral surface". To my knowledge, PARD3 is an apical protein with similar localization to C. elegans PAR-3 and Drosophila Bazooka. PARD3B is a junctional protein. I assume that the antibody that the authors are using is to PARD3 and not PARD3B? Can the authors please clarify this in the text.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Wang, Po-Kai, et al., utilized the de novo polarization of MDCK cells cultured in Matrigel to assess the interdependence between polarity protein localization, centrosome positioning, and apical membrane formation. They show that the inhibition of Plk4 with Centrinone does not prevent apical membrane formation, but does result in its delay, a phenotype the authors attribute to the loss of centrosomes due to the inhibition of centriole duplication. However, the targeted mutagenesis of specific centrosome proteins implicated in the positioning of centrosomes in other cell types (CEP164, ODF2, PCNT, and CEP120) did not affect centrosome positioning in 3D cultured MDCK cells. A screen of proteins previously implicated in MDCK polarization revealed that the polarity protein Par-3 was upstream of centrosome positioning, similar to other cell types.

      Strengths:

      The investigation into the temporal requirement and interdependence of previously proposed regulators of cell polarization and lumen formation is valuable to the community. Wang et al., have provided a detailed analysis of many of these components at defined stages of polarity establishment. Furthermore, the generation of PCNT, p53, ODF2, Cep120, and Cep164 knockout MDCK cell lines is likely valuable to the community.

      Weaknesses:

      Additional quantifications would highly improve this manuscript, for example it is unclear whether the centrosome perturbation affects gamma tubulin levels and therefore microtubule nucleation, it is also not clear how they affect the localization of the trafficking machinery/polarity proteins. For example, in Figure 4, the authors measure the intensity of Gp134 at the apical membrane initiation site following cytokinesis, but there is no measure of Gp134 at the centrosome prior to this.

      We thank the reviewer for this important suggestion. Previous studies have shown that genes encoding appendage proteins and CEP120 do not regulate γ-tubulin recruitment to centrosomes (Betleja, Nanjundappa, Cheng, & Mahjoub, 2018; Vasquez-Limeta & Loncarek, 2021). Although the loss of PCNT reduces γ-tubulin levels, this reduction is partially compensated by AKAP450. Even in the case of PCNT/AKAP450 double knockouts, low levels of γ-tubulin remain at the centrosome (Gavilan et al., 2018), suggesting that it is difficult to completely eliminate γ-tubulin by perturbing centrosomal genes alone.

      To directly address this question, in the revised manuscript (Page 8, Paragraph 4; Figure 4—figure supplement 3), we employed a recently reported method to block γ-tubulin recruitment by co-expressing two constructs: the centrosome-targeting carboxy-terminal domain (C-CTD) of CDK5RAP2 and the γ-tubulin-binding domain of NEDD1 (N-gTBD). This approach effectively depleted γ-tubulin and abolished microtubule nucleation at the centrosome (Vinopal et al., 2023). Interestingly, despite the reduced efficiency of apical vesicle trafficking, these cells were still able to establish polarity, with centrioles positioned apically. These results suggest that microtubule nucleation at the centrosomes (centrosomal microtubules) facilitates—but is not essential for—polarity establishment.

      Regarding Figure 4, we assume the reviewer was referring to Gp135 rather than Gp134. In the revised manuscript (Page 8, Paragraph 2; Figure 4I), we observed a slight decrease in Gp135 intensity near PCNT-KO centrosomes at the pre-Abs stage. However, its localization at the AMIS following cytokinesis remained unaffected. These results suggest that the loss of PCNT has a limited impact on Gp135 localization. 

      Reviewer #2 (Public review):

      Summary:

      The authors decoupled several players that are thought to contribute to the establishment of epithelial polarity and determined their causal relationship. This provides a new picture of the respective roles of junctional proteins (Par3), the centrosome, and endomembrane compartments (Cdc42, Rab11, Gp135) from upstream to downstream.

      Their conclusions are based on live imaging of all players during the early steps of polarity establishment and on the knock-down of their expression in the simplest ever model of epithelial polarity: a cell doublet surrounded by ECM.

      The position of the centrosome is often taken as a readout for the orientation of the cell polarity axis. There is a long-standing debate about the actual role of the centrosome in the establishment of this polarity axis. Here, using a minimal model of epithelial polarization, a doublet of daugthers MDCK cultured in Matrigel, the authors made several key observations that bring new light to our understanding of a mechanism that has been studied for many years without being fully explained:

      (1) They showed that centriole can reach their polarized position without most of their microtubule-anchoring structures. These observations challenge the standard model according to which centrosomes are moved by the production and transmission of forces along microtubules.

      (2) However) they showed that epithelial polarity can be established in the absence of a centriole.

      (3) (Somehow more expectedly) they also showed that epithelial polarity can't be established in the absence of Par3.

      (4) They found that most other polarity players that are transported through the cytoplasm in lipid vesicles, and finally fused to the basal or apical pole of epithelial cells, are moved along an axis which is defined by the position of centrosome and orientation of microtubules.

      (5) Surprisingly, two non-daughter cells that were brought in contact (for 6h) could partially polarize by recruiting a few Par3 molecules but not the other polarity markers.

      (6) Even more surprisingly, in the absence of ECM, Par 3 and centrosomes could move to their proper position close to the intercellular junction after cytokinesis but other polarity markers (at least GP135) localized to the opposite, non-adhesive, side. So the polarity of the centrosome-microtubule network could be dissociated from the localisation of GP135 (which was believed to be transported along this network).

      Strengths:

      (1) The simplicity and reproducibility of the system allow a very quantitative description of cell polarity and protein localisation.

      (2) The experiments are quite straightforward, well-executed, and properly analyzed.

      (3) The writing is clear and conclusions are convincing.

      Weaknesses:

      (1) The simplicity of the system may not capture some of the mechanisms involved in the establishment of cell polarity in more physiological conditions (fluid flow, electrical potential, ion gradients,...).

      We agree that certain mechanisms may not be captured by this simplified system. However, the model enables us to observe intrinsic cellular responses, minimize external environmental variables, and gain new insights into how epithelial cells position their centrosomes and establish polarity. 

      (2) The absence of centriole in centrinone-treated cells might not prevent the coalescence of centrosomal protein in a kind of MTOC which might still orient microtubules and intracellular traffic. How are microtubules organized in the absence of centriole? If they still form a radial array, the absence of a centriole at the center of it somehow does not conflict with classical views in the field.

      Previous studies have shown that in the absence of centrioles, centrosomal proteins can relocate to alternative microtubule-organizing centers (MTOCs), such as the Golgi apparatus (Gavilan et al., 2018). Furthermore, centriole loss leads to increased nucleation of non-centrosomal microtubules (Martin, Veloso, Wu, Katrukha, & Akhmanova, 2018). However, these microtubules typically do not form the classical radial array or a distinct star-like organization. 

      While this non-centrosomal microtubule network can still support polarity establishment, it does so less efficiently—similar to what is observed in p53-deficient cells undergoing centriole-independent mitosis (Meitinger et al., 2016). Thus, although the absence of centrioles does not completely prevent microtubule-based organization or polarity establishment, it impairs their spatial coordination and reduces overall efficiency compared to a centriole-centered microtubule-organizing center (MTOC). 

      (3) The mechanism is still far from clear and this study shines some light on our lack of understanding. Basic and key questions remain:

      (a) How is the centrosome moved toward the Par3-rich pole? This is particularly difficult to answer if the mechanism does not imply the anchoring of MTs to the centriole or PCM.

      Previous studies have shown that Par3 interacts with dynein, potentially anchoring it at the cell cortex (Schmoranzer et al., 2009). This interaction enables dynein, a minus-enddirected motor, to exert pulling forces on microtubules, thereby promoting centrosome movement toward the Par3-enriched pole.

      In our experiments (Figure 4), we attempted to disrupt centrosomal microtubule nucleation by knocking out multiple genes involved in centrosome structure and function, including ODF2 and PCNT. Under these perturbations, γ-tubulin still remained detectable at the centrosome, and we were unable to completely eliminate centrosomal microtubules. 

      To address this question more directly, we employed a strategy to deplete γ-tubulin from centrosomes by co-expressing the centrosome-targeting C-terminal domain (C-CTD) of CDK5RAP2 and the γ-tubulin-binding domain of NEDD1 (N-gTBD). As shown in the new data of the revised manuscript (Page 8, Paragraph 4; Figure 4—figure supplement 3), this approach effectively depleted γ-tubulin from centrosomes, thereby abolishing microtubule nucleation at the centrosome. 

      Surprisingly, even under these conditions, centrioles remained apically positioned (Page 8, Paragraph 4; Figure 4—figure supplement 3), indicating that centrosomal microtubules are not essential for centrosome movement during polarization.

      Given these findings, we agree that the precise mechanism by which the Par3-enriched cortex attracts or guides centrosome movement remains unclear. Although dynein–Par3 interactions may contribute, further studies are needed to elucidate how centrosome repositioning occurs in the absence of microtubule-based pulling forces from the centrosome itself.

      (b) What happens during cytokinesis that organises Par3 and intercellular junction in a way that can't be achieved by simply bringing two cells together? In larger epithelia cells have neighbours that are not daughters, still, they can form tight junctions with Par3 which participates in the establishment of cell polarity as much as those that are closer to the cytokinetic bridge (as judged by the overall cell symmetry). Is the protocol of cell aggregation fully capturing the interaction mechanism of non-daughter cells?

      We speculate that a key difference between cytokinesis and simple cell-cell contact lies in the presence or absence of actomyosin contractility during the process of cell division. Specifically, contraction of the cytokinetic ring generates mechanical forces between the two daughter cells, which are absent when two non-daughter cells are simply brought together. While adjacent epithelial cells can indeed form tight junctions and recruit Par3, the lack of shared cortical tension and contractile actin networks between non-daughter cells may lead to differences in how polarity is initiated. This mechanical input during cytokinesis may serve as an organizing signal for centrosome positioning. This idea is supported by recent work showing that the actin cytoskeleton can influence centrosome positioning (Jimenez et al., 2021), suggesting that contractile actin structures formed during cytokinesis may contribute to spatial organization in a manner that cannot be replicated by simple aggregation. 

      In our experiments, we simply captured two cells that were in contact within Matrigel. We cannot say for sure that it captures all the interaction mechanisms of non-daughter cells, but it does provide a contrast to daughter cells produced by cytokinesis. 

      Reviewer #3 (Public review):

      Here, Wang et al. aim to clarify the role of the centrosome and conserved polarity regulators in apical membrane formation during the polarization of MDCK cells cultured in 3D. Through well-presented and rigorous studies, the authors focused on the emergence of polarity as a single MDCK cell divided in 3D culture to form a two-cell cyst with a nascent lumen. Focusing on these very initial stages, rather than in later large cyst formation as in most studies, is a real strength of this study. The authors found that conserved polarity regulators Gp135/podocalyxin, Crb3, Cdc42, and the recycling endosome component Rab11a all localize to the centrosome before localizing to the apical membrane initiation site (AMIS) following cytokinesis. This protein relocalization was concomitant with a repositioning of centrosomes towards the AMIS. In contrast, Par3, aPKC, and the junctional components E-cadherin and ZO1 localize directly to the AMIS without first localizing to the centrosome. Based on the timing of the localization of these proteins, these observational studies suggested that Par3 is upstream of centrosome repositioning towards the AMIS and that the centrosome might be required for delivery of apical/luminal proteins to the AMIS.

      To test this hypothesis, the authors generated numerous new cell lines and/or employed pharmacological inhibitors to determine the hierarchy of localization among these components. They found that removal of the centrosome via centrinone treatment severely delayed and weakened the delivery of Gp135 to the AMIS and single lumen formation, although normal lumenogenesis was apparently rescued with time. This effect was not due to the presence of CEP164, ODF2, CEP120, or Pericentrin. Par3 depletion perturbed the repositioning of the centrosome towards the AMIS and the relocalization of the Gp135 and Rab11 to the AMIS, causing these proteins to get stuck at the centrosome. Finally, the authors culture the MDCK cells in several ways (forced aggregation and ECM depleted) to try and further uncouple localization of the pertinent components, finding that Par3 can localize to the cell-cell interface in the absence of cell division. Par3 localized to the edge of the cell-cell contacts in the absence of ECM and this localization was not sufficient to orient the centrosomes to this site, indicating the importance of other factors in centrosome recruitment.

      Together, these data suggest a model where Par3 positions the centrosome at the AMIS and is required for the efficient transfer of more downstream polarity determinants (Gp135 and Rab11) to the apical membrane from the centrosome. The authors present solid and compelling data and are well-positioned to directly test this model with their existing system and tools. In particular, one obvious mechanism here is that centrosome-based microtubules help to efficiently direct the transport of molecules required to reinforce polarity and/or promote lumenogenesis. This model is not really explored by the authors except by Pericentrin and subdistal appendage depletion and the authors do not test whether these perturbations affect centrosomal microtubules. Exploring the role of microtubules in this process could considerably add to the mechanisms presented here. In its current state, this paper is a careful observation of the events of MCDK polarization and will fill a knowledge gap in this field. However, the mechanism could be significantly bolstered with existing tools, thereby elevating our understanding of how polarity emerges in this system.

      We agree that further exploration of microtubule dynamics could strengthen the mechanistic framework of our study. In our initial experiments, we disrupted centrosome function through genetic perturbations (e.g., knockout of PCNT, CEP120, CEP164, and ODF2). However, consistent with previous reports (Gavilan et al., 2018; Tateishi et al., 2013), we found that single-gene deletions did not completely eliminate centrosomal microtubules. Furthermore, imaging microtubule organization in 3D culture presents technical challenges. Due to the increased density of microtubules during cell rounding, we were unable to obtain clear microtubule filament structures—either using α-tubulin staining in fixed cells or SiR-tubulin labeling in live cells. Instead, the signal appeared diffusely distributed throughout the cytosol.

      To overcome this, we employed a recently reported approach by co-expressing the centrosome-targeting carboxy-terminal domain (C-CTD) of CDK5RAP2 and the γtubulin-binding domain (gTBD) of NEDD1 to completely deplete γ-tubulin and abolish centrosomal microtubule nucleation (Vinopal et al., 2023). In our new data presented in the revised manuscript (Page 8, Paragraph 4; Figure 4—figure supplement 3), we found that cells lacking centrosomal microtubules were still able to polarize and position the centrioles apically. However, the efficiency of polarized transport of Gp135 vesicles to the apical membrane was reduced. These findings suggest that centrosomal microtubules are not essential for polarity establishment but may contribute to efficient apical transport. 

      Reference

      Betleja, E., Nanjundappa, R., Cheng, T., & Mahjoub, M. R. (2018). A novel Cep120-dependent mechanism inhibits centriole maturation in quiescent cells. Elife, 7. doi:10.7554/eLife.35439

      Gavilan, M. P., Gandolfo, P., Balestra, F. R., Arias, F., Bornens, M., & Rios, R. M. (2018). The dual role of the centrosome in organizing the microtubule network in interphase. EMBO Rep, 19(11). doi:10.15252/embr.201845942

      Jimenez, A. J., Schaeffer, A., De Pascalis, C., Letort, G., Vianay, B., Bornens, M., . . . Thery, M. (2021). Acto-myosin network geometry defines centrosome position. Curr Biol, 31(6), 1206-1220 e1205. doi:10.1016/j.cub.2021.01.002

      Martin, M., Veloso, A., Wu, J., Katrukha, E. A., & Akhmanova, A. (2018). Control of endothelial cell polarity and sprouting angiogenesis by non-centrosomal microtubules. Elife, 7. doi:10.7554/eLife.33864

      Meitinger, F., Anzola, J. V., Kaulich, M., Richardson, A., Stender, J. D., Benner, C., . . . Oegema, K. (2016). 53BP1 and USP28 mediate p53 activation and G1 arrest after centrosome loss or extended mitotic duration. J Cell Biol, 214(2), 155-166. doi:10.1083/jcb.201604081

      Schmoranzer, J., Fawcett, J. P., Segura, M., Tan, S., Vallee, R. B., Pawson, T., & Gundersen, G. G. (2009). Par3 and dynein associate to regulate local microtubule dynamics and centrosome orientation during migration. Curr Biol, 19(13), 1065-1074. doi:10.1016/j.cub.2009.05.065

      Tateishi, K., Yamazaki, Y., Nishida, T., Watanabe, S., Kunimoto, K., Ishikawa, H., & Tsukita, S. (2013). Two appendages homologous between basal bodies and centrioles are formed using distinct Odf2 domains. J Cell Biol, 203(3), 417-425. doi:10.1083/jcb.201303071

      Vasquez-Limeta, A., & Loncarek, J. (2021). Human centrosome organization and function in interphase and mitosis. Semin Cell Dev Biol, 117, 30-41. doi:10.1016/j.semcdb.2021.03.020

      Vinopal, S., Dupraz, S., Alfadil, E., Pietralla, T., Bendre, S., Stiess, M., . . . Bradke, F. (2023). Centrosomal microtubule nucleation regulates radial migration of projection neurons independently of polarization in the developing brain. Neuron, 111(8), 1241-1263 e1216. doi:10.1016/j.neuron.2023.01.020.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Figures:

      (1) Figure 3 B+C - Although in comparison to Figure 2 it appears the p53 mutation does not affect θN-C, or Lo-c. the figure would benefit from direct comparison to control cells.

      We appreciate your suggestion to improve the clarity of the figure. In response, we have revised Figure 3B+C to include control cell data, allowing for clearer side-by-side comparisons in the updated figures. 

      (2) Figure 3D - Clarify if both were normalized to time point 0:00 of the p53 KO. The image used appears that Gp135 intensity increases substantially between 0:00 and 0:15 in the figure, but the graph suggests that the intensity is the same if not slightly lower.

      Figure 3D – The data were normalized to the respective 0:00 time point for each condition. Because the intensity profile was measured along a line connecting the two nuclei, Gp135 signal could only be detected if it appeared along this line. However, the images shown are maximum-intensity projections, meaning that Gp135 signals from peripheral regions are projected onto the center of the image. This may create the appearance of increased intensity at certain time points (e.g., Figure 3A, p53-KO + CN, 0:00–0:15). 

      (3) Figure 4A: The diagram does not accurately represent the effect of the mutations, for example, PCNT mutation likely doesn't completely disrupt PCM (given gamma-tubulin is still visible in the staining), but instead results in its disorganization, Cep164 also wouldn't be expected to completely ablate distal appendages.

      Thank you for your comment. We have modified the figure in the revised manuscript (Figure 4A) to more clearly depict the defective DAs. 

      (4) Figure 4 + Supplements: A more in-depth characterization of the mutations would help address the previous comment and strengthen the manuscript. Especially as these components have previously been implicated in centrosome transport.

      Thank you for your valuable suggestion. As noted in previous studies, CEP164 is essential for distal appendage function and basal body docking, with its loss resulting in blocked ciliogenesis (Tanos et al., 2013); CEP120 is required for centriole elongation and distal appendage formation, and its loss also results in blocked ciliogenesis (Comartin et al., 2013; Lin et al., 2013; Tsai, Hsu, Liu, Chang, & Tang, 2019); ODF2 functions upstream in the formation of subdistal appendages, and its loss eliminates these structures and impairs microtubule anchoring (Tateishi et al., 2013); and PCNT functions as a PCM scaffold, necessary for the recruitment of PCM components and for microtubule nucleation at the centrosome (Fong, Choi, Rattner, & Qi, 2008; Zimmerman, Sillibourne, Rosa, & Doxsey, 2004). 

      Given that the phenotypes of these mutants have been well characterized in the literature. Here, we further focus on their roles in centrosome migration and polarized vesicle trafficking within the specific context of our study. 

      (5) Figure 4: It would be interesting to measure the Gp135 intensity at the centrosomes, given that the model proposes it is trafficked from the centrosomes to the AMIS.

      Thank you for your suggestion. We have included measurements of Gp135 intensity at the centrosomes during the Pre-Abs stage in the revised figure (Figure 4I). Our data show no significant differences in Gp135 intensity between wild-type (WT) and CEP164-, ODF2-, or CEP120-knockout (KO) cell lines. However, a slight decrease in Gp135 intensity was observed in PCNT-KO cells. 

      (6) Figure 6F shows that in suspension culture polarity is reversed, however, in Figure 6G gp135 still localizes to the cytokinetic furrow prior to polarity reversal. Given this paper demonstrates Par-3 is upstream of centrosome positioning, it would be important to have temporal data of how Par-3 localizes prior to the ring observed in 6F.

      Thank you for your comment. We have included a temporal analysis of Par3 localization using fixed-cell staining in the revised figure (Figure 6—figure supplement 1D). This analysis shows that Par3 also localizes to the cytokinesis site during the Pre-Abs stage, prior to ring formation observed during the Post-CK stage (Figure 6F). Interestingly, during the Pre-Abs stage, the centrosomes also migrate toward the center of the cell doublets in suspension culture, and Gp135 surrounding the centrosomes is also recruited to a region near the center (Figure 6—figure supplement 1E). These data suggest that Par3 also is initially recruited to the cytokinesis site before polarity reversal, potentially promoting centrosome migration. The main difference from Matrigel culture is the peripheral localization of Par3 and Gp135 in suspension, which is likely due to the lack of external ECM signaling. 

      Results:

      (1) Page 7 Paragraph 1 - consistently use AMIS (Apical membrane initiation site) rather than "the apical site".

      Thank you for your helpful comment. We have revised the manuscript (Page 7, Paragraph 1) and will now use "AMIS" (Apical Membrane Initiation Site) instead of "the apical site" throughout the text. 

      (2) Page 7 Paragraph 4 - A single sentence explaining why the p53 background had to be used for the Cep120 deletion would be beneficial. Did the cell line have a reduced centrosome number? Does this effect apical membrane initiation similar to centrinone?

      We have revised the text (Page 7, Paragraph 4) to clarify that we were unable to generate a CEP120 KO line in p53-WT cells for unknown reasons. CEP120-KO cells have a normal number of centrosome, but their centrioles are shorter. Because this KO line still contains centrioles, the effect is different from centrinone treatment, which results in a complete loss of centrioles. 

      (3) Page 10 paragraph 4 - This paragraph is confusing to read. I understand that in the cysts and epithelial sheet the cytokinetic furrow is apical, therefore a movement towards the AMIS could be due to its coincidence with the furrow. However, the phrasing "....we found that centrosomes move towards the apical membrane initiation site direction before bridge abscission. Taken together these findings indicate the position is strongly associated with the site of cytokinesis but not with the apical membrane" is confusing to the reader.

      We have revised the manuscript (Page 11, paragraph 4) to change the AMIS as the center of the cell doublet. During de novo epithelial polarization, the apical membrane has not yet formed at the Pre-Abs stage. However, at the Pre-Abs stage, the centrosome has already migrated toward the site of cytokinesis, suggesting that centrosome positioning is correlated with the site of cell division. A similar phenomenon occurs in fully polarized epithelial cysts and sheets, where the centrosomes also migrate before bridge abscission. Thus, we propose that the position of the centrosome is closely associated with the site of cytokinesis and is independent of apical membrane formation. 

      Discussion

      (1) Page 11, Paragraph 2 - citations needed when discussing previous studies.

      Thank you for your suggestion. We have included the necessary references to the discussion of the previous studies in the revised manuscript (Page 12, Paragraph 2). 

      (2) Page 12, Paragraph 2 - This section of the discussion would be strengthened by discussing the role of the actomyosin network in defining centrosome position (Jimenez et al., 2021). It seems plausible that the differences observed in the different conditions could be due to altered actomyosin architecture. Especially where the cells haven't undergone cytokinesis.

      We appreciate the suggestion of a role for the actomyosin network in determining centrosome positioning. Recent studies have indeed highlighted the role of the actomyosin network in regulating centrosome centering and off-centering (Jimenez et al., 2021). During the pre-abscission stage of cell division, the actomyosin network undergoes significant dynamic changes, with the contractile ring forming at the center and actin levels decreasing at the cell periphery. In contrast, under aggregated cell conditions—meaning cells that have not undergone division—the actomyosin network does not exhibit such dynamic changes. The loss of actomyosin remodeling may therefore influence whether the centrosome moves. Thus, alterations in actomyosin architecture may contribute to the differences observed under various conditions, particularly when cells have not yet completed cytokinesis. We have revised Paragraph 2 on Page 13 to briefly mention the referenced study and to propose that the actomyosin network may influence centrosome positioning, contributing to our observed results. This addition strengthens the discussion and clarifies our findings. 

      (3) Page 12 paragraph 3 - Given that centrosome translocation during cytokinesis in MDCK cells (this study) appears to be similar to that observed in HeLa cells and the zebrafish Kupffers vesicle (Krishnan et al., 2022) it would be interesting to discuss why Rab11a and PCNT may not be essential to centrosome positioning in MDCK cells.

      Thank you for your insightful comment. We agree that it is interesting that centrosome translocation during cytokinesis in MDCK cells (as observed in our study) is similar to that observed in HeLa cells and zebrafish Kupffer's vesicle (Krishnan et al., 2022). However, there are notable differences between these systems that may help explain why Rab11a and PCNT are not essential for centrosome positioning in MDCK cells.

      Our study used 3D culture of MDCK cells, while the reference study examined adherent culture of HeLa cells. In the adherent culture, cells attached to the culture surface form large actin stress fibers on their basal side, which weakens the actin networks in the apical and intercellular regions. In contrast, the 3D culture system used in our study better preserves cell polarity and the integrity of the actin network, which might contribute to centrosome positioning independent of Rab11a and PCNT. Differences in culture conditions and actin network architecture may explain why Rab11a and PCNT are not required for centrosome positioning in MDCK cells.

      Furthermore, the referenced study focused on Rab11a and PCNT in zebrafish embryos at 3.3–5 hours post-fertilization (hpf), a time point before the formation of the Kupffer’s vesicle. At this stage, the cells they examined may not yet have become epithelial cells, which may also influence the requirement of Rab11a and PCNT for centrosome positioning. We hypothesize that during the pre-abscission stage, centrosome migration toward the cytokinetic bridge occurs primarily in epithelial cells, and that the polarity and centrosome positioning mechanisms in these cells may differ from those in other cell types, such as zebrafish embryos.

      Furthermore, data from Krishnan et al. (2022) suggest that cytokinesis failure in pcnt+/- heterozygous embryos and Rab11a functional-blocked embryos may be due to the presence of supernumerary centrosomes. Consistent with this, our data show that blocking cytokinesis inhibits centrosome movement in MDCK cells. However, in our MDCK cell lines with PCNT or Rab11a knockdown, we did not observe significant cytokinesis failure, and centrosome migration proceeded normally. 

      Reviewer #2 (Recommendations for the authors):

      Suggestions for experiments:

      (1) A description of the organization of microtubules in the absence of centriole, or in the absence of ECM would be interesting to understand how polarity markers end up where you observed them. This easy experiment may significantly improve our understanding of this system.

      Previous studies have shown that in the absence of centrioles, microtubule organization undergoes significant changes. Specifically, the number of non-centrosomal microtubules increases, and these microtubules are not radially arranged, leading to the absence of focused microtubule organizing centers in centriolar-deficient cells (Martin, Veloso, Wu, Katrukha, & Akhmanova, 2018). This disorganized microtubule network reduces the efficiency of vesicle transport during de novo epithelial polarization at the mitotic preabscission stage. 

      In contrast, the organization of microtubules under ECM-free conditions remains less well characterized. Here, we show that while the ECM plays a critical role in establishing the direction of epithelial polarity, it does not influence the positioning of the centrosome, the microtubule-organizing center (MTOC).  

      (2) Would it be possible to knock down ODF2 and pericentrin to completely disconnect the centrosome from microtubules?

      ODF2 is the base of subdistal appendages. When ODF2 is knocked out, it affects the recruitment of all downstream proteins to the subdistal appendages (Mazo, Soplop, Wang, Uryu, & Tsou, 2016). One study has shown that ODF2 knockout cells almost completely lost subdistal appendage structures and significantly reduced the microtubule asters surrounding the centrioles (Tateishi et al., 2013). However, although pericentrin (PCNT) is the main scaffold of the pericentriolar matrix (PCM) of centrosomes, the microtubule organization ability of centrosomes can be compensated by AKAP450, a paralog of PCNT, after PCNT knockout. A previous study has even shown that in cells with a double knockout of PCNT and AKAP450, γ-tubulin can still be recruited to the centrosomes, and centrosomes can still nucleate microtubules (Gavilan et al., 2018). This suggests that there are other proteins or pathways that promote microtubule nucleation on centrosomes. We are unsure whether the triple knockout of ODF2, PCNT, and AKAP450 can completely disconnect the centrosome from microtubules. However, a recent study reported a simpler approach involving the expression of dominant-negative fragments of the γ-tubulinbinding protein NEDD1 and the activator CDK5RAP2 at the centrosome (Vinopal et al., 2023). In our revised manuscript (Page 8, Paragraph 4; Figure 4—figure supplement 3), we applied this strategy, which resulted in the depletion of nearly all γ-tubulin from the centrosome. This indicates a strong suppression of centrosomal microtubule nucleation and an effective disconnection of the centrosome from the microtubule network. 

      (3) The study does not distinguish the role of cytokinesis from the role of tight junctions, which form only after cytokinesis and not simply by bringing cells into contact. Would it be feasible and interesting to study the polarization after cytokinesis in cells that could not form tight junctions (due to the absence of Ecad or ZO1 for example)?

      Studying cell polarization after cytokinesis in cells unable to form tight junctions is a promising area of research.

      Recent studies have shown that mouse embryonic stem cells (mESCs) cultured in Matrigel can form ZO-1-labelled tight junctions at the midpoint of cell–cell contact even in the absence of cell division. However, in the absence of E-cadherin, ZO-1 localization is significantly impaired. Interestingly, despite the loss of E-cadherin, the Golgi apparatus and centrosomes remain oriented toward the cell–cell interface (Liang, Weberling, Hii, Zernicka-Goetz, & Buckley, 2022). These findings suggest that cell polarity can be maintained independently of tight junction formation, highlighting the potential value of studying cell polarization that lack tight junctions.

      Furthermore, while studies have explored the effects of knocking down tight junction components such as JAM-A and Cingulin on lumen formation in MDCK 3D cultures (Mangan et al., 2016; Tuncay et al., 2015), the role of ZO-1 in this context remains underexplored. Cingulin knockdown has been shown to disrupt endosome targeting and the formation of the AMIS, while both JAM-A and Cingulin knockdown result in actin accumulation at multiple points, leading to the formation of multi-lumen structures rather than a reversal of polarity. However, previous research has not specifically investigated centrosome positioning in JAM-A and Cingulin knockdown cells, an area that could provide valuable insights into how polarity is maintained in the absence of tight junctions. 

      Writing details:

      (1) The migration of the centrosome in the absence of appendages or PCM is proposed to be ensured by compensatory mechanisms ensuring the robustness of microtubule anchoring to the centrosome. It could also be envisaged that the centrosome motion does not require this anchoring and that other yet unknown moving mechanisms, based on an actin network for example, might exist.

      Thank you for your valuable comments. We agree that there may indeed be some unexpected mechanisms that allow centrosomes to move independently of microtubule anchoring to the centrosome, such as mechanisms based on actin filaments or noncentrosomal microtubules; these mechanisms are worth further investigation.

      In response to your suggestion, in the Paragraph 5 of the discussion section, we further clarified that while a microtubule anchoring mechanism might be one explanation, other mechanisms could also influence centrosome movement in the absence of appendages or PCM. Additionally, we revised the Paragraph 4 regarding the possibility of actin network-driven centrosome movement and emphasized the importance of future research for a deeper understanding of these processes. 

      (2) The actual conclusion of the study of Martin et al (eLife 2018) is not simply that centrosome is not involved in cell polarization but that it hinders cell polarization!

      Thank you for your valuable feedback. We agree with the findings of Martin et al. (eLife 2018) that centrosome is not irrelevant to cell polarity, but rather they inhibit cell polarization. Therefore, we have revised the manuscript (Page 2, Paragraph 2) to more accurately reflect this viewpoint. 

      (3) This study recalls some conclusions of the study by Burute et al (Dev Cell 2017), in particular the role of Par3 in driving centrosome toward the intercellular junction of daughter cells after cytokinesis. It would be welcome to comment on the results of this study in light of their work.

      Thank you for your valuable feedback. The study by Burute et al. (Dev Cell, 2017) showed that in micropattern-cultures of MCF10A cells, the cells exhibit polarity and localize their centrosomes towards the intercellular junction, while downregulation of Par3 gene expression disrupts this centrosome positioning. This result is similar to our findings in 3D cultured MDCK cells and consistent with previous studies in C. elegans intestinal cells and migrating NIH 3T3 cells (Feldman & Priess, 2012; Schmoranzer et al., 2009), indicating that Par3 indeed influences centrosome positioning in different cellular systems. However, Par3 does not directly localize to the centrosome; rather, it localizes to the cell cortex or cell-cell junctions. Therefore, Par3 likely regulates centrosome positioning through other intermediary molecules or mechanisms, but the specific mechanism remains unclear and requires further investigation. 

      (4) Could the term apico-basal be used in the absence of a basement membrane to form a basal pole?

      We understand that using the term "apico-basal" in the absence of a basement membrane might raise some questions. Traditionally, the apico-basal axis refers to the polarity of epithelial cells, where the apical surface faces the lumen or external environment, and the basal surface is oriented toward the basement membrane. However, in the absence of a basement membrane, such as in certain in vitro systems or under specific experimental conditions, polarity along a similar axis can still be observed. In such cases, the term "apico-basal" can still be used to describe the polarity between the apical domain and the region where it contacts the substrate or adjacent cells. 

      (5) The absence of centrosome movement to the intercellular bridge in spread cells in culture is not so surprising considering the work of Lafaurie-Janvore et al (Science 2018) about the role of cell spreading in the regulation of bridge tension and abscission delay.

      Thank you for your valuable comment. Indeed, previous studies have shown that in some cell types, the centrosome does move toward the intercellular bridge in spread cells (Krishnan et al., 2022; Piel, Nordberg, Euteneuer, & Bornens, 2001), but other studies have suggested that this movement may not be significant and it may not occur in universally observed across all cell types (Jonsdottir et al., 2010). In our study, we aim to demonstrate that this phenomenon is more pronounced in 3D culture systems compared to 2D spread cell culture systems. Previous studies and our work have observed that centrosome migration occurs during the pre-abscission stage, but whether this migration is directly related to cytokinetic bridge tension or the time of abscission remains an open question. Further research is needed to explore the potential relationship between centrosome positioning, cytokintic bridge tension, and the timing of abscission. 

      (6) GP135 (podocalyxin) has been proposed to have anti-adhesive/lubricant properties (hence its pro-invasive effect). Could it be possible that once localized at the cell surface it is systematically moved away from regions that are anchored to either the ECM or adjacent cells? So its localization away from the centrosome in an ECM-free experiment would not be a consequence of defective targeting but relocalization after reaching the plasma membrane?

      Thank you for your valuable comment. We agree that GP135 may indeed move directly across the cell surface, away from the region where it interacts with the ECM or adjacent cells. This re-localization could be due to its anti-adhesive or lubricating properties, which may facilitate its displacement from these adhesive sites. To validate this, it is necessary to employ higher-resolution real-time imaging system to observe the dynamic behavior of GP135 on the cell surface.

      However, this does not contradict our main conclusion. Under suspension culture conditions without ECM, the centrosome positioning in cell doublets is indeed decoupled from apical membrane orientation. This suggests that the localization of the centrosome and the apical membrane is regulated by different mechanisms. Specifically, the GP135 protein tends to accumulate away from areas of contact with the ECM or adjacent cells, possibly through movement within the cell membrane or by recycling endosome transport. In contrast, centrosome positioning is closely related to the cytokinesis site. Our study clearly elucidates the differences between these two polarity properties. 

      Reviewer #3 (Recommendations for the authors):

      Major:

      (1) To me, a clear implication of these studies is that Gp135, Rab11, etc. are delivered to the AMIS on centrosomal microtubules. The authors do not explore this model except to say that depletion of SD appendage or pericentrin has no effect on the protein relocalization to the AMIS. However, the authors do not observe microtubule association with the centrosome in these KO conditions. This analysis is imperative to interpret existing results since these are new KO conditions in this cell/culture system and parallel pathways (e.g. CDK5RAP2) are known to contribute to microtubule association with the centrosome. An ability to comment on the mechanism by which the centrosome contributes to the efficiency of polarization would greatly enhance the paper.

      Microtubule requirement could also be tested in numerous additional ways requiring varying degrees of new experiments:

      (a) faster live cell imaging at abscission to see if the deposition of those components appears to traffic on MTs;

      (b) live cell imaging with microtubules (e.g. SPY-tubulin) and/or EB1 to determine the origin and polarity of microtubules at the pertinent stages;

      For (a) and (b), because the cells were cultured in Matrigel, they tended to be round up, with a dense internal structure that made observation difficult. In contrast, under adherent culture conditions, the cells were flattened with a more dispersed internal structures, making them easier to observe. We had previously used SPY-tubulin to label microtubules for live cell imaging; however, due to the dense microtubule structure in 3D culture, the image contrast was reduced, and we could not clearly observe the microtubule network within the cells. 

      (c) acute nocodazole treatment at abscission to determine the effect on protein localization.

      Regarding the method of using nocodazole to study microtubule requirements at the abscission stage, we believe that nocodazole treatment may lead to cytokinesis failure. Cell division failure results in the formation of binucleated cells, which are unable to establish cell polarity. Furthermore, nocodazole treatment cannot distinguish between centrosomal and non-centrosomal microtubules, making it unsuitable for studying the specific role of centrosomal microtubules in this process.

      In our new data (Figure 4-figure supplementary 3) presented in the revised manuscript, we employed a recently reported method by co-expressing of the centrosome-targeting carboxy-terminal domain (C-CTD) of CDK5RAP2 and the γ-tubulin-binding domain (gTBD) of NEDD1 to completely deplete γ-tubulin and abolish centrosomal microtubule nucleation (Vinopal et al., 2023). We found that cells lacking centrosomal microtubules were still able to polarize and position the centrioles apically. However, the efficiency of polarized transport of Gp135 vesicles to the apical membrane was reduced. These findings suggest that centrosomal microtubules are not essential for polarity establishment but may contribute to facilitate efficient apical transport. 

      (2) Similar to the expanded analysis of the role of microtubules in this system, it would be excellent if the author could expand on the role of Par3 and the centrosome, although this reviewer recognizes that the authors have already done substantial work. For example, what are the consequences of Gp135 and/or Rab11 getting stuck at the centrosome? Do the authors have any later images to determine when and if these components ever leave the centrosome? Existing literature focuses on the more downstream consequence of Par3 removal on single-lumen formation. 

      Similarly, could the authors expand on the description of polarity disruption following centrinone treatment? It is clear that Gp135 recruitment is disrupted, but how and when do things get fixed and what else is disrupted at the very earliest stages of AMIS formation? The authors have an excellent opportunity to really expand on what is known about the requirements for these conserved components.

      Regarding the use of centrinone in treatment, we speculate that Gp135 can still accumulate at the AMIS over time, although the efficiency of its recruitment may be reduced.

      Furthermore, under similar conditions, other apical membrane components (such as the Crumbs3 protein) may exhibit similar characteristics to Gp135 protein. 

      (3) Perhaps satisfying both of the above asks, could the authors do a faster time-lapse at the relevant time points, i.e. as proteins are being recruited to the AMIS (time points between 1Aiv and v)? This type of imaging again might help shed light on the mechanism.

      We believe the above questions are very important and may require further experimental verification in the future. 

      Minor:

      (1) What is the green patch of Gp135 in Figure 2A that does not colocalize with the centrosome? Is this another source of Gp135 that is being delivered to the AMIS? This type of patch is also visible in Figure 3A 15 and 30-minute panels.

      During mitosis, membrane-composed organelles such as the Golgi apparatus are typically dispersed throughout the cytoplasm. However, during the pre-abscission stage, these organelles begin to reassemble and cluster around the centrosome. Furthermore, they also accumulate in the region between the nucleus and the cytokinetic bridge, corresponding to the “patch” mentioned in Figure 2A. 

      Live cell imaging results showed that this Gp135 patch initially appears in a region not associated with the centrosome. Subsequently, they were either directly transported to the AMIS or fused with the centrosome-associated Gp135 and transported together. Notably, this patch was only observed when Gp135 was overexpressed in cells. No such distinct protein patches were observed when staining endogenous Gp135 protein (Figure 1A), suggesting that overexpression of Gp135 protein may lead to a localized increase in its concentration in that region. 

      (2) I am confused by the "polarity index" quantification as this appears to just be a nucleus centrosome distance measurement and wouldn't, for example, distinguish if the centrosomes separated from the nucleus but were on the basal side of the cell.

      The position of the centrosome within the cell (i.e., its distance from the nucleus) can indeed serve as an indicator of cell polarity (Burute et al., 2017). We acknowledge that this quantitative method does not directly capture the specific direction in which the centrosome deviates from the cell center. To address this limitation, we have incorporated information about the angle between the nucleus and the centrosome, which allows for a more accurate description of changes in cell polarity (Rodriguez-Fraticelli, Auzan, Alonso, Bornens, & Martin-Belmonte, 2012). 

      (3) How is GP135 "at AMIS" measured? Is an arbitrary line drawn? This is important later when comparing to centrinone treatment in Figure 3D where the quantification does not seem to accurately capture the enrichment of Gp135 that is seen in the images.

      To measure the expression level of Gp135 in the "AMIS" region of the cell, we first connected the centers of the two cell nuclei in three-dimensional space to form a straight line. Then, we used the Gp135 expression intensity at the midpoint of this line as the representative value for the AMIS region. This method is based on the assumption that the AMIS region is most likely located between the centers of the two cell nuclei. Therefore, this quantitative method provides a standardized assessment tool for comparing Gp135 expression levels under different conditions. 

      (4) The authors reference cell height (p.7) but no data for this measurement are shown

      Thank you for the comment. Although we did not perform quantitative measurements, the differences in cell height are clearly visible in Figure 3E (p53-KO + CN), which visually illustrates this phenomenon. 

      (5) Can the authors comment on the seeming reduction of Par3 in p53 KO cells?

      We did not observe a reduction of Par3 in p53-KO cells in our experiments.

      (6) Can the authors make sense of the E-cad localization: Figure 5, Supplement 2.

      Our study revealed that E-cadherin begins to accumulate at the cell-cell contact sites during the pre-abscission stage. Its appearance is similar to that of ZO-1, which also appears near the cell division site during this phase. Therefore, the behavior of E-cadherin contrasts sharply with that of Gp135, further highlighting the unique trafficking mechanisms of apical membrane proteins during this process. 

      (7) I find the results in Figure 6G puzzling. Why is ECM signaling required for Gp135 recruitment to the centrosome. Could the authors discuss what this means?

      We appreciate the reviewer’s valuable comments and thank you for the opportunity to clarify this point. The data in Figure 6G do not indicate that ECM signaling is required for the recruitment of Gp135 to the centrosome. Rather, our findings suggest that even in the absence of ECM, the centrosomes can migrate to a polarized position similar to that in Matrigel culture. This suggests that centrosome migration and the orientation of the nucleus–centrosome axis may be independent of ECM signaling and are primarily driven by cytokinesis alone. 

      Regarding the localization of Gp135, previous studies have shown that ECM signaling through integrin promotes endocytosis, which is crucial for the internalization of Gp135 from the cell membrane and its subsequent transport to the AMIS (Buckley & St Johnston, 2022). Our study found that, prior to its accumulation at the AMIS, Gp135 transiently localizes around the centrosome. In the absence of ECM, due to reduced endocytosis, Gp135 primarily remains on the cell membrane and does not undergo intracellular trafficking.  

      (8) The authors end the Discussion stating that these studies may have implication for in vivo settings, yet do not discuss the striking similarities to the C. elegans and Drosophila intestine or the findings from any other more observational studies of tubular epithelial systems in vivo (e.g. mouse kidney polarization, zebrafish neuroepithelium, etc.). These models should be discussed.

      Thank you for your valuable comment. Indeed, all types of epithelial tissues or tubular epithelial systems in vivo share some common features during cell division, which have been well-documented across various species. 

      These features include: during interphase, the centrosome is located at the apical surface of the cells; after the cell enters mitosis, the centrosome moves to the lateral side of the cell to regulate spindle orientation; and during cytokinesis, the cleavage furrow ingresses asymmetrically from the basal to the apical side, with the cytokinetic bridge positioned at the apical surface. Our study using MDCK 3D culture and transwell culture systems successfully mimicked these key features, demonstrating that these in vitro models are of significant value for studying cell polarization dynamics. 

      Based on our observations, we speculate that the centrosome may return to the apical surface after anaphase, just before bridge abscission. This is consistent with our findings from studies using MDCK 3D cultures and transwell systems, which showed that the centrosome relocates prior to the final stages of cytokinesis.

      Additionally, we propose that de novo polarization of the kidney tubule in vivo may not solely depend on the aggregation and mesenchymal-epithelial transition (MET) of the metanephric mesenchyme. It may also be related to the cell division process, which triggers centrosome migration and polarized vesicle trafficking. These processes likely contribute to enhancing cell polarization, as we observed in our in vitro models.

      We hope this will further clarity the potential implications of our findings for in vivo model studies, as well as and their broader impact on the field of tubular epithelial cell polarization research. 

      (9) There are several grammatical issues/typos throughout the paper. A careful readthrough is required. For example:

      this sentence makes no sense "that the centrosome acts as a hub of apical recycling endosomes and centrosome migration during cytokinetic pre-abscission before apical membrane components are targeted to the AMIS"

      We carefully reviewed the paper and made necessary revisions to address the issues raised. In particular, we revised certain sentences to improve clarity and readability (Page 5, Paragraph 3). 

      (10) P.8: have been previously reported [to be] involved in MDCK...

      We appreciate the reviewer's valuable suggestions. We have revised the sentence accordingly (Page 9, Paragraph 2). 

      (11) This sentence seems misplaced: "Cultured conditions influence cellular polarization preferences."

      The sentence itself is fine, but to improve the coherence and clarity of the paragraph, we adjusted the paragraph structure and added some transitional phrases (Page 13, Paragraph 1).  

      (12) "Play a downstream role in Par3 recruitment" doesn't make sense, this should just be downstream of Par3 recruitment.

      Thank you for your suggestion. We have revised the wording accordingly, changing it to "downstream of Par3 recruitment" (Page 10, Paragraph 2).  

      Reference

      Buckley, C. E., & St Johnston, D. (2022). Apical-basal polarity and the control of epithelial form and function. Nat Rev Mol Cell Biol, 23(8), 559-577. doi:10.1038/s41580-022-00465-y

      Burute, M., Prioux, M., Blin, G., Truchet, S., Letort, G., Tseng, Q., . . . Thery, M. (2017). Polarity Reversal by Centrosome Repositioning Primes Cell Scattering during Epithelial-to-Mesenchymal Transition. Dev Cell, 40(2), 168-184. doi:10.1016/j.devcel.2016.12.004

      Comartin, D., Gupta, G. D., Fussner, E., Coyaud, E., Hasegan, M., Archinti, M., . . . Pelletier, L. (2013). CEP120 and SPICE1 cooperate with CPAP in centriole elongation. Curr Biol, 23(14), 13601366.

      doi:10.1016/j.cub.2013.06.002

      Feldman, J. L., & Priess, J. R. (2012). A role for the centrosome and PAR-3 in the hand-off of MTOC function during epithelial polarization. Curr Biol, 22(7), 575-582. doi:10.1016/j.cub.2012.02.044

      Fong, K. W., Choi, Y. K., Rattner, J. B., & Qi, R. Z. (2008). CDK5RAP2 is a pericentriolar protein that functions in centrosomal attachment of the gamma-tubulin ring complex. Mol Biol Cell, 19(1), 115-125. doi:10.1091/mbc.e07-04-0371

      Gavilan, M. P., Gandolfo, P., Balestra, F. R., Arias, F., Bornens, M., & Rios, R. M. (2018). The dual role of the centrosome in organizing the microtubule network in interphase. EMBO Rep, 19(11). doi:10.15252/embr.201845942

      Jimenez, A. J., Schaeffer, A., De Pascalis, C., Letort, G., Vianay, B., Bornens, M., . . . Thery, M. (2021). Acto-myosin network geometry defines centrosome position. Curr Biol, 31(6), 1206-1220 e1205. doi:10.1016/j.cub.2021.01.002

      Jonsdottir, A. B., Dirks, R. W., Vrolijk, J., Ogmundsdottir, H. M., Tanke, H. J., Eyfjord, J. E., & Szuhai, K. (2010). Centriole movements in mammalian epithelial cells during cytokinesis. BMC Cell Biol, 11, 34. doi:10.1186/1471-2121-11-34

      Krishnan, N., Swoger, M., Rathbun, L. I., Fioramonti, P. J., Freshour, J., Bates, M., . . . Hehnly, H. (2022). Rab11 endosomes and Pericentrin coordinate centrosome movement during preabscission in vivo. Life Sci Alliance, 5(7). doi:10.26508/lsa.202201362

      Liang, X., Weberling, A., Hii, C. Y., Zernicka-Goetz, M., & Buckley, C. E. (2022). E-cadherin mediates apical membrane initiation site localisation during de novo polarisation of epithelial cavities. EMBO J, 41(24), e111021. doi:10.15252/embj.2022111021

      Lin, Y. N., Wu, C. T., Lin, Y. C., Hsu, W. B., Tang, C. J., Chang, C. W., & Tang, T. K. (2013). CEP120 interacts with CPAP and positively regulates centriole elongation. J Cell Biol, 202(2), 211219. doi:10.1083/jcb.201212060

      Mangan, A. J., Sietsema, D. V., Li, D., Moore, J. K., Citi, S., & Prekeris, R. (2016). Cingulin and actin mediate midbody-dependent apical lumen formation during polarization of epithelial cells. Nat Commun, 7, 12426. doi:10.1038/ncomms12426

      Martin, M., Veloso, A., Wu, J., Katrukha, E. A., & Akhmanova, A. (2018). Control of endothelial cell polarity and sprouting angiogenesis by non-centrosomal microtubules. Elife, 7. doi:10.7554/eLife.33864

      Mazo, G., Soplop, N., Wang, W. J., Uryu, K., & Tsou, M. F. (2016). Spatial Control of Primary Ciliogenesis by Subdistal Appendages Alters Sensation-Associated Properties of Cilia. Dev Cell, 39(4), 424-437. doi:10.1016/j.devcel.2016.10.006

      Piel, M., Nordberg, J., Euteneuer, U., & Bornens, M. (2001). Centrosome-dependent exit of cytokinesis in animal cells. Science, 291(5508), 1550-1553. doi:10.1126/science.1057330

      Rodriguez-Fraticelli, A. E., Auzan, M., Alonso, M. A., Bornens, M., & Martin-Belmonte, F. (2012). Cell confinement controls centrosome positioning and lumen initiation during epithelial morphogenesis. J Cell Biol, 198(6), 1011-1023. doi:10.1083/jcb.201203075

      Schmoranzer, J., Fawcett, J. P., Segura, M., Tan, S., Vallee, R. B., Pawson, T., & Gundersen, G. G. (2009). Par3 and dynein associate to regulate local microtubule dynamics and centrosome orientation during migration. Curr Biol, 19(13), 1065-1074. doi:10.1016/j.cub.2009.05.065

      Tanos, B. E., Yang, H. J., Soni, R., Wang, W. J., Macaluso, F. P., Asara, J. M., & Tsou, M. F. (2013). Centriole distal appendages promote membrane docking, leading to cilia initiation. Genes Dev, 27(2), 163-168. doi:10.1101/gad.207043.112

      Tateishi, K., Yamazaki, Y., Nishida, T., Watanabe, S., Kunimoto, K., Ishikawa, H., & Tsukita, S. (2013). Two appendages homologous between basal bodies and centrioles are formed using distinct Odf2 domains. J Cell Biol, 203(3), 417-425. doi:10.1083/jcb.201303071

      Tsai, J. J., Hsu, W. B., Liu, J. H., Chang, C. W., & Tang, T. K. (2019). CEP120 interacts with C2CD3 and Talpid3 and is required for centriole appendage assembly and ciliogenesis. Sci Rep, 9(1), 6037. doi:10.1038/s41598-019-42577-0

      Tuncay, H., Brinkmann, B. F., Steinbacher, T., Schurmann, A., Gerke, V., Iden, S., & Ebnet, K. (2015). JAM-A regulates cortical dynein localization through Cdc42 to control planar spindle orientation during mitosis. Nat Commun, 6, 8128. doi:10.1038/ncomms9128

      Vinopal, S., Dupraz, S., Alfadil, E., Pietralla, T., Bendre, S., Stiess, M., . . . Bradke, F. (2023). Centrosomal microtubule nucleation regulates radial migration of projection neurons independently of polarization in the developing brain. Neuron, 111(8), 1241-1263 e1216. doi:10.1016/j.neuron.2023.01.020

      Zimmerman, W. C., Sillibourne, J., Rosa, J., & Doxsey, S. J. (2004). Mitosis-specific anchoring of gamma tubulin complexes by pericentrin controls spindle organization and mitotic entry. Mol Biol Cell, 15(8), 3642-3657. doi:10.1091/mbc.e03-11-0796.

    1. eLife Assessment

      This study uses a novel 3D imaging method to identify the Periportal Lamellar Complex (PLC), an important new structure. Although the methodological advancement and morphological descriptions are convincing, the evidence for its proposed function is incomplete, relying on transcriptomic correlation rather than direct experimental validation. The work would therefore be strengthened by focusing its claims on the robust methodological advancement and detailed morphological characterization.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chengjian Zhao et al. focused on the interactions between vascular, biliary, and neural networks in the liver microenvironment, addressing the critical bottleneck that the lack of high-resolution 3D visualization has hindered understanding of these interactions in liver disease.

      Strengths:

      This study developed a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized CUBIC tissue clearing. This method enables the simultaneous 3D visualization of spatial networks of the portal vein, hepatic artery, bile ducts, and central vein in the mouse liver. The authors reported a perivascular structure termed the Periportal Lamellar Complex (PLC), which is identified along the portal vein axis. This study clarifies that the PLC comprises CD34⁺Sca-1⁺ dual-positive endothelial cells with a distinct gene expression profile, and reveals its colocalization with terminal bile duct branches and sympathetic nerve fibers under physiological conditions.

      Comments on revisions:

      The authors very nicely addressed all concerns from this reviewer. There are no further concerns or comments.

    3. Reviewer #2 (Public review):

      Summary:

      The present manuscript of Xu et al. reports a novel clearing and imaging method focusing on the liver. The Authors simultaneously visualized the portal vein, hepatic artery, central vein, and bile duct systems by injected metal compound nanoparticles (MCNPs) with different colors into the portal vein, heart left ventricle, vena cava inferior and the extrahepatic bile duct, respectively. The method involves: trans-cardiac perfusion with 4% PFA, the injection of MCNPs with different colors, clearing with the modified CUBIC method, cutting 200 micrometer thick slices by vibratome, and then microscopic imaging. The Authors also perform various immunostaining (DAB or TSA signal amplification methods) on the tissue slices from MCNP-perfused tissue blocks. With the application of this methodical approach, the Authors report dense and very fine vascular branches along the portal vein. The authors name them as 'periportal lamellar complex (PLC)' and report that PLC fine branches are directly connected to the sinusoids. The authors also claim that these structures co-localize with terminal bile duct branches and sympathetic nerve fibers and contain endothelial cells with a distinct gene expression profile. Finally, the authors claim that PLC-s proliferate in liver fibrosis (CCl4 model) and act as scaffold for proliferating bile ducts in ductular reaction and for ectopic parenchymal sympathetic nerve sprouting.

      Strengths:

      The simultaneous visualization of different hepatic vascular compartments and their combination with immunostaining is a potentially interesting novel methodological approach.

      Weaknesses:

      This reviewer has some concerns about the validity of the microscopic/morphological findings as well as the transcriptomics results, and suggests that the conclusions of the paper may be critically viewed. Namely, at this point, it is still not fully clear that the 'periportal lamellar complex (PLC)' that the Authors describe really exists as a distinct anatomical or functional unit or these are fine portal branches that connect the larger portal veins into the adjacent sinusoid. Also, in my opinion, to identify the molecular characteristics of such small and spatially highly organized structures like those fine radial portal branches, the only way is to perform high-resolution spatial transcriptomics (instead of data mining in existing liver single cell database and performing Venn diagram intersection analysis in hepatic endothelial subpopulations). Yet, the existence of such structures with a distinct molecular profile cannot be excluded. Further research with advanced imaging and omics techniques (such as high resolution volume imaging, and spatial transcriptomics/proteomics) are needed to reproduce these initial findings.

    4. Reviewer #3 (Public review):

      Summary:

      In the revised version of the manuscript authors addressed multiple comments, clarifying especially the methodological part of their work and PLC identification as a novel morphological feature of the adult liver portal veins. Tet is now also much clearer and has better flow.

      The additional assessment of the smartSeq2 data from Pietilä et al., 2025 strengthens the transcriptomic profiling of the CD34+Sca1+ cells and the discussion of the possible implications for the liver homeostasis and injury response. Why it may suffer from similar bias as other scRNA seq datasets - multiple cell fate signatures arising from mRNA contamination from proximal cells during dissociation, it is less likely that this would happen to yield so similar results.

      Nevertheless, a more thorough assessment by functional experimental approaches is needed to decipher the functional molecules and definite protein markers before establishing the PLC as the key hub governing the activity of biliary, arterial, and neuronal liver systems.

      The work does bring a clear new insight into the liver structure and functional units and greatly improves the methodological toolbox to study it even further, and thus fully deserves the attention of the Elife readers.

      Strengths:

      The authors clearly demonstrate an improved technique tailored to the visualization of the liver vasulo-biliary architecture in unprecedented resolution.

      This work proposes a new morphological feature of adult liver facilitating interaction between the portal vein, hepatic arteries, biliary tree, and intrahepatic innervation, centered at previously underappreciated protrusions of the portal veins - the Periportal Lamellar Complexes (PLCs).

      Weaknesses:

      The importance of CD34+Sca1+ endothelial cell subpopulation for PLC formation and function was not tested and warrants further validation.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chengjian Zhao et al. focused on the interactions between vascular, biliary, and neural networks in the liver microenvironment, addressing the critical bottleneck that the lack of high-resolution 3D visualization has hindered understanding of these interactions in liver disease.

      Strengths:

      This study developed a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized CUBIC tissue clearing. This method enables the simultaneous 3D visualization of spatial networks of the portal vein, hepatic artery, bile ducts, and central vein in the mouse liver. The authors reported a perivascular structure termed the Periportal Lamellar Complex (PLC), which is identified along the portal vein axis. This study clarifies that the PLC comprises CD34⁺Sca-1⁺ dual-positive endothelial cells with a distinct gene expression profile, and reveals its colocalization with terminal bile duct branches and sympathetic nerve fibers under physiological conditions.<br />

      Weaknesses:

      This manuscript is well-written, organized, and informative. However, there are some points that need to be clarified.

      (1) After MCNP-dye injection, does it remain in the blood vessels, adsorb onto the cell surface, or permeate into the cells? Does the MCNP-dye have cell selectivity?

      The experimental results showed that after injection, the MCNP series nanoparticles predominantly remained within the lumens of blood vessels and bile ducts, with their tissue distribution determined by physical perfusion. No diffusion of the dye signal into the surrounding parenchymal tissue was observed, nor was there any evidence of adsorption onto the cell surface or entry into cells. The newly added Supplementary Figure S2A–H further confirmed this feature, demonstrating that the dye signals were strictly confined to the luminal space, clearly delineating the continuous course of blood vessels and the branching morphology of bile ducts. These findings strongly support the conclusion that “MCNP dyes are distributed exclusively within the luminal compartments.”

      Therefore, the MCNP dyes primarily serve as intraluminal tracers within the tissue rather than as labels for specific cell types.

      (2) All MCNP-dyes were injected after the mice were sacrificed, and the mice's livers were fixed with PFA. After the blood flow had ceased, how did the authors ensure that the MCNP-dyes were fully and uniformly perfused into the microcirculation of the liver?

      Thank you for the reviewer’s valuable comments. Indeed, since all MCNP dyes were perfused after the mice were euthanized and blood circulation had ceased, we cannot fully ensure a homogeneous distribution of the dye within the hepatic microcirculation. The vascular labeling technique based on metallic nanoparticle dyes used in this study offers clear imaging, stable fluorescence intensity, and multiplexing advantages; however, it also has certain limitations. The main issue is that the dye distribution within the hepatic parenchyma can be affected by factors such as lobular overlap, local tissue compression, and variations in vascular pathways, resulting in regional inhomogeneity of dye perfusion. This is particularly evident in areas where multiple lobes converge or where anatomical structures are complex, leading to local dye accumulation or over-perfusion.

      In our experiments, we attempted to minimize local blockage or over-perfusion by performing PBS pre-flushing and low-pressure, constant-speed perfusion. Nevertheless, localized dye accumulation or uneven distribution may still occur in lobe junctions or structurally complex regions. Such variation represents one of the methodological limitations. Overall, the dye signals in most samples remained confined to the vascular and biliary lumens, and the distribution pattern was highly reproducible.

      We have addressed this issue in the Discussion section but would like to emphasize here that, although this system has clear advantages, it remains sensitive to anatomical variability in the liver—such as lobular overlap and vascular heterogeneity. At vascular junctions, local perfusion inhomogeneity or dye accumulation may occur; therefore, injection strategies and perfusion parameters should be adjusted according to liver size and vascular condition to improve reproducibility and imaging quality. It should also be noted that the results obtained using this method primarily aim to visualize the overall and fine anatomical structures of the hepatic vascular system rather than to quantitatively reflect hemodynamic processes. In the future, we plan to combine in vivo perfusion or dynamic fluid modeling to further validate the diffusion characteristics of the dyes within the hepatic microcirculation.

      (3) It is advisable to present additional 3D perspective views in the article, as the current images exhibit very weak 3D effects. Furthermore, it would be better to supplement with some videos to demonstrate the 3D effects of the stained blood vessels.

      Thank you for the reviewer’s valuable comments. In response to the suggestion, we have added perspective-rendered images generated from the 3D staining datasets to provide a more intuitive visualization of the spatial morphology of the hepatic vasculature. These images have been included in Figure S2A–J. In addition, we have prepared supplementary videos (available upon request) that dynamically display the three-dimensional distribution of the stained vessels, further enhancing the spatial perception and visualization of the results.

      (4) In Figure 1-I, the authors used MCNP-Black to stain the central veins; however, in addition to black, there are also yellow and red stains in the image. The authors need to explain what these stains are in the legend.

      Thank you for the reviewer’s constructive comment. In Figure 1I, MCNP-Black labels the central vein (black), MCNP-Yellow labels the portal vein (yellow), MCNP-Pink labels the hepatic artery (pink), and MCNP-Green labels the bile duct (green). We have revised the Figure 1 legend to include detailed descriptions of the color signals and their corresponding structures to avoid any potential confusion.

      (5) There is a typo in the title of Figure 4F; it should be "stem cell".

      Thank you for the reviewer’s careful correction. We have corrected the spelling error in the title of Figure 4F to “stem cell” and updated it in the revised manuscript.

      (6) Nuclear staining is necessary in immunofluorescence staining, especially for Figure 5e. This will help readers distinguish whether the green color in the image corresponds to cells or dye deposits.

      We thank the reviewer for the valuable suggestion. We understand that nuclear staining can help determine the origin of fluorescence signals. However, in our three-dimensional imaging system, the deep signal acquisition range after tissue clearing often causes nuclear dyes such as DAPI to generate highly dense and widespread fluorescence, especially in regions rich in vascular structures, which can obscure the fine vascular and perivascular details of interest. Therefore, this study primarily focuses on high-resolution visualization of the spatial architecture of the vascular and biliary systems. We have added an explanation regarding this point in Figures S2I–J.

      Reviewer #2 (Public review):

      Summary:

      The present manuscript of Xu et al. reports a novel clearing and imaging method focusing on the liver. The authors simultaneously visualized the portal vein, hepatic artery, central vein, and bile duct systems by injecting metal compound nanoparticles (MCNPs) with different colors into the portal vein, heart left ventricle, inferior vena cava, and the extrahepatic bile duct, respectively. The method involves: trans-cardiac perfusion with 4% PFA, the injection of MCNPs with different colors, clearing with the modified CUBIC method, cutting 200 micrometer thick slices by vibratome, and then microscopic imaging. The authors also perform various immunostaining (DAB or TSA signal amplification methods) on the tissue slices from MCNP-perfused tissue blocks. With the application of this methodical approach, the authors report dense and very fine vascular branches along the portal vein. The authors name them as 'periportal lamellar complex (PLC)' and report that PLC fine branches are directly connected to the sinusoids. The authors also claim that these structures co-localize with terminal bile duct branches and sympathetic nerve fibers, and contain endothelial cells with a distinct gene expression profile. Finally, the authors claim that PLC-s proliferate in liver fibrosis (CCl4 model) and act as a scaffold for proliferating bile ducts in ductular reaction and for ectopic parenchymal sympathetic nerve sprouting.

      Strengths:

      The simultaneous visualization of different hepatic vascular compartments and their combination with immunostaining is a potentially interesting novel methodological approach.

      Weaknesses:

      This reviewer has several concerns about the validity of the microscopic/morphological findings as well as the transcriptomics results. In this reviewer's opinion, the introduction contains overstatements regarding the potential of the method, there are severe caveats in the method descriptions, and several parts of the Results are not fully supported by the documentation. Thus, the conclusions of the paper may be critically viewed in their present form and may need reconsideration by the authors.

      We sincerely thank the reviewer for the thorough evaluation and constructive comments on our study. We fully understand and appreciate the reviewer’s concerns regarding the methodological validity and interpretation of the results. In response, we have made comprehensive revisions and additions to the manuscript as follows:

      First, we have carefully revised the Introduction and Discussion sections to provide a more balanced description of the methodological potential, removing statements that might be considered overstated, and clarifying the applicable scope and limitations of our approach (see the revised Introduction and Discussion).

      Second, we have substantially expanded the Methods section with detailed information on model construction, imaging parameters, data processing workflow, and technical aspects of the single-cell transcriptomic reanalysis, to enhance the transparency and reproducibility of the study.

      Third, we have added additional references and explanatory notes in the Results section to better support the main conclusions (see Section 6 of the Results).

      Finally, we have rechecked and validated all experimental data, and conducted a verification analysis using an independent single-cell RNA-seq dataset (Figure S6). The results confirm that the morphological observations and transcriptomic findings are consistent and reproducible across independent experiments.

      We believe these revisions have greatly strengthened the reliability of our conclusions and the overall scientific rigor of the manuscript. Once again, we sincerely appreciate the reviewer’s valuable comments, which have been very helpful in improving the logic and clarity of our work.

      Reviewer #3 (Public review):

      Summary:

      In the reviewed manuscript, researchers aimed to overcome the obstacles of high-resolution imaging of intact liver tissue. They report successful modification of the existing CUBIC protocol into Liver-CUBIC, a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized liver tissue clearing, significantly reducing clearing time and enabling simultaneous 3D visualization of the portal vein, hepatic artery, bile ducts, and central vein spatial networks in the mouse liver. Using this novel platform, the researchers describe a previously unrecognized perivascular structure they termed Periportal Lamellar Complex (PLC), regularly distributed along the portal vein axis. The PLC originates from the portal vein and is characterized by a unique population of CD34⁺Sca-1⁺ dual-positive endothelial cells. Using available scRNAseq data, the authors assessed the CD34⁺Sca-1⁺ cells' expression profile, highlighting the mRNA presence of genes linked to neurodevelopment, biliary function, and hematopoietic niche potential. Different aspects of this analysis were then addressed by protein staining of selected marker proteins in the mouse liver tissue. Next, the authors addressed how the PLC and biliary system react to CCL4-induced liver fibrosis, implying PLC dynamically extends, acting as a scaffold that guides the migration and expansion of terminal bile ducts and sympathetic nerve fibers into the hepatic parenchyma upon injury.

      The work clearly demonstrates the usefulness of the Liver-CUBIC technique and the improvement of both resolution and complexity of the information, gained by simultaneous visualization of multiple vascular and biliary systems of the liver at the same time. The identification of PLC and the interpretation of its function represent an intriguing set of observations that will surely attract the attention of liver biologists as well as hepatologists; however, some claims need more thorough assessment by functional experimental approaches to decipher the functional molecules and the sequence of events before establishing the PLC as the key hub governing the activity of biliary, arterial, and neuronal liver systems. Similarly, the level of detail of the methods section does not appear to be sufficient to exactly recapitulate the performed experiments, which is of concern, given that the new technique is a cornerstone of the manuscript.

      Nevertheless, the work does bring a clear new insight into the liver structure and functional units and greatly improves the methodological toolbox to study it even further, and thus fully deserves the attention of readers.

      Strengths:

      The authors clearly demonstrate an improved technique tailored to the visualization of the liver vasulo-biliary architecture in unprecedented resolution.

      This work proposes a new biological framework between the portal vein, hepatic arteries, biliary tree, and intrahepatic innervation, centered at previously underappreciated protrusions of the portal veins - the Periportal Lamellar Complexes (PLCs).

      Weaknesses:

      Possible overinterpretation of the CD34+Sca1+ findings was built on re-analysis of one scRNAseq dataset.

      Lack of detail in the materials and methods section greatly limits the usefulness of the new technique to other researchers.

      We thank the reviewer for this important comment. We agree that when conclusions are mainly based on a single dataset, overinterpretation should be avoided. In response to this concern, we have carefully re-evaluated and clearly limited the scope of our interpretation of the scRNA-seq analysis. In addition, we performed a validation analysis using an independent single-cell RNA-seq dataset (see new Figure S6), which consistently confirmed the presence and characteristic transcriptional profile of the periportal CD34⁺Sca1⁺ endothelial cell population. These supplementary analyses strengthen the robustness of our findings and address the reviewer’s concern regarding potential overinterpretation.

      In the revised manuscript, we have also greatly expanded the Materials and Methods section by providing detailed information on sample preparation, imaging parameters, data processing workflow, and single-cell reanalysis procedures. These revisions substantially improve the transparency and reproducibility of our methodology, thereby enhancing the usability and reference value of this technique for other researchers.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Introduction

      (1) In general, the Introduction is very lengthy and repetitive. It needs extensive shortening to a maximum of 2 A4 pages.

      We thank the reviewer for the valuable suggestions. We have thoroughly condensed and restructured the Introduction, removing redundant content and merging related paragraphs to make the theme more focused and the logic clearer. The revised Introduction has been shortened to within two A4 pages, emphasizing the scientific question, innovation, and technical approach of the study.

      (2) Please correct this erroneous sentence:

      '...the liver has evolved the most complex and densely n organized vascular network in the body, consisting primarily of the portal vein system, central vein system, hepatic artery system, biliary system, and intrahepatic autonomic nerve network [6, 7].'

      We thank the reviewer for pointing out this spelling error. The revised sentence is as follows:

      “…the liver has evolved the most complex and densely organized ductal-vascular network in the body, consisting primarily of the portal vein system, central vein system, hepatic artery system, biliary system, and intrahepatic autonomic nerve network [6, 7].”

      (3) '...we achieved a 63.89% improvement in clearing efficiency and a 20.12% increase in tissue transparency'

      Please clarify what you exactly mean by 'clearing efficiency' and 'increased tissue transparency'.

      We thank the reviewer for the valuable comments and have clarified the relevant terminology in the revised manuscript.

      “Clearing efficiency” refers to the improvement in the time required for the liver tissue to become completely transparent when treated with the optimized Liver-CUBIC protocol (40% urea + H₂O₂), compared with the conventional CUBIC method. In this study, the clearing time was reduced from 9 days to 3.25 days, representing a 63.89% increase in time efficiency.

      “Tissue transparency” refers to the ability of the cleared tissue to transmit visible light. We quantified the optical transparency by measuring light transmittance across the 400–900 nm wavelength range using a microplate reader. The results showed that the average transmittance increased by 20.12%, indicating that Liver-CUBIC treatment markedly enhanced the optical clarity of the liver tissue.

      (4) I am concerned about claiming this imaging method as real '3D imaging'. Namely, while the authors clear full lobes, they actually cut the cleared lobes into 200-micrometer-thick slices and perform further microscopy imaging on these slices. Considering that they focus on ductular structures of the liver (such as vasculature, bile duct system, and innervations), 200 micrometer allows a very limited 3D overview, particularly in comparison with the whole-mount immuno-imaging methods combined with light sheet microscopy (such as Adori 2021, Liu 2021, etc). In this context, I feel several parts of the Introduction to be an overstatement: besides of emphasizing the advantages of the technique (such as simultaneous visualization of different hepatic vascular compartments and the bile duct system by MCNPs, the combination with immunostainings), the authors must honestly discuss the limitations (such as limited tissue overview, potential dye perfusion problems - uneven distribution of the dye etc).

      We appreciate the reviewer’s insightful comments. It is true that most of the imaging depth in this study was limited to approximately 200 μm, and thus it could not achieve whole-liver three-dimensional imaging comparable to light-sheet microscopy. However, the primary focus of our study was to resolve the microscopic intrahepatic architecture, particularly the spatial relationships among blood vessels, bile ducts, and nerve fibers. Through high-resolution imaging of thick tissue sections, combined with MCNP-based multichannel labeling and immunofluorescence co-staining, we were able to accurately delineate the three-dimensional distribution of these microstructures within localized regions.

      In addition to thick-section imaging, we also obtained whole-lobe dye perfusion data (as shown in Figure S1F), which comprehensively depict the three-dimensional branching patterns and distribution of the vascular systems within the liver lobe. These images were acquired from intact liver lobes perfused with MCNP dyes, revealing a continuous vascular network extending from major trunks to peripheral branches, thereby demonstrating that our approach is also capable of achieving organ-level visualization.

      We have added this image and a corresponding description in the revised manuscript to more comprehensively present the coverage of our imaging system, and we have incorporated this clarification into the Discussion section.

      Method

      (5) More information may be needed about MCNPs:

      a) As reported, there are nanoparticles with different colors in brightfield microscopy, but the particles are also excitable in fluorescence microscopy. Would you please provide a summary about excitation/emission wavelengths of the different MCNPs? This is crucial to understand to what extent the method is compatible with fluorescence immunohistochemistry.

      We thank the reviewer for the careful attention and professional suggestion. We fully agree that this issue is critical for evaluating the compatibility of our method with fluorescent immunohistochemistry. Different types of metal compound nanoparticles (MCNPs) have clearly distinguishable spectral properties:

      - MCNP-Green and MCNP-Yellow: AF488-matched spectra, with excitation/emission wavelengths of 495/519 nm.

      - MCNP-Pink: Designed for far-red spectra, with excitation/emission wavelengths of 561/640 nm.

      - MCNP-Black: Non-fluorescent, appearing black under bright-field microscopy only.

      The above information has been added to the Materials and Methods section.

      b) Also, is there more systematic information available concerning the advantage of these particles compared to 'traditional' fluorescence dyes, such as Alexa fluor or Cy-dyes, in fluorescence microscopy and concerning their compatibility with various tissue clearing methods (e.g., with the frequently used organic-solvent-based methods)?

      We thank the reviewer for the detailed question. Compared with conventional organic fluorescent dyes, MCNP offers the following advantages:

      - Enhanced photostability: Its inorganic core-shell structure resists fading even after hydrogen peroxide bleaching.

      - High signal stability: Fluorescence is maintained during aqueous-based clearing (e.g., CUBIC) and multiple rounds of staining without quenching.

      We appreciate the reviewer’s suggestion. In our Liver-CUBIC system, MCNP nanoparticles exhibited excellent multi-channel labeling stability and fluorescence signal retention. Regarding compatibility with other clearing methods (e.g., SCAFE, SeeDB, CUBIC), since these methods have limited effectiveness for whole-liver clearing (see Figure 2 of Tainaka, et al. 2014) and cannot meet the requirements for high-resolution microstructural imaging in this study, we consider further testing of their compatibility unnecessary.

      In summary, MCNP dye demonstrates superior signal stability and spectral separation compared with conventional organic fluorescent dyes in multi-channel, long-term, high-transparency three-dimensional tissue imaging.

      c) When you perfuse these particles, to which structures do they bind inside the ducts (vessels, bile ducts)? Is the 48h post-fixation enough to keep them inside the tubes/bind them to the vessel walls? Is there any 'wash-out' during the complex cutting/staining procedure? E.g., in Figure 2D: the 'classical' hepatic artery in the portal triad is not visible - but the MCNP apparently penetrated to the adjacent sinusoids at the edge of the lobulus. Also, in Figure 3B, there is a significant mismatch between the MNCP-green (bile duct) signal and the CD19 (epithelium marker) immunostaining. Please discuss these.

      The experimental results showed that following injection, MCNP nanoparticles primarily remained within the vascular and biliary lumens, and their tissue distribution depended on physical perfusion. No dye signal was observed to diffuse into the surrounding parenchyma, nor did the particles adhere to cell surfaces or enter cells. The newly added Supplementary Figures S2A–H further confirm this feature: the dye signal is strictly confined within the lumens, clearly delineating continuous vascular paths and biliary branching patterns, strongly supporting the conclusion that “MCNP dye is distributed only within luminal spaces.”

      Thus, MCNP dye mainly serves as an intraluminal tracer rather than a label for specific cell types.

      We provide the following explanations and analyses regarding MCNP distribution in the hepatic vascular and biliary systems and its post-fixation stability:

      - Potential signal displacement during sectioning/immunostaining: During slicing and immunostaining, a small number of particles may be washed away due to mechanical cutting or washing steps; however, the overall three-dimensional structure retains high spatial fidelity.

      - Observation in Figure 2D: MCNP was seen entering the sinusoidal spaces at the lobule periphery, but hepatic arteries were not visible, likely due to limitations in section thickness. Although arteries were not apparent in this slice, arterial distribution around the portal vein is visible in Figure 2C. It should be noted that Figures 2C, D, and E do not represent whole-liver imaging, so not all regions necessarily contain visible hepatic arteries. For easier identification, the main hepatic artery trunk is highlighted in cyan in Figure 2E.

      - Incomplete biliary signal in Figure 3B: This may be because CK19 labeling only covers biliary epithelial cells, whereas MCNP-green distributes throughout the biliary lumen. In Figure 3B, the terminal MCNP-green signal exhibits irregular polygonal structures, which we interpret as the canalicular regions.

      (6) Which fixative was used for 48h of postfixation (step 6) after MCNP injections?

      After MCNP injection, mouse livers were post-fixed in 4% paraformaldehyde (PFA) for 48 hours. This fixation condition effectively “locks” the MCNP particles within the vascular and biliary lumens, maintaining their spatial positions, while also being compatible with subsequent sectioning and multi-channel immunostaining analyses.

      The above information has been added to the Materials and Methods section

      (7) What is the 'desired thickness' in step 7? In the case of immunostained tissue, a 200-micrometer slice thickness is mentioned. However, based on the Methods, it is not completely clear what the actual thickness of the tissue was that was examined ultimately in the microscopes, and whether or not the clearing preceded the cutting or vice versa.

      We appreciate the reviewer’s question. The “desired thickness” referred to in step 7 of the manuscript corresponds to the thickness of tissue sections used for immunostaining and high-resolution microscopic imaging, which is typically around 200 µm. We selected 200 µm because this thickness is sufficient to observe the PLC structure in its entirety, allows efficient staining, and preserves tissue architecture well. Other researchers may choose different section thicknesses according to their experimental needs.

      In this study, the processing order for immunostained tissue samples was sectioning followed by clearing, as detailed below:

      Section Thickness

      To ensure antibody penetration and preservation of three-dimensional structure, tissue sections were typically cut to ~200 µm. Thicker sections can be used if more complete three-dimensional structures are required, but adjustments may be needed based on antibody penetration and fluorescence detection conditions.

      Clearing Sequence

      After sectioning, slices were processed using the Liver-CUBIC aqueous-based clearing system.

      (8) More information is needed concerning the 'deep-focus microscopy' (Keyence), the applied confocal system, and the THUNDER 'high resolution imaging system': basic technical information, resolutions, objectives (N.A., working distance), lasers/illumination, filters, etc.

      In this study, all liver lobes (left, right, caudate, and quadrate lobes) were subjected to Liver-CUBIC aqueous-based clearing to ensure uniform visualization of MCNP fluorescence and immunolabeling throughout the three-dimensional imaging of the entire liver.

      The above information has been added to the Materials and Methods section.

      Imaging Systems and Settings

      VHX-6000 Extended Depth-of-Field Microscope: Objective: VH-Z100R, 100×–1000×; resolution: 1 µm (typical); illumination: coaxial reflected; transmitted illumination on platform: ON.

      Zeiss Confocal Microscope (980): Objectives: 20× or 40×; image size: 1024 × 1024. Fluorescence detection was set up in three channels:

      - Channel 1: 639 nm laser, excitation 650 nm, emission 673 nm, detection range 673–758 nm, corresponding to Cy5-T1 (red).

      - Channel 2: 561 nm laser, excitation 548 nm, emission 561 nm, detection range 547–637 nm, corresponding to Cy3-T2 (orange).

      - Channel 3: 488 nm laser, excitation 493 nm, emission 517 nm, detection range 490–529 nm, corresponding to AF488-T3 (green).

      Leica THUNDER Imager 3D Tissue: Fluorescence detection in two channels:

      - Channel 1: FITC channel (excitation 488 nm, emission ~520 nm).

      - Channel 2: Orange-red channel (excitation/emission 561/640 nm).<br /> Equipped with matching filter sets to ensure signal separation.

      The above information has been added to the Materials and Methods section.

      (9) Liver-CUBIC, step 2: which lobe(s) did you clear (...whole liver lobes...).

      In this study, all liver lobes (left, right, caudate, and quadrate lobes) were subjected to Liver-CUBIC aqueous-based clearing to ensure uniform visualization of MCNP fluorescence and immunolabeling throughout the three-dimensional imaging of the entire liver.

      The above information has been added to the Materials and Methods section.

      (10) For the DAB and TSA IHC stainings, did you use free-floating slices, or did you mount the vibratome sections and do the staining on mounted sections?

      In this study, fixed livers were first sectioned into thick slices (~200 µm) using a vibratome. Subsequently, DAB and TSA immunohistochemical (IHC) staining were performed on free-floating sections. During the entire staining process, the slices were kept floating in the solutions, ensuring thorough antibody penetration in the thick sections while preserving the three-dimensional tissue architecture, thereby facilitating multiple rounds of staining and three-dimensional imaging.

      (11) Regarding the 'transmission quantification': this was measured on 1 mm thick slices. While it is interesting to make a comparison between different clearing methods in general, one must note that it is relatively easy to clear 1mm thick tissue slices with almost any kind of clearing technique and in any tissues. The 'real' differences come with thicker blocks, such as >5mm in the thinnest dimension. Do you have such experiences (e.g., comparison in whole 'left lateral liver lobes')?

      In this study, we performed three-dimensional visualization of entire liver lobes to depict the distribution of MCNPs and the overall spatial architecture of the vascular and biliary systems (Figure S1F). However, due to the limitations of the plate reader and fluorescence imaging systems in terms of spatial resolution and light penetration depth, quantitative analyses were conducted only on tissue sections approximately 1 mm thick.

      Regarding the comparative quantification of different clearing methods, as the reviewer noted, nearly all aqueous- or organic solvent–based clearing techniques can achieve relatively uniform transparency in 1 mm-thick tissue sections, so differences at this thickness are limited. We have not yet conducted systematic comparisons on whole-lobe sections thicker than 5 mm and therefore cannot provide “true” difference data for thicker tissues.

      (12) There is no method description for the ELMI studies in the Methods.

      Transmission Electron Microscopy (TEM) Analysis of MCNPs

      Before imaging, the MCNP dye solution was centrifuged at 14,000 × g for 10 minutes at 4 °C to remove aggregates and impurities. The supernatant was collected, diluted 50-fold, and 3–4 μL of the sample was applied onto freshly glow-discharged Quantifoil R1.2/1.3 copper grids (Electron Microscopy Sciences, 300 mesh). The sample was allowed to sit for 30 seconds to enable particle adsorption, after which excess liquid was gently wicked away with filter paper and the grid was air-dried at room temperature. The sample was then negatively stained with 1% uranyl acetate for 30 seconds and air-dried again before imaging.

      Negative-stain TEM images were acquired using a JEOL JEM-1400 transmission electron microscope operating at 120 kV and equipped with a CCD camera. Data acquisition followed standard imaging conditions.

      The above information has been added to the Materials and Methods section.

      (13) Please, provide a method description for the applied CCl4 cirrhosis model. This is completely missing.

      (1) Under a fume hood, carbon tetrachloride (CCl₄) was dissolved in corn oil at a 1:3 volume ratio to prepare a working solution, which was filtered through a 0.2 μm filter into a 30 mL glass vial. In our laboratory, to mimic chronic injury, mice in the experimental group were intraperitoneally injected at a dose of 1 mL/kg body weight per administration.

      (2) Mice were carefully removed from the cage and placed on a scale to record body weight for calculation of the injection volume.

      (3) The needle cap was carefully removed, and the required volume of the pre-prepared CCl₄ solution was drawn into the syringe. The syringe was gently flicked to remove any air bubbles.

      (4) Mice were placed on a textured surface (e.g., wire cage) and restrained. When the mouse was properly positioned, ideally with the head lowered about 30°, the left lower or right lower abdominal quadrant was identified.

      (5) Holding the syringe at a 45° angle, with the bevel facing up, the needle was inserted approximately 4–5 mm into the abdominal wall, and the calculated volume of CCl₄ was injected.

      (6) Mice were returned to their cage and observed for any signs of discomfort.

      (7) Needles and syringes were disposed of in a sharps container without recapping. A new syringe or needle was used for each mouse.

      (8) To establish a progressive liver fibrosis model, injections were administered twice per week (e.g., Monday and Thursday) for 3 or 6 consecutive weeks (n=3 per group). Control mice were injected with an equal volume of corn oil for 3 or 6 weeks (n=3 per group).

      (9) Forty-eight hours after the last injection, mice were euthanized by cervical dislocation, and livers were rapidly harvested. Portions of the liver were processed for paraffin embedding and histological sectioning, while the remaining tissue was either immediately frozen or used for subsequent molecular biology analyses.

      The above information has been added to the Materials and Methods section.

      (14) Please provide a method description for the quantifications reported in Figures 5D, 5F, and 6E.

      ImageJ software was used to analyze 3D stained images (Figs. 5F, 6E), and the ultra-depth-of-field 3D analysis module was used to analyze 3D DAB images (Fig. 5D). The specific steps are as follows:

      Figure 5D: DAB-stained 3D images from the control group and the CCl<sub>4</sub> 6-week (CCl<sub>4</sub>-6W) group were analyzed. For each group, 20 terminal bile duct branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. All measurements were plotted as scatter plots to reflect the spatial extension of bile ducts relative to the portal vein under different conditions.

      Figure 5F: TSA 3D multiplex-stained images from the control group, CCl<sub>4</sub> 3-week (CCl<sub>4</sub>-3W), and CCl<sub>4</sub> 6-week (CCl<sub>4</sub>-6W) groups were analyzed. For each group, 5 terminal bile duct branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. Measurements were plotted as scatter plots to illustrate bile duct spatial extension.

      Figure 6E: TSA 3D multiplex-stained images from the control, CCl<sub>4</sub>-3W, and CCl<sub>4</sub>-6W groups were analyzed. For each group, 5 terminal nerve branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. Scatter plots were generated to depict the spatial distribution of nerves under different treatment conditions.

      (15) Please provide a method description for the human liver samples you used in Figure S6. Patient data, fixation, etc...

      The human liver tissue samples shown in Figure S6 were obtained from adjacent non-tumor liver tissues resected during surgical operations at West China Hospital, Sichuan University. All samples used were anonymized archived tissues, which were applied for scientific research in accordance with institutional ethical guidelines and did not involve any identifiable patient information. After being fixed in 10% neutral formalin for 24 hours, the tissues were routinely processed for paraffin embedding (FFPE), and sectioned into 4 μm-thick slices for immunostaining and fluorescence imaging.

      Results

      (16) While it is stated in the Methods that certain color MCNPs were used for labelling different structures (i.e., yellow: hepatic artery; green: bile duct; portal vein: pink; central veins: black), in some figures, apparently different color MCNPs are used for the respective structures. E.g., in Figure 1J, the artery is pink and the portal vein is green. Please clarify this.

      The color assignment of MCNP dyes is not fixed across different experiments or schematic illustrations. MCNP dyes of different colors are fundamentally identical in their physical and chemical properties and do not exhibit specific binding or affinity for particular vascular structures. We select different colors based on experimental design and imaging presentation needs to facilitate distinction and visualization, thereby enhancing recognition in 3D reconstruction and image display. Therefore, the color labeling in Figure 1F is primarily intended to illustrate the distribution of different vascular systems, rather than indicating a fixed correspondence to a specific dye or injection color.

      (17) In Figure 1J, the hepatic artery is extremely shrunk, while the portal vein is extremely dilated - compared to the physiological situation. Does it relate to the perfusion conditions?

      We appreciate the reviewer’s attention. In fact, under normal physiological conditions, the hepatic arteries labeled by CD31 are naturally narrow. Therefore, the relatively thin hepatic arteries and thicker portal veins shown in Figure 1J are normal and unrelated to the perfusion conditions. See figure 1E of Adori et al., 2021.

      (18) Re: MCNP-black labelled 'oval fenestrae': the Results state 50-100 nm, while they are apparently 5-10-micron diameter in Figure 1I. Accordingly, the comparison with the ELMI studies in the subsequent paragraph is inappropriate.

      We thank the reviewer for the correction. The previous statement was a typographical error. In fact, the diameter of the “elliptical windows” marked by MCNP-black is 5–10 μm, so the diameter of 5–10 μm shown in Figure 1I is correct.

      (19) Please, correct this erroneous sentence: 'Pink marked the hepatic arterial system by injection extrahepatic duct (Figure 2B).'

      Original sentence: “The hepatic arterial system was labeled in pink by injection through the extrahepatic duct (Figure 2B).”

      Revised sentence: “The hepatic arterial system was labeled in pink by injection through the left ventricle (Figure 2B).”

      (20) How do you define the 'primary portal vein tract'?

      We thank the reviewer for the question. The term “primary portal vein tract” refers to the first-order branches of the portal vein that enter the liver from the hepatic hilum. These are the major branches arising directly from the main portal vein trunk and are responsible for supplying blood to the respective hepatic lobes. This definition corresponds to the concept of the first-order portal vein in hepatic anatomy.

      (21) I am concerned that the 'periportal lamellar complex (PLC)' that the Authors describe really exists as a distinct anatomical or functional unit. I also see these in 3D scans - in my opinion, these are fine, lower-order portal vein branches that connect the portal veins to the adjacent sinusoid. The strong MCNP-labelling of these structures may be caused by the 'sticking' of the perfused MCNP solutions in these 'pockets' during the perfusion process. What do these structures look like with SMA or CD31 immunostaining? Also, one may consider that the anatomical evaluation of these structures may have limitations in tissue slices. Have you ever checked MCNP-perfused, cleared full live lobes in light sheet microscope scans? I think this would be very useful to have a comprehensive morphological overview. Unfortunately, based on the presented documentation, I am also not convinced that PLCs are 'co-localize' with fine terminal bile duct branches (Figure 3E, S3C), or with TH+ 'neuronal bead chain networks' (Fig 6C). More detailed and more convincing documentation is needed here.

      We thank the reviewer for the detailed comments. Regarding the existence and function of the periportal lamellar complex (PLC), our observations are based on MCNP-Pink labeling of the portal vein, through which we were able to identify the PLC structure surrounding the portal branches. It should be noted that the PLC represents a very small anatomical structure. Although we have not yet performed light-sheet microscopy scanning, we anticipate that such imaging would primarily visualize larger portal vein branches. Nevertheless, this does not affect our overall conclusions.

      We also appreciate the reviewer’s suggestion that the observed structures might result from MCNP adherence during perfusion. To verify the structural characteristics of the PLC, we performed immunostaining for SMA and CD31, which revealed a specific arrangement pattern of smooth muscle and endothelial markers rather than simple perfusion-induced deposition (Figures 4F and S6B).

      Regarding the apparent colocalization of the PLC with terminal bile duct branches (Figures 3E and S3C) and TH⁺ neuronal bead-like networks (Figure 6C), we acknowledge that current literature evidence remains limited. Therefore, we have carefully described these observations as possible spatial associations rather than definitive conclusions. Future studies integrating high-resolution three-dimensional imaging with functional analyses will help to further clarify the anatomical and physiological significance of the PLC.

      (22) 'Extended depth-of-field three-dimensional bright-field imaging revealed a strict 1:1 anatomical association between the primary portal vein trunk (diameter 280 {plus minus} 32 μm) and the first-order bile duct (diameter 69 {plus minus} 8 μm) (Figures 3A and S3A)'.

      How do you define '1:1 anatomical association'? How do you define and identify the 'order' (primary, secondary) of vessel and bile duct branches in 200-micrometer slices?

      We thank the reviewer for the question. In this study, the term “1:1 anatomical correlation” refers to the stable paired spatial relationship between the main portal vein trunk and its corresponding primary bile duct within the same portal territory. In other words, each main portal vein branch is accompanied by a primary bile duct of matching branching order and trajectory, together forming a “vascular–biliary bundle.”

      The definitions of “primary” and “secondary” branches were based on extended-depth 3D bright-field reconstructions, considering both branching hierarchy and vessel/duct diameters: primary branches arise directly from the main trunk at the hepatic hilum and exhibit the largest diameters (averaging 280 ± 32 μm for the portal vein and 69 ± 8 μm for the bile duct), whereas secondary branches extend from the primary branches toward the lobular interior with smaller calibers.

      (23) In my opinion, the applied methodical approach in the single cell transcriptomics part (data mining in the existing liver single cell database and performing Venn diagram intersection analysis in hepatic endothelial subpopulations) is largely inappropriate and thus, all the statements here are purely speculative. In my opinion, to identify the molecular characteristics of such small and spatially highly organized structures like those fine radial portal branches, the only way is to perform high-resolution spatial transcriptomic.

      We thank the reviewer for the comment. We fully acknowledge the importance of high-resolution spatial transcriptomics in identifying the fine structural characteristics of portal vein branches. Due to current funding and technical limitations, we were unable to perform such high-resolution spatial transcriptomic analyses. However, we validated the molecular features of the PLC using another publicly available liver single-cell RNA-sequencing dataset, which provided preliminary supporting evidence (Figures S6B and S6C). In the manuscript, we have carefully stated that this analysis is exploratory in nature and have avoided overinterpretation. In future studies, high-resolution spatial omics approaches will be invaluable for more precisely delineating the molecular characteristics of these fine structures.

      (24) 'How the autonomic nervous system regulates liver function in mice despite the apparent absence of substantive nerve fiber invasion into the parenchyma remains unclear.'

      Please consider the role of gap junctions between hepatocytes (e.g., Miyashita, 1991; Seseke, 1992).

      In this study, we analyzed the spatial distribution of hepatic nerves in mice using immunofluorescence staining and found that nerve fibers were almost exclusively confined to the portal vein region (Figure S6A). Notably, this distribution pattern differs markedly from that in humans. Previous studies have shown that, in human livers, nerves are not only located around the portal veins but also present along the central veins, interlobular septa, and within the parenchymal connective tissue (Miller et al., 2021; Yi, la Fleur, Fliers & Kalsbeek, 2010).

      Further research has provided a physiological explanation for this interspecies difference: even among species with distinct sympathetic innervation patterns in the parenchyma—i.e., with or without direct sympathetic input—the sympathetic efferent regulatory functions may remain comparable (Beckh, Fuchs, Ballé & Jungermann, 1990). This is because signals released from aminergic and peptidergic nerve terminals can be transmitted to hepatocytes through gap junctions as electrical signals (Hertzberg & Gilula, 1979; Jensen, Alpini & Glaser, 2013; Seseke, Gardemann & Jungermann, 1992; Taher, Farr & Adeli, 2017).

      However, the scarcity of nerve fibers within the mouse hepatic parenchyma suggests that the mechanisms by which the autonomic nervous system regulates liver function in mice may differ from those in humans. This observation prompted us to further investigate the potential role of PLC endothelial cells in this process.

      (25) Please, correct typos throughout the text.

      We thank the reviewer for this comment. We have carefully proofread the entire manuscript and corrected all typographical errors and minor language issues throughout the text.

      Reviewer #3 (Recommendations for the authors):

      (1) A strong recommendation - the authors ought to challenge their scRNAsq- re-analysis with another scRNAseq dataset, namely a recently published atlas of adult liver endothelial, but also mesenchymal, immune, and parenchymal cell populations https://pubmed.ncbi.nlm.nih.gov/40954217/, performed with Smart-seq2 approach, which is perfectly suitable as it brings higher resolution data, and extensive cluster identity validation with stainings. Pietilä et al. indicate a clear distinction of portal vein endothelial cells into two populations that express Adgrg6, Jag1 (e2c), from Vegfc double-positive populations (e5c and e2c). Moreover, the dataset also includes the arterial endothelial cells that were shown to be part of the PLC, but were not followed up with the scRNAseq analysis. This distinction could help the authors to further validate their results, better controlling for cross-contaminations that may occur during scRNAseq preparation.

      We thank the reviewer for the valuable suggestion. As noted, we have further validated the molecular characteristics of the PLC using a recently published atlas of adult liver endothelial cells (Pietilä et al., 2023, PMID: 40954217). This dataset, generated using the Smart-seq2 technique, provides high-resolution transcriptomic profiles. By analyzing this dataset, we identified a CD34⁺LY6A⁺ portal vein endothelial cell population within the e2 cluster, which is localized around the portal vein. We then examined pathways and gene expression patterns related to hematopoiesis, bile duct formation, and neural signaling within these cells. The results revealed gene enrichment patterns consistent with those observed in our primary dataset, further supporting the robustness of our analysis of the PLC’s molecular characteristics.

      (2) Improving the methods section is highly recommended, this includes more detailed information for material and protocols used - catalog numbers; protocol details of the usage - rocking platforms, timing, and tubes used for incubations; GitHub or similar page with code used for the scRNA seq re-analysis.

      We thank the reviewer for the valuable suggestion. We have added more detailed information regarding the materials and experimental procedures in the Methods section, including catalog numbers, incubation conditions (such as the type of shaker, incubation time, and tube specifications), and other relevant parameters.

      (3) In Figure 2A, the authors claim the size of the nanoparticle is 100nm, while based on the image, the size is ~150-180nm. A more thorough quantification of the particle size would help users estimate the usability of their method for further applications.

      We thank the reviewer for the comment. In the TEM image shown in Figure 2A, the nanoparticles indeed appear to be approximately 150–200 nm in size. We have re-verified the particle dimensions and will update the corresponding description in the Methods section to allow readers to more accurately assess the applicability of this approach.

      (4) In Figure 3E, it is not clear what is labeled by the pink signal. Please consider labeling the structures in the figure.

      We thank the reviewer for the valuable comment. The pink signal in Figure 3E was originally intended to label the hepatic artery. However, a slight spatial misalignment occurred during the labeling process, making its position appear closer to the central vein rather than the portal vein in the image. To avoid misunderstanding, we will add clear annotations to the image and clarify this deviation in the figure legend in the revised version. It should also be noted that this figure primarily aims to illustrate the spatial relationship between the bile duct and the portal vein, and this minor deviation does not affect the reliability of our experimental conclusions.

      (5) The following statement is not backed by quantification as it ought to be „Dual-channel three-dimensional confocal imaging combined with CK19 immunostaining revealed that the sites of dye leakage did not coincide with the CK19-positive terminal bile duct epithelium, but instead were predominantly localized within regions adjacent to the PLC structures".

      We thank the reviewer for the valuable comment. We have added the corresponding quantitative analysis to support this conclusion. Quantitative assessment of the extended-depth imaging data revealed that dye leakage predominantly occurred in regions adjacent to the PLC structure, rather than in the perivenous sinusoidal areas. The corresponding results have been presented in the revised Figure 3G.

      (6) Similarly, Figure 4F is central to the Sca1CD34 cell type identification but lacks any quantification, providing it would strengthen the key statement of the article. A possible way to approach this is also by FACS sorting the double-positive cells and bluk/qRT validation.

      We thank the reviewer for raising this point. We agree that quantitative validation of the Sca1⁺CD34⁺ population by FACS sorting could further support our conclusions. However, the primary focus of this study is on the spatial localization and transcriptional features of PLC endothelial cells. The identification of the Sca1⁺CD34⁺ subset is robustly supported by multiple complementary approaches, including three-dimensional imaging, co-staining with pan-endothelial markers, and projection mapping analyses. Collectively, these lines of evidence provide a solid basis for characterizing this unique endothelial population.

      (7) The images in Figure S4D are not comparable, as the Sca1-stained image shows a longitudinal section of the PV, but the other stainings are cross-sections of PVs.

      We thank the reviewer for the careful comment. We agree that the original Sca1-stained image, being a longitudinal section of the portal vein, was not optimal for direct comparison with other cross-sectional images. We have replaced it with a cross-sectional image of the portal vein to ensure comparability across all images. The updated image has been included in the revised Supplementary Figure S4D.

      (8) I might be wrong, but Figure 4J is entirely missing, and only a cartoon is provided. Either remove the results part or provide the data.

      We appreciate the reviewer’s careful observation. Figure 4J was intentionally designed as a schematic illustration to summarize the structural relationships and spatial organization of the portal vein, hepatic artery, and PLC identified in the previous panels (Figures 4A–4I). It does not represent newly acquired experimental data, but rather serves to provide a conceptual overview of the findings.

      To avoid misunderstanding, we have clarified this point in the figure legend and the main text, stating that Figure 4J is a schematic summary rather than an experimental image. Therefore, we respectfully prefer to retain the schematic figure to aid readers’ interpretation of the preceding results.

      (9) The methods section lacks information about the CCL4concentration, and it is thus hard to estimate the dosage of CCL4 received (ml/kg). This is important for the interpretation of the severity of the fibrosis and presence of cirrhosis, as different doses may or may not lead to cirrhosis within the short regimen performed by the authors [PMID: 16015684 DOI: 10.3748/wjg.v11.i27.4167]. Validation of the fibrosis/cirrhosis severity is, in this case, crucial for the correct interpretation of the results. If the level of cirrhosis is not confirmed, only progressive fibrosis should be mentioned in the manuscript, as these two terms cannot be used interchangeably.

      Thank you for the reviewer’s comment. We indeed omitted the information on the concentration of carbon tetrachloride (CCl<sub>4</sub>) in the Methods section. In our experiments, mice received intraperitoneal injections of CCl<sub>4</sub> at a dose of 1 mL/kg body weight, twice per week, for a total of six weeks. We have revised the manuscript accordingly, using the term “progressive fibrosis” to avoid confusion between fibrosis and cirrhosis.

      (10) The following statement is not backed by any correlation analysis: "Particularly during liver fibrosis progression, the PLC exhibits dynamic structural extension correlating with fibrosis severity,.. ".

      We thank the reviewer for the comment. The original statement that the “PLC correlates with fibrosis severity” lacked support from quantitative analysis. To ensure a precise description, we have revised the sentence as follows: “During liver fibrosis progression, the PLC exhibits dynamic structural extension.”

      (11) Similarly, the following statement is not followed by data that would address the impact of innervation on liver function: "How the autonomic nervous system regulates liver function in mice despite the apparent absence of substantive nerve fiber invasion into the parenchyma remains unclear.".

      This section has been revised. In this study, we analyzed the spatial distribution of nerves in the mouse liver using immunofluorescence staining. The results showed that nerve fibers were almost entirely confined to the portal vein region (Figure S6A). Notably, this distribution pattern differs significantly from that in humans. Previous studies have demonstrated that in the human liver, nerves are not only distributed around the portal vein but also present in the central vein, interlobular septa, and connective tissue of the hepatic parenchyma (Miller et al., 2021; Yi, la Fleur, Fliers & Kalsbeek, 2010).

      Previous studies have further explained the physiological basis for this difference: even among species with differences in parenchymal sympathetic innervation (i.e., species with or without direct sympathetic input), their sympathetic efferent regulatory functions may still be similar (Beckh, Fuchs, Ballé & Jungermann, 1990). This is because signals released by adrenergic and peptidergic nerve terminals can be transmitted to hepatocytes as electrical signals through intercellular gap junctions (Hertzberg & Gilula, 1979; Jensen, Alpini & Glaser, 2013; Seseke, Gardemann & Jungermann, 1992; Taher, Farr & Adeli, 2017). However, the scarcity of nerve fibers in the mouse hepatic parenchyma suggests that the mechanism by which the autonomic nervous system regulates liver function in mice may differ from that in humans. This finding also prompts us to further explore the potential role of PLC endothelial cells in this process.

      (12) Could the authors discuss their interpretation of the results in light of the fact that the innervation is lower in cirrhotic patients? https://pmc.ncbi.nlm.nih.gov/articles/PMC2871629/. Also, while ADGRG6 (Gpr126) may play important roles in liver Schwann cells, it is likely not through affecting myelination of the nerves, as the liver nerves are not myelinated https://pubmed.ncbi.nlm.nih.gov/2407769/ and https://www.pnas.org/doi/10.1073/pnas.93.23.13280.

      We have revised the text to state that although most hepatic nerves are unmyelinated, GPR126 (ADGRG6) may regulate hepatic nerve distribution via non-myelination-dependent mechanisms. Studies have shown that GPR126 exerts both Schwann cell–dependent and –independent functions during peripheral nerve repair, influencing axon guidance, mechanosensation, and ECM remodeling (Mogha et al., 2016; Monk et al., 2011; Paavola et al., 2014).

      (13) The manuscript would benefit from text curation that would:

      a) Unify the language describing the PLC, so it is clear that (if) it represents protrusions of the portal veins.

      We have standardized the description of the PLC throughout the manuscript, clearly specifying its anatomical relationship with the portal vein. Wherever appropriate, we indicate that the PLC represents protrusions associated with the portal vein, avoiding ambiguous or inconsistent statements.

      b) Increase the accuracy of the statements.

      Examples: "bile ducts, and the central vein in adult mouse livers."

      We have refined all statements for accuracy.

      c) Reduce the space given to discussion and results in the introduction, moving them to the respective parts. The same applies to the results section, where discussion occurs at more places than in the Discussion part itself.

      We have edited the Introduction, removing detailed results and functional explanations, and retaining only a concise overview.

      Examples: "The formation of PLC structures in the adventitial layer may participate in local blood flow regulation, maintenance of microenvironmental homeostasis, and vascular-stem cell interactions."

      "This finding suggests that PLC endothelial cells not only regulate the periportal microcirculatory blood flow, but also establish a specialized microenvironment that supports periportal hematopoietic regulation, contributing to stem cell recruitment, vascular homeostasis, and tissue repair. "

      "Together, these findings suggest the PLC endothelium may act as a key regulator of bile duct branching and fibrotic microenvironment remodeling in liver cirrhosis. " This one in particular would require further validation with protein stainings and similar, directly in your model.

      d) Provide a clear reference for the used scRNA seq so it's clear that the data were re-analyzed.

      Example: "single-cell transcriptomic analysis revealed significant upregulation of bile duct-related genes in the CD34<sup>+</sup>Sca-1<sup>+</sup> endothelium of PLC in cirrhotic liver, with notably high expression of Lgals1 (Galectin-1) and HGF(Figure 5G) "

      When describing the transcriptional analysis of PLC endothelial cells, we explicitly cited the original scRNA-seq dataset (Su et al., 2021), clarifying that these data were reanalyzed rather than newly generated.

      e) Introducing references for claims that, in places, are crucial for further interpretation of experiments.

      Examples: "It not only guides bile duct branching during development but also"; the authors show no data from liver development.

      Thank you for pointing this out. We have revised the relevant statement to ensure that the claim is accurate and well-supported.

      f) Results sentence "Instead, bile duct epithelial cells at the terminal ducts extended partially along the canalicular network without directly participating in the formation of the bile duct lumen." Lacks a callout to the respective Figure.

      We would like to thank the reviewers for pointing out this issue. In the revised manuscript, the relevant image (Figure 3D) has been clearly annotated with white arrows to indicate the phenomenon of terminal cholangiocytes extending along the bile canaliculi network. Additionally, the schematic diagram on the right side clearly shows the bile canaliculi, cholangiocytes, and bile flow direction using arrows and color coding, thus intuitively corresponding to the textual description.

      (14) Formal text suggestions: The manuscript text contains a lot of missed or excessive spaces and several typos that ought to be fixed. A few examples follow:

      a) "densely n organized vascular network "

      b) "analysis, while offering high spatial "

      c) "specific differences, In the human liver, "

      d) Figure 4F has a typo in the description.

      e) "generation of high signal-to-noise ratio, multi-target " SNR abbreviation was introduced earlier.

      f) Canals of Hering, CoH abbreviation comes much later than the first mention of the Canals of Hering.

      We thank the reviewer for the helpful comment regarding textual consistency. We have carefully reviewed and revised the entire manuscript to improve the accuracy, clarity, and consistency of the text.

    1. eLife Assessment

      In this valuable study, the authors present traces of bone modification on ~1.8 million-year-old proboscidean remains from Tanzania, which they infer to be the earliest evidence for stone-tool-assisted megafaunal consumption by hominins. Challenging published claims, the authors argue that persistent megafaunal exploitation roughly coincided with the earliest Achulean tools. Notwithstanding the rich descriptive and spatial data, the behavioral inferences about hominin agency rely on traces (such as bone fracture patterns and spatial overlap) that are not unequivocal; the evidence presented to support the inferences thus remains incomplete. Given the implications of the timing and extent of hominin consumption of nutritious and energy-dense food resources, as well as of bone toolmaking, the findings of this study will be of interest to paleoanthropologists and other evolutionary biologists.

    2. Reviewer #1 (Public review):

      Domínguez-Rodrigo and colleagues make a largely convincing case for habitual elephant butchery by Early Pleistocene hominins at Olduvai Gorge (Tanzania), ca. 1.8-1.7 million years ago. They present this at a site scale (the EAK locality, which they excavated), as well as across the penecontemporaneous landscape, analyzing a series of findspots that contain stone tools and large-mammal bones. The latter are primarily elephants, but giraffids and bovids were also butchered in a few localities.

      The authors claim that this is the earliest well-documented evidence for elephant butchery; doing so requires debunking other purported cases of elephant butchery in the literature, or in one case, reinterpreting elephant bone manipulation as being nutritional (fracturing to obtain marrow) rather than technological (to make bone tools). The authors' critical discussion of these cases may not be consensual, but it surely advances the scientific discourse. The authors conclude by suggesting that an evolutionary threshold was achieved at ca. 1.8 ma, whereby regular elephant consumption rich in fats and perhaps food surplus, more advanced extractive technology (the Acheulian toolkit), and larger human group size had coincided.

      The fieldwork and spatial statistics methods are presented in detail and are solid and helpful, especially the excellent description (all too rare in zooarchaeology papers) of bone conservation and preservation procedures. The results are detailed and clearly presented.

      The authors achieved their aims, showcasing recurring elephant butchery in 1.8-1.7 million-year-old archaeological contexts. The authors cautiously emphasize the temporal and spatial correlation of 1) elephant butchery, 2) Acheulian toolkits, and 3) larger sites, and discuss how these elements may be causally related.

      Overall, this is an interesting manuscript of broad interest that presents original data and interpretations from the Early Pleistocene archaeology of Olduvai Gorge. These observations and the authors' critical review of previously published evidence are an important contribution that will form the basis for building models of Early Pleistocene hominin adaptation.

    3. Reviewer #2 (Public review):

      The manuscript makes a valuable contribution to the Olduvai Gorge record, offering a detailed description of the EAK faunal assemblage. In particular, the paper provides a high-resolution record of a juvenile Elephas recki carcass, associated lithic artifacts, and several green-broken bone specimens. These data are inherently valuable and will be of significant interest to researchers studying Early Pleistocene taphonomy. My concerns do not relate to the quality or importance of the data themselves, but rather to the interpretive inferences drawn from these data, particularly regarding the strength of the claim for unambiguous proboscidean butchery.

      This review follows the authors' response to an earlier round of reviewer feedback and addresses points raised in that exchange. In their rebuttal, the authors state that some of my initial concerns reflect misunderstandings of their analysis, but after carefully re-reading both the manuscript and their responses, I do not believe this is the case.

      In their response, the authors state that they do not treat the EAK evidence as decisive, yet the manuscript repeatedly characterizes the assemblage in very definitive terms. For example, EAK is described as "the oldest unambiguous proboscidean butchery site at Olduvai" and as "the oldest secure proboscidean butchery evidence." These phrases communicate a high level of confidence that does not align with the more qualified position articulated in the rebuttal and extends beyond what the documented evidence securely supports.

      I appreciate the authors' clarification regarding the fracture features, and I agree that these are well-established outcomes of dynamic hammerstone percussion. At the same time, several of these traits have been documented in non-anthropogenic contexts, including helicoidal spiral fractures resulting from trampling and carnivore activity (Haynes 1983), adjacent or flake-like scars created by carnivore gnawing (Villa and Bartram 1996), hackled break surfaces produced by heavy passive breakage such as trampling or sediment pressure (Haynes 1983), and impact-related bone flakes observed in carnivore-modified assemblages (Coil et al. 2020). One of the biggest issues is that there is no quantitative data or images of the bone fracture features that the authors refer to as the main diagnostic criteria at EAK. The only figures that show EAK specimens (S21, S22, S23) illustrate general green-bone fracture morphology but none of the specific traits listed in the text. In contrast, clear examples of similar features come from other Olduvai assemblages, which may be misleading to readers if they mistakenly interpret those as images from EAK. The manuscript also states that these traits "co-occur," but it is not defined whether this refers to multiple features on the same fragment or within the broader assemblage. Without images or counts that document these traits on EAK fossils, readers cannot evaluate the strength of the interpretation. Including that information would substantially strengthen the manuscript.

      Regarding the statement that "natural elephant long limb breaks have been documented only in pre or peri-mortem stages when an elephant breaks a leg, and only in femora (Haynes et al., 2021)," it is not entirely clear what this example is intended to illustrate in relation to the EAK assemblage. My understanding is that the authors are suggesting that naturally produced green bone fractures in elephants are very limited, perhaps occurring only in pre or peri-mortem broken leg cases, and that fractures on other elements should therefore be attributed to hominin activity. If that is not the intended argument, I would encourage clarifying this point. This appears to conflate pre-mortem injury with the broader issue of equifinality. My original comment was not referring to pre-mortem breaks but to the range of natural (i.e., non-hominin) and post-mortem processes that can generate spiral or green bone fractures similar to those described by the authors.

      I fully understand the spatial analyses, and I realize that the association between bones and lithics is statistically significant. My original concern was not about whether the correlation exists, but about how that correlation is interpreted. That point still stands. Statistical co-occurrence cannot distinguish among the multiple depositional and post-depositional processes that can generate similar spatial patterns. However, I agree that the spatial correlation is intriguing, particularly when viewed alongside the possible butchery evidence. The pattern is notable and worthy of publication, even if the behavioral interpretation requires caution.

      Finally, in considering the authors' response on the Nyayanga material, I still find the basis for their dismissal of that evidence difficult to follow and the contrasting treatment of the Nyayanga and EAK evidence raises concerns about interpretive consistency. Plummer et al. (2023) specify that bone surface modifications were examined using low-power magnification (10×-40×) and strong light sources to identify modifications and that they attributed agency (e.g., hominin, carnivore) to modifications only after excluding possible alternatives. The rebuttal does not engage with the procedures reported. The existence of newer analytical techniques does not diminish the validity of long-standing methods that have been applied across many studies. It is also unclear why abrasion is presented as a more likely explanation than stone tool cutmarks. The authors dismiss the Nyayanga images as "blurry," but this is irrelevant to the interpretation, since the analysis was based on the fossils, not the photographs. The Nyayanga dataset is dismissed without a thorough engagement, while the EAK material, despite similar uncertainties and potential for alternative explanations, is treated as definitive.

      These concerns do not diminish the significance of the EAK assemblage, and addressing them would allow the interpretations to more fully reflect the scope of the available data.

      Literature Cited:<br /> Coil, R., Yezzi-Woodley, K., & Tappen, M. (2020). Comparisons of impact flakes derived from hyena and hammerstone long bone breakage. Journal of Archaeological Science, 120, 105167.

      Haynes, G. (1983). A guide for differentiating mammalian carnivore taxa responsible for gnaw damage to herbivore limb bones. Paleobiology, 9(2), 164-172.<br /> Haynes, G., Krasinski, K., & Wojtal, P. (2021). A study of fractured proboscidean bones in recent and fossil assemblages. Journal of Archaeological Method and Theory, 28(3), 956-1025.

      Plummer, T. W., et al. (2023). Expanded geographic distribution and dietary strategies of the earliest Oldowan hominins and Paranthropus. Science, 379(6632), 561-566.<br /> Villa, P., & Bartram, L. (1996). Flaked bone from a hyena den. Paléo, Revue d'Archéologie Préhistorique, 8(1), 143-159.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Domínguez-Rodrigo and colleagues make a moderately convincing case for habitual elephant butchery by Early Pleistocene hominins at Olduvai Gorge (Tanzania), ca. 1.8-1.7 million years ago. They present this at the site scale (the EAK locality, which they excavated), as well as across the penecontemporaneous landscape, analyzing a series of findspots that contain stone tools and large-mammal bones. The latter are primarily elephants, but giraffids and bovids were also butchered in a few localities. The authors claim that this is the earliest well-documented evidence for elephant butchery; doing so requires debunking other purported cases of elephant butchery in the literature, or in one case, reinterpreting elephant bone manipulation as being nutritional (fracturing to obtain marrow) rather than technological (to make bone tools). The authors' critical discussion of these cases may not be consensual, but it surely advances the scientific discourse. The authors conclude by suggesting that an evolutionary threshold was achieved at ca. 1.8 ma, whereby regular elephant consumption rich in fats and perhaps food surplus, more advanced extractive technology (the Acheulian toolkit), and larger human group size had coincided.

      The fieldwork and spatial statistics methods are presented in detail and are solid and helpful, especially the excellent description (all too rare in zooarchaeology papers) of bone conservation and preservation procedures. However, the methods of the zooarchaeological and taphonomic analysis - the core of the study - are peculiarly missing. Some of these are explained along the manuscript, but not in a standard Methods paragraph with suitable references and an explicit account of how the authors recorded bone-surface modifications and the mode of bone fragmentation. This seems more of a technical omission that can be easily fixed than a true shortcoming of the study. The results are detailed and clearly presented.

      By and large, the authors achieved their aims, showcasing recurring elephant butchery in 1.8-1.7 million-year-old archaeological contexts. Nevertheless, some ambiguity surrounds the evolutionary significance part. The authors emphasize the temporal and spatial correlation of (1) elephant butchery, (2) Acheulian toolkits, and (3) larger sites, but do not actually discuss how these elements may be causally related. Is it not possible that larger group size or the adoption of Acheulian technology have nothing to do with megafaunal exploitation? Alternative hypotheses exist, and at least, the authors should try to defend the causation, not just put forward the correlation. The only exception is briefly mentioning food surplus as a "significant advantage", but how exactly, in the absence of food-preservation technologies? Moreover, in a landscape full of aggressive scavengers, such excess carcass parts may become a death trap for hominins, not an advantage. I do think that demonstrating habitual butchery bears very significant implications for human evolution, but more effort should be invested in explaining how this might have worked.

      Overall, this is an interesting manuscript of broad interest that presents original data and interpretations from the Early Pleistocene archaeology of Olduvai Gorge. These observations and the authors' critical review of previously published evidence are an important contribution that will form the basis for building models of Early Pleistocene hominin adaptation.

      This is a good example of the advantages of the eLife reviewing process. It has become much too common, among traditional peer-reviewing journals, to reject articles when there is no coincident agreement in the reviews, regardless of the heuristics (i.e., empirically-supported weight) of the arguments on both reviewers. Reviewers 1 and 2 provide contrasting evaluations, and the eLife dialogue between authors and reviewers enable us to address their comments differentially. Reviewer 1 (R1), whose evaluation is overall positive, remarks that the methods of the zooarchaeological and taphonomic analysis are missing. We have added them now in the revised version of our manuscript. R1 also remarks that our work highlights correlation of events, but not necessarily causation. We did not establish causation because such interpretations bear a considerable amount of speculation (and they might have fostered further criticism by R2); however, in the revised version, we expanded our discussion of these issues substantially. Establishing causation among the events described is impossible, but we certainly provide arguments to link them.

      Reviewer #2 (Public review):

      The authors argue that the Emiliano Aguirre Korongo (EAK) assemblage from the base of Bed II at Olduvai Gorge shows systematic exploitation of elephants by hominins about 1.78 million years ago. They describe it as the earliest clear case of proboscidean butchery at Olduvai and link it to a larger behavioral shift from the Oldowan to the Acheulean.

      The paper includes detailed faunal and spatial data. The excavation and mapping methods appear to be careful, and the figures and tables effectively document the assemblage. The data presentation is strong, but the behavioral interpretation is not supported by the evidence.

      The claim for butchery is based mainly on the presence of green-bone fractures and the proximity of bones and stone artifacts. These observations do not prove human activity. Fractures of this kind can form naturally when bones break while still fresh, and spatial overlap can result from post-depositional processes. The studies cited to support these points, including work by Haynes and colleagues, explain that such traces alone are not diagnostic of butchery, but this paper presents them as if they were.

      The spatial analyses are technically correct, but their interpretation extends beyond what they can demonstrate. Clustering indicates proximity, not behavior. The claim that statistical results demonstrate a functional link between bones and artifacts is not justified. Other studies that use these methods combine them with direct modification evidence, which is lacking in this case.

      The discussion treats different bodies of evidence unevenly. Well-documented cut-marked specimens from Nyayanga and other sites are described as uncertain, while less direct evidence at EAK is treated as decisive. This selective approach weakens the argument and creates inconsistency in how evidence is judged.

      The broader evolutionary conclusions are not supported by the data. The paper presents EAK as marking the start of systematic megafaunal exploitation, but the evidence does not show this. The assemblage is described well, but the behavioral and evolutionary interpretations extend far beyond what can be demonstrated.

      We disagree with the arguments provided by Reviewer 2 (R2). The arguments are based on two issues: bone breakage and spatial association. We will treat both separately here.

      Bone breakage

      R2 argues that:

      “The claim for butchery is based mainly on the presence of green-bone fractures and the proximity of bones and stone artifacts. These observations do not prove human activity. Fractures of this kind can form naturally when bones break while still fresh, and spatial overlap can result from post-depositional processes. The studies cited to support these points, including work by Haynes and colleagues, explain that such traces alone are not diagnostic of butchery, but this paper presents them as if they were.”

      In our manuscript, we argued that green-breakage provides an equally good (or even  better) taphonomic evidence of butchery if documented following clear taphonomic indicators. Not all green breaks are equal and not all “cut marks” are unambiguously identifiable as such. First, “natural” elephant long limb breaks have been documented only in pre/peri-mortem stages when an elephant breaks a leg. As a matter of fact, they have only been reported in publication on femora, the thinnest long bone (Haynes et al., 2021). Unfortunately, they have been studied many months after the death of the individuals, and the published diagnosis is made under the assumption that no other process intervened in the modification of those bones during this vast time span. Most of the breaks resulting from pre-mortem fractures produce long smooth, oblique/helical outlines. Occasionally, some flake scarring may occur on the cortical surface. This has been documented as uneven, small-sized, spaced, and we are not sure if it resulted from rubbing of broken fragments while the animal was alive and attempting to walk or some may have resulted from dessication of the bone after one year. When looking at them in detail, such breaks contain sometimes step-microfractures and angular (butterfly-like) outlines. Sometimes, they may be accompanied by pseudo-notches, which are distinct and not comparable to the deep notches that hammerstone breaking generates on the same types of bones. Commonly, the edges of the breaks show some polishing, probably from separate break planes rubbing against each other. It should be emphasized that the experimental work on hammerstone breaking documented by Haynes et al. (2021) is based on bone fracture properties of bones that are no longer completely green. The cracking documented in their hammerstone experimentation, with very irregular outlines differs from the cracking that we are documented in butchery of recently dead elephants.

      All this contrasts with the overlapping notches and flake scars (mostly occurring on the medullary side of the bone), both of them bigger in size, with clear smooth, spiral and longitudinal trajectories, with a more intensive modification on the medullary surface, and with sharp break edges resulting from hammerstone breaking of the green bone. No “natural” break has been documented replicating the same morphologies displayed in the Supplementary File to our paper. We display specimens with inflection points, hackle marks on the breaks, overlapping scarring on the medullary surface, with several specimens displaying percussion marks and pitting (also most likely percussion marks). Most importantly, we document this patterned modification on elements other than femora, for which no example has been documented of purported morphological equifinality caused by pre-mortem “natural” breaking. In contrast, such morphologies are documented in hammerstone-broken completely green bones (work in progress). We cited the works of Haynes to support this, because they do not show otherwise. As a matter of fact, Haynes himself had the courtesy of making a thorough reading of our manuscript and did not encounter any contradiction with his work. 

      Spatial association

      R2 argues in this regard:

      “The spatial analyses are technically correct, but their interpretation extends beyond what they can demonstrate. Clustering indicates proximity, not behavior. The claim that statistical results demonstrate a functional link between bones and artifacts is not justified. Other studies that use these methods combine them with direct modification evidence, which is lacking in this case.”

      We should emphasize that there is some confusion in the use and interpretation of clustering by R2 when applied to EAK. R2 appears to interpret clustering as the typical naked-eye perception of the spatial association of different items. In contrast, we rely on the statistical concept of clustering, more specifically on spatial interdependence or covariance, which is different. Items may appear visually clustered but still be statistically independent. This could, for example, result from two independent depositional episodes that happen to overlap spatially. In such cases, the item-to-item relationship does not necessarily show any spatial interdependence between classes other than simple clustering (i.e., spatial coincidence in intensity).

      Spatial statistical interdependence, on the other hand, reflects a spatial relationship or co-dependence between different items. This goes beyond the mere fact that classes appear clustered: items between classes may show specific spatial relationships — they may avoid each other or occupy distinct positions in space (regular co-dependence), or they may interact within the same spatial area (clustering co-dependence). Our tests indicate the latter for EAK.

      Such patterns are difficult to explain when depositional events are unrelated, since the probability that two independent events would generate identical spatial patterns in the same loci is very low. They are also difficult to reconcile when post-depositional processes intervene and resediment part of the assemblage (Domínguez-Rodrigo et al. 2018).

      Finally, R2 concludes:

      “The discussion treats different bodies of evidence unevenly. Well-documented cut-marked specimens from Nyayanga and other sites are described as uncertain, while less direct evidence at EAK is treated as decisive. This selective approach weakens the argument and creates inconsistency in how evidence is judged.”

      The Nyayanga hippo remains bearing modifications have not been well-documented cut marks. Neither R2 nor we can differentiate those marks from those inflicted by natural abrasive processes in coarse-grained sedimentary contexts, where the carcasses are found. The fact that the observable microscopic features (through low-quality photographs as appear in the original publication) differ between the cut marks documented on smaller animals and those inferred for the hippo remains makes them even more ambiguous. Nowhere in our manuscript do we treat the EAK evidence (or any other evidence) as decisive, but as the most likely given the methods used and the results reported.

      References

      Haynes G, Krasinski K, Wojtal P. 2021. A Study of Fractured Proboscidean Bones in Recent and Fossil Assemblages. Journal of Archaeological Method and Theory 28:956–1025.

      Domínguez-Rodrigo, M., Cobo-Sánchez, L., yravedra, J., Uribelarrea, D., Arriaza, C., Organista, E., Baquedano, E. 2018. Fluvial spatial taphonomy: a new method for the study of post-depositional processes. Archaeological and Anthropological Sciences 10: 1769-1789.

      Recommendations for authors:

      Reviewer #1 (Recommendations for the authors):

      I have several recommendations that, in my opinion, could enhance the communication of this study to the readers. The first point is the only crucial one.

      (1) A detailed zooarchaeological methods section must be added, with explanations (or references to them) of precisely how the authors defined and recorded bone-surface modifications and mode of bone fragmentation.

      This appears in the revised version of the manuscript in the form of a new sub-section within the Methods section.

      (2) The title could be improved to better represent the contents of the paper. It contains two parts: the earliest evidence for elephant butchery (that's ok), and revealing the evolutionary impact of megafaunal exploitation. The latter point is not actually revealed in the manuscript, just alluded to here and there (see also below).

      We have elaborated on this in the revised version, linking megafaunal exploitation and anatomical changes (which appear discussed in much more detail in the references indicated).

      (3) The abstract does not make it clear whether the authors think that the megafaunal adaptation strongly correlates with the Acheulian technocomplex. It seems that they do, so please make this point apparent in the abstract.

      From a functional point of view, we document the correlation, but do not believe in the causation, since most butchering tools around these megafaunal carcasses are typologically non Acheulian. We have indicated so in the abstract.

      (4) Please define what you mean by "megafauna". How large should an animal be to be considered as megafauna in this particular context?

      We have added this definition: we identify as “megafauna” those animals heavier than 800 kg.

      (5) In the literature survey, consider also this Middle Pleistocene case-study of elephant butchery, including a probable bone tool: Rabinovich, R., Ackermann, O., Aladjem, E., Barkai, R., Biton, R., Milevski, I., Solodenko, N., and Marder, O., 2012. Elephants at the middle Pleistocene Acheulian open-air site of Revadim Quarry, Israel. Quaternary International, 276, pp.183-197.

      Added to the revised version

      (6) The paragraph in lines 123-160 is unclear. Do the authors argue that the lack of evidence for processing elephant carcasses for marrow and grease is universal? They bring forth a single example of a much later (MIS 5) site in Germany. Then, the authors state the huge importance of fats for foragers (when? Where? Surely not in all latitudes and ecosystems). This left me confused - what exactly are you trying to claim here?

      We have explained this a little more in the revised text. What we pointed out was that most prehistoric (and modern) elephant butchery sites leave grease-containing long bones intact. Evidence of anthropogenic breakage of these elements is rather limited. The most probably reason is the overabundance of meat and fat from the rest of the carcass and the time-consuming effort needed to access the medullary cavity of elephant long bones.

      (7) The paragraph in lines 174-187 disrupts the flow of the text, contains previously mentioned information, ends with an unclear sentence, and could be cut.

      (8) Results: please provide the MNI for the EAK site (presumably 1, but this is never mentioned).

      Done in the revised version.

      (9) Lines 292 - 295: The authors found no traces of carnivoran activity (carnivoran remains, coprolites, or gnawing marks on the elephant bones), yet they attribute the absence of some non-dense skeletal elements to carnivore ravaging. I cannot understand this rationale, given that other density-mediated processes could have deleted the missing bones and epiphysis.

      This interpretation stems from our observations of several elephant carcasses in the Okavango delta in Botswana. Those that were monitored showed deletion of remains (i.e., disappearance of certain bones, like feet) without necessarily imprinting damage on the rest of the carcass. Carnivore intervention in an elephant death site can result in deletion of a few remains without much damage (if any), or if hyena clans access the carcass, much more conspicuous damage can be documented. There is a whole range of carnivore signatures in between. We are currently working on our study of several elephant carcasses subjected to these highly variable degrees of carnivore impact.

      (10) Lines 412 - 422: "The clustering of the elephant (and hippopotamus) carcasses in the areas containing the highest densities of landscape surface artifacts is suggestive of a hominin agency in at least part of their consumption and modification." - how so? It could equally suggest that both hominins and elephants were drawn to the same lush environments.

      We agree. Both hominins and megafauna must have been drawn to the same ecological loci for interaction to emerge. However, the fact that the highest density clusters of artifacts coincide with the highest density of carcasses “showing evidence of having been broken”, is suggestive of hominin use and consumption.

      (11) Discussion: I suggest starting the Discussion with a concise appraisal of the lines of evidence detailed in the Results and their interpretation, and only then, the critical reassessment of other studies. Similarly, a new topic starts in line 508, but without any subheading or an introductory sentence that could assist the readers.

      We added the introductory lines of the former Conclusion section to the revised Discussion section, as suggested by R1.

      (12) Line 607: Neumark-Nord are Late Pleistocene sites (MIS 5), not Middle Pleistocene.

      Corrected.

      (13) Regarding the ambiguity in how megafaunal exploitation may be causally related to the other features of the early Acheulian, the authors can develop the discussion. Alternatively, they should explicitly state that correlation is not causation, and that the present study adds the megafaunal exploitation element to be considered in future discussion of the shifts in lifestyles 1.8 million years ago.

      We have done so.

      Reviewer #2 (Recommendations for the authors):

      The following detailed comments are provided to help clarify arguments, ensure accurate representation of cited literature, and strengthen the logical and methodological framing of the paper. Line numbers refer to the version provided for review.

      (1) Line 55: Such concurrency (sometimes in conjunction with other variables)

      The term "other variables" is very vague. I would suggest expanding on this or taking it out altogether.

      (2) Line 146: Megafaunal long bone green breakage (linked to continuous spiral fractures on thick cortical bone) is probably a less ambiguous trace of butchery than "cut marks", since many of the latter could be equifinal and harder to identify, especially in contexts of high abrasion and trampling (Haynes et al., 2021, 2020).

      This reasoning is not supported by the evidence or the cited sources. Green-bone spiral fractures only show that a bone broke while it was fresh and do not reveal who or what caused it. Carnivore feeding, trampling, and natural sediment pressure can all create the same patterns, so these fractures are not clearer evidence of butchery than cut marks. Cut marks, when they are preserved and morphologically clear, remain the most reliable indicator of human activity. The Haynes papers actually show the opposite of what is claimed here. They warn that spiral fractures and surface marks can form naturally and that fracture patterns alone cannot be used to infer butchery. This section should be revised to reflect what those studies actually demonstrate.

      The reasoning referred to in line 146 is further explained below in the original text as follows:

      “Despite the occurrence of green fractures on naturally-broken bones, such as those trampled by elephants (Haynes et al., 2020), those occurring through traumatic fracturing or gnawed by carnivores (Haynes and Hutson, 2020), these fail to reproduce the elongated, extensive, or helicoidal spiral fractures (uninterrupted by stepped sections), accompanied by the overlapping conchoidal scars (both cortical and medullary), the reflected scarring, the inflection points, or the impact hackled break surfaces and flakes typical of dynamic percussive breakage. Evidence of this type of green breakage had not been documented earlier for the Early Pleistocene proboscidean or hippopotamid carcasses, beyond the documentation of flaked bone with the purpose of elaboration of bone tools (Backwell and d’Errico, 2004; Pante et al., 2020; Sano et al., 2020).”

      The problem in the way that R2 uses Haynes et al.´s works is that R2 uses features separately. Natural breaks occurring while the bone is green can generate spiral smooth breaks, for example, but it is not the presence of a single feature that invalidates the diagnosis of agency or that is taphonomically relevant, but the concurrence of several of them. The best example of a naturally (pre-mortem) broken bone was published by Haynes et al.

      The natural break shows helical fractures, subjugated to linear (angular) fracture outlines. Notice how the crack displays a zig-zag. The break is smooth but most damage occurs on the cortical surface, with flaking adjacent to the break and step micro-fracturing on the edges. The cortical scarring is discontinuous (almost marginal) and very small, almost limited to the very edge of the break. No modification occurs on the medullary surface. No extensive conchoidal fractures are documented, and certainly none inside the medullary surface of the break.

      Compare with Figure S8, S10, S17 and S34 (all specimens are shown in their medullary surface):

      In these examples, we see clearly modified medullary surfaces with multiple green breaks and large-sized step fractures, accompanied in some examples by hackle marks. Some show large overlapping scars (of substantially bigger size than those documented in the natural break image). Not a single example of naturally-broken bones has been documented displaying these morphologies simultaneously. It is the comprehensive analysis of the co-occurrence of these features and not their marginal and isolated occurrence in naturally-broken bones that make a difference in the attribution of agency. Likewise, no example of naturally-broken bone has been published that could mimic any of the two green-broken bones documented at EAK. In contrast, we do have bones from our on-going experimentation with green elephant carcasses that jointly reproduce these features. See also Figure 6 of the article to find another example without any modern referent in the naturally-broken bones documented.

      We should emphasize that R2 is inaccurately portraying what Haynes et al.´s results really document. Contrary to R2´s assertion, trampling does not reproduce any of the examples shown above. Neither do carnivores. It should be stressed that Haynes & Harrod only document similar overlapping scarring on the medullary surface of bones, when using much smaller animals. In all the carnivore damage repertoire that they document for elephants, durophagous spotted hyenas can only inflict furrowing on the ends of the biggest long bones, especially if they are adults. Long bone midshafts remain inaccessible to them. The mid-shaft portions of bones that we document in our Supplementary File and at EAK cannot be the result of hyena (or carnivore damage) for this reason, and also because their intense gnawing on elephant bones leaves tooth marking on most of the elements that they modify, being absent in our sample.

      (3) Line 176: other than hominins accessed them in different taphonomically-defined stages- stages - the "Stages" is repeated twice

      Defined in the revised version

      (4) Line 174: Regardless of the type of butchery evidence - and with the taphonomic caveat that no unambiguous evidence exists to confirm that megafaunal carcasses were hunted or scavenged other than hominins accessed them in different taphonomically-defined stages- stages - the principal reasons for exploring megafaunal consumption in early human evolution is its origin, its episodic or temporally-patterned occurrence, its impact on hominin adaptation to certain landscapes, and its reflection on hominin group size and site functionality.

      This sentence is confusing and needs to be rewritten for clarity. It tries to combine too many ideas at once, and the phrasing makes it hard to tell what the main point is. The taphonomic caveat in the middle interrupts the sentence and obscures the argument. It should be broken into separate, clearer statements that distinguish what evidence exists, what remains uncertain, and what the broader goals of the discussion are.

      We believe the ideas are displayed clearly

      (5) Line 179: landscapes, and its reflection on hominin group size and site functionality. If hominins actively sought the exploitation of megafauna, especially if targeting early stages of carcass consumption, the recovery of an apparent surplus of resources reflects a substantially different behavior from the small-group/small-site pattern documented at several earlier Oldowan anthropogenic sites (Domínguez-Rodrigo et al., 2019) -or some modern foragers, like the Hadza, who only exploit megafaunal carcasses very sporadically, mostly upon opportunistic encounters (Marlowe, 2010; O'Connell et al., 1992; Wood, 2010; Wood and Marlowe, 2013).

      This sentence makes a reasonable point, but is written in a confusing way. The idea that early, deliberate access to megafauna would represent a different behavioral pattern from smaller Oldowan or modern foraging contexts is valid, but the sentence is awkward and hard to follow. It should be rephrased to make the logic clearer and more direct.

      We believe the ideas are displayed clearly

      (6) Line 186: When the process started of becoming megafaunal commensal started has major implications for human evolution.

      This sentence is awkward and needs to be rewritten for clarity. The phrasing "when the process started of becoming megafaunal commensal started" is confusing and grammatically incorrect. It could be revised to something like "Determining when hominins first began to interact regularly with megafauna has major implications for human evolution," or another version that clearly identifies the process being discussed.

      Modified in the revised version

      (7) Line189: The multiple taphonomic biases intervening in the palimpsestic nature of most of these butchery sites often prevent the detection of the causal traces linking megafaunal carcasses and hominins. Functional links have commonly been assumed through the spatial concurrence of tools and carcass remains; however, this perception may be utterly unjustified as we argued above. Functional association of both archaeological elements can more securely be detected through objective spatial statistical methods. This has been argued to be foundational for heuristic interpretations of proboscidean butchery sites (Giusti, 2021). Such an approach removes ambiguity and solidifies spatial functional association, as demonstrated at sites like Marathousa 1 (Konidaris et al., 2018) or TK Sivatherium (Panera et al., 2019). This method will play a major role in the present study.

      This section overstates what spatial analysis can demonstrate and misrepresents the cited studies. The works by Giusti (2021), Konidaris et al. (2018), and Panera et al. (2019) do use spatial statistics to examine relationships between artifacts and faunal remains, but they explicitly caution that spatial overlap alone does not prove functional or behavioral association. These studies argue that clustering can support such interpretations only when combined with detailed taphonomic and stratigraphic evidence. None of them claims that spatial analysis "removes ambiguity" or "solidifies" functional links. The text should be revised to reflect the more qualified conclusions of those papers and to avoid implying that spatial statistics can establish behavioral causation on their own.

      We disagree. Both works (Giusti and Panera) use spatial statistical tools to create an inferential basis reinforcing a functional association of lithics and bones. In both cases, the anthropogenic agency inferred is based on that. We should stress that this only provides a basis for argumentation, not a definitive causation. Again, those analyses show much more than just apparent visual clustering.

      (8) Line 200: Here, we present the discovery of a new elephant butchery site (Emiliano Aguirre Korongo, EAK), dated to 1.78 Ma, from the base of Bed II at Olduvai Gorge. It is the oldest unambiguous proboscidean butchery site at Olduvai.

      It is fine to state the main finding in the introduction, but the phrasing here is too strong. Calling EAK "the oldest unambiguous proboscidean butchery site" asserts certainty before the evidence is presented. The claim should be stated more cautiously, for example, "a new site that provides early evidence for proboscidean butchery," so that the language reflects the strength of the data rather than pre-judging it.

      We understand the caution by R2, but in this case, EAK is the oldest taphonomically-supported evidence of elephant butchery at Olduvai (see discussion about FLK North in the text). Whether this is declared at the beginning or the end of the text is irrelevant.

      (9) Line 224: The drying that characterizes Bed II had not yet taken place during this moment.

      This sentence reads like a literal translation. It should be rewritten for clarity.

      Modified in the revised version

      (10) Line 233: During the recent Holocene, the EAK site was affected by a small landslide which displaced the...

      This section contains far more geological detail than is needed for the argument. The reader only needs to know that the site block was displaced by a small Holocene landslide but retains its stratigraphic integrity. The extended discussion of regional faults, seismicity, and slope processes goes well beyond what is necessary for context and distracts from the main focus of the paper.

      We disagree. The geological information is what is most commonly missing from most archaeological reports. Here, it is relevant because of the atypical process and because it has been documented only twice with elephant butchery sites. Explaining the dynamic geological process that shaped the site helps to understand its spatial properties.

      (11) Line 264: In June 2022, a partial elephant carcass was found at EAK on a fragmented stratigraphic block...

      This section reads like field notes rather than a formal site description. Most of the details about the discovery sequence, trench setup, and excavation process are unnecessary for the main text. Only the basic contextual information about the find location, stratigraphic position, and anatomical composition is needed. The rest could be condensed or moved to the methods or supplementary material.

      We disagree. See reply above.

      (12) Line 291: hominins or other carnivores. Ongoing restoration work will provide an accurate estimate of well-preserved and modified fractions of the assemblage.

      This sentence is unclear and needs to specify what kind of restoration work is being done and what is meant by well-preserved and modified fractions. It is not clear whether modified refers to surface marks, diagenetic alteration, or something else. If the bones are still being cleaned or prepared, the analysis is incomplete, and the counts cannot be considered final. If restoration only means conservation or stabilization, that should be stated clearly so the reader understands that it does not affect the results. As written, it is not clear whether the data presented here are preliminary or complete.

      We added: For this reason, until restoration is concluded, we cannot produce any asssertion about the presence or absence of bone surface modifications.

      (13) Line 294: The tibiae were well preserved, but the epiphyseal portions of the femora were missing, probably removed by carnivores, which would also explain why a large portion of the rib cage and almost all vertebrae are missing.

      This explanation is not well supported. The missing elements could be the result of other forms of density-mediated destruction, such as sediment compaction or post-depositional fragmentation, especially since no tooth marks were found. Given the low density of ribs, vertebrae, and femoral epiphyses, these processes are more likely explanations than carnivore removal. The text should acknowledge these alternatives rather than attributing the pattern to carnivore activity without direct evidence.

      Sediment compaction and post-depositional can break bones but cannot make them disappear. Our excavation process was careful enough to detect bone if present. Their absence indicates two possibilities: erosion through the years at the front of the excavation or carnivore intervention. Carnivores can take elephant bones without impacting the remaining assemblage (see our reply above to a similar comment).

      (14) Line 304: The fact that the carcass was moved while encased in its sedimentary context, along with the close association of stone tools with the elephant bones, is in agreement with the inference that the animal was butchered by hominins. A more objective way to assess this association is through spatial statistical analysis.

      The authors state that "the carcass was moved while encased in its sedimentary context, along with the close association of stone tools with the elephant bones, is in agreement with the inference that the animal was butchered by hominins." This does not logically follow. Movement of the block explains why the bones and tools remain together, not how that association was created. The preserved association alone does not demonstrate butchery, especially in the absence of cut marks or other direct evidence of hominin activity.

      Again, we are sorry that R2 is completely overlooking the strong signal detected by the spatial statistical analysis. The way that the block moved, it preserved the original association of bones and tools. This statement is meant to clarify that despite the allochthonous nature of the block, the original autochthonous depositional process of both types of archaeological materials has been preserved. The spatial association, as statistically demonstrated, indicates that the functional link is more likely than any other alternative process. The additional fact that nowhere else in that portion of the outcrop do we identify scatters of tools (all appear clustered at a landscape scale with the elephant) adds more support to this interpretation. This would have been further supported by the presence of cut marks, no doubt, but their absence does not indicate lack of functional association, since as Haynes´ works have clearly shown, most bulk defleshing of modern elephant leaves no traces on most bones.

      (15) Line 370: This also shows that the functional connection between the elephant bones and the tools has been maintained despite the block post-sedimentary movement.

      The spatial analyses appear to have been carried out appropriately, and the interpretations of clustering and segregation are consistent with the reported results. However, the conclusion that the "functional connection" between bones and tools has been maintained goes beyond what spatial correlation alone can demonstrate. These analyses show spatial proximity and scale-dependent clustering but cannot, by themselves, confirm a behavioral or functional link.

      R2 is making this comment repeatedly and we have addressed it more than once above. We disagree and we refer to our replies above to sustain it.

      (16) Line 412: The clustering of the elephant (and hippopotamus) carcasses in the areas containing the highest densities of landscape surface artifacts is suggestive of a hominin agency in at least part of their consumption and modification. The presence of green broken elephant long bone elements in the area surveyed is only documented within such clusters, both for lower and upper Bed II. This constitutes inverse negative evidence for natural breaks occurring on those carcasses through natural (i.e., non-hominin) pre- and peri-mortem limb breaking (Haynes et al., 2021, 2020; Haynes and Hutson, 2020). In this latter case, it would be expected for green-broken bones to show a more random landscape distribution, and occur in similar frequencies in areas with intense hominin landscape use (as documented in high density artifact deposition) and those with marginal or non-hominin intervention (mostly devoid of anthropogenic lithic remains).

      The clustering of green-bone fractures with stone tools is intriguing but should be interpreted cautiously. The Haynes references are misrepresented here. Those studies address both cut marks and green-bone (spiral) fractures, emphasizing that each can arise through non-hominin processes such as trampling, carcass collapse, and sediment loading. They do not treat green fractures as clearer evidence of butchery; in fact, they caution that such breakage patterns can occur naturally and even form clustered distributions in areas of repeated animal activity. The claim that these studies support spiral fractures as unambiguous indicators of hominin activity, or that natural breaks would be randomly distributed, is not accurate.

      We would like to emphasize again that the Haynes´references are not misrepresented here. See our extensive reply above. If R2 can provide evidence of natural breakage patterns resulting from pre-mortem limb breaking or post-mortem trampling resulting in all limb bones being affected by these processes and resulting in smooth spiral breaks, accompanied with extensive and overlapping scarring on the medullary surface, in conjunction with the other features described in our replies above, then we would be willing to reconsider. With the evidence reported until now, that does not occur simultaneously on specimens resulting from studies on modern elephant bones.

      R2 seems to contradict him(her)self here by saying that Haynes studies show that cut marks are not reliable because they can also be reproduced via trampling. Until this point, R2 had been saying that only cut marks could demonstrate a functional link and support butchery. Haynes´ studies do not deal experimentally with sediment loading.

      (17) Line 424: This indicates that from lower Bed II (1.78 Ma) onwards, there is ample documented evidence of anthropogenic agency in the modification of proboscidean bones across the Olduvai paleolandscapes. The discovery of EAK constitutes, in this respect, the oldest evidence thereof at the gorge. The taphonomic evidence of dynamic proboscidean bone breaking across time and space supports, therefore, the inferences made by the spatial statistical analyses of bones and lithics at the site.

      This conclusion is overstated. The claim of "ample documented evidence of anthropogenic agency" is too strong, given that the main support comes from indirect indicators like green-bone fractures and spatial clustering rather than clear butchery marks. It would be more accurate to say that the evidence suggests or is consistent with possible hominin involvement. The final sentence also conflates association with causation; spatial and taphonomic data can indicate a relationship, but do not confirm that the carcasses were butchered by hominins.

      The evidence is based on spatially clustering (at a landscape scale) of tools and elephant (and other megafaunal taxa) bones, in conjunction with a large amount of green-broken elements. This interpretation, if we compare it against modern referents is supported even stronger. In the past few years, we have been conducting work on modern naturally dead elephant carcasses in Botswana and Zambia, and of the several carcasses that we have seen, we have not identified a single case of long bone shaft breaks like those described by Haynes as natural or like those we describe here as anthropogenic. This probably means that they are highly unlikely or marginal occurrences at a landscape scale. This seems to be supported by Haynes´ work too. Out of the hundreds of elephant carcasses that he has monitored and studied over the years for different works, we have managed to identify only two instances where he described natural pre-mortem breaks. This certainly qualifies as extremely marginal. 

      Most of the Results section is clearly descriptive, but beginning with "The clustering of the elephant (and hippopotamus) carcasses..." the text shifts from reporting observations to drawing behavioral conclusions. From this point on, it interprets the data as evidence of hominin activity rather than simply describing the patterns. This part would be more appropriate for the Discussion, or should be rewritten in a neutral, descriptive way if it is meant to stay in the Results.

      This appears extensively discussed in the Discussion section, but the data presented in the results is also interpreted in that section, following a clear argumental chain.

      (18) Line 433: A recent discovery of a couple of hippopotamus partial carcasses at the 3.0-2.6 Ma site of Nyayanga (Kenya), spatially concurrent with stone artifacts, has been argued to be causally linked by the presence of cut marks on some bones (Plummer et al., 2023). The only evidence published thereof is a series of bone surface modifications on a hippo rib and a tibial crest, which we suggest may be the result of byproduct of abiotic abrasive processes; the marks contrast noticeably with the well-defined cut marks found on smaller mammal bones (Plummer et al. ́s 2023: Figure 3C, D) associated with the hippo remains (Plummer et al., 2023).

      The authors suggest that the Nyayanga marks could result from abiotic abrasion, but this claim does not engage with the detailed evidence presented by Plummer et al. (2023). Plummer and colleagues documented well-defined, morphologically consistent cut marks and considered the sedimentary context in their interpretation. Raising abrasion as a general possibility without addressing that analysis gives the impression of selective skepticism rather than an evaluation grounded in the published data.

      We disagree again on this matter. R2 does not clarify what he/she means by well-defined or morphologically consistent. We provide an alternative interpretation of those marks that fit their morphology and features and that Plummer at al did not successfully exclude. We also emphasize that the interpretation of the Nyayanga marks was made descriptively, without any analytical approach and with a high degree of subjectivity by the researcher. All of this disqualifies the approach as well defined and keeps casting an old look at modern taphonomy. Descriptive taphonomy is a thing of the 1980´s. Today there are a plethora of analytical methods, from multivariate statistics, to geometric morphometrics to AI computer vision (so far the most reliable) which represent how taphonomy (and more specifically, analysis of bone surface modifications) should be conducted in the XXI century. This approaches would reinforce interpretations as preliminarily published by Plummer et al, provided they reject alternative explanations like those that we have provided.

      (19) Line 459: It would have been essential to document that the FLK N6 tools associated with the elephant were either on the same depositional surface as the elephant bones and/or on the same vertical position. The ambiguity about the FLK N6 elephant renders EAK the oldest secure proboscidean butchery evidence at Olduvai, and also probably one of the oldest in the early Pleistocene elsewhere in Africa.

      The concern about vertical mixing is fair, but the tone makes it sound like the association is definitely not real. It would be more accurate to say that the evidence is ambiguous, not that it should be dismissed altogether.

      We have precisely done so. We do not dismiss it, but we cannot take it for anything solid since we excavated the site and show how easily one could make functional associations if forgetting about the third dimension. It is not a secure butchery site. This is what we said and we stick to this statement.

      (20) Line 479: In all cases, these wet environments must have been preferred places for water-dependent megafauna, like elephants and hippos, and their overlapping ecological niches are reflected in the spatial co-occurrence of their carcasses. Both types of megafauna show traces of hominin use through either cutmarked or percussed bones, green-broken bones, or both (Supplementary Information).

      The environmental part is good, but the behavioral interpretation is too strong. Saying elephants and hippos "must have been" drawn to these areas is too certain, and claiming that both "show traces of hominin use" makes it sound like every carcass was modified. It should be clearer that only some have possible evidence of this.

      The sentence only refers to both types of fauna taxonomically. No inference can be drawn therefor that all carcasses are modified.

      (21) Line 496: In most green-broken limb bones, we document the presence of a medullary cavity, despite the continuous presence of trabecular bone tissue on its walls.

      This sentence is confusing and doesn't seem to add anything meaningful. All limb bones naturally have a medullary cavity lined with trabecular bone, so it's unclear why this is noted as significant. The authors should clarify what they mean here or remove it if it's simply describing normal bone structure.

      No. Modern elephant long bones do not have a hollow medullary cavity. All the medullary volume is composed of trabecular tissue. Some elephants in the past had hollow medullary cavities, which probably contained larger amounts of marrow and fat. 

      (22) Line 518: We are not confident that the artefacts reported by de la Torre et al are indeed tools.

      While I generally agree with this statement, the paragraph reads as defensive rather than comparative. It would help if they briefly summarized what de la Torre et al. actually argued before explaining why they disagree.

      We devote two full pages of the Discussion section to do so precisely.

      (23) Lines 518-574: They are similar to the green-broken specimens that we have reported here...

      This part is very detailed but inconsistent. They argue that the T69 marks could come from natural processes, but they use similar evidence (green fractures, overlapping scars) to argue for human activity at EAK. If equifinality applies to one, it applies to both.

      We are confused by this misinterpretation. Features like green fractures and overlapping scars (among others) can be used to detect anthropogenic agency in elephant bone breaking; that is, any given specimen can be determined to have been an “artifact” (in the sense of human-created item), but going from there to interpreting an artifact as a tool, there is a large distance. Whereas an artifact (something made by a human) can be created indirectly through several processes (for example, demarrowing a bone resulting in long bone fragments), a tool suggest either intentional manufacture and use or both. That is the difference between de la Torre et al.´s interpretation and ours. We believe that they are showing anthropogenically-made items, but they have provided no proof that they were tools.

      (24) Line 576: A final argument used by the authors to justify the intentional artifactual nature of their bone implements is that the bone tools were found in situ within a single stratigraphic horizon securely dated to 1.5 million years ago, indicating systematic production rather than episodic use. This is taphonomically unjustified.

      The reasoning here feels uneven in how clustering evidence is used. At EAK, clustering of bones and artifacts is taken as meaningful evidence of hominin activity, but here the same pattern at T69 is treated as a natural by-product of butchery or carnivore activity. If clustering alone cannot distinguish between intentional and incidental association, the authors should clarify why it is interpreted as diagnostic in one case but not in the other.

      Again, we are confused by this misinterpretation. It applies to two different scenarios/questions:

      a) is there a functional link between tools and bones at EAK and T69? We have statistically demonstrated that at EAK and we think de la Torre et al. is trying to do the same for T69, although using a different method. 

      b) Are the purported tools at T69 tools? Are those that we report here tools? In this regard there is no evidence for either case and given that several bones from T69 come from animals smaller than elephants, we do not discard that carnivores might have been responsible for those, whereas hominin butchery might have been responsible for the intense long limb breaking at that site. It remains to be seen how many (if any) of those specimens were tools.

      (25) Line 600: If such a bone implement was a tool, it would be the oldest bone tool documented to date (>1.7 Ma).

      The comparison to prior studies is useful, and the point about missing use-wear traces is well taken. However, the last lines feel speculative. If no clear use evidence has been found, it's premature to suggest that one specimen "would be the oldest bone tool." That claim should be either removed or clearly stated as hypothetical.

      It clearly reads as hypothetical.

      (26) Line 606: Evidence documents that the oldest systematic anthropogenic exploitation of proboscidean carcasses are documented (at several paleolandscape scales) in the Middle Pleistocene sites of Neumark-Nord (Germany)(Gaudzinski-Windheuser et al., 2023a, 2023b).

      This is the first and only mention of Neumark-Nord in the paper, and it appears without any prior discussion or connection to the rest of the study. If this site is being used for comparison or as part of a broader temporal framework, it needs to be introduced and contextualized earlier. As written, it feels out of place and disconnected from the rest of the argument.

      This is a Late Pleistocene site and we do not see the need to present it earlier, given that the scope of this work is Early Pleistocene.

      (27) Line 608: Evidence of at least episodic access to proboscidean remains goes back in time (see review in Agam and Barkai, 2018; Ben-Dor et al., 2011; Haynes, 2022).

      The distinction between "systematic" and "episodic" exploitation is useful, but the authors should clarify what criteria define each. The phrase "episodic access...goes back in time" is vague and could be replaced with a clearer statement summarizing the nature of the earlier evidence.

      It is self-explanatory

      (28) Line 610: Redundant megafaunal exploitation is well documented at some early Pleistocene sites from Olduvai Gorge (Domínguez-Rodrigo et al., 2014a, 2014b; Organista et al., 2019, 2017, 2016).

      The phrase "redundant megafaunal exploitation" needs clarification. "Redundant" is not standard terminology in this context. Does this mean repeated, consistent, or overlapping behaviors? Also, while these same Olduvai sites are mentioned earlier, this phrasing also introduces new interpretive language not used before and implies a broader behavioral generalization than what the data actually show.

      Webster: Redundant means repetitive, occurring multiple times.

      (29) Line 612: At the very same sites, the stone artifactual assemblages, as well as the site dimensions, are substantially larger than those documented in the Bed I Oldowan sites (Diez-Martín et al., 2024, 2017, 2014, 2009).

      The placement and logic of this comparison are unclear. The discussion moves from Middle Pleistocene Neumark-Nord to early Pleistocene Olduvai sites, then to Bed I Oldowan contexts without clearly signaling the temporal or geographic transitions. If the intent is to contrast Acheulean vs. Oldowan site scale or organization, that connection needs to be made explicit. As written, it reads as a disjointed shift rather than a continuation of the argument.

      We disagree. Here, we finalize by bringing in some more recent assemblages where hominin agency is not in question.

      (30) Line 616: Here, we have reported a significant change in hominin foraging behaviors during Bed I and Bed II times, roughly coinciding with the replacement of Oldowan industries by Acheulian tool kits -although during Bed II, both industries co-existed for a substantial amount of time (Domínguez-Rodrigo et al., 2023; Uribelarrea et al., 2019, 2017).

      This section should be restructured for flow. The reference to behavioral change during Bed I-II and the overlap of Oldowan and Acheulean industries is important, but feels buried after a long detour. Consider moving this earlier or rephrasing so the main conclusion (behavioral change across Beds I-II) is clearly stated first, followed by supporting examples.

      It is not within the scope of this work and is properly described in the references mentioned.

      (31) Line 620: The evidence presented here, together with that documented by de la Torre et al. (2025), represents the most geographically extensive documentation of repeated access to proboscidean and other megafaunal remains at a single fossil locality.

      The phrase "most geographically extensive documentation of repeated access" overstates what has been demonstrated. The evidence presented is site-specific and does not justify such a broad superlative. This should be toned down or supported with comparative quantitative data.

      We disagree. There is no other example where such an abundant record of green-broken elements from megafauna is documented. Neumark-Nord is more similar because it shows extensive evidence of butchery, but not so much about degreasing.

      (32) Line 623: The transition from Oldowan sites, where lithic and archaeofaunal assemblages are typically concentrated within 30-40 m2 clusters, to Acheulean sites that span hundreds or even over 1000 m2 (as in BK), with distinct internal spatial organization and redundancy in space use across multiple archaeological layers spanning meters of stratigraphic sequence (Domínguez-Rodrigo et al., 2014a, 2009b; Organista et al., 2017), reflects significant behavioral and technological shifts.

      This sentence about site size and spatial organization repeats earlier claims without adding new insight. If it's meant as a synthesis, it should explicitly say how the spatial expansion relates to changes in behavior or mobility, not just describe the difference.

      In the Conclusion section these correlations have been explained in more detail to add some causation.

      (33) Line 628: This pattern likely signifies critical innovations in human evolution, coinciding with major anatomical and physiological transformations in early hominins (Dembitzer et al., 2022; Domínguez-Rodrigo et al., 2021, 2012).

      The conclusion that this "signifies critical innovations in human evolution" is too sweeping, given the data presented. It introduces physiological and anatomical transformation without connecting it to any evidence in this paper. Either cite the relevant findings or limit the claim to behavioral implications.

      The references cited elaboration in extension this. The revised version of the Conclusion section also elaborates on this.

      Overall, the conclusions section reads as a loosely connected set of assertions rather than a focused synthesis. It introduces new interpretations and terminology not supported or developed earlier in the paper, and the argument jumps across temporal and geographic scales without clear transitions. The discussion should be restructured to summarize key results, clarify the scope of interpretation, and avoid speculative or overstated claims about evolutionary significance.

      We have done so, supported by the references used in addition to extending some of the arguments

      (34) Line 639: The systematic excavation of the stratigraphic layers involved a small crew.

      This sentence is not necessary.

      No comment

      (35) Line 643: The orientation and inclination of the artifacts were recorded using a compass and an inclinometer, respectively.

      What were these measurements used for (e.g., post-depositional movement analysis, spatial patterning)? A short note on the purpose would make this more meaningful.

      Fabric analysis has been added to the revised version.

      (36) Line 659: Restoration of the EAK elephant bones

      This section could be streamlined and clarified. It includes procedural detail that doesn't contribute to scientific replicability (e.g., the texture of gauze, number of consolidant applications), while omitting some key information (such as how restoration may have affected analytical results). It also contains interpretive comments ("most of the assemblage has been successfully studied") that don't belong in Methods.

      No comment

      (37) Line 689: In the field laboratory, cleaning of the bone remains was carried out, along with adhesion of fragments and their consolidation when necessary.

      Clarify whether cleaning or adhesion treatments might obscure or alter bone surface modifications, as this has analytical implications.

      These protocols do not impact bone like that anymore.

      (38) Line 711: (b) Percussion Tools - Includes hammerstones or cobbles exhibiting diagnostic battering, pitting, and/or impact scars consistent with percussive activities.

      Define how diagnostic features (battering, pitting) were identified - visual inspection, magnification, or quantitative criteria?

      Both macro and microscopically

      (39) Line 734: We conducted the analysis in three different ways after selecting the spatial window, i.e., the analysed excavated area (52.56 m2).

      Clarify why the 52.56 m<sup>2</sup> spatial window was chosen. Was this the total excavated area or a selected portion?

      It was what was left of the elephant accumulation after erosion.

      (40) Line 728: The spatial statistical analyses of EAK.

      Adding one or two sentences at the start explaining the analytical objective, such as testing spatial association between faunal and lithic materials, would help readers understand how each analysis relates to the broader research questions.

      This is well explained in the main text

      (41) Line 782: An intensive survey seeking stratigraphically-associated megafaunal bones was carried out in the months of June 2023 and 2024.

      It would help to specify whether the same areas were resurveyed in both field seasons or if different zones were covered each year. This information is important for understanding sampling consistency and potential spatial bias.

      Both areas were surveyed in both field seasons. We were very consistent.

      (42) Line 787: We focused on proboscidean bones and used hippopotamus bones, some of the most abundant in the megafaunal fossils, as a spatial control.

      Clarify how the hippopotamus remains functional as a "spatial control." Are they used as a proxy for water-associated taxa to test habitat patterning, or as a baseline for comparing carcass distribution? The meaning of "control" in this context is ambiguous.

      As a proxy for megafaunal distribution given their greater abundance over any other megafaunal taxa.

      (43) Line 789: Stratigraphic association was carried out by direct observation of the geological context and with the presence of a Quaternary geologist during the whole survey.

      This is good methodological practice, but it would be helpful to describe how stratigraphic boundaries were identified in the field (for example, by reference to tuffs or marker beds). That information would make the geological framework more replicable.

      This is basic geological work. Of course, both tuffs and marker beds were followed.

      (44) Line 791: When fossils found were ambiguously associated with specific strata, these were excluded from the present analysis.

      You might specify what proportion of the total finds were excluded due to uncertain stratigraphic association. Reporting this would indicate the strength of the stratigraphic control.

      This was not quantified but it was a very small amount compared to those whose stratigraphic provenience was certain.

      (45) Line 799: The goals of this survey were: a) collect a spatial sample of proboscidean and megafaunal bones enabling us to understand if carcasses on the Olduvai paleolandscapes were randomly deposited or associated to specific habitats.

      You might clarify how randomness or habitat association was tested.

      Randomness was tested spatially and comparing density according to ecotone. Same for habitat association.

      (46) The Methods section provides detailed information about excavation, restoration, and spatial analyses but omits critical details about the zooarchaeological and taphonomic procedures. There is no explanation of how faunal remains were analyzed once recovered, including how cut marks, percussion marks, or green bone fractures were identified or what magnification or diagnostic criteria were used. The authors also do not specify the analytical unit used for faunal quantification (e.g., NISP, MNI, MNE, or other), making it unclear how specimen counts were generated for spatial or taphonomic analyses. Even if these details are provided in the Supplementary Information, the main text should include at least a concise summary describing the analytical framework, the criteria for identifying surface modifications and fracture morphology, and the quantification system employed. This information is essential for transparency, replicability, and proper evaluation of the behavioral interpretations.

      See reply above. There is a new subsection on taphonomic methods now.

      Supplementary information:

      (47) The Supplementary Information includes a large number of green-broken proboscidean specimens from other Olduvai localities (BK, LAS, SC, FLK West), but it is never explained why these are shown or how they relate to the EAK study. The main analysis focuses entirely on the EAK elephant, including so much unrelated material without any stated purpose, which makes the supplement confusing. If these examples are meant only to illustrate the appearance of green fractures, that should be stated. Otherwise, the extensive inclusion of non-EAK material gives the impression that they were part of the analyzed assemblage when they were not.

      This is stated in the opening paragraph to the section.

      (48) Line 96: A small collection of green-broken elephant bones was retrieved from the lower and upper Bed II units.

      It would help to clarify whether these specimens are part of the EAK assemblage or derive from other Bed II localities. As written, it is not clear whether this description refers to material analyzed in the main text or to comparative examples shown only in the Supplementary Information.

      No, EAK only occupies the lower Bed II section. They belong in the Bed II paleolandscape units.

      (49) Line 97: One of them, a proximal femoral shaft found within the LAS unit, has all the traces of having been used as a tool (Figure 6).

      This says the bone tool in Figure 6 is from LAS, but the main text caption identifies it as from EAK. If I am not mistaken, EAK is a site at the base of Bed II, and LAS is a separate stratigraphic unit higher in the sequence, so the authors should clarify which is correct.

      Our mistake. It provenience is from LAS in the vicinity of EAK.

      (50) Line 186: Figure S20. Example of other megafaunal long bone shafts showing green breaks.

      Not cited in text or SI narrative. No indication where these bones come from or why they are relevant.

      It appears justified in the revised version.

      (51) Line 474: Figure S28-S30. Hyena-ravaged giraffe bones from Chobe (Botswana).

      These figures are not discussed in the text or SI, and their relevance to the study is unclear. The authors should explain why these modern comparative examples were included and how they inform interpretations of the Olduvai assemblages.

      It appears justified in the revised version.

      (52) Line 498: Figure S31. Bos/Bison bone from Bois Roche (France).

      This figure is not mentioned in the text or Supplementary Information. The authors should specify why this specimen is shown and how it contributes to the study's taphonomic or behavioral comparisons.

      It appears justified in the revised version.

      (53) Line 504: Figure S32. Miocene Gomphotherium femur from Spain.

      This figure is never referenced in the paper. The authors should clarify the purpose of including a Miocene specimen from outside Africa and explain what it adds to the interpretation of Bed II material.

      It appears justified in the revised version.

      (54) Line 508: Figure S33. Elephant femoral shaft from BK (Olduvai).

      This figure appears to show comparative material but is not cited or discussed in the text. The authors should explain why the BK material is presented here and how it relates to EAK or the broader analysis.

      There are two figures labeled S33.

      It appears justified in the revised version.

      (55) Line 515: Figure S33. Tibia fragment from a large medium-sized bovid displaying multiple overlapping scars on both breakage planes inflicted by carnivore damage.

      Because this figure repeats the S33 label and is not cited or explained in the text, it is unclear why this specimen is included or how it contributes to the study. The authors should correct the duplicate numbering and clarify the purpose of this figure.

      It appears justified in the revised version.

      (56) Line 522: Same specimen as shown in Figure S30, viewed on its medial side.

      This is not the same bone as S30. This figure is not discussed in the text or Supplementary Information. The authors should clarify why it is included and how it relates to the rest of the analysis.

      It appears justified in the revised version.

    1. eLife Assessment

      This manuscript presents a fundamental advance in our understanding of nuclear receptor pharmacology by expanding on previous work demonstrating dual ligand occupancy in the peroxisome proliferator-activated receptor-gamma (PPARγ). Using a compelling combination of biophysical, biochemical, and cellular approaches, the authors show that covalent inverse agonists with enhanced efficacy shift the receptor conformation toward a transcriptionally repressive state that limits orthosteric ligand co-binding more effectively. This revised manuscript further strengthens support for a proximal, bidirectional allosteric model of dual ligand occupancy by sharpening the distinction between prior and new findings, adding clear conceptual figures, and strengthening statistical rigor.

    2. Reviewer #1 (Public review):

      Summary:

      This paper focuses on understanding how covalent inhibitors of peroxisome proliferator-activated receptor-gamma (PPARg) show improved inverse agonist activities. This work is important because PPARg plays essential roles in metabolic regulation, insulin sensitization, and adipogenesis. Like other nuclear receptors, PPARg, is a ligand-responsive transcriptional regulator. Its important role, coupled with its ligand-sensitive transcriptional activities, makes it an attractive therapeutic target for diabetes, inflammation, fibrosis, and cancer. Traditional non-covalent ligands like thiazolininediones (TZDs) show clinical benefit in metabolic diseases, but utility is limited by off-target effects and transient receptor engagement. In previous studies, the authors characterized and developed covalent PPARg inhibitors with improved inverse agonist activities. They also showed that these molecules engage unique PPARg ligand binding domain (LBD) conformations whereby the c-terminal helix 12 penetrates into the orthosteric binding pocket to stabilize a repressive state. In the nuclear receptor superclass of proteins, helix 12 is an allosteric switch that governs pharmacologic responses, and this new conformation was highly novel. In this study, the authors did a more thorough analysis of how two covalent inhibitors, SR33065 and SR36708 influence the structural dynamics of PPARg LBD.

      Strengths:

      (1) The authors employed a compelling integrated biochemical and biophysical approach.

      (2) The cobinding studies are unique for the field of nuclear receptor structural biology, and I'm not aware of any similar structural mechanism described for this class of proteins.

      (3) Overall, the results support their conclusions.

      (4) The results open up exciting possibilities for the development of new ligands that exploit the potential bidirectional relationship between the covalent versus non-covalent ligands studied here.

      Weaknesses:

      All weaknesses were addressed by the Authors in revision.

    3. Reviewer #2 (Public review):

      Summary:

      The authors use ligands (inverse agonists, partial agonists) for PPAR, and coactivators and corepressors, to investigate how ligands and cofactors interact in a complex manner to achieve functional outcomes (repressive vs. activating).

      Strengths:

      The data (mostly biophysical data) are compelling from well-designed experiments. Figures are clearly illustrated. The conclusions are supported by these compelling data. These results contribute to our fundamental understanding of the complex ligand-cofactor-receptor interactions.

      Weaknesses:

      Breaking down a complex system into a simpler model system, when possible, provides a unique lens with which to probe systems with mechanistic insight. While it works well in this particular paper, in general, caution should be taken when using simplified models to study a complex system.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This paper focuses on understanding how covalent inhibitors of peroxisome proliferator-activated receptor-gamma (PPARg) show improved inverse agonist activities. This work is important because PPARg plays essential roles in metabolic regulation, insulin sensitization, and adipogenesis. Like other nuclear receptors, PPARg, is a ligand-responsive transcriptional regulator. Its important role, coupled with its ligand-sensitive transcriptional activities, makes it an attractive therapeutic target for diabetes, inflammation, fibrosis, and cancer. Traditional non-covalent ligands like thiazolininediones (TZDs) show clinical benefit in metabolic diseases, but utility is limited by off-target effects and transient receptor engagement. In previous studies, the authors characterized and developed covalent PPARg inhibitors with improved inverse agonist activities. They also showed that these molecules engage unique PPARg ligand binding domain (LBD) conformations whereby the c-terminal helix 12 penetrates into the orthosteric binding pocket to stabilize a repressive state. In the nuclear receptor superclass of proteins, helix 12 is an allosteric switch that governs pharmacologic responses, and this new conformation was highly novel. In this study, the authors did a more thorough analysis of how two covalent inhibitors, SR33065 and SR36708 influence the structural dynamics of PPARg LBD. 

      Strengths: 

      (1) The authors employed a compelling integrated biochemical and biophysical approach.  

      (2) The cobinding studies are unique for the field of nuclear receptor structural biology, and I'm not aware of any similar structural mechanism described for this class of proteins.  

      (3) Overall, the results support their conclusions.  

      (4) The results open up exciting possibilities for the development of new ligands that exploit the potential bidirectional relationship between the covalent versus non-covalent ligands studied here. 

      Weaknesses: 

      (1) The major weakness in this work is that it is hard to appreciate what these shifting allosteric ensembles actually look like on the protein structure. Additional graphical representations would really help convey the exciting results of this study. 

      We thank the review for the comments. In response to the specific recommendations below, we added two new figures—Figure 1 and Figure 8 in this resubmission—that hopefully address the weakness identified by the reviewer.

      Reviewer #2 (Public review): 

      Summary: 

      The authors use ligands (inverse agonists, partial agonists) for PPAR, and coactivators and corepressors, to investigate how ligands and cofactors interact in a complex manner to achieve functional outcomes (repressive vs. activating). 

      Strengths: 

      The data (mostly biophysical data) are compelling from well-designed experiments. Figures are clearly illustrated. The conclusions are supported by these compelling data. These results contribute to our fundamental understanding of the complex ligand-cofactor-receptor interactions. 

      Weaknesses: 

      This is not the weakness of this particular paper, but the general limitation in using simplified models to study a complex system. 

      We appreciate the reviewer’s comments. Breaking down a complex system into a simpler model system, when possible, provides a unique lens with which to probe systems with mechanistic insight. While simplified models may not always explain the complexity of systems in cells, for example, our recently published work showed that a simplified model system — biochemical assays using reconstituted PPARγ ligand-binding domain (LBD) protein and peptides derived from coregulator proteins (similar to the assays in this current work) and protein NMR structural biology studies using PPARγ LBD — can explain the activity of ligand-induced PPARγ activation and repression to a high degree (pearson/spearman correlation coefficients ~0.7-0.9):

      MacTavish BS, Zhu D, Shang J, Shao Q, He Y, Yang ZJ, Kamenecka TM, Kojetin DJ. Ligand efficacy shifts a nuclear receptor conformational ensemble between transcriptionally active and repressive states. Nat Commun. 2025 Feb 28;16(1):2065. doi: 10.1038/s41467-025-57325-4. PMID: 40021712; PMCID: PMC11871303.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors): 

      (1) More set-up is needed in the results section. The first paragraph is unclear on what is new to this study versus what was done previously. Likewise, a brief description of the assays used and the meaning behind differences in signals would help the general reader along. 

      We modified the last paragraph of the introduction and first results section to hopefully better set the stage for what was done previously vs. what is new/recollected in this study. In our results section, we also include more description about what the assays measure.

      (2) Since this paper is building on previous work, additional figures are needed in the introduction and discussion. Graphical depictions of what was found in the first study on how these ligands uniquely influence PPARg LBD conformation. A new model/depiction in the discussion for what was learned and its context with the rest of the field. 

      Our revised manuscript includes a new Figure 1 describing the possible allosteric mechanism by which a covalent ligand inhibits binding of other non-covalent ligands that was inferred from our previous study; and a new Figure 8 with a model for what has been learned.

      (3) It is stated that the results shown are representative data for at least two biological replicates. However, I do not see the other replicates shown in the supplementary information. 

      We appreciate the Reviewer’s emphasis on data reproducibility and rigor. We confirm that the biochemical and cellular assay data presented are indeed representative of consistent findings observed across two or more biological replicates—and we show representative data in our figures but not the extensive replicate data in supplementary information consistent with standard practices.

      (4) Figure 1a could benefit from labels of antagonists, inverse agonist, etc., next to each chemical structure. Likewise, if any co-crystal or other models are available it would be helpful to include those for comparison. 

      We added the pharmacological labels to Figure 2a (old Figure 1a).

      (5) The figure legends don't seem to match up completely with the figures. For example, Figure 2b states that fitted Ki values +/- standard deviation. are stated in the legend, but it's shown as the log Ki. 

      We revised the figure legends to ensure they display the appropriate errors as reported from the data fitting.

      (6) EC50, IC50, Ki, and Kd values alongside reported errors and R2 values for the fits should be reported in a table. 

      Our revised manuscript now includes a Source Data file (Figure 5—source data 1.xlsx) of the data (n=2) plotted in Figure 5 (old Figure 4) so that readers can regenerate the plots and calculate the errors and R2 values if desired. Otherwise, fitted values and errors are reported in figures when fitting in Prism permitted and reported errors; when Prism was unable to fit data or fit the error, n.d. (not determined) is specified.

      (7) Statistical analysis is missing in some places, for example, Figure 1b. 

      We revised Figure 2b (old Figure 1b) to include statistical testing.

      Reviewer #2 (Recommendations for the authors): 

      I suggest that the authors discuss the following points to broaden the significance of the results: 

      (1) The two partial agonists MRL24 and nTZDpa) are "partial" in the coactivator and corepressor recruitment assays, but are "complete" in the TR-FRET ligand displacement assay (Figure 2). Please explain that a partial agonist is defined based on the functional outcome (cofactor recruitment in this study) but not binding affinity/efficacy. 

      We added the following sentence to describe the partial agonist activity of these compounds: “These high affinity ligands are partial agonists as defined on their functional outcome in coregulator recruitment and cellular transcription; i.e., they are less efficacious than full agonists at recruiting peptides derived from coactivator proteins in biochemical assays (Chrisman et al., 2018; Shang et al., 2019; Shang and Kojetin, 2024) and increasing PPARγ-mediated transcription (Acton et al., 2005; Berger et al., 2003).“

      (2) Will the discovery reported here be broadly applicable? 

      (a) Applicable if other partial agonists and inhibitors are used? 

      (b) Applicable if different coactivators/corepressors, or different segments of the same cofactor, are used?

      (c) Applicable to other NRs (their AF-2 are similar but with sequence variation)?

      (d) The term "allosteric" might mean different things to different people - many readers might think that it means a "distal and unrelated" binding pocket. It might be helpful to point out that in this study, the allosteric site is actually "proximal and related". 

      We expanded our introduction and/or discussion sections to expand upon these concepts; specific answers as follows:

      (a) Orthosteric partial agonists?—yes, because helix 12 would clash with an orthosteiric ligand; other covalent inhibitors?—it depends on whether the covalent inhibitor stabilizes helix 12 in the orthosteric pocket.

      (b) yes with some nuanced exceptions where certain segments of the same coregulator protein bind with high affinity and others apparently do not bind or bind with low affinity

      (c) it is not clear yet if other NRs share a similar ligand-induced conformational ensemble to PPARγ

      (d) we addressed this point in the 4th paragraph of the introduction “...the non-covalent ligand binding event we previously described at the alternate/allosteric site, which is proximal to the orthosteric ligand-binding pocket, …”

    1. eLife Assessment

      This study addresses an important problem in gene regulation, namely, which features of chromatin regulate potential RNA Polymerase 2 activity at a locus. The authors provided evidence that specific post-translational modifications of histones within the gene body are correlated with Pol II transcription, that these modifications are dynamic, and that they can be regulated by Pol II activity. The manuscript contributes to the concept of "fragile nucleosomes" as a unifying framework for key epigenetic drivers of transcription; however, the quality of the evidence provided is inadequate in support of the claims made, and further evidence teasing out the mechanistic aspects of the work would strengthen its impact. This work will be of interest to the fields of transcriptional regulation, chromatin structure, and epigenetics.

    2. Reviewer #1 (Public review):

      Summary:

      This study aims to explore how different forms of "fragile nucleosomes" facilitate RNA Polymerase II (Pol II) transcription along gene bodies in human cells. The authors propose that pan-acetylated, pan-phosphorylated, tailless, and combined acetylated/phosphorylated nucleosomes represent distinct fragile states that enable efficient transcription elongation. Using CUT&Tag-seq, RNA-seq, and DRB inhibition assays in HEK293T cells, they report a genome-wide correlation between histone pan-acetylation/phosphorylation and active Pol II occupancy, concluding that these modifications are essential for Pol II elongation.

      Strengths:

      (1) The manuscript tackles an important and long-standing question about how Pol II overcomes nucleosomal barriers during transcription.

      (2) The use of genome-wide CUT&Tag-seq for multiple histone marks (H3K9ac, H4K12ac, H3S10ph, H4S1ph) alongside active Pol II mapping provides a valuable dataset for the community.

      (3) The integration of inhibition (DRB) and recovery experiments offers insight into the coupling between Pol II activity and chromatin modifications.

      (4) The concept of "fragile nucleosomes" as a unifying framework is potentially appealing and could stimulate further mechanistic studies.

      Weaknesses:

      (1) Misrepresentation of prior literature

      The introduction incorrectly describes findings from Bintu et al., 2012. The cited work demonstrated that pan-acetylated or tailless nucleosomes reduce the nucleosomal barrier for Pol II passage, rather than showing no improvement. This misstatement undermines the rationale for the current study and should be corrected to accurately reflect prior evidence.

      (2) Incorrect statement regarding hexasome fragility

      The authors claim that hexasome nucleosomes "are not fragile," citing older in vitro work. However, recent studies clearly showed that hexasomes exist in cells (e.g., PMID 35597239) and that they markedly reduce the barrier to Pol II (e.g., PMID 40412388). These studies need to be acknowledged and discussed.

      (3) Inaccurate mechanistic interpretation of DRB

      The Results section states that DRB causes a "complete shutdown of transcription initiation (Ser5-CTD phosphorylation)." DRB is primarily a CDK9 inhibitor that blocks Pol II release from promoter-proximal pausing. While recent work (PMID: 40315851) suggests that CDK9 can contribute to CTD Ser5/Ser2 di-phosphorylation, the manuscript's claim of initiation shutdown by DRB should be revised to better align with the literature. The data in Figure 4A indicate that 1 µM DRB fully inhibits Pol II activity, yet much higher concentrations (10-100×) are needed to alter H3K9ac and H4K12ac levels. The authors should address this discrepancy by discussing the differential sensitivities of CTD phosphorylation versus histone modification turnover.

      (4) Insufficient resolution of genome-wide correlations

      Figure 1 presents only low-resolution maps, which are insufficient to determine whether pan-acetylation and pan-phosphorylation correlate with Pol II at promoters or gene bodies. The authors should provide normalized metagene plots (from TSS to TTS) across different subgroups to visualize modification patterns at higher resolution. In addition, the genome-wide distribution of another histone PTM with a different localization pattern should be included as a negative control.

      (5) Conceptual framing

      The manuscript frequently extrapolates correlative genome-wide data to mechanistic conclusions (e.g., that pan-acetylation/phosphorylation "generate" fragile nucleosomes). Without direct biochemical or structural evidence. Such causality statements should be toned down.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors use various genomics approaches to examine nucleosome acetylation, phosphorylation, and PolII-CTD phosphorylation marks. The results are synthesized into a hypothesis that 'fragile' nucleosomes are associated with active regions of PolII transcription.

      Strengths:

      The manuscript contains a lot of genome-wide analyses of histone acetylation, histone phosphorylation, and PolII-CTD phosphorylation.

      Weaknesses:

      This reviewer's main research expertise is in the in vitro study of transcription and its regulation in purified, reconstituted systems. I am not an expert at the genomics approaches and their interpretation, and overall, I had a very hard time understanding and interpreting the data that are presented in this manuscript. I believe this is due to a problem with the manuscript, in that the presentation of the data is not explained in a way that's understandable and interpretable to a non-expert. For example:

      (1) Figure 1 shows genome-wide distributions of H3K9ac, H4K12ac, Ser2ph-PolII, mRNA, H3S10ph, and H4S1ph, but does not demonstrate correlations/coupling - it is not clear from these data that pan-acetylation and pan-phosphorylation are coupled with Pol II transcription.

      (2) Figure 2 - It's not clear to me what Figure 2 is supposed to be showing.

      (A) Needs better explanation - what is the meaning of the labels at the top of the gel lanes?

      (B) This reviewer is not familiar with this technique, its visualization, or its interpretation - more explanation is needed. What is the meaning of the quantitation graphs shown at the top? How were these calculated (what is on the y-axis)?

      (3) To my knowledge, the initial observation of DRB effects on RNA synthesis also concluded that DRB inhibited initiation of RNA chains (pmid:982026) - this needs to be acknowledged.

      (4) Again, Figures 4B, 4C, 5, and 6 are very difficult to understand - what is shown in these heat maps, and what is shown in the quantitation graphs on top?

    4. Reviewer #3 (Public review):

      Summary:

      Li et al. investigated the prevalence of acetylated and phosphorylated histones (using H3K9ac, H4K12ac, H3S10ph & H4S1ph as representative examples) across the gene body of human HEK293T cells, as well as mapping elongating Pol II and mRNA. They found that histone acetylation and phosphorylation were dominant in gene bodies of actively transcribing genes. Genes with acetylation/phosphorylation restricted to the promoter region were also observed. Furthermore, they investigated and reported a correlation between histone modifications and Pol II activity, finding that inhibition of Pol II activity reduced acetylation/phosphorylation levels, while resuming Pol II activity restored them. The authors then proposed a model in which pan-acetylation or pan-phosphorylation of histones generates fragile nucleosomes; the first round of transcription is accompanied by pan-acetylation, while subsequent rounds are accompanied by pan-phosphorylation.

      Strengths:

      This study addresses a highly significant problem in gene regulation. The author provided riveting evidence that certain histone acetylation and/or phosphorylation within the gene body is correlated with Pol II transcription. The author furthermore made a compelling case that such transcriptionally correlated histone modification is dynamic and can be regulated by Pol II activity. This work has provided a clearer view of the connection between epigenetics and Pol II transcription.

      Weaknesses:

      The title of the manuscript, "Fragile nucleosomes are essential for RNA Polymerase II to transcribe in eukaryotes", suggests that fragile nucleosomes lead to transcription. While this study shows a correlation between histone modifications in gene bodies and transcription elongation, a causal relationship between the two has not been demonstrated.

    5. Author response:

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This study aims to explore how different forms of "fragile nucleosomes" facilitate RNA Polymerase II (Pol II) transcription along gene bodies in human cells. The authors propose that pan-acetylated, pan-phosphorylated, tailless, and combined acetylated/phosphorylated nucleosomes represent distinct fragile states that enable eFicient transcription elongation. Using CUT&Tagseq, RNA-seq, and DRB inhibition assays in HEK293T cells, they report a genome-wide correlation between histone pan-acetylation/phosphorylation and active Pol II occupancy, concluding that these modifications are essential for Pol II elongation. 

      Strengths: 

      (1) The manuscript tackles an important and long-standing question about how Pol II overcomes nucleosomal barriers during transcription. 

      (2) The use of genome-wide CUT&Tag-seq for multiple histone marks (H3K9ac, H4K12ac, H3S10ph, H4S1ph) alongside active Pol II mapping provides a valuable dataset for the community. 

      (3) The integration of inhibition (DRB) and recovery experiments oFers insight into the coupling between Pol II activity and chromatin modifications. 

      (4) The concept of "fragile nucleosomes" as a unifying framework is potentially appealing and could stimulate further mechanistic studies. 

      Really appreciate the positive or affirmative comments from the reviewer.

      Weaknesses: 

      (1)  Misrepresentation of prior literature 

      The introduction incorrectly describes findings from Bintu et al., 2012. The cited work demonstrated that pan-acetylated or tailless nucleosomes reduce the nucleosomal barrier for Pol II passage, rather than showing no improvement. This misstatement undermines the rationale for the current study and should be corrected to accurately reflect prior evidence. 

      What we said is according to the original report in the publication (Bintu et al., Cell, 2012). Here is the citation from the report:

      Page 739,(Bintu, L. et al., Cell, 2012)(PMID: 23141536)

      “Overall transcription through tailless and acetylated nucleosomes is slightly faster than through unmodified nucleosomes (Figure 1C), with crossing times that are generally under 1 min (39.5 ± 5.7 and 45.3 ± 7.6 s, respectively). Both the removal and acetylation of the tails increase eFiciency of NPS passage:71% for tailless nucleosomes and 63% for acetylated nucleosomes (Figures 1C and S1), in agreement with results obtained using bulk assays of transcription (Ujva´ ri et al., 2008).”

      We will cite this original sentence in our revision.

      (2) Incorrect statement regarding hexasome fragility

      The authors claim that hexasome nucleosomes "are not fragile," citing older in vitro work. However, recent studies clearly showed that hexasomes exist in cells (e.g., PMID 35597239) and that they markedly reduce the barrier to Pol II (e.g., PMID 40412388). These studies need to be acknowledged and discussed. 

      “hexasome” was introduced in the transcription field four decades ago. Later, several groups claimed that “hexasome” is fragile and could be generated in transcription elongation of Pol II. However, their original definition was based on the detection of ~100 bps DNA fragments (MNase resistant) in vivo by Micrococcal nuclease sequencing (MNase-seq), which is the right length to wrap up one hexasome histone subunit (two H3/4 and one H2A/2B) to form the sub-nucleosome of a hexasome. As we should all agree that acetylation or phosphorylation of the tails of histone nucleosomes will lead to the compromised interaction between DNA and histone subunits, which could lead to the intact naïve nucleosome being fragile and easy to disassemble, and easy to access by MNase. Fragile nucleosomes lead to better accessibility of MNase to DNA that wraps around the histone octamer, producing shorter DNA fragments (~100 bps instead of ~140 bps). In this regard, we believe that these ~100 bps fragments are the products of fragile nucleosomes (fragile nucleosome --> hexasome), instead of the other way around (hexasome --> fragile). 

      Actually, two early reports from Dr. David J.  Clark’s group from NIH raised questions about the existence of hexasomes in vivo (PMID: 28157509) (PMID: 25348398).

      From the report of PMID:35597239, depletion of INO80 leads to the reduction of “hexasome” for a group of genes, and the distribution of both “nucleosomes” and “hexasomes” with the gene bodies gets fuzzier (less signal to noise). In a recent theoretical model (PMID: 41425263), the corresponding PI found that chromatin remodelers could act as drivers of histone modification complexes to carry out different modifications along gene bodies. The PI found that INO80 could drive NuA3 (a H3 acetyltransferase) to carry out pan-acetylation of H3 and possibly H2B as well in the later runs of transcription of Pol II for a group of genes (SAGA-dependent). It suggests that the depletion of INO80 will affect (reduce) the pan-acetylation of nucleosomes, which leads to the drop of pan-acetylated fragile nucleosomes, subsequently the drop of “hexasomes”. This explains why depletion of INO80 leads to the fuzzier results of either nucleosomes or “hexasomes” in PMID: 35597239. The result of PMID: 35597239 could be a strong piece of evidence to support the model proposed by the corresponding PI (PMID: 41425263).

      From a recent report: PMID:40412388, the authors claimed that FACT could bind to nucleosomes to generate “hexasomes”, which are fragile for Pol II to overcome the resistance of nucleosomes. It was well established that FACT enhances the processivity of Pol II in vivo via its chaperonin property. However, the exact working mechanism of FACT still remains ambiguous. A report from Dr. Cramer’s group showed that FACT enhances the elongation of regular genes but works just opposite for pausing-regulated genes (PMID: 38810649). An excellent review by Drs. Tim Formosa and Fred Winston showed that FACT is not required for the survival of a group of differentiated cells (PMID: 33104782), suggesting that FACT is not always required for transcription. It is quite tricky to generate naïve hexasomes in vitro according to early reports from the late Dr. Widom’s group. Most importantly, the new data (the speed of Pol II, the best one on bare DNA is ~27 bps/s) from the report of PMID: 40412388, which is much slower than the speed of Pol II in vivo: ~2.5 kbs/min or ~40 bps/s. From our recovering experiments (Fig. 4C, as mentioned by reviewer #3), in 20 minutes (the period between 10 minutes and 30 minutes, due to the property of CUT-&TAG-seq, of which Pol II still active after cells are collected, there is a big delay of complete stop of Pol II during the procedure of CUT&TAG experiments, so the first period of time does not actually reflect the speed of Pol II, which is ~5 kb/min), all Pol IIs move at a uniform speed of ~2.5 kbs/min in vivo. Interestingly, a recent report from Dr. Shixin Liu’s group (PMID: 41310264) showed that adding SPT4/5 to the transcription system with bare DNA (in vitro), the speed of Pol II reaches ~2.5kbs/min, exactly the same one as we derived in vivo. Similar to the original report (PMID: 23141536), the current report of PMID:40412388 does not mimic the conditions in vivo exactly.

      There is an urgent need for a revisit of the current definition of “hexasome”, which is claimed to be fragile and could be generated during the elongation of Pol II in vivo. MNase is an enzyme that only works when the substrate is accessible. In inactive regions of the genome, due to the tight packing of chromatin, MNase is not accessible to individual nucleosomes within the bodies of a gene or upstream of promoters, which is why we only see phased/spacing or clear distribution of nucleosomes at the transcription start sites, but it becomes fuzzy downstream or upstream of promoters. On the other hand, for fragile nucleosomes, the accessibility to MNase should increase dramatically, which leads to the ~100 bps fragments. Based on the uniform rate (2.5 kbs/min) of Pol II for all genes derived from human 293T cells and the similar rate (2.5 kbs/min) of Pol II on bare DNA in vitro, it is unlikely for Pol II to pause in the middle of nucleosomes to generate “hexasomes” to continue during elongation along gene bodies. Similar to RNAPs in bacterial (no nucleosomes) and Archaea (tailless nucleosomes), there should be no resistance when Pol IIs transcribe along all fragile nucleosomes within gene bodies in all eukaryotes, as we characterized in this manuscript. 

      (3)  Inaccurate mechanistic interpretation of DRB 

      The Results section states that DRB causes a "complete shutdown of transcription initiation (Ser5-CTD phosphorylation)." DRB is primarily a CDK9 inhibitor that blocks Pol II release from promoter-proximal pausing. While recent work (PMID: 40315851) suggests that CDK9 can contribute to CTD Ser5/Ser2 di-phosphorylation, the manuscript's claim of initiation shutdown by DRB should be revised to better align with the literature. The data in Figure 4A indicate that 1 M DRB fully inhibits Pol II activity, yet much higher concentrations (10-100 ) are needed to alter H3K9ac and H4K12ac levels. The authors should address this discrepancy by discussing the differential sensitivities of CTD phosphorylation versus histone modification turnover. 

      Yes, it was reported that DRB is also an inhibitor of CDK9. However, if the reviewer agrees with us and the current view in the field, the phosphorylation of Ser5-CTD of Pol II is the initiation of transcription for all Pol II-regulated genes in eukaryotes. CDK9 is only required to work on the already phosphorylated Ser5-CTD of Pol II to release the paused Pol II, which only happens in metazoans. From a series of works by us and others: CDK9 is unique in metazoans, required only for the pausing-regulated genes but not for regular genes. We found that CDK9 works on initiated Pol II (Ser5-CTD phosphorylated Pol II) and generates a unique phosphorylation pattern on CTD of Pol II (Ser2ph-Ser2ph-Ser5ph-CTD of Pol II), which is required to recruit JMJD5 (via CID domain) to generate a tailless nucleosome at +1 from TSS to release paused Pol II (PMID: 32747552). Interestingly, the report from Dr. Jesper Svejstrup’s group (PMID: 40315851) showed that CDK9 could generate a unique phosphorylation pattern (Ser2ph-Ser5ph-CTD of Pol II), which is not responsive to the popular 3E10 antibody that recognizes the single Ser2phCTD of Pol II.  This interesting result is consistent with our early report showing the unique phosphorylation pattern (Ser2ph-Ser2ph-Ser5ph-CTD of Pol II) is specifically generated by CDK9 in animals, which is not recognized by 3E10 either (PMID: 32747552). Actually, an early report from Dr. Dick Eick’s group (PMID: 26799765) showed the difference in the phosphorylation pattern of the CTD of Pol II between animal cells and yeast cells.  We have characterized how CDK9 is released from 7SK snRNP and recruited onto paused Pol II via the coupling of JMJD6 and BRD4 (PMID: 32048991), which was published on eLIFE. It is well established that CDK9 works after CDK7 or CDK8. From our PRO-seq data (Fig. 3) and CUT&TAG-seq data of active Pol II (Fig. 4), adding DRB completely shuts down all genes via inhibiting the initiation of Pol II (generation of Ser5ph-CTD of Pol II). Due to the uniqueness of CDK9 only in metazoans, it is not required for the activation of CDK12 or CDK13 (they are orthologs of CTK1 in yeast), as we demonstrated recently (PMID: 41377501). Instead, we found that CDK11/10 acts as the ortholog of Bur1 kinase from yeast, is essential for the phosphorylation of Spt5, the link of CTD of Pol II, and CDK12 (PMID: 41377501). 

      (4) Insufficient resolution of genome-wide correlations 

      Figure 1 presents only low-resolution maps, which are Insufficient o determine whether pan-acetylation and pan-phosphorylation correlate with Pol II at promoters or gene bodies. The authors should provide normalized metagene plots (from TSS to TTS) across different subgroups to visualize modification patterns at higher resolution. In addition, the genome-wide distribution of another histone PTM with a diFerent localization pattern should be included as a negative control. 

      A popular view in the field is that the majority of genomes are inactive since they do not contain coding RNAs, which are responsible for ~20,000 protein candidates characterized in animals. However, our genomewide characterization using the four histone modification marks, active Pol II, and RNA-seq, shows a different story. Figure 1 shows that most of the human genome of HEK293T is active in producing not only protein-coding RNAs but also non-coding RNAs (the majority of them). We believe that Figure 1 could change our current view of the activity of the entire genome, and should be of great interest to general readers as well as researchers on genomics. Furthermore, it is a basis for Figure 2, which is a zoom-in of Figure 1.  

      (5) Conceptual framing 

      The manuscript frequently extrapolates correlative genome-wide data to mechanistic conclusions (e.g., that pan-acetylation/phosphorylation "generate" fragile nucleosomes). Without direct biochemical or structural evidence. Such causality statements should be toned down.  

      The reviewer is right, we should tone down the strong sentences. However, we believe that our data is strong enough to derive the general conclusion. The reviewer may agree with us that the entire field of transcription and epigenetics has been stagnant in recent decades, but there is an urgent need for fresh ideas to change the current situation. Our novel discoveries, for sure, additional supporting data are needed, should open up a brand new avenue for people to explore. We believe that a new era of transcription will emerge based on our novel discoveries. We hope that this manuscript will attract more people to these topics. As Reviewer #3 pointed out, this story establishes the connection between transcription and epigenetics in the field. 

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, the authors use various genomics approaches to examine nucleosome acetylation, phosphorylation, and PolII-CTD phosphorylation marks. The results are synthesized into a hypothesis that 'fragile' nucleosomes are associated with active regions of PolII transcription. 

      Strengths: 

      The manuscript contains a lot of genome-wide analyses of histone acetylation, histone phosphorylation, and PolII-CTD phosphorylation. 

      Weaknesses: 

      This reviewer's main research expertise is in the in vitro study of transcription and its regulation in purified, reconstituted systems. 

      Actually, the pioneering work of the establishment of in vitro transcription assays at Dr. Robert Roeder’s group led to numerous groundbreaking discoveries in the transcription field. The contributions of in vitro work in the transcription field are the key for us to explore the complexity of transcription in eukaryotes in the early times and remain important currently.

      I am not an expert at the genomics approaches and their interpretation, and overall, I had a very hard time understanding and interpreting the data that are presented in this manuscript.  I believe this is due to a problem with the manuscript, in that the presentation of the data is not explained in a way that's understandable and interpretable to a non-expert.

      Thanks for your suggestions. You are right, we have problems expressing our ideas clearly in this manuscript, which could confuse. We will make modifications accordingly per your suggestions.

      For example: 

      (1) Figure 1 shows genome-wide distributions of H3K9ac, H4K12ac, Ser2phPolII, mRNA, H3S10ph, and H4S1ph, but does not demonstrate correlations/coupling - it is not clear from these data that pan-acetylation and pan-phosphorylation are coupled with Pol II transcription. 

      Figure 1 shows the overall distribution of the four major histone modifications, active Pol II, and mRNA genome-wide in human HEK293T cells. It tells general readers that the entire genome is quite active and far more than people predicted that most of the genome is inactive, since just a small portion of the genome expresses coding RNAs (~20,000 in animals). Figure 1 shows that the majority of the genome is active and expresses not only coded mRNA but also non-coding RNAs. After all, it is the basis of Figure 2, which is a zoom-in of Figure 1. However, it is beyond the scope of this manuscript to discuss the non-coding RNAs. 

      (2) Figure 2 - It's not clear to me what Figure 2 is supposed to be showing. 

      (A) Needs better explanation - what is the meaning of the labels at the top of the gel lanes? 

      Figure 2 is a zoom-in for the individual gene, which shows how histone modifications are coupled with Pol II activity on the individual gene. We will give a more detailed explanation of the figure per the reviewer’s suggestions.

      (B) This reviewer is not familiar with this technique, its visualization, or its interpretation - more explanation is needed. What is the meaning of the quantitation graphs shown at the top? How were these calculated (what is on the y-axis)? 

      Good suggestions, we will do some modifications.

      (3) To my knowledge, the initial observation of DRB eFects on RNA synthesis also concluded that DRB inhibited initiation of RNA chains (pmid:982026) - this needs to be acknowledged. 

      Thanks for the reference, which is the first report to show the DRB inhibits initiation of Pol II in vivo. We will cite it in the revision.  

      (4) Again, Figures 4B, 4C, 5, and 6 are very difficult to understand - what is shown in these heat maps, and what is shown in the quantitation graphs on top? 

      Thanks for the suggestions, we will give a more detailed description of the Figures.  

      Reviewer #3 (Public review): 

      Summary: 

      Li et al. investigated the prevalence of acetylated and phosphorylated histones (using H3K9ac, H4K12ac, H3S10ph & H4S1ph as representative examples) across the gene body of human HEK293T cells, as well as mapping elongating Pol II and mRNA. They found that histone acetylation and phosphorylation were dominant in gene bodies of actively transcribing genes. Genes with acetylation/phosphorylation restricted to the promoter region were also observed. Furthermore, they investigated and reported a correlation between histone modifications and Pol II activity, finding that inhibition of Pol II activity reduced acetylation/phosphorylation levels, while resuming Pol II activity restored them. The authors then proposed a model in which panacetylation or pan-phosphorylation of histones generates fragile nucleosomes; the first round of transcription is accompanied by panacetylation, while subsequent rounds are accompanied by panphosphorylation. 

      Strengths: 

      This study addresses a highly significant problem in gene regulation. The author provided riveting evidence that certain histone acetylation and/or phosphorylation within the gene body is correlated with Pol II transcription. The author furthermore made a compelling case that such transcriptionally correlated histone modification is dynamic and can be regulated by Pol II activity. This work has provided a clearer view of the connection between epigenetics and Pol II transcription. 

      Thanks for the insightful comments, which are exactly what we want to present in this manuscript. 

      Weaknesses: 

      The title of the manuscript, "Fragile nucleosomes are essential for RNA Polymerase II to transcribe in eukaryotes", suggests that fragile nucleosomes lead to transcription. While this study shows a correlation between histone modifications in gene bodies and transcription elongation, a causal relationship between the two has not been demonstrated. 

      Thanks for the suggestions. What we want to express is that the generation of fragile nucleosomes precedes transcription, or, more specifically, transcription elongation. The corresponding PI wrote a hypothetical model on how pan-acetylation is generated by the coupling of chromatin remodelers and acetyltransferase complexes along gene bodies, in which chromatin remodelers act as drivers to carry acetyltransferases along gene bodies to generate pan-acetylation of nucleosomes (PMID: 41425263). We have a series of work to show how “tailless nucleosomes” at +1 from transcription start sites are generated to release paused Pol II in metazoans (PMID: 28847961) (PMID: 29459673) (PMID: 32747552) (PMID: 32048991).   We still do not know how pan-phosphorylation along gene bodies is generated. It should be one of the focuses of our future research.

    1. eLife Assessment

      This is an important study on the sensory roles of Cerebrospinal fluid-contacting neurons (CBF-cn) in mammals. The authors identify PKD2L1 as the predominant pH-sensing channel CBF-cn and show how the apical extension is used as an amplifier of chemical changes in the content of the Cerebrospinal fluid. The evidence is solid in experimental design but limited in mechanistic interpretation, as the electrophysiological analyses require re-evaluation.

    2. Reviewer #1 (Public review):

      This study by Vitar et al. probes the molecular identity and functional specialization of pH-sensing channels in cerebrospinal fluid-contacting neurons (CSFcNs). Combining patch-clamp electrophysiology, laser-based local acidification, immunohistochemistry, and confocal imaging, the authors propose that PKD2L1 channels localized to the apical protrusion (ApPr) function as the predominant dual-mode pH sensor in these cells.

      The work establishes a compelling spatial-physiological link between channel localization and chemosensory behavior. The integration of optical and electrical approaches is technically strong, and the separation of phasic and sustained response modes offers a useful conceptual advance for understanding how CSF composition is monitored.

      Several aspects of data interpretation, however, require clarification or reanalysis-most notably the single-channel analyses (event counts, Po metrics, and mixed parameters), the statistical treatment, and the interpretation of purported "OFF currents." Additional issues include PKD2L1-TRPP3 nomenclature consistency, kinetic comparison with ASICs, and the physiological relevance of the extreme acidification paradigm. Addressing these points will substantially improve reproducibility and mechanistic depth.

      Overall, this is a scientifically important and technically sophisticated study that advances our understanding of CSF sensing, provided that the analytical and interpretative weaknesses are satisfactorily corrected.

      (1) The authors should re-analyze electrophysiological data, focusing on macroscopic currents rather than statistically unreliable Po calculations. Remove or revise the Po analysis, which currently conflates current amplitude and open probability.

      (2) PKD2L1-TRPP3 nomenclature should be clarified and all figure labels, legends, and text should use consistent terminology throughout.

      (3) The authors should reinterpret the so-called OFF currents as pH-dependent recovery or relaxation phenomena, not as distinct current species. Remove the term "OFF response" from the manuscript.

      (4) Evidence for physiological relevance should be provided, including data from milder acidification (pH 6.5-6.8) and, where appropriate, comparisons with ASIC-mediated currents to place PKD2L1 activity in context.

      (5) Terminology and data presentation should be unified, adopting consistent use of "predominant" (instead of "exclusive") and "sustained" (instead of "tonic"), and all statistical formats and units should be standardized.

      (6) The Discussion should be expanded to address potential Ca²⁺-dependent signaling mechanisms downstream of PKD2L1 activation and their possible roles in CSF flow regulation and central chemoreception.

    3. Reviewer #2 (Public review):

      Summary:

      Cerebrospinal fluid contacting neurons (CSF-cNs) are GABAergic cells surrounding the spinal cord central canal (CC). In mammals, their soma lies sub-ependymally, with a dendritic-like apical extension (AP) terminating as a bulb inside the CC.

      How this anatomy-soma and AP in distinct extracellular environments relate to their multimodal CSF-sensing function remains unclear.

      The authors confirm that in GATA3:GFP mice, where these cells are labeled, that CSFcNs exhibit prominent spontaneous electrical activity mediated by PKD2L1 (TRPP2) channels, non-selective cation channels with ~200 pS conductance modulated by protons and mechanical forces.

      They investigated PKD2L1 pH sensitivity and its effects on CSFcN excitability. They uncovered that PKD2L1 generates both phasic and tonic currents, bidirectionally modulated by pH with high sensitivity near physiological values.

      Combining electrophysiology (intact and isolated AP recordings) with elegant laser-photolysis, they show that functional PKD2L1 channels localize specifically to the apical extension (AP).

      This spatial segregation, coupled with PKD2L1's biophysical properties (high conductance, pH sensitivity) and the AP's unique features (very high input resistance), renders CSFcN excitability highly sensitive to PKD2L1 modulation. Their findings reveal how the AP's properties are optimised for its sensory role.

      Strengths:

      This is a very convincing demonstration using elegant and challenging approaches (uncaging, outside out patch of the AP) together to form a complete understanding of how these sensory cells can detect the changes of pH in the CSF so finely.

      Weaknesses:

      The following do not constitute weaknesses; rather, they are minor requests that this reviewer considers would complete this beautiful study.

      (1) It would be nice to quantify further the relation in spontaneous as well as in acidic or basic pH between the effects observed on channel opening and holding current: do they always vary together and in a linear way?

      (2) Since CSF-cNs also respond to changes in osmolarity (Orts Dell Immagine 2013) & mechanosensory stimulations in a PKD2L1 dependent manner (Sternberg NC 2018), it would be nice to test the same results whether the same results hold true on the role of PKD2L1 in AP for pressure application of changes in osmolarity.

      In mice, like in fish (Sternberg et al, NC 2018), we can observe throughout the figures that a large fraction of the channel activity occurs with partial and very fast openings of the PKD2L1 channel. I recommend the authors analyse the points below:<br /> a) To what extent do these partial openings of the channel contribute to the changes in holding current and resting potential?<br /> b) In the trace from the outside out AP, it looks like the partial transient openings are gone. Can the authors verify whether these partial openings are only present in somatic recordings?

      (3) Previous studies have observed expression of metabotropic Glutamate receptors in CSF-cNs (transcriptome from Prendergast et al CB 2023). The authors only used blockers for ionotropic glutamate receptors in their recordings: could it be that these metabotropic receptors influence the response to uncaging of MNI-Glu when glutamate is co-released with a proton?

      (4) In the outside out patch of the AP, PKD2L1 unitary currents appear rare. Could it be that the disruption in the cilium or underlying actin/myosin cytoskeleton drastically alter the open probability of the channel?

      (5) Could the authors use drugs against ASIC to specify which ASIC channels contribute to the pH response in the soma?

      (6) This is out of the scope of this study, but we did observe in fish a very rarely-opening channel in the PKD2L1KO mutant. I wonder if the authors have similar observations in the conditions where PKD2L1 is mainly in the closed state.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      This study by Vitar et al. probes the molecular identity and functional specialization of pH-sensing channels in cerebrospinal fluid-contacting neurons (CSFcNs). Combining patch-clamp electrophysiology, laser-based local acidification, immunohistochemistry, and confocal imaging, the authors propose that PKD2L1 channels localized to the apical protrusion (ApPr) function as the predominant dual-mode pH sensor in these cells.

      The work establishes a compelling spatial-physiological link between channel localization and chemosensory behavior. The integration of optical and electrical approaches is technically strong, and the separation of phasic and sustained response modes offers a useful conceptual advance for understanding how CSF composition is monitored.

      Several aspects of data interpretation, however, require clarification or reanalysis-most notably the single-channel analyses (event counts, Po metrics, and mixed parameters), the statistical treatment, and the interpretation of purported "OFF currents." Additional issues include PKD2L1-TRPP3 nomenclature consistency, kinetic comparison with ASICs, and the physiological relevance of the extreme acidification paradigm. Addressing these points will substantially improve reproducibility and mechanistic depth.

      Overall, this is a scientifically important and technically sophisticated study that advances our understanding of CSF sensing, provided that the analytical and interpretative weaknesses are satisfactorily corrected.

      (1) The authors should re-analyze electrophysiological data, focusing on macroscopic currents rather than statistically unreliable Po calculations. Remove or revise the Po analysis, which currently conflates current amplitude and open probability.

      We agree with the reviewer that the Po analysis has strong limitations, particularly in experiments where the recording times are short, such as when extracellular pH is changed via photolysis (Figure 4D) or puff application (Figure 3Aa). To circumvent this problem and not rely solely on Po estimations, we used alternative methods, including an analysis of the total membrane charge (extensively used throughout the manuscript, as in Figures 3A and 4D) and an analysis of event latencies (Figure 4G). Nevertheless, single channel recordings contain information that is not included in the macroscopic current analysis. In the revised version, we intend to stress that the elementary current amplitude is conserved during manipulations such as pH changes, leaving the total number of channels (N) and the channel open probability (Po) as possible culprits for the current changes. Since these changes are rapid and reversible, it is likely that N remains constant while Po changes. To address the reviewer’s concern, we propose the following changes/reanalysis: (i) report in each condition the minimum N (based on the maximum number of simultaneously open channels; for example, in Figure 3Aa, the minimum N goes from 4-5 in control conditions to 1 during the puff of the pH 6.4 solution). Although imperfect, this method provides a tentative estimate of Po; (ii) report the fraction of time that the channels remain open; (iii) revise the text and figures to use the expression “apparent Po” instead of “Po”, acknowledging the limitations of the measurement in short recordings. We also acknowledge that some traces (Figure 3Aa, top) may appear confusing, as they seem to show macroscopic currents. We will modify these figures by including the amplitude histograms (as in Figure 1Bb) to clearly demonstrate that recordings from CSFcNs primarily reflect single-channel activity when challenged with pH changes.

      (2) PKD2L1-TRPP3 nomenclature should be clarified and all figure labels, legends, and text should use consistent terminology throughout.

      We agree with the reviewer that the nomenclature for the polycystin protein family is confusing. In this manuscript, we have followed the nomenclature  proposed in a recent comprehensive review on polycystin channels by Palomero, Larmore and DeCaen (Palomero et al. 2023), which refer to the channels by their gene names. As indicated in that review, the PKD2L1 channel corresponds to TRPP2 (previously known as TRPP3, see their Table 1). However, in another recent review on TRP channels,  the PKD2L1 channel is referred to as TRPP3 (Zhang et al. 2023). To prevent any ambiguity, we will remove references to the TRPP nomenclature from the text and exclusively use the PKD2L1 acronym.

      (3) The authors should reinterpret the so-called OFF currents as pH-dependent recovery or relaxation phenomena, not as distinct current species. Remove the term "OFF response" from the manuscript.

      Although largely used in the literature, we concur with the reviewer that the term “OFF response” is not very helpful from a biophysical perspective as it may imply the existence of a distinct current. Consequently, we will remove the terms “OFF response” and “OFF current” from the revised manuscript and replace them with the term “photolysis-evoked PKD2L1 current”. Furthermore, to improve the logical flow, we will condense the two sections (“The proton-induced current is an off-current” and “The off-current is mediated by the activation of PKD2L1 channels”) into a single, new section titled “The photolysis-induced current is mediated by PKD2L1 channels”. This consolidation will prevent the artificial separation of the description of this current. Finally, we will revise the discussion to better characterize this photolysis-evoked phenomenon as a recovery current.

      (4) Evidence for physiological relevance should be provided, including data from milder acidification (pH 6.5-6.8) and, where appropriate, comparisons with ASIC-mediated currents to place PKD2L1 activity in context.

      This point is partly addressed in Figure 3. The data indicate that  PKD2L1 channels are highly sensitive to pH variations within the physiological range. To strengthen this conclusion, we will add the EC50 values derived from the curve fittings to the figure. Regarding ASIC-mediated currents, one of our main conclusions is that ASICs are not present in the apical process (ApPr), as the effects of proton photolysis in the ApPr are not blocked by ASIC antagonists. Our results suggest that PKD2L1 channels are the exclusive pH sensitive channels in the ApPr. ASIC channels likely mediate acid sensitivity in the soma, although we have not investigated the latter in detail. We intend to modify the Discussion in order to provide a physiological framework linking channel activity with physiological and pathophysiological pH changes. 

      (5) Terminology and data presentation should be unified, adopting consistent use of "predominant" (instead of "exclusive") and "sustained" (instead of "tonic"), and all statistical formats and units should be standardized.

      Folllowing the reviewer’s suggestions, an exhaustive rephrasing will be performed to unify terminology, data presentation and correct the text.

      (6) The Discussion should be expanded to address potential Ca²⁺-dependent signaling mechanisms downstream of PKD2L1 activation and their possible roles in CSF flow regulation and central chemoreception.

      This is indeed a very interesting and currently unresolved point in the physiology of CSFcNs. Published data indicate that calcium influx through PKD2L1 channels is a key regulator of apical process (ApPr) physiology. These channels are calcium permeable yet are also inhibited by intracellular calcium (DeCaen et al. 2016). Additionally, ultrastructural data show that the ApPr is rich in mitochondria and tubulo-vesicular structures resembling the Golgi apparatus (Bruni et Reddy 1987; Bjugn et al. 1988; Nakamura et al. 2023), intracellular organelles critical for calcium homeostasis. Altogether, this evidence suggests that intra-ApPr calcium concentration must be finely regulated, both in space and time, for the ApPr to fulfill its physiological roles. Based on the existing literature, we can speculate that these calcium signals are decoded by several systems: (i) calcium may act as a second messenger, linking the activation of the multimodal PKD2L1 channels to changes in CSFcN excitability, which in turn regulates spinal neuronal networks controlling locomotor activity; (ii) calcium could initiate the neurosecretion of various molecules from the ApPr into the central canal, as proposed by the Wyart group in the zebrafish in the context of bacterial infections (Prendergast et al. 2023); (iii) calcium could activate the Hedgehog signaling pathway (as has been shown by Delling et al. 2013); iv) calcium could modulate CSF flow by modulating ependymal cells ciliary activity. Resolving these downstream pathways is essential to fully define the role of CSFcNs as integrators of cerebrospinal fluid homeostasis. We will expand on this topic in the Discussion section of the revised ms.

      Reviewer #2 (Public review):

      Summary:

      Cerebrospinal fluid contacting neurons (CSF-cNs) are GABAergic cells surrounding the spinal cord central canal (CC). In mammals, their soma lies sub-ependymally, with a dendritic-like apical extension (AP) terminating as a bulb inside the CC.

      How this anatomy-soma and AP in distinct extracellular environments relate to their multimodal CSF-sensing function remains unclear.

      The authors confirm that in GATA3:GFP mice, where these cells are labeled, that CSFcNs exhibit prominent spontaneous electrical activity mediated by PKD2L1 (TRPP2) channels, non-selective cation channels with ~200 pS conductance modulated by protons and mechanical forces.

      They investigated PKD2L1 pH sensitivity and its effects on CSFcN excitability. They uncovered that PKD2L1 generates both phasic and tonic currents, bidirectionally modulated by pH with high sensitivity near physiological values.

      Combining electrophysiology (intact and isolated AP recordings) with elegant laser-photolysis, they show that functional PKD2L1 channels localize specifically to the apical extension (AP).

      This spatial segregation, coupled with PKD2L1's biophysical properties (high conductance, pH sensitivity) and the AP's unique features (very high input resistance), renders CSFcN excitability highly sensitive to PKD2L1 modulation. Their findings reveal how the AP's properties are optimised for its sensory role.

      Strengths:

      This is a very convincing demonstration using elegant and challenging approaches (uncaging, outside out patch of the AP) together to form a complete understanding of how these sensory cells can detect the changes of pH in the CSF so finely.

      Weaknesses:

      The following do not constitute weaknesses; rather, they are minor requests that this reviewer considers would complete this beautiful study.

      (1) It would be nice to quantify further the relation in spontaneous as well as in acidic or basic pH between the effects observed on channel opening and holding current: do they always vary together and in a linear way?

      Following the reviewer’s suggestion, we performed a Spearman’s rank correlation test. The analysis revealed a significant correlation between the changes in the apparent open probability and the holding current in paired experiments (control vs pH 6.4 pressure applications; p < 0.05, Spearman r = 0.72 and critical value = 0.67). The Pearson correlation coefficient calculated on the same data set was r = 0.63 (critical value = 0.632), indicating that the correlation is not linear. We thank the reviewer for raising this point and will add this analysis to the manuscript.

      (2) Since CSF-cNs also respond to changes in osmolarity (Orts Dell Immagine 2013) & mechanosensory stimulations in a PKD2L1 dependent manner (Sternberg NC 2018), it would be nice to test the same results whether the same results hold true on the role of PKD2L1 in AP for pressure application of changes in osmolarity.

      This is a very important point. As the reviewer notes, previous experimental evidence indicates that CSFcNs are also sensitive to osmolarity changes and mechanical stimulation in a PKD2L1-dependent manner. It is therefore reasonable to assume that, similar to pH sensitivity, osmotic and mechanical sensitivity depend on channels localized to the apical process (ApPr). Regarding mechanosensitivity, this spatial segregation could be tested by mechanically stimulating either the ApPr or the soma with a piezo-controlled blunt pipette (see, for example, Hao et al. 2013). Assessing sensitivity to osmotic changes, however, is more challenging, as pressure application lacks the spatial resolution to discriminate between compartments in such a compact cell. In theory, a highly localized osmotic jump could be achieved via photolysis, provided a caged compound that releases many osmotic particles simultaneously is used. In typical photolysis experiments, a localized osmotic change is produced, but its amplitude is very low (on the order of 1 to 2 mOsm).

      In mice, like in fish (Sternberg et al, NC 2018), we can observe throughout the figures that a large fraction of the channel activity occurs with partial and very fast openings of the PKD2L1 channel. I recommend the authors analyse the points below:

      (a) To what extent do these partial openings of the channel contribute to the changes in holding current and resting potential?

      As the reviewer indicates, these partial and rapid openings are characteristic of PKD2L1 single-channel activity and appear to be conserved across species. However, estimating their precise contribution to the sustained current would require a detailed channel model, which is currently lacking. Indeed, the exact mechanism underlying this prominent sustained current in CSFcNs remains unknown and should definitely be addressed in future work.

      (b) In the trace from the outside out AP, it looks like the partial transient openings are gone. Can the authors verify whether these partial openings are only present in somatic recordings?

      The outside-out recordings from the apical process also show some partial openings (see the upper trace in Figure 4Db). We will specifically mention this important point in the revised version of the ms. 

      (3) Previous studies have observed expression of metabotropic Glutamate receptors in CSF-cNs (transcriptome from Prendergast et al CB 2023). The authors only used blockers for ionotropic glutamate receptors in their recordings: could it be that these metabotropic receptors influence the response to uncaging of MNI-Glu when glutamate is co-released with a proton?

      We thank the reviewer for pointing out the presence of metabotropic glutamate receptors in CSFcNs. However, our evidence indicates that metabotropic receptors do not contribute to the response when uncaging MNI-glutamate. This conclusion is supported by two observations: (i) the response obtained when uncaging MNI-γLGG, which does not release glutamate (Figure 5Ab), and (ii) the response obtained when uncaging protons from DPNI-GABA (data not shown) (DPNI-GABA is a GABA cage with photochemistry similar to MNI cages that also releases a proton upon photolysis; Trigo et al. 2009), are the same. In both experiments (uncaging MNI-γLGG or DPNI-GABA) a clear photolysis-evoked PKD2L1 current is observed.

      (4) In the outside out patch of the AP, PKD2L1 unitary currents appear rare. Could it be that the disruption in the cilium or underlying actin/myosin cytoskeleton drastically alter the open probability of the channel?

      The reviewer is correct in noting that the opening frequency of PKD2L1 channels appears lower in outside-out patches than in whole-ApPr recordings, although we have not quantified this. We interpreted this difference as reflecting a lower channel number. However, as the reviewer suggests, a plausible alternative explanation is that the channel's biophysical properties are altered when removed from its native ionic environment or when it loses interactions with regulatory proteins. We will address this point in the Discussion.

      (5) Could the authors use drugs against ASIC to specify which ASIC channels contribute to the pH response in the soma?

      As described in the manuscript, we performed experiments with ASIC antagonists, although we did not attempt to characterize the specific ASIC subtype mediating the somatic response. Based on the published literature, we used both psalmotoxin-1, which blocks ASIC1 channels, and APETx2, which blocks ASIC3 channels. The presence of ASIC1 in mouse CSFcNs has been demonstrated previously (Orts-Del’immagine et al. 2012; Orts-Del’Immagine et al. 2016), while ASIC3 has been identified in lamprey CSFcNs (Jalalvand et al. 2016). When applying an acidic solution to the soma, we recorded an inward current that was substantially blocked by psalmotoxin-1, although a small residual component persisted, consistent with the earlier findings of Orts-Del’Immagine et al. We did not attempt to block this remaining Psalmotoxin1‑insensitive component.

      (6) This is out of the scope of this study, but we did observe in fish a very rarely-opening channel in the PKD2L1KO mutant. I wonder if the authors have similar observations in the conditions where PKD2L1 is mainly in the closed state.

      We have never seen such kind of openings in our recordings (when the channel is closed or in the presence of dibucaine).

      References

      Bjugn, R, H K Haugland, et P R Flood. 1988. “Ultrastructure of the mouse spinal cord ependyma”. Journal of Anatomy 160 (octobre): 117‑25.

      Bruni, J. E., et K. Reddy. 1987. “Ependyma of the Central Canal of the Rat Spinal Cord: A Light and Transmission Electron Microscopic Study”. Journal of Anatomy 152 (juin): 55‑70.

      Delling, Markus, Paul G. DeCaen, Julia F. Doerner, Sebastien Febvay, et David E. Clapham. 2013. ”Primary cilia are specialized calcium signaling organelles”. Nature 504 (7479): 311‑14 https://doi.org/10.1038/nature12833.

      Hao, Jizhe, Jérôme Ruel, Bertrand Coste, Yann Roudaut, Marcel Crest, et Patrick Delmas. 2013. “Piezo-Electrically Driven Mechanical Stimulation of Sensory Neurons”. In Ion Channels, édité par Nikita Gamper, vol. 998. Methods in Molecular Biology. Humana Press. https://doi.org/10.1007/978-1-62703-351-0_12.

      Jalalvand, Elham, Brita Robertson, Hervé Tostivint, Peter Wallén, et Sten Grillner. 2016. “The Spinal Cord Has an Intrinsic System for the Control of pH”. Current Biology: CB 26 (10): 1346‑51. https://doi.org/10.1016/j.cub.2016.03.048.

      Nakamura, Yuka, Miyuki Kurabe, Mami Matsumoto, et al. 2023. “Cerebrospinal Fluid-Contacting Neuron Tracing Reveals Structural and Functional Connectivity for Locomotion in the Mouse Spinal Cord”. eLife 12 (février): e83108. https://doi.org/10.7554/eLife.83108.

      Orts-Del’Immagine, Adeline, Riad Seddik, Fabien Tell, et al. 2016. “A Single Polycystic Kidney Disease 2-like 1 Channel Opening Acts as a Spike Generator in Cerebrospinal Fluid-Contacting Neurons of Adult Mouse Brainstem”. Neuropharmacology 101 (février): 549‑65. https://doi.org/10.1016/j.neuropharm.2015.07.030.

      Orts-Del’immagine, Adeline, Nicolas Wanaverbecq, Catherine Tardivel, Vanessa Tillement, Michel Dallaporta, et Jérôme Trouslard. 2012. “Properties of Subependymal Cerebrospinal Fluid Contacting Neurones in the Dorsal Vagal Complex of the Mouse Brainstem”. The Journal of Physiology 590 (16): 3719‑41. https://doi.org/10.1113/jphysiol.2012.227959.

      Prendergast, Andrew E., Kin Ki Jim, Hugo Marnas, et al. 2023. “CSF-Contacting Neurons Respond to Streptococcus Pneumoniae and Promote Host Survival during Central Nervous System Infection”. Current Biology 33 (5): 940-956.e10. https://doi.org/10.1016/j.cub.2023.01.039.

      Trigo, Federico F., George Papageorgiou, John E. T. Corrie, et David Ogden. 2009. “Laser photolysis of DPNI-GABA, a tool for investigating the properties and distribution of GABA receptors and for silencing neurons in situ”. Journal of Neuroscience Methods 181 (2): 159‑69. https://doi.org/10.1016/j.jneumeth.2009.04.022.

    1. eLife Assessment

      This study presents important findings on how cardiac regenerative capacity diverges across species by examining heart repair in two species of livebearers, platyfish and swordtails. In contrast to zebrafish, the livebearer species show persistent scarring after cryo-injury, and the work highlights how lineage-specific anatomical and immunological traits may constrain regenerative competence. The study is compelling, the data are convincing, and the results contribute to our understanding of the mechanisms underlying heart regeneration across vertebrates.

    2. Reviewer #1 (Public review):

      Summary:

      How the regenerative capacity of the heart varies among different species has been a long-standing question. Within teleosts, zebrafish can regenerate their hearts, while medaka and cavefish cannot. The authors examined heart regeneration in two livebearers, platyfish and swordtails. Interestingly, they found that these two fish species lack the compact myocardium layer that contains coronary vessels. Furthermore, these fish form a "pseudoaneurysm" after cryoinjury without initial deposition of fibrotic tissues. However, delayed leukocyte infiltration and prolonged inflammation lead to permanent scar tissue in the injured heart. Although their cardiomyocytes can also proliferate, platyfish and swordtails can only regenerate partially. The authors argue that the restorative mechanism of platyfish and swordtails likely reflects "evolutionary innovations in the ventricle type and the immune system".

      Strengths:

      The authors took advantage of the annotated genome of platyfish to perform transcriptomic analyses. The histological analyses and immunostaining are beautifully done.

      Minor Weaknesses:

      Transcriptomic analysis was only done for one time point. Different time points could be included to validate whether some processes occur at different time points. But this can be done in the future for more detailed studies."

    3. Reviewer #2 (Public review):

      This manuscript by Hisler, Rees, and colleagues examines the cardiac regenerative ability of two livebearer species, the platyfish and swordtail. Unlike zebrafish, these species lack cortical myocardium and coronary vasculature. Cryoinjury to their hearts caused persistent scarring at 60 and 90 days post-injury and prevented most of the myocardium from regenerating. Although the wound size progressively shrinks and fibronectin content decreases, the myocardial wall does not recover. Transcriptomic profiling at 7 dpi revealed significant differences between zebrafish and platyfish, including alterations in ECM deposition, immune regulation, and signaling pathways involved in regeneration, such as TGFβ, mTOR, and Erbb2. Platyfish exhibit a delayed but chronic immune response, and although some cardiomyocyte proliferation is observed, it does not appear to contribute to myocardial recovery significantly.

      Overall, this is an excellent manuscript that tackles a crucial question: do different fish lineages have the ability to regenerate hearts, or is this capability limited to a few groups? Therefore, this work is relevant to the fields of cardiac regeneration and comparative regenerative biology for a broad audience. I am very enthusiastic about expanding the list of species tested for their heart regeneration abilities, and this study is detailed and rigorous, providing a solid foundation for future comparative research. However, there are several aspects where additional work could significantly strengthen the manuscript.

      Major comments

      (1) Title selection

      The title the authors chose suggests that platyfish and swordtails "partially regenerate," but I do wonder how much these animals truly regenerate. This may be a semantic discussion and a matter of personal preference. Still, based on other significant work on regenerative capacity (see, for example, the landmark cavefish regeneration paper PMID: 30462998 or work on medaka PMID: 24947076), the persistence of such a prominent fibrotic scar would be considered a minimal regenerative capacity. Measuring this "partial regeneration" more precisely by comparing zebrafish with platyfish and swordtails would also greatly strengthen the comparisons made here - see below.

      The same can be said about line 152-153 - do these hearts "regenerate" with deformation and partial scarring, or would it be more fair to say that they are "healed" or "repaired" with a process that involves fibrosis?

      (2) Cross-species comparisons

      Having two species of livebearers strengthens the findings of this paper, but the presentation of results from both species is inconsistent. For example, the reader should not be asked to assume that the architecture of the swordtail ventricle is similar to that of the platyfish (line 125). The same applies to the presence or absence of coronary vessels (Figure 1), the reduction in wound area over time (Figure 3), and the immune system's response (Figure 5). Most importantly, the authors miss an opportunity to move from qualitative observations to quantifying the "partial regeneration" phenotype they observe. Specifically, providing a side-by-side comparison between these new species and zebrafish would help define the extent of differences in regeneration potential. For instance, in Figure 6, while the authors provide excellent quantification of PCNA staining in platyfish, these data are less meaningful without a direct comparison with zebrafish results. The same applies to Figures 6E and 6F - although differences are noted, quantifying these results would enable a more rigorous assessment of the process.

      (3) Lack of coronary vasculature

      There is a growing body of evidence highlighting the importance of the coronary vessels during zebrafish heart regeneration (PMIDs: 27647901, 31743664). Surprisingly, this finding has not been integrated or discussed in the context of this literature.

      The results of the alkaline phosphatase assay and anti-podocalyxin-2 staining appear inconsistent. Specifically, in Supplementary Figure 1L-M, we can see some vessels covering the bulbus arteriosus and also what appears to be a signal in the ventricle. However, in Figures 1 K and 1L, we cannot see any vessels, even in the bulbus. The authors should also be more rigorous and add a description of how many animals were analyzed, their ages, and sizes. In zebrafish, the formation of the coronary arteries appears to depend on animal size and age. With the data provided, we cannot say whether this is a one-time observation or a consistent finding across many animals at different ages and across both species.

      The link between livebearers' responses and pseudoaneurysms is overstated. This work is already extremely relevant without trying to make it medically oriented.

    4. Author response:

      Reviewer #1:

      Minor Weaknesses:

      "Transcriptomic analysis was only done for one time point. Different time points could be included to validate whether some processes occur at different time points. But this can be done in the future for more detailed studies."

      Our response regarding time points of transcriptomic analysis:

      We appreciate this constructive suggestion. We fully agree that performing RNA-seq at multiple time points would provide valuable insights into the temporal dynamics of molecular pathways during cardiac regeneration. However, given that our study represents the first comprehensive characterization of cardiac regeneration in poeciliids, we deliberately focused our resources on establishing the foundational framework, including morphological, cellular, and initial transcriptomic analyses between zebrafish and platyfish. Expanding to multiple time points would constitute a substantial additional study that, while scientifically valuable, would extend beyond the scope of this initial characterization.

      We will acknowledge this limitation in the Discussion and indicate that temporal transcriptomic profiling is an important direction for future investigation.

      Reviewer #2:

      (1) Title selection

      Our response regarding the use of the term “partially regenerate” in the title and results:

      We thank Reviewer 2 for this important point regarding the terminology used to describe the cardiac response in platyfish and swordtails. We agree that the term "partially regenerate" may overstate the regenerative capacity of these species, particularly given the persistence of a substantial collagenous scar at the injury site. The reviewer is correct that, based on established criteria in the field, including the landmark studies on cavefish (PMID: 30462998) and medaka (PMID: 24947076), the presence of such prominent fibrotic scarring would be more appropriately characterized as limited or minimal regenerative capacity rather than partial regeneration.

      While we observe a significant reduction in wound volume at 30 dpci and some degree of tissue remodeling, we acknowledge that the persistent scarring and incomplete myocardial recovery more accurately reflect a healing or repair process rather than true regeneration. We therefore agree with the reviewer's suggestion to revise our terminology throughout the manuscript.

      We will revise the title to: "The livebearers platyfish and swordtails heal their hearts with persistent scarring." We will also modify other relevant sections of the Results and Discussion to consistently describe these processes as "healing" or "repair" rather than "regeneration", while still acknowledging the biological changes that do occur (wound contraction, remodeling, limited cardiomyocyte proliferation). This revised framing better aligns our work with the established terminology in the comparative cardiac regeneration literature and more accurately represents the phenotype we observe.

      We believe this change will strengthen the manuscript by providing a more precise characterization of the cardiac response in these species and facilitating clearer comparisons with other model systems.

      (2) Cross-species comparisons

      Our response regarding the inconsistent presentation of results for different species:

      We thank the reviewer for recognizing that our conclusions regarding the regenerative capacity of livebearers are strengthened by including two poeciliid species, platyfish and swordtails. We agree that presenting results more consistently across both species will significantly improve the manuscript. We acknowledge that our current presentation creates a burden on the reader by asking them to assume similarities between species without providing supporting data. While we initially focused primarily on platyfish due to its superior genome annotation (critical for our transcriptomic analyses), we recognize that this approach left important gaps in the manuscript.

      We will address this by generating comprehensive supplementary figures that present swordtail data alongside platyfish for key findings. Specifically, we will add a complete anatomical characterization of swordtail ventricle architecture, demonstrating the structural similarities to platyfish that underpin our comparative conclusions. We will also perform quantification of wound area reduction and immune response dynamics over time in swordtails, allowing direct comparison between species.

      We clarify that we did perform detailed analyses of swordtail heart anatomy during our initial studies, which revealed remarkable similarity to platyfish. However, space constraints in Figures 1 and S1 (which already span full pages with zebrafish-platyfish comparisons) prevented us from including these data in the original submission. We now recognize that explicitly presenting these data is essential for the reader to evaluate our conclusions.

      Our response regarding quantification and comparison with zebrafish: 

      We appreciate the reviewer's suggestion to move beyond qualitative observations toward rigorous quantification of the "partial regeneration" phenotype. As suggested by the reviewer for the PCNA analysis, we will provide direct quantitative comparisons with published zebrafish regeneration studies, including data from several relevant studies and our own lab's work. This comparison will delineate the extent of differences in proliferative response between complete regenerators (zebrafish) and limitted regenerators (poeciliids).

      These additions will transform our descriptive observations into quantitative assessments that rigorously define the incomplete healing phenotype in poeciliids relative to complete regeneration in zebrafish. We believe these changes will substantially strengthen the manuscript and address the reviewer's concerns about comparative rigor.

      (3) Lack of coronary vasculature

      Our response regarding inconsistencies in vascularization data:

      We thank the reviewer for his/her comment regarding our data on the absence of coronary vasculature in the platyfish heart. The reviewer noted differences between alkaline phosphatase (AP) enzymatic staining and anti-Podocalyxin-2 immunofluorescence staining. We would like to clarify that these observed differences are not inconsistencies but rather reflect the distinct specificities of these two complementary approaches.

      Alkaline phosphatase staining is selective for arterial branches and capillaries in the heart (PMID: 13982613; PMID: 9477306; PMID: 8245430; PMID: 3562789; PMID: 29023576; PMID: 28632131) and revealed a typical vascular pattern in the bulbus arteriosus and ventricle in zebrafish but not in platyfish. Anti-Podocalyxin-2 staining displayed a vessel-like pattern in zebrafish but not in platyfish. However, in both species Podocalyxin staining also  labeled other types of non-vascular structures. This is expected given that Podocalyxin is a cell surface sialomucin with broader expression beyond blood vessels, including the endocardium (PMID: 19142011) and certain neuronal populations, in addition to other non-cardiac tissue types (PMID: 19578008; PMID: 3511072; PMID: 34201212).

      We will revise the manuscript to emphasize this distinction and clarify our rationale: we deliberately employed Podocalyxin-2 staining as a complementary, less selective approach to corroborate our alkaline phosphatase findings. In platyfish, the convergent evidence from both methods (the absence of typical vascular structures with a selective AP staining and the detection of only non-vascular patterns with the broader marker Podocalyxin-2) strengthens our conclusion that platyfish hearts lack a conventional coronary vascular network.

      Our response regarding reproducibility:

      The assays were performed independently by two researchers at different stages of the study using two different batches of adult platyfish. The results were consistent in both assays, and we are therefore confident in the reproducibility of our findings.

      Our response regarding citations of references on revascularization:

      We thank the reviewer for recommending the studies PMID: 27647901 and PMID: 31743664 that revealed the importance of rapid revascularization during heart regeneration in zebrafish. We will be pleased to integrate these works to present our data in the appropriate context of current knowledge.

      Our response regarding a link to pseudoaneurysms:

      We appreciate the reviewer's feedback regarding the link to pseudoaneurysm. We agree that the primary contributions of our work stand on their own merit, and we will revise the text to present the livebearer findings more cautiously without overstating their potential medical relevance. We will focus on the intrinsic biological significance of our findings.

    1. eLife Assessment

      In this work, the authors intend to assess the existence of a redox potential across germline stem cells and neighbouring somatic stem cells in the Drosophila testis. Some aspects of the manuscript are convincing, like the clear effect of SOD KD on cyst cell differentiation state. Other conclusions of the work, such as the non-autonomous effect of this KD on germ cells are not sufficiently supported by the data. This remains true even with the revised version of the paper, as the effect of redox state of the soma on the germline is a major point of the paper, and this remains a critical flaw. The work could be potentially useful if the critiques of the reviewers were fully addressed; the strength of the evidence of the manuscript as it stands is still inadequate. Readers should use their own judgment about the validity and meaningfulness of different findings.

    2. Reviewer #1 (Public review):

      Mitochondrial staining difference is convincing, but the status of the mitos, fused vs fragmented, elongated vs spherical, does not seem convincing. Given the density of mito staining in CySC, it is difficult to tell what is an elongated or fused mito vs the overlap of several smaller mitos.

      I'm afraid the quantification and conclusions about the gstD1 staining in CySC vs. GSCs is just not convincing-I cannot see how they were able to distinguish the relevant signals to quantify once cell type vs the other.

      The overall increase in gstD1 staining with the CySC SOD KD looks nice, but again I can't distinguish different cel types. This experiment would have been more convincing if the SOD KD was mosaic, so that individual samples would show changes in only some of the cells. Still, it seems that KD of SOD in the CySC does have an effect on the germline, which is interesting.

      The effect of SOD KD on the number of less differentiated somatic cells seems clear. However, the effect on the germline is less clear and is somewhat confusing. Normally, a tumor of CySC or less differentiated Cyst cells, such as with activated JAK/STAT, also leads to a large increase in undifferentiated germ cells, not a decrease in germline as they conclude they observe here. The images do not appear to show reduced number of GSCs, but if they counted GSCs at the niche, then that is the correct way to do it, but its odd that they chose images that do not show the phenotype. In addition, lower number of GSCs could also be caused by "too many CySCs" which can kick out GSCs from the niche, rather than any affect on GSC redox state. Further, their conclusion of reduced germline overall, e.g. by vasa staining, does not appear to be true in the images they present and their indication that lower vasa equals fewer GSCs is invalid since all the early germline expresses Vasa.

      The effect of somatic SOD KD is perhaps most striking in the observation of Eya+ cyst cells closer to the niche. The combination of increased Zfh1+ cells with many also being Eya+ demonstrates a strong effect on cyst cell differentiation, but one that is also confusing because they observe increases in both early cyst cells (Zfh1+) as well as late cyst cells (Eya+) or perhaps just an increase in the Zfh1/Eya double-positive state that is not normally common. The effects on the RTK and Hh pathways may also reflect this disturbed state of the Cyst cells.

      However, the effect on germline differentiation is less clear-the images shown do not really demonstrate any change in BAM expression that I can tell, which is even more confusing given the clear effect on cyst cell differentiation.

      For the last figure, any effect of SOD OE in the germline on the germline itself is apparently very subtle and is within the range observed between different "wt" genetic backgrounds.

      Comments on revisions:

      Upon re-re-review, the manuscript is improved but retains many of the flaws outlined in the first reviews.

    3. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review)  

      Mitochondrial staining difference is convincing, but the status of the mitochondria, fused vs fragmented, elongated vs spherical, does not seem convincing. Given the density of mito staining in CySC, it is difficult to tell whether what is an elongated or fused mito vs the overlap of several smaller mitos.

      To address this, we have now removed the statements regarding the differences in the shape of mitochondria among the stem cell population. We have limited our statements to stating that the CySCs are more mitochondria dense compared to the neighbouring GSCs.

      The quantification and conclusions about the gstD1 staining in CySC vs. GSCs is just not convincing-I cannot see how they were able to distinguish the relevant signals to quantify once cell type vs the other.

      We appreciate the reviewer’s concern. To address this, we have included new images along with z-stack reconstructions (Fig 1G-P and S1C-D’’’), which now provide clearer distinction of gstD1 staining between CySCs and GSCs and improve the accuracy of quantification. The intensity of gstD1 staining overlapping with that of Vasa+ zone has been quantified as ROS levels for GSCs. Similarly, the cytoplasmic area of gstD1 stain bounded by Dlg and Tj+ nuclei was quantified as ROS levels for CySCs.    

      Images do not appear to show reduced number of GSCs, but if they counted GSCs at the niche, then that is the correct way to do it, but its odd that they chose images that do not show the phenotype. Further, their conclusion of reduced germline overall, e.g by vasa staining, does not appear to be true in the images they present and their indication that lower vasa equals fewer GSCs is invalid since all the early germline expresses Vasa.

      We have replaced the figure with images where the GSC rosette is clearly visible, ensuring that the counted GSCs at the niche accurately reflect the phenotype (Fig. 2 C’’, D’’). We agree that Vasa is expressed in all early germline cells. The overall reduced Vasa signal intensity in our western blot analysis for Sod1RNAi reflects a general reduction in the germline population, not just the GSCs. We have modified our statements in the Results appropriately.  

      However, the effect on germline differentiation is less clear-the images shown do not really demonstrate any change in BAM expression that I can tell, which is even more confusing given the clear effect on cyst cell differentiation.

      We appreciate the reviewer’s observation. To clarify this point, we have now included z-stack projection images of Bam expression in the revised version (Fig 3E’’-F’’) .

      These images more clearly demonstrate the difference in Bam expression, thereby highlighting the effect on germline differentiation. Moreover, Bam expressing cells are present more closure to hub in Sod1RNAi condition, indicating early differentiation.

      For the last figure, any effect of SOD OE in the germline on the germline itself is apparently very subtle and is within the range observed between different "wt" genetic backgrounds.

      We acknowledge that the effect of SOD overexpression on the germline is not very significant. The germline cells already possess a modest ROS load and it is a well-established fact that they possess a robust anti-oxidant defence machinery in order to protect the genome. Therefore, elevating the levels of antioxidant enzymes such as Sod1 does not translate into a major change and the effect observed are generally subtle.     

      Reviewer #3 (Public review)  

      In Fig. 1N (tj-SODi), one can see that all of gst-GFP resides within the differentiating somatic cells and none is in the germ cells. Furthermore, the information provided in the materials and methods about quantification of gst-GFP is not sufficient. Focusing on Dlg staining is not sufficient. They need to quantify the overlap of Vasa (a cytoplasmic protein in GSCs) with GFP.

      In our analysis, we have indeed quantified the GFP intensity in area of overlap between gstD1-GFP and Vasa-positive zone in the germ cells which are in direct contact with hub, in order to accurately quantify the ROS reporter signal within the germline compartment. Further, to ensure accurate cell boundary demarcation, we used Dlg staining as an additional parameter. While Dlg staining alone was included in the figure panels for clarity of visualization, the actual quantification was performed by considering both Vasa (for germ cells cytoplasm) and Dlg (for cellular boundaries). This has been clarified in the Materials and Methods.

      Additionally, since Tj-gal4 is active in hub cells, it is not clear whether the effects of SOD depletion also arise from perturbation of niche cells.

      We acknowledge that Tj-Gal4 also shows minimal activity in hub cells. To address this, we had tested C587-Gal4 and observed similar effects on niche architecture, though weaker than with Tj-Gal4, underlying the effect of ROS originating from CySC.  

      First, the authors are studying a developmental effect, rather than an adult phenotype. Second, the characterization of the somatic lineage is incomplete. It appears that high ROS in the somatic lineage autonomously decreases MAP kinase signaling and increases Hh signaling. They assume that the MAPK signaling is due to changes in Egfr activity but there are other tyrosine kinases active in CySCs, including PVR/VEGFR (PMID: 36400422), that impinge on MAPK. In any event ,their results are puzzling because lower Egfr should reduce CySC self-renewal and CySC number (Amoyel, 2016) and the ability of cyst cells to encapsulate gonialblasts (Lenhart Dev Cell 2015). The increased Hh should increase CySC number and the ability of CySCs to outcompete GSCs. The fact that the average total number of GSCs declines in tj>SODi testes suggests that high ROS CySCs are indeed outcompeting GSCs. However, as I wrote in myfirst critique, the characterization of the high ROS soma is incomplete. And the role of high ROS in the hub cells is acknowledged but not investigated.

      We acknowledge the reviewer’s concern that our study primarily examines a developmental effect. Our rationale was that redox imbalance during early stages can set longterm trajectories for stem cell behavior and niche organization, which ultimately manifest in adult testes.

      We agree that sole evaluation of Erk levels may not reflect the actual status of EGFR signalling and there is an apparent contradictory observation of low Erk and high CySC self-renewal. We believe that this ROS mediated change in Erk status, resulting in high CySC proliferation, might be an outcome of an interplay between other RTKs beyond EGFR. While the expansion of CySCs is primarily governed by Hh, a detailed dissection of these pathways under altered redox environment will be an interesting work to develop in future. Regarding the GSC number, it cannot be definitively stated that high ROS-CySCs are indeed outcompeting the GSCs, but yes, that possibility parallely exists. However, in presence case, there is no denying that the ROS levels of GSCs are indeed high under high CySC-ROS condition. It is known that ROS imbalance in GSCs promote their differentiation which was also observed in the present study through Bam staining. Therefore, redox mediated reduction in GSC number cannot be completely ruled out.  We have already discussed these points in the revised manuscript and suggest possible non-canonical effects of ROS on signal integration within CySCs that might reconcile these findings. Further, in the present study, we have focussed on redox interplay between the two stem cell populations (GSC and CySC) of the niche. Hence, we have not covered the redox profiling of the hub in detail.   

      The paragraph in the introduction (lines 62-76) mentions autonomous ROS levels in stem cells, not the transfer of ROS from one cell to another. And this paragraph is confusing because it starts with the (inaccurate) statement all stem cells have low ROS and then they discuss ISCs, which have high ROS.

      We have revised the paragraph for clarity. It now distinguishes between stem cell types with low versus relatively high ROS requirements (e.g., ISCs, HSCs, NSCs) and includes recent evidence of non-autonomous ROS signaling, such as paracrine ROS action from pericardial cells to cardiomyocytes and gap-junction–mediated ROS waves in cardiomyocyte monolayers. This resolves the ambiguity and presents a balanced view of autonomous and nonautonomous ROS regulation.

      While there has been an improvement in the scholarship of the testis, there are still places where the correct paper is not cited and issues with the text.

      All concerns regarding missing or incorrect citations and textual issues have now been carefully addressed and corrected. Relevant references have been added in the appropriate places to ensure accuracy.

      The authors are encouraged to more completely characterize the phenotype of high ROS in hub and CySCs.

      We have now included improved images showing the respective ROS profiles GSCs, CySCs and the hub. As mentioned in the earlier response, this work focuses on the redox interplay between GSCs and CySCs hence, we have not included any analysis on hub. However, we agree with reviewer that the hub contributions should also be evaluated as a future direction.

    1. eLife Assessment

      This important work advances our understanding of the development of the visual system. The data presented is compelling and provides a detailed single-cell atlas of post-natal anterior chamber development in mice, highlighting the trabecular meshwork and Schlemm's canal.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents a comprehensive single-cell atlas of mouse anterior segment development, focusing on the trabecular meshwork and Schlemm's canal. The authors profiled ~130,000 cells across seven postnatal stages, providing detailed and solid characterization of cell types, developmental trajectories, and molecular programs.

      Strengths:

      The manuscript is well-written, with a clear structure and thorough introduction of previous literature, providing a strong context for the study. The characterization of cell types is detailed and robust, supported by both established and novel marker genes as well as experimental validation. The developmental model proposed is intriguing and well supported by the evidence. The study will serve as a valuable reference for researchers investigating anterior segment developmental mechanisms. Additionally, the discussion effectively situates the findings within the broader field, emphasizing their significance and potential impact for developmental biologists studying the visual system.

      Weaknesses:

      The weaknesses of the study are minor and addressable. As the study focuses on the mouse anterior segment, a brief discussion of potential human relevance would strengthen the work by relating the findings to human anterior segment cell types, developmental mechanisms, and possible implications for human eye disease. Data availability is currently limited, which restricts immediate use by the community. Similarly, the analysis code is not yet accessible, limiting the ability to reproduce and validate the computational analyses presented in the study.

    3. Reviewer #2 (Public review):

      Summary:

      This study presents a detailed single-cell transcriptomic analysis of the postnatal development of mouse anterior chamber tissues. Analysis focused on the development of cells that comprise Schlemm's Canal (SC) and trabecular meshwork (TM).

      Strengths:

      This developmental atlas represents a valuable resource for the research community. The dataset is robust, consisting of ~130,000 cells collected across seven time points from early post-natal development to adulthood. Analyses reveal developmental dynamics of SC and TM populations and describe the developmental expression patterns of genes associated with glaucoma.

      Weaknesses:

      (1) Throughout the paper, the authors place significant weight on the spatial relationships of UMAP clusters, which can be misleading (See Chari and Patcher, Plos Comb Bio 2023). This is perhaps most evident in the assessment of vascular progenitors (VP) into BEC and SEC types (Figures 4 and 5). In the text, VPs are described as a common progenitor for these types, however, the trajectory analysis in Figure 5 denotes a path of PEC -> BEC -> VP -> SEC. These two findings are incongruous and should be reconciled. The limitations of inferring relationships based on UMAP spatial positions should be noted.

      (2) Figure 2d does not include P60. It is also noted that technical variation resulted in fewer TM3 cells at P21; was this due to challenges in isolation? What is the expected proportion of TM3 cells at this stage?

      (3) In Figures 3a and b it is difficult to discern the morphological changes described in the text. Could features of the image be quantified or annotated to highlight morphological features?

      (4) Given the limited number of markers available to identify SC and TM populations during development, it would be useful to provide a table describing potential new markers identified in this study.

      (5) The paper introduces developmental glaucoma (DG), namely Axenfeld-Rieger syndrome and Peters Anomaly, but the expression analysis (Figure S20) does not annotate which genes are associated with DG.

    4. Author response:

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This study presents a comprehensive single-cell atlas of mouse anterior segment development, focusing on the trabecular meshwork and Schlemm's canal. The authors profiled ~130,000 cells across seven postnatal stages, providing detailed and solid characterization of cell types, developmental trajectories, and molecular programs. 

      Strengths: 

      The manuscript is well-written, with a clear structure and thorough introduction of previous literature, providing a strong context for the study. The characterization of cell types is detailed and robust, supported by both established and novel marker genes as well as experimental validation. The developmental model proposed is intriguing and well supported by the evidence. The study will serve as a valuable reference for researchers investigating anterior segment developmental mechanisms. Additionally, the discussion effectively situates the findings within the broader field, emphasizing their significance and potential impact for developmental biologists studying the visual system. 

      Weaknesses: 

      The weaknesses of the study are minor and addressable. As the study focuses on the mouse anterior segment, a brief discussion of potential human relevance would strengthen the work by relating the findings to human anterior segment cell types, developmental mechanisms, and possible implications for human eye disease. Data availability is currently limited, which restricts immediate use by the community. Similarly, the analysis code is not yet accessible, limiting the ability to reproduce and validate the computational analyses presented in the study. 

      In the revised version we will highlight the human relevance of our work in the discussion section. Additionally, data and codes are public on single cell portal and GEO, accession numbers have been updated.

      Reviewer #2 (Public review): 

      Summary: 

      This study presents a detailed single-cell transcriptomic analysis of the postnatal development of mouse anterior chamber tissues. Analysis focused on the development of cells that comprise Schlemm's Canal (SC) and trabecular meshwork (TM). 

      Strengths: 

      This developmental atlas represents a valuable resource for the research community. The dataset is robust, consisting of ~130,000 cells collected across seven time points from early post-natal development to adulthood. Analyses reveal developmental dynamics of SC and TM populations and describe the developmental expression patterns of genes associated with glaucoma. 

      Weaknesses: 

      (1) Throughout the paper, the authors place significant weight on the spatial relationships of UMAP clusters, which can be misleading (See Chari and Patcher, Plos Comb Bio 2023). This is perhaps most evident in the assessment of vascular progenitors (VP) into BEC and SEC types (Figures 4 and 5). In the text, VPs are described as a common progenitor for these types, however, the trajectory analysis in Figure 5 denotes a path of PEC -> BEC -> VP -> SEC. These two findings are incongruous and should be reconciled. The limitations of inferring relationships based on UMAP spatial positions should be noted. 

      (2) Figure 2d does not include P60. It is also noted that technical variation resulted in fewer TM3 cells at P21; was this due to challenges in isolation? What is the expected proportion of TM3 cells at this stage? 

      (3) In Figures 3a and b it is difficult to discern the morphological changes described in the text. Could features of the image be quantified or annotated to highlight morphological features? 

      (4) Given the limited number of markers available to identify SC and TM populations during development, it would be useful to provide a table describing potential new markers identified in this study. 

      (5) The paper introduces developmental glaucoma (DG), namely Axenfeld-Rieger syndrome and Peters Anomaly, but the expression analysis (Figure S20) does not annotate which genes are associated with DG.

      (1) We agree that inferring biological relationships from the spatial arrangement of UMAP clusters has limitations and we will qualify our interpretation accordingly in the text. We will also add clarifying language to the trajectory analysis in Figure 5. The intended developmental trajectory is PEC → VP → BEC and SEC; however, the cluster labels in Figure 5 were applied incorrectly. Specifically, VP-BECs were mislabeled as BECs, which led to the confusion.

      (2) We recently published the P60 dataset separately (Tolman, Li, Balasubramanian et al., eLife 2025); these data consist of integrated single-nucleus multiome profiles that were subjected to in-depth analysis. Additionally, we found that integrating the P60 dataset with the developmental datasets obscured sub-clustering of mature cell types. In future manuscripts, we will pursue a more detailed analysis of TM development and perform time point–specific clustering, similar to the approach we used for endothelial cells (Figure 4e).

      Comparing proportions of cells at different ages and as the eyes grows needs to be done cautiously. Notwithstanding the limitations, the proportions of TM1, TM2, and TM3 clusters are expected to be similar between P14 and P21 as the proportions at P14 and P60 are similar when comparing to the separately analyzed P60 data.  Importantly, our dissection strategy changed with age: from P2 to P14, we removed approximately one-third of the cornea, whereas at P21 and P60 we removed most of the cornea to help maximize representation of limbal cells as the eyes grew. This change in dissection likely contributed to the reduced number of TM3 cells observed at P21.  TM3 cells are enriched anteriorly (at-least in adult) and so are located closer to the corneal cut during dissection of the P21 eyes (which despite being larger than younger ages are still small and more delicate to accurately dissect than at P60) and are therefore more likely to be lost. Additional details are provided in the Methods section.

      (3) For Figure 3a and b, we will work to add clarity by providing additional annotations and an additional illustration.

      (4) We will include a table listing potential new markers for developing SC and TM populations.

      (5) We will annotate the genes associated with DG in Figure S20.

    1. eLife Assessment

      This important study introduces a new biology-informed strategy for deep learning models aiming to predict mutational effects in antibody sequences. It provides solid evidence that separating selection from the nucleotide-level mutation process improves performance over the objectives of protein language models inspired by natural language processing. This paper should be of interest to computational immunologists, but also to the broader community interested in deep learning for biological sequence data and evolution.

    2. Reviewer #1 (Public review):

      Summary:

      Matsen et al. describe an approach for training an antibody language model that explicitly tries to remove effects of "neutral mutation" from the language model training task, e.g. learning the codon table, which they claim results in biased functional predictions. They do so by modeling empirical sequence-derived likelihoods through a combination of a "mutation" model and a "selection" model; the mutation model is a non-neural Thrifty model previously developed by the authors, and the selection model is a small Transformer that is trained via gradient descent. The sequence likelihoods themselves are obtained from analyzing parent-child relationships in natural SHM datasets. The authors validate their method on several standard benchmark datasets and demonstrate its favorable computational cost. They discuss how deep learning models explicitly designed to capture selection and not mutation, trained on parent-child pairs, could potentially apply to other domains such as viral evolution or protein evolution at large.

      Strengths:

      Overall, we think the idea behind this manuscript is really clever and shows promising empirical results. Two aspects of the study are conceptually interesting: the first is factorizing the training likelihood objective to learn properties that are not explained by simple neutral mutation rules, and the second is training not on self-supervised sequence statistics but on the differences between sequences along an antibody evolutionary trajectory. If this approach generalizes to other domains of life, it could offer a new paradigm for training sequence-to-fitness models that is less biased by phylogeny or other aspects of the underlying mutation process.

      Weaknesses:

      Some claims made in the paper are weakly or indirectly supported by the data. In particular, the claim that learning the codon table contributes to biased functional effect predictions may be true, but requires more justification. Additionally, the paper could benefit from additional benchmarking and comparison to enhanced versions of existing methods, such as AbLang plus a multi-hit correction. Further descriptions of model components and validation metrics could help make the manuscript more readable.

    3. Reviewer #2 (Public review):

      Summary:

      Endowing protein language models with the ability to predict the function of antibodies would open a world of translational possibilities. However, antibody language models have yet to achieve breakthrough success, which large language models have achieved for the understanding and generation of natural language. This paper elegantly demonstrates how training objectives imported from natural language applications lead antibody language models astray on function prediction tasks. Training models to predict masked amino acids teaches models to exploit biases of nucleotide-level mutational processes, rather than protein biophysics. Taking the underlying biology of antibody diversification and selection seriously allows for disentangling these processes through what the authors call deep amino acid selection models. These models extend previous work by the authors (Matsen MBE 2025) by providing predictions not only for the selection strength at individual sites, but also for individual amino acid substitutions. This represents a practically important advance.

      Strengths:

      The paper is based on a deep conceptual insight, the existence of a multitude of biological processes that affect antibody maturation trajectories. The figures and writing a very clear, which should help make the broader field aware of this important but sometimes overlooked insight. The paper adds to a growing literature proposing biology-informed tweaks for training protein language models, and should thus be of interest to a wide readership interested in the application of machine learning to protein sequence understanding and design.

      Weaknesses:

      Proponents of the state-of-the-art protein language models might counter the claims of the paper by appealing to the ability of fine-tuning to deconvolve selection and mutation-related signatures in their high-dimensional representation spaces. Leaving the exercise of assessing this claim entirely to future work somewhat diminishes the heft of the (otherwise good!) argument. In the context of predicting antibody binding affinity, the modeling strategy only allows prediction of mutations that improve affinity on average, but not those which improve binding to specific epitopes.

    4. Reviewer #3 (Public review):

      Summary:

      This work proposes DASM, a new transformer-based approach to learning the distribution of antibody sequences which outperforms current foundational models at the task of predicting mutation propensities under selected phenotypes, such as protein expression levels and target binding affinity. The key ingredient is the disentanglement, by construction, of selection-induced mutational effects and biases intrinsic to the somatic hypermutation process (which are embedded in a pre-trained model).

      Strengths:

      The approach is benchmarked on a variety of available datasets and for two different phenotypes (expression and binding affinity). The biologically informed logic for model construction implemented is compelling, and the advantage, in terms of mutational effects prediction, is clearly demonstrated via comparisons to state-of-the-art models.

      Weaknesses:

      The gain in interpretability is only mentioned but not really elaborated upon or leveraged for gaining insight. The following aspects could have been better documented: the hyperparametric search to establish the optimal model; the predictive performance of baseline approaches, to fully showcase the gain yielded by DASM.

    1. eLife Assessment

      This potentially valuable manuscript focuses on the phosphorylation of residue T495 as a mechanism to inactivate HSP70 and disrupt cell cycle progression in response to DNA damage. The evidence supporting this model is incomplete and would be strengthened by additional studies defining the extent of T495 phosphorylation induced by DNA damage, identifying the kinase responsible for phosphorylating T495 of HSP70, and further elucidation of the functional implications of T495 phosphorylation in human cells. This work will be of interest to scientists focused on topics including chaperone biology, proteostasis, cell cycle progression, and DNA damage.

    2. Reviewer #1 (Public review):

      This manuscript proposes that phosphorylation of a conserved Hsp70 residue (human T495 / yeast Ssa1 T492) is a BER-triggered, DDR-dependent phospho-switch that acts as a conserved brake on G1/S cell-cycle progression in response to DNA damage.

      Although the topic is interesting and potentially useful, the strength of evidence of the mechanistic and "conserved checkpoint" claims that this site is directly activated by DNA damage is inadequate and fundamentally incorrect. The work requires extensive additional experimentation and substantial tempering of conclusions.

      Specific comments:

      (1) Activation of T495:

      (a) The author's premise for the site being activated by DNA damage is Albuquerque et al, where PTMs on MMS treated yeast are analyzed. T492 (the yeast equivalent of human T495) is observed as phosphorylated. However, the authors fail to note that there is no untreated sample analysis in this study, and it is likely that T492 phosphorylation is also present in untreated cells. This is also backed up by later evidence from the same lab (Smolka et al), where they do not identify T492 as being dependent on Mec1/Tel/Rad53 kinases.

      (b) The kinase(s) directly responsible for T495 phosphorylation are not identified. Instead, the authors show that knockdown or pharmacological inhibition of DNA-PKcs, ATM, Chk2, and CK1 attenuate pHsp70.

      (c) ATM siRNA knockdown has no effect, while ATM inhibitors do, which the authors acknowledge but do not resolve. This discrepancy raises concerns about off-target drug effects.

      (d) No in vitro kinase assays, motif analysis, or phosphosite mapping confirming these kinases as direct T495 kinases are presented. Thus, the proposed signaling cascade remains speculative.

      (e) Smolka and many other labs characterized DDR sites as SQ/TQ motifs, and T492 doesn't fit that motif.

      (f) No genetic tests in yeast (e.g., BER mutants) are used to connect Ssa1 T492 phosphorylation to BER in that system, despite the strong BER-centric model.

      (g) Overexpression of MPG gives only a modest increase in pHsp70, while APE1 overexpression has no effect, and Polβ overexpression does not decrease pHsp70. These mixed results weaken the central claim that Hsp70 phosphorylation is a tuned sensor of BER burden.

      (h) A major concern is that pHsp70 is only convincingly detected after very high, prolonged MMS (10 mM, 5 h) or 0.5 mM arsenite treatments. Other DNA-damaging agents (bleomycin, camptothecin, hydroxyurea) that robustly activate DDR kinases do not induce pHsp70. This suggests to me that the authors are observing a side effect of proteotoxic stress. This is likely (see Paull et al, PMID: 34116476).

      (i) A recent study in Nature Communications (Omkar et al., 2025) demonstrates rapid phosphorylation of yeast T492 in a pkc1-dependent manner, diminishing the impact of these findings.

      (2) Downstream Effects of T492/T495:

      (a) The manuscript's central conceptual advance is that pHsp70 is a cell-cycle-regulated brake on G1/S. Yet in mammalian cells, the authors show only that pHsp70 appears late, after cells have traversed mitosis, and that blocking CDK1 (G2/M) prevents its accumulation.

      (b) There is no functional test in human cells: no knockdown/rescue experiments with T495A or T495E, no cell-cycle profiling upon altering Hsp70 phosphorylation state, and no demonstration that pHsp70 actually causes any delay in S-phase entry, rather than simply correlating with late damage responses. The strong conclusion that pT495 "stalls cell cycle progression" (e.g., Figure 6 model) is therefore not supported in the human system.

      (c) All functional conclusions rely on T492A/E point mutants at the endogenous SSA1 locus, usually in an ssa2Δ background, in a family of highly redundant Hsp70s. Without showing that this site is actually modified during their MMS treatments, the assignment of phenotypes to loss of a physiological phospho-switch is premature. The authors need to repeat their studies in an Ssa1-4 background, as in https://pubmed.ncbi.nlm.nih.gov/32205407/.

      (d) The authors infer that T495E "locks" Hsc70 in a pseudo-open state based on reduced J-protein-stimulated ATPase activity, unchanged ATP binding, altered trypsin sensitivity, and retained tau binding. However, there is no direct comparison of phosphorylated vs T495E protein (e.g., via in vitro phosphorylation with LegK4 followed by side-by-side biochemical assays, or structural analysis). Thus, it remains unclear to what extent the glutamate substitution mimics a phosphate at this position.

      (e) No client release kinetics, co-chaperone binding assays, or in vivo chaperone function tests are provided, yet the discussion builds a detailed model of a "pseudo-open" state that simultaneously resembles ATP-bound conformation and allows persistent substrate engagement.

    3. Reviewer #2 (Public review):

      Summary:

      This paper follows a clue provided by an earlier paper from the same lab, that the pathogen Legionella pneumophila translocates into its host cell a kinase LegK4 that phosphorylates the cytosolic Hsp70 on threonine 495. The consequences of modification of this conserved Hsp70 residue, whether by LegK4-phosphorylation in the cytosol (of infected cells) or by FICD-mediated AMPylation in the ER (under conditions of low ER stress) are to lock the chaperone in a JDP-refractory state, thus functionally inactivating it.

      Here, the claim is to have discovered an endogenous phosphorylation event targeting the same residue in cells in which DNA damage base-excision repair is overburdened.

      Strengths:

      The suggestion of physiological modulation of chaperone activity by covalent modification is an interesting area of cell physiology. Specifically, the claim for discovery of a discrete phosphorylation event of an Hsp70 chaperone, one with a well-defined biochemical consequence, is this paper's strength.

      Weaknesses:

      The kinase(s) responsible for the phosphorylation have not been identified (and hence remain inaccessible to experimental i.e., genetic or pharmacological manipulation). The mechanistic links to DNA damage repair and the fitness benefits of this proposed adaptation remain obscure. Of greater concern, the data provided in the paper fail to exclude the trivial possibility that the phosphorylation event described (and characterised through biochemical proxies) is biologically neutral, reflecting nothing more than a bystander event in which kinase(s) activated by application of high concentrations of a powerful alkylating agent (MMS) phosphorylate, at meaninglessly low stoichiometry, an abundant protein (Hsp70) on a surface exposed residue. Failure to exclude this (plausible) scenario is this paper's weakness.

    4. Reviewer #3 (Public review):

      In this manuscript, Moss et al. demonstrate that Hsp70 phosphorylation at a conserved threonine residue integrates DNA damage responses with cell-cycle control. The authors present unbiased biochemical, cell-based, and yeast genetic analyses showing that phosphorylation of human Hsp70 at T495 (and the analogous Ssa1 T492 in yeast) is triggered by base-excision-repair intermediates and downstream DDR kinase activity, leading to delayed G1/S progression after DNA damage. They used orthogonal approaches such as ATPase assays, phospho-specific detection, kinase-inhibition studies, synchronization experiments, and phenotypic analyses of phosphomutants. They presented robust data that collectively supported the conclusion that dynamic Hsp70 phosphorylation functions as a conserved "molecular brake" to prevent inappropriate S-phase entry under genotoxic stress. However, there are a few minor questions and clarifications that the authors are well-positioned to address.

    1. eLife Assessment

      Rickert and colleagues demonstrate that the host peptidoglycan-binding protein PGLYRP1 has both beneficial and detrimental effects on Bordetella pertussis infection in mice. Using a solid array of techniques, the study provides useful insights into how peptidoglycan species may alter host immune responses. The data on the bactericidal effects on B. pertussis are incomplete, and further experiments are needed to draw conclusions on this question.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aim to demonstrate that PGLYRP1 plays a dual role in host responses to B. pertussis infection. PGLYRP1 signaling is known to activate bactericidal responses due to recognition of peptidoglycan. Through NOD1 activation and TREM-1 engagement, it appears PGLYRP1 also has immunomodulator activities. The authors present mouse knockout studies and gene expression data to illustrate the role of PGLYRP1 in relation to B. pertussis peptidoglycan. Mice lacking PGLYRP1 had slightly lower pathology scores. When TCT peptidoglycan was removed from the bacteria, surprisingly IL23A, IL6, IL1B, and other pro-inflammatory genes encoding cytokines increased. The relationship to TCT and PGLYRP1 suggests the pathogen uses this strategy to decrease immune activation. The authors went on to show the relationship between PGLRP1 and TREM-1 as mediated by PGN using various versions of peptidoglycan. The study presents multiple angles of data to back up its findings and demonstrates an interesting strategy used by B. pertussis to downregulate innate responses to its presence during infection.

      Strengths:

      Use of knockout mice of the key factor being considered, paired with isogenic B. pertussis strains, to reveal the mechanism of immune modulation to benefit the bacteria. The authors used in vivo gene expression paired with in vivo assays to establish each aspect of the mechanism.

      Weaknesses:

      The main focus was on innate responses, and some analysis of antigen-specific antibody responses could improve the impact of the findings.

    3. Reviewer #2 (Public review):

      Since its original discovery, the mechanistic basis for TCT-mediated pathogenesis of Bordetella pertussis has been a moving target and difficult to uncouple from confounding variables. The current study provides some exciting data that suggest PGLYRP-1 modulates host responses upon 'activation' by TCT. While there are some strengths associated with the unbiased approaches and collective data to support the claims associated with TCT and PGLYRP-1's function in this system, caution should be used when interpreting and extrapolating some of the information provided. For instance, the amount and purity of TCT used in the studies are unclear, and the in vitro activity of PGLYRP1 on B. pertussis is questionable. Different mouse backgrounds are used for various assays throughout, and it is known that the PRRs vary in these systems, so the confounding variables are difficult to uncouple. Additional concerns include the types of statistical tests being performed to support some of the claims and the relevance of using whole, intact PG sacculi from other species for comparative studies with a fragment of released PG (i.e., TCT).

    4. Reviewer #3 (Public review):

      Summary:

      This study evaluates the contributions of the mammalian PG-binding protein PGLYRP1 to Bordetella infection. The authors find potential roles for PGLYRP1 in both bacterial killing (canonical) and regulation of inflammation (non-canonical). While these are interesting findings and the idea that PG fragment release has differential impacts on infection depending on fragment structure, the study is limited by the lack of connection between the in vivo and in vitro experiments, and determining the precise mechanism of how PGLYRP1 regulates host responses and bacterial fitness during infection requires further study.

      Strengths:

      (1) The combination of scRNAseq with in vitro and in vivo assays provides complementary views of PGLYRP1 function during infection.

      (2) The use of TCT-deficient B. pertussis provides a useful control and perturbation in the in vitro assays.

      Weaknesses:

      (1) The study does not ultimately resolve the initial early versus late phenotype divergence. While the in vitro assays suggest explanations for their in vivo observations, further mechanistic links are lacking and necessary for the author's conclusions throughout. To state one example, what is the early and late infection phenotype of TCT- Bp in mice lacking PGLYRP1? RNAseq data are reported from these mice, but there are no burden or pathology studies. Furthermore, what are the neutrophil phenotypes (NOD-1/TREM-1 activation) in vivo? And are they dependent on PGLYRP1 and/or TCT?

      (2) It is unclear whether or how the NOD1 and TREM-1 pathways interact.

      (3) Many of the study's conclusions rely on the use of HEK293 reporter lines in the absence of bacterial infection, which may not be physiologically representative.

      (4) The methods lack detail overall, and the experimental procedures should be described more concretely, especially for the scRNAseq datasets.

    1. eLife Assessment

      This fundamental work substantially advances our understanding of a major research question: whether collagen can be directly imaged with MRI. The evidence supporting the conclusion is compelling, with methods, data, and analyses that are more rigorous than those currently considered state-of-the-art. The work will be of high interest to MR physicists and clinicians, as collagen is the most abundant protein in the human body and plays an essential role in health.

    2. Reviewer #1 (Public review):

      Summary:

      The aim of this work is to directly image collagen in tissue using a new MRI method with positive contrast. The work presents a new MRI method that allows very short, powerful radio frequency (RF) pulses and very short switching times between transmission and reception of radio frequency signals.

      Strengths:

      The experiments with and without the removal of 1H hydrogen, which is not firmly bound to collagen, on tissue samples from tendons and bones, are very well suited to prove the detection of direct hydrogen signals from collagen. The new method has great potential value in medicine, as it allows for better investigation of ageing processes and many degenerative diseases in which functional tissue is replaced by connective tissue (collagen).

      Weaknesses:

      It is clear that, due to the relatively long time intervals between RF excitation and signal readout, standard hardware in whole-body MRI systems can only be used to examine surrounding water and not hydrogen bound to collagen molecules.

    3. Reviewer #2 (Public review):

      Summary:

      This work presents direct magnetic resonance imaging (MRI) of collagen, which is not possible with conventional MRI or other tomographic imaging modalities.

      Strengths:

      The experimental work is impressive, and the presentation of results is clear and convincing. Through a series of thoughtfully prepared experiments, I found the evidence that the images reflect direct measurements of collagen to be highly compelling.

      Due to the technical demands, direct collagen imaging is unlikely to become widespread for routine clinical work, at least not anytime soon. That said, this work is nonetheless transformative and will likely be highly significant for research and perhaps clinical trials.

    4. Reviewer #3 (Public review):

      The paper is well written and well presented. The topic is important, and its significance is explained succinctly and accurately. I am only capable of reviewing the clinical aspects of this work, which is very largely technical in nature. Several clinical points are worth considering:

      (1) Tendons typically display large magic angle effects as a result of their highly ordered collagen structure (cortical bone much less so), and so it would have been of interest to know what orientation the tendons had to B 0 (in vitro and in vivo). This could affect the signal level at the longer echo time and thus the signal on the subtracted images.

      (2) The in vivo transverse image looks about mid-forearm, where tendons are not prominent. A transverse image of the lower forearm, where there is an abundance of tendons, might have been preferable.

      (3) The in vivo images show the interosseous membrane as a high signal on both the shorter and longer TE images. The structure contains ordered collagen with fibres at different oblique angles to the radius and ulnar, and thus potentially to B 0. Collagen fibres may have been at an orientation towards the magic angle, and this may account for the high signal on the longer TE image and the low signal on the subtracted image.

      (4) Some of the signals attributed to the muscle may be from an attachment of the muscle to the aponeurosis.

      (5) There is significant collagen in subcutaneous tissues, so the designation "skin" may more correctly be "skin and subcutaneous tissue".

      (6) Cortical bone is very heterogeneous, with boundaries between hard bone and soft tissue with significant susceptibility differences between the two across a small distance. This might be another mechanism for ultrashort T 2 * tissue values in addition to the presence of collagen. The two effects might be distinguished by also including a longer TE spin echo acquisition.

      Solid cortical bone may also have an ultrashort T 2 * in its own right.

      (7) It may be worth noting that in disease T 2 * may be increased. As a result, the subtraction image may make abnormal tissue less obvious than normal tissue. Magic angle effects may also produce this appearance.

      (8) It may be worth distinguishing fibrous connective tissue (loose or dense), which may be normal or abnormal, from fibrosis, which is an abnormal accumulation of fibrous connective tissue in damaged tissue. Fibrosis typically has a longer T 2 initially and decreases its T 2 * over time. In places, the context suggests that fibrous connective tissue may be more appropriate than fibrosis.

      Overall, the paper appears very well constructed and describes thoughtful and important work.

    1. eLife Assessment

      This study presents a valuable analysis of a large dataset of [NiFe]-CODHs, integrating genomic context, operon organization, and clade-specific gene neighborhoods to discern patterns of functional diversification and adaptation. Carefully looking at the CODH genomic context, e.g., CODH-HCP co-occurrence, the authors gain insight into enzymatic activity, biotechnological potential, and differential functional roles. The approach aligns with current standards in genomic enzymology to characterize newly identified enzymes. With solid support, this work provides a broadly informative contribution to the field.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript analyzes a large dataset of [NiFe]-CODHs with a focus on genomic context and operon organization. Beyond earlier phylogenetic and biochemical studies, it addresses CODH-HCP co-occurrence, clade-specific gene neighborhoods, and operon-level variation, offering new perspectives on functional diversification and adaptation.

      Strengths:

      The study has a valuable approach.

      Weaknesses:

      Several points should be addressed.

      (1) The rationale for excluding clades G and H should be clarified. Inoue et al. (Extremophiles 26:9, 2022) defined [NiFe]-CODH phylogenetic clades A-H. In the present manuscript, clades A-H are depicted, yet the analyses and discussion focus only on clades A-F. If clades G and H were deliberately excluded (e.g., due to limited sequence data or lack of biochemical evidence), the rationale should be clearly stated. Providing even a brief explanation of their status or the reason for omission would help readers understand the scope and limitations of the study. In addition, although Figure 1 shows clades A-H and cites Inoue et al. (2022), the manuscript does not explicitly state how these clades are defined. An explicit acknowledgement of the clade framework would improve clarity and ensure that readers fully understand the basis for subsequent analyses.

      (2) The co-occurrence data would benefit from clearer presentation in the supplementary material. At present, the supplementary data largely consist of raw values, making interpretation difficult. For example, in Figure 3b, the co-occurrence frequencies are hard to reconcile with the text: clade A shows no co-occurrence with clade B and even lower tendencies than clades E or F, while clade E appears relatively high. Similarly, the claim that clades C and D "more often co-occur, especially with A, E, and F" does not align with the numerical trends, where D and E show stronger co-occurrence but C does not. A concise, well-organized summary table would greatly improve clarity and prevent such misunderstandings.

      (3) The rationale for analyzing gene neighborhoods at the single-operon level needs clarification. Many microorganisms encode more than one CODH operon, yet the analysis was carried out at the level of individual operons. The authors should clarify the biological rationale for this choice and discuss how focusing on single operons rather than considering the full complement per organism might affect the interpretation of genomic context.

    3. Reviewer #2 (Public review):

      The authors present a comparative genomic and phylogenetic analysis aimed at elucidating the functions of nickel-dependent carbon monoxide dehydrogenases (Ni-CODHs) and hybrid-cluster proteins (HCPs). By examining gene neighborhoods, phylogenetic relationships, and co-occurrence patterns, they propose functional hypotheses for different CODH clades and highlight those with the greatest potential for biotechnological applications.

      A major strength of this work lies in its systematic and conceptually clear approach, which provides a rapid and low-cost framework for predicting the functional potential of newly identified CODHs based on sequence data and genomic context. The analysis is careful in minimizing false positives and offers valuable insights into the diversity and distribution of CODH enzyme clades.

      However, several limitations should be considered when interpreting the findings. The use of incomplete genome assemblies may lead to the exclusion of relevant genes or operonic regions. Clade H was omitted due to a lack of information on its host, and the number of class II HCPs included is limited. Although the genomic window analyzed is relatively broad, it may still miss functionally relevant neighboring genes. The study assumes that the pathways associated with CODHs are encoded near the enzyme loci, but these could also occur elsewhere in the genome or on the complementary strand. The authors acknowledge these and other limitations clearly and thoughtfully, which strengthens the transparency and credibility of their analysis.

      Given the high evolutionary diversity of CODHs-both across and within clades-phenotypic predictions derived solely from sequence and neighborhood data should be interpreted with caution. Sequence-based searches, while specific, may have limited sensitivity, and structural homology searches could further enrich the dataset. Additionally, the visual inspection used to filter out non-CODH sequences is not described in detail, leaving uncertainty about reproducibility. The generalization of enzymatic activity or inactivity from a few characterized examples to entire clades should also be regarded as tentative.<br /> Despite these limitations, the study presents a solid and valuable methodological framework that can aid in the rapid functional screening of novel CODH enzymes and may inspire broader applications in enzyme discovery and metabolic annotation.

    1. eLife Assessment<br /> <br /> This study examines an important question regarding the developmental trajectory of neural mechanisms supporting facial expression processing. Leveraging a rare intracranial EEG (iEEG) dataset including both children and adults, the authors reported that facial expression recognition mainly engaged the posterior superior temporal cortex (pSTC) among children, while both pSTC and the prefrontal cortex were engaged among adults. In terms of strength of evidence, the solid methods, data and analyses broadly support the claims with minor weaknesses.

    2. Reviewer #1 (Public review):

      Summary:

      This study investigates how the brain processes facial expressions across development by analyzing intracranial EEG (iEEG) data from children (ages 5-10) and post-childhood individuals (ages 13-55). The researchers used a short film containing emotional facial expressions and applied AI-based models to decode brain responses to facial emotions. They found that in children, facial emotion information is represented primarily in the posterior superior temporal cortex (pSTC)-a sensory processing area-but not in the dorsolateral prefrontal cortex (DLPFC), which is involved in higher-level social cognition. In contrast, post-childhood individuals showed emotion encoding in both regions. Importantly, the complexity of emotions encoded in the pSTC increased with age, particularly for socially nuanced emotions like embarrassment, guilt, and pride.The authors claim that these findings suggest that emotion recognition matures through increasing involvement of the prefrontal cortex, supporting a developmental trajectory where top-down modulation enhances understanding of complex emotions as children grow older.

      Strengths:

      (1) The inclusion of pediatric iEEG makes this study uniquely positioned to offer high-resolution temporal and spatial insights into neural development compared to non-invasive approaches, e.g., fMRI, scalp EEG, etc.

      (2) Using a naturalistic film paradigm enhances ecological validity compared to static image tasks often used in emotion studies.

      (3) The idea of using state-of-the-art AI models to extract facial emotion features allows for high-dimensional and dynamic emotion labeling in real time.

      Weaknesses:

      (1) The study has notable limitations that constrain the generalizability and depth of its conclusions. The sample size was very small, with only nine children included and just two having sufficient electrode coverage in the posterior superior temporal cortex (pSTC), which weakens the reliability and statistical power of the findings, especially for analyses involving age. Authors pointed out that a similar sample size has been used in previous iEEG studies, but the cited works focus on adults and do not look at the developmental perspectives. Similar work looking at developmental changes in iEEG signals usually includes many more subjects (e.g., n = 101 children from Cross ZR et al., Nature Human Behavior, 2025) to account for inter-subject variabilities.

      (2) Electrode coverage was also uneven across brain regions, with not all participants having electrodes in both the dorsolateral prefrontal cortex (DLPFC) and pSTC, making the conclusion regarding the different developmental changes between DLPFC and pSTC hard to interpret (related to point 3 below). It is understood that it is rare to have such iEEG data collected in this age group, and the electrode location is only determined by clinical needs. However, the scientific rigor should not be compromised by the limited data access. It's the authors' decision whether such an approach is valid and appropriate to address the scientific questions, here the developmental changes in the brain, given all the advantages and constraints of the data modality.

      (3) The developmental differences observed were based on cross-sectional comparisons rather than longitudinal data, reducing the ability to draw causal conclusions about developmental trajectories. Also, see comments in point 2.

      (4) Moreover, the analysis focused narrowly on DLPFC, neglecting other relevant prefrontal areas such as the orbitofrontal cortex (OFC) and anterior cingulate cortex (ACC), which play key roles in emotion and social processing. Agree that this might be beyond the scope of this paper, but a discussion section might be insightful.

      (5) Although the use of a naturalistic film stimulus enhances ecological validity, it comes at the cost of experimental control, with no behavioral confirmation of the emotions perceived by participants and uncertain model validity for complex emotional expressions in children. A non-facial music block that could have served as a control was available but not analyzed. The validation of AI model's emotional output needs to be tested. It is understood that we cannot collect these behavioral data retrospectively within the recorded subjects. Maybe potential post-hoc experiments and analyses could be done, e.g., collect behavioral, emotional perception data from age-matched healthy subjects.

      (6) Generalizability is further limited by the fact that all participants were neurosurgical patients, potentially with neurological conditions such as epilepsy that may influence brain responses. At least some behavioral measures between the patient population and the healthy groups should be done to ensure the perception of emotions is similar.

      (7) Additionally, the high temporal resolution of intracranial EEG was not fully utilized, as data were downsampled and averaged in 500-ms windows. It seems like the authors are trying to compromise the iEEG data analyses to match up with the AI's output resolution, which is 2Hz. It is not clear then why not directly use fMRI, which is non-invasive and seems to meet the needs here already. The advantages of using iEEG in this study are missing here.

      (8) Finally, the absence of behavioral measures or eye-tracking data makes it difficult to directly link neural activity to emotional understanding or determine which facial features participants attended to. Related to point 5 as well.

      Comments on revisions:

      A behavioral measurement will help address a lot of these questions. If the data continues collecting, additional subjects with iEEG recording and also behavioral measurements would be valuable.

    3. Reviewer #2 (Public review):

      Summary:

      In this paper, Fan et al. aim to characterize how neural representations of facial emotions evolve from childhood to adulthood. Using intracranial EEG recordings from participants aged 5 to 55, the authors assess the encoding of emotional content in high-level cortical regions. They report that while both the posterior superior temporal cortex (pSTC) and dorsolateral prefrontal cortex (DLPFC) are involved in representing facial emotions in older individuals, only the pSTC shows significant encoding in children. Moreover, the encoding of complex emotions in the pSTC appears to strengthen with age. These findings lead the authors to suggest that young children rely more on low-level sensory areas and propose a developmental shift from reliance on lower-level sensory areas in early childhood to increased top-down modulation by the prefrontal cortex as individuals mature.

      Strengths:

      (1) Rare and valuable dataset: The use of intracranial EEG recordings in a developmental sample is highly unusual and provides a unique opportunity to investigate neural dynamics with both high spatial and temporal resolution.

      (2 ) Developmentally relevant design: The broad age range and cross-sectional design are well-suited to explore age-related changes in neural representations.

      (3) Ecological validity: The use of naturalistic stimuli (movie clips) increases the ecological relevance of the findings.

      (4) Feature-based analysis: The authors employ AI-based tools to extract emotion-related features from naturalistic stimuli, which enables a data-driven approach to decoding neural representations of emotional content. This method allows for a more fine-grained analysis of emotion processing beyond traditional categorical labels.

      Weaknesses:

      (1) While the authors leverage Hume AI, a tool pre-trained on a large dataset, its specific performance on the stimuli used in this study remains unverified. To strengthen the foundation of the analysis, it would be important to confirm that Hume AI's emotional classifications align with human perception for these particular videos. A straightforward way to address this would be to recruit human raters to evaluate the emotional content of the stimuli and compare their ratings to the model's outputs.

      (2) Although the study includes data from four children with pSTC coverage-an increase from the initial submission-the sample size remains modest compared to recent iEEG studies in the field.

      (3) The "post-childhood" group (ages 13-55) conflates several distinct neurodevelopmental periods, including adolescence, young adulthood, and middle adulthood. As a finer age stratification is likely not feasible with the current sample size, I would suggest authors temper their developmental conclusions.

      (4) The analysis of DLPFC-pSTC directional connectivity would be significantly strengthened by modeling it as a continuous function of age across all participants, rather than relying on an unbalanced comparison between a single child and a (N=7) post-childhood group. This continuous approach would provide a more powerful and nuanced view of the developmental trajectory. I would also suggest including the result in the main text.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study examines a valuable question regarding the developmental trajectory of neural mechanisms supporting facial expression processing. Leveraging a rare intracranial EEG (iEEG) dataset including both children and adults, the authors reported that facial expression recognition mainly engaged the posterior superior temporal cortex (pSTC) among children, while both pSTC and the prefrontal cortex were engaged among adults. However, the sample size is relatively small, with analyses appearing incomplete to fully support the primary claims. 

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study investigates how the brain processes facial expressions across development by analyzing intracranial EEG (iEEG) data from children (ages 5-10) and post-childhood individuals (ages 13-55). The researchers used a short film containing emotional facial expressions and applied AI-based models to decode brain responses to facial emotions. They found that in children, facial emotion information is represented primarily in the posterior superior temporal cortex (pSTC) - a sensory processing area - but not in the dorsolateral prefrontal cortex (DLPFC), which is involved in higher-level social cognition. In contrast, post-childhood individuals showed emotion encoding in both regions. Importantly, the complexity of emotions encoded in the pSTC increased with age, particularly for socially nuanced emotions like embarrassment, guilt, and pride. The authors claim that these findings suggest that emotion recognition matures through increasing involvement of the prefrontal cortex, supporting a developmental trajectory where top-down modulation enhances understanding of complex emotions as children grow older.

      Strengths:

      (1) The inclusion of pediatric iEEG makes this study uniquely positioned to offer high-resolution temporal and spatial insights into neural development compared to non-invasive approaches, e.g., fMRI, scalp EEG, etc.

      (2) Using a naturalistic film paradigm enhances ecological validity compared to static image tasks often used in emotion studies.

      (3) The idea of using state-of-the-art AI models to extract facial emotion features allows for high-dimensional and dynamic emotion labeling in real time

      Weaknesses:

      (1) The study has notable limitations that constrain the generalizability and depth of its conclusions. The sample size was very small, with only nine children included and just two having sufficient electrode coverage in the posterior superior temporal cortex (pSTC), which weakens the reliability and statistical power of the findings, especially for analyses involving age

      We appreciated the reviewer’s point regarding the constrained sample size.

      As an invasive method, iEEG recordings can only be obtained from patients undergoing electrode implantation for clinical purposes. Thus, iEEG data from young children are extremely rare,  and rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our main conclusions. Specifically, 8 children (53 recording contacts in total) and 13 control participants (99 recording contacts in total) with electrode coverage in the DLPFC are included in our DLPFC analysis. This sample size is comparable to other iEEG studies with similar experiment designs [1-3]. 

      For pSTC, we returned to the data set and found another two children who had pSTC coverage. After involving these children’s data, the group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Notably, the two new children’s (S33 and S49) responses were highly consistent with our previous observations. Moreover, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (1) Zheng, J. et al. Multiplexing of Theta and Alpha Rhythms in the Amygdala-Hippocampal Circuit Supports Pafern Separation of Emotional Information. Neuron 102, 887-898.e5 (2019).

      (2) Diamond, J. M. et al. Focal seizures induce spatiotemporally organized spiking activity in the human cortex. Nat. Commun. 15, 7075 (2024).

      (3) Schrouff, J. et al. Fast temporal dynamics and causal relevance of face processing in the human temporal cortex. Nat. Commun. 11, 656 (2020).

      (2) Electrode coverage was also uneven across brain regions, with not all participants having electrodes in both the dorsolateral prefrontal cortex (DLPFC) and pSTC, and most coverage limited to the left hemisphere-hindering within-subject comparisons and limiting insights into lateralization.

      The electrode coverage in each patient is determined entirely by the clinical needs. Only a few patients have electrodes in both DLPFC and pSTC because these two regions are far apart, so it’s rare for a single patient’s suspected seizure network to span such a large territory. However, it does not affect our results, as most iEEG studies combine data from multiple patients to achieve sufficient electrode coverage in each target brain area. As our data are mainly from left hemisphere (due to the clinical needs), this study was not designed to examine whether there is a difference between hemispheres in emotion encoding. Nevertheless, lateralization remains an interesting question that should be addressed in future research, and we have noted this limitation in the Discussion (Page 8, in the last paragraph of the Discussion).

      (3) The developmental differences observed were based on cross-sectional comparisons rather than longitudinal data, reducing the ability to draw causal conclusions about developmental trajectories.  

      In the context of pediatric intracranial EEG, longitudinal data collection is not feasible due to the invasive nature of electrode implantation. We have added this point to the Discussion to acknowledge that while our results reveal robust age-related differences in the cortical encoding of facial emotions, longitudinal studies using non-invasive methods will be essential to directly track developmental trajectories (Page 8, in the last paragraph of Discussion). In addition, we revised our manuscript to avoid emphasis causal conclusions about developmental trajectories in the current study (For example, we use “imply” instead of “suggest” in the fifth paragraph of Discussion).

      (4) Moreover, the analysis focused narrowly on DLPFC, neglecting other relevant prefrontal areas such as the orbitofrontal cortex (OFC) and anterior cingulate cortex (ACC), which play key roles in emotion and social processing.

      We agree that both OFC and ACC are critically involved in emotion and social processing. However, we have no recordings from these areas because ECoG rarely covers the ACC or OFC due to technical constraints. We have noted this limitation in the Discussion(Page 8, in the last paragraph of Discussion). Future follow-up studies using sEEG or non-invasive imaging methods could be used to examine developmental patterns in these regions.

      (5) Although the use of a naturalistic film stimulus enhances ecological validity, it comes at the cost of experimental control, with no behavioral confirmation of the emotions perceived by participants and uncertain model validity for complex emotional expressions in children. A nonfacial music block that could have served as a control was available but not analyzed. 

      The facial emotion features used in our encoding models were extracted by Hume AI models, which were trained on human intensity ratings of large-scale, experimentally controlled emotional expression data[1-2]. Thus, the outputs of Hume AI model reflect what typical facial expressions convey, that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that children’s interpretation of complex emotions may differ from that of adults, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added a paragraph in the Discussion (see Page 8) to explicitly note that our study focused on the encoding of presented emotion.

      We appreciated the reviewer’s point regarding the value of non-facial music blocks. However,  although there are segments in music condition that have no faces presented, these cannot be used as a control condition to test whether the encoding model’s prediction accuracy in pSTC or DLPFC drops to chance when no facial emotion is present. This is because, in the absence of faces, no extracted emotion features are available to be used for the construction of encoding model (see Author response image 1 below).  Thus, we chose to use a different control analysis for the present work. For children’s pSTC, we shuffled facial emotion feature in time to generate a null distribution, which was then used to test the statistical significance of the encoding models (see Methods/Encoding model fitting for details).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      Author response image 1.

      Time courses of Hume AI extracted facial expression features for the first block of music condition. Only top 5 facial expressions were shown here to due to space limitation.

      (6) Generalizability is further limited by the fact that all participants were neurosurgical patients, potentially with neurological conditions such as epilepsy that may influence brain responses. 

      We appreciated the reviewer’s point. However, iEEG data can only be obtained from clinical populations (usually epilepsy patients) who have electrodes implantation.  Given current knowledge about focal epilepsy and its potential effects on brain activity, researchers believe that epilepsy-affected brains can serve as a reasonable proxy for normal human brains when confounding influences are minimized through rigorous procedures[1]. In our study, we took several steps to ensure data quality: (1) all data segments containing epileptiform discharges were identified and removed at the very beginning of preprocessing, (2) patients were asked to participate the experiment several hours outside the window of seizures. Please see Method for data quality check description (Page 9/ Experimental procedures and iEEG data processing). 

      (1) Parvizi J, Kastner S. 2018. Promises and limitations of human intracranial electroencephalography. Nat Neurosci 21:474–483. doi:10.1038/s41593-018-0108-2

      (7) Additionally, the high temporal resolution of intracranial EEG was not fully utilized, as data were down-sampled and averaged in 500-ms windows.  

      We agree that one of the major advantages of iEEG is its millisecond-level temporal resolution. In our case, the main reason for down-sampling was that the time series of facial emotion features extracted from the videos had a temporal resolution of 2 Hz, which were used for the modelling neural responses. In naturalistic contexts, facial emotion features do not change on a millisecond timescale, so a 500 ms window is sufficient to capture the relevant dynamics. Another advantage of iEEG is its tolerance to motion, which is excessive in young children (e.g., 5-year-olds). This makes our dataset uniquely valuable, suggesting robust representation in the pSTC but not in the DLPFC in young children. Moreover, since our method framework (Figure 1) does not rely on high temporal resolution method, so it can be transferred to non-invasive modalities such as fMRI, enabling future studies to test these developmental patterns in larger populations.

      (8) Finally, the absence of behavioral measures or eye-tracking data makes it difficult to directly link neural activity to emotional understanding or determine which facial features participants afended to.  

      We appreciated this point. Part of our rationale is presented in our response to (5) for the absence of behavioral measures. Following the same rationale, identifying which facial features participants attended to is not necessary for testing our main hypotheses because our analyses examined responses to the overall emotional content of the faces. However, we agree and recommend future studies use eye-tracking and corresponding behavioral measures in studies of subjective emotional understanding. 

      Reviewer #2 (Public review):

      Summary:

      In this paper, Fan et al. aim to characterize how neural representations of facial emotions evolve from childhood to adulthood. Using intracranial EEG recordings from participants aged 5 to 55, the authors assess the encoding of emotional content in high-level cortical regions. They report that while both the posterior superior temporal cortex (pSTC) and dorsolateral prefrontal cortex (DLPFC) are involved in representing facial emotions in older individuals, only the pSTC shows significant encoding in children. Moreover, the encoding of complex emotions in the pSTC appears to strengthen with age. These findings lead the authors to suggest that young children rely more on low-level sensory areas and propose a developmental shiZ from reliance on lower-level sensory areas in early childhood to increased top-down modulation by the prefrontal cortex as individuals mature.

      Strengths: 

      (1) Rare and valuable dataset: The use of intracranial EEG recordings in a developmental sample is highly unusual and provides a unique opportunity to investigate neural dynamics with both high spatial and temporal resolution. 

      (2) Developmentally relevant design: The broad age range and cross-sectional design are well-suited to explore age-related changes in neural representations. 

      (3) Ecological validity: The use of naturalistic stimuli (movie clips) increases the ecological relevance of the findings. 

      (4) Feature-based analysis: The authors employ AIbased tools to extract emotion-related features from naturalistic stimuli, which enables a data-driven approach to decoding neural representations of emotional content. This method allows for a more fine-grained analysis of emotion processing beyond traditional categorical labels. 

      Weaknesses: 

      (1) The emotional stimuli included facial expressions embedded in speech or music, making it difficult to isolate neural responses to facial emotion per se from those related to speech content or music-induced emotion. 

      We thank the reviewer for their raising this important point. We agree that in naturalistic settings, face often co-occur with speech, and that these sources of emotion can overlap. However, background music induced emotions have distinct temporal dynamics which are separable from facial emotion (See the Author response image 2 (A) and (B) below). In addition, face can convey a wide range of emotions (48 categories in Hume AI model), whereas music conveys far fewer (13 categories reported by a recent study [1]). Thus, when using facial emotion feature time series as regressors (with 48 emotion categories and rapid temporal dynamics), the model performance will reflect neural encoding of facial emotion in the music condition, rather than the slower and lower-dimensional emotion from music. 

      For the speech condition, we acknowledge that it is difficult to fully isolate neural responses to facial emotion from those to speech when the emotional content from faces and speech highly overlaps. However, in our study, (1) the time courses of emotion features from face and voice are still different (Author response image 2 (C) and (D)), (2) our main finding that DLPFC encodes facial expression information in postchildhood individuals but not in young children was found in both speech and music condition (Figure 2B and 2C). In music condition, neural responses to facial emotion are not affected by speech. Thus, we have included the DLPFC results from the music condition in the revised manuscript (Figure 2C), and we acknowledge that this issue should be carefully considered in future studies using videos with speech, as we have indicated in the future directions in the last paragraph of Discussion.

      (1) Cowen, A. S., Fang, X., Sauter, D. & Keltner, D. What music makes us feel: At least 13 dimensions organize subjective experiences associated with music across different cultures. Proc Natl Acad Sci USA 117, 1924–1934 (2020).

      Author response image 2.

      Time courses of the amusement. (A) and (B) Amusement conveyed by face or music in a 30-s music block. Facial emotion features are extracted by Hume AI. For emotion from music, we approximated the amusement time course using a weighted combination of low-level acoustic features (RMS energy, spectral centroid, MFCCs), which capture intensity, brightness, and timbre cues linked to amusement. Notice that music continues when there are no faces presented. (C) and (D) Amusement conveyed by face or voice in a 30-s speech block. From 0 to 5 seconds, a girl is introducing her friend to a stranger. The camera focuses on the friend, who appears nervous, while the girl’s voice sounds cheerful. This mismatch explains why the shapes of the two time series differ at the beginning. Such situations occur frequently in naturalistic movies

      (2) While the authors leveraged Hume AI to extract facial expression features from the video stimuli, they did not provide any validation of the tool's accuracy or reliability in the context of their dataset. It remains unclear how well the AI-derived emotion ratings align with human perception, particularly given the complexity and variability of naturalistic stimuli. Without such validation, it is difficult to assess the interpretability and robustness of the decoding results based on these features.  

      Hume AI models were trained and validated by human intensity ratings of large-scale, experimentally controlled emotional expression data [1-2]. The training process used both manual annotations from human raters and deep neural networks. Over 3000 human raters categorized facial expressions into emotion categories and rated on a 1-100 intensity scale. Thus, the outputs of Hume AI model reflect what typical facial expressions convey (based on how people actually interpret them), that is, the presented facial emotion. Our goal of the present study was to examine how facial emotions presented in the videos are encoded in the human brain at different developmental stages. We agree that the interpretation of facial emotions may be different in individual participants, resulting in different perceived emotion (i.e., the emotion that the observer subjectively interprets). Behavioral ratings are necessary to study the encoding of subjectively perceived emotion, which is a very interesting direction but beyond the scope of the present work. We have added text in the Discussion to explicitly note that our study focused on the encoding of presented emotion (second paragraph in Page 8).

      (1) Brooks, J. A. et al. Deep learning reveals what facial expressions mean to people in different cultures. iScience 27, 109175 (2024).

      (2) Brooks, J. A. et al. Deep learning reveals what vocal bursts express in different cultures. Nat. Hum. Behav. 7, 240–250 (2023).

      (3) Only two children had relevant pSTC coverage, severely limiting the reliability and generalizability of results.  

      We appreciated this point and agreed with both reviewers who raised it as a significant concern. As described in response to reviewer 1 (comment 1), we have added data from another two children who have pSTC coverage. Group-level analysis using permutation test showed that children’s pSTC significantly encode facial emotion in naturalistic contexts (Figure 3B). Because iEEG data from young children are extremely rare, rapidly increasing the sample size within a few years is not feasible. However, we are confident in the reliability of our conclusion that children’s pSTC can encode facial emotion. First,  the two new children’s responses (S33 and S49) from pSTC were highly consistent with our previous observations (see individual data in Figure 3B). Second, the averaged prediction accuracy in children’s pSTC (r<sub>speech</sub>=0.1565) was highly comparable to that in post-childhood group (r<sub>speech</sub>=0.1515).

      (4) The rationale for focusing exclusively on high-frequency activity for decoding emotion representations is not provided, nor are results from other frequency bands explored.   

      We focused on high-frequency broadband (HFB) activity because it is widely considered to reflect the responses of local neuronal populations near the recording electrode, whereas low-frequency oscillations in the theta, alpha, and beta ranges are thought to serve as carrier frequencies for long-range communication across distributed networks[1-2]. Since our study aimed to examine the representation of facial emotion in localized cortical regions (DLPFC and pSTC), HFB activity provides the most direct measure of the relevant neural responses. We have added this rationale to the manuscript (Page 3).

      (1) Parvizi, J. & Kastner, S. Promises and limitations of human intracranial electroencephalography. Nat. Neurosci. 21, 474–483 (2018).

      (2) Buzsaki, G. Rhythms of the Brain. (Oxford University Press, Oxford, 200ti).

      (5) The hypothesis of developmental emergence of top-down prefrontal modulation is not directly tested. No connectivity or co-activation analyses are reported, and the number of participants with simultaneous coverage of pSTC and DLPFC is not specified.  

      Directional connectivity analysis results were not shown because only one child has simultaneous coverage of pSTC and DLPFC. However, the  Granger Causality results from post-childhood group (N=7) clearly showed that the influence in the alpha/beta band from DLPFC to pSTC (top-down) is gradually increased above the onset of face presentation (Author response image 3, below left, plotted in red). By comparison, the influence in the alpha/beta band from pSTC to DLPFC (bottom-up) is gradually decreased after the onset of face presentation (Author response image 3, below left, blue curve). The influence in alpha/beta band from DLPFC to pSTC was significantly increased at 750 and 1250 ms after the face presentation (face vs nonface, paired t-test, Bonferroni  corrected P=0.005, 0.006), suggesting an enhanced top-down modulation in the post-childhood group during watching emotional faces. Interestingly, this top-down influence appears very different in the 8-year-old child at 1250 ms after the face presentation (Author response image 3, below left, black curve).

      As we cannot draw direct conclusions from the single-subject sample presented here, the top-down hypothesis is introduced only as a possible explanation for our current results. We have removed potentially misleading statements, and we plan to test this hypothesis directly using MEG in the future.

      Author response image 3.

      Difference of Granger causality indices (face – nonface) in alpha/beta and gamma band for both directions. We identified a series of face onset in the movie that paticipant watched. Each trial was defined as -0.1 to 1.5 s relative to the onset. For the non-face control trials, we used houses, animals and scenes. Granger causality was calculated for 0-0.5 s, 0.5-1 s and 1-1.5 s time window. For the post-childhood group, GC indices were averaged across participants. Error bar is sem.

      (6) The "post-childhood" group spans ages 13-55, conflating adolescence, young adulthood, and middle age. Developmental conclusions would benefit from finer age stratification.  

      We appreciate this insightful comment. Our current sample size does not allow such stratification. But we plan to address this important issue in future MEG studies with larger cohorts.

      (7) The so-called "complex emotions" (e.g., embarrassment, pride, guilt, interest) used in the study often require contextual information, such as speech or narrative cues, for accurate interpretation, and are not typically discernible from facial expressions alone. As such, the observed age-related increase in neural encoding of these emotions may reflect not solely the maturation of facial emotion perception, but rather the development of integrative processing that combines facial, linguistic, and contextual cues. This raises the possibility that the reported effects are driven in part by language comprehension or broader social-cognitive integration, rather than by changes in facial expression processing per se.  

      We agree with this interpretation. Indeed, our results already show that speech influences the encoding of facial emotion in the DLPFC differently in the childhood and post-childhood groups (Figure 2D), suggesting that children’s ability to integrate multiple cues is still developing. Future studies are needed to systematically examine how linguistic cues and prior experiences contribute to the understanding of complex emotions from faces, which we have added to our future directions section (last paragraph in Discussion, Page 8-9 ).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      In the introduction: "These neuroimaging data imply that social and emotional experiences shape the prefrontal cortex's involvement in processing the emotional meaning of faces throughout development, probably through top-down modulation of early sensory areas." Aren't these supposed to be iEEG data instead of neuroimaging? 

      Corrected.

      Reviewer #2 (Recommendations for the authors):

      This manuscript would benefit from several improvements to strengthen the validity and interpretability of the findings:

      (1) Increase the sample size, especially for children with pSTC coverage. 

      We added data from another two children who have pSTC coverage. Please see our response to reviewer 2’s comment 3 and reviewer 1’s comment 1.

      (2) Include directional connectivity analyses to test the proposed top-down modulation from DLPFC to pSTC. 

      Thanks for the suggestion. Please see our response to reviewer 2’s comment 5.

      (3) Use controlled stimuli in an additional experiment to separate the effects of facial expression, speech, and music. 

      This is an excellent point. However, iEEG data collection from children is an exceptionally rare opportunity and typically requires many years, so we are unable to add a controlled-stimulus experiment to the current study. We plan to consider using controlled stimuli to study the processing of complex emotion using non-invasive method in the future. In addition, please see our response to reviewer 2’s comment 1 for a description of how neural responses to facial expression and music are separated in our study.

    1. eLife Assessment

      This important contribution to enzyme annotation offers a deep learning framework for catalytic site prediction. Integrating biochemical knowledge with large language models, the authors demonstrate how to extract meaningful information from sequence alone. They introduce Squidly, a freely available new ML modeling framework, that outperforms existing tools on standard benchmarks, including the CataloDB dataset. The evidence is convincing, with an extensively and carefully addressed narrative upon revision.

    2. Reviewer #1 (Public review):

      In this well-written and timely manuscript, Rieger et al. introduce Squidly, a new deep learning framework for catalytic residue prediction. The novelty of the work lies in the aspect of integrating per-residue embeddings from large protein language models (ESM2) with a biology-informed contrastive learning scheme that leverages enzyme class information to rationally mine hard positive/negative pairs. Importantly, the method avoids reliance on the use of predicted 3D structures, enabling scalability, speed, and broad applicability. The authors show that Squidly outperforms existing ML-based tools and even BLAST in certain settings, while an ensemble with BLAST achieves state-of-the-art performance across multiple benchmarks. Additionally, the introduction of the CataloDB benchmark, designed to test generalization at low sequence and structural identity, represents another important contribution of this work.

    3. Reviewer #2 (Public review):

      Summary:

      The authors aim to develop Squidly, a sequence-only catalytic residue prediction method. By combining protein language model (ESM2) embedding with a biologically inspired contrastive learning pairing strategy, they achieve efficient and scalable predictions without relying on three-dimensional structure. Overall, the authors largely achieved their stated objectives, and the results generally support their conclusions. This research has the potential to advance the fields of enzyme functional annotation and protein design, particularly in the context of screening large-scale sequence databases and unstructured data. However, the data and methods are still limited by the biases of current public databases, so the interpretation of predictions requires specific biological context and experimental validation.

      Strengths:

      The strengths of this work include the innovative methodological incorporation of EC classification information for "reaction-informed" sample pairing, thereby enhancing the discriminative power of contrastive learning. Results demonstrate that Squidly outperforms existing machine learning methods on multiple benchmarks and is significantly faster than structure prediction tools, demonstrating its practicality.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      In this well-written and timely manuscript, Rieger et al. introduce Squidly, a new deep learning framework for catalytic residue prediction. The novelty of the work lies in the aspect of integrating per-residue embeddings from large protein language models (ESM2) with a biology-informed contrastive learning scheme that leverages enzyme class information to rationally mine hard positive/negative pairs. Importantly, the method avoids reliance on the use of predicted 3D structures, enabling scalability, speed, and broad applicability. The authors show that Squidly outperforms existing ML-based tools and even BLAST in certain settings, while an ensemble with BLAST achieves state-of-the-art performance across multiple benchmarks. Additionally, the introduction of the CataloDB benchmark, designed to test generalization at low sequence and structural identity, represents another important contribution of this work.

      We thank the reviewer for their constructive and encouraging assessment of the manuscript. We appreciate the recognition of Squidly’s biology-informed contrastive learning framework with ESM2 embeddings, its scalability through the avoidance of predicted 3D structures, and the contribution of the CataloDB benchmark. We are pleased that the reviewer finds these aspects to be of value, and their comments will help us in further clarifying the strengths and scope of the work.

      The manuscript acknowledges biases in EC class representation, particularly the enrichment for hydrolases. While CataloDB addresses some of these issues, the strong imbalance across enzyme classes may still limit conclusions about generalization. Could the authors provide per-class performance metrics, especially for underrepresented EC classes?

      We thank the reviewer for raising this point. We agree that per-class performance metrics provide important insight into generalizability across underrepresented EC classes. In response, we have updated Figure 3 to include two additional panels: (i) per-EC F1, precision and recall scores, and (ii) a relative display of true positives against the total number of predictable catalytic residues. These additions allow the class imbalance to be more directly interpretable. We have also revised the text between lines 316-321 to better contextualize our generalizability claims in light of these results.

      An ablation analysis would be valuable to demonstrate how specific design choices in the algorithm contribute to capturing catalytic residue patterns in enzymes.

      We agree an ablation analysis is beneficial to show the benefits of a specific approach. We consider the main design choice in Squidly to be how we select the training pairs, hence we chose a standard design choice for the contrastive learning model. We tested the effect of different pair schemes on performance and report the results in Figure 2A and lines 244258. These results are a targeted ablation in which we evaluate Squidly against AEGAN using the AEGAN training and test datasets, while systematically varying the ESM2 model size and pair-mining scheme. As a baseline, we included the LSTM trained directly on ESM2 embeddings and random pair selection.  We showed that indeed the choice of pairs has a large impact on performance, which is significantly improved when compared to naïve pairing. This comparison suggests that performance gains are attributable to reactioninformed pair-mining strategies. We recognize that the way these results were originally presented made this ablation less clear. We have revised the wording in the Results section (lines 244-247) and updated the caption to Figure 2A to emphasize the purpose of this section of the paper.

      The statement that users can optionally use uncertainty to filter predictions is promising but underdeveloped. How should predictive entropy values be interpreted in practice? Is there an empirical threshold that separates high- from low-confidence predictions? A demonstration of how uncertainty filtering shifts the trade-off between false positives and false negatives would clarify the practical utility of this feature.

      Thank you for the suggestion. Your comment prompted us to consider what is the best way to represent the uncertainty and, additionally, what is the best metric to return to users and how to visualize the results. Based on this, we included several new figures (Figure 3H and Supplementary Figures S3-5). We used these figures to select the cutoffs (mean prediction of 0.6, and variance < 0.225) which were then set as the defaults in Squidly, and used in all subsequent analyses. The effect of these cutoffs is most evident in the tradeoff of precision and recall. Hence users may opt to select their own filters based on the mean prediction and variance across the predictions, and these cutoffs can be passed as command line parameters to Squidly. The choice to use a consistent default cutoff selected using the Uni3175 benchmark has slightly improved the reported performance for the benchmarks seen in table 1, and figure 3C. However, our interpretation remains the same.

      The excerpt highlights computational efficiency, reporting substantial runtime improvements (e.g., 108 s vs. 5757 s). However, the comparison lacks details on dataset size, hardware/software environment, and reproducibility conditions. Without these details, the speedup claim is difficult to evaluate. Furthermore, it remains unclear whether the reported efficiency gains come at the expense of predictive performance

      Thank you for pointing out this limitation in how we presented the runtime results. We have rerun the tests and updated the table. An additional comment is added underneath, which details the hardware/software environment used to run both tools, as well as that the Squidly model is the ensemble version. As per the relationship between efficiency gains and predictive performance, both 3B and 15B models are benchmarked side by side across the paper.

      Compared to the tools we were able to comprehensively benchmark, it does not come at a cost. However, we note that the increased benefits in runtime assume that a structure must be folded, which is not the case for enzymes already present in the PDB. If that is the case, then it is likely already annotated and, in those cases, we recommend using BLAST which is superior in terms of run time than either Squidly or a structure-based tool and highly accurate for homologous or annotated sequences.

      Given the well-known biases in public enzyme databases, the dataset is likely enriched for model organisms (e.g., E. coli, yeast, human enzymes) and underrepresents enzymes from archaea, extremophiles, and diverse microbial taxa. Would this limit conclusions about Squidly's generalizability to less-studied lineages?

      The enrichment for model organisms in public enzyme databases may indeed affect both ESM2 and Squidly when applied to underrepresented lineages such as archaea, extremophiles, and diverse microbial taxa. We agree that this limitation is significant and have adjusted and expanded the previous discussion of benchmarking limitations accordingly (lines 358, 369). We thank the reviewer for highlighting this issue, which has helped us to improve the transparency and balance of the manuscript.

      Reviewer #2:

      The authors aim to develop Squidly, a sequence-only catalytic residue prediction method. By combining protein language model (ESM2) embedding with a biologically inspired contrastive learning pairing strategy, they achieve efficient and scalable predictions without relying on three-dimensional structure. Overall, the authors largely achieved their stated objectives, and the results generally support their conclusions. This research has the potential to advance the fields of enzyme functional annotation and protein design, particularly in the context of screening large-scale sequence databases and unstructured data. However, the data and methods are still limited by the biases of current public databases, so the interpretation of predictions requires specific biological context and experimental validation.

      Strengths:

      The strengths of this work include the innovative methodological incorporation of EC classification information for "reaction-informed" sample pairing, thereby enhancing the discriminative power of contrastive learning. Results demonstrate that Squidly outperforms existing machine learning methods on multiple benchmarks and is significantly faster than structure prediction tools, demonstrating its practicality.

      Weaknesses:

      Disadvantages include the lack of a systematic evaluation of the impact of each strategy on model performance. Furthermore, some analyses, such as PCA visualization, exhibit low explained variance, which undermines the strength of the conclusions.

      We thank the reviewer for their comments and feedback. 

      The authors state that "Notably, the multiclass classification objective and benchmarks used to evaluate EasIFA made it infeasible to compare performance for the binary catalytic residue prediction task." However, EasIFA has also released a model specifically for binary catalytic site classification. The authors should include EasIFA in their comparisons in order to provide a more comprehensive evaluation of Squidly's performance.

      We thank the reviewer for raising this point. EasIFA’s binary classification task includes catalytic, binding, and “other” residues, which differs from Squidly’s strict catalytic residue prediction. This makes direct comparison non-trivial, which is why we originally had opted to not benchmark against EasIFA and instead highlight it in our discussion.

      Given your comment, we did our best to include a benchmark that could give an indication of a comparison between the two tools. To do this, we filtered EasIFA’s multiclass classification test dataset for a non-overlapping subset with Squidly and AEGAN training data and <40% sequence identity to all training sets. This left only 66 catalytic residue– containing sequences that we could use as a held-out test set from both tools. We note it is not directly equal as Squidly and AEGAN had lower average identity to this subset (8.2%) than EasIFA (23.8%), placing them at a relative disadvantage.

      We also identified a potential limitation in EasIFA’s original recall calculation, where sequences lacking catalytic residues were assigned a recall of 0. We adapted this to instead consider only the sequences which do have catalytic residues, which increased recall across all models. With the updated evaluation, EasIFA continues to show strong performance, consistent with it being SOTA if structural inputs are available. Squidly remains competitive given it operates solely from sequence and has a lower sequence identity to this specific test set.

      Due to the small and imbalanced benchmark size, differences in training data overlap, and differences in our analysis compared with the original EasIFA analysis, we present this comparison in a new section (A.4) of the supplementary information rather than in the main text. References to this section have been added in the manuscript at lines 265-268. Additionally, we do update the discussion and emphasize the potential benefits of using EasIFA at lines (353-356).

      The manuscript proposes three schemes for constructing positive and negative sample pairs to reduce dataset size and accelerate training, with Schemes 2 and 3 guided by reaction information (EC numbers) and residue identity. However, two issues remain:

      (a) The authors do not systematically evaluate the impact of each scheme on model performance.

      (b) In the benchmarking results, it is not explicitly stated which scheme was used for comparison with other models (e.g., Table 1, Figure 6, Figure 8). This lack of clarity makes it difficult to interpret the results and assess reproducibility.

      (c) Regarding the negative samples in Scheme 3 in Figure 1, no sampling patterns are shown for residue pairs with the same amino acid, different EC numbers, and both being catalytic residues.

      We thank the reviewer for these suggestions, which enabled us to improve the clarity and presentation of the manuscript. Please find our point by point response:

      (a) We thank the reviewer for highlighting the lack of clarity in the way we have presented our evaluation in the section describing the Uni3175 benchmark. We aimed to systematically evaluate the impact of each scheme using the Uni3175 benchmark and refer to these results at lines 244-258, Additionally, we have adjusted the presentation of this section at lines 244-247 also in line with related comments from reviewer 1 in order to make the intention of this section and benchmark results to allow a comparison of each scheme to baseline models and AEGAN. These results led us to use Scheme 3 in both models for the other benchmarks in Figures 2 and 3. Please let us know if there is anything we can do to further improve the interpretability of Squidly’s performance.

      (b) We thank the reviewer for highlighting this issue and improving the clarity of our manuscript. We agree that after the Uni3175 benchmark was used to evaluate the schemes, we did not clearly state in the other benchmarks that scheme 3 was chosen for both the 3B and 15B models. We have made changes in table 1 and the Figure legends of Figures 2 and 3 to state that scheme 3 was used. In addition, we integrated related results into panel figures (e.g. Figures 2 and 3 now show models trained and tested on consistent benchmark datasets) and standardized figure colors and legend formatting throughout. Furthermore, we suspect that the previous switch from using the individual vs ensembled Squidly models during the paper was not well indicated, and likely to confuse the reader. Therefore, we decided to consistently report the ensembled Squidly models for all benchmarks except in the ablation study (Figure 2A). In line with this, we altered the overview Figure 1A, so that it is clearer that the default and intended version of Squidly is the ensemble.

      (c) We appreciate the reviewer pointing this out. You’re correct, we explicitly did not sample the negatives described by the reviewer in scheme 3 as our focus was on the hard negatives that relate most to the binary objective.  We do think this is a great idea and would be worth exploring further in future versions of Squidly, where we will be expanding the label space used for hard-negative sampling and including binding sites in our prediction. We have updated the discussion at lines 395-396 to highlight this potential direction.

      The PCA visualization (Figure 3) explains very little variance (~5% + 1.8%), but its use to illustrate the separability of embedding and catalytic residues may overinterpret the meaning of the low-dimensional projection. We question whether this figure is appropriate for inclusion in the main text and suggest that it be moved to the Supporting Information.

      We thank the reviewer for this suggestion. We had discussed this as well, and in the end decided to include it in the main manuscript. We agree that the explained variance is low. However, when we first saw the PCA we were surprised that there was any separation at all. This then prompted us to investigate further, so we kept it in the manuscript to be true to the scientific story. However, we do agree that our interpretation could be interpreted as overly conclusive given the minimal variance explained by the top 2 PCs. Therefore, we agree with the assessment that the figure, alongside the accompanying results section, is more appropriately placed in the supplementary information. We moved this section (A.1) to the appendix to still explain the exploratory data analysis process that we used to tackle this problem, so that the general thought process behind Squidly is available for further reading.  

      Minor Comments:

      (1) Figure Quality and Legends a) In Figure 4, the legend is confusing: "Schemes 2 and 3 (S1 and S2) ..." appears inconsistent, and the reference to Scheme 3 (S3) is not clearly indicated.

      (b) In Figure 6, the legend overlaps with the y-axis labels, reducing readability. The authors should revise the figures to improve clarity and ensure consistent notation.

      The reviewer correctly notes inconsistencies in figure presentation. We have revised the legend of Figure 4 (now 2A) to ensure schemes are referred to consistently and Scheme 3 (S3) is clearly indicated. We also adjusted Figure 6 (now 2c) to remove the overlap between the legend and y-axis labels.  

      Conclusion

      We thank the reviewers and editor again for their constructive input. We believe the revisions and clarifications substantially strengthened the manuscript and the resource

    1. eLife Assessment

      This important study presents a well-constructed multiscale simulation framework to investigate ATP-driven DNA translocation by prokaryotic SMC complexes, supporting a segment-capture mechanism. The strength of evidence is convincing, highlighting the necessity of a precise balance between electrostatic interactions and hydrogen bonding, as well as the critical role of kleisin asymmetry in ensuring unidirectional movement.

    2. Reviewer #1 (Public review):

      Summary:

      This study used explicit-solvent simulations and coarse-grained models to identify the mechanistic features that allow for unidirectional motion of SMC on DNA. Shorter explicit-solvent models provides a description of relevant hydrogen bond energetics, which was then encoded in a coarse-grained structure-based model. In the structure-based model, the authors mimic chemical reactions as signaling changes in the energy landscape of the assembly. By cycling through the chemical cycle repeatedly, the authors show how these time-dependent energetic shifts naturally lead SMC to undergo translocation steps along DNA that are on a length scale that has been identified.

      Strengths:

      Simulating large-scale conformational changes in complex assemblies is extremely challenging. This study utilizes highly-detailed models to parameterize a coarse-grained model, thereby allowing the simulations to connect the dynamics of precise atomistic-level interactions with a large-scale conformational rearrangement. This study serves as an excellent example for this overall methodology, where future studies may further extend this approach to investigated any number of complex molecular assemblies.

      Comments on revisions:

      No additional recommendations. I removed the weakness description in the summary, since the authors have addressed that concern.

    3. Reviewer #2 (Public review):

      Summary:

      The authors perform coarse grained and all atom simulations to provide a mechanism for loop extrusion that is involved in genome compaction.

      Strengths:

      The simulations are very thoughtful. They provide insights into the the translocation process, which is only one of the mechanisms. Much of the analyses is very good. Over all the study advances the use of simulations in this complicated systems.

      Weaknesses:

      Even the authors point out several limitations, which cannot be easily overcome in paper because of the paucity of experimental data. Nevertheless, the authors could have done to illustrate the main assertion that loop extrusion occurs by the motor translocating on DNA. They should mention more clearly that there are alternate theory that have accounted for a number of experimental data.

      Comments on revisions:

      The authors have adequately addressed my concerns.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study used explicit-solvent simulations and coarse-grained models to identify the mechanistic features that allow for the unidirectional motion of SMC on DNA. Shorter explicit-solvent models describe relevant hydrogen bond energetics, which were then encoded in a coarse-grained structure-based model. In the structure-based model, the authors mimic chemical reactions as signaling changes in the energy landscape of the assembly. By cycling through the chemical cycle repeatedly, the authors show how these time-dependent energetic shifts naturally lead SMC to undergo translocation steps along DNA that are on a length scale that has been identified.

      Strengths:

      Simulating large-scale conformational changes in complex assemblies is extremely challenging. This study utilizes highly-detailed models to parameterize a coarse-grained model, thereby allowing the simulations to connect the dynamics of precise atomistic-level interactions with a large-scale conformational rearrangement. This study serves as an excellent example for this overall methodology, where future studies may further extend this approach to investigated any number of complex molecular assemblies.

      We thank the reviewer for careful reading of our manuscript and highlighting the value of our bottom-up multiscale simulation approach.

      Weaknesses:

      The only relative weakness is that the text does not always clearly communicate which aspects of the dynamics are expected to be robust. That is, which aspects of the dynamics/energetics are less precisely described by this model? Where are the limits of the models, and why should the results be considered within the range of applicability of the models?

      We appreciate this insightful comment and agree that it is important to more explicitly describe the robustness and limitations of the simulation model used in this study. In response to this comment, we have revised the Discussion section of our manuscript.

      First, to clarify the robust aspects of our model, we have added a new subsection titled “Parametric choices and robustness of simulation model” to the Discussion, which is as follows:

      “The switching Gō approach adopted in this study is a powerful tool for providing the relationship between known large-scale conformational changes and the resulting functional and mechanical dynamics of the molecular machine (Brandani and Takada, 2018b; Koga and Takada, 2006b; Nagae et al., 2025). In this study, we mimic conformational change induced by ATP binding and hydrolysis events by instantaneously switching the potential energy function from one that stabilized a given conformation to another that stabilized a different conformation. This drives the protein to undergo a conformational transition toward the minimum of the new energy landscape.

      This approach is particularly well suited to investigate whether a given conformational change in a subunit of a molecular machine can produce the overall motion observed, and whether this process is mechanically feasible. Therefore, the fundamental mechanisms identified in this study, i.e., DNA segment capture mechanism, the correlation between step size and loop length, and the unidirectional translocation mechanism originating from the asymmetric kleisin path, can be considered as robust, as they emerge directly from the structural and topological constraints of the SMC-kleisin architecture rather than from tuned parameters.”

      Additionally, to more clearly define the limits of our model, we have expanded the "Limitations in current simulations" subsection. Specifically, we have added a detailed discussion regarding the energetics and transition pathways inherent to the switching Gō approach, which is as follows:

      “First, use of switching potentials to trigger conformational changes impose a limitation on predictive power for energetics and transition pathways. The switching of potentials is akin to a “vertical excitation” from one energy landscape to another, rather than a thermally activated crossing of an energy barrier. Consequently, the model cannot provide quantitative predictions of the transition rates or the free energy barriers associated with these changes. Furthermore, while the subsequent relaxation follows the new potential landscape, it is not guaranteed to reproduce the unique, physically correct transition pathway. Nevertheless, this simplification is justified because conformational changes within the protein are expected to occur on a much faster timescale than the large-scale motion of the DNA. Thus, this simplification has a limited impact on our main conclusions regarding the functional DNA dynamics driven by these large-scale conformational changes.”

      We have not made any additions regarding the timescale and dwell times for each ATP state, as these were already discussed in the original manuscript.

      Reviewer #2 (Public review):

      Summary:

      The authors perform coarse grained and all atom simulations to provide a mechanism for loop extrusion that is involved in genome compaction.

      Strengths:

      The simulations are very thoughtful. They provide insights into the translocation process, which is only one of the mechanisms. Much of the analyses is very good. Over all the study advances the use of simulations in this complicated systems.

      We sincerely thank the reviewer for their thoughtful and encouraging comments.

      Weaknesses:

      Even the authors point out several limitations, which cannot be easily overcome in the paper because of the paucity of experimental data. Nevertheless, the authors could have done so to illustrate the main assertion that loop extrusion occurs by the motor translocating on DNA. They should mention more clearly that there are alternative theories that have accounted for a number of experimental data.

      We thank the reviewer for these constructive suggestions. As the reviewer pointed out, it is important to state more explicitly how the unidirectional DNA translocation revealed in this study relates to the widely recognized loop-extrusion hypothesis of genome organization and situate our findings with the context of major alternative theories.

      To address this, we first clarify the relationship between the translocation mechanism we observed and the phenomenon of loop extrusion. We emphasize that our simulations were designed to elucidate the core motor activity of the SMC complex, and we explicitly state our view that loop extrusion is a functional consequence of this motor activity when the complex is anchored to DNA.

      Second, as the reviewer also suggested, we addressed alternative models of loop extrusion that also have experimental support in more details. We have revised the Discussion accordingly to provide a more balanced and comprehensive context. Further details are provided in our separate response to the comment below.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, Yamauchi and colleagues combine all-atom and coarse-grained MD simulations to investigate the mechanism of DNA translocation by prokaryotic SMC complexes. Their multiscale approach is well-justified and supports a segment-capture model in which ATP-dependent conformational changes lead to the unidirectional translocation of DNA. A key insight from the study is that asymmetry in the kleisin path enforces directionality. The work introduces an innovative computational framework that captures key features of SMC motor action, including DNA binding, conformational switching, and translocation.

      This work is well executed and timely, and the methodology offers a promising route for probing other large molecular machines where ATP activity is essential.

      Strengths:

      This manuscript introduces an innovative yet simple method that merges all-atom and coarse-grained, purely equilibrium, MD simulations to investigate DNA translocation by SMC complexes, which is triggered by activated ATP processes. Investigating the impact of ATP on large molecular motors like SMC complexes is extremely challenging, as ATP catalyses a series of chemical reactions that take and keep the system out of equilibrium. The authors simulate the ATP cycle by cycling through distinct equilibrium simulations where the force field changes according to whether the system is assumed to be in the disengaged, engaged, and V-shaped states; this is very clever as it avoids attempting to model the non-equilibrium process of ATP hydrolysis explicitly. This equilibrium switching approach is shown to be an effective way to probe the mechanistic consequences of ATP binding and hydrolysis in the SMC complex system.

      The simulations reveal several important features of the translocation mechanism. These include identifying that a DNA segment of ~200 bp is captured in the engaged state and pumped forward via coordinated conformational transitions, yielding a translocation step size in good agreement with experimental estimates. Hydrogen bonding between DNA and the top of the ATPase heads is shown to be critical for segment capturtrans, as without it, translocation is shown to fail. Finally, asymmetry in the kleisin subunit path is shown to be responsible for unidirectionally.

      This work highlights how molecular simulations are an excellent complement to experiments, as they can exploit experimental findings to provide high-resolution mechanistic views currently inaccessible to experiments. The findings of these simulations are plausible and expand our understanding of how ATP hydrolysis induces directional motion of the SMC complex.

      We thank the reviewer for the thoughtful and encouraging assessment of our work. We appreciate the reviewer’s summary of our key contributions, especially our switching Gō strategy, the segment-capture mechanism of SMC translocation, and the role of kleisin-path asymmetry in ensuring unidirectionality.

      Weaknesses:

      There are aspects of the methodology and modelling assumptions that are not clear and could be better justified. The major ones are listed below:

      (1) The all-atom MD simulations involve a 47-bp DNA duplex interacting with the ATPase heads, from which key residues involved in hydrogen bonding are identified. However, DNA mechanics-including flexibility and hydrogen bond formation-are known to be sequence-dependent. The manuscript uses a single arbitrary sequence but does not discuss potential biases. Could the authors comment on how sequence variability might affect binding geometry or the number of hydrogen bonds observed?

      We thank the reviewer for this insightful comment regarding the potential effects of DNA sequence.

      The primary biological role of the SMC complex is to organize genome architecture on a global scale; as such, its fundamental interaction with DNA is considered not to be sequence-specific. Our all-atom MD simulations and analysis pipeline were designed to probe the nature of this general interaction. Our approach confirms this rationale: the analysis exclusively identified hydrogen bonds formed between amino acid residues and the phosphate groups of the DNA's sugar-phosphate backbone. As shown in Figs. 1B and 1C, the results confirm that the key stabilizing interactions occur between basic residues on the SMC head surface and the DNA backbone. Since the backbone is chemically uniform, the stable binding mode we characterized is inherently sequence-independent.

      While the final bound state is likely sequence-independent, we agree that sequence-dependent properties such as local DNA flexibility or intrinsic curvature could influence the kinetics of the binding process. For example, the rate of initial recognition or the ease of DNA bending on the head surface might vary between AT-rich and GC-rich regions. However, once the DNA is bound, we expect the stable binding geometry and the identity of the key interacting residues to be conserved across different sequences.

      Therefore, we are confident that using a single, representative DNA sequence is a valid approach for elucidating the fundamental, non-sequence-specific aspects of SMC-DNA interaction and does not alter the general validity of the translocation mechanism proposed in this work.

      (2) A key feature of the coarse-grained model is the inclusion of a specific hydrogen-bonding potential between DNA and residues on the ATPase heads. The authors select the top 15 hydrogen-bond-forming residues from the all-atom simulations (with contact probability > 0.05), but the rationale for this cutoff is not explained. Also, the strength of hydrogen bonds in coarse-grained models can be sensitive to context. How did the authors calibrate the strength of this interaction relative to electrostatics, and did they test its robustness (e.g., by varying epsilon or residue set)? Could this interaction be too strong or too weak under certain ionic conditions? What happens when salt is changed?

      Thank you for these comments. We provide our rationale for the parameter choices below.

      The contact probability cutoff of 0.05 was chosen to create a comprehensive set of residues that form physically robust interactions with DNA. To establish this robustness, we performed a parallel set of all-atom simulations using a different force field (see Fig. S2). This cross-validation revealed two key points. First, the top six residues (Arg120, Arg123, Ile63, Arg111, Arg62, and Lys56), which include experimentally confirmed DNA-binding sites, consistently exhibited the highest contact probabilities in both force fields, confirming the reliability of our identification. Second, and just as importantly, many residues with lower contact probabilities (e.g., Trp115, Tyr107, Arg105, Ser124, and Ser54) were also consistently detected across both simulations. This reproducibility suggests that these interactions are physically robust and not artifacts of a specific force field. We therefore concluded that a 0.05 cutoff is a well-balanced threshold that ensures the inclusion of not only the primary anchor residues but also the secondary, moderately interacting residues that are crucial for cooperatively stabilizing the DNA. We discussed this point in Method in the revised manuscript, which is as follows:

      “The rationale for this cutoff is the physical robustness of the identified interactions; all-atom simulations using a different force field confirmed that the same set of key interacting residues, including both strong and moderate binders, was consistently identified (Fig. S2).”

      The strength of the hydrogen bond potential was set to ϵ = 4.0 k​T (≈2.4 kcal/mol), a physically plausible value corresponding to an ideal hydrogen bond. To test the robustness of this parameterization, we performed preliminary simulations where we varied these parameters by (i) reducing the value of ϵ and (ii) restricting the interaction to only the top six anchor residues. In both test cases, while a short DNA duplex (47 bp) could still bind to the ATPase heads, simulations with a long DNA (800 bp) failed to form a stable DNA loop after initial docking. These tests demonstrated that a larger set of cooperative interactions with a physically realistic strength was necessary for the full segment capture mechanism. Our final parameter set (15 residues at ϵ = 4.0 k​T) was thus chosen as the parameter set required to capture both the initial anchoring of DNA and the subsequent cooperative stabilization of the captured loop.

      As correctly pointed out, ionic conditions are a critical factor. Our simulations revealed that the salt concentration had a more pronounced effect on the kinetics of the DNA finding its correct binding site rather than on the thermodynamic stability of the final bound state. During our parameter tuning, we found that at physiological salt conditions (150 mM), long-range electrostatic interactions become dominant. This caused the DNA to be non-specifically captured by positively charged patches on the sides of the heads, which are not the functional binding sites. This off-pathway trapping kinetically prevented the DNA from reaching its proper location within the simulation timeframe. In contrast, the high-salt conditions (300 mM) used in this study screen these long-range interactions, suppressing non-specific trapping and allowing the DNA to efficiently explore the protein surface. This enables the correct binding to be established via the specific, short-range hydrogen bonds. Therefore, the ion concentration in our model is more as a crucial kinetic control factor to reproduce correct binding pathway within a realistic simulation timeframe. This point is discussed in the new subsection entitled “Parametric choices and robustness of simulation model”.

      (3) To enhance sampling, the translocation simulations are run at 300 mM monovalent salt. While this is argued to be physiological for Pyrococcus yayanosii, such a concentration also significantly screens electrostatics, possibly altering the interaction landscape between DNA and protein or among protein domains. This may significantly impact the results of the simulations. Why did the authors not use enhanced sampling methods to sample rare events instead of relying on a high-salt regime to accelerate dynamics?

      We agree that enhanced sampling methods are powerful for exploring rare events. However, many of these techniques require the pre-definition of a suitable, low-dimensional reaction coordinate (RC) to guide the simulation. The primary goal of our study was to discover the DNA translocation mechanism as it emerges naturally from fundamental physical interactions, without imposing a priori assumptions about the specific pathway.

      The DNA segment capture process is complex, involving the coordinated motion of a long DNA polymer and multiple protein domains. Defining a simple RC in advance was not feasible and would have carried a significant risk of biasing the system toward an artificial pathway. Therefore, to avoid such bias, we chose to perform direct, unbiased molecular dynamics simulations. Using a physiologically relevant high-salt concentration (300 mM) for Pyrococcus yayanosii was a strategy to accelerate the system's natural dynamics, allowing us to observe these unbiased trajectories within a feasible computational timescale.

      Because our current work has elucidated the fundamental steps of this mechanism, we agree that this work provides a foundation for more quantitative analyses. As suggested, future studies using methods like Markov State Model analysis or enhanced sampling techniques, guided by more sophisticated RCs defined from the insights of this work, would be a valuable next step for characterizing the free-energy landscape of the process or longer time scale dynamics.

      (4) Only a small fraction of the simulated trajectories complete successful translocation (e.g., 45 of 770 in one set), and this is attributed to insufficient simulation time. While the authors are transparent about this, it raises questions about the reliability of inferred success rates and about possible artefacts (e.g., DNA trapping in coiled-coil arms). Could the authors explore or at least discuss whether alternative sampling strategies (e.g., Markov State Models, transition path sampling) might address this limitation more systematically?

      We thank the reviewer for raising this point that is crucial for considering limitations and future directions of our study.

      As we noted in a previous response, the primary reason we did not employ such enhanced sampling methods was the limited prior knowledge available to define previously uncharacterized DNA translocation process. Therefore, we first try to define the key conformational states and transitions without the potential bias of a predefined model or reaction coordinate. This approach was successful, as it allowed us to identify critical on-pathway states like “DNA segment capture” and significant off-pathway or kinetically trapped states such as 'DNA trapping' between the coiled-coil arms.

      We fully agree that the low success rate observed is a key finding that points to significant kinetic bottlenecks, and that a more systematic analysis is required. Having identified the essential states, applying techniques such as Markov State Models (MSMs) or transition path sampling represents a powerful and logical next step. These methods, using a state-space definition based on our findings, will enable a quantitative characterization of the free-energy landscape and the transition rates between states. This will provide a rigorous understanding of the kinetic factors, such as the depth of the trapped-state energy well, that underlie the low translocation efficiency.

      In the revised manuscript, we discuss the application of these advanced sampling methods as a feasible and promising future direction, which is as follows:

      “Future studies can leverage the insights from this work to overcome the current timescale limitations. Techniques such as Markov state modeling (Husic and Pande, 2018; Prinz et al., 2011) or enhanced sampling methods (Hénin et al., 2022) may be employed to quantitatively characterize the free-energy landscape and transition rates. Such an approach would provide a rigorous understanding of the kinetic barriers, such as the stability of the trapped state, that govern the efficiency of SMC translocation.”

      Reviewer #1 (Recommendations for the authors):

      As noted in the public review, there could be a more systematic description of the limits of the model. The model appears to be carefully crafted, though every model has limits. It could be helpful for the general readership to give some idea of which parametric choices are more critical, and which mechanistic features should be robust to minor changes in parameters.

      We sincerely thank the reviewer for this constructive comment. We agree that clarifying which aspects of our model is robust and sensitive to specific parameter choices is crucial for the reader's understanding.

      We have expanded the Discussion to clarify how specific simulation parameters affect the efficiency and success rate of DNA translocation in our coarse-grained simulations. In particular, we have added a description of the parametric choices for (i) selection and strength of hydrogen bonds, (ii) ionic strength, and (iii) interaction strength between the coiled-coil arms. The discussion can be found in subsection entitled “Parametric choice and robustness of simulation model” in the Discussion, which is as follows:

      “On the other hand, the efficiency and success rate of DNA translocation in our simulations are more sensitive to certain parametric choices. For instance, the selection and strength of hydrogen bond-like interactions are a key factor. Our model incorporates specific hydrogen bonds between the upper surface of the ATPase heads and DNA, based on all-atom simulations. These interactions are essential for initiating segment capture; without them, DNA fails to migrate to the correct binding surface. While the identification of these key residues is a robust finding—persisting across different all-atom force fields (Fig. S2)—their strength and number in the coarse-grained potential are critical parameters that directly influence the probability and kinetics of DNA capture. Another critical parameter is the ionic strength. We performed translocation simulations at an ionic strength of 300 mM to accelerate DNA dynamics. At lower concentrations, non-specific electrostatic interactions between DNA and positively charged patches on the sides of the ATPase heads or coiled-coil arm became dominant, hindering the efficient migration of DNA to its functional binding site. Using a higher-than-physiological ionic strength is a justified practice in coarse-grained simulations employing the Debye-Hückel approximation, as it serves as a first-order correction to mimic the strong local charge screening by condensed counterions that is not explicitly captured by the mean-field model (Brandani et al., 2021; Niina et al., 2017b). Finaly, the interaction strength between the coiled-coil arms is also important. In our model, once the arms closed during the transition from the V-shaped to the disengaged state, they remained closed on the simulated timescale, frequently trapping DNA pushed from the hinge and thereby leading to failed translocation. This behavior suggests that the arm–arm interactions may be overestimated. A parameterization that allows for more frequent, transient opening of the arms could increase the success rate of DNA pumping.”

      Reviewer #2 (Recommendations for the authors):

      This paper reports simulations (all atom and coarse grained) to provide molecular details of loop extrusion. In general, it is a well done paper. There are a few issues that the authors should address.

      (1) The study supposes that loop extrusion occurs by translocation. Although they point out alternate models like scrunching (C Dekker; the theory by Takaki is also based on the scrunching model that the authors should mention), they should discuss this further. After all, the Takaki theory does predict several experimental outcomes very accurately. The precise mechanism has not been nailed down - The paper by Terakawa in Science suggests the extrusion is by translocation, but the evidence is not clear.

      We thank the reviewer for this insightful comment. We agree that our discussion should briefly acknowledge alternative models such as scrunching. We have therefore revised the manuscript to mention the theory by Takaki et al. (Nat. Commun., 2021), which reproduces several experimental outcomes.

      Because our present work specifically addresses the translocation mechanism based on DNA segment capture, we now state that scrunching and related models represent alternative proposals for loop extrusion.

      In this revision, we have added discussion to the end of the subsection titled "DNA segment capture as the mechanism of the DNA translocation by SMC complexes." in the Discussion section, which is as follows:

      “Turning to loop extrusion mechanisms, alternative mechanisms have been proposed in addition to the DNA-segment capture model. For example, Takaki et al. developed a scrunching-based theory that quantitatively accounts for several experimental observations, including force-velocity relationships and step-size distributions. While our present study focuses on the DNA translocation mechanism via segment capture, it is important to note that scrunching and other models remain plausible alternatives for loop extrusion. The precise mechanism may depends on the specific SMC complex and their subunits and remains to be fully resolved.”

      (2) It is unclear how one can say from Figure 4I and J that translocation has taken place. These panels show that the base pair length increases. This should be explained more clearly. They should also simultaneously plot the location of the heads (2D plot).

      Thank you for this valuable suggestion. In response to the comment on how translocation is presented in Fig. 4I and J, we have revised the text to make it clear that the SMC complex moves along DNA in subsection entitled “DNA translocation via DNA-segment capture”, as follows:

      “Fig. 4I represents the one-dimensional contour coordinate of the DNA molecule, indexed by base pairs (1-800). In this plot, translocation is visualized as a discontinuous shift in the range of base-pair indices that the SMC complex contacts over one complete ATP cycle”

      “This translocation is recorded in Fig. 4I as the average coordinate of the kleisin contact region (red dots) jumps from ~400 bp before the cycle to ~600bp after, which corresponds to a translocation event of ~200 bp”

      We believe that adding this explanation makes it clearer to readers that Fig. 4I and 4J provide direct evidence for unidirectional translocation of the SMC complex.

      (3) The transitions between the states are very abrupt (see Figure 2). Please explain. Also, in which state does extrusion take place? What is the role of the V-shape - is it part of the ATPase cycle?

      We thank the reviewer for raising these questions.

      In our simulation, we implemented ATP-binding state change by instantaneously switching the structure-based (Gō-type) potential between reference conformations for the disengaged (apo), engaged (ATP-bound), and V-shaped (ADP-bound) states at predetermined times. The system rapidly relaxes along the new funnel-shaped potential energy surface toward its minimum. This rapid relaxation is why the transition appears abrupt in metrics such as the Q-score in Fig.2.

      The V-shaped state corresponds to a key ADP-bound intermediate within the ATP hydrolysis cycle. Its primary role in our model is preparatory; it establishes the necessary open geometry that allows for the subsequent "zipping" of the coiled-coil arms. Crucially, unidirectional pumping motion is generated during the transition from the V-shaped state to the disengaged state. That is, the zipping motion of the coiled-coil arm pushes the captured DNA segment forward, resulting in a net translocation along the DNA.

      (4) It appears the heads do not move between the disengaged to engaged states. Why not in their model?

      Thank you for pointing out the lack of clarity in explanation of the SMC head movement in our simulations.

      In our model, the transition from the disengaged to the engaged state involves a dynamic rearrangement of the SMC heads. Specifically, one ATPase head slides (~10 Å) and rotates (~85°) relative to the other ATPase head to re-associate at a new dimer interface. This movement drives the global conformational change of the complex from a rod-like shape to an open ring, a mechanism proposed in a previous structural study (Diebold-Durand et al., Mol. Cell, 2017).

      As reviewer 2 noted, this crucial motion, which is reflected in the changing head-head distance and hinge angle in Fig. 2A, was not sufficiently highlighted in the text. We have therefore revised the manuscript to explicitly describe this head rearrangement to improve clarity, which is as follows:

      “Upon transition to the engaged state, the two ATPase heads were quickly rearranged to form the new inter-subunit contacts. Specifically, this rearrangement involves one ATPase head sliding by approximately 10 Å and rotating by 85° relative to the other, allowing it to associate through a different interface (Diebold-Durand et al., 2017b). The fractions of formed contacts, Q-scores, that exist at the disengaged (engaged) states quickly decreased (increased) (Fig. 2A, top two plots).”

      (5) What is pumping - it has been used in Marko NAR in the DNA capture model. How is that illustrated in the simulations?

      We thank the reviewer for raising this point. In the context of the DNA segment-capture model by Marko et al. (NAR, 2019), "pumping" refers to the conceptual process where a DNA loop, captured in an upper compartment of the SMC ring, is transferred to a lower compartment, resulting in net translocation.

      Our simulations provide a direct, molecular-resolution visualization of the physical mechanism underlying this concept. We illustrate that the "pumping" action is not a passive transfer but an active, mechanical process driven by a specific conformational change. This occurs during the transition from the V-shaped (ADP-bound) to the disengaged state. As shown in our trajectories, the two coiled-coil arms close in a zipper-like manner, beginning from the hinge and progressing toward the ATPase heads. This zipping motion physically pushes the captured DNA segment from the hinge region toward the kleisin ring.

      This process is visualized in our simulations as a clear, unidirectional translocation step (see Figs. 4B–D, 4I, and S6). The result is a net forward movement of the DNA by a distance that corresponds to the length of the initially captured loop, a key prediction of the Marko’s model that we quantify in our step-size analysis (Figs. 4K–L and S8).

      To make this point clearer for the reader, we have revised the manuscript. We have explicitly defined this "zipping and pushing" action as the physical basis for the "pumping" mechanism in the subsection titled "Zipping motion of coiled-coil arms pushes the DNA from hinge domain toward kleisin ring", which is as follows:.

      “This active, mechanical pushing of the DNA loop, driven by the sequential closing of the coiled-coil arm, constitutes the physical basis of the “pumping” mechanism that drives unidirectional translocation. Our simulations thus provide a concrete, molecular-level visualization for this key step in the DNA segment-capture model.”

      (6) The length of DNA simulated is small for understandable reasons. Both experiments and theory show that loop extrusion sizes can be very large, far exceeding the sizes of the SMA complex. Could the small size of DNA be affecting the results?

      We thank the reviewer for this important comment. The relationship between our simulated system size and the large-scale phenomena observed experimentally is a key point.

      Our study was specifically designed to elucidate the fundamental mechanism of the elementary, single-cycle translocation step at near-atomic resolution. For this purpose, the 800 bp DNA length was sufficient. The observed translocation step size per cycle was 216 ± 71 bp, which is substantially smaller than the total length of the simulated DNA. This confirms that the boundaries of our system did not artificially constrain the core translocation process we aimed to investigate. Therefore, we think that the DNA length used in this study did not systematically bias our main findings regarding the motor mechanism itself.

      As the reviewer pointed out, on the other hand, our current setup cannot reproduce the formation of kilobase-scale loops. We hypothesize that these large-scale events are intrinsically linked to the stochastic nature of the ATP hydrolysis cycle, which was simplified in our simulation model. We used fixed durations for each state for computational feasibility. In a more realistic scenario, a stochastically prolonged engaged state would provide a larger duration time for a captured DNA loop to grow via thermal diffusion. This could lead to occasional, much larger translocation steps upon ATP hydrolysis, contributing to the large loop sizes seen experimentally.

      (7) Minor point: The first CG model using three sites was introduced in PNAS vol 102, 6789 2005. The authors should consider citing it.

      Thank you for this suggestion. We have now cited the paper the reviewer recommended. Please find subsection entitled Coarse-grained simulations in Materials and Methods.

    1. eLife Assessment

      This important study reports three experiments examining how the subjective experience of task regularities influences perceptual decision-making. Although the evidence linking subjective ratings to behavioral measures is solid, the study would be strengthened if potential reverse influences of response times on subjective ratings were ruled out and if more comprehensive model comparisons supporting the main claims were performed. The findings will appeal to a wide range of researchers in decision-making and perception.

    2. Reviewer #1 (Public review):

      Summary:

      Press et al test, in three experiments, whether responses in a speeded response task reflect people's expectations, and whether these expectations are best explained by the objective statistics of the experimental context (e.g., stimulus probabilities) or by participants' mental representation of these probabilities. The studies use a classical response time and accuracy task, in which people are (1) asked to make a response (with one hand), this response then (2) triggers the presentation of one of several stimuli (with different probabilities depending on the response), and participants (3) then make a speeded response to identify this stimulus (with the other hand). In Experiment 1, participants are asked to rate, after the experiment, the subjective probabilities of the different stimuli. In Experiments 2 and 3, they rated, after each trial, to what extent the stimulus was expected (Experiment 2), or whether they were surprised by the stimulus (Experiment 3). The authors test (using linear models) whether the subjective ratings in each experiment predict stimulus identification times and accuracies better than objective stimulus probabilities (Experiment 1), or than their objective probability derived from a Rescorla-Wagner model of prior stimulus history (Experiment 2 and 3). Across all three experiments, the results are identical. Response times are best described by contributions from both subjective and objective probabilities. Accuracy is best described by subjective probability.

      Strengths:

      This is an exciting series of studies that tests an assumption that is implicit in predictive theories of response preparation (i.e., that response speed/accuracy tracks subjective expectancies), but has not been properly tested so far, to my knowledge. I find the idea of measuring subjective expectancy and surprise in the same trials as the response very clever. The manuscript is extremely well written. The experiments are well thought-out, preregistered, and the results seem highly robust and replicable across studies.

      Weaknesses:

      In my assessment, this is a well-designed, implemented, and analysed series of studies. I have one substantial concern that I would like to see addressed, and two more minor ones.

      (1) The key measure of the relationship between subjective ratings and response times/accuracy is inherently correlational. The causal relationship between both variables is therefore by definition ambiguous. I worry that the results don't reveal an influence of subjective expectancy of response times/accuracies, but the reverse: an influence of response times/accuracies on subjective expectancy ratings.

      This potential issue is most prominent in Experiments 2 and 3, where people rate their expectations in a given trial directly after they made their response. We can assume that participants have at least some insight into whether their response in the current trial was correct/erroneous or fast/slow. I therefore wonder if the pattern of results can simply be explained by participants noticing they made an error (or that they responded very slowly) and subsequently being more inclined to rate that they did not expect this stimulus (in Experiment 2) or that they were surprised by it (in Experiment 3).

      The specific pattern across the two response measures might support this interpretation. Typically, participants are more aware of the errors they make than of their response speed. From the above perspective, it would therefore be not surprising that all experiments show stronger associations between accuracy and subjective ratings than between response times and subjective ratings -- exactly as the three studies found.

      I acknowledge that this problem is less strong in Experiment 1, where participants do not rate expectancy or surprise after each response, but make subjective estimates of stimulus probabilities after the experiment. Still, even here, the flow of information might be opposite to what the authors suggest. Participants might not have made more errors for stimuli that they thought as least likely, but instead might have used the number of their responses to identify a given stimulus as a proxy for rating their likelihood. For example, if they identify a square as a square 25% of the time, even though 5% of these responses were in error, it is perhaps no surprise if their rating of the stimulus likelihood better tracks the times they identified it as a square (25%) than the actual stimulus likelihoods (20%).

      This potential reverse direction of effects would need to be ruled out to fully support the authors' claims.

      (2) My second, more minor concern, is whether the Rescorla-Wagner model is truly the best approximation of objective stimulus statistics. It is traditionally a model of how people learn. Isn't it, therefore, already a model of subjective stimulus statistics, derived from the trial history, instead of objective ones? If this is correct, my interpretation of Experiments 2 and 3 would be (given my point 1 above is resolved) that subjective expectancy ratings predict responses better than this particular model of learning, meaning that it is not a good model of learning in this task. Comparing results against Rescorla-Wagner may even seem like a stronger test than comparing them against objective stimulus statistics - i.e., they show that subjective ratings capture expectancies better even than this model of learning. The authors already touch upon this point in the General Discussion, but I would like to see this expanded, and - ideally - comparisons against objective stimulus statistics (perhaps up to the current trial) to be included, so that the authors can truly support the claim that it is not the objective stimulus statistics that determine response speed and accuracy.

      (3) There is a long history of research trying to link response times to subjective expectancies. For example, Simon and Craft (1989, Memory & Cognition) reported that stimuli of equal probability were identified more rapidly when participants had explicitly indicated they expect this stimulus to occur in the given trial, and there's similar more recent work trying to dissociate stimulus statistics and explicit expectations (e.g., Umbach et al., 2012, Frontiers; for a somewhat recent review, see Gaschler et al., 2014, Neuroscience & Biobehavioral Reviews). It has not become clear to me how the current results relate to this literature base. How do they impact this discussion, and how do they differ from what is already known?

    3. Reviewer #2 (Public review):

      Summary:

      This work by Clarke, Rittershofer, and colleagues used categorization and discrimination tasks with subjective reports of task regularities. In three behavioral experiments, they found that these subjective reports explain task accuracy and response times at least as well and sometimes better than objective measures. They conclude that subjective experience may play a role in predicting processing.

      Strengths:

      This set of behavioral studies addresses an important question. The results are replicated three times with a different experimental design, which strengthens the claims. The design is preregistered, which further strengthens the results. The findings could inspire many studies in decision-making.

      Weaknesses:

      It seems to me that it is important, but difficult to distinguish whether the objective and subjective measures stem from reasonably different mechanisms contributing to behavior, or whether they are simply two noisy proxies to the same mechanism, in which case it is not so surprising that both contribute to the explained variance. The authors acknowledge in the discussion that the type of objective measure that is chosen is crucial.

      For instance, the RW model's learning rates were not fitted to participants but to the sequence of stimuli, so they represent the optimal parameter values, not the true ones that participants are using. Is the subjective measure just a readout of the RW model's true state when using the participants' parameters? Relatedly, would the authors consider the RW predictions from participants using a sub-optimal alpha to be a subjective or an objective measure? Do the results truly show the importance of subjective measures, or is it another way of saying that humans are sub-optimal (Rahnev & Denison, 2018, BBS) ... or optimal for other goals. I see the difficulty of avoiding double-dipping on accuracy, but this seems essential to address. This relates to a more general question about the underlying mechanisms of subjective versus objective measures, which is alluded to in the discussion but could be interesting to develop a bit further.

      In terms of methods, I did not fully understand the 'RW model expectedness' objective metric in Experiments 2 and 3. VT is defined as the 'model's expectation for the given tone T. A (signed?) prediction error is defined for the expectation update, but it seems that the RW model expectedness used in the figures and statistical models is VT, sign-inverted for unexpected stimuli. So how do we interpret negative values, and how often do they occur? Shouldn't it be the unsigned value that is taken as objective surprise? This could be explained in a bit more detail. Could this be related to the quadratic effect that one can see in Figures 4E and 5E, which is not taken into account in the statistical model? Figures 4A and 5A also seem to show a combination of linear and quadratic effects. A more complete description of the objective measure could help determine whether this is a serious issue or just noise in the data.

      Gabor patches in Experiments 2 and 3 seemed to have been presented at quite a sharp contrast (I did not find this info), and accuracy seems to saturate at 100%. What was the distribution of error rates, i.e., how many participants were so close to 100% that there was no point in including them in the analysis?

      In the second preregistration, the authors announced that BIC comparisons between the full model and the objective model will test whether subjective measures capture additional variance [...] beyond objective prediction error. This is also the conclusion reached in sections 3.3 and 4.3. The model comparison, however, is performed by selecting the best of three models, excluding the null model. It seems that the full model still wins over the objective model, but sometimes quite marginally. Could the authors not test the significance of the model comparison since models are nested?

    4. Reviewer #3 (Public review):

      Summary:

      Clarke et al. investigate the role of subjective representations of task-based statistical structure on choice accuracy and reaction times during perceptual decision-making. Subjective representations of objective statistical structure are often overlooked in studies of predictive processing and, consequently, little is known about their role in predictive phenomena. By gauging the subjective experience of stimulus probability, expectedness, and surprise in tasks with fixed cue-stimulus contingencies, the authors aimed to separate subjective and objective (task-induced) contributions to predictive effects on behaviour.

      Across three different experiments, subjective and objective contributions to predictions were found to explain unique portions of variance in reaction time data. In addition, choice accuracy was best predicted by subjective representations of statistical structure in isolation. These findings reveal that the subjective experience of statistical regularities may play a key role in the predictive processes that shape perception and cognition.

      Strengths:

      This study combines careful and thorough behavioral experimentation with an innovative focus on subjective experience in predictive processing. By collecting three independent datasets with different perceptual decision-making paradigms, the authors provide converging evidence that subjective representations of statistical structure explain unique variance in behavior beyond objective task structure. The analysis strategy, which directly contrasts the contributions of subjective and objective predictors, is conceptually rigorous and allows clear insight into how subjective and objective influences shape behavior. The methods are consistently applied across all three datasets and produce coherent results, lending strong support to the authors' conclusions. The study emphasizes the critical role of subjective experience in predictive processing, with implications for understanding learning, perception, and decision-making.

      Weaknesses:

      Despite these strengths, there are several conceptual and technical issues that should be addressed. In Experiments 2 and 3, the authors use a Rescorla-Wagner (RW) learning model to estimate trialwise expectedness (Experiment 2) and surprise (Experiment 3). While the RW model is a well-established model for explaining learning behaviour, it does not represent the objective 'ground truth' statistical structure of the environment, and treating RW trajectories as such imposes assumptions about learning that may not match participants' actual behavior. This assumption could strongly affect the comparison between subjective and 'objective' predictors. It would strengthen the primary conclusions of the manuscript if other implementations of the objective statistical structure, such as the true task-defined probabilities (i.e., 25% or 75%), were considered to provide a complementary 'ground truth' perspective.

      Additionally, because objective statistical structure was predictive of subjective ratings in all three experiments, these predictors are likely collinear in the full model. Collinearity can lead to inflated standard errors and unstable coefficient estimates, even if the models converge. Currently, this potential critical problem of the applied statistical models is not assessed, reported on, or controlled for (e.g., by residualizing predictors). RW trajectories are also not reported in the manuscript, limiting the ability to assess how the model evolves over time and whether it maps onto the task-induced probabilities in a sensible way. This is particularly relevant because participants' subjective estimates of the task-induced probabilities seem to converge to the ground truth after just a few trials, especially for the 75% stimuli (Figure 3C).

    1. eLife Assessment

      This paper uses a new computational method that integrates bulk sequencing and single-cell sequencing data to provide refined gene expression datasets for 52 neuron classes in C. elegans. The paper's findings are convincing, presenting an approach that alleviates a key shortcoming of single-cell RNA sequencing. While the datasets have some limitations that the authors acknowledge, the new methodology and refined datasets will be important resources for those interested in understanding how gene expression shapes neuronal morphology and physiology.

    2. Reviewer #1 (Public review):

      This is an interesting manuscript aimed at improving the transcriptome characterization of 52 C. elegans neuron classes. Previous single-cell RNA seq studies already uncovered transcriptomes for these, but the data are incomplete, with a bias against genes with lower expression levels. Here, the authors use cell-specific reporter combinations to FACS purify neurons and use bulk RNA sequencing to obtain better sequencing depth. This reveals more rare transcripts, as well as non-coding RNAs, pseudo genes, etc. The authors develop computational approaches to combine the bulk and scRNA transcriptome results to obtain more definitive gene lists for the neurons examined.

      To ultimately understand features of any cell, from morphology to function, an understanding of the full complement of the genes it expresses is a pre-requisite. This paper gets us a step closer to this goal, assembling a current "definitive list" of genes for a large proportion of C. elegans neurons. The computational approaches used to generate the list are based on reasonable assumptions, the data appear to have been treated appropriately statistically, and the conclusions are generally warranted. I have a few issues that the authors may chose to address:

      (1) As part of getting rid of cross contamination in the bulk data, the authors model the scRNA data, extrapolate it to the bulk data and subtract out "contaminant" cell types. One wonders, however, given that low expressed genes are not represented in the scRNA data, whether the assignment of a gene to one or another cell type can really be made definitve. Indeed, it's possible that a gene is expressed at low levels in one cell, and in high levels in another, and would therefore be considered a contaminant. The result would be to throw out genes that actually are expressed in a given cell type. The definitive list would therefore be a conservative estimate, and not necessarily the correct estimate.

      (2) It would be quite useful to have tested some genes with lower expression levels using in vivo gene-fusion reporters to assess whether the expression assignments hold up as predicted. i.e. provide another avenue of experimentation, non-computational, to confirm that the decontamination algorithm works.

      (3) In many cases, each cell class would be composed of at least 2 if not more neurons. Is it possible that differences between members of a single class would be missed by applying the cleanup algorithms? Such transcripts would be represented only in a fraction of the cells isolated by scRNAseq, and might then be considered not real?

      (4) I didn't quite catch whether the precise staging of animals was matched between the bulk and scRNAseq datasets. Importantly, there are many genes whose expression is highly stage specific or age specific so that even slight temporal difference might yield different sets of gene expression.

      (5) To what extent does FACS sorting affect gene expression? Can the authors provide some controls?

      Comments on revisions:

      The authors have made reasonable arguments in response to my questions, and have done some additional experiments. I believe that although they did not do so, they could have generated additional reporters for the lower expressed genes, that would have validated their method of data integration. Nonetheless, I think the paper is rigorous and will be of use to the community.

    3. Reviewer #2 (Public review):

      Summary:

      This study from the CenGEN consortium addresses several limitations of single-cell RNA (scRNA) and bulk RNA sequencing in C. elegans with a focus on cells in the nervous system. scRNA datasets can give very specific expression profiles, but detecting rare and non-polyA transcripts is difficult. In contrast, bulk RNA sequencing on isolated cells can be sequenced to high depth to identify rare and non-polyA transcripts but frequently suffers from RNA contamination from other cell types. In this study, the authors generate a comprehensive set of bulk RNA datasets from 53 individual neurons isolated by fluorescence activated cell sorting (FACS). The authors combine these datasets with a previously published scRNA dataset (Taylor et al., 2021) to develop a novel method, called LittleBites, to estimate and subtract contamination from the bulk RNA data. The authors validate the method by comparing detected transcripts against gold-standard datasets on neuron-specific and non-neuronal transcripts. The authors generate an "integrated" list of protein-coding expression profiles for the 53 neuron sub-types, with fewer but higher confidence genes compared to expression profiles based only on scRNA. Also, the authors identify putative novel pan-neuronal and cell-type specific non-coding RNAs based on the bulk RNA data. LittleBites should be generally useful for extracting higher confidence data from bulk RNA-seq data in organisms where extensive scRNA datasets are available. The additional confidence in neuron-specific expression and non-coding RNA expands the already great utility of the neuronal expression reference atlas generated by the CenGEN consortium.

      Strengths:

      The study generates and analyzes a very comprehensive set of bulk RNA datasets from individual fluorescently tagged transgenic strains. These datasets are technically challenging to generate and significantly expand our knowledge of gene expression, particularly in cells that were poorly represented in the initial scRNA-seq datasets. Additionally, all transgenic strains are made available as a resource from the Caenorhabditis Elegans Genetics Center (CGC).

      The study uses the authors' extensive experience with neuronal expression to benchmark their method for reducing contamination utilizing a set of gold-standard validated neuronal and non-neuronal genes. These gold-standard genes will be helpful for benchmarking any C. elegans gene expression study.

      Weaknesses:

      The bulk RNA-seq data collected by the authors has high levels of contamination and, in some cases, is based on very few cells. The methodology to remove contamination partly makes up for this shortcoming, but the high background levels of contaminating RNA in the FACS-isolated neurons limit the confidence in cell-specific transcripts.

      The study does not experimentally validate any of the refined gene expression predictions, which was one of the main strengths of the initial CenGEN publication (Taylor et al, 2021). No validation experiments (e.g., fluorescence reporters or single molecule FISH) were performed for protein-coding or non-coding genes, which makes it difficult for the reader to assess how much gene predictions are improved, other than for the gold standard set, which may have specific characteristics (e.g., bias toward high expression as they were primarily identified in fluorescence reporter experiments).

      The study notes that bulk RNA-seq data, in contrast to scRNA-seq data, can be used to identify which isoforms are expressed in a given cell. Although not included in this manuscript, two bioRxiv papers have used the generous openness of the CenGEN consortium to study alternative splicing in C. elegans neurons [bioRxiv, 2024.2005.2016.594567 (2024) and bioRxiv, 2024.2005.2016.594572 (2024)], nicely showing the strengths of the data.

      Comments on revisions: I agree that the paper is improved.

    4. Reviewer #3 (Public review):

      Summary

      This study aims to overcome key limitations of single-cell RNA-seq in C. elegans neurons-especially the under-detection of lowly expressed and non-polyadenylated transcripts and residual contamination-by integrating bulk RNA-seq from FACS-isolated neuron types with an existing scRNA-seq atlas. The authors introduce LittleBites, an iterative, reference-guided decontamination algorithm that uses a single-cell reference together with ground-truth reporter datasets to optimize subtraction of contaminating signal from bulk profiles. They then generate an "Integrated" dataset that combines the sensitivity of bulk data with the specificity of scRNA-seq and use it to call neuron-specific expression for protein-coding genes, "rescued" genes not detected in scRNA-seq, and multiple classes of non-coding RNAs across 53 neuron classes. All data, code, and thresholded matrices are made publicly available to enable community reuse.

      Strengths

      (1) Conceptual advance and useful resource. The work demonstrates in a concrete way how bulk and single-cell datasets can be combined to overcome the weaknesses of each approach, and delivers a high-resolution transcriptomic resource for a substantial fraction of C. elegans neuron classes . The integrated matrices, thresholded expression calls, and non-coding RNA catalog will be useful both for basic neurobiology and for method developers.

      (2) Careful benchmarking and transparency. The revised manuscript includes extensive benchmarking of LittleBites and the Integrated dataset against multiple independent "ground-truth" sets: neuron-specific reporter lines, curated non-neuronal markers, and ubiquitous genes. The authors evaluate AUROCs over a wide range of thresholds, explain ROC/AUROC metrics for non-specialists, and quantify how integration affects both sensitivity and specificity relative to scRNA-seq alone.

      (3) Improved methodological clarity. In response to review, the authors now provide a much more intuitive description of the LittleBites algorithm, including a stepwise explanation of (1) contamination estimation via NNLS using single-cell references, (2) weighted subtraction tuned by a learning-rate parameter, and (3) performance optimization based on AUROC against ground-truth genes. this makes the approach accessible to readers who are not computational specialists and will facilitate re-implementation.

      (4) Systematic analysis of reference dependence. The authors explicitly address the concern that LittleBites depends on the completeness and accuracy of the scRNA-seq reference. They examine how performance varies with cluster size and by simulated degradation of the reference (e.g., reducing the number of cells per cluster), and show that AUROCs remain robust, but that gene-level assignments are more variable for clusters represented by fewer cells. This is an important and honest characterization of when the method is reliable and when users should be cautious.

      (5) Additional biological context. The manuscript now more clearly situates the dataset in the context of previous and ongoing work. In particular, the authors highlight that other groups have already used these bulk data to discover and validate cell-type-specific alternative splicing events, strengthening the case that the data are biologically meaningful beyond the immediate analyses presented here. The expanded analysis of non-coding RNAs and GPCR pseudogenes also adds biological interest.

      (6) Improved handling and documentation of "unexpressed" genes. The authors have trimmed the original list of 4,440 genes called "unexpressed" in scRNA-seq to a higher-confidence subset and provide new supplementary tables that include gene identities and tissue annotations. They also use a curated set of non-neuronal markers to estimate residual contamination and show that most such markers are not detected in the integrated data, with only a small number of apparent false positives remaining.

      Weaknesses

      (1) Novel assignments remain predictive rather than experimentally validated. Although the authors have strengthened their benchmarking and refer to external work that validates some splicing patterns from these data, the large sets of newly assigned lowly expressed genes and non-coding RNAs-particularly those rescued from the "unexpressed" gene pool-are still inferred from computational criteria (thresholding plus correlation-based decontamination) rather than direct orthogonal assays (e.g., smFISH, in situ hybridization, or reporter lines). This is understandable given scale and cost, but it means that many of these calls should be interpreted as well-supported predictions, not definitive expression maps. The revised manuscript acknowledges this, and a dedicated "Limitations of this study" subsection will further clarify this point for readers.

      (2) Reduced stability for neuron types with sparse single-cell representation. The authors' new analyses show that while integration improves overall correlation and AUROC across a wide range of neuron types, gene-level assignments are less stable for neuron classes represented by relatively few cells in the scRNA-seq reference. For such neuron types, both false negatives and false positives are more likely, and users should be cautious when interpreting cell-type-specific expression differences based solely on these calls.

      (3) Residual contamination and misclassification are not completely eliminated. Despite the careful design of LittleBites and the additional correlation-based decontamination of "unexpressed" genes, the authors' benchmarking against curated non-neuronal markers shows that a small fraction of putative non-neuronal genes remains detectable even at stricter thresholds, and some bona fide neuronal genes are removed as likely contaminants. The new supplementary tables documenting "unexpressed" genes and their tissue annotations, together with explicit statements about residual error rates and the predictive nature of these classifications, help users to judge the reliability of specific genes, but they also underscore that the dataset is not a perfect ground truth.

      (4) Scope and coverage remain incomplete. As the authors note, the dataset covers 53 neuron classes and does not fully represent all 302 neurons or all known neuron subtypes. In addition, bulk samples represent pools of neurons, and so the approach cannot resolve within-class heterogeneity or subtype-specific expression within those pools. These are inherent limitations of the current experimental design rather than flaws in the analysis, but they are important for readers to keep in mind when using the resource.

      Overall, the revised manuscript presents solid evidence for the main methodological and resource claims, with clearly articulated limitations. The work is likely to have valuable impact on the C. elegans community and provides a template for integrating bulk and single-cell data in other systems.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      (1) As part of getting rid of cross-contamination in the bulk data, the authors model the scRNA data, extrapolate it to the bulk data and subtract out "contaminant" cell types. One wonders, however, given that low expressed genes are not represented in the scRNA data, whether the assignment of a gene to one or another cell type can really be made definitive. Indeed, it's possible that a gene is expressed at low levels in one cell, and high levels in another, and would therefore be considered a contaminant. The result would be to throw out genes that actually are expressed in a given cell type. The definitive list would therefore be a conservative estimate, and not necessarily the correct estimate.

      We agree that the various strategies we employ do not result in perfect annotation of gene expression. However, despite their limitations, they are significantly better than either the single cell or the bulk data alone. We represent these strengths and shortcomings throughout the manuscript (for example, in ROC curves).

      (2) It would be quite useful to have tested some genes with lower expression levels using in vivo gene-fusion reporters to assess whether the expression assignments hold up as predicted. i.e. provide another avenue of experimentation, non-computational, to confirm that the decontamination algorithm works.

      We agree that evaluating only highly-expressed genes might introduce bias. We used a large battery of in vivo reporters, made with best-available technology (CRISPR insertion of the fluorophore into the endogenous locus) to evaluate our approaches. These reporters were constructed without bias in terms of gene expression and therefore represent both high and low expression levels. These data are represented throughout the manuscript (for example, in ROC curves). Details about the battery of reporters may be found in Taylor et al 2021. In addition to these reporters, this manuscript also generates and analyzes two other types of gene sets: non-neuronal and ubiquitous genes. Again, these genes are selected without bias toward gene expression, and the techniques presented here are benchmarked against them as well, with positive results.

      (3) In many cases, each cell class would be composed of at least 2 if not more neurons. Is it possible that differences between members of a single class would be missed by applying the cleanup algorithms? Such transcripts would be represented only in a fraction of the cells isolated by scRNAseq, and might then be considered not real.

      For the data set presented in this manuscript, all cells of a single neuron type were labeled and isolated together by FACS, and sequencing libraries were constructed from this pool of cells. Thus, potential subtypes within a particular type (when that type includes more than one cell) cannot be resolved by this method. By contrast, such subtypes were in some cases resolved in the single cell approach. To make the two data sets compatible with each other, for the single cell data we combined any subtypes together. We state in the Methods:

      “For this work, single cell clusters of neuron subtypes were collapsed to the resolution of the bulk replicates (example: VB and VB1 clusters in the single cell data were treated as one VB cluster).”

      (4) I didn't quite catch whether the precise staging of animals was matched between the bulk and scRNAseq datasets. Importantly, there are many genes whose expression is highly stage-specific or age-specific so even slight temporal differences might yield different sets of gene expression.

      We agree that accurate staging is critically important for valid comparisons between data sets and have included an additional supplemental table with staging metadata for each sample. The staging protocol used for the bulk data set was initially employed to generate scRNA seq data and should be comparable. An additional description of our approach is now included in Methods:

      “Populations of synchronized L1s were grown at 23 C until reaching the L4 stage on 150 mM 8P plates inoculated with Na22. The time in culture to reach the L4 stage varied (40.5-49 h) and was determined for each strain. 50-100 animals were inspected with a 40X DIC objective to determine developmental stage as scored by vulval morphology (Mok et al., 2025). Cultures were predominantly composed of L4 larvae but also typically included varying fractions of L3 larvae and adults.”

      We have also updated supplementary table 1 to include additional information about each sort including the observed developmental stages and their proportions when available, the temperature the worms were grown at, the genotype of each experiment, and the number of cells collected in FACS.

      (5) To what extent does FACS sorting affect gene expression? Can the authors provide some controls?

      We appreciate this suggestion. We agree that FACS sorting (and also dissociation of the animals prior to sorting) might affect gene expression, particularly of stress-related transcripts. We note that dissociation and FACS sorting was also used to collect cells for our single cell data set (Taylor et al 2021). We would note that clean controls for this approach can be prohibitively difficult to collect, as the process of dissociation and FACS will inevitably change the proportion of cell types present in the sample, and for bulk sequencing efforts it is difficult even with deconvolution approaches to accurately account for changes in gene expression that result from dissociation and FACS, versus changes in gene expression that result from differences in cell type composition. We regrettably omitted a discussion of these issues in the manuscript. We now write in the Results:

      “The dissociation and FACS steps used to isolate neuron types induce cellular stress responsive pathways (van den Brink et al., 2017; Kaletsky et al., 2016, Taylor 2021). Genes associated with this stress response (Taylor 2021) were not removed from downstream analyses, but should be viewed with caution.”

      Reviewer #2 (Public review):

      The bulk RNA-seq data collected by the authors has high levels of contamination and, in some cases, is based on very few cells. The methodology to remove contamination partly makes up for this shortcoming, but the high background levels of contaminating RNA in the FACS-isolated neurons limit the confidence in cell-specific transcripts.

      We agree that these are the limitations of the source data. One of the manuscript’s main goals is to analyze and refine these source data, reducing these limitations and quantifying the results.

      The study does not experimentally validate any of the refined gene expression predictions, which was one of the main strengths of the initial CenGEN publication (Taylor et al, 2021). No validation experiments (e.g., fluorescence reporters or single molecule FISH) were performed for protein-coding or non-coding genes, which makes it difficult for the reader to assess how much gene predictions are improved, other than for the gold standard set, which may have specific characteristics (e.g., bias toward high expression as they were primarily identified in fluorescence reporter experiments).

      We agree that evaluating only highly-expressed genes might introduce bias. We used a large battery of in vivo reporters, made with best-available technology (CRISPR insertion of the fluorophore into the endogenous locus) to evaluate our approaches. These reporters were constructed without bias in terms of gene expression and therefore represent both high and low expression levels. These data are represented throughout the manuscript (for example, in ROC curves). Details about the battery of reporters may be found in Taylor et al 2021. In addition to these reporters, this manuscript also generates and analyzes two other types of gene sets: non-neuronal and ubiquitous genes. Again, these genes are selected without bias toward gene expression, and the techniques presented here are benchmarked against them as well, with positive results.

      The study notes that bulk RNA-seq data, in contrast to scRNA-seq data, can be used to identify which isoforms are expressed in a given cell. However, no analysis or genome browser tracks were supplied in the study to take advantage of this important information. For the community, isoform-specific expression could guide the design of cell-specific expression constructs or for predictive modeling of gene expression based on machine learning.

      We strongly agree that these datasets allow for new discoveries in neuronal splicing patterns and regulators, which is explored further in other publications from our group and other research groups in the field. We did not sufficiently highlight these works in the body of our text, and have added a reference in the discussion. “In addition, the bulk RNA-seq dataset contains transcript information across the gene body, which parallel efforts have used to identify mRNA splicing patterns that are not found in the scRNA-seq dataset.” These works can be found in references 26 and 27.

      (1) The study relies on thresholding to determine whether a gene is expressed or not. While this is a common practice, the choice of threshold is not thoroughly justified. In particular, the choice of two uniform cutoffs across protein-encoding RNAs and of one distinct threshold for non-coding RNAs is somewhat arbitrary and has several limitations. This reviewer recommends the authors attempt to use adaptive threshold-methods that define gene expression thresholds on a per-gene basis. Some of these methods include GiniClust2, Brennecke's variance modeling, HVG in Seurat, BASiCS, and/or MAST Hurdle model for dropout correction.

      We appreciate the reviewer’s suggestion, and would note that the integrated data currently incorporates some gene-specific weighting to identify gene expression patterns, as the single-cell data are weighted by maximum expression for each gene prior to integration with the LittleBites cleaned data. This gene level normalization markedly improved gene detection accuracy, and is discussed in depth in our 2021 Paper “Molecular topography of an entire nervous system”. We previously explored several methods for setting gene specific thresholds for identifying gene expression patterns in the integrated dataset. Unfortunately we found that none of the tested methods out performed setting “static” thresholds across all genes in the integrated dataset, and tended to increase false positive rates for some low abundance genes, where gene-specific thresholding can tend towards calling a gene expressed in at least one cell type when it is actually not expressed in any cell types present. These methods are likely to provide better results for expanded datasets that cover all tissue types (where one might reasonably expect that a gene is likely to be expressed in at least one sample).

      (2) Most importantly, the study lacks independent experimental validation (e.g., qPCR, smFISH, or in situ hybridization) to confirm the expression of newly detected lowly expressed genes and non-coding RNAs. This is particularly important for validating novel neuronal non-coding RNAs, which are primarily inferred from computational approaches.

      We agree that smFISH and related in situ validation methods would be an asset in this analysis. Unfortunately because most ncRNAs are very short, they are prohibitively difficult to accurately measure with smFISH. Many ncRNAs we attempted to assay with smFISH methods can bind at most 3 fluorescent probes, which unfortunately was not reliably distinguishable from background autofluorescence in the worm. Many published methods for smFISH signal amplification have not been optimized for C. elegans, and the tough cuticle is a major barrier for those efforts.

      (3) The novel biology is somewhat limited. One potential area of exploration would be to look at cell-type specific alternative splicing events.

      We appreciate this suggestion. Indeed, as we put our source data online prior to publishing this manuscript, two published papers already use this source data set to analyze alternative splicing. Further, these works include validation of splicing patterns observed in this source data, indicating the biological relevance of these data sets.

      (4) The integration method disproportionately benefits neuron types with limited representation in scRNA-seq, meaning well-sampled neuron types may not show significant improvement. The authors should quantify the impact of this bias on the final dataset.

      We agree that cell-types that are well represented in the single-cell dataset tend to have fewer new genes identified in the Integrated dataset than “rare” cell-types in the single cell data. However we would note that cell-types that are highly abundant in the single-cell data appear to become increasingly vulnerable to non-neuronal false positives, and that integration’s primary effect in high abundance cell-types appears to be reducing the false positive rate for non-neuronal genes. Thus we suggest that integration benefits all cell-types across the spectrum of single-cell abundance. The false positives are likely caused by a side-effect of normalization steps in the single-cell dataset, which is moderated by using the LittleBites cleaned bulk samples as an orthogonal measurement. The benefit of integration for cell-types with low abundance in the single-cell dataset is now quantified, and the benefits of integration for low and high abundance cell-types from the single cell data are described in the following section (p.13):

      “To test the stability of LittleBites cleanup across different single-cell reference dataset qualities, we ran the algorithm on a set of bulk samples by first subsetting the corresponding single-cell cluster’s population to 10, 50, 100, or 500 cells. We performed this process 500 times for each subsampling rate for each sample (2000 total runs per sample). We found that testing gene AUROC values are stable across reference cluster sizes (Fig. 2D), suggesting that even if the target cell type is rarely represented in a single cell reference, accurate cleaning is still possible. However, comparing gene level stability across target cluster population levels reveals that low abundance references have higher gene level variance (Fig. 2E), lower purity estimates (Fig. S2F), higher variance in the mean expression across genes (Fig. S2G), and they tend to have lower overall expression (suggesting more aggressive subtraction) (Fig. S2H). This indicates that while binary gene calling is improved even if the reference cluster is small, users should be cautious when using fewer than 100 cells in their single cell reference cluster as the resulting cleanup is less stable.”

      (5) The authors employ a logit transformation to model single-cell proportions into count space, but they need to clarify its assumptions and potential pitfalls (e.g., how it handles rare cell types).

      We agree that the assumptions and pitfalls of the logit model are key for evaluating its usefulness, especially for cell types that are rarely captured in the single-cell dataset. The assumptions and pitfalls are described in the methods section, but we regretfully omitted any mention of those pitfalls in the results, which we have now rectified.

      The description in the methods section is: “We applied this formula to our real single cell dataset and used this equation to transform proportion measures of gene expression into a count space to generate the Prop2Count dataset for downstream analysis and integration with bulk datasets. This procedure allows for proportions data to be used in downstream analyses that work with counts datasets. However, it does limit the range of potential values that each gene can have, with the potential values set as:

      As n approaches 0, the number of potential values decreases, which can be incompatible with some downstream models. Thus, caution should be used when applying this transformation to datasets with few cells.”

      The new mention in the results is: “However, caution should be taken when using this approach in scRNAseq cases where all replicates of a cell type contain few cells. scProp2Count values are limited to the space of possible proportion values, and so replicates with low numbers of cells will have fewer potential expression “levels” which may break some model assumptions in downstream applications (see Methods).”

      (6) The LittleBites approach is highly dependent on the accuracy of existing single-cell references. If the scRNA-seq dataset is incomplete or contains classification biases, this could propagate errors into the bulk RNA-seq data. The authors may want to discuss potential limitations and sensitivity to errors in the single-cell dataset, and it is critical to define minimum quality parameters (e.g. via modeling) for the scRNAseq dataset used as reference.

      We appreciate this suggestion, and agree that manuscript would benefit from a description of where the LittleBites method can give poor results. To this end, we subset our single cell reference for individual neurons of interest to the level of 10, 50, 100, or 500 cells (500 iterations per sample rate), and then ran Littlebites, and compared metrics of gene expression stability, sample composition estimates, and AUROC performance on test genes. We found that when fewer than 100 cells for the target cell type are included in the single cell reference, gene expression stability drops (variance between subsampling iterations was much higher when fewer reference cells were used). However, we found that AUROC values were consistently high regardless of how many reference cells were included, but that this stability in AUROC values was paired with lower overall counts in samples with <100 reference cells after cleanup. This indicates that in cases where few reference cells are present, higher AUROC values might be achieved by more aggressive subtraction, which is attenuated when the reference model is more complete. This analysis is shown in figure 2 and figure S2, and described in the results section, recreated here.

      “To test the stability of Littlebites cleanup across different single-cell reference dataset qualities, we ran the algorithm on a set of bulk samples by first subsetting the corresponding single-cell cluster’s population to 10, 50, 100, or 500 cells. We performed this process 500 times for each subsampling rate for each sample (2000 total runs per sample). We found that testing gene AUROC values are stable across reference cluster sizes (Fig. 2D), suggesting that even if the target cell type is rarely represented in a single cell reference, accurate cleaning is still possible. However, comparing gene level stability across target cluster population levels reveals that low population references have higher gene level variance (Fig. 2E), lower purity estimates (Fig. S2F), higher variance in the mean expression across genes (Fig. S2G), and they tend to have lower overall expression (suggesting more aggressive subtraction) (Fig. S2H). This suggests that while binary gene calling is improved similarly even if the reference cluster is small, users should be cautious when using less than 100 cells in their single cell reference cluster as the resulting cleanup is less stable.”

      (7) Also very important, the LittleBites method could benefit from a more intuitive explanation and schematic to improve accessibility for non-computational readers. A supplementary step-by-step breakdown of the subtraction process would be useful.

      We appreciate this suggestion and implemented a step-by-steo breakdown of the subtraction process in the methods section, also copied below. We also updated the graphic representation in figure 2A.

      “LittleBites Subtraction algorithm

      LittleBites is an iterative algorithm for bulk RNA-seq datasets, that improves the accuracy of cell-type specific bulk RNA-seq sample profiles by removing counts from non-target contaminants (e.g. ambient RNA from dead cells, carry-over non-target cells from FACS enrichment due to imperfect gating). This method leverages single cell reference datasets and ground truth expression information to guide iterative and conservative subtraction to enrich for true target cell-type expression. Using this approach, LittleBites balances subtraction by optimizing using both a single-cell reference, and an orthogonal ground truth reference, moderating biases inherent to either reference.

      This algorithm first calculates gene level specificity weights in a single cell reference dataset using SPM (Specificity Preservation Method) (re-add 22, re-add 23). SPM assigns high weights (approaching 1) to genes expressed in single cell types while applying conservative weights to genes with broader expression patterns, which helps to reduce inappropriate subtraction.

      The algorithm proceeds in a loop of three steps:

      Step 1: Estimate Contamination. Each bulk sample is modeled as the sum of a linear combination of single-cell profiles (target cell type and likely contaminants) using non-negative least squares (NNLS). The resulting coefficients provide the estimate of how much of the sample’s counts come from the target cell-type, and how much comes from each contaminant cell-type.

      Step 2: Weighted Subtraction. Each bulk sample is cleaned by subtracting the weighted sum of contaminant single-cell profiles. This subtraction is attempted multiple times (separately) across a series of learning rate weights (usually ranging from 0-1) which moderate the size of the subtraction step (Equation 1). This produces a range of possible “cleaned” sample options for evaluation.

      Step 3: Performance Optimization. For each learning rate, the cleaned result is evaluated against a set of ground truth genes by calculating the area under the receiver operating characteristic curve (AUROC). The learning rate that optimizes the AUROC is then selected. When multiple learning rates yielded equivalent AUROC values, the lowest learning rate value is chosen to minimize subtraction.

      If the optimal learning rate at Step 3 is 0 (no subtraction option beats the baseline) then the loop is halted. Else, the cleaned bulk profile is returned to Step 1, and the loop continues until the AUROC cannot be improved upon using the single-cell reference modeling.“

      (8) In the same vein, the ROC curves and AUROC comparisons should have clearer annotations to make results more interpretable for readers unfamiliar with these metrics.

      We agree that the ROC and AUROC metrics need a clearer explanation to make their use and interpretations clearer. We included a description of both metrics, and a suggestion for how to interpret them in the results section, copied below.

      “To evaluate the post-subtraction datasets accuracy we used the area under the Receiver-Operator Characteristic (AUROC) score. In brief, we set a wide range of thresholds to call genes expressed or unexpressed, and then compared it to expected expression from a set of ground truth genes. This comparison produces a true positive rate (TPR, the percentage of truly expressed genes that are called expressed), and false positive rate (FPR, the percentage of truly not expressed genes that are called expressed), and a false discovery rate (FDR, the percentage of genes called expressed that are truly not expressed). The Receiver-Operator Characteristic (ROC) is the graph of the line produced by the TPR and FPR values across the range of thresholds tested, and the AUROC is calculated as the sum of the area under that line. A “random” model of gene expression is expected to have an AUROC value of 0.5, and a “perfect” model is expected to have an AUROC value of 1. Thus, AUROCs below 0.5 are worse than a random guess, and values closer to 1 indicate higher accuracy.”

      (9) Finally, after the correlation-based decontamination of the 4,440 'unexpressed' genes, how many were ultimately discarded as non-neuronal?

      a) Among these non-neuronal genes, how many were actually known neuronal genes or components of neuronal pathways (e.g., genes involved in serotonin synthesis, synaptic function, or axon guidance)?

      b) Conversely, among the "unexpressed" genes classified as neuronal, how many were likely not neuron-specific (e.g., housekeeping genes) or even clearly non-neuronal (e.g., myosin or other muscle-specific markers)?

      Combined with point 10, see below.

      (10) To increase transparency and allow readers to probe false positives and false negatives, I suggest the inclusion of:

      a) The full list of all 4,440 'unexpressed' genes and their classification at each refinement step. In that list flag the subsets of genes potentially misclassified, including:

      - Neuronal genes wrongly discarded as non-neuronal.

      - Non-neuronal genes wrongly retained as neuronal.

      b) Add a certainty or likelihood ranking that quantifies confidence in each classification decision, helping readers validate neuronal vs. non-neuronal RNA assignments.

      This addition would enhance transparency, reproducibility, and community engagement, ensuring that key neuronal genes are not erroneously discarded while minimizing false positives from contaminant-derived transcripts.

      We agree that the genes called “unexpressed” in the single-cell data need more context and clarity. First, we trimmed the list to only include 2,333 genes of highest confidence. Second, for those genes we identified any with published neuronal expression patterns. Identifying genes that were retained as neuronal but are likely non-neuronal in origin is difficult as many markers are expressed in a mixture of neuronal and non-neuronal cell-types, however we used a curated list of putative non-neuronal markers to assess the accuracy of the integrated data (see supplementary table 4), and established that most non-neuronal markers are undetected in the integrated data, with the number of detected genes decreasing as our threshold stringency increases. Of note, a few putative non-neuronal genes remain detected even at high thresholds, indicating that our dataset still retains a small percentage of neuronal false positives. This result has been collected in the new supplementary figure 4F, and addressed in the following text in the results section “Testing against a curated list of non-neuronal genes from fluorescent reporters and genomic enrichment studies, we found that of 445 non-neuronal markers, each gene was detected in an average of 12.5 cells or a median of 3 cells in the single-cell dataset, and an average of 8.7 cells or a median of 1 cell in the integrated dataset, at a 14% FDR threshold.”

      We also included a list of “unexpressed” gene identities and tissue annotations as new supplementary tables 16 and 17.

      Reviewer #2 (Recommendations for the authors):

      The utility of the bulk RNA-seq data would be significantly increased if the authors were to analyze which isoforms are expressed in individual neurons. Also, it would be very useful to know if there are instances where a gene is expressed in several neurons, but different isoforms are specific to individual neurons.

      We appreciate this suggestion. Indeed, as we put our source data online prior to publishing this manuscript, two published papers already use this source data set to analyze alternative splicing. Further, these works include validation of splicing patterns observed in this source data, indicating the biological relevance of these data sets. This is now noted in our discussion section “In addition, the bulk RNA-seq dataset contains transcript information across the gene body, which parallel efforts have used to identify mRNA splicing patterns that are not found in the scRNA-seq dataset.” These works can be found in references 26 and 27.

      Reviewer #3 (Recommendations for the authors):

      (1) Describe the number of L4 animals processed to obtain good-quality bulk RNAseq libraries from the different neuronal types. If the number of worms would be different for different neuronal types, then please make a supplementary table listing the minimum number of worms needed for each neuronal type.

      We appreciate the reviewer’s recommendation, and agree that it would be a useful resource to provide suggestions for how many worms are needed per experiment. Unfortunately We did not track the total number of animals for each sample. We aimed to start with 200-300 ul of packed worms for each strain, generally equating to >500,000 worms, but yields of FACS-isolated cells varied among cell types. Because replicates for specific neuron types were also variable in some instances (See additions to supplemental Table 1), yields likely depend on multiple factors. We have previously noted (Taylor et al., 2021), for example, that some cell types were under-represented in scRNA-seq data (e.g, pharyngeal neurons) based on in vivo abundance presumptively due to the difficulty of isolation or sub-viability in the cell dissociation-FACS protocol.

      (2) List the thresholds for the parameters used during the FASTQC quality control and the threshold number of reads that would make a sample not useful.

      We now include parameters for sample exclusion in the methods section. “Samples were excluded after sequencing if they had: fewer than 1 million read pairs, <1% of uniquely mapping reads to the C. elegans genome, > 50% duplicate reads (low umi diversity), or failed deduplication steps in the nudup package.”

      (3) In Figure 5C, include an overlapping bar that shows the total number of genes in each cell type. You may need to use a log scale to see both (new and all) numbers of genes in the same graph. Add supplementary tables with the names of all new genes assigned to each neuronal type.

      We agree that this figure panel needed additional context. On further reflection we concluded that figure 5 was not sufficiently distinct from figure 4 to warrant separation, and incorporated some key findings from figure 5 into figure S4.

    1. eLife Assessment

      This valuable study describes an interesting infection phenotype that differs between adult male and female zebrafish. The authors present data indicating that male-biased expression of Cyp17a2 appears to mediate viral infection through STING and USP8 activity regulation. Through experimentation on male fish, the authors present solid evidence linking this factor to direct and indirect antiviral outcomes through ubiquitination pathways. These findings raise interesting questions about immune mechanisms that underlie sex-dimorphism and the selective pressures that might shape it.

    2. Reviewer #1 (Public review):

      Summary:

      In this manuscript Lu & Cui et al. observe that adult male zebrafish are more resistant to infection and disease following exposure to Spring Viremia of Carp Virus (SVCV) than female fish. The authors then attempt to identify some of the molecular underpinnings of this apparent sexual dimorphism and focus their investigations on a gene called cytochrome P450, family 17, subfamily A, polypeptide 2 (cyp17a2) because it was among genes that they found to be more highly expressed in kidney tissue from males than in females. Their investigations lead them to propose a direct connection between cyp17a2 and modulation of interferon signaling as the key underlying driver of difference between male and female susceptibility to SVCV.

      Strengths:

      Strengths of this study include the interesting observation of a substantial difference between adult male and female zebrafish in their susceptibility to SVCV, and also the breadth of experiments that were performed linking cyp17a2 to infection phenotypes and molecularly to the stability of host and virus proteins in cell lines. The authors place the infection phenotype in an interesting and complex context of many other sexual dimorphisms in infection phenotypes in vertebrates. This study succeeds in highlighting an unexpected factor involved in antiviral immunity that will be an important subject for future investigations of infection, metabolism, and other contexts.

      Weaknesses:

      Weaknesses of this study include a proposed mechanism underlying the sexual dimorphism phenotype based on experimentation in only males, and widespread reliance on over-expression when investigating protein-protein interaction and localization.

    3. Reviewer #2 (Public review):

      This study conducted by Lu et al. explores the molecular underpinnings of sexual dimorphism in antiviral immunity in zebrafish, with a particular emphasis on the male-biased gene cyp17a2. The authors demonstrate that male zebrafish exhibit stronger antiviral responses than females, and they identify a teleost-specific gene cyp17a2 as a key regulator of this dimorphism. Utilizing a combination of in vivo and in vitro methodologies, they demonstrate that Cyp17a2 potentiates IFN responses by stabilizing STING via K33-linked polyubiquitination and directly degrades the viral P protein via USP8-mediated deubiquitination. The work challenges conventional views of sex-based immunity and proposes a novel, hormone- and sex chromosome-independent mechanism.

      Strengths:

      (1) The following constitutes a novel concept, sexual dimorphism in immunity can be driven by an autosomal gene rather than sex chromosomes or hormones represents a significant advance in the field, offering a more comprehensive understanding of immune evolution.

      (2) The present study provides a comprehensive molecular pathway, from gene expression to protein-protein interactions and post-translational modifications, thereby establishing a link between Cyp17a2 and both host immune enhancement (via STING) and direct antiviral activity (via viral protein degradation).

      (3) In order to substantiate their claims, the authors utilize a wide range of techniques, including transcriptomics, Co-IP, ubiquitination assays, confocal microscopy, and knockout models.

      (4) The utilization of a singular model is imperative. Zebrafish, which are characterized by their absence of sex chromosomes, offer a clear genetic background for the dissection of autosomal contributions to sexual dimorphism.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Weaknesses:

      (1) Weaknesses of this study include a proposed mechanism underlying the sexual dimorphism phenotype based on experimentation in only males, and widespread reliance on over-expression when investigating protein-protein interaction and localization. Additionally, a minor weakness is that the text describing the identification of cyp17a2 as a candidate contains errors that are confusing.

      We thank the reviewer for these insightful comments, which have helped us improve the manuscript.

      (1) Experimentation in males. We focused on male zebrafish for our mechanistic studies to preclude potential confounding effects from female hormones and to directly interrogate the basis of the observed male-biased resistance. As confirmed in the manuscript (lines 151-153), both wild-type and cyp17a2⁻/⁻ males developed normal male sex organs and exhibited comparable androgen levels. This crucial control gives us confidence that the differences in antiviral immunity we observed are a direct consequence of Cyp17a2 loss-of-function, rather than secondary to developmental or hormonal abnormalities. We fully agree that elucidating the mechanism in females represents a valuable and interesting direction for future research.

      (2) Over-expression studies. We acknowledge that overexpression approaches can have inherent limitations. To mitigate this and strengthen our conclusions, we complemented these experiments with loss-of-function data from both knockout zebrafish and knockdown cells, as well as validation at the endogenous level (e.g., Fig. 4J and S4C). The consistent results obtained across these diverse experimental models collectively reinforce our conclusion that Cyp17a2 interacts with and stabilizes STING.

      (3) We thank the reviewer for pointing out the lack of clarity in the text regarding the selection process of Cyp17a2. We have thoroughly revised the manuscript to provide a precise and accurate description of our methodology. The relevant text is now as follows: “Differential expression analysis identified 1511 upregulated and 1117 downregulated genes (Fig. 2A and Table S2). We then focused on a subset of known or putative sexrelated genes. Among these eight candidates, cyp17a2 exhibited the most significant male-biased upregulation, a finding that was subsequently confirmed by qPCR (Fig. 2B and S1A)” (lines 142-144).

      (2) Lines 139-140 describe the data for Figure 2 as deriving from "healthy hermaphroditic adult zebrafish". This appears to be a language error and should be corrected to something that specifies that the comparison made is between healthy adult male and female kidneys.

      We thank the reviewer for pointing out this inaccuracy. This was a terminological error, and we have corrected the text to accurately state “transcriptome sequencing was performed on head-kidney tissues from healthy adult male and female zebrafish” (lines 139-140). We have carefully reviewed the manuscript to ensure no similar errors are present.

      (3) In Figure 2A and associated text cyp17a2 is highlighted but the volcano plot does not indicate why this was an obvious choice. For example, many other genes are also highly induced in male vs female kidneys. Figure 2B and line 143 describe a subset of "eight sex-related genes" but it is not clear how these relate to Figure 2A. The narrative could be improved to clarify how cyp17a2 was selected from Figure 2A and it seems that the authors made an attempt to do this with Figure 2B but it is not clear how these are related. This is important because the available data do not rule out the possibility that other factors also mediate the sexual dimorphism they observed either in combination, in a redundant fashion, or in a more complex genetic fashion. The narrative of the text and title suggests that they consider this to be a monogenic trait but more evidence is needed.

      We thank the reviewer for raising these important points. We have revised the manuscript to clarify the candidate gene selection process and to avoid any implication that the trait is monogenic.

      The selection of cyp17a2 was not based solely on its position in the volcano plot (Fig. 2A), but on a multi-faceted rationale. We first prioritized genes with known or putative sex-related functions from the pool of differentially expressed genes. From this subset, cyp17a2 emerged as the lead candidate due to a combination of unique attributes, it exhibited the most significant and consistent male-biased upregulation among the validated candidates (Fig. 2B and S1A); it is a teleost-specific autosomal gene, suggesting a novel mechanism for sexual dimorphism independent of canonical sex chromosomes; and it showed conserved male-biased expression across multiple tissues (Fig. 2C and 2D). Regarding its representation in the volcano plot, cyp17a2 was included in the underlying dataset but was not explicitly labeled in the revised Figure 2A to maintain visual clarity, as the plot aimed to illustrate the global transcriptomic landscape rather than highlight individual genes.

      We agree with the reviewer that other genetic factors may contribute to the observed sexual dimorphism. Accordingly, we have modified the text throughout the manuscript to remove any suggestion of a purely monogenic trait. Our functional data position cyp17a2 as a key and sufficient factor, as its knockout in males was sufficient to ablate the antiviral resistance phenotype (Fig. 2E-G), demonstrating a major, nonredundant role without precluding potential contributions from other genes.

      The following specific changes have been made to the text.

      (1) The title has been revised by replacing “governs” with “orchestrates.” (line 1)  

      (2) The abstract now states “the male-biased gene cyp17a2 as a critical mediator of this enhanced response” instead of “which are driven by the male-biased gene Cyp17a2 rather than by hormones or sex chromosomes.” (lines 33-34)

      (3) The discussion now states “Our study leverages this unique context to demonstrate that enhanced antiviral immunity in males is mediated by the male-biased expression of the autosomal gene cyp17a2,” removing the comparative phrasing regarding hormones or sex chromosomes. (lines 364-366)

    1. eLife Assessment

      This study makes an important contribution by revealing how saccades selectively disrupt spatial working memory while sparing other object features, and by demonstrating how this mechanism is altered in aging and neurodegeneration. The findings are supported by convincing evidence derived from well-controlled eye-tracking experiments and systematic generative model comparisons. Together, the work provides a computationally grounded framework that is of importance for understanding trans-saccadic memory and its clinical relevance.

    2. Reviewer #1 (Public review):

      Summary:

      This study employed a saccade-shifting sequential working memory paradigm, manipulating whether a saccade occurred after each memory array to directly compare retinotopic and transsaccadic working memory for both spatial location and color. Across four participant groups (young and older healthy adults, and patients with Parkinson's disease and Alzheimer's disease), the authors found a consistent saccade-related cost specifically for spatial memory - but not for color - regardless of differences in memory precision. Using computational modeling, they demonstrate that data from healthy participants are best explained by a complex saccade-based updating model that incorporates distractor interference. Applying this model to the patient groups further elucidates the sources of spatial memory deficits in PD and AD. The authors then extend the model to explain copying deficits in these patient groups, providing evidence for the ecological validity of the proposed saccade-updating retinotopic mechanism.

      Strengths:

      Overall, the manuscript is well written, and the experimental design is both novel and appropriate for addressing the authors' key research questions. I found the study to be particularly comprehensive: it first characterizes saccade-related costs in healthy young adults, then replicates these findings in healthy older adults, demonstrating how this "remapping" cost in spatial working memory is age-independent. After establishing and validating the best-fitting model using data from both healthy groups, the authors apply this model to clinical populations to identify potential mechanisms underlying their spatial memory impairments. The computational modeling results offer a clearer framework for interpreting ambiguities between allocentric and retinotopic spatial representations, providing valuable insight into how the brain represents and updates visual information across saccades. Moreover, the findings from the older adult and patient groups highlight factors that may contribute to spatial working memory deficits in aging and neurological disease, underscoring the broader translational significance of this work.

      Weaknesses:

      Several concerns should be addressed to enhance the clarity of the manuscript:

      (1) Relevance of the figure-copy results (pp. 13-15).

      Is it necessary to include the figure-copy task results within the main text? The manuscript already presents a clear and coherent narrative without this section. The figure-copy task represents a substantial shift from the LOCUS paradigm to an entirely different task that does not measure the same construct. Moreover, the ROCF findings are not fully consistent with the LOCUS results, which introduces confusion and weakens the manuscript's coherence. While I understand the authors' intention to assess the ecological validity of their model, this section does not effectively strengthen the manuscript and may be better removed or placed in the Supplementary Materials.

      (2) Model fitting across age groups (p. 9).

      It is unclear whether it is appropriate to fit healthy young and healthy elderly participants' data to the same model simultaneously. If the goal of the model fitting is to account for behavioral performance across all conditions, combining these groups may be problematic, as the groups differ significantly in overall performance despite showing similar remapping costs. This suggests that model performance might differ meaningfully between age groups. For example, in Figure 4A, participants 22-42 (presumably the elderly group) show the best fit for the Dual (Saccade) model, implying that the Interference component may contribute less to explaining elderly performance.

      Furthermore, although the most complex model emerges as the best-fitting model, the manuscript should explain how model complexity is penalized or balanced in the model comparison procedure. Additionally, are Fixation Decay and Saccade Update necessarily alternative mechanisms? Could both contribute simultaneously to spatial memory representation? A model that includes both mechanisms-e.g., Dual (Fixation) + Dual (Saccade) + Interference-could be tested to determine whether it outperforms Model 7 to rule out the sole contribution of complexity.

      Minor point: On p. 9, line 336, Figure 4A does not appear to include the red dashed vertical line that is mentioned as separating the age groups.

      (3) Clarification of conceptual terminology.

      Some conceptual distinctions are unclear. For example, the relationship between "retinal memory" and "transsaccadic memory," as well as between "allocentric map" and "retinotopic representation," is not fully explained. Are these constructs related or distinct? Additionally, the manuscript uses terms such as "allocentric map," "retinotopic representation," and "reference frame" interchangeably, which creates ambiguity. It would be helpful for the authors to clarify the relationships among these terms and apply them consistently.

      (4) Rationale for the selective disruption hypothesis (p. 4, lines 153-154).

      The authors hypothesize that "saccades would selectively disrupt location memory while leaving colour memory intact." Providing theoretical or empirical justification for this prediction would strengthen the argument.

      (5) Relationship between saccade cost and individual memory performance (p. 4, last paragraph).

      The authors report that larger saccades were associated with greater spatial memory disruption. It would be informative to examine whether individual differences in the magnitude of saccade cost correlate with participants' overall/baseline memory performance (e.g. their memory precision in the no-saccade condition). Such analyses might offer insights into how memory capacity/ability relates to resilience against saccade-induced updating.

      (6) Model fitting for the healthy elderly group to reveal memory-deficit factors (pp. 11-12).

      The manuscript discusses model-based insights into components that contribute to spatial memory deficits in AD and PD, but does not discuss components that contribute to spatial memory deficits in the healthy elderly group. Given that the EC group also shows impairments in certain parameters, explaining and discussing these outcomes of the EC group could provide additional insights into age-related memory decline, which would strengthen the study's broader conclusions.

      (7) Presentation of saccade conditions in Figure 5 (p. 11).

      In Figure 5, it may be clearer to group the four saccade conditions together within each patient group. Since the main point is that saccadic interference on spatial memory remains robust across patient groups, grouping conditions by patient type rather than intermixing conditions would emphasize this interpretation.

    3. Reviewer #2 (Public review):

      Summary:

      Zhao et al investigate how object location and colour are degraded across saccadic eye movements. They employ an eye-tracking task that requires participants to remember two sequentially presented items and subsequently report the colour and position of either one of these. Through counterbalancing of the presence or absence of saccades across items, the authors endeavour to dissect the impact of saccades independently on item location or colour. These behavioural findings form the basis of generative models designed to test competing, nested accounts of how stored information is stored and updated across saccades.

      Strengths:

      The combination of eye-tracking and generative modelling is a strength of the paper, which opens new perspectives into the impact of Alzheimer's and Parkinson's disease on the performance of visuospatial cognitive tests. The finding that the model parameters covary with clinical performance on the ROCF test is a nice example of a "computational assay" of disease.

      Weaknesses:

      I have a number of substantial and minor concerns for the authors to consider in a revision:

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript introduces a visual paradigm aimed at studying trans-saccadic memory.

      The authors observe how memory of object location is selectively impaired across eye movements, whereas object colour memory is relatively immune to intervening eye movements.<br /> Results are reported for young and elderly healthy controls, as well as PD and AD participants.

      A computational model is introduced to account for these results, indicating how early differences in memory encoding and decay (but not trans-saccadic updating per se) can account for the observed differences between healthy controls and clinical groups.

      Strengths:

      The data presented encompasses healthy and elderly controls, as well as clinical groups.

      The authors introduce an interesting modelling strategy, aimed at isolating and identifying the main components behind the observed pattern of results.

      Weaknesses:

      The models tested differ in terms of the number of parameters. In general, a larger number of parameters leads to a better goodness of fit. It is not clear how the difference in the number of parameters between the models was taken into account.

      It is not clear whether the modelling results could be influenced by overfitting (it is not clear how well the model can generalize to new observations).

      Results specificity: it is not clear how specific the modelling results are with respect to constructional ability (measured via the Rey-Osterrieth Complex Figure test). As with any cognitive test, performance can also be influenced by general, non-specific abilities that contribute broadly to test success.

    5. Author response:

      (1) About ROCF figure-copy results

      Reviewer #1 queried the necessity of including the Rey-Osterrieth Complex Figure (ROCF) results in the main text. We appreciate the reviewer’s perspective on the narrative flow and the transition between the LOCUS paradigm and the ROCF results. However, we remain keen to retain these findings in the main tex, as they provide critical ecological and clinical validation for the computational mechanisms identified in our study.

      We argue that the following points support the retention of these results:

      (1)  The ROCF we used is a standard neuropsychological tool for identifying constructional apraxia. Our results bridge the gap between basic cognitive neuroscience and clinical application by demonstrating that specific remapping parameters—rather than general memory precision—predict real-world deficits in patients.

      (2)  The finding that our winning model explains approximately 62% of the variance in ROCF copy scores across all diagnostic groups further indicates that these parameters from the LOCUS task represent core computational phenotypes that underpin complex, real-life visuospatial construction (copying drawings).

      (3)  Previous research has often observed only a weak or indirect link between drawing ability and traditional working memory measures, such as digit span  (Senese et al., 2020). This was previously attributed to “deictic” strategies—like frequent eye movements—that minimise the need to hold large amounts of information in memory (Ballard et al., 1995; Cohen, 2005; Draschkow et al., 2021). While our study was not exclusively designed to catalogue all cognitive contributions to drawing, our findings provide significant and novel evidence indicating that transsaccadic integration is a critical driver of constructional (copying drawing) ability. By demonstrating this link, we offer a new direction for future research, shifting the focus from general memory capacity toward the precision of spatial updating across eye movements.

      By including the ROCF results in the main text, we provide evidence for a functional role for spatial remapping that extends beyond perceptual stability into the domain of complex visuomotor control. We will expand on these points in the Discussion in our revised manuscript.

      (2) Model complexity and overfitting

      We would like to clarify that the Bayesian model selection (BMS) procedure utilised in this manuscript inherently balances model fit with parsimony. Unlike maximum likelihood inference, where overfitting is a primary concern often requiring cross-validation via out-of-sample prediction, our approach depends upon the comparison of marginal likelihoods. This method directly penalises model complexity — a principle often described as the “Bayesian Occam’s Razor” (Rasmussen and Ghahramani, 2000). This means that a model is only favoured if the improvement in fit justifies the additional parameter space. If a parameter were redundant, it would lower the model's evidence by “diluting” the probability mass over the parameter space. The emergence of the “Dual (Saccade) + Interference” model as the winning candidate suggests it offers the most plausible generative account of the data while maintaining necessary parsimony. We would be happy to point toward literature that discusses how these marginal likelihood approximations provide a more robust guard against overfitting than standard metrics like BIC or AIC (MacKay, 2003; Murray and Ghahramani, 2005; Penny, 2012).

      (3) On model fitting across age groups

      This approach is primarily supported by our empirical findings: there was no significant interaction between age group and saccade condition for either location or colour memory. While older adults demonstrated lower baseline precision, the specific disruptive effect of saccades (the “saccade cost”) was remarkably consistent across cohorts. This justifies the use of a common generative model to assess quantitative differences in parameter estimates.

      This approach does implicitly assume that participants perform the task in a qualitatively similar way. However, as this assumption is mitigated by the fact that our winning model nests simpler models as special cases, it supports the assessment of group differences in parameters that play consistent mechanistic roles. This flexibility allows the model to naturally accommodate groups where certain components—such as interference—may play a reduced role, while remaining sensitive to the specific mechanistic failures that differentiate healthy aging from neurodegeneration.

      (4) Conceptual terminology and patient group descriptions

      We will clarify our conceptual terminology, explicitly defining the relationships between retinotopic (eye-centred), transsaccadic (across-saccade), and spatiotopic (world-centred) representations.

      Regarding the demographics of the clinical cohorts, we apologise for any lack of clarity in our initial presentation. The patient demographics for both the Parkinson’s disease (PD) and Alzheimer’s disease (AD) groups—including age, gender, education, and ACE-III scores—are currently detailed alongside the healthy control data (two groups: Young Healthy Controls and Elderly Healthy Controls) in the table within the Participants section of the Materials and Methods. In our revision. We will ensure that this table is correctly labelled as Table 2 and will provide more comprehensive recruitment and characterisation details for both patient groups within the main text. Finally, we will include a detailed discussion in the Supplementary Materials regarding eye-tracking data quality across all cohorts, specifically comparing calibration accuracy, trace stability, and trial rejection rates to demonstrate that our findings are not confounded by differences in recording quality between healthy and clinical populations.

      References

      Ballard DH, Hayhoe MM, Pelz JB. 1995. Memory Representations in Natural Tasks. Journal of Cognitive Neuroscience 7:66–80. DOI: https://doi.org/10.1162/jocn.1995.7.1.66

      Cohen DJ. 2005. Look little, look often: The influence of gaze frequency on drawing accuracy. Perception & Psychophysics 67:997–1009. DOI: https://doi.org/10.3758/BF03193626

      Draschkow D, Kallmayer M, Nobre AC. 2021. When Natural Behavior Engages Working Memory. Current Biology 31:869-874.e5. DOI: https://doi.org/10.1016/j.cub.2020.11.013, PMID: 33278355

      MacKay DJC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press.

      Murray I, Ghahramani Z. 2005. A note on the evidence and Bayesian Occam’s razor (Technical report No. GCNU TR 2005-003). Gatsby Unit.

      Penny WD. 2012. Comparing Dynamic Causal Models using AIC, BIC and Free Energy. Neuroimage 59:319–330. DOI: https://doi.org/10.1016/j.neuroimage.2011.07.039, PMID: 21864690

      Rasmussen C, Ghahramani Z. 2000. Occam’ s Razor. Advances in Neural Information Processing Systems. MIT Press.

      Senese VP, Zappullo I, Baiano C, Zoccolotti P, Monaco M, Conson M. 2020. Identifying neuropsychological predictors of drawing skills in elementary school children. Child Neuropsychology 26:345–361. DOI: https://doi.org/10.1080/09297049.2019.1651834, PMID: 31390949

    1. eLife Assessment

      This important study presents the rational redesign and engineering of interleukin-7. The data from the integrated approach of using computational, biophysical, and cellular experiments are convincing. This paper is broadly relevant to those studying immunomodulation using biologics.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript describes the use of computational tools to design a mimetic of the interleukin-7 (IL-7) cytokine with superior stability and receptor binding activity compared to the naturally occurring molecule. The authors focused their engineering efforts on the loop regions to preserve receptor interfaces while remediating structural irregularities that destabilize the protein. They demonstrated the enhanced thermostability, production yield, and bioactivity of the resulting molecule through biophysical and functional studies. Overall, the manuscript is well written, novel, and of high interest to the fields of molecular engineering, immunology, biophysics, and protein therapeutic design. The experimental methodologies used are convincing; however, the article would benefit from more quantitative comparisons of bioactivity through titrations.

      Comments on revisions:

      All comments have been sufficiently addressed, with the exception of comment 24 from Reviewer 1. The authors need to modify the manuscript abstract, introduction, and/or discussion to clarify which limitations of IL-7 were addressed by their molecule and to note the limitations of their approach in terms of mitigating toxicity or enhancing half-life.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript describes the use of computational tools to design a mimetic of the interleukin-7 (IL-7) cytokine with superior stability and receptor binding activity compared to the naturally occurring molecule. The authors focused their engineering efforts on the loop regions to preserve receptor interfaces while remediating structural irregularities that destabilize the protein. They demonstrated the enhanced thermostability, production yield, and bioactivity of the resulting molecule through biophysical and functional studies. Overall, the manuscript is well written, novel, and of high interest to the fields of molecular engineering, immunology, biophysics, and protein therapeutic design. The experimental methodologies used are convincing; however, the article would benefit from more quantitative comparisons of bioactivity through titrations.

      Reviewer #2 (Public review):

      Summary:

      This manuscript presents the computational design and experimental validation of Neo-7, an engineered variant of interleukin-7 (IL-7) with improved folding efficiency, expression yield, and therapeutic activity. The authors employed a rational protein design approach using Rosetta loop remodeling to reconnect IL-7's functional helices through shorter, more efficient loops, resulting in a protein with superior stability and binding affinity compared to wild-type IL-7. The work demonstrates promising translational potential for cancer immunotherapy applications.

      Strengths:

      (1) The integration of Rosetta loop remodeling with AlphaFold validation represents an established computational pipeline for rational protein design. The iterative refinement process, using both single-sequence and multimer AlphaFold predictions, is methodologically sound.

      (2) The authors provide thorough characterization across multiple platforms (yeast display, bacterial expression, mammalian cell expression) and assays (binding kinetics, thermostability, bioactivity), strengthening the robustness of their findings.

      (3) The identification of the critical helix 1 kink stabilized by disulfide bonding and its recreation through G4C/L96C mutations demonstrates deep structural understanding and successful problem-solving.

      (4) The MC38 tumor model results show clear therapeutic advantages of Neo-7 variants, with compelling immune profiling data supporting CD8+ T cell-mediated anti-tumor mechanisms.

      (5) The transcriptomic profiling provides valuable mechanistic insights into T cell activation states and suggests reduced exhaustion markers, which are clinically relevant.

      Weaknesses:

      (1) While computational predictions are extensive, the manuscript lacks experimental structural validation of the designed Neo-7 variants. The term "Structural Validation" should not be used in the header.

      We thank the reviewer for this constructive comment. To better reflect the work conducted, we have revised the section title from “Structural Validation of Neo-7 in AlphaFold single sequence mode” to “Structural Modeling of Neo-7 in AlphaFold single sequence mode.” This change clarifies that our study employed in silico modeling approaches rather than experimental structural validation.

      We thank the reviewer for this insightful comment. We speculate that the slower off-rate observed for Neo-7 variants is primarily attributable to their enhanced structural stability, which promotes the formation of a more stable cytokine–receptor complex. This is consistent with prior observations in other engineered cytokines, such as IL-2 mimetics (Neo-2/15).

      In terms of biological consequences, we believe the slower off-rate is unlikely to result in signaling bias or qualitatively distinct pathways for several reasons:

      IL-7’s mechanism of action is inherently regulated to prevent over-signaling. T cells downregulate IL7R-α expression upon IL-7 stimulation, ensuring a built-in negative feedback mechanism.

      IL-7 signaling is dominated by STAT5 activation, without the signaling plasticity observed in cytokines like IL-21 or IL-22, which can bias toward STAT1/3 and drive divergent functional outcomes.

      Our RNA-seq data support this interpretation, as Neo-7–treated CD8⁺ T cells exhibited transcriptional profiles highly similar to those induced by WT-IL-7, with the difference being an enhanced magnitude of response rather than novel pathway engagement.

      Taken together, we infer that the slower off-rate of Neo-7 enhances the potency and durability of IL-7 signaling without altering its downstream specificity, thereby strengthening the magnitude of immune responses while maintaining the canonical STAT5-driven biology of IL-7.

      (3) While computational immunogenicity prediction is provided, these methods are very limited.

      We fully agree with the reviewer that current in silico immunogenicity prediction tools are limited and cannot be considered definitive. Indeed, to date, none of these algorithms has demonstrated a strong correlation with clinical immunogenicity outcomes of biologics. For example, the presence of anti-drug antibodies (ADA) in murine or non-human primate models often does not translate into ADA induction in human clinical trials. This disconnect underscores the inherent challenges of predicting immunogenicity based solely on computational or preclinical models.

      Our strategy to mitigate potential immunogenicity was therefore not to rely exclusively on prediction software, but instead to apply a conservative design principle: preserving the vast majority of the parental IL-7 sequence while introducing only the minimal number of amino acid substitutions required to achieve our engineering objectives. By maintaining sequence continuity with the native cytokine, we aim to minimize the risk of introducing novel epitopes while improving stability and developability. We acknowledge that definitive immunogenicity assessment can only be addressed in future clinical studies.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Specific Points:

      (1) The authors should describe the molecular composition of CYT-107.

      We thank the reviewer for this suggestion and have added clarification regarding the molecular composition of CYT-107. CYT-107 is a recombinant form of wild-type human interleukin-7 (IL-7) expressed in eukaryotic cells, which introduces N-linked glycosylation modifications to the protein. As a glycosylated recombinant IL-7, CYT-107 more closely mimics the natural human cytokine compared to bacterial expression systems that produce non-glycosylated IL-7. This feature contributes to its stability and bioavailability in clinical applications.

      (Reference: U.S. National Center for Advancing Translational Sciences, GSRS record for IL-7, https://gsrs.ncats.nih.gov/ginas/app/ui/substances/46bd8013-1e2d-4b6e-afcf-340f447e8710

      (2) The authors should indicate the receptor layout for IL-7 in the introduction and indicate available structural data. Also, in line 93, the authors should indicate that IL-7Ra is one subunit of the heterodimeric receptor complex.

      We thank the reviewer for this insightful suggestion. However, due to page limitations, we have chosen to orient the introduction around the design rationale, computational workflow, and biological functionality of IL-7. To address the reviewer’s point while maintaining brevity, we have now included a concise description of the IL-7 receptor layout and its available structural data in the main text. Specifically, in line 93 we revised the sentence to read:“We began by examining the crystal structure of IL-7 bound to its receptor, IL7R-α (interleukin-7 receptor alpha; PDB ID: 3DI2), which recruits IL-2Rγ to form a heterodimeric receptor complex essential for downstream signaling.”

      (3) The abbreviation IL-7Ra should be defined at first use.

      We thank the reviewer for the comment. The abbreviation has now been defined at its first appearance in the manuscript. Specifically, at Line 93 we revised the sentence as follows:

      “We began by examining the crystal structure of IL-7 bound to its receptor, IL7R-α (interleukin-7 receptor alpha; PDB ID: 3DI2), which recruits IL-2Rγ to form a heterodimeric receptor complex essential for downstream signaling..”

      (4) The authors need to clarify whether the human or murine IL-7Ra is being used in each experiment mentioned in the results text.

      We thank the reviewer for this important point. We have now specified in the main text and corresponding subsection titles whether human or murine IL-7Rα was used in each experiment.

      (5) The authors sometimes use a dash in IL7Ra and IL2Rg and sometimes do not. This should be standardized.

      We appreciate the reviewer’s observation. We have standardized the terminology throughout the manuscript to “IL7Rα” and “IL2Rγ” to maintain consistency.

      (6) In Figure 3E, the authors left out the v in "Neo7-LDv1".

      We have corrected the omission of “v” and updated the label to read Neo7-LDv1.

      (7) In Figure 3E, the authors must indicate in the bottom row that they are visualizing sequential binding to IL-2Rg following incubation with IL-7Ra. This should be stated in the results text and the figure caption as well.

      We have revised the results text and figure caption to clearly state that the bottom row illustrates sequential binding to IL-2Rγ following incubation with IL-7Rα.

      “for detection of IL-2Rγ binding, yeast cells were first incubated with recombinant IL-7Rα, washed, and subsequently incubated with IL-2Rγ”

      (8) In Figure 3E, "IL-7Rg" should be corrected to "IL-2Rg".

      We have corrected “IL-7Rγ” to “IL-2Rγ” in Figure 3E for accuracy and consistency.

      (9) In line 140, the authors claim that Neo7-LDv1 is partially folded based on the binding to the heterodimeric receptor complex. However, the data are insufficient to support this conclusion.

      We understand the concern of the reviewer and we decided to rephrase the sentence for better understanding: “A degree of binding to IL2Rγ was detected, possibly reflecting partial folding of the displayed protein in the yeast display platform.” While we do not claim the protein to be fully or uniformly folded, this deduction is supported by the yeast display data and further corroborated by AlphaFold structural predictions.

      (10) In lines 185-186, the authors claim that the binding affinity for IL-2Rg is improved, but this is not shown in Figure 3, which looks only at a single concentration and shows comparable binding between WT-IL7 and Neo7-LDv2.

      We thank the reviewer for this valuable observation. Our original wording was ambiguous and may have implied a direct comparison with WT-IL7, which was not intended. The sentence was meant to highlight that within the Neo-7 variant series, Neo7-LDv2 displayed stronger binding to both IL-7Rα and IL-2Rγ compared to other Neo-7 variants. To avoid misinterpretation, we have revised the text as follows:

      “Importantly, the enhanced binding affinity towards IL7Rα also led to improved binding towards the common IL2Rγ., relative to other variants in the Neo-7 series.”

      (11) Lines 202-203 appear to be an error.

      We thank the reviewer for pointing this out. The lines in question were indeed an error and have now been removed from the manuscript.

      (12) In yeast display validation, negative controls showing binding to the fluorescent antibody only and an irrelevant control protein should be shown for all constructs in order to evaluate nonspecific interactions.

      We agree with the reviewer that appropriate negative controls are important to validate specificity. To address this, we will include yeast display data with negative controls—native yeast (EBY100) stained with the corresponding fluorescent antibody in the Supplementary Information. This addition will provide clearer validation of binding specificity and reduce concerns regarding nonspecific interactions.

      (13) For yeast display studies, titrations rather than single concentrations should be used to compare constructs (Figures 3 and 4). The claim that any of the constructs has a higher affinity than any other construct must be supported by performing titrations.

      We thank the reviewer for this comment. We respectfully note that yeast display titrations provide relative rather than absolute estimates of binding affinity. In our study, constructs were compared under identical antigen concentrations, where the observed fluorescence intensity reflected their relative binding strength. These yeast display results served as an initial screening strategy, which we subsequently validated using surface plasmon resonance (SPR). SPR provided quantitative binding parameters and confirmed the binding differences observed in yeast display. Thus, while yeast titrations were not performed, the combination of side-by-side yeast display comparisons and orthogonal validation by SPR supports our affinity claims with both qualitative and quantitative evidence.

      (14) The acronym SPR needs to be defined, and the authors should mention that this technique was used for quantitative binding studies in line 259.

      We thank the reviewer for this suggestion. The acronym has now been defined in the main text at its first use, and we have clarified its role in the study. The revised text reads:

      “We then characterized the binding affinities of Neo-7 variants to mouse IL-7 receptor alpha (mIL-7Rα) in a quantitative manner using surface plasmon resonance (SPR).”

      (15) A titration of 2E8 cell proliferation versus concentration should be presented for IL-7 versus Neo-7 variants to directly compare EC50 values and make claims regarding potency in Figure 5H. Also, the authors should clarify whether a proliferation or viability assay was performed.

      We thank the reviewer for the helpful comment regarding the use of EC₅₀ values when discussing potency. In response, we have revised the manuscript to avoid overinterpreting the data. Specifically, we replaced the term potency with ability to stimulate, as the 2E8 cell assay was designed to validate whether receptor binding by IL-7 and Neo-7 variants translates into biological function—namely, supporting immune cell viability and proliferation under limiting cytokine conditions. The assay was not optimized to determine formal EC₅₀ values, but rather to demonstrate functional activity consistent with IL-7 receptor engagement.

      We have also clarified in the text that the experiment was a proliferation assay, with cell viability assessed as part of the readout. This revision better reflects the scope of the assay while aligning our claims with the data presented.

      (16) Isotype control is not an appropriate name for the Fc-Only construct. This should be denoted as Fc Only.

      We thank the reviewer for this comment. We have revised the terminology throughout the manuscript, changing isotype control to Fc control.

      (17) A titration of mouse splenocyte proliferation versus concentration should be presented for IL-7 versus Neo-7 variants to directly compare EC50 values and make claims regarding potency in Figure 6.

      We thank the reviewer for this insightful suggestion regarding EC₅₀ analysis. In this study, the splenocyte proliferation assay was designed as a preliminary in vitro screen to confirm the biological activity of Neo-7 variants relative to wild-type IL-7 prior to in vivo testing. The assay was not optimized for quantitative potency determination, but rather to provide an initial functional validation of the constructs. We have therefore revised the manuscript wording to avoid overinterpreting the data and refrained from making claims regarding EC₅₀-based potency. Instead, we emphasize that the in vivo tumor model provides a more physiologically relevant and rigorous platform for assessing cytokine functionality, including proliferation and immunomodulation.

      (18) The legends in Figure 6 should indicate the colors used for each construct.

      We thank the reviewer for pointing this out. We have revised the legend for Figure 6 to include the color codes corresponding to each construct.

      (19) Metabolism should be singular in lines 433 and 435.

      We have corrected the wording so that “metabolism” is consistently used in the singular form.

      (20) In Figure 8D, "cycling" should be changed to "cycle".

      The word “cycling” has been corrected to “cycle” in Figure 8D.

      (21) The treatments need to be indicated in Figure 8D. Also, a color scale is needed.

      We agree with the reviewer, and a color scale description has now been included in the Figure legend to aid interpretation. “The gene expression heatmap is derived from Z-scores calculated from the RNA sequencing data, with expression levels color-coded from high (red) to low (blue). ”

      (22) More comparisons between RNASeq data for Fc-WTIL7 versus Fc-Neo7 (Figure 8) should be presented in the results section.

      We thank the reviewer for this suggestion. Due to space limitations in the main manuscript, we are unable to include an expanded description of all RNA-Seq comparisons. However, we will provide a more detailed analysis of Fc-WT-IL7 versus Fc-Neo7 in the supplementary section, including expanded differential gene expression comparisons and pathway enrichment analyses. This will allow readers to fully appreciate the differences while maintaining focus in the main text.

      (23) The strikethrough in line 464 needs to be corrected.

      We have corrected the strikethrough error in line 464.

      (24) It is unclear how stabilizing IL-7 improves its toxicity or half-life. The authors should indicate more clearly which limitations of IL-7 were addressed by their molecule in the abstract, introduction, and discussion.

      Native IL-7 demonstrates an excellent safety profile but faces two major challenges in clinical application: (1) short plasma half-life and (2) suboptimal developability due to poor stability. The short half-life is typically addressed through Fc-fusion strategies, which extend systemic exposure via FcRn recycling. However, wild-type IL-7 exhibits a strong aggregation tendency when fused to Fc, rendering the fusion protein poorly developable. By redesigning IL-7 into the more stable Neo-7 format, we substantially improved the folding efficiency and purity of the Fc-fusion protein after affinity purification, thereby enabling its advancement as a recombinant biologic candidate.

      We do not intend to claim that increased stability directly reduces in vivo toxicity. The favorable safety profile of IL-7 arises primarily from its intrinsic biology (mechanism of action and downstream signaling), rather than from its structural stability. That said, improved stability and reduced aggregation propensity could potentially lower the immunogenicity risk of protein biologics. Nevertheless, there are currently no validated in vitro or in vivo assays that reliably correlate protein stability or aggregation with clinical immunogenicity outcomes.

      (25) The acronym MSA needs to be defined.

      We have defined the acronym MSA (Multiple Sequence Alignment) on page 7, line 142.

      (26) The acronym CPD needs to be defined.

      We have defined the acronym CPD (Computational Protein Design) on page 23, line 468.

      Reviewer #2 (Recommendations for the authors):

      Any experimental structural data would be good to have.

      We plan to pursue X-ray crystallography of Neo-7 in future studies to obtain high-resolution structural confirmation. However, we emphasize that such experiments require significant time and resources, and the results would not alter the biological claims made in this study. Our focus here is to demonstrate that with recent advances in in silico protein structure prediction algorithms, such as AlphaFold2, it is now feasible to redesign therapeutic proteins with sufficient accuracy to achieve improved developability and biological performance. This study highlights how computational approaches can streamline protein drug engineering, reducing reliance on labor-intensive structural studies during the early stages of therapeutic development.

      Please add details of how the changed kinetics might affect downstream pathways.

      We appreciate the reviewer’s suggestion to elaborate on the biological implications of the altered binding kinetics.

      Our data show that Neo-7 variants display a slower off-rate relative to WT-IL-7, which likely reflects enhanced stabilization of the cytokine–receptor complex. In principle, this could prolong receptor occupancy and modestly extend downstream signaling duration. However, several biological features of IL-7 constrain the risk of excessive or aberrant signaling:

      Receptor Regulation: IL-7 signaling induces rapid downregulation of IL7Rα on T cells, serving as a feedback mechanism to prevent sustained or uncontrolled activation. This "hardwired" receptor regulation reduces the likelihood that a slower off-rate translates into pathological over-signaling.

      Pathway Specificity: IL-7 primarily signals through the JAK/STAT5 axis, with little evidence of signaling bias. Unlike other cytokines (e.g., IL-21, IL-22) that can activate STAT1 or STAT3 and drive distinct functional outcomes, IL-7’s pathway specificity minimizes concerns about altered signaling directionality.

      Transcriptional Evidence: Our RNA-seq analysis further supports this, showing that Neo-7 and WT-IL-7 activate similar transcriptional programs. The differences we observed were in the magnitude of response, not in the qualitative nature of the pathways engaged. This suggests that Neo-7 variants enhance the intensity of canonical IL-7 signaling rather than redirecting it toward alternative or unintended pathways.

      Together, these findings support the interpretation that the slower off-rate of Neo-7 variants likely contributes to stronger or more sustained activation of IL-7’s canonical STAT5 pathway, while intrinsic regulatory mechanisms and pathway fidelity safeguard against inappropriate signaling outcomes.

      Minor:

      (1) The Figure 3 text is hard to read.

      We acknowledge the reviewer’s concern regarding the readability of Figure 3. In the revised manuscript, we will provide a higher-resolution version of the figure to ensure that all labels and text are clearly visible upon magnification.

      (2) The manuscript switches between "Neo-7" and "Neo7" .

      We agree with the reviewer’s observation. To maintain consistency throughout the manuscript, all references have been standardized to Neo-7.

    1. eLife Assessment

      This study addresses a key question in developmental cognitive neuroscience by identifying early neural correlates of variability in language learning and showing how syllable tracking and word segmentation develop from birth to two years in infants with differing likelihoods of autism. The evidence is generally strong, with rigorous longitudinal EEG acquisition, careful preprocessing, and validated statistical approaches, though several methodological clarifications would further strengthen confidence in the inferences. Overall, the findings offer important insights with clear theoretical implications for understanding early mechanisms of speech perception and statistical learning, supported by convincing evidence.

    2. Reviewer #1 (Public review):

      Summary:

      This manuscript reports a prospective longitudinal study examining whether infants with high likelihood (HL) for autism differ from low-likelihood (LL) infants in two levels of word learning: brain-to-speech cortical entrainment and implicit word segmentation. The authors report reduced syllable tracking and post-learning word recognition in the HL group relative to the LL group. Importantly, both the syllable-tracking entrainment measure and the word recognition ERP measure are positively associated with verbal outcomes at 18-20 months, as indexed by the Mullen Verbal Developmental Quotient. Overall, I found this to be a thoughtfully designed and carefully executed study that tackles a difficult and important set of questions. With some clarifications and modest additional analyses or discussion on the points below, the manuscript has strong potential to make a substantial contribution to the literature on early language development and autism.

      Strengths:

      This is an important study that addresses a central question in developmental cognitive neuroscience: what mechanisms underlie variability in language learning, and what are the early neural correlates of these individual differences? While language development has a relatively well-defined sensitive period in typical development, the mechanisms of variability - particularly in the context of neurodevelopmental conditions - remain poorly understood, in part because longitudinal work in very young infants and toddlers is rare. The present study makes a valuable contribution by directly targeting this gap and by grounding the work in a strong theoretical tradition on statistical learning as a foundational mechanism for early language acquisition.

      I especially appreciate the authors' meticulous approach to data quality and their clear, transparent description of the methods. The choice of partial least squares correlation (PLS-c) is well motivated, given the multidimensional nature of the data and collinearity among variables, and the manuscript does a commendable job explaining this technique to readers who may be less familiar with it.

      The results reveal interesting developmental changes in syllable tracking and word segmentation from birth to 2 years in both HL and LL infants. Simply mapping these trajectories in both groups is highly valuable. Moreover, the associations between neural indices of brain-to-speech entrainment and word segmentation with later verbal outcomes in the LL group support a critical role for speech perception and statistical learning in early language development, with clear implications for understanding autism. Overall, this is a rich dataset with substantial potential to inform theory.

      Weaknesses:

      (1) Clarifying longitudinal vs. concurrent associations

      Because the current analytical approach incorporates all time points, including the final visit, it is challenging to determine to what extent the brain-language associations are driven by longitudinal relationships vs. concurrent correlations at the last time point. This does not undermine the main findings, but clarifying this issue could significantly enhance the impact of the individual-differences results. If feasible, the authors might consider (a) showing that a model excluding the final visit still predicts verbal outcomes at the last visit in a similar way, or (b) more explicitly acknowledging in the discussion that the observed associations may be partly or largely driven by concurrent correlations. Either approach would help readers interpret the strength and nature of the longitudinal claims.

      (2) Incorporating sleep status into longitudinal models

      Sleep status changes systematically across developmental stages in this cohort. Given that some of the papers cited to justify the paradigm also note limitations in speech entrainment and word segmentation during sleep or in patients with impaired consciousness, it would be helpful to account for sleep more directly. Including sleep status as a factor or covariate in the longitudinal models, or at least elaborating more fully on its potential role and limitations, would further strengthen the conclusions and reassure readers that these effects are not primarily driven by differences in sleep-wake state.

      (3) Use of PLS-c and potential group × condition interactions

      I am relatively new to PLS-c. One question that arose is whether PLS-c could be extended to handle a two-way interaction between group and condition contrasts (STR vs. RND). If so, some of the more complex supplementary models testing developmental trajectories within each group (Page 8, Lines 258-265) might be more directly captured within a single, unified framework. Even a brief comment in the methods or discussion about the feasibility (or limitations) of modeling such interactions within PLS-c would be informative for readers and could streamline the analytic narrative.

      (4) STR-only analyses and the role of RND

      Page 8, Lines 241-245: This analysis is conducted only within the STR condition. The lack of group difference observed here appears consistent with the lack of group difference in word-level entrainment (Page 9, Lines 292-294), suggesting that HL and LL groups may not differ in statistical learning per se, but rather in syllabic-level entrainment. As a useful sanity check and potential extension, it might be informative to explore whether syllable-level entrainment in the RND condition differs between groups to a similar extent as in Figure 2C-D. In other work (e.g., adults vs. children; Moreau et al., 2022), group differences can be more pronounced for syllable-level than for word-level entrainment. Figure S6 seems to hint that a similar pattern may exist here. If feasible, including or briefly reporting such an analysis could help clarify the asymmetry between the two learning measures and further support the interpretation of syllabic-level differences.

      (5) Multi-speaker input and voice perception (Page 15, Lines 475-483)

      The multi-speaker nature of the speech input is an interesting and ecologically relevant feature of the design, but it does add interpretive complexity. The literature on voice perception in autism is still mixed: for example, Boucher et al. (2000) reported no differences in voice recognition and discrimination between children with autism and language-matched non-autistic peers, whereas behavioral work in autistic adults suggests atypical voice perception (e.g., Schelinski et al., 2016; Lin et al., 2015). I found the current interpretation in this paragraph somewhat difficult to follow, partly because the data do not directly test how HL and LL infants integrate or suppress voice information. I think the authors could strengthen this section by slightly softening and clarifying the claims.

      (6) Asymmetry between EEG learning measures

      Page 16, Lines 502-507 touches on the asymmetry between the two EEG learning measures but leaves some questions for the reader. The presence of word recognition ERPs in the LL group suggests that a failure to suppress voice information during learning did not prevent successful word learning. At the same time, there is an interesting complementary pattern in the HL group, who show LL-like word-level entrainment but does not exhibit robust word recognition. Explicitly discussing this asymmetry - why HL infants might show relatively preserved word-level entrainment yet reduced word recognition ERPs, whereas LL infants show both - would enrich the theoretical contribution of the manuscript.

      References:

      (1) Moreau, C. N., Joanisse, M. F., Mulgrew, J., & Batterink, L. J. (2022). No statistical learning advantage in children over adults: Evidence from behaviour and neural entrainment. Developmental Cognitive Neuroscience, 57, 101154. https://doi.org/10.1016/j.dcn.2022.101154

      (2) Boucher, J., Lewis, V., & Collis, G. M. (2000). Voice processing abilities in children with autism, children with specific language impairments, and young typically developing children. Journal of Child Psychology and Psychiatry, 41(7), 847-857. https://doi.org/10.1111/1469-7610.00672

      (3) Schelinski, S., Borowiak, K., & von Kriegstein, K. (2016). Temporal voice areas exist in autism spectrum disorder but are dysfunctional for voice identity recognition. Social Cognitive and Affective Neuroscience, 11(11), 1812-1822. https://doi.org/10.1093/scan/nsw089

      (4) Lin, I.-F., Yamada, T., Komine, Y., Kato, N., Kato, M., & Kashino, M. (2015). Vocal identity recognition in autism spectrum disorder. PLOS ONE, 10(6), e0129451. https://doi.org/10.1371/journal.pone.0129451

    3. Reviewer #2 (Public review):

      Summary:

      This article looks at differences in how the brain entrains to, or tracks, the rhythmic presentation of syllables and words in speech in infants at increased likelihood versus low likelihood for autism. The authors first sought to characterize how brain responses are modulated by learning the statistical probability of a given syllable following the one before it over the first two years of life. They then sought to identify at which stages of word learning infants with increased likelihood of autism showed difficulties, and whether those difficulties worsened over time. Finally, they sought to indicate whether infants' statistical learning and word learning abilities could predict later verbal skills. The authors found similar developmental trajectories of neural entrainment to syllables in infants at high and low likelihood for autism, but infants at high likelihood for autism had overall weaker syllable-level entrainment. Infants at high versus low likelihood for autism showed different developmental trajectories for word entrainment. Lower syllable entrainment in high-likelihood infants corresponded with poorer verbal outcomes, but word entrainment was not associated with verbal outcomes. Event-related potential responses to words and part words were positively associated with verbal outcomes, however, but only in low-likelihood infants.

      Strengths:

      Overall, the article provides rigorous statistical analysis of longitudinal EEG data to provide strong support for the claims that neural entrainment to syllable and word features of speech may be a useful marker for language development difficulties, particularly in infants at increased likelihood for neurodevelopmental disorders. The EEG data collection and preprocessing procedures are well within standards in the field. Readers should take care to note that authors indexed neural entrainment to speech using phase-locking values instead of spectral power.

      Weaknesses:

      While the statistical analyses are rigorous, a few of the components of the models are not clearly defined, and some corrections and thresholds for significance warrant further justification. Further, a few stimuli and participant details that could influence results are not specified. It is not clear whether all participants came from majority French-speaking families; differences in the amount of French language exposure (compared to other languages that may be spoken by a participant's family) could influence results. The standardized volume of the stimuli is also not included. As a result, readers should be encouraged to interpret that neural entrainment to speech features is likely a useful mechanism to explain differences in language development, while taking this interpretation with some caution.

    1. eLife Assessment

      This is an important study showing that movement vigor is not solely an individual property but emerges through interaction when two people are physically linked. The evidence is convincing, supported by a well-controlled experimental design and modeling that closely match the observed behavior. While the authors provided a helpful comparison of several candidate models of human-human interaction dynamics, the statistical power and the statistical analyses could be further improved.

    2. Reviewer #1 (Public review):

      Summary:

      The authors present a novel investigation of the movement vigor of individuals completing a synchronous extension-flexion task. Participants were placed into groups of two (so-called "dyads") and asked to complete shared movements (connected via a virtual loaded spring) to targets placed at varying amplitudes. The authors attempted to quantify what, if any, adjustments in movement vigor individual participants made during the dyadic movements, given the combined or co-dependent nature of the task. This is a novel, timely question of interest within the broader field of human sensorimotor control.

      Participants from each dyad were labeled as "slow" (low vigor) or "fast" (high vigor), and their respective contributions to the combined movement metrics were assessed. The authors presented four candidate models for dyad interactions: (a) independent motor plans (i.e., co-activity hypothesis), (b) individual-led motor plans (i.e., leader-follower hypothesis), (c) generalization to a weighted average motor plan (i.e., weighted adaptation hypothesis), and (d) an uncertainty-based model of dynamic partner-partner interaction (i.e., interactive adaptation hypothesis). The final model allowed for dynamic changes in individual motor plans (and therefore, movement vigor) based on partner-partner interactions and observations. After detailed observations of interaction torque and movement duration (or vigor), the authors concluded that the interactive adaptation model provided the best explanation of human-human interaction during self-paced dyadic movements.

      Strengths:

      The experimental setup (simultaneous wrist extension-flexion movements) has been thoroughly vetted. The task was designed particularly well, with adequate block pseudo-randomization to ensure general validity of the results. The analyses of torque interaction, movement kinematics, and vigor are sound, as are the statistical measures used to assess significance. The authors structured the work via a helpful comparison of several candidate models of human-human interaction dynamics, and how well said models explained variance in the vigor of solo and combined movements. The research question is timely and extends current neuroscientific understanding of sensorimotor control, particularly in social contexts.

      Weaknesses:

      (1) My chief concern about the study as it currently stands is the relatively low number of data points (n=10). The authors recruited 20 participants, but the primary conclusions are based on dyad-specific interactions (i.e., analyses of "fast" vs "slow" participants in each pair). Some of these analyses would benefit greatly, in terms of power, from the addition of more data points.

      1a) The distribution of delta-vigor (Fast group vs Slow group) is highly skewed (see Figures 3D, S6D), with over half of the dyads exhibiting delta-vigor less than 0.2 (i.e., less than 20% of unit vigor). Given the relatively low number of dyads, it would be helpful for the authors to provide explicit listings of VigorFast, VigorSlow, and VigorCombined for each of the 10 separate dyads or pairings.

      1b) The authors concluded that the interactive adaptation hypothesis provided the best summary of the combined movement dynamics in the study. If this is indeed the case, then the relative degree of difference in vigor between the fast and slow participants in a dyad should matter. How well did the interactive adaptation model explain variance in the dyads with relatively low delta-vigor (e.g., less than 0.2) vs relatively high delta-vigor?

      (2) The authors shared the results of one analysis of reaction time, showing that the reaction times of the slow partners and the fast partners did not differ during the initial passive block. Did the authors observe any changes in RT of either the slow or fast partner during the combined (primary task) blocks (KL, KH, etc.)? If the pairs of participants did indeed employ a form of interactive adaptation, then it is certainly plausible that this interaction would manifest in the initial movement planning phase (i.e., RT) in addition to the vigor and smoothness of the movements themselves.

    3. Reviewer #2 (Public review):

      Summary:

      This study examines how individual movement vigor is integrated into a shared, dyadic vigor when two individuals are physically coupled. Participants performed wrist-reaching movements toward targets at different distances while mechanically linked via a virtual elastic band, and dyads were formed by pairing participants with different baseline vigor profiles. Under interaction conditions, movements converged to coordinated patterns that could not be explained by simple averaging, indicating that each dyad behaved as a single functional unit. Notably, under coupling, movement durations for both partners were shorter than in the solo condition, arguing against the view that each individual simply executed an independent movement plan. Furthermore, dyadic vigor was primarily predicted by the slower partner's vigor rather than by the faster partner's, suggesting that neither a leader-follower strategy nor a weighted averaging account fully explains the observed behavior. The authors propose a computational model in which both partners adapt to the emerging interaction dynamics ("interactive adaptation strategy"), providing a coherent explanation of the behavioral observations.

      Strengths:

      The study is carefully designed and addresses an important question about how individual movement vigor is integrated during joint action. The experimental paradigm allows systematic manipulation of interaction strength and partner asymmetry. The behavioral results show clear and robust patterns, particularly the shortening of movement durations under elastic coupling (KL and KH conditions) and the asymmetrical contribution of the slower partner's vigor to dyadic vigor. The computational model captures the main behavioral patterns well and provides a principled framework for interpreting dyadic vigor not as a simple combination of two independent motor plans, but as an emergent property arising from mutual adaptation. Conceptually, the study is notable in extending the notion of vigor from an individual attribute to a dyad-level construct, opening a new perspective on coordinated movement and motor decision-making.

      Weaknesses:

      A key conceptual issue concerns the apparent asymmetry between partners in the computational framework. While dyadic vigor is empirically better predicted by the slower partner's vigor, the model formulation appears to emphasize the faster partner's time-related cost and interaction forces. Although the cost function includes an uncertainty-related component associated with the slower partner, it remains unclear from the current formulation and description how dyadic vigor is formally derived from the slower partner's control policy within the same modeling framework. This raises an important question regarding whether the model offers a symmetric account of dyadic vigor formation for both partners or whether it is effectively anchored to the faster partner's control architecture.

      A second conceptual issue concerns the interpretation of the term "motor plan." It remains unclear whether this term refers primarily to movement-related characteristics such as speed or duration, or more broadly to the underlying optimization structure that governs these variables. This distinction is theoretically important, as it determines whether the reported interaction effects should be understood as adjustments in movement characteristics or as changes in the structure of the control policy itself.

    4. Reviewer #3 (Public review):

      Summary:

      This study provides novel insights into how individuals regulate the speed of their movements both alone and in pairs, highlighting consistent differences in movement vigor across people and showing that these differences can adapt in dyadic contexts. The findings are significant because they reveal stable individual patterns of action that are flexible when interacting with others, and they suggest that multiple factors, beyond reward sensitivity, may contribute to these idiosyncrasies. The evidence is generally strong, supported by careful behavioral measurements and appropriate modeling, though clarifying some statistical choices and including additional measures of accuracy and smoothness would further strengthen the support for the conclusions.

      Major Comments:

      (1) Given the idiosyncrasies in individual vigor, would linear mixed models (LMMs) be more appropriate than ANOVAs in some analyses (e.g., in the section "Solo session"), as they can account for random intercepts and slopes on vigor measures? Some figures (e.g., Figure 2.B and 3.E) indeed seem to show that some aspects of behaviour may present variability in slopes and intercepts across participants. In fact, I now realize that LMMs are used in the "Emergence of dyadic vigor from the partners' individual vigor" section, so could the authors clarify why different statistical approaches were applied depending on the sections?

      (2) If I understand correctly, the introduction suggests that idiosyncrasies in movement vigor may be driven by inter-individual differences in reward sensitivity. However, the current task does not involve any explicit rewards, yet the authors still observe idiosyncrasies in vigor, which is interesting. Could this indicate that other factors contribute to these consistent individual differences? For example, could sensitivity to temporal costs or physical effort explain the slow versus fast subgrouping? Specifically, might individuals more sensitive to temporal costs move faster to minimize opportunity costs, and might those less sensitive to effort costs also move faster? Along the same lines, could the two subgroups (slow vs. fast) be characterized in terms of underlying computational "phenotypes," such as their sensitivities to time and effort? If this is not feasible with the current dataset, it would still be valuable to discuss whether these factors could plausibly account for the observed patterns, based on existing literature.

      (3) The observation that dyads did not lose accuracy or smoothness despite changes in vigor is interesting and suggests a shift in the speed-accuracy tradeoff. Could the authors include accuracy and smoothness measures in the main figures rather than only in supplementary materials? I think it would make the manuscript more complete.

      (4) It is a bit unclear to me whether the variance assumptions for ANOVAs were checked, for instance, in Figure 3H.

    1. eLife Assessment

      This important study combines optogenetic manipulations and wide-field imaging to show that the retrosplenial cortex controls behavioral responses to whisker deflection in a context-dependent manner. The evidence is convincing, but the study would benefit from additional analyses to disentangle the contributions of movement initiation to the recorded neural signals. The paper should be of strong interest to neuroscientists studying cortical mechanisms of sensorimotor processing.

    2. Reviewer #1 (Public review):

      Summary

      The strength of this manuscript lies in the behavior: mice use a continuous auditory background (pink vs brown noise) to set a rule for interpreting an identical single-whisker deflection (lick in W+ and withhold in W− contexts) while always licking to a brief 10 kHz tone. Behaviorally, animals acquire the rule and switch rapidly at block transitions and take a few trials to fully integrate the context cue. What's nice about this behavior is the separate auditory cue, which shows the animals remain engaged in the task, so it's not just that the mice check out (i.e., become disengaged in the W- context). The authors then use optical tools, combining cortex-wide optogenetic inactivation (using localized inhibition in a grid-like fashion) with widefield calcium imaging to map what regions are necessary for the task and what the local and global dynamics are. Classic whisker sensorimotor nodes (wS1/wS2/wM/ALM) behave as expected with silencing reducing whisker-evoked licking. Retrosplenial cortex (RSC) emerges as a somewhat unexpected, context-specific node: silencing RSC (and tjS1) increases licking selectively in W−, arguing that these regions contribute to applying the "don't lick" policy in that context. I say somewhat because work from the Delamater group points to this possibility, albeit in a Pavlovian conditioning task and without neural data. I would still recommend the authors of the current manuscript review that work to see whether there is a relevant framework or concept (Castiello, Zhang, Delamater, 'The retrosplenial cortex as a possible 'sensory integration' area: a neural network modeling approach of the differential outcomes effect of negative patterning', 2021, Neurobiology of Learning and Memory).

      The widefield imaging shows that RSC is the earliest dorsal cortical area to show W+ vs W− divergence after the whisker stimulus, preceding whisker motor cortex, consistent with RSC injecting context into the sensorimotor flow. A "Context Off" control (continuous white noise; same block structure) impairs context discrimination, indicating the continuous background is actually used to set the rule (an important addition!) Pre-stimulus functional-connectivity analyses suggest that there is some activity correlation that maps to the context presumably due to the continuous background auditory context. Simultaneous opto+imaging projects perturbations into a low-dimensional subspace that separates lick vs no-lick trajectories in an interpretable way.

      In my view, this is a clear, rigorous systems-level study that identifies an important role for RSC in context-dependent sensorimotor transformation, thereby expanding RSC's involvement beyond navigation/memory into active sensing and action selection. The behavioral paradigm is thoughtfully designed, the claims related to the imaging are well defended, and the causal mapping is strong. I have a few suggestions for clarity that may require a bit of data analysis. I also outline one key limitation that should be discussed, but is likely beyond the scope of this manuscript.

      Major strengths

      (1) The task is a major strength. It asks the animal to generate differential motor output to the same sensory stimulus, does so in a block-based manner, and the Context-Off condition convincingly shows that the continuous contextual cue is necessary. The auditory tone control ensures this is more than a 'motivational' context but is decision-related. In fact, the slightly higher bias to lick on the catch trials in the W+ context is further evidence for this.

      (2) The dorsal-cortex optogenetic grid avoids a 'look-where-we-expect' approach and lets RSC fall out as a key node. The authors then follow this up with pharmacology and latency analyses to rule out simple motor confounds. Overall, this is rigorous and thoughtfully done.

      (3) While the mesoscale imaging doesn't allow for cellular resolution, it allows for mapping of the flow of information. It places RSC early in the context-specific divergence after whisker onset, a valuable piece that complements prior work.

      (4) The baseline (pre-stim) functional connectivity and the opto-perturbation projections into a task subspace increase the significance of the work by moving beyond local correlates.

      Key limitation

      The current optogenetic window begins ~10 ms before the sensory cue and extends 1s after, which is ideal for perturbing within-trial dynamics but cannot isolate whether RSC is required to maintain the context-specific rule during the baseline. Because context is continuously available, it makes me wonder whether RSC is the locus maintaining or, instead, gating the context signal. The paper's results are fully consistent with that possibility, but causality in the pre-stimulus window remains an open question. (As a pointer for future work, pre-stimulus-only inactivation, silencing around block switches, or context-omission probe trials (e.g., removing the background noise unexpectedly within a W+ or W- context block), could help separate 'holding' from 'gating' of the rule. But I'm not suggesting these are needed for this manuscript, but would be interesting for future studies.)

    3. Reviewer #2 (Public review):

      Summary:

      The authors aim to understand the neural basis of context-dependent sensory processing and decision-making.

      Strengths:

      They used an innovative behavioral paradigm where the action-outcome association changes independent of the sensory stimulus. This theoretically allows the authors to disentangle the effect of behavioral context on sensory processing. Using this approach combined with optogenetic silencing, they discover that RSC activity is necessary for suppressing a lick response when the stimulus switches to the unrewarded context.

      Weaknesses:

      Sensory processing appears to be entangled with jaw/tongue movement initiation. Activity in M1 and RSC during auditory-evoked lick responses appears to be identical to activity during whisker-evoked lick responses, indicating that movement initiation is the main driver of M1/RSC activity, rather than changes in the flow of sensory information. If sensory information were the main driver of the initial M1/RSC response, then auditory evoked responses should have a longer latency. Perhaps this is beyond the resolution of the calcium indicator or imaging frame rate. It is not clear from the data shown if differences in S1 activity when comparing W+ and W- stimulation are caused by context-sensitive sensory processing or whisker movement following whisker deflection.

    1. eLife Assessment

      This study presents SynaptoGen, a differentiable extension of connectome models that links gene expression, protein-protein interaction probabilities, synaptic multiplicity, and synaptic weights, and demonstrates its use in reinforcement learning agents and a C. elegans-inspired case study. The work is a valuable contribution to computational connectomics and neuro-inspired machine learning, with solid mathematical and computational evidence supporting the proposed optimization framework. However, the broader biological and synthetic-biology claims - particularly genomic control of synaptogenesis and drug-discovery applications - are currently overstated and would benefit from a more tempered framing and clearer articulation of biological limitations.

    2. Reviewer #1 (Public review):

      The authors address a set of important and challenging questions at the interface of (developmental) neuroscience, genetics, and computation. They ask how complex neural circuits could emerge from compact genomic information, and they outline a bold vision in which this process might eventually be harnessed to design synthetic biological intelligence through genetic control of synaptogenesis. These are significant and stimulating ideas that merit rigorous theoretical and experimental exploration.

      However, the present work does not convincingly engage with these questions at a mechanistic level. Most of the circuit formation aspects appear to be adopted from prior models, and it is not clear how the main methodological modifications-introducing synaptic conductance and stochastic formalisms-provide new conceptual insight into genomic specification of neural circuitry. The manuscript does not include significant biological data or validation to support the proposed framework, and the results provided instead use artificial reinforcement learning benchmarks, which do not appear informative with respect to the biological claims.

      Overall, while the manuscript raises intriguing themes and ambitions, the proposed model is conceptually disconnected from the biological problem it purports to address. The strength of evidence does not support the strong interpretative or translational claims, and substantial rethinking of the modeling framework, in particular its validation strategy, would be required for the work to match the claims of our improved understanding of the genomic basis of neural circuit formation and our ability to engineer it.

    3. Reviewer #2 (Public review):

      In this manuscript, the authors built upon the Connectome Model literature and proposed SynaptoGen, a differentiable model that explicitly takes into account multiplicity and conductance in neural connectivity. The authors evaluated SynaptoGen through simulated reinforcement learning tasks and established its performance as often superior to two considered baselines. This work is a valuable addition to the field, supported by a solid methodology with some details and limitations missing.

      Major points:

      (1) The genetic features in the X and Y matrices in the CM were originally introduced as combinatorial gene expression patterns that correspond to the presence and even absence of a subset of genes. The authors oversimplify this original scope by only considering single-gene expression features. While this was arguably a reasonable first approximation for a case study of gap junctions in C. elegans, it is by no means expected to be a plausible expectation for chemical synapses. As the authors appear to motivate their model by chemical synapses that have polarities, they should either consider combinatorial rules in the model or at least present this explicitly as a key limitation of the model. Omitting combinatorial effects also renders the presented "bioplausible" baseline much less bioplausible, likely calling for a different name.

      (2) It is not fully explained how Equation (11) is obtained, even conceptually. It is unclear why \bar{B} and \bar{G} should be element-wise multiplied together, both already being expected values. Moreover, the authors acknowledged in lines 147-149 that the components of \bar{G} actually depend on gene expression X, which is a component in \bar{B}, so the logic here seems circular.

      (3) The authors considered two baselines, namely SNES and a bioplausible control. However, it would be of interest to also investigate: a) Vanilla DQN with the same size trained on the same MLP, to judge whether the biological insights behind SynaptoGen parameterization add value to performance. b) Using Equation (7) instead of Equation (11) to construct the weight matrices, to judge whether incorporating the conductance adds value to performance.

    4. Reviewer #3 (Public review):

      Summary

      Boccato et al. present an ambitious and thoughtfully developed framework, SynaptoGen, which proposes a differentiable model of synaptogenesis grounded in gene-expression vectors, protein interaction probabilities, and conductance rules. The authors aim to bridge the gap between computational connectomics and synthetic biological intelligence by enabling gradient-based optimization of genetically encoded circuit architectures. They support this goal with mathematical derivations, simulation experiments across several RL benchmarks, and a biologically grounded validation using C. elegans adhesion-molecule co-expression data. The paper is timely and conceptually compelling, offering a unified formulation of synaptic multiplicity and synaptic weight formation that can be integrated directly into learning systems.

      Strengths

      (1) Well-motivated framework with clear conceptual contributions.

      (2) Rigorous mathematical development.

      (3) Compelling empirical validation.

      (4) Excellent framing and discussion of future impact.

      Weaknesses

      (1) Overstated claims in the abstract and discussion.

      (2) Ambiguity in "first of its kind" assertions.

    1. eLife Assessment

      This is an important contribution that largely confirms prior evidence that word recognition - a cornerstone of development - improves across early childhood and is related to vocabulary growth. This study is distinguished by its use of a large, multi-study dataset that is uncommon in prior research on cognitive development. It provides solid evidence that speed, accuracy, and consistency of word learning improve with age, and will therefore prove of interest to those studying language, and more broadly, perception and development.

    2. Reviewer #1 (Public review):

      Summary:

      The study examined the extent to which children's word recognition skill improves across early development, becoming faster, more accurate and less variable, and the extent to which word recognition skill is related to children's concurrent and later vocabulary knowledge.

      Strengths:

      The main strength of the study comes from the dataset, which recycles previously collected data from 24 studies to examine the development of word recognition skill using data from 1963 children. This maximizes the impact of previously collected data while also allowing the study to reliably ask big-picture questions on the development of word recognition skill and its relation to chronological age and vocabulary knowledge. Data analysis is rigorous, thought through and very clearly described. Data and code necessary to reproduce the manuscript are shared on the project's GitHub.

      Weaknesses:

      The limitations of the study are acknowledged to some extent, but need to be improved and ensured that they run throughout the manuscript. Thus, in the discussion, the authors note that the approach is observational and exploratory, and highlight for me a key alternative explanation of the findings, namely that faster children could be faster due to their larger vocabulary, rather than faster children learning more words. Indeed, the latter explanation for the relationship is called into question, given that growth in speed was not related to growth in vocabulary. Here, the authors note that the null result may be related to the fact that they do not sufficiently precise estimates of growth slopes, rather than taking the alternative explanation seriously that there may not be as causal a link between being a faster word learner and a better word learner (learn more words). This is especially since, but correct me if I'm wrong here, the current vocabulary size is not taken into consideration in the model examining vocabulary growth. Given the increasing number of studies showing that current vocabulary knowledge predicts vocabulary growth (Laing, Kalinowski et al, Siew & Vitevitch), one simple alternative explanation is that current vocabulary knowledge predicts both current word recognition skill and later vocabulary knowledge. Is there anything in the data speaking against this hypothesis?

      Equally, while the SEM examines vocabulary growth controlling for age, I wonder about the other way around. What would happen to the effect of age on word recognition skill (in the LME model, S8) if one were to add concurrent vocabulary size? So does chronological age explain word recognition skill or vocabulary knowledge? Right now, the manuscript describes this effect purely related to chronological age, but is it age per se or other cognitive abilities, including a key change across development, namely, vocabulary size? Thus, the presentation of the skill learning hypothesis suggests that age is a proxy for experience, while you actually have here a very nice proxy for experience in terms of children's vocabulary size.

      Critically, while the discussion is more nuanced, the way the abstract is concluded and the way the Introduction is phrased suggest that the study is able to answer a causal question, which, as the authors themselves note, is not possible. The abstract, for instance, states that word recognition becomes faster, more accurate and less variable...consistent with a process of skill learning. And also that this skill plays a role in supporting early language learning, which is very causal language. I don't think you can really claim that you are testing the two hypotheses you suggest here. The work is definitely embedded in the context of these hypotheses, but are you really able to test them? My worry is that while the discussion is more nuanced, the extent to which this study will then be cited down the line as showing that children learn more words down the line because they are faster at recognizing words, and anything that you can do to tamper with such interpretations would be good for the literature. For me, this should not just be relegated to the discussion but should be touched upon in the abstract and Introduction.

      Finally, it would help to talk more about the mechanisms at work in any relationship between word recognition and language learning. It seems to me that this would rely on some predictive processing framework, given the description on page 4, and it would be good to make this clear (faster and more accurately you can recognize a ball, better use this evidence to infer the speaker's intended meaning). Equally, when referring to word recognition, it would be good to clarify what this refers to - how well a child knows what a word refers to (and in the context of LWL, what it does not refer to) or how quickly it directs attention to what is referred to.

      With regards to the data, I wonder if there is a clustering of kids past 24 months that is happening here, looking at Figures 1 and 2, where it seems like there is less change past the 24-month point. Is there any way to look at whether the effect of age or vocabulary on word recognition is not linear but asymptotic?

    3. Reviewer #2 (Public review):

      Summary:

      This paper presents a series of analyses of a large dataset combining many prior studies of early word recognition (Peekbank). The analyses demonstrate that the speed, accuracy and consistency of word learning improve with age. Moreover, the speed of word learning early in development was related to vocabulary growth over time.

      Strengths:

      A key strength of the paper is the use of a large multi-study dataset. This is particularly valuable in the field of early cognitive development, which has (due to practical limitations) often been based on small-scale studies that necessarily provide a shaky foundation for conclusions. The analyses are also well-motivated.

      Weaknesses:

      The weaknesses I saw are primarily in some aspects of the conceptual motivation for the research.

      First, I wasn't entirely clear about what the authors meant by "word recognition ability". For much of the manuscript (including the use of the term "word recognition ability" itself), this comes across as an intrinsic ability or skill that improves with development. Alternatively, the speed and accuracy metrics taken from studies in Peekbank might capture children's increasing knowledge of the common, concrete words typically used in these studies. To me, this is a somewhat different construct from a general skill at recognizing words. It would be helpful if the authors could clarify which construct they intend to capture, or if it is not possible to distinguish between these constructs from the Peekbank data.

      Second, and relatedly, if the source of the age-related improvements is increasing experience with the common concrete words used in the Peekbank studies, then one might expect word recognition and improvements with age to be related to word frequency, given that more frequent words are experienced more often. Word frequency predicts word knowledge when assessed using CDI data. Can effects of frequency be detected in Peekbank word recognition metrics? If not, why? Similarly, is the speed and accuracy of word recognition in Peekbank data related to CDI-derived word age of acquisition, and again, if not, why?

      Finally, there is a bit of a risk of the main findings of this paper coming across as a foregone conclusion. I.e., how could it be otherwise that word recognition improves with development?

    1. eLife Assessment

      This important work investigates cooperative behaviors in adolescents using a repeated Prisoner's Dilemma game. The computational modeling approach used in the study is solid and rigorous. The work could be further strengthened with the consideration of modeling higher-order social inferences and non-linear relationships between age and observed behavior. Findings from this study will be of interest to developmental psychologists, economists, and social psychologists.

    2. Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in wegithed value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts which move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and model-comparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and well-structured.

      Weaknesses:

      I had some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      The authors have now addressed my comments and concerns in their revised version.

      Appraisal & Discussion:

      Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      Comments on revisions:

      Thank you to the authors for addressing my comments and concerns.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      Weaknesses:

      A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-by-trial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      Finally, the two age groups compared-adolescents (high school students) and adults (university students)-differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      Comments on revisions:

      The authors have adequately addressed my previous comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in weighted value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts that move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and modelcomparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and well-structured.

      We thank the reviewer for recognizing the strengths of our work.

      Weaknesses:

      (1) I also have some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      We thank the reviewer for this suggestion. Following the comment, we added a hierarchical Bayesian estimation. We built a hierarchical model with both group-level (adolescent group and adult group) and individual-level structures for the best-fitting model. Four Markov chains with 4,000 samples each were run, and the model converged well (see Figure supplement 7).

      We then analyzed the posterior parameters for adolescents and adults separately. The results were consistent with those from the MLE analysis. These additional results have been included in the Appendix Analysis section (also see Figure supplement 5 and 7). In addition, we have updated the code and provided the link for reference. We appreciate the reviewer’s suggestion, which improved our analysis.

      (2) There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. 

      However, our computational modeling explicitly addressed this possibility. Model 4 (inequality aversion) captures decisions that are driven purely by self-interest or aversion to unequal outcomes, including a parameter reflecting disutility from advantageous inequality, which represents self-oriented motives. If participants’ behavior were solely guided by the payoff-dominant strategy, this model should have provided the best fit. However, our model comparison showed that Model 5 (social reward) performed better in both adolescents and adults, suggesting that cooperative behavior is better explained by valuing social outcomes beyond payoff structures.

      Besides, if adolescents’ lower cooperation is that they strategically respond to the payoff structure by adopting defection as the more rewarding option. Then, adolescents should show reduced cooperation across all rounds. Instead, adolescents and adults behaved similarly when partners defected, but adolescents cooperated less when partners cooperated and showed little increase in cooperation even after consecutive cooperative responses. This pattern suggests that adolescents’ lower cooperation cannot be explained solely by strategic responses to payoff structures but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded our Discussion to acknowledge this important point and to clarify how the behavioral and modeling results address the reviewer’s concern.

      “Overall, these findings indicate that adolescents’ lower cooperation is unlikely to be driven solely by strategic considerations, but may instead reflect differences in the valuation of others’ cooperation or reduced motivation to reciprocate. Although defection is the payoff-dominant strategy in the Prisoner’s Dilemma, the selective pattern of adolescents’ cooperation and the model comparison results indicate that their reduced cooperation cannot be fully explained by strategic incentives, but rather reflects weaker valuation of social reciprocity.”

      Appraisal & Discussion:

      (3) The authors have partially achieved their aims, but I believe the manuscript would benefit from additional methodological clarification, specifically regarding the use of hierarchical model fitting and the inclusion of Bayes Factors, to more robustly support their conclusions. It would also be important to investigate the source of the model confusion observed in two of their models.

      We thank the reviewer for this comment. In the revised manuscript, we have clarified the hierarchical Bayesian modeling procedure for the best-fitting model, including the group- and individual-level structure and convergence diagnostics. The hierarchical approach produced results that fully replicated those obtained from the original maximumlikelihood estimation, confirming the robustness of our findings. Please also see the response to (1).

      Regarding the model confusion between the inequality aversion (Model 4) and social reward (Model 5) models in the model recovery analysis, both models’ simulated behaviors were best captured by the baseline model. This pattern arises because neither model includes learning or updating processes. Given that our task involves dynamic, multi-round interactions, models lacking a learning mechanism cannot adequately capture participants’ trial-by-trial adjustments, resulting in similar behavioral patterns that are better explained by the baseline model during model recovery. We have added a clarification of this point to the Results:

      “The overlap between Models 4 and 5 likely arises because neither model incorporates a learning mechanism, making them less able to account for trial-by-trial adjustments in this dynamic task.”

      (4) I am unconvinced by the claim that failures in mentalising have been empirically ruled out, even though I am theoretically inclined to believe that adolescents can mentalise using the same procedures as adults. While reinforcement learning models are useful for identifying biases in learning weights, they do not directly capture formal representations of others' mental states. Greater clarity on this point is needed in the discussion, or a toning down of this language.

      We sincerely thank the reviewer for this professional comment. We agree that our prior wording regarding adolescents’ capacity to mentalise was somewhat overgeneralized. Accordingly, we have toned down the language in both the Abstract and the Discussion to better align our statements with what the present study directly tests. Specifically, our revisions focus on adolescents’ and adults’ ability to predict others’ cooperation in social learning. This is consistent with the evidence from our analyses examining adolescents’ and adults’ model-based expectations and self-reported scores on partner cooperativeness (see Figure 4). In the revised Discussion, we state:

      “Our results suggest that the lower levels of cooperation observed in adolescents stem from a stronger motive to prioritize self-interest rather than a deficiency in predicting others’ cooperation in social learning”.

      (5) Additionally, a more detailed discussion of the incentives embedded in the Prisoner's Dilemma task would be valuable. In particular, the authors' interpretation of reduced adolescent cooperativeness might be reconsidered in light of the zero-sum nature of the game, which differs from broader conceptualisations of cooperation in contexts where defection is not structurally incentivised.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. However, our behavioral and computational evidence suggests that this pattern cannot be explained solely by strategic responses to payoff structures, but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded the Discussion to acknowledge this point and to clarify how both behavioral and modeling results address the reviewer’s concern (see also our response to 2).

      (6) Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      We thank the reviewer for the professional comments, which have helped us improve our work.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      We thank the reviewer for highlighting the strengths of our work.

      Weaknesses:

      A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-bytrial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      We thank the reviewer for this thoughtful comment. We agree that social learning from human partners may involve higher-order inferences beyond simple reinforcement learning from non-human sources. To address this, we had previously included such mechanisms in our behavioral modeling. In Model 7 (Social Reward Model with Influence), we tested a higher-order belief-updating process in which participants’ expectations about their partner’s cooperation were shaped not only by the partner’s previous choices but also by the inferred influence of their own past actions on the partner’s subsequent behavior. In other words, participants could adjust their belief about the partner’s cooperation by considering how their partner’s belief about them might change. Model comparison showed that Model 7 did not outperform the best-fitting model, suggesting that incorporating higher-order influence updates added limited explanatory value in this context. As suggested by the reviewer, we have further clarified this point in the revised manuscript.

      Regarding trait-based frameworks, we appreciate the reviewer’s reference to Hackel et al. (2015). That study elegantly demonstrated that learners form relatively stable beliefs about others’ social dispositions, such as generosity, especially when the task structure provides explicit cues for trait inference (e.g., resource allocations and giving proportions). By contrast, our study was not designed to isolate trait learning, but rather to capture how participants update their expectations about a partner’s cooperation over repeated interactions. In this sense, cooperativeness in our framework can be viewed as a trait-like latent belief that evolves as evidence accumulates. Thus, while our model does not include a dedicated trait module that directly modulates learning rates, the belief-updating component of our best-fitting model effectively tracks a dynamic, partner-specific cooperativeness, potentially reflecting a prosocial tendency.

      This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      We thank the reviewer for the suggestion. Following the comment, we implemented an additional model incorporating a dynamic learning rate based on the magnitude of prediction errors. Specifically, we developed Model 9:  Social reward model with Pearce–Hall learning algorithm (dynamic learning rate), in which participants’ beliefs about their partner’s cooperation probability are updated using a Rescorla–Wagner rule with a learning rate dynamically modulated by the Pearce–Hall (PH) Error Learning mechanism. In this framework, the learning rate increases following surprising outcomes (larger prediction errors) and decreases as expectations become more stable (see Appendix Analysis section for details).

      The results showed that this dynamic learning rate model did not outperform our bestfitting model in either adolescents or adults (see Figure supplement 6). We greatly appreciate the reviewer’s suggestion, which has strengthened the scope of our analysis. We now have added these analyses to the Appendix Analysis section (see Figure Supplement 6) and expanded the Discussion to acknowledge this modeling extension and further discuss its implications.

      Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      We thank the reviewer for this professional comment. In addition to the linear analyses, we further conducted exploratory analyses to examine potential non-linear relationships between age and the model parameters. Specifically, we fit LMMs for each of the four parameters as outcomes (α+, α-, β, and ω). The fixed effects included age, a quadratic age term, and gender, and the random effects included subject-specific random intercepts and random slopes for age and gender. Model comparison using BIC did not indicate improvement for the quadratic models over the linear models for α<sup>+</sup> (ΔBIC<sub>quadratic-linear</sub> = 5.09), α− (ΔBICquadratic-linear = 3.04), β (ΔBICquadratic-linear = 3.9), or ω (ΔBICquadratic-linear = 0). Moreover, the quadratic age term was not significant for α<sup>+</sup>, α<sup>−</sup>, or β (all ps > 0.10). For ω, we observed a significant linear age effect (b = 1.41, t = 2.65, p = 0.009) and a significant quadratic age effect (b = −0.03, t = −2.39, p = 0.018; see Author response image 1). This pattern is broadly consistent with the group effect reported in the main text. The shaded area in the figure represents the 95% confidence interval. As shown, the interval widens at older ages (≥ 26 years) due to fewer participants in that range, which limits the robustness of the inferred quadratic effect. In consideration of the limited precision at older ages and the lack of BIC improvement, we did not emphasize the quadratic effect in the revised manuscript and present these results here as exploratory.

      Author response image 1.

      Linear and quadratic model fits showing the relationship between age and the ω parameter, with 95% confidence intervals.<br />

      Finally, the two age groups compared - adolescents (high school students) and adults (university students) - differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      We appreciate this comment. Indeed, adolescents (high school students) and adults (university students) differ not only in age but also in sociocultural and socioeconomic backgrounds. In our study, all participants were recruited from Beijing and surrounding regions, which helps minimize large regional and cultural variability. Moreover, we accounted for individual-level random effects and included participants’ social value orientation (SVO) as an individual difference measure. 

      Nonetheless, we acknowledge that other contextual factors, such as differences in financial independence, socioeconomic status, and social experience—may also contribute to group differences in cooperative behavior and reward valuation. Although our results are broadly consistent with developmental theories of reward sensitivity and social decisionmaking, sociocultural influences cannot be entirely ruled out. Future work with more demographically matched samples or with socioeconomic and regional variables explicitly controlled will help clarify the relative contributions of biological and contextual factors. Accordingly, we have revised the Discussion to include the following statement:  “Third, although both age groups were recruited from Beijing and nearby regions, minimizing major regional and cultural variation, adolescents and adults may still differ in socioeconomic status, financial independence, and social experience. Such contextual differences could interact with developmental processes in shaping cooperative behavior and reward valuation. Future research with demographically matched samples or explicit measures of socioeconomic background will help disentangle biological from sociocultural influences.”

      Reviewer #3 (Public review):

      Summary:

      Wu and colleagues find that in a repeated Prisoner's Dilemma, adolescents, compared to adults, are less likely to increase their cooperation behavior in response to repeated cooperation from a simulated partner. In contrast, after repeated defection by the partner, both age groups show comparable behavior.

      To uncover the mechanisms underlying these patterns, the authors compare eight different models. They report that a social reward learning model, which includes separate learning rates for positive and negative prediction errors, best fits the behavior of both groups. Key parameters in this winning model vary with age: notably, the intrinsic value of cooperating is lower in adolescents. Adults and adolescents also differ in learning rates for positive and negative prediction errors, as well as in the inverse temperature parameter.

      Strengths: 

      The modeling results are compelling in their ability to distinguish between learned expectations and the intrinsic value of cooperation. The authors skillfully compare relevant models to demonstrate which mechanisms drive cooperation behavior in the two age groups.

      We thank the reviewer’s recognition of our work’s strengths.

      Weaknesses:

      Some of the claims made are not fully supported by the data:

      The central parameter reflecting preference for cooperation is positive in both groups. Thus, framing the results as self-interest versus other-interest may be misleading.

      We thank the reviewer for this insightful comment. In the social reward model, the cooperation preference parameter is positive by definition, as defection in the repeated rPDG always yields a +2 monetary advantage regardless of the partner’s action. This positive value represents the additional subjective reward assigned to mutual cooperation (e.g., reciprocity value) that counterbalances the monetary gain from defection. Although the estimated social reward parameter ω was positive, the effective advantage of cooperation is Δ=p×ω−2. Given participants’ inferred beliefs p, Δ was negative for most trials (p×ω<2), indicating that the social reward was insufficient to offset the +2 advantage of defection. Thus, both adolescents and adults valued cooperation positively, but adolescents’ smaller ω and weaker responsiveness to sustained partner cooperation suggest a stronger weighting on immediate monetary payoffs. 

      In this light, our framing of adolescents as more self-interested derives from their behavioral pattern: even when they recognized sustained partner cooperation and held high expectations of partner cooperation, adolescents showed lower cooperative behavior and reciprocity rewards compared with adults. Whereas adults increased cooperation after two or three consecutive partner cooperations, this pattern was absent among adolescents. We therefore interpret their behavior as relatively more self-interested, reflecting reduced sensitivity to the social reward from mutual cooperation rather than a categorical shift from self-interest to other-interest, as elaborated in the Discussion.

      It is unclear why the authors assume adolescents and adults have the same expectations about the partner's cooperation, yet simultaneously demonstrate age-related differences in learning about the partner. To support their claim mechanistically, simulations showing that differences in cooperation preference (i.e., the w parameter), rather than differences in learning, drive behavioral differences would be helpful.

      We thank the reviewer for raising this important point. In our model, both adolescents and adults updated their beliefs about partner cooperation using an asymmetric reinforcement learning (RL) rule. Although adolescents exhibited a higher positive and a lower negative learning rate than adults, the two groups did not differ significantly in their overall updating of partner cooperation probability (Fig. 4a-b). We then examined the social reward parameter ω, which was significantly smaller in adolescents and determined the intrinsic value of mutual cooperation (i.e., p×ω). This variable differed significantly between groups and closely matched the behavioral pattern.

      Following the reviewer’s suggestion, we conducted additional simulations varying one model parameter at a time while holding the others constant. The difference in mean cooperation probability between adults and adolescents served as the index (positive = higher cooperation in adults). As shown in the Author response image 2, decreases in ω most effectively reproduced the observed group difference (shaded area), indicating that age-related differences in cooperation are primarily driven by variation in the social reward parameter ω rather than by others.

      Author response image 2.

      Simulation results showing how variations in each model parameter affect the group difference in mean cooperation probability (Adults – Adolescents). Based on the best-fitting Model 8 and parameters estimated from all participants, each line represents one parameter (i.e., α+, α-, ω, β) systematically varied within the tested range (α±:0.1–0.9; ω, β:1–9) while other parameters were held constant. Positive values indicate higher cooperation in adults. Smaller ω values most strongly reproduced the observed group difference, suggesting that reduced social reward weighting primarily drives adolescents’ lower cooperation.

      Two different schedules of 120 trials were used: one with stable partner behavior and one with behavior changing after 20 trials. While results for order effects are reported, the results for the stable vs. changing phases within each schedule are not. Since learning is influenced by reward structure, it is important to test whether key findings hold across both phases.

      We thank the reviewer for this thoughtful and professional comment. In our GLMM and LMM analyses, we focused on trial order rather than explicitly including the stable vs. changing phase factor, due to concerns about multicollinearity. In our design, phases occur in specific temporal segments, which introduces strong collinearity with trial order. In multi-round interactions, order effects also capture variance related to phase transitions. 

      Nonetheless, to directly address this concern, we conducted additional robustness analyses by adding a phase variable (stable vs. changing) to GLMM1, LMM1, and LMM3 alongside the original covariates. Across these specifications, the key findings were replicated (see GLMM<sub>sup</sub>2 and LMM<sub>sup</sub>4–5; Tables 9-11), and the direction and significance of main effects remained unchanged, indicating that our conclusions are robust to phase differences.

      The division of participants at the legal threshold of 18 years should be more explicitly justified. The age distribution appears continuous rather than clearly split. Providing rationale and including continuous analyses would clarify how groupings were determined.

      We thank the reviewer for this thoughtful comment. We divided participants at the legal threshold of 18 years for both conceptual and practical reasons grounded in prior literature and policy. In many countries and regions, 18 marks the age of legal majority and is widely used as the boundary between adolescence and adulthood in behavioral and clinical research. Empirically, prior studies indicate that psychosocial maturity and executive functions approach adult levels around this age, with key cognitive capacities stabilizing in late adolescence (Icenogle et al., 2019; Tervo-Clemmens et al., 2023). We have clarified this rationale in the Introduction section of the revised manuscript.

      “Based on legal criteria for majority and prior empirical work, we adopt 18 years as the boundary between adolescence and adulthood (Icenogle et al., 2019; Tervo-Clemmens et al., 2023).”

      We fully agree that the underlying age distribution is continuous rather than sharply divided. To address this, we conducted additional analyses treating age as a continuous predictor (see GLMM<sub>sup</sub>1 and LMM<sub>sup</sub>1–3; Tables S1-S4), which generally replicated the patterns observed with the categorical grouping. Nevertheless, given the limited age range of our sample, the generalizability of these findings to fine-grained developmental differences remains constrained. Therefore, our primary analyses continue to focus on the contrast between adolescents and adults, rather than attempting to model a full developmental trajectory.

      Claims of null effects (e.g., in the abstract: "adults increased their intrinsic reward for reciprocating... a pattern absent in adolescents") should be supported with appropriate statistics, such as Bayesian regression.

      We thank the reviewer for highlighting the importance of rigor when interpreting potential null effects. To address this concern, we conducted Bayes factor analyses of the intrinsic reward for reciprocity and reported the corresponding BF10 for all relevant post hoc comparisons. This approach quantifies the relative evidence for the alternative versus the null hypothesis, thereby providing a more direct assessment of null effects. The analysis procedure is now described in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Once claims are more closely aligned with the data, the study will offer a valuable contribution to the field, given its use of relevant models and a well-established paradigm.

      We are grateful for the reviewer’s generous appraisal and insightful comments.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      I commend the authors on a well-structured, clear, and interesting piece of work. I have several questions and recommendations that, if addressed, I believe will strengthen the manuscript.

      We thank the reviewer for commending the organization of our paper.

      Introduction: - Why use a zero-sum (Prisoner's Dilemma; PD) versus a mixed-motive game (e.g. Trust Task) to study cooperation? In a finite set of rounds, the dominant strategy can be to defect in a PD.

      We thank the reviewer for this helpful comment. We agree that both the rationale for using the repeated Prisoner’s Dilemma (rPDG) and the limitations of this framework should be clarified. We chose the rPDG to isolate the core motivational conflict between selfinterest and joint welfare, as its symmetric and simultaneous structure avoids the sequential trust and reputation dependencies/accumulation inherent to asymmetric tasks such as the Trust Game (King-Casas et al., 2005; Rilling et al., 2002).

      Although a finitely repeated rPDG theoretically favors defection, extensive prior research shows that cooperation can still emerge in long repeated interactions when players rely on learning and reciprocity rather than backward induction (Rilling et al., 2002; Fareri et al., 2015). Our design employed 120 consecutive rounds, allowing participants to update expectations about partner behavior and to establish stable reciprocity patterns over time. We have added the following clarification to the Introduction:

      “The rPDG provides a symmetric and simultaneous framework that isolates the motivational conflict between self-interest and joint welfare, avoiding the sequential trust and reputation dynamics characteristic of asymmetric tasks such as the Trust Game (Rilling et al., 2002; King-Casas et al., 2005)”

      Methods:

      Did the participants know how long the PD would go on for?

      Were the participants informed that the partner was real/simulated?

      Were the participants informed that the partner was going to be the same for all rounds?

      We thank the reviewer for the meticulous review work, which helped us present the experimental design and reporting details more clearly. the following clarifications: I. Participants were not informed of the total number of rounds in the rPDG. This prevented endgame expectations and avoided distraction from counting rounds, which could introduce additional effects. II. Participants were told that their partner was another human participant in the laboratory. However, the partner’s behavior was predetermined by a computer program. This design enabled tighter experimental control and ensured consistent conditions across age groups, supporting valid comparisons. III. Participants were informed that they would interact with the same partner across all rounds, aligning with the essence of a multiround interaction paradigm and stabilizing partner-related expectations. For transparency, we have clarified these points in the Methods and Materials section:

      “Participants were told that their partner was another human participant in the laboratory and that they would interact with the same partner across all rounds. However, in reality, the actions of the partner were predetermined by a computer program. This setup allowed for a clear comparison of the behavioral responses between adolescents and adults. Participants were not informed of the total number of rounds in the rPDG.”

      The authors mention that an SVO was also recorded to indicate participant prosociality. Where are the results of this? Did this track game play at all? Could cooperativeness be explained broadly as an SVO preference that penetrated into game-play behaviour?

      We thank the reviewer for pointing this out. We agree that individual differences in prosociality may shape cooperative behavior, so we conducted additional analyses incorporating SVO. Specifically, we extended GLMM1 and LMM3 by adding the measured SVO as a fixed effect with random slopes, yielding GLMM<sub>sup</sub>3 and LMM<sub>sup</sub>6 (Tables 12–13). The results showed that higher SVO was associated with greater cooperation, whereas its effect on the reward for reciprocity was not significant. Importantly, the primary findings remained unchanged after controlling for SVO. These results indicate that cooperativeness in our task cannot be explained solely by a broad SVO preference, although a more prosocial orientation was associated with greater cooperation. We have reported these analyses and results in the Appendix Analysis section.

      Why was AIC chosen rather an BIC to compare model dominance?

      Sorry for the lack of clarification. Both the Akaike Information Criterion (AIC, Akaike, 1974) and Bayesian Information Criterion (BIC, Schwarz, 1978) are informationtheoretic criterions for model comparison, neither of which depends on whether the models to be compared are nested to each other or not (Burnham et al., 2002). We have added the following clarification into the Methods.

      “We chose to use the AICc as the metric of goodness-of-fit for model comparison for the following statistical reasons. First, BIC is derived based on the assumption that the “true model” must be one of the models in the limited model set one compares (Burnham et al., 2002; Gelman & Shalizi, 2013), which is unrealistic in our case. In contrast, AIC does not rely on this unrealistic “true model” assumption and instead selects out the model that has the highest predictive power in the model set (Gelman et al., 2014). Second, AIC is also more robust than BIC for finite sample size (Vrieze, 2012).”

      I believe the model fitting procedure might benefit from hierarchical estimation, rather than maximum likelihood methods. Adolescents in particular seem to show multiple outliers in a^+ and w^+ at the lower end of the distributions in Figure S2. There are several packages to allow hierarchical estimation and model comparison in MATLAB (which I believe is the language used for this analysis; see https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007043).

      We thank the reviewer for this helpful comment and for referring us to relevant methodological work (Piray et al., 2019). We have addressed this point by incorporating hierarchical Bayesian estimation, which effectively mitigates outlier effects and improves model identifiability. The results replicated those obtained with MLE fitting and further revealed group-level differences in key parameters. Please see our detailed response to Reviewer#1 Q1 for the full description of this analysis and results.

      Results: Model confusion seems to show that the inequality aversion and social reward models were consistently confused with the baseline model. Is this explained or investigated? I could not find an explanation for this.

      The apparent overlap between the inequality aversion (Model 4) and social reward (Model 5) models in the recovery analysis likely arises because neither model includes a learning mechanism, making them unable to capture trial-by-trial adjustments in this dynamic task. Consequently, both were best fit by the baseline model. Please see Response to Reviewer #1 Q3 for related discussion.

      Figures 3e and 3f show the correlation between asymmetric learning rates and age. It seems that both a^+ and a^- are around 0.35-0.40 for young adolescents, and this becomes more polarised with age. Could it be that with age comes an increasing discernment of positive and negative outcomes on beliefs, and younger ages compress both positive and negative values together? Given the higher stochasticity in younger ages (\beta), it may also be that these values simply represent higher uncertainty over how to act in any given situation within a social context (assuming the differences in groups are true).

      We appreciate this insightful interpretation. Indeed, both α+ and α- cluster around 0.35–0.40 in younger adolescents and become increasingly polarized with age, suggesting that sensitivity to positive versus negative feedback is less differentiated early in development and becomes more distinct over time. This interpretation remains tentative and warrants further validation. Based on this comment, we have revised the Discussion to include this developmental interpretation.

      We also clarify that in our model β denotes the inverse temperature parameter; higher β reflects greater choice precision and value sensitivity, not higher stochasticity. Accordingly, adolescents showed higher β values, indicating more value-based and less exploratory choices, whereas adults displayed relatively greater exploratory cooperation. These group differences were also replicated using hierarchical Bayesian estimation (see Response to Reviewer #1 Q1). In response to this comment, we have added a statement in the Discussion highlighting this developmental interpretation.

      “Together, these findings suggest that the differentiation between positive and negative learning rates changes with age, reflecting more selective feedback sensitivity in development, while higher β values in adolescents indicate greater value sensitivity. This interpretation remains tentative and requires further validation in future research.”

      A parameter partial correlation matrix (off-diagonal) would be helpful to understand the relationship between parameters in both adolescents and adults separately. This may provide a good overview of how the model properties may change with age (e.g. a^+'s relation to \beta).

      We thank the reviewer for this helpful comment. We fully agree that a parameter partial correlation matrix can further elucidate the relationships among parameters. Accordingly, we conducted a partial correlation analysis and added the visually presented results to the revised manuscript as Figure 2-figure supplement 4.

      It would be helpful to have Bayes Factors reported with each statistical tests given that several p-values fall within the 0.01 and 0.10.

      We thank the reviewer for this important recommendation. We have conducted Bayes factor analyses and reported BF10 for all relevant post hoc comparisons. We also clarified our analysis in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Discussion: I believe the language around ruling out failures in mentalising needs to be toned down. RL models do not enable formal representational differences required to assess mentalising, but they can distinguish biases in value learning, which in itself is interesting. If the authors were to show that more complex 'ToM-like' Bayesian models were beaten by RL models across the board, and this did not differ across adults and adolescents, there would be a stronger case to make this claim. I think the authors either need to include Bayesian models in their comparison, or tone down their language on this point, and/or suggest ways in which this point might be more thoroughly investigated (e.g., using structured models on the same task and running comparisons: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087619).

      We thank the reviewer for the comments. Please see our response to Reviewer 1 (Appraisal & Discussion section) for details.

      Reviewer #2 (Recommendations for the authors):

      The authors may want to show the winning model earlier (perhaps near the beginning of the Results section, when model parameters are first mentioned).

      We thank the reviewer for this suggestion. We agree that highlighting the winning model early improves clarity. Currently, we have mentioned the winning model before the beginning of the Results section. Specifically, in the penultimate paragraph of the Introduction we state:

      “We identified the asymmetric RL learning model as the winning model that best explained the cooperative decisions of both adolescents and adults.”

      Reviewer #3 (Recommendations for the authors):

      In addition to the points mentioned above, I suggest the following:

      (1) Clarify plots by clearly explaining each variable. In particular, the indices 1 vs. 1,2 vs. 1,2,3 were not immediately understandable.

      We thank the reviewer for this suggestion. We agree that the indices were not immediately clear. We have revised the figure captions (Figure 1 and 4) to explicitly define these terms more clearly: 

      “The x-axis represents the consistency of the partner’s actions in previous trials (t<sub>−1</sub>: last trial; t<sub>−1,2</sub>: last two trials; t<sub>−1,2,3</sub>: last three trials).”

      It's unclear why the index stops at 3. If this isn't the maximum possible number of consecutive cooperation trials, please consider including all relevant data, as adolescents might show a trend similar to adults over more trials.

      We thank the reviewer for raising this point. In our exploratory analyses, we also examined longer streaks of consecutive partner cooperation or defection (up to four or five trials). Two empirical considerations led us to set the cutoff at three in the final analyses. First, the influence of partner behavior diminished sharply with temporal distance. In both GLMMs and LMMs, coefficients for earlier partner choices were small and unstable, and their inclusion substantially increased model complexity and multicollinearity. This recency pattern is consistent with learning and decision models emphasizing stronger weighting of recent evidence (Fudenberg & Levine, 2014; Fudenberg & Peysakhovich, 2016). Second, streaks longer than three were rare, especially among some participants, leading to data sparsity and inflated uncertainty. Including these sparse conditions risked biasing group estimates rather than clarifying them. Balancing informativeness and stability, we therefore restricted the index to three consecutive partner choices in the main analyses, which we believe sufficiently capture individuals’ general tendencies in reciprocal cooperation.

      The term "reciprocity" may not be necessary. Since it appears to reflect a general preference for cooperation, it may be clearer to refer to the specific behavior or parameter being measured. This would also avoid confusion, especially since adolescents do show negative reciprocity in response to repeated defection.

      We thank you for this comment. In our work, we compute the intrinsic reward for reciprocity as p × ω, where p is the partner cooperation expectation and ω is the cooperation preference. In the rPDG, this value framework manifests as a reciprocity-derived reward: sustained mutual cooperation maximizes joint benefits, and the resulting choice pattern reflects a value for reciprocity, contingent on the expected cooperation of the partner. This quantity enters the trade-off between U<sub>cooperation</sub> and U<sub>defection</sub>and captures the participant’s intrinsic reward for reciprocity versus the additional monetary reward payoff of defection. Therefore, we consider the term “reciprocity” an acceptable statement for this construct.

      Interpretation of parameters should closely reflect what they specifically measure.

      We thank the reviewer for pointing this out. We have refined the relevant interpretations of parameters in the current Results and Discussion sections.

      Prior research has shown links between Theory of Mind (ToM) and cooperation (e.g., Martínez-Velázquez et al., 2024). It would be valuable to test whether this also holds in your dataset.

      We thank the reviewer for this thoughtful comment. Although we did not directly measure participants’ ToM, our design allowed us to estimate participants’ trial-by-trial inferences (i.e., expectations) about their partner’s cooperation probability. We therefore treat these cooperation expectations as an indirect representation for belief inference, which is related to ToM processes. To test whether this belief-inference component relates to cooperation in our dataset, we further conducted an exploratory analysis (GLMM<sub>sup</sub>4) in which participants’ choices were regressed on their cooperation expectations, group, and the group × cooperation-expectation interaction, controlling for trial number and gender, with random effects. Consistent with the ToM–cooperation link in prior research (MartínezVelázquez et al., 2024), participants’ expectations about their partner’s cooperation significantly predicted their cooperative behavior (Table 14), suggesting that decisions were shaped by social learning about others’ inferred actions. Moreover, the interaction between group and cooperation expectation was not significant, indicating that this inference-driven social learning process likely operates similarly in adolescents and adults. This aligns with our primary modeling results showing that both age groups update beliefs via an asymmetric learning process. We have reported these analyses in the Appendix Analysis section.

      More informative table captions would help the reader. Please clarify how variables are coded (e.g., is female = 0 or 1? Is adolescent = 0 or 1?), to avoid the need to search across the manuscript for this information.

      We thank the reviewer for raising this point. We have added clear and standardized variable coding in the table notes of all tables to make them more informative and avoid the need to search the paper. We have ensured consistent wording and formatting across all tables.

      I hope these comments are helpful and support the authors in further strengthening their manuscript.

      We thank the three reviewers for their comments, which have been helpful in strengthening this work.

      Reference

      (1) Fudenberg, D., & Levine, D. K. (2014). Recency, consistent learning, and Nash equilibrium. Proceedings of the National Academy of Sciences of the United States of America, 111(Suppl. 3), 10826–10829. https://doi.org/10.1073/pnas.1400987111

      (2) Fudenberg, D., & Peysakhovich, A. (2016). Recency, records, and recaps: Learning and nonequilibrium behavior in a simple decision problem. ACM Transactions on Economics and Computation, 4(4), Article 23, 1–18. https://doi.org/10.1145/2956581

      (3) Hackel, L., Doll, B., & Amodio, D. (2015). Instrumental learning of traits versus rewards: Dissociable neural correlates and effects on choice. Nature Neuroscience, 18, 1233– 1235. https://doi.org/10.1038/nn.4080

      (4) Icenogle, G., Steinberg, L., Duell, N., Chein, J., Chang, L., Chaudhary, N., Di Giunta, L.,Dodge, K. A., Fanti, K. A., Lansford, J. E., Oburu, P., Pastorelli, C., Skinner, A. T.,Sorbring, E., Tapanya, S., Uribe Tirado, L. M., Alampay, L. P., Al-Hassan, S. M.,Takash, H. M. S., & Bacchini, D. (2019). Adolescents’ cognitive capacity reaches adult levels prior to their psychosocial maturity: Evidence for a “maturity gap” in a multinational, cross-sectional sample. Law and Human Behavior, 43(1), 69–85. https://doi.org/10.1037/lhb0000315

      (5) Krekelberg, B. (2024). Matlab Toolbox for Bayes Factor Analysis (v3.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.13744717

      (6) Martínez-Velázquez, E. S., Ponce-Juárez, S. P., Díaz Furlong, A., & Sequeira, H. (2024). Cooperative behavior in adolescents: A contribution of empathy and emotional regulation? Frontiers in Psychology, 15, 1342458. https://doi.org/10.3389/fpsyg.2024.1342458

      (7) Tervo-Clemmens, B., Calabro, F. J., Parr, A. C., et al. (2023). A canonical trajectory of executive function maturation from adolescence to adulthood. NatureCommunications, 14, 6922. https://doi.org/10.1038/s41467-023-42540-8

      (8) King-Casas, B., Tomlin, D., Anen, C., Camerer, C. F., Quartz, S. R., & Montague, P. R. (2005). Getting to know you: reputation and trust in a two-person economic exchange. Science, 308(5718), 78-83. https://doi.org/10.1126/science.1108062

      (9) Rilling, J. K., Gutman, D. A., Zeh, T. R., Pagnoni, G., Berns, G. S., & Kilts, C. D. (2002). A neural basis for social cooperation. Neuron, 35(2), 395-405. https://doi.org/10.1016/s0896-6273(02)00755-9

      (10) Fareri, D. S., Chang, L. J., & Delgado, M. R. (2015). Computational substrates of social value in interpersonal collaboration. Journal of Neuroscience, 35(21), 8170-8180. https://doi.org/10.1523/JNEUROSCI.4775-14.2015

      (11) Akaike, H. (2003). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723. https://doi.org/10.1109/TAC.1974.1100705

      (12) Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461464. https://doi.org/10.1214/aos/1176344136

      (13) Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.https://doi.org/10.1007/b97636

      (14) Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x

      (15) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018

      (16) Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127.

    1. eLife Assessment

      This study offers important insights into how entorhinal and hippocampal activity support human thinking in feature spaces. It replicates hexagonal symmetry in entorhinal cortex, reports a novel three-fold symmetry in both behavior and hippocampal signals, and links these findings with a computational model. The task and analyses are sophisticated, and the results appear solid and of broad interest to neuroscientists.

    2. Reviewer #1 (Public review):

      Summary:

      Zhang and colleagues examine neural representations underlying abstract navigation in entorhinal cortex (EC) and hippocampus (HC) using fMRI. This paper replicates a previously identified hexagonal modulation of abstract navigation vectors in abstract space in EC in a novel task involving navigating in a conceptual Greeble space. In HC, the authors identify a three-fold signal of the navigation angle. They also use a novel analysis technique (spectral analysis) to look at spatial patterns in these two areas and identify phase coupling between HC and EC. Interestingly, the three-fold pattern identified in the hippocampus explains quirks in participants' behavior where navigation performance follows a three-fold periodicity. Finally, the authors propose a EC-HPC PhaseSync Model to understand how the EC and HC construct cognitive maps. The wide array and creativity of the techniques used is impressive but because of their unique nature, the paper would benefit from more details on how some of these techniques were implemented.

      Comments on revisions:

      Most of my concerns were adequately addressed, and I believe the paper is greatly improved. I have two more points. I noticed that the legend for Figure 4 still refers to some components of the previous figure version, this should be updated to reflect the current version of the figure. I also think the paper would benefit from more details regarding some of the analyses. Specifically, the phase-amplitude coupling analysis should have a section in the methods which should be sure to clarify how the BOLD signals were reconstructed.

    3. Reviewer #2 (Public review):

      The authors report results from behavioral data, fMRI recordings, and computer simulations during a conceptual navigation task. They report 3-fold symmetry in behavioral and simulated model performance, 3-fold symmetry in hippocampal activity, and 6-fold symmetry in entorhinal activity (all as a function of movement directions in conceptual space). The analyses seem thoroughly done, and the results and simulations are very interesting.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      Zhang and colleagues examine neural representations underlying abstract navigation in the entorhinal cortex (EC) and hippocampus (HC) using fMRI. This paper replicates a previously identified hexagonal modulation of abstract navigation vectors in abstract space in EC in a novel task involving navigating in a conceptual Greeble space. In HC, the authors claim to identify a three-fold signal of the navigation angle. They also use a novel analysis technique (spectral analysis) to look at spatial patterns in these two areas and identify phase coupling between HC and EC. Finally, the authors propose an EC-HPC PhaseSync Model to understand how the EC and HC construct cognitive maps. While the wide array of techniques used is impressive and their creativity in analysis is admirable, overall, I found the paper a bit confusing and unconvincing. I recommend a significant rewrite of their paper to motivate their methods and clarify what they actually did and why. The claim of three-fold modulation in HC, while potentially highly interesting to the community, needs more background to motivate why they did the analysis in the first place, more interpretation as to why this would emerge in biology, and more care taken to consider alternative hypotheses seeped in existing models of HC function. I think this paper does have potential to be interesting and impactful, but I would like to see these issues improved first.

      General comments:

      (1) Some of the terminology used does not match the terminology used in previous relevant literature (e.g., sinusoidal analysis, 1D directional domain).

      We thank the reviewer for this valuable suggestion, which helps to improve the consistency of our terminology with previous literature and to reduce potential ambiguity. Accordingly, we have replaced “sinusoidal analysis” with “sinusoidal modulation” (Doeller et al., 2010; Bao et al., 2019; Raithel et al., 2023) and “1D directional domain” with “angular domain of path directions” throughout the manuscript.

      (2) Throughout the paper, novel methods and ideas are introduced without adequate explanation (e.g., the spectral analysis and three-fold periodicity of HC).

      We thank the reviewer for raising this important point. In the revised manuscript, we have substantially extended the Introduction (paragraphs 2–4) to clarify our hypothesis, explicitly explaining why the three primary axes of the hexagonal grid cell code may manifest as vector fields. We have also revised the first paragraph of the “3-fold periodicity in the HPC” section in the Results to clarify the rationale for using spectral analysis. Please refer to our responses to comment 2 and 3 below for details.

      Reviewer #2 (Public review):

      The authors report results from behavioral data, fMRI recordings, and computer simulations during a conceptual navigation task. They report 3-fold symmetry in behavioral and simulated model performance, 3-fold symmetry in hippocampal activity, and 6-fold symmetry in entorhinal activity (all as a function of movement directions in conceptual space). The analyses are thoroughly done, and the results and simulations are very interesting.

      We sincerely thank the reviewer for the positive and encouraging comments on our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) This paper has quite a few spelling and grammatical mistakes, making it difficult to understand at times.

      We apologize for the wordings and grammatical errors. We have thoroughly re-read and carefully edited the entire manuscript to correct typographical and grammatical errors, ensuring improved clarity and readability.

      (2) Introduction - It's not clear why the three primary axes of hexagonal grid cell code would manifest as vector fields.

      We thank the reviewer for raising this important point. In the revised Introduction (paragraphs 2, 3, and 4), we now explicitly explain the rationale behind our hypothesis that the three primary axes of the hexagonal grid cell code manifest as vector fields.

      In paragraph 2, we present empirical evidence from rodent, bat, and human studies demonstrating that mental simulation of prospective paths relies on vectorial representations in the hippocampus (Sarel et al., 2017; Ormond and O’Keefe, 2022; Muhle-Karbe et al., 2023).

      In paragraphs 3 and 4, we introduce our central hypothesis: vectorial representations may originate from population-level projections of entorhinal grid cell activity, based on three key considerations:

      (1) The EC serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020).

      (2) Grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022), which makes it plausible that their spatially periodic activity can be detected using fMRI.

      (3) A model-based inference: for example, in the simplest case, when one mentally simulates a straight pathway aligned with the grid orientation, a subpopulation of grid cells would be activated. The resulting population activity would form a near-perfect vectorial representation, with constant activation strength along the path. In contrast, if the simulated path is misaligned with the grid orientation, the population response becomes a distorted vectorial code. Consequently, simulating all possible straight paths spanning 0°–360° results in 3-fold periodicity in the activity patterns—due to the 180° rotational symmetry of the hexagonal grid, orientations separated by 180° are indistinguishable.

      We therefore speculate that vectorial representations embedded in grid cell activity exhibit 3-fold periodicity across spatial orientations and serve as a periodic structure to represent spatial direction. Supporting this view, reorientation paradigms in both rodents and young children have shown that subjects search equally in two opposite directions, reflecting successful orientation encoding but a failure to integrate absolute spatial direction (Hermer and Spelke, 1994; Julian et al., 2015; Gallistel, 2017; Julian et al., 2018).

      (3) It took me a few reads to understand what the spectral analysis was. After understanding, I do think this is quite clever. However, this paper needs more motivation to understand why you are performing this analysis. E.g., why not just take the average regressor at the 10º, 70º, etc. bins and compare it to the average regressor at 40º, 100º bins? What does the Fourier transform buy you?

      We are sorry for the confusion. we outline the rationale for employing Fast Fourier Transform (FFT) analysis to identify neural periodicity. In the revised manuscript, we have added these clarifications into the first paragraph of the “3-fold periodicity in the HPC” subsection in the Results.

      First, FFT serves as an independent approach to cross-validate the sinusoidal modulation results, providing complementary evidence for the 6-fold periodicity in EC and the 3-fold periodicity in HPC.

      Second, FFT enables unbiased detection of multiple candidate periodicities (e.g., 3–7-fold) simultaneously without requiring prior assumptions about spatial phase (orientation). By contrast, directly comparing “aligned” versus “misaligned” angular bins (e.g., 10°/70° vs. 40°/100°) would implicitly assume knowledge of the phase offset, which was not known a priori.

      Finally, FFT uniquely allows periodicity analysis of behavioral performance, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency makes it possible to directly compare periodicities across neural and behavioral domains.

      (4) A more minor point: at one point, you say it’s a spectral analysis of the BOLD signals, but the methods description makes it sound like you estimated regressors at each of the bins before performing FFT. Please clarify. 

      We apologize for the confusion. In our manuscript, we use the term spectral analysis to distinguish this approach from sinusoidal modulation analysis. Conceptually, our spectral analysis involves a three-level procedure:

      (1) First level: We estimated direction-dependent activity maps using a general linear model (GLM), which included 36 regressors corresponding to path directions, down-sampled in 10° increments.

      (2) Second level: We applied a Fast Fourier Transform (FFT) to the direction-dependent activity maps derived from the GLM to examine the spectral magnitude of potential spatial periodicities.

      (3) Third level: We conducted group-level statistical analyses across participants to assess the consistency of the observed periodicities.

      We have revised the “Spectral analysis of MRI BOLD signals” subsection in the Methods to clarify this multi-level procedure.

      (5) Figure 4a:

      Why do the phases go all the way to 2*pi if periodicity is either three-fold or six-fold? 

      When performing correlation between phases, you should perform a circular-circular correlation instead of a Pearson's correlation.

      We thank the reviewer for raising this important point. In the original Figure 4a, both EC and HPC phases spanned 0–2π because their sinusoidal phase estimates were projected into a common angular space by scaling them according to their symmetry factors (i.e., multiplying the 3-fold phase by 3 and the 6-fold phase by 6), followed by taking the modulo 2π. However, this projection forced signals with distinct intrinsic periodicities (120° vs. 60° cycles) into a shared 360° space, thereby distorting their relative angular distances and disrupting the one-to-one correspondence between physical directions and phase values. Consequently, this transformation could bias the estimation of their phase relationship.

      In the revised analysis and Figure 4a, we retained the original phase estimates derived from the sinusoidal modulation within their native periodic ranges (0–120° for 3-fold and 0–60° for 6-fold) by applying modulo operations directly. Following your suggestion, the relationship between EC and HPC phases was then quantified using circular–circular correlation (Jammalamadaka & Sengupta, 2001), as implemented in the CircStat MATLAB toolbox. This updated analysis avoids the rescaling artifact and provides a statistically stronger and conceptually clearer characterization of the phase correspondence between EC and HPC.

      (6) Figure 4d needs additional clarification:

      Phase-locking is typically used to describe data with a high temporal precision. I understand you adopted an EEG analysis technique to this reconstructed fMRI time-series data, but it should be described differently to avoid confusion. This needs additional control analyses (especially given that 3 is a multiple of 6) to confirm that this result is specific to the periodicities found in the paper.

      We thank the reviewer for this insightful comment. We have extensively revised the description of the Figure 4 to avoid confusion with EEG-based phase-locking techniques. The revised text now explicitly clarifies that our approach quantifies spatial-domain periodic coupling across path directions, rather than temporal synchronization of neural signals.

      To further address the reviewer’s concern about potential effects of the integer multiple relationship between the 3-fold HPC and 6-fold EC periodicities, we additionally performed two control analyses using the 9-fold and 12-fold EC components, both of which are also integer multiples of the 3-fold HPC periodicity. Neither control analysis showed significant coupling (p > 0.05), confirming that the observed 3-fold–6-fold coupling was specific and not driven by their harmonic relationship.

      The description of the revised Figure 4 has been updated in the “Phase Synchronization Between HPC and EC Activity” subsection of the Results.

      (7) Figure 5a is misleading. In the text, you say you test for propagation to egocentric cortical areas, but I don’t see any analyses done that test this. This feels more like a possible extension/future direction of your work that may be better placed in the discussion.

      We are sorry for the confusion. Figure 5a was intended as a hypothesis-driven illustration to motivate our analysis of behavioral periodicity based on participants’ task performance. However, we agree with the reviewer that, on its own, Figure 5a could be misleading, as it does not directly present supporting analyses.

      To provide empirical support for the interpretation depicted in Figure 5a, we conducted a whole-brain analysis (Figure S8), which revealed significant 3-fold periodic signals in egocentric cortical regions, including the parietal cortex (PC), precuneus (PCU), and motor regions.

      To avoid potential misinterpretation, we have revised the main text to include these results and explicitly referenced Figure S8 in connection with Figure 5a.

      The updated description in the “3-fold periodicity in human behavior” subsection in the Results is as follows:

      “Considering the reciprocal connectivity between the medial temporal lobe (MTL), where the EC and HPC reside, and the parietal cortex implicated in visuospatial perception and action, together with the observed 3-fold periodicity within the DMN (including the PC and PCu; Fig. S8), we hypothesized that the 3-fold periodic representations of path directions extend beyond the MTL to the egocentric cortical areas, such as the PC, thereby influencing participants' visuospatial task performance (Fig. 5a)”.

      Additionally, Figure 5a has been modified to more clearly highlight the hypothesized link between activity periodicity and behavioral periodicity, rather than suggesting a direct anatomical pathway.

      (8) PhaseSync model: I am not an expert in this type of modeling, so please put a lower weight on this comment (especially compared to some of the other reviewers). While the PhaseSync model seems interesting, it’s not clear from the discussion how this compares to current models. E.g., Does it support them by adding the three-fold HC periodicity? Does it demonstrate that some of them can't be correct because they don't include this three-fold periodicity?

      We thank the reviewer for the insightful comment regarding the PhaseSync model. We agree that further clarifying its relationship to existing computational frameworks is important.

      The EC–HPC PhaseSync model is not intended to replace or contradict existing grid–place cell models of navigation (e.g., Bicanski and Burgess, 2019; Whittington et al., 2020; Edvardsen et al., 2020). Instead, it offers a hierarchical extension by proposing that vectorial representations in the hippocampus emerge from the projections of periodic grid codes in the entorhinal cortex. Specifically, the model suggests that grid cell populations encode integrated path information, forming a vectorial gradient toward goal locations.

      To simplify the theoretical account, our model was implemented in an idealized square layout. In more complex real-world environments, hippocampal 3-fold periodicity may interact with additional spatial variables, such as distance, movement speed, and environmental boundaries.

      We have revised the final two paragraphs of the Discussion to clarify this conceptual framework and emphasize the importance of future studies in exploring how periodic activity in the EC–HPC circuit interacts with environmental features to support navigation.

      Reviewer #2 (Recommendations for the authors):

      (1) Please show a histogram of movement direction sampling for each participant.

      We thank the reviewer for this helpful suggestion. We have added a new supplementary figure (Figure S2) showing histograms of path direction sampling for each participant (36 bins of 10°). The figure is also included. Rayleigh tests for circular uniformity revealed no significant deviations from uniformity (all ps > 0.05, Bonferroni-corrected across participants), confirming that path directions were sampled evenly across 0°–360°.

      (2) Why didn’t you use participants’ original trajectories (instead of the trajectories inferred from the movement start and end points) for the hexadirectional analyses? 

      In our paradigm, participants used two MRI-compatible 2-button response boxes (one for each hand) to adjust the two features of the greebles. As a result, the raw adjustment path contained only four cardinal directions (up, down, left, right). If we were to use the raw stepwise trajectories, the analysis would be restricted to these four directions, which would severely limit the angular resolution. By instead defining direction as the vector from the start to the end position in feature space, we can expand the effective range of directions to the full 0–360°. This approach follows previous literature on abstract grid-like coding in humans (e.g., Constantinescu et al., 2016), where direction was similarly defined by the relative change between two feature dimensions rather than the literal stepwise path. We have added this clarification in the “Sinusoidal modulation” subsection of the revised method.

      (3) Legend of Figure 2: the statement "localizing grid cell activity" seems too strong because it is still not clear whether hexadirectional signals indeed result from grid-cell activity (e.g., Bin Khalid et al., eLife, 2024). I would suggest rephrasing this statement (here and elsewhere). 

      Thank you for this helpful suggestion. We have removed the statement “localizing grid cell activity” to avoid ambiguity and revised the legend of Figure 2a to more explicitly highlight its main purpose—defining how path directions and the aligned/misaligned conditions were constructed in the 6-fold modulation. We have also modified similar expressions throughout the manuscript to ensure consistency and clarity.

      (4) Legend of Figure 2: “cluster-based SVC correction for multiple comparisons” - what is the small volume you are using for the correction? Bilateral EC?

      For both Figure 2 and Figure 3, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This has been clarified in the revised Statistical Analysis section of the Methods as “… with small-volume correction (SVC) applied within the bilateral MTL”.

      (5) Legend of Figure 2: "ROI-based analysis" - what kind of ROI are you using? "corrected for multiple comparisons" - which comparisons are you referring to? Different symmetries and also the right/left hemisphere?

      In Figure 2b, the ROI was defined as a functional mask derived from the significant activation cluster in the right entorhinal cortex (EC). Since no robust clusters were observed in the left EC, the functional ROI was restricted to the right hemisphere. We indeed included Figure 2c to illustrate this point; however, we recognize that our description in the text was not sufficiently clear.

      Regarding the correction for multiple comparisons, this refers specifically to the comparisons across different rotational symmetries (3-, 4-, 5-, 6-, and 7-fold). Only the 6-fold symmetry survived correction, whereas no significant effects were detected for the other symmetries.

      We have clarified these points in the “6-fold periodicity in the EC” subsection of the result as “… The ROI was defined as a functional mask of the right EC identified in the voxel-based analysis and further restricted within the anatomical EC. These analyses revealed significant periodic modulation only at 6-fold (Figure  2c; t(32) = 3.56, p = 0.006, two-tailed, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.62) …”.

      We have also revised the “3-fold periodicity in the HPC” subsection of the result as “… ROI analysis, using a functional mask of the HPC identified in the spectral analysis and further restricted within the anatomical HPC, indicated that HPC activity selectively fluctuated at 3-fold periodicity (Figure 3e; t(32) = 3.94, p = 0.002, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.70) …”.

      (6) Figure 2d: Did you rotationally align 0{degree sign} across participants? Please state explicitly whether (or not) 0{degree sign} aligns with the x-axis in Greeble space.

      We thank the reviewer for this helpful question. Yes, before reconstructing the directional tuning curve in Figure 2d, path directions were rotationally aligned for each participant by subtracting the participant-specific grid orientation (ϕ) estimated from the independent dataset (odd sessions). We have now made this description explicit in the revised manuscript in the “6-fold periodicity in the EC” subsection of the Results, stating “… To account for individual difference in spatial phase, path directions were calibrated by subtracting the participant-specific grid orientation estimated from the odd sessions ...”.

      (7) Clustering of grid orientations in 30 participants: What does “Bonferroni corrected” refer to? Also, the Rayleigh test is sensitive to the number of voxels - do you obtain the same results when using pair-wise phase consistency? 

      “Bonferroni corrected” here refers to correction across participants. We have clarified this in the first paragraph of the “6-fold periodicity in the EC” subsection of the Result and in the legend of Supplementary Figure S5 as “Bonferroni-corrected across participants.”

      To examine whether our findings were sensitive to the number of voxels, we followed the reviewer’s guidance to compute pairwise phase consistency (PPC; Vinck et al., 2010) for each participant. The PPC results replicated those obtained with the Rayleigh test. We have updated the new results into the Supplementary Figure S5. We also updated the “Statistical Analysis” subsection of the Methods to describe PPC as “For the PPC (Vinck et al., 2010), significance was tested using 5,000 permutations of uniformly distributed random phases (0–2π) to generate a null distribution for comparison with the observed PPC”.

      (8) 6-fold periodicity in the EC: Do you compute an average grid orientation across all EC voxels, or do you compute voxel-specific grid orientations?

      Following the protocol originally described by Doeller et al. (2010), we estimated voxel-wise grid orientations within the EC and then obtained a participant-specific orientation by averaging across voxels within a hand-drawn bilateral EC mask. The procedure is described in detail in the “Sinusoidal modulation” subsection of the Methods.

      (9) Hand-drawn bilateral EC mask: What was your procedure for drawing this mask? What results do you get with a standard mask, for example, from Freesurfer or SPM? Why do you perform this analysis bilaterally, given that the earlier analysis identified 6-fold symmetry only in the right EC? What do you mean by "permutation corrected for multiple comparisons"?

      We thank the reviewer for raising these important methodological points. To our knowledge, no standard volumetric atlas provides an anatomically defined entorhinal cortex (EC) mask. For example, the built-in Harvard–Oxford cortical structural atlas in FSL contains only a parahippocampal region that encompasses, but does not isolate, the EC. The AAL atlas likewise does not contain an EC region. In FreeSurfer, an EC label is available, but only in the fsaverage surface space, which is not directly compatible with MNI-based volumetric group-level analyses.

      Therefore, we constructed a bilateral EC mask by manually delineating the EC according to the detailed anatomical landmarks described by Insausti et al. (1998). Masks were created using ITK-SNAP (Version 3.8, www.itksnap.org). For transparency and reproducibility, the mask has been made publicly available at the Science Data Bank (link: https://www.scidb.cn/s/NBriAn), as indicated in the revised Data and Code availability section.

      Regarding the use of a bilateral EC mask despite voxel-wise effects being strongest in the right EC. First, we did not have any a priori hypothesis regarding laterality of EC involvement before performing analyses. Second, previous studies estimated grid orientation using a bilateral EC mask in their sinusoidal analyses (Doeller et al., 2010; Constantinescu et al., 2016; Bao et al., 2019; Wagner et al., 2023; Raithel et al., 2023). We therefore followed this established approach to estimate grid orientation.

      By “permutation corrected for multiple comparisons” we refer to the family-wise error correction applied to the reconstructed directional tuning curves (Figure 2d for the EC, Figure 3f for the HPC). Specifically, directional labels were randomly shuffled 5,000 times, and an FFT was applied to each shuffled dataset to compute spectral power at each fold. This procedure generated null distributions of spectral power for each symmetry. For each fold, the 95th percentile of the maximal power across permutations was used as the uncorrected threshold. To correct across folds, the 95th percentile of the maximal suprathreshold power across all symmetries was taken as the family-wise error–corrected threshold. We have clarified this procedure in the revised “Statistical Analysis” subsection of the Methods.

      (10) Figures 3b and 3d: Why do different hippocampal voxels show significance for the sinusoidal versus spectral analysis? Shouldn’t the analyses be redundant and, thus, identify the same significant voxels? 

      We thank the reviewer for this insightful question. Although both sinusoidal modulation and spectral analysis aim to detect periodic neural activity, the two approaches are methodologically distinct and are therefore not expected to identify exactly the same significant voxels.

      Sinusoidal modulation relies on a GLM with sine and cosine regressors to test for phase-aligned periodicity (e.g., 3-fold or 6-fold), calibrated according to the estimated grid orientation. This approach is highly specific but critically depends on accurate orientation estimation. In contrast, spectral analysis applies Fourier decomposition to the directional tuning profile, enabling the detection of periodic components without requiring orientation calibration.

      Accordingly, the two analyses are not redundant but complementary. The FFT approach allows for an unbiased exploration of multiple candidate periodicities (e.g., 3–7-fold) without predefined assumptions, thereby providing a critical cross-validation of the sinusoidal GLM results. This strengthens the evidence for 6-fold periodicity in EC and 3-fold periodicity in HPC. Furthermore, FFT uniquely facilitates the analysis of periodicities in behavioral performance data, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency enables direct comparison of periodicities across neural and behavioral domains.

      Additionally, the anatomical distributions of the HPC clusters appear more similar between Figure 3b and Figure 3d after re-plotting Figure 3d using the peak voxel coordinates (x = –24, y = –18), which are closer to those used for Figure 3b (x = –24, y = –20), as shown in the revised Figure 3.

      Taken together, the two analyses serve distinct but complementary purposes.

      (11) 3-fold sinusoidal analysis in hippocampus: What kind of small volume are you using to correct for multiple comparisons?

      We thank the reviewer for this comment. The same small volume correction procedure was applied as described in R4. Specifically, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This procedure has been clarified in the revised Statistical Analysis section of the Methods as following: “… with small-volume correction (SVC) applied within the bilateral MTL.”

      (12) Figure S5: “right HPC” – isn’t the cluster in the left hippocampus? 

      We are sorry for the confusion. The brain image was present in radiological orientation (i.e., the left and right orientations are flipped). We also checked the figure and confirmed that the cluster shown in the original Figure S5 (i.e., Figure S6 in the revised manuscript) is correctly labeled as the right hippocampus, as indicated by the MNI coordinate (x = 22), where positive x values denote the right hemisphere. To avoid potential confusion, we have explicitly added the statement “Volumetric results are displayed in radiological orientation” to the figure legends of all volume-based results.

      (13) Figure S5: Why are the significant voxels different from the 3-fold symmetry analysis using 10{degree sign} bins?

      As shown in R10, the apparent differences largely reflect variation in MNI coordinates. After adjusting for display coordinates, the anatomical locations of the significant clusters are in fact highly similar between the 10°-binned (Figure 3d, shown above) and the 20°-binned results (Figure S6).

      Although both analyses rely on sinusoidal modulation, they differ in the resolution of the input angular bins (10° vs. 20°). Combined with the inherent noise in fMRI data, this makes it unlikely that the two approaches would yield exactly the same set of significant voxels. Importantly, both analyses consistently reveal robust 3-fold periodicity in the hippocampus, indicating that the observed effect is not dependent on angular bin size.

      (14) Figure 4a and corresponding text: What is the unit? Phase at which frequency? Are you using a circular-circular correlation to test for the relationship?

      We thank the reviewer for raising this important point. In the revised manuscript, we have clarified that the unit of the phase values is radians, corresponding to the 6-fold periodic component in the EC and the 3-fold periodic component in the HPC. In the original Figure 4a, both EC and HPC phases—estimated from sinusoidal modulation—were analyzed using Pearson correlation. We have since realized issues with this approach, as also noted R5 to Reviewer #1.

      In the revised analysis and Figure 4a (as shown above), we re-evaluated the relationship between EC and HPC phases using a circular–circular correlation (Jammalamadaka & Sengupta, 2001), implemented in the CircStat MATLAB toolbox. The “Phase synchronization between the HPC and EC activity” subsection of the Result has been accordingly updated as following:

      “To examine whether the spatial phase structure in one region could predict that in another, we tested whether the orientations of the 6-fold EC and 3-fold HPC periodic activities, estimated from odd-numbered sessions using sinusoidal modulation with rotationally symmetric parameters (in radians), were correlated across participants. A cross-participant circular–circular correlation was conducted between the spatial phases of the two areas to quantify the spatial correspondence of their activity patterns (EC: purple dots; HPC: green dots) (Jammalamadaka & Sengupta, 2001). The analysis revealed a significant circular correlation (Figure 4a; r = 0.42, p < 0.001) …”.

      In the “Statistical analysis” subsection of the method:

      “… The relationship between EC and HPC phases was evaluated using the circular–circular correlation (Jammalamadaka & Sengupta, 2001) implemented in the CircStat MATLAB toolbox …”.

      (15) Paragraph following “We further examined amplitude-phase coupling...” - please clarify what data goes into this analysis.

      We thank the reviewer for this helpful comment. In this analysis, the input data consisted of hippocampal (HPC) phase and entorhinal (EC) amplitude, both extracted using the Hilbert transform from the reconstructed BOLD signals of the EC and HPC derived through sinusoidal modulation. We have substantially revised the description of the amplitude–phase coupling analysis in the third paragraph of the “Phase Synchronization Between HPC and EC Activity” subsection of the Results to clarify this procedure.

      (16) Alignment between EC 6-fold phases and HC 3-fold phases: Why don't you simply test whether the preferred 6-fold orientations in EC are similar to the preferred 3-fold phases in HC? The phase-amplitude coupling analyses seem sophisticated but are complex, so it is somewhat difficult to judge to what extent they are correct. 

      We thank the reviewer for this thoughtful comment. We employed two complementary analyses to examine the relationship between EC and HPC activity. In the revised Figure 4 (as shown in Figure 4 for Reviewer #1), Figure 4a provides a direct and intuitive measure of the phase relationship between the two regions using circular–circular correlation. Figure 4b–c examines whether the activity peaks of the two regions are aligned across path directions using cross-frequency amplitude–phase coupling, given our hypothesis that the spatial phase of the HPC depends on EC projections. These two analyses are complementary: a phase correlation does not necessarily imply peak-to-peak alignment, and conversely, peak alignment does not always yield a statistically significant phase correlation. We therefore combined multiple analytical approaches as a cross-validation across methods, providing convergent evidence for robust EC–HPC coupling.

      (17) Figure 5: Do these results hold when you estimate performance just based on “deviation from the goal to ending locations” (without taking path length into account)? 

      We thank the reviewer for this thoughtful suggestion. Following the reviewer’s advice, we re-estimated behavioral performance using the deviation between the goal and ending locations (i.e., error size) and path length independently. As shown in the new Figure S9, no significant periodicity was observed in error size (p > 0.05), whereas a robust 3-fold periodicity was found for path length (p < 0.05, corrected for multiple comparisons).

      We employed two behavioral metrics,(1) path length and (2) error size, for complementary reasons. In our task, participants navigated using four discrete keys corresponding to the cardinal directions (north, south, east, and west). This design inherently induces a 4-fold bias in path directions, as described in the “Behavioral performance” subsection of the Methods. To minimize this artifact, we computed the objectively optimal path length and used it to calibrate participants’ path lengths. However, error size could not be corrected in the same manner and retained a residual 4-fold tendency (see Figure S9d).

      Given that both path length and error size are behaviorally relevant and capture distinct aspects of task performance, we decided to retain both measures when quantifying behavioral periodicity. This clarification has been incorporated into the “Behavioral performance” subsection of the Methods, and the 2<sup>nd</sup> paragraph of the “3-fold periodicity in human behavior” subsection of the Results.

      (18) Phase locking between behavioral performance and hippocampal activity: What is your way of creating surrogates here?

      We thank the reviewer for this helpful question. Surrogate datasets were generated by circularly shifting the signal series along the direction axis across all possible offsets (following Canolty et al., 2006). This procedure preserves the internal phase structure within each domain while disrupting consistent phase alignment, thereby removing any systematic coupling between the two signals. Each surrogate dataset underwent identical filtering and coherence computation to generate a null distribution, and the observed coherence strength was compared with this distribution using paired t-tests across participants. The statistical analysis section has been systematically revised to incorporate these methodological details.

      (19) I could not follow why the authors equate 3-fold symmetry with vectorial representations. This includes statements such as “these empirical findings provide a potential explanation for the formation of vectorial representation observed in the HPC.” Please clarify.

      We thank the reviewer for raising this point. Please refer to our response to R2 for Reviewer #1 and the revised Introduction (paragraphs 2–4), where we explicitly explain why the three primary axes of the hexagonal grid cell code can manifest as vector fields.

      (20) It was unclear whether the sentence “The EC provides a foundation for the formation of periodic representations in the HPC” is based on the authors’ observations or on other findings. If based on the authors’ findings, this statement seems too strong, given that no other studies have reported periodic representations in the hippocampus to date (to the best of my knowledge).

      We thank the reviewer for this comment. We agree that the original wording lacked sufficient rigor. We have extensively revised the 3rd paragraph of the Discussion section with more cautious language by reducing overinterpretation and emphasizing the consistency of our findings with prior empirical evidence, as follows: “The EC–HPC PhaseSync model demonstrates how a vectorial representation may emerge in the HPC from the projections of populations of periodic grid codes in the EC. The model was motivated by two observations. First, the EC intrinsically serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020), and grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022). Second, mental planning, characterized by “forward replay” (Dragoi and Tonegawa, 2011; Pfeiffer, 2020), has the capacity to activate populations of grid cells that represent sequential experiences in the absence of actual physical movement (Nyberg et al., 2022). We hypothesize that an integrated path code of sequential experiences may eventually be generated in the HPC, providing a vectorial gradient toward the goal location. The path code exhibits regular, vector-like representations when the path direction aligns with the orientations of grid axes, and becomes irregular when they misalign. This explanation is consistent with the band-like representations observed in the dorsomedial EC (Krupic et al., 2012) and the irregular activity fields of trace cells in the HPC (Poulter et al., 2021). ”

    1. eLife Assessment

      TrASPr is an important contribution that leverages transformer models focused on regulatory regions to enhance predictions of tissue-specific splicing events. The revisions strengthen the manuscript by clarifying methodology and expanding analyses across exon and intron sizes, and the evidence supporting TrASPr's predictive performance is compelling. This work will be of interest to researchers in computational genomics and RNA biology, offering an improved model for splicing prediction and a promising approach to RNA sequence design.

    2. Reviewer #1 (Public review):

      Summary

      The authors propose a transformer-based model for prediction of condition- or tissue-specific alternative splicing and demonstrate its utility in design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant exising approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Furthermore, the architecture of the model is designed to model alternative splicing events, whereas Pangolin and SpliceAI are focused on modeling individual splice junctions, which is an easier problem.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions: Regulatory elements predicted by the model were experimentally verified; novel tissue-specific cassette exons were verified by LSV-seq.

      (4) The authors use their model for sequence design to optimize splicing outcome, which is a novel application.

      Weaknesses:

      None noted.

    3. Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of evidence.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational auto encoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks. However, comparison of BOS against existing methods for sequence design is lacking.

      Strengths:

      - A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      - Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      - Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      - As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models.<br /> *This point is now addressed in the revision *<br /> *Moreover, datasets have been made available by the authors on BitBucket. *

      - Related to the previous point, as discussed in the manuscript, SpliceAI and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also consider fine-tuning Pangolin on cassette exons only (as you do for your model).<br /> *This point is still not addressed in the revision. *

      - L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases-thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      Added after revision: The authors have added additional analyses of performance based on both the length of the exon under consideration and the total length of the surrounding intronic contexts. The result that TrASPr performs well across various context sizes (i.e., the length of the sequence between the upstream and downstream exons, ranging from <1k to >10k) is highly encouraging and supports the claim that most of the sequence-based splicing logic is located proximal to the splice sites. It is also noteworthy that TrASPr performs well for exons longer than 200, suggesting that most of the "regulatory code" is present at the exon boundaries rather than in its center (which TrASPr is blind to).<br /> Additionally, Pearson correlation is used as the sole performance metric in many analyses (e.g., Fig 2 - Supp 2). The authors should consider alternative accuracy metrics, such as RMSE, which better convey the magnitude of prediction error and are more easily comparable across datasets. Pearson correlation may also be more sensitive to outliers on the smaller samples that arise when binning sequences.

      - L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.<br /> * Our initial comment was incorrect, as pointed out by the authors. *

      - L214, ablations of individual features are missing.<br /> * This was addressed in the revision. *

      - L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included<br /> * This was addressed in the revision. *

      - L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly for the complete failure of SpliceAI and Pangolin shown in Fig 4d.<br /> * The authors should consider adding SpliceAI/Pangolin predictions for the alternative 5' and 3' splice site selection tasks (and code for related analyses) to the BitBucket repository.*

      - BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      *Minor comment added after revision: regarding the author response that "A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper.":<br /> It's not clear why BOS cannot be evaluated as a separate contribution by instead using different "teacher" models instead of TrASPr. Additionally, BOS lacks evaluation against existing methods for sequence optimization. *

      - The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.<br /> * See comment above *

    4. Author response:

      The following is the authors’ response to the original reviews

      A point by point response included below. Before we turn to that we want to note one change that we decided to introduce, related to generalization on unseen tissues/cell types (Figure 3a in the original submission and related question by Reviewer #2 below). This analysis was based on adding a latent “RBP state” representation during learning of condition/tissue specific splicing. The “RBP state” per condition is captured by a dedicated encoder. Our original plan was to have a paper describing a new RBP-AE model we developed in parallel, which also served as the base to capture this “RBP State”. However, we got delayed in getting this second paper finalized (it was led by other lab members, some of whom have already left the lab). This delay affected the TrASPr manuscript as TrASPr’s code should be available and analysis reproducible upon publication. After much deliberation, we decided that in order to comply with reproducibility standards while not self scooping the RBP-AE paper, we eventually decided to take out the RBP-AE and replace it with a vanilla PCA based embedding for the “RBP-State”. The PCA approach is simpler and reproducible, based on linear transformation of the RBPs expression vector into a lower dimension. The qualitative results included in Figure 3a still hold, and we also produced the new results suggested by Reviewer #2 in other GTEX tissues with this PCA based embedding (below). 

      We don’t believe the switch to PCA based embedding should have any bearing on the current manuscript evaluation but wanted to take this opportunity to explain the reasoning behind this additional change.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors propose a transformer-based model for the prediction of condition - or tissue-specific alternative splicing and demonstrate its utility in the design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant existing approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Another difference compared to Pangolin and SpliceAI which are focused on modeling individual splice junctions is the focus on modeling a complete alternative splicing event.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions.     

      (4) The authors use their model for sequence design to optimize splicing outcomes, which is a novel application.

      We wholeheartedly thank Reviewer #1 for these positive comments regarding the modeling approach we took to this task and the evaluations we performed. We have put a lot of work and thought into this and it is gratifying to see the results of that work acknowledged like this.

      Weaknesses:

      No weaknesses were identified by this reviewer, but I have the following comments:

      (1) I would be curious to see evidence that the model is learning position-specific representations.

      This is an excellent suggestion to further assess what the model is learning. To get a better sense of the position-specific representation we performed the following analyses:

      (1) Switching the transformers relative order: All transformers are pretrained on 3’ and 5’ splice site regions before fine-tunning for the PSI and dPSI prediction task. We hypothesized that if relative position is important, switching the order of the transformers would make a large difference on prediction accuracy. Indeed if we switch the 3’ and 5’ we see as expected a severe drop in performance, with Pearson correlation on test data dropping from 0.82 to 0.11. Next, we switched the two 5’ and 3’ transformers, observing a drop to 0.65 and 0.78 respectively. When focusing only on changing events the drop was from 0.66 to 0.54 (for 3’ SS transformers), 0.48 (for 5’ SS transformers), and 0.13 (when the 3’ and 5’ transformers flanking the alternative exon were switched). 

      (2) Position specific effect of RBPs: We wanted to test whether the model is able to learn position specific effects for RBPs. For this we focused on two RBPs, FOX (a family of three highly related RBPs), and QKI, both have a relatively well defined motif, known condition and position specific effect identified via RBP KD experiments combined with CLIP experiments (e.g. PMID: 23525800, PMID: 24637117, PMID: 32728246). For each, we randomly selected 40 highly and 40 lowly included cassette exons sequences. We then ran in-silico mutagenesis experiments where we replaced small windows of sequences with the RBP motifs (80 for RBFOX and 80 for QKI), then compared TrASPR’s predictions for the average predictions for 5 random sequences inserted in the same location. The results of this are now shown in Figure 4 Supp 3, where the y-axis represents the dPSI effect per position (x-axis), and the color represents the percentile of observed effects over inserting motifs in that position across all 80 sequences tested. We see that both RBPs have strong positional preferences for exerting a strong effect on the alternative exon. We also see differences between binding upstream and downstream of the alternative exon. These results, learned by the model from natural tissue-specific variations, recapitulate nicely the results derived from high-throughput experimental assays. However, we also note that effects were highly sequence specific. For example, RBFOX is generally expected to increase inclusion when binding downstream of the alternative exon and decrease inclusion when binding upstream. While we do observe such a trend we also see cases where the opposite effects are observed. These sequence specific effects have been reported in the literature but may also represent cases where the model errs in the effect’s direction. We discuss these new results in the revised text.

      (3) Assessing BOS sequence edits to achieve tissue-specific splicing: Here we decided to test whether BOS edits in intronic regions (at least 8b away from the nearest splice site) are important for the tissue-specific effect. The results are now included in Figure 6 Supp 1, clearly demonstrating that most of the neuronal specific changes achieved by BOS were based on changing the introns, with a strong effect observed for both up and downstream intron edits.

      (2) The transformer encoders in TrASPr model sequences with a rather limited sequence size of 200 bp; therefore, for long introns, the model will not have good coverage of the intronic sequence. This is not expected to be an issue for exons.

      The reviewer is raising a good question here. On one hand, one may hypothesize that, as the reviewer seems to suggest, TrASPr may not do well on long introns as it lacks the full intronic sequence.

      Conversely, one may also hypothesize that for long introns, where the flanking exons are outside the window of SpliceAI/Pangolin, TrASPr may have an advantage.

      Given this good question and a related one by Reviewer #2, we divided prediction accuracy by intron length and the alternative exon length.

      For short exons  (<100bp) we find TrASPr and Pangolin perform similarly, but for longer exons, especially those > 200, TrASPr results are better. When dividing samples by the total length of the upstream and downstream intron, we find TrASPr outperform all other models for introns of combined length up to 6K, but Pangolin gets better results when the combined intron length is over 10K. This latter result is interesting as it means that contrary to the second hypothesis laid out above, Pangolin’s performance did not degrade for events where the flanking exons were outside its field of view. We note that all of the above holds whether we assess all events or just cases of tissue specific changes. It is interesting to think about the mechanistic causes for this. For example, it is possible that cassette exons involving very long introns evoke a different splicing mechanism where the flanking exons are not as critical and/or there is more signal in the introns which is missed by TrASPr. We include these new results now as Figure 2 - Supp 1,2 and discuss these in the main text.

      (3) In the context of sequence design, creating a desired tissue- or condition-specific effect would likely require disrupting or creating motifs for splicing regulatory proteins. In your experiments for neuronal-specific Daam1 exon 16, have you seen evidence for that? Most of the edits are close to splice junctions, but a few are further away.

      That is another good question. Regarding Daam1 exon 16, in the original paper describing the mutation locations some motif similarities were noted to PTB (CU) and CUG/Mbnl-like elements (Barash et al Nature 2010). In order to explore this question beyond this specific case we assessed the importance of intronic edits by BOS to achieve a tissue specific splicing profile - see above.

      (4) For sequence design, of tissue- or condition-specific effect in neuronal-specific Daam1 exon 16 the upstream exonic splice junction had the most sequence edits. Is that a general observation? How about the relative importance of the four transformer regions in TrASPr prediction performance?

      This is another excellent question. Please see new experiments described above for RBP positional effect and BOS edits in intronic regions which attempt to give at least partial answers to these questions. We believe a much more systematic analysis can be done to explore these questions but such evaluation is beyond the scope of this work.

      (5) The idea of lightweight transformer models is compelling, and is widely applicable. It has been used elsewhere. One paper that came to mind in the protein realm:

      Singh, Rohit, et al. "Learning the language of antibody hypervariability." Proceedings of the National Academy of Sciences 122.1 (2025): e2418918121.

      We definitely do not make any claim this approach of using lighter, dedicated models instead of a large ‘foundation’ model has not been taken before. We believe Rohit et al mentioned above represents a somewhat different approach, where their model (AbMAP) fine-tunes large general protein foundational models (PLM) for antibody-sequence inputs by supervising on antibody structure and binding specificity examples. We added a description of this modeling approach citing the above work and another one which specifically handles RNA splicing (intron retention, PMID: 39792954).

      Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of the evidence. Additionally, state-of-the-art models (SpliceAI and Pangolin) are reported to perform extremely poorly in some tasks, which is surprising in light of previous reports of their overall good prediction accuracy; the reasoning for this lack of performance compared to TrASPr is not explored.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational autoencoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks.

      We thank Reviewer #2 for this detailed summary and positive view of our work. It seems the main issue raised in this summary regards the evaluations: The reviewer finds details of the evaluations missing and the fact that SpliceAI and Pangolin perform poorly on some of the tasks to be surprising. We made a concise effort to include the required details, including code and data tables. In short, some of the concerns were addressed by adding additional evaluations, some by clarifying missing details, and some by better explaining where Pangolin and SpliceAI may excel vs. settings where these may not do as well. More details are given below. 

      Strengths:

      (1) A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      (2) Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      (1) Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      We made an effort to make the tasks be specific and detailed,  including making the code and data of those available. We believe this helped improve clarity in the revised version.

      (2) As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models. For instance, Pangolin was apparently trained on a different dataset (Cardoso-Moreira et al. 2019), and using a different processing pipeline (based on SpliSER) than the ones used in this submission. As a result, the inferior performance of Pangolin reported here could potentially be due to subtle distribution shifts. The authors should add a discussion of the differences in the training set, and whether they affect your comparisons (e.g., in Figure 2). They should also consider adding a table summarizing the various datasets used in their previous work for training and testing. Publishing their training and testing datasets in an easy-to-use format would be a fantastic contribution to the community, establishing a common benchmark to be used by others.

      There are several good points to unpack here. Starting from the last one, we very much agree that a standard benchmark will be useful to include. For tissue specific splicing quantification we used the GTEx dataset from which we select six representative human tissues (heart, cerebellum, lung, liver, spleen, and EBV-transformed lymphocytes). In total, we collected 38394 cassette exon events quantified across 15 samples (here a ‘sample’ is a cassette exon quantified in two tissues) from the GTEx dataset with high-confidence quantification for their PSIs based on MAJIQ. A detailed description of how this data was derived is now included in the Methods section, and the data itself is made available via the bitbucket repository with the code.

      Next, regarding the usage of different data and distribution shifts for Pangolin: The reviewer is right to note there are many differences between how Pangolin and TrASPr were trained. This makes it hard to determine whether the improvements we saw are not just a result of different training data/labels. To address this issue, we first tried to finetune the pre-trained Pangolin with MAJIQ’s PSI dataset: we use the subset of the GTEx dataset described above, focusing on the three tissues analyzed in Pangolin’s paper—heart, cerebellum, and liver—for a fair comparison. In total, we obtained 17,218 events, and we followed the same training and test split as reported in the Pangolin paper. We got Pearson: 0.78 Spearman: 0.68 which are values similar to what we got without this extra fine tuning. Next, we retrained Pangolin from scratch, with the full tissues and training set used for TrASPr, which was derived from MAJIQ’s quantifications. Since our model only trained on human data with 6 tissues at the same time, we modified Pangolin from original 4 splice site usage outputs to 6 PSI outputs. We tried to take the sequence centered with the first or the second splice site of the mid exon. This test resulted in low performance (3’ SS: pearson 0.21 5’ SS: 0.26.). 

      The above tests are obviously not exhaustive but their results suggest that the differences we observe are unlikely to be driven by distribution shifts. Notably, the original Pangolin was trained on much more data (four species, four tissues each, and sliding windows across the entire genome). This training seems to be important for performance while the fact we switched from Pangolin’s splice site usage to MAJIQ’s PSI was not a major contributor. Other potential reasons for the improvements we observed include the architecture, target function, and side information (see below) but a complete delineation of those is beyond the scope of this work. 

      (3) Related to the previous point, as discussed in the manuscript, SpliceAI, and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would be interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also, consider fine-tuning Pangolin on cassette exons only (as you do for your model).

      Please see the above response. We did not investigate more sophisticated models that adjust Pangolin’s architecture further as such modifications constitute new models which are beyond the scope of this work.

      (4) L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases - thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      This was a good suggestion, related to another comment made by Reviewer #1. Please see above our response to them with a breakdown by exon/intron length.

      (5) L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.

      Previous models were not trained exclusively on constitutive exons and Pangolin specifically was trained with their version of junction usage across tissues. That said, the reviewer’s point is valid (and similar to ones made above) about a need to have a matched training/testing and potential distribution shifts. Please see response and evaluations described above. 

      (6) L214, ablations of individual features are missing.

      These were now added to the table which we moved to the main text (see table also below).

      (7) L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included.

      Good question. The task here was to assess predictions in unseen conditions, hence we opted to test on completely different data of human cell lines rather than additional tissue samples. Following the reviewers suggestion we also evaluated predictions on two additional GTEx tissues, Cortex and Adrenal Gland. These new results, as well as the previous ones for ENCODE, were updated to use the PCA based embedding of “RBP-State” as described above. We also compared the predictions using the PCA based embedding of the “RBP-State” to training directly on data (not the test data of course) from these tissues. See updated Figure 3a,b. Figure 3 Supp 1,2.

      (8) L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly, the complete failure of SpliceAI and Pangolin is shown in Figure 4d.

      Line 239 refers to predicting relative inclusion levels between competing 3’ and 5’ splice sites. We admit we too expected this to be better for SpliceAI and Pangolin but we were not able to find bugs in our analysis (which is all made available for readers and reviewers alike). Regarding this expectation to perform better, first we note that we are not aware of a similar assessment being done for either of those algorithms (i.e. relative inclusion for 3’ and 5’ alternative splice site events). Instead, our initial expectation, and likely the reviewer’s as well, was based on their detection of splice site strengthening/weakening due to mutations, including cryptic splice site activation. More generally though, it is worth noting in this context that given how SpliceAI, Pangolin and other algorithms have been presented in papers/media/scientific discussions, we believe there is a potential misperception regarding tasks that SpliceAI and Pangolin excel at vs other tasks where they should not necessarily be expected to excel. Both algorithms focus on cryptic splice site creation/disruption. This has been the focus of those papers and subsequent applications.  While Pangolin added tissue specificity to SpliceAI training, the authors themselves admit “...predicting differential splicing across tissues from sequence alone is possible but remains a considerable challenge and requires further investigation”. The actual performance on this task is not included in Pangolin’s main text, but we refer Reviewer #2 to supplementary figure S4 in the Pangolin manuscript to get a sense of Pangolin’s reported performance on this task. Similar to that, Figure 4d in our manuscript is for predicting ‘tissue specific’ regulators. We do not think it is surprising that SpliceAI (tissue agnostic) and Pangolin (slight improvement compared to SpliceAI in tissue specific predictions) do not perform well on this task. Similarly, we do not find the results in Figure 4C surprising either. These are for mutations that slightly alter inclusion level of an exon, not something SpliceAI was trained on - SpiceAI was trained on genomic splice sites with yes/no labels across the genome. As noted elsewhere in our response, re-training Pangolin on this mutagenesis dataset results in performance much closer to that of TrASPr. That is to be expected as well - Pangolin is constructed to capture changes in PSI (or splice site usage as defined by the authors), those changes are not even tissue specific for the CD19 data and the model has no problem/lack of capacity to generalize from the training set just like TrASPr does. In fact, if you only use combinations of known mutations seen during training a simple regression model gives correlation of ~92-95% (Cortés-López et al 2022). In summary, we believe that better understanding of what one can realistically expect from models such as SpliceAI, Pangolin, and TrASPr will go a long way to have them better understood and used effectively. We have tried to make this more clear in the revision.

      (9) BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      We thank the reviewer for the suggestion. We agree those are two distinct contributions/algorithms and we indeed considered having them as two separate papers. However, there is strong coupling between the design algorithm (BOS) and the predictor that enables it (TrASPr). This coupling is both conceptual (TrASPr as a “teacher”) and practical in terms of evaluations. While we use experimental data (experiments done involving Daam1 exon 16, CD19 exon 2) we still rely heavily on evaluations by TrASPr itself. A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper. For those reasons we eventually decided to make it into what we hope is a more compelling combined story about generative models for prediction and design of RNA splicing.

      (10) The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.

      We can definitely see the logic behind trying BOS with different predictors. That said, as we note above most of BOS evaluations are based on the “teacher”. As such, it is unclear what value replacing the teacher would bring. We also note that given this limitation we focus mostly on evaluations in comparison to existing approaches (genetic algorithm or random mutations as a strawman). 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      Additional comments:

      (1) Is your model picking up transcription factor binding sites in addition to RBPs? TFs have been recently shown to have a role in splicing regulation:

      Daoud, Ahmed, and Asa Ben-Hur. "The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models." PLOS Computational Biology 21.1 (2025): e1012755.

      We agree this is an interesting point to explore, especially given the series of works from the Ben-Hur’s group. We note though that these works focus on intron retention (IR) which we haven’t focused on here, and we only cover short intronic regions flanking the exons. We leave this as a future direction as we believe the scope of this paper is already quite extensive.

      (2) SpliceNouveau is a recently published algorithm for the splicing design problem:

      Wilkins, Oscar G., et al. "Creation of de novo cryptic splicing for ALS and FTD precision medicine." Science 386.6717 (2024): 61-69.

      Thank you for pointing out Wilkins et al recent publication, we now refer to it as well. 

      (3) Please discuss the relationship between your model and this deep learning model. You will also need to change the following sentence: "Since the splicing sequence design task is novel, there are no prior implementations to reference."

      We revised this statement and now refer to several recent publications that propose similar design tasks.  

      (4) I would suggest adding a histogram of PSI values - they appear to be mostly close to 1 or 0.

      PSI values are indeed typically close to either 0 or 1. This is a known phenomenon illustrated in previous studies of splicing (e.g. Shen et al NAR 2012 ). We are not sure what is meant by the comment to add a histogram but we made sure to point this out in the main text: 

      “...Still, those statistics are dominated by extreme values, such that 33.2\% are smaller than 0.15 and 56.0\% are higher than 0.85. Furthermore, most cassette exons do not change between a given tissue pair (only 14.0\% of the samples in the dataset, \ie a cassette exon measured across two tissues, exhibit ΔΨ| ≥ 0.15).”

      (5) Part of the improvement of TrASPr over Pangolin could be the result of a more extensive dataset.

      Please see above responses and new analysis.

      (6) In the discussion of the roles of alternative splicing, protein diversity is mentioned, but I suggest you also mention the importance of alternative splicing as a regulatory mechanism:

      Lewis, Benjamin P., Richard E. Green, and Steven E. Brenner. "Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans." Proceedings of the National Academy of Sciences 100.1 (2003): 189-192.

      Thank you for the suggestion. We added that point and citation. 

      (7) Line 96: You use dPSI without defining it (although quite clear that it should be Delta PSI).

      Fixed.

      (8) Pretrained transformers: Have you trained separate transformers on acceptor and donor sites, or a single splice junction transformer?

      Single splice junction pre-training.

      (9) "TrASPr measures the probability that the splice site in the center of Se is included in some tissue" - that's not my understanding of what TrASPr is designed to do.

      We revised the above sentence to make it more precise: “Given a genomic sequence context S<sub>e</sub> = (s<sub>e</sub>,...,s<sub>e</sub>), made of  a cassette exon e and flanking intronic/exonic regions, TrASPr predicts for tissue c the fraction of transcripts where exon e is included or skipped over, ΔΨ-<sub>e,c,c’</sub>.”

      (10) Please include the version of the human genome annotations that you used. 

      We used GENCODE v40 human genome hg38- this is now included in the Data section. 

      (11) I did not see a description of the RBP-AE component in the methods section. A bit more detail on the model would be useful as well.

      Please see above details about replacing RBP-AE with a simpler linear PCA “RBP-State” encoding. We added details about how the PCA was performed to the Methods section.

      (12) Typos, grammar:

      -   Fix the following sentence: ATP13A2, a lysosomal transmembrane cation transporter, linked to an early-onset form of Parkinson's Disease (PD) when 306 loss-of-function mutations disrupt its function.

      Sentence was fixed to now read: “The first example is of a brain cerebellum-specific cassette exon skipping event predicted by TrASPr in the ATP13A2 gene (aka PARK9). ATP13A2 is a lysosomal transmembrane cation transporter, for which loss of function mutation has been linked to early-onset of Parkinson’s Disease (PD)”.

      -   Line 501: "was set to 4e−4"(the - is a superscript). 

      Fixed

      -   A couple of citations are missing in lines 580 and 581.

      Thank you for catching this error. Citations in line 580, 581 were fixed.

      (13) Paper title: Generative modeling for RNA splicing predictions and design - it would read better as "Generative modeling for RNA splicing prediction and design", as you are solving the problems of splicing prediction and splicing design.  

      Thank you for the suggestion. We updated the title and removed the plural form.

      Reviewer #2 (Recommendations for the authors):

      (1) Appendices are not very common in biology journals. It is also not clear what purpose the appendix serves exactly - it seems to repeat some of the things said earlier. Consider merging it into the methods or the main text. 

      We merged the appendices into the Methods section and removed redundancy.

      (2) L112, "For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than N edit locations and M total base changes." How are N and M different? Is there a difference between an edit location and a base change? 

      Yes, N is the number of locations (one can think of it as a start position) of various lengths (e.g. a SNP is of length 1) and the total number of positions edited is M. The text now reads “For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than  $N$ edit locations (\ie start position of one or more consecutive bases) and $M$ total base changes.”

      (3) L122: "DEN was developed for a distinct problem". What prevents one from adapting DEN to your sequence design task? The method should be generic. I do not see what "differs substantially" means here. (Finally, wasn't DEN developed for the task you later refer to as "alternative splice site" (as opposed to "splice site selection")? Use consistent terminology. And in L236 you use "splice site variation" - is that also the same?).

      Indeed, our original description was not clear/precise enough. DEN was designed and trained for two tasks: APA, and 5’ alternative splice site usage. The terms “selection”, “usage”, and “variation” were indeed used interchangeably in different locations and the reviewer was right, noting the lack of precision. We have now revised the text to make sure the term “relative usage” is used. 

      Nonetheless, we hold DEN was indeed defined for different tasks. See figures from Figure 2A, 6A of Linder et al 2020 (the reference was also incorrect as we cited the preprint and not the final paper):

      In both cases DEN is trying to optimize a short region for selecting an alternative PA site (left) or a 5’ splice site (right). This work focused on an MPRA dataset of short synthetic sequences inserted in the designated region for train/test. We hold this is indeed a different type of data and task then the one we focus on here. Yes, one can potentially adopt DEN for our task, but this is beyond the scope of this paper. Finally, we note that a more closely related algorithm recently proposed is Ledidi (Schreiber et al 2025) which was posted as a pre-print. Similar to BOS, Ledidi tries to optimize a given sequence and adopt it with a few edits for a given task. Regardless, we updated the main text to make the differences between DEN and the task we defined here for BOS more clear, and we also added a reference to Ledidi and other recent works in the discussion section.

      (4) L203, exons with DeltaPSI very close to 0.15 are going to be nearly impossible to classify (or even impossible, considering that the DeltaPSI measurements are not perfect). Consider removing such exons to make the task more feasible.

      Yes, this is how it was done. As described in more details below, we defined changing samples as ones where the change was >= 0.15 and non-changing as ones where the change in PSI was < 0.05 to avoid ambiguous cases affecting the classification task.  

      (5) L230, RBP-AE is not explained in sufficient detail (and does not appear in the methods, apparently). It is not clear how exactly it is trained on each new cellular condition.

      Please see response in the opening of this document and Q11 from

      Reviewer 1 

      (6) L230, "significantly improving": the r value actually got worse; it is therefore not clear you can claim any significant improvement. Please mention that fact in the text.

      This is a fair point. We note that we view the “a” statistic as potentially more interesting/relevant here as the Pearson “r” is dominated by points being generally close to 0/1.  Regardless, revisiting this we realized one can also make a point that the term “significant” is imprecise/misplaced since there is no statistical test done here (side note: given the amount of points, a simple null of same distribution yes/no would pass significance but we don’t think this is an interesting/relevant test here). Also, we note that with the transition to PCA instead of RBP-AE we actually get improvements in both a and r values, both for the ENCODE samples shown in Figure 3a and the two new GTEX tissues we tested (see above). We now changed the text to simply state: 

      “...As shown in Figure 3a, this latent space representation allows TrSAPr to generalize from the six GTEX tissues to unseen conditions, including unseen GTEX tissues (top row), and ENCODE cell lines (bottom row). It improves prediction accuracy compared to TrASPr lacking PCA (eg a=88.5% vs a=82.3% for ENCODE cell lines), though naturally training on the additional GTEX and ENCODE conditions can lead to better performance  (eg a=91.7%, for ENCODE, Figure 3a left column).”

      (7) L233, "Notably, previous splicing codes focused solely on cassette exons", Rosenberg et al. focused solely on alternative splice site choice.

      Right - we removed that sentence.. 

      (8) L236, "trained TrASPr on datasets for 3' and 5' splice site variations". Please provide more details on this task. What is the input to TrASPr and what is the prediction target (splice site usage, PSI of alternative isoforms)? What datasets are used for this task?

      The data for this data was the same GTEx tissue data processed, just for alternative 3’ and 5’ splice sites events. We revised the description of this task in the main task and added information in the Methods section. The data is also included in the repo.

      (9) L243, "directly from genomic sequences", and conservation?

      Yes, we changed the sentence to read “...directly from genomic sequences combined with related features” 

      (10) L262, what is the threshold for significant splicing changes?

      The threshold is 0.15 We updated the main text to read the following:

      The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in \FIG{mut_effect}b (left), while the distribution of effects ($|\Delta \Psi|$) observed across those 6106 samples is shown in \FIG{mut_effect}b (right). To this data we applied three testing schemes. The first is a standard 5-fold CV where 20\% of combinations of point mutations were hidden in every fold while the second test involved 'unseen mutation' (UM) where we hide any sample that includes mutations in specific positions for a total of 1480 test samples. As illustrated by the CDF in \FIG{mut_effect}b, most samples (each sample may involve multiple positions mutated) do not involve significant splicing changes. Thus, we also performed a third test using only  the 883 samples were mutations cause significant changes ($|\Delta \Psi|\geq 0.15 $). 

      (11) L266, Pangolin performance is only provided for one of the settings (and it is not clear which). Please provide details of its performance in all settings.

      The description was indeed not clear. Pangolin’s performance was similar to SpliceAI as mentioned above but retraining it on the CD19 data yielded much closer performance to TrASPr. We include all the matching tests for Pangolin after retraining in Figure 4 Supp Figure 1. 

      (12) Please specify "n=" in all relevant plots. 

      Fixed.

      (13) Figure 3a, "The tissues were first represented as tokens, and new cell line results were predicted based on the average over conditions during training." Please explain this procedure in more detail. What are these tokens and how are they provided to the model? Are the cell line predictions the average of the predictions for the training tissues?

      Yes, we compared to simply the average over the predictions for the training tissues for that specific event as baseline to assess improvements (see related work pointing for the need to have similar baselines in DL for genomics in https://pubmed.ncbi.nlm.nih.gov/33213499/). Regarding the tokens - we encode each tissue type as a possible value and feed the two tissues as two tokens to the transformer.

      (14) Figure 4b, the total count in the histogram is much greater than 6106. Please explain the dataset you're using in more detail, and what exactly is shown here.

      We updated the text to read: 

      “...we used 6106 sequence samples where each sample may have multiple positions mutated (\ie mutation combinations) in exon 2 of CD19 and its flanking introns and exons (Cortes et al 2022). The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in Figure 4b (left).”

      (15) Figure 5a, how are the prediction thresholds (TrASPr passed, TrASPr stringent, and TrASPr very stringent) defined?

      Passed: dpsi>0.1, Stringent: dpsi>0.15, Very stringent: dpsi>0.2 This is now included in the main text.

      (16) L417, please include more detail on the relative size of TrASPr compared to other models (e.g. number of parameters, required compute, etc.).

      SpliceAI is a general-purpose splicing predictor with 32-layer deep residual neural network to capture long-range dependencies in genomic sequences. Pangolin is a deep learning model specifically designed for predicting tissue-specific splicing with similar architecture as SpliceAI. The implementation of SpliceAI that can be found here https://huggingface.co/multimolecule/spliceai involves an ensemble of 5 such models for a total of ~3.5M parameters. TrASPr, has 4 BERT transformers (each 6 layers and 12 heads) and MLP a top of those for a total of ~189M parameters. Evo 2, a genomic ‘foundation’ model has 40B parameters, DNABERT has ~86M (a single BERT with 12 layers and 12 heads), and Borzoi has 186M parameters (as stated in https://www.biorxiv.org/content/10.1101/2025.05.26.656171v2).  We note that the difference here is not just in model size but also the amount of data used to train the model. We edited the original L417 to reflect that.

      (17) L546, please provide more detail on the VAE. What is the dimension of the latent representation?

      We added more details in the Methods section like the missing dimension (256) and definitions for P(Z) and P(S). 

      (18) Consider citing (and possibly comparing BOS to) Ghari et al., NeurIPS 2024 ("GFlowNet Assisted Biological Sequence Editing").

      Added.

      (19) Appendix Figure 2, and corresponding main text: it is not clear what is shown here. What is dPSI+ and dPSI-? What pairs of tissues are you comparing? Spearman correlation is reported instead of Pearson, which is the primary metric used throughout the text.

      The dPSI+ and dPSI- sets were indeed not well defined in the original submission. Moreover, we found our own code lacked consistency due to different tests executed at different times/by different people. We apologize for this lack of consistency and clarity which we worked to remedy in the revised version. To answer the reviewer’s question, given two tissues ($c,c'$), dPSI+ and dPSI- is for correctly classifying the exons that are significantly differentially included or excluded. Specifically, differential included exons are those for which  $\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \geq 0.15$, compared to those that are not  ($\Delta \Psi_{e,c1,c2} < 0.05). Similarly, dPSI- is for correctly classifying the exons that are significantly differentially excluded in the first tissue or included in the second tissue ($\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \leq -0.15$) compared to those that are not  ($\Delta \Psi_{e,c1,c2} > -0.05). This means dPSI+ and dPSI- are dependent on the order of c1, c2. In addition, we also define a direction/order agnostic test for changing vs non changing events i.e. $|\Delta \Psi_{e,c1,c2}| \geq 0.15$ vs $|\Delta \Psi_{e,c1,c2}| < 0.05$. These test definitions are consistent with previous publications (e.g. Barash et al Nature 2010, Jha et al 2017) and also answer different biological questions: For example “Exons that go up in brain” and “Exons that go up in Liver” can reflect distinct mechanisms, while changing exons capture a model’s ability to identify regulated exons even if the direction of prediction may be wrong. The updated Appendix Figure 2 is now in the main text as Figure 2d and uses Pearson, while AUPRC and AUROC refer to the changing vs no-changing classification task described above such that we avoid dPSI+ and dPSI- when summarizing in this table over 3 pairs of tissues . Finally, we note that making sure all tests comply with the above definition also resulted in an update to Figure 2b/c labels and values, where TrASPr’s improvements over Pangolin reaches up to 1.8fold in AUPRC compared to 2.4fold in the earlier version. We again apologize for having a lack of clarity and consistent evaluations in the original submission.

      (20) Minor typographical comments:

      -   Some plots could use more polishing (e.g., thicker stroke, bigger font size, consistent style (compare 4a to the other plots)...).

      Agreed. While not critical for the science itself we worked to improve figure polishing in the revision to make those more readable and pleasant. 

      -   Consider using 2-dimensional histograms instead of the current kernel density plots, which tend to over-smooth the data and hide potentially important details. 

      We were not sure what the exact suggestion is here and opted to leave the plots as is.

      -   L53: dPSI_{e, c, c'} is never formally defined. Is it PSI_{e, c} - PSI_{e, c'} or vice versa?  

      Definition now included (see above).

      -   L91: Define/explain "transformer" and provide reference. 

      We added the explanation and related reference of the transformer in the introduction section and BERT in the method section.  

      -   L94: exons are short. Are you referring here to the flanking introns? Please explain. 

      We apologize for the lack of clarity. We are referring to a cassette exon alternative splicing event as is commonly defined by the splice junctions involved that is from the 5’ SS of the upstream exon to the 3’ SS of the downstream exon. The text now reads:

      “...In contrast, 24% of the cassette exons analyzed in this study span a region between the flanking exons' upstream 3' and downstream 5' splice sites that are larger than 10 kb.”

      -   L132: It's unclear whether a single, shared transformer or four different transformers (one for each splice site) are being pre-trained. One would at least expect 5' and 3' splice sites to have a different transformer. In Methods, L506, it seems that each transformer is pre-trained separately. 

      We updated the text to read:

      “We then center a dedicated transformer around each of the splice sites of the cassette exon and its upstream and downstream (competing) exons (four separate transformers for four splice sites in total).”

      -   L471: You explain here that it is unclear what tasks 'foundation' models are good for. Also in L128, you explain that you are not using a 'foundation' model. But then in L492, you describe the BERT model you're using as a foundation model! 

      Line 492 was simply a poor choice of wording as “foundation” is meant here simply as the “base component”. We changed it accordingly.

      -   L169, "pre-training ... BERT", explain what exactly this means. Is it using masking? Is it self-supervised learning? How many splice sites do you provide? Also explain more about the BERT architecture and provide references. 

      We added more details about the BERT architecture and training in the Methods section.

      -   L186 and later, the values for a and r provided here and in the below do not correspond to what is shown in Figure 2. 

      Fixed, thank you for noticing this.

      -   L187,188: What exactly do you mean by "events" and "samples"? Are they the same thing? If so, are they (exon, tissue) pairs? Please use consistent terminology. Moreover, when you say "changing between two conditions": do you take all six tissues whenever there is a 0.15 spread in PSI among them? Or do you take just the smallest PSI tissue and the largest PSI tissue when there is a 0.15 spread between them? Or something else altogether?

      Reviewer #2 is yet again correct that the definitions were not precise. A “sample” involves a specific exon skipping “event” measured in two tissues.  The text now reads: 

      “....most cassette exons do not change between a given tissue pair (only 14.0% of the samples in the dataset, i.e., a cassette exon measured across two tissues, exhibit |∆Ψ| ≥ 0.15). Thus, when we repeat this analysis only for samples involving exons that exhibited a change in inclusion (|∆Ψ| ≥ 0.15) between at least two tissues, performance degrades for all three models, but the differences between them become more striking (Figure 2a, right column).”

      -   Figure 1a, explain the colors in the figure legend. The 3D effect is not needed and is confusing (ditto in panel C).

      Color explanation is now added: “exons and introns are shown as blue rectangles and black lines. The blue dashed line indicates the inclusive pattern and the red junction indicates an alternative splicing pattern.” 

      These are not 3D effects but stacks to indicate multiple events/cases. We agree these are not needed in Fig1a to illustrate types of AS and removed those. However, in Fig1c and matching caption we use the stacks to  indicate HT data captures many such LSVs over which ML algorithms can be trained. 

      -   Figure 1b, this cartoon seems unnecessary and gives the wrong impression that this paper explores mechanistic aspects of splicing. The only relevant fact (RBPs serving as splicing factors) can be explained in the text (and is anyway not really shown in this figure).

      We removed Figure 1b cartoon.

      -   Figure 1c, what is being shown by the exon label "8"? 

      This was meant to convey exon ID, now removed to simplify the figure. 

      -   Figure 1e, left, write "Intron Len" in one line. What features are included under "..."? Based on the text, I did not expect more features.

      Also, the arrows emanating from the features do not make sense. Is "Embedding" a layer? I don't think so. Do not show it as a thin stripe. Finally, what are dPSI'+ and dPSI'-? are those separate outputs? are those logits of a classification task?

      We agree this description was not good and have updated it in the revised version. 

      -   Figure 1e, the right-hand side should go to a separate figure much later, when you introduce BOS.

      We appreciate the suggestion. However, we feel that Figure 1e serves as a visual representation of the entire framework. Just like we opted to not turn this work into two separate papers (though we fully agree it is a valid option that would also increase our publication count), we also prefer to leave this unified visual representation as is.

      -   Figure 2, does the n=2456 refer to the number of (exons, tissues) pairs? So each exon contributes potentially six times to this plot? Typo "approximately". 

      The “n” refers to the number of samples which is a cassette event measured in two tissues. The same cassette event may appear in multiple samples if it was confidently quantified in more than two tissues. We updated the caption to reflect this and corrected the typo.

      -   Figure 2b, typo "differentially included (dPSI+) or excluded" .

      Fixed.

      -   L221, "the DNABERT" => "DNABERT".

      Fixed.

      -   L232, missing percent sign.

      -    

      Fixed.

      -   L246, "see Appendix Section 2 for details" seems to instead refer to the third section of the appendix.

      We do not have this as an Appendix, the reference has been updated.

      -   Figure 3, bottom panels, PSI should be "splice site usage"? 

      PSI is correct here - we hope the revised text/definitions make it more clear now.

      -   Figure 3b: typo: "when applied to alternative alternative 3'".

      Fixed.

      -   p252, "polypyrimidine" (no capitalization).

      Fixed.

      -   Strange capitalization of tissue names (e.g., "Brain-Cerebellum"). The tissue is called "cerebellum" without capitalization.

      We used EBV (capital) for the abbreviation and lower case for the rest.

      -   Figure 4c: "predicted usage" on the left but "predicted PSI" on the right. 

      Right. We opted to leave it as is since Pangolin and SpliceAI do predict their definition of “usage” and not directly PSI, we just measure correlations to observed PSI as many works have done in the past. 

      -   Figure 4 legend typo: "two three".

      Fixed.

      -   L351, typo: "an (unsupervised)" (and no need to capitalize Transformer).

      Fixed.

      -   L384, "compared to other tissues at least" => "compared to other tissues of at least".

      Fixed.

      -   L549, P(Z) and P(S) are not defined in the text.

      Fixed.

      -   L572, remove "Subsequently". Add missing citations at the end of the paragraph.

      Fixed.

      -   L580-581, citations missing.

      Fixed.

      -   L584-585, typo: "high confidince predictions"

      Fixed.

      -   L659-660, BW-M and B-WM are both used. Typo?

      Fixed.

      -   L895, "calculating the average of these two", not clear; please rewrite.

      Fixed.

      -   L897, "Transformer" and "BERT", do these refer to the same thing? Be consistent.  

      BOS is a transformer and not a BERT but TrASPr uses the BERT architecture. BERT is a type of transformer as the reviewer is surely well aware so the sentence is correct. Still, to follow the reviewer’s recommendation for consistency/clarity we changed it here to state BERT.

      -   Appendix Figure 5: The term dPSI appears to be overloaded to also represent the difference between predicted PSI and measured PSI, which is inconsistent with previous definitions. 

      Indeed! We thank the reviewer again for their sharp eye and attention to details that we missed. We changed Supp Figure 5, now Figure 4 Supplementary Figure 2, to |PSI’-PSI| and defined those as the difference between TrASPr’s predictions (PSI’) and MAJIQ based PSI quantifications.

    1. eLife Assessment

      This important work advances our understanding of the role of kisspeptin neurons in regulating the luteinizing hormone (LH) surge in females. The evidence demonstrating increased neuronal activity in anterior hypothalamic kisspeptin neurons just before the LH surge is compelling, though additional neuroanatomical evidence showing the specificity of the methods would strengthen the study. It also confirms that high circulating levels of estradiol, but also other unidentified factors, are required for the full daily activation. This research will be of interest to reproductive biologists and neuroscientists studying the female ovarian cycle.

    2. Joint Public Review:

      Summary:

      This is an excellent, timely study investigating and characterizing the underlying neural activity that generates the neuroendocrine GnRH and LH surges that are responsible for triggering ovulation. Abundant evidence accumulated over the past 20 years implicated the population of kisspeptin neurons in the hypothalamic RP3V region (also referred to as the POA or AVPV/PeN kisspeptin neurons) as being involved in driving the GnRH surge in response to elevated estradiol (E2), also known as the "estrogen positive feedback". However, while former studies used Cfos coexpression as a marker of RP3V kisspeptin neuron activation at specific times and found this correlates with the timing of the LH surge, detailed examination of the live in vivo activity of these neurons before, during, and after the LH surge remained elusive due to technical challenges.

      Here, Zhou and colleagues use fiber photometry to measure the long-term synchronous activity of RP3V kisspeptin neurons across different stages of the mouse estrous cycle, including on proestrus when the LH surge occurs, as well as in a well-established OVX+E2 mouse model of the LH surge.

      The authors report that RP3V kisspeptin neuron activity is low on estrous and diestrus, but increases on proestrus several hours before the late afternoon LH surge, mirroring prior reports of rising GnRH neuron activity in proestrus female mice. The measured increase in RP3V kisspeptin activation is long, spanning ~13 hours in proestrus females and extending well beyond the end of the LH secretion, and is shown by the authors to be E2 dependent.

      For this work, Kiss-Cre female mice received a Cre-dependent AAV injection, containing GCaMP6, to measure the neuronal activation of RP3V Kiss1 cells. Females exhibited periods of increased neuronal activation on the day of proestrus, beginning several hours prior to the LH surge and lasting for about 12 hours. Though oscillations in the pattern of GCaMP fluorescence were occasionally observed throughout the ovarian cycle, the frequency, duration, and amplitude of these oscillations were significantly higher on the day of proestrus. This increase in RP3V Kiss1 neuronal activation that precedes the increase in LH supports the hypothesis that these neurons are critical in regulating the LH surge. The authors compare this data to new data showing a similar increased activation pattern in GnRH neurons just prior to the LH surge, further supporting the hypothesis that RP3V Kiss1 cell activation causes the release of kisspeptin to stimulate GnRH neurons and produce the LH surge.

      Strengths:

      This study provides compelling data demonstrating that RP3V kisspeptin neuronal activity changes throughout the ovarian cycle, likely in response to changes in estradiol levels, and that neuronal activation increases on the day of the LH surge.

      The observed increase in RP3V kisspeptin neuronal activation precedes the LH surge, which lends support to the hypothesis that these neurons play a role in regulating the estradiol-induced LH surge. Continuing to examine the complexities of the LH surge and the neuronal populations involved, as done in this study, is critical for developing therapeutic treatments for women's reproductive disorders.

      This innovative study uses a within-subject design to examine neuronal activation in vivo across multiple hormone milieus, providing a thorough examination of the changes in activation of these neurons. The variability in neuronal activity surrounding the LH surge across ovarian cycles in the same animals is interesting and could not be achieved without this within-subjects design. The inclusion and comparison of ovary-intact females and OVX+E2 females is valuable to help test mechanisms under these two valuable LH surge conditions, and allows for further future studies to tease apart minor differences in the LH surge pattern between these 2 conditions.

      This study provides an excellent experimental setup able to monitor the daily activity of preoptic kisspeptin neurons in freely moving female mice. It will be a valuable tool to assess the putative role of these kisspeptin neurons in various aspects of altered female fertility (aging, pathologies...). This approach also offers novel and useful insights into the impact of E2 and circadian cues on the electrical activity of RP3V kisspeptin neurons.

      An intriguing cyclical oscillation in kisspeptin neural activity every 90 minutes exists, which may offer critical insight into how the RP3V kisspeptin system operates. Interestingly, there was also variability in the onset and duration of RP3V Kisspeptin neuron activity between and within mice in naturally cycling females. Preoptic kisspeptin neurons show an increased activity around the light/dark transition only on the day of proestrus, and this is associated with an increase in LH secretion. An original finding is the observation that the peak of kisspeptin neuron activation continues a few hours past the peak of LH, and the authors hypothesize that this prolonged activity could drive female sexual behaviors, which usually appear after the LH surge.

      The authors demonstrated that ovariectomy resulted in very little neuronal activity in RP3V kisspeptin neurons. When these ovarietomized females were treated with estradiol benzoate (EB) and an LH surge was induced, there was an increase in RP3V kisspeptin neuronal activation, as was seen during proestrus. However, the magnitude of the change in activity was greater during proestrus than during the EB-induced LH surge. Interestingly, the authors noted a consistent peak in activity about 90 minutes prior to lights out on each day of the ovarian cycle and during EB treatment, but not in ovariectomized females. The functional purpose of this consistent neuronal activity at this time remains to be determined.

      Though not part of this study, the comparison of neuronal activation of GnRH neurons during the LH surge to the current data was convincing, demonstrating a similar pattern of increased activation that precedes the LH surge.

      In summary, the study is well-designed, uses proper controls and analyses, has robust data, and the paper is nicely organized and written. The data from these experiments is compelling, and the authors' claims and conclusions are nicely supported and justified by the data. The data support the hypothesis in the field that these RP3V neurons regulate the LH surge. Overall, these findings are important and novel, and lend valuable insight into the underlying neural mechanisms for neuroendocrine control of ovulation.

      Weaknesses:

      (1) LH levels were not measured in many mice or in robust temporal detail, such as every 30 or 60 min, to allow a more detailed comparison between the fine-scale timing of RP3V neuron activation with onset and timing of LH surge dynamics.

      (2) The authors report that the peak LH value occurred 3.5 hours after the first RP3V kisspeptin neuron oscillation. However, it is likely, and indeed evident from the 2 example LH patterns shown in Figures 3A-B, that LH values start to increase several hours before the peak LH. This earlier rise in LH levels ("onset" of the surge) occurs much closer in time to the first RP3V kisspeptin neuron oscillatory activation, and as such, the ensuing LH secretion may not be as delayed as the authors suggest.

      (3) The authors nicely show that there is some variation (~2 hours) in the peak of the first oscillation in proestrus females. Was this same variability present in OVX+E2 females, or was the variability smaller or absent in OVX+E2 versus proestrus? It is possible that the variability in proestrus mice is due to variability in the timing and magnitude of rising E2 levels, which would, in theory, be more tightly controlled and similar among mice in the OVX+E2 model. If so, the OVX+E2 mice may have less variability between mice for the onset of RP3V kisspeptin activity.

      (4) One concern regarding this study is the lack of data showing the specificity of the AAV and the GCaMP6s signals. There are no data showing that GCaMP6s is limited to the RP3V and is not expressed in other Kiss1 populations in the brain. Given that 2ul of the AAV was injected, which seems like a lot considering it was close to the ventricle, it is important to show that the signal and measured activity are specific to the RP3V region. Though the authors discuss potential reasons for the low co-expression of GCaMP6 and kisspeptin immunoreactivity, it does raise some concern regarding the interpretation of these results. The low co-expression makes it difficult to confirm the Kiss1 cell-specificity of the Cre-dependent AAV injections. In addition, if GFP (GCaMP6s) and kisspeptin protein co-localization is low, it is possible that the activation of these neurons does not coincide with changes in kisspeptin or that these neurons are even expressing Kiss1 or kisspeptin at the time of activation. It is important to remember that the study measures activation of the kisspeptin neuron, and it does not reveal anything specific about the activity of the kisspeptin protein.

      (5) One additional minor concern is that LH levels were not measured in the ovariectomized females during the expected time of the LH surge. The authors suggest that the lower magnitude of activation during the LH surge in these females, in comparison to proestrus females, may be the result of lower LH levels. It's hard to interpret the difference in magnitude of neuronal activation between EB-treated and proestrus females without knowing LH levels. In addition, it's possible that an LH surge did not occur in all EB-treated females, and thus, having LH levels would confirm the success of the EB treatment.

      (6) This kisspeptin neuron peak activity is abolished in ovariectomized mice, and estradiol replacement restored this activity, but only partially. Circulating levels of estradiol were not measured in these different setups, but the authors hypothesize that the lack of full restoration may be due to the absence of other ovarian signals, possibly progesterone.

      (7) Recordings in several mice show inter- and intra-variability in the time of peak onset. It is not shown whether this variability is associated with a similar variability in the timing of the LH surge onset in the recorded mice. The authors hypothesized that this variability indicates a poor involvement of the circadian input. However, no experiments were done to investigate the role of the (vasopressinergic-driven) circadian input on the kisspeptin neuron activation at the light/dark transition. Thus, we suggest that the authors be more tentative about this hypothesis.

    1. eLife Assessment

      This study aims to identify the proteins that make up the electrical synapse, which are much less understood than those of the chemical synapse. These findings represent an important step toward understanding the molecular function of chemical synapses and will have broad utility for the wider neuroscience field. The experimental evidence is convincing.

    2. Reviewer #1 (Public review):

      This study aims to identify the proteins that compose the electrical synapse, which are much less understood than those of the chemical synapse. Identifying these proteins is important to understand how synaptogenesis and conductance are regulated in these synapses.

      Using a proteomics approach, the authors identified more than 50 new proteins and used immunoprecipitation and immunostaining to validate their interaction of localization. One new protein, a scaffolding protein (Sipa1l3), shows particularly strong evidence of being an integral component of the electrical synapse. The function of Sipa1l3 remains to be determined.

      Another strength is the use of two different model organisms (zebrafish and mice) to determine which components are conserved across species. This approach also expands the utility of this work to benefit researchers working with both species.

      The methodology is robust and there is compelling evidence supporting the findings.

      Comments on revisions:

      I thank the authors for responding to the comments. No further recommendations.

    3. Reviewer #3 (Public review):

      Summary:

      This study by Tetenborg S et al. identifies proteins that are physically closely associated with gap junctions in retinal neurons of mice and zebrafish using BioID, a technique that labels and isolates proteins in proximal to a protein of interest. These proteins include scaffold proteins, adhesion molecules, chemical synapse proteins, components of the endocytic machinery, and cytoskeleton-associated proteins. Using a combination of genetic tools and meticulously executed immunostaining, the authors further verified the colocalizations of some of the identified proteins with connexin-positive gap junctions. The findings in this study highlight the complexity of gap junctions. Electrical synapses are abundant in the nervous system, yet their regulatory mechanisms are far less understood than those of chemical synapses. This work will provide valuable information for future studies aiming to elucidate the regulatory mechanisms essential for the function of neural circuits.

      Strengths:

      A key strength of this work is the identification of novel gap junction-associated proteins in AII amacrine cells and photoreceptors using BioID in combination with various genetic tools. The well-studied functions of gap junctions in these neurons will facilitate future research into the functions of the identified proteins in regulating electrical synapses.

      Comments on revisions:

      The authors have addressed my concerns in the revised manuscript.

    4. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer 1

      The authors should clarify the statement regarding the expression in horizontal cells (lines 170-172). In line 170, it is stated that GFP was observed in horizontal cells. Since GFP is fused to Cx36, the observation of GFP in horizontal cells would suggest the expression of Cx36-GFP.

      We believe that there appears to be a misunderstanding. GFP is observed in horizontal cells, because the test AAV construct, which consists of the HKamac promoter and a downstream GFP sequence, was used to validate the promoter specificity in wildtype animals. This was just a test to confirm that HKamac is indeed active in AII amacrine cells as previously described by Khabou et al. 2023. This construct was not used for the large scale BioID screen. For these experiments, V5-dGBP-Turbo was expressed under the control of the HKamac promoter as illustrated in Figure 2A.

      Fig 7: the legend is missing the descriptions for panels A-C.

      We apologize for this mistake. We have missed the label “(A-C)” and added it to the legend.

      Supplemental files are not referenced in the manuscript.

      We have added a reference for these files in line 221-226.

      Reviewer 2

      Supplementary Files 1 and 2 are presented as two replicates of the zebrafish proteomic datasets, but they appear to be identical.

      This appears to be a misunderstanding. These two replicates contain slightly different hits, although the most abundant candidates are identical.

      Reviewer 3

      Thank you for the positive comments

    1. eLife Assessment

      This study presents a valuable finding on how the locus coeruleus modulates the involvement of medial prefrontal cortex in set shifting using calcium imaging. The evidence supporting the claims was viewed as incomplete in comparisons of extra- (EDS) and intradimensional shifts (IDS). The work is of broad interest to those studying flexible cognition.

    2. Reviewer #1 (Public review):

      Summary:

      The authors note that there is a large corpus of research establishing the importance of LC-NE projections to medial prefrontal cortex (mPFC) of rats and mice in attentional set or 'rule' shifting behaviours. However, this is complex behavior and the authors were attempting to gain an understanding of how locus coeruleus modulation of the mPFC contributes to set shifting.

      The authors replicated the ED-shift impairment following NE denervation of mPFC by chemogenetic inhibition of the LC. They further showed that LC inhibition changed the way neurons in mPFC responded to the cues, with a greater proportion of individual neurons responsive to 'switching', but the individual neurons also had broader tuning, responding to other aspects of the task (i.e., response choice and response history). The population dynamics was also changed by LC inhibition, with reduced separation of population vectors between early-post-switch trials, when responding was at chance, and later trials when responding was correct. This was what they set out to demonstrate and so one can conclude they achieved their aims.

      The authors concluded that LC inhibition disrupted mPFC "encoding capacity for switching" and suggest that this "underlie[s] the behavioral deficits."

      Strengths:

      The principal strength is combining inactivation of LC with calcium imaging in mPFC. This enabled detailed consideration of the change in behavior (i.e., defining epochs of learning, with an 'early phase' when responding is at chance being compared to a 'later phase' when the behavioral switch has occurred) and how these are reflected in neuronal activity in the mPFC, with and without LC-NE input.

      Comments on revised version:

      In their response to reviewers, the authors say "We report p values using 2 decimal points and standard language as suggested by this reviewer". However, no changes were made in the manuscript: for example, "P = 4.2e-3" rather than "p = 0.004".

      In their response to the reviewers, they wrote: "Upon closer examination of the behavioral data, we exclude several sessions where more trials were taken in IDS than in EDS." If those sessions in which EDSIDS. Most problematic is the fact that the manuscript now reads "Importantly, control mice (pooled from Fig. 1e, 1h, Supp. Fig. 1a, 1b) took more trials to complete EDS than IDS (Trials to criterion: IDS vs. EDS, 10 {plus minus} 1 trials vs. 16 {plus minus} 1 trials, P < 1e-3, Supp. Fig. 1c), further supporting the validity of attentional switching (as in Fig. 1c)" without mentioning that data has been excluded.

    3. Reviewer #3 (Public review):

      Summary:

      Nigro et al examine how the locus coeruleus (LC) influences the medial prefrontal cortex (mPFC) during attentional shifts required for behavioral flexibility. Specifically, the propose that LC-mPFC inputs enable mice to shift attention effectively from texture to odor cues to optimize behavior. The LC and its noradrenergic projections to the mPFC have previously been implicated in this behavior. The authors further establish this by using chemogenetics to inhibit LC terminals in mPFC and show a selective deficit in extradimensional set shifting behavior. But the study's primary innovation is the simultaneous inhibition of LC while recording multineuron patterns of activity in mPFC. Analysis at the single neuron and population levels revealed broadened tuning properties, less distinct population dynamics, and disrupted predictive encoding when LC is inhibited. These findings add to our understanding of how neuromodulatory inputs shape attentional encoding in mPFC and are an important advance. There are some methodological limitations and/or caveats that should be considered when interpreting the findings, and these are described below.

      Strengths:

      The naturalistic set-shifting task in freely-moving animals is a major strength and the inclusion of localized suppression of LC-mPFC terminals is builds confidence in the specificity of their behavioral effect. Combining chemogenetic inhibition of LC while simultaneously recording neural activity in mPFC with miniscopes is state-of-the-art. The authors apply analyses to population dynamics in particular that can advance our understanding of how the LC modifies patterns of mPFC neural activity. The authors show that neural encoding at both the single cell level and the population level are disrupted when LC is inhibited. They also show that activity is less able to predict key aspects of the behavior when the influence of LC is disrupted. This is quite interesting and adds to a growing understanding of how neuromodulatory systems sharpen tuning of mPFC activity.

      Weaknesses:

      Weaknesses are mostly minor, but there are some caveats that should be considered. First, the authors use a DBH-Cre mouse line and provide histological confirmation of overlap between HM4Di expression and TH immunostaining. While this strongly suggests modulation of noradrenergic circuit activity, the results should be interpreted conservatively as there is no independent confirmation that norepinephrine (NE) release is suppressed and these neurons are known to release other neurotransmitters and signaling peptides. In the absence of additional control experiments, it is important to recognize that effects on mPFC activity may or may not be directly due to LC-mPFC NE.

      Another caveat is that the imaging analyses are entirely from the extradimensional shift session. Without analyzing activity data from the intradimensional shift (IDS) session, one cannot be certain that the observed changes are to some feature of activity that is specific to extradimensional shifts. Future experiments should examine animals with LC suppression during the IDS as well, which would show whether the observed effects are specific to an extradimensional shift and might explain behavioral effects.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We thank the reviewers and editors for this peer review. Following the editorial assessment and specific review comments, in this revision we have included new analysis to support the validity of the behavioral task (Reviewer #2). We have improved data presentation by including 1) data points from individual animals (Reviewer #1, #3), 2) updated histology showing the expression of hM4Di in LC neurons as well as LC terminals in the mPFC (Reviewer #3), and 3) more detailed descriptions of methodology and data analysis (Reviewer #1, #2, #3).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Planned t-tests should be performed in both control and experimental animals to determine if the number of trials needed to reach criterion on the ID is lower than on the ED. Based on the data analyses showing no difference among the control group, the data could be pooled to demonstrate that the task is valid. Reporting all p-values using 2 decimal points and standard language e.g., p < 0.001 would greatly improve the readability of the data. 

      Thank you for this suggestion. As pointed out by this reviewer, more trials to reach performance criterion in EDS than IDS is indicative of successful acquisition and switching of the attentional sets. Upon closer examination of the behavioral data, we exclude several sessions where more trials were taken in IDS than in EDS, and our conclusions that DREADD inhibition of the LC or LC input to the mPFC impaired rule switching in EDS remain robust (e.g., new Fig. 1e, 1h). We also pool control and test data (Fig. 1e, 1h, new Supp. Fig. 1a, 1b) to demonstrate the validity of this task (new Supp. Fig. 1c, IDS vs. EDS in the control group, 10 ± 1 trials vs. 16 ± 1 trials, P < 1e-3). The validity of set shifting is also supported by the new Fig. 1c.  

      We report p values using 2 decimal points and standard language as suggested by this reviewer.

      Relevant to the comments from Reviewer #1 in the public review, we now show individual data points on the bar charts (new Fig. 1e, 1h).  

      (2) It may also be helpful to provide the average time between CNO infusion and onset of the ED as well as information about when maximal effects are expected after these treatments.

      Systemic CNO injections were administered immediately after IDS, and we waited approximately one hour before proceeding to EDS. Maximal effects of systemic CNO activation were reported to occur after 30 minutes and last for at least 4-6 hours. Both control and test groups received the CNO injections in the same manner. This is now better described in Methods.  

      Reviewer #3 (Recommendations for the authors):

      (1) Add better histology images showing colocalization of TH and HM4Di. Quantification of colocalization would be optimal.

      We now include better histology images (new Fig. 1d) and have quantified the colocalization of TH and HM4Di in the main text (line 115-116).  

      (2) If possible, images showing HM4Di expression in mPFC axon terminals would be useful. If these are colocalized with TH immunostaining, that would increase confidence in their identity. This would be much more useful than the images provided in Figure 1C.

      We now include new image to show hM4Di expression (mCherry) in LC terminals in the mPFC (new Fig. 1f). However, due to technical limitations (species of the primary antibody), we did not co-stain with TH.

      (3) Include behavior of mice from the miniscope experiment in Figure 2 to show they are similar to those from Figure 1.

      This is now included in Supp. Fig. 1b.

      (4) More details about the processing and segmentation of miniscope data would be helpful (e.g., how many neurons were identified from each animal?). 

      We use standard preprocessing and segmentation pipelines in Inscopix data processing software (version 1.6), which includes modules for motion correction and signal extraction. Briefly, raw imaging videos underwent preprocessing, including a x4 spatial down sampling to reduce file size and processing time. No temporal down sampling was performed. The images were then cropped to eliminate post-registration borders and areas where cells were not visible. Prior to the calculation of the dF/F0 traces, lateral movement was corrected. For ROI identification, we used a constrained non-negative matrix factorization algorithm optimized for endoscopic data (CNMF-E) to extract fluorescence traces from ROIs. We identified 128 ± 31 neurons after manual selection, depending on recording quality and field of view. Number of neurons acquired from each animal are now included in Methods. This is now further elaborated in Methods (line 405415).  

      (5) Add more methodological detail for how cell tuning was analyzed, including how z-scoring was performed (across the entire session?), and how neurons in each category were classified. 

      We have expanded the Methods section to clarify how cell tuning was analyzed (line 419430). Calcium traces were z-scored on a per-neuron basis across the entire session. For each neuron, we computed trial-averaged activity aligned to specific task events (e.g., digging in one of the two ramekins available). A neuron was classified as responsive if its activity showed a significant difference (p < 0.05) between two conditions within the defined time window in the ROC analysis.

      (6) For data from Figure 2F it would be very useful to plot data from individual mice in addition to this aggregated representation.

      We now include data from individual mice in Supp. Table 1.

      (7) I think it would be helpful to move some parts of Figure S1 to the main Figure 1, in particular the table from S1A. 

      Fig. S1 is now part of the new Fig. 1.

      (8) Clarify whether Figure S2 is an independent replication, as implied, or whether the same test data is shown twice in two separate figures (In Figure 1b and Supplementary Figure 2).

      The test group in Fig. S2 (new Fig. S1) is the same as the test group in Fig. 1b (new Fig. 1e), but the control group is a separate cohort. This is now clarified in the figure legends.  

      (9) The authors should add a limitations section to the discussion where they specifically discuss the caveats involved in relating their results specifically to NE. This should include the possible involvement of co-transmitters and off-target expression of Cre in other populations.

      Thank you for this comment. Previous pharmacology and lesion studies showed that LC input or NE content in the mPFC was specifically required for EDS-type switching processes (Lapiz, M.D. et al., 2006; Tait, D.S. et al. 2007; McGaughy, J. et al. 2008), in light of which we interpret our mPFC neurophysiological effects with LC inhibition as at least partially mediated by the direct LC-NE input.  When discussing the limitations of our study, we now explicitly acknowledge the potential involvement of co-transmitters released by LC neurons (line 253-256).  

      (10) The authors should provide details about the TH antibody uses for IHC

      We now include more details in immunohistochemistry (line 384-388).

      (11) Throughout, it would be helpful to include datapoints from individual animals - these are included in some supplementary figures, but are missing in a number of the main plots.

      Reviewer #1 made a similar comment, and we now include individual data points in the figures (e.g., Fig. 1e, 1h).

    1. eLife Assessment

      This study introduces a novel method for estimating spatial spectra from irregularly sampled intracranial EEG data, revealing cortical activity across all spatial frequencies, which supports the global and integrated nature of cortical dynamics. It showcases important technical innovations and rigorous analyses, including tests to rule out potential confounds. However, further direct evaluation of the model, for example by using simulated cortical activity with a known spatial spectrum (e.g., an iEEG volume-conductor model that describes the mapping from cortical current source density to iEEG signals, and that incorporates the reference electrodes and the particular montage used), would even further strengthen the incomplete evidence.

    2. Reviewer #1 (Public review):

      Summary:

      The paper uses rigorous methods to determine phase dynamics from human cortical stereotactic EEGs. It finds that the power of the phase is higher at the lowest spatial phase. The application to data illustrates the solidity of the method and their potential for discovery.

      Comments on revised submission:

      The authors have provided responses to the previous recommendations.

    3. Reviewer #3 (Public review):

      Summary:

      The authors propose a method for estimating the spatial power spectrum of cortical activity from irregularly sampled data and apply it to iEEG data from human patients during a delayed free recall task. The main findings are that the spatial spectra of cortical activity peak at low spatial frequencies and decrease with increasing spatial frequency. This is observed over a broad range of temporal frequencies (2-100 Hz).

      Strengths:

      A strength of the study is the type of data that is used. As pointed out by the authors, spatial spectra of cortical activity are difficult to estimate from non-invasive measurements (EEG and MEG) and from commonly used intracranial measurements (i.e. electrocorticography or Utah arrays) due to their limited spatial extent. In contrast, iEEG measurements are easier to interpret than EEG/MEG measurements and typically have larger spatial coverage than Utah arrays. However, iEEG is irregularly sampled within the three-dimensional brain volume and this poses a methodological problem that the proposed method aims to address.

      Weaknesses:

      Although the proposed method is evaluated in several indirect ways, a direct evaluation is lacking. This would entail simulating cortical current source density (CSD) with known spatial spectrum and using a realistic iEEG volume-conductor model to generate iEEG signals.

      Comments on revised version:

      In my original review, I raised the following issue:

      "The proposed method of estimating wavelength from irregularly sampled three-dimensional iEEG data involves several steps (phase-extraction, singular value-decomposition, triangle definition, dimension reduction, etc.) and it is not at all clear that the concatenation of all these steps actually yields accurate estimates. Did the authors use more realistic simulations of cortical activity (i.e. on the convoluted cortical sheet) to verify that the method indeed yields accurate estimates of phase spectra?"

      And the authors' response was:

      "We now included detailed surrogate testing, in which varying combinations of sEEG phase data and veridical surrogate wavelengths are added together. See our reply from the public reviewer comments. We assess that real neurophysiological data (here, sEEG plus surrogate and MEG manipulated in various ways) is a more accurate way to address these issues. In our experience, large scale TWs appear spontaneously in realistic cortical simulations, and we now cite the relevant papers in the manuscript (line 53)."

      The point that I wanted to make is not that traveling waves appear in computational models of cortical activity, as the authors seem to think. My point was that the only direct way to evaluate the proposed method for estimating spatial spectra is to use simulated cortical activity with known spatial spectrum. In particular, with "realistic simulations" I refer to the iEEG volume-conductor model that describes the mapping from cortical current source density (CSD) to iEEG signals, and that incorporates the reference electrodes and the particular montage used.

      Although in the revised manuscript the authors have provided indirect evidence for the soundness of the proposed estimation method, the lack of a direct evaluation using realistic simulations with ground truth as described above makes that remain sceptical about the soundness of the method.

    4. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study introduces a novel method for estimating spatial spectra from irregularly sampled intracranial EEG data, revealing cortical activity across all spatial frequencies, which supports the global and integrated nature of cortical dynamics. The study showcases important technical innovations and rigorous analyses, including tests to rule out potential confounds; however, the lack of comprehensive theoretical justification and assumptions about phase consistency across time points renders the strength of evidence incomplete. The dominance of low spatial frequencies in cortical phase dynamics continues to be of importance, and further elaboration on the interpretation and justification of the results would strengthen the link between evidence and conclusions.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The paper uses rigorous methods to determine phase dynamics from human cortical stereotactic EEGs. It finds that the power of the phase is higher at the lowest spatial phase.

      Strengths:

      Rigorous and advanced analysis methods.

      Weaknesses:

      The novelty and significance of the results are difficult to appreciate from the current version of the paper.

      (1) It is very difficult to understand which experiments were analysed, and from where they were taken, reading the abstract. This is a problem both for clarity with regard to the reader and for attribution of merit to the people who collected the data.

      We now explicitly state the experiments that were used, lines 715-716.

      (2) The finding that the power is higher at the lowest spatial phase seems in tune with a lot of previous studies. The novelty here is unclear and it should be elaborated better.

      It is not generally accepted in neuroscience that power is higher at lowest spatial frequencies, and recent research concludes that traveling waves at this scale may be the result of artefactual measurement (Orczyk et al., 2022; Hindriks et al., 2014; Zhigalov & Jensen,2023). The question we answer is therefore timely and a source of controversy to researchers analysing TWs in cortex. While, in our view, the previous literature points in the direction of our conclusions (notably the work of Freeman et. al. 2003; 2000; Barrie et al. 1996), it is not conclusive at the scale we are interested in, specifically >8cm, and certainly not convincing to the proponents of ‘artefactual measurement’.

      We have added to a sentence to make this explicit in the abstract, lines 20-22. Please also note previous text at the end of the introduction, lines 140-148 and in the first paragraph of the discussion, lines 563-569.

      I could not understand reading the paper the advantage I would have if I used such a technique on my data. I think that this should be clear to every reader.

      We have made the core part of the code available on github (line 1154), which should simplify adoption of the technique. We have urged, in the Discussion (lines 653-663), why habitual measurement of SF spectra is desirable, since the same task measured with EEG, sEEG or ECoG does not encompass the same spatial scales, and researchers may be comparing signals with different functional properties. Until reliable methods for estimating SF are available, not dependent on the layout of the recording array, data cannot be analysed to resolve this question. Publication of our results and methods will help this process along.

      (3) It seems problematic to trust in a strong conclusion that they show low spatial frequency dynamics of up to 15-20 cm given the sparsity of the arrays. The authors seem to agree with this concern in the last paragraph of page 12. 

      The new surrogate testing supports our conclusions. The sEEG arrays would not normally be a first choice to estimate SF spectra, for reasons of their sparsity, which may be why such estimates have not been done before. Yet, this is the research challenge that we sought to solve, and a problem for which there was no ready method to hand. Nevertheless, it is a problem that urgently needed to be solved given the current debate on the origin of large-scale TWs. We have now included detailed surrogate testing of real data plus varying strength model waves (Figure 6A and Supplementary Figure 4). We believe this should convince the reader that we are measuring the spatial frequency spectrum with sufficient accuracy to answer the central research question.

      They also say that it would be informative to repeat the analyses presented here after the selection of more participants from all available datasets. It begs the question of why this was not done. It should be done if possible.

      We have now doubled the number of participants in the main analyses. Since each participant comprises a test of the central hypothesis, now the hypothesis test now has 23 replications (Supplementary Figures 2 and 3). There were four failures to reach significance due to under-powered tests, i.e., not enough contacts. This is sufficient test of the hypothesis and, in our opinion, not the primary obstacle to scientific acceptance of our results. The main obstacle is providing convincing tests that the method is accurate, and this is what we have focussed on. Publication of python code and the detailed methods described here enable any interested researcher to extend our method to other datasets.

      (4) Some of the analyses seem not to exploit in full the power of the dataset. Usually, a figure starts with an example participant but then the analysis of the entire dataset is not as exhaustive. For example, in Figure 6 we have a first row with the single participants and then an average over participants. One would expect quantifications of results from each participant (i.e. from the top rows of GFg 6) extracting some relevant features of results from each participant and then showing the distribution of these features across participants. This would complement the subject average analysis.

      The results are now clearly split into sections, where we first deal with all the single participant analyses, then the surrogate testing to confirm the basic results, then the participant aggregate results (Figure 7 and Supplementary Figure 7). The participant aggregate results reiterate the basic findings for the single participants. The key finding is straightforward (SF power decreases with SF) and required only one statistical analysis per subject.

      (5) The function of brain phase dynamics at different frequencies and scales has been examined in previous papers at frequencies and scales relevant to what the authors treat. The authors may want to be more extensive with citing relevant studies and elaborating on the implications for them. Some examples below:

      Womelsdorf T, et alScience. 2007

      Besserve M et al. PloS Biology 2015

      Nauhaus I et al Nat Neurosci 2009

      We have added two paragraphs to the discussion, in response to the reviewer suggestion (lines 606-623). These paragraphs place our high TF findings in the context of previous research.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors analyze the organization of phases across different spatial scales. The authors analyze intracranial, stereo-electroencephalogram (sEEG) recordings from human clinical patients. The authors estimate the phase at each sEEG electrode at discrete temporal frequencies. They then use higher-order SVD (HOSVD) to estimate the spatial frequency spectrum of the organization of phase in a data-driven manner. Based on this analysis, the authors conclude that most of the variance explained is due to spatially extended organizations of phase, suggesting that the best description of brain activity in space and time is in fact a globally organized process. The authors' analysis is also able to rule out several important potential confounds for the analysis of spatiotemporal dynamics in EEG.

      Strengths:

      There are many strengths in the manuscript, including the authors' use of SVD to address the limitation of irregular sampling and their analyses ruling out potential confounds for these signals in the EEG.

      Weaknesses:

      Some important weaknesses are not properly acknowledged, and some conclusions are overinterpreted given the evidence presented.

      The central weakness is that the analyses estimate phase from all signal time points using wavelets with a narrow frequency band (see Methods - "Numerical methods"). This step makes the assumption that phase at a particular frequency band is meaningful at all times; however, this is not necessarily the case. Take, for example, the analysis in Figure 3, which focuses on a temporal frequency of 9.2 Hz. If we compare the corresponding wavelet to the raw sEEG signal across multiple points in time, this will look like an amplitude-modulated 9.2 Hz sinusoid to which the raw sEEG signal will not correspond at all. While the authors may argue that analyzing the spatial organization of phase across many temporal frequencies will provide insight into the system, there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal. This is a critical point for the analysis because while this analysis of the spatial organization of phase could provide some interesting results, this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time. If this is not true, then the foundation of the analysis may not be precisely clear. This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local". Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.

      “using wavelets with a narrow frequency band … this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time”

      Our method uses very short time-window Morlet wavelets to avoid the assumptions of oscillations, i.e., long-lasting sinusoids in the signal, in the sense of sinusoidal waveforms, or limit cycles extending in time. Cortical TWs can only last one or two cycles (Alexander et al., 2006), requiring methods that are compact in the time domain to avoid underreporting the desired phenomena. Additionally, the short time-window Morlet wavelets have low frequency resolution, so they are robust with respect to shifts in frequency between sites. We now discuss this issue explicitly in the Methods (lines 658-674). This means the phase estimation methods used in the manuscript precisely do not have the problem of assuming narrow-band oscillations in the signal. The methods are also robust to the exact shape of the waveforms; the signal needs be only approximately sinusoidal; to rise and fall. This means the Fourier variant we use does not introduce ringing artefact that can be introduced using longer timeseries methods, such as FFT.

      “This step makes the assumption that phase at a particular frequency band is meaningful at all times”

      This important consideration is entrenched in our choice of methods. By way of explanatory background, we point out that this step is not the final step. Aggregation methods can be used to distinguish between signal and noise. In the simple case, event-locked time-series of phase can be averaged. This would allow consistent (non-noise) phase relations to be preserved, while the inconsistent (including noise) phase relations would be washed out. This is part of the logic behind all such aggregation procedures, e.g., phase-locking, coherence. SVD has the advantage of capturing consistent relations in this sense, but without loss of information as occurs in averaging (up to the choice of number of singular vectors in the final model). Specifically, maps of the spatial covariances in phase are captured in the order of the variance explained. Noise (in the sense conveyed by the reviewer) in the phase measurements will not contribute to highest rank singular vectors. SVD is commonly used to remove noise, and that is one of its purposes here. This point can be seen by considering the very smooth singular vectors derived from MEG (Figure 3F) in this new version of the manuscript. These maps of phase gradients pull out only the non-noisy relations, even as their weighted sums reproduce any individual sample to any desired accuracy.

      To summarize, the next step (of incorporating the phase measure into the SVD) neatly bypasses the issue of non-meaningful phase quantification. This is one of the reasons why we do not undertake the spatial frequency estimates on the raw matrices of estimated phase.

      We now include a new sub-paragraph on this topic in the methods, lines 831-838.

      In addition, we have reworded the first description of the methods with a new paragraph at the end of the introduction, which better balances the description of the steps involved. The two sentences (lines 162-166 highlight the issue of concern to the reviewer.

      “there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal.”

      The correct description of the full sEEG signal is beyond the scope of the present research. Our main goal, as stated, is to show that the hypothesis that ‘extra-cranial measurements of TWs is the result of projection from localized activity’ is not supported by the evidence of spatial patterns of activity in the cortex. Since this activity can be accessed as single frequency band (especially if localized sources create the large-scale patterns), analysis of SF on a TF-by-TF basis is sufficient.

      “This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local".

      We agree with the reviewer, even though we expect that the strongest influences on local phase are due to other cortical signals in the same band. The implicit assumption of the focus on bands of the same temporal frequency is now made explicit in the abstract (lines 31-34).

      A sentence addressing this issue had been added to the first paragraph of the discussion (lines 579-582).

      Inclusion of cross-frequency interactions would likely require a highly regular measurement array over the scales of interest here, i.e., the noise levels inherent in the spatial organization of sEEG contacts would not support such analyses.

      “Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.”

      We have removed the phase examples that were previously in Supplementary Figure 5 (and Figure 5 in the previous version of the main text), since further surrogate testing and modelling (Supplementary Figure 11) shows the LSVs from irregular arrays will inevitably capture mixtures of low and high SF signals. The final section of the Methods explains this effect in some detail. Instead, the new version of the manuscript relies on new surrogate testing to validate our methods.

      Another weakness is in the discussion on spatial scale. In the analyses, the authors separate contributions at (approximately) > 15 cm as macroscopic and < 15 cm as mesoscopic. The problem with the "macroscopic" here is that 15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur. For example, if a specific set of cortical regions, spanning over a 10 cm range, were to exhibit a consistent organization of phase at a particular temporal frequency (required by the analysis technique, as noted above), it is not clear why that would not be considered a "macroscopic" organization of phase, since it comprises multiple areas of the brain acting in coordination. Further, while this point could be considered as mostly semantic in nature, there is also an important technical consideration here: would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected? If this is not the case, then could it be possible that the lowest spatial frequencies are detected more often simply because it would be difficult to detect variable organizations in subsets of electrodes?

      The motivation for our study was to show that large-scale TWs measured outside the cortex cannot be the result of more localized activity being ‘projected up’. In this case, the temporal frequency of the artefactual waves would be the same as the localized sources, so the criticism does not apply.

      “while this point could be considered as mostly semantic in nature”

      We have changed the terminology in the paper to better coincide with standard usage. Macroscopic now refers to >1cm, while we refer to >8cm as large-scale.

      “15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur.”

      We can assume that subtle frequency variation (e.g., within an alpha phase binding) is greatest at the largest scales of cortex, or at least not less varying than measurements within regions. This means that not considering frequency-drift effects will not inflate low spatial frequency power over high spatial frequency power. Even so, the power spectrum we estimated is approximately 1/SF, so that unmeasured cross-frequency effects in binding (causal influences on local phase) would have to overcome the strength of this relation for this criticism to apply, which seems unlikely.

      “would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected?”

      See our previous comments about the low temporal frequency resolution of two cycle Morlet wavelets. The answer is yes, up to the range approximated by half-power bandwidth, which is large in the case of this method (see lines 760-764).

      Another weakness is disregarding the potential spike waveform artifact in the sEEG signal in the context of these analyses. Specifically, Zanos et al. (J Neurophysiol, 2011) showed that spike waveform artifacts can contaminate electrode recordings down to approximately 60 Hz. This point is important to consider in the context of the manuscript's results on spatial organization at temporal frequencies up to 100 Hz. Because the spike waveform artifact might affect signal phase at frequencies above 60 Hz, caution may be important in interpreting this point as evidence that there is significant phase organization across the cortex at these temporal frequencies.

      We have now added a sentence on this issue to the discussion (lines 600-602).

      However, our reading of the Zanos et al. paper is that the low temporal frequency (60-100Hz) contribution of spikes and spike patterns is negligible compared to genuine post-synaptic membrane fluctuations (see their Figure 3). These considerations come more strongly into play when correlations between LFP and spikes are calculated or spike triggered averaging is undertaken, since then a signal is being partly correlated with itself, or, partly averaged over the supposedly distinct signal with which it was detected.

      A last point is that, even though the present results provide some insight into the organization of phase across the human brain, the analyses do not directly link this to spiking activity. The predictive power that these spatial organizations of phase could provide for spiking activity - even if the analyses were not affected by the distortion due to the narrow-frequency assumption - remains unknown. This is important because relating back to spiking activity is the key factor in assessing whether these specific analyses of phase can provide insight into neural circuit dynamics. This type of analysis may be possible to do with the sEEG recordings, as well, by analyzing high-gamma power (Ray and Maunsell, PLoS Biology, 2011), which can provide an index of multi-unit spiking activity around the electrodes.

      “even if the analyses were not affected by the distortion due to the narrow-frequency assumption”

      See our earlier comment about narrow TFs; this is not the case in the present work.

      The spiking activity analysis would be an interesting avenue for future research. It appears the 1000Hz sampling frequency in the present data is not sufficient for method described in Ray & Maunsell (2011). On a related topic, we have shown that large-scale traveling waves in the MEG and 8cm waves in ECoG can both be used to predict future localized phase at a single sensor/contact, two cycles into the future (Alexander et al., 2019). This approach could be used to predict spiking activity, by combining it with the reviewer’s suggestion. However, the current manuscript is motivated by the argument that measured large-scale extra-cranial TWs are merely projections of localized cortical activity. Since spikes do not arise in this argument, we feel it is outside the scope of the present research. We have added this suggestion to the discussion as a potential line of future research (lines 686-688).

      Reviewer #3 (Public review):

      Summary:

      The authors propose a method for estimation of the spatial spectra of cortical activity from irregularly sampled data and apply it to publicly available intracranial EEG data from human patients during a delayed free recall task. The authors' main findings are that the spatial spectra of cortical activity peak at low spatial frequencies and decrease with increasing spatial frequency. This is observed over a broad range of temporal frequencies (2-100 Hz).

      Strengths:

      A strength of the study is the type of data that is used. As pointed out by the authors, spatial spectra of cortical activity are difficult to estimate from non-invasive measurements (EEG and MEG) due to signal mixing and from commonly used intracranial measurements (i.e. electrocorticography or Utah arrays) due to their limited spatial extent. In contrast, iEEG measurements are easier to interpret than EEG/MEG measurements and typically have larger spatial coverage than Utah arrays. However, iEEG is irregularly sampled within the threedimensional brain volume and this poses a methodological problem that the proposed method aims to address.

      Weaknesses:

      The used method for estimating spatial spectra from irregularly sampled data is weak in several respects.

      First, the proposed method is ad hoc, whereas there exist well-developed (Fourier-based) methods for this. The authors don't clarify why no standard methods are used, nor do they carry out a comparative evaluation.

      We disagree that the method is ad hoc, though the specific combination of SVD and multiscale differencing is novel in its application to sEEG. The SVD method has been used to isolate both ~30cm TWs in MEG and EEG (Alexander et al., 2013; 2016), as well as 8cm waves in ECoG (Alexander et al., 2013; 2019). In our opening examples in the results now reiterate these previous related findings, by way of example analysis of MEG data (Figure 3). This will better inform the reader on the extent of continuity of the method from previous research.

      Standard FFT has been used after interpolating between EEG electrodes to produce a uniform array (Alamia et al., 2023). There exist well-developed Fourier methods for nonuniform grids, such as simple interpolation, the butterfly algorithm, wavefield extrapolation and multi-scale vector field techniques. However, the problems for which these methods are designed require non-sparse sampling or less irregular arrays. The sEEG contacts (reduced in number to grey matter contacts) are well outside the spatial irregularity range of any Fourierrelated methods that we are aware of, particularly at the broad range of spatial scales of interest here (2cm up to 24cm). This would make direct comparison of these specialized Fourier method to our novel methods, in the sEEG, something of a straw-man comparison.

      We now include a summary paragraph in the introduction, which is a brief review of Fourier methods designed to deal with non-uniform sampling (lines 159-162).

      Second, the proposed method lacks a theoretical foundation and hinges on a qualitative resemblance between Fourier analysis and singular value decomposition.

      We have improved our description of the theoretical relation between Fourier analysis and SVD (additional material at lines 839-861 and 910-922). In fact, there are very strong links between the two methods, and now it should be clearer that our method does not rely on a mere qualitative resemblance.

      Third, the proposed method is not thoroughly tested using simulated data. Hence it remains unclear how accurate the estimated power spectra actually are.

      We now include a new surrogate testing procedure, which takes as inputs the empirical data and a model signal (of known spatial frequency) in various proportions. Thus, we test both the impact of small amount of surrogate signal on the empirical signal, and the impact of ‘noise’ (in the form of a small amount of empirical signal) added to the well-defined surrogate signal.

      In addition, there are a number of technical issues and limitations that need to be addressed or clarified (see recommendations to the authors).

      My assessment is that the conclusions are not completely supported by the analyses. What would convince me, is if the method is tested on simulated cortical activity in a more realistic set-up. I do believe, however, that if the authors can convincingly show that the estimated spatial spectra are accurate, the study will have an impact on the field. Regarding the methodology, I don't think that it will become a standard method in the field due to its ad hoc nature and well-developed alternatives.

      Simulations of cortical activity do not seem the most direct way to achieve this goal. The first author has published in this area (Liley et. al., 1999; Wright et al., 2001), and such simulations, for both bulk and neuronally based simulations, readily display traveling wave activity at low spatial frequencies (indeed, this was the origin of the present scientific journey). The manuscript outlines these results in the introduction, as well as theoretical treatments proposing the same. Several other recent studies have highlighted the appearance of largescale travelling waves using connectome-based models (https://www.biorxiv.org/content/10.1101/2025.07.05.663278v1; https://www.nature.com/articles/s41467-024-47860-x), which we do not include in the manuscript for reasons of brevity. In short, the emergence of TW phenomenon in models is partly a function of the assumptions put into them (i.e., spatial damping, boundary conditions, parameterization of connection fields) and would therefore be inconclusive in our view.

      Instead, we rely on the advantages provided by the way our central research question has been posed: that the spatial frequency distribution of grey matter signal can determine whether extra-cranial TWs are artefactual. The newly introduced surrogate methods reflect this advantage by directly adding ground truth spatial frequency components to individual sample measurements. This is a less expensive option than making cortical simulations to achieve the same goal.

      For the same reasons, we include testing of the methods using real cortical signals with MEG arrays (for which we could test the effects of increasing sparseness of contacts, test the effects of average referencing, and also construct surrogate time-series with alternative spectra).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Major points

      Methods, Page 18: "... using notch filters to remove the 50Hz line signal and its harmonics ...": The sEEG data appear to have been recorded in North America, where the line frequency is 60 Hz. Is this perhaps a typo, or was a 50 Hz notch filter in fact applied here (which would be a mistake)?

      This has now been fixed in the text to read 60Hz. This is the notch filter that was applied.

      Minor points

      (1) While the authors do state that they are analyzing the "spatial frequency spectrum of phase dynamics" in the abstract, this could be more clearly emphasized. Specifically, the difference between signal power at different spatial frequencies (as analyzed by a standard Fourier analysis) and the organization of phase in space (as done here) could be more clearly distinguished.

      We now address this point explicitly on lines 167-172. We now include at the end of the results additional analyses where the TF power is included. This means that the effects of including signal power at different temporal frequencies can be directly compared to our main analysis of the SF spectrum of the phase dynamics.

      (2) Figure 1A-C: It was not immediately clear what the lengths provided in these panels (e.g."> 40 cm cortex", "< 10 cm", "< 30 cm") were meant to indicate. This could be made clearer.

      Now fixed in the caption.

      (3) Figure 2A: If this is surrogate data to explain the analysis technique, it would be helpful to note explicitly at this point.

      This Figure has been completely reworked, and now the status of the examples (from illustrative toy models to actual MEG data) should be clearer.

      (4) Figure 4A: Why change from "% explained variance" for the example data in Figure 2C to arbitrary units at this point?

      This has now been explicitly stated in the methods (lines 1033-1036).

      (5) Page 15: "This means either the results were biased by a low pass filter, or had a maximum measurable...": If the authors mean that the low-pass filter is due to spatial blurring of neural activity in the EEG signal, it would be helpful to state that more directly at this point.

      Now stated directly, lines 567-568.

      (6) Page 23: "...where |X| is the complex magnitude of X...": The modulus operation is defined on a complex number, yet here is applied to a vector of complex numbers. If the operation is elementwise, it should be defined explicitly.

      ‘Elementwise’ is now stated explicitly (line 1020).

      Reviewer #3 (Recommendations for the authors):

      In the submitted manuscript, the authors propose a method to estimate spatial (phase) spectra from irregularly sampled oscillatory cortical activity. They apply the method to intracranial (iEEG) data and argue that cortical activity is organized into global waves up to the size of the entire cortex. If true, this finding is certainly of interest, and I can imagine that it has profound implications for how we think about the functional organization of cortical activity.

      We have added a section to the discussion outlining the most radical of these implications: what does it mean to do source localization when non-local signals dominate? Lines 670-681.

      The manuscript is well-written, with comprehensive introduction and discussion sections, detailed descriptions of the results, and clear figures. However, the proposed method comprised several ad hoc elements and is not well-founded mathematically, its performance is not adequately assessed, and its limitations are not sufficiently discussed. As such, the study failed to convince (me) of the correctness of the main conclusions.

      We now have a direct surrogate testing of the method. We have also improved the mathematical explanation to show that the link between Fourier analysis and SVD is not ad hoc, but well understood in both literatures. We had addressed explicitly in the text all of the limitations raised by the reviewers.

      Major comments

      (1) The main methodological contribution of the study is summarized in the introduction section:

      "The irregular sampling of cortical spatial coordinates via stereotactic EEG was partly overcome by the resampling of the phase data into triplets corresponding to the vertices of approximately equilateral triangles within the cortical sheet."

      There exist well-established Fourier methods for handling irregularly sampled data so it is unclear why the authors did not resort to these and instead proposed a rather ad hoc method without theoretical justification (see next comment).

      We have re-reviewed the literature on non-uniform Fourier analysis. We now briefly review the Fourier methods for handling irregularly sampled data (lines 155-162) and conclude that none of the existing methods can deal with the degree of irregularity, and especially sparsity, found for the grey-matter sEEG contacts.

      (2) In the Appendix, the authors write:

      "For appropriate signals, i.e., those with power that decreases monotonically with frequency, each of the first few singular vectors, v_k, is an approximate complex sinusoid with wavenumber equal to k."

      I don't think this is true in general and if it is, there must be a formal argument that proves it. Furthermore, is it also true for irregularly sampled data? And in more than one spatial dimension? Moreover, it is also unclear exactly how the spatial Fourier spectrum is estimated from the SVD.

      In response to these reviewer queries, we now spend considerably more time in the conceptual set-up of the manuscript, giving examples of where SVD can be used to estimate the Fourier spectrum. We have now unpacked the word ‘appropriate’ and we are now more exact in our phrasing. This is laid out in lines 843-850 of the manuscript. In addition, the methods now describe the mathematical links between Fourier analysis and SVD (lines 851861 and 910-922).

      The authors write:

      "The spatial frequency spectrum can therefore be estimated using SVD by summing over the singular values assigned to each set of singular vectors with unique (or by binning over a limited range of) spatial frequencies. This procedure is illustrated in Figure 1A-C."

      First, the singular vectors are ordered to decreasing values of the corresponding singular values. Hence, if the singular values are used to estimate spectral power, the estimated spectrum will necessarily decrease with increasing spatial frequency (as can be seen in Figure 2C). Then how can traveling waves be detected by looking for local maxima of the estimated power spectra?

      TWs are not detected by looking for local maxima in the spectra. Our work has focussed on the global wave maps derived from the SVD of phase (i.e., k=1-3), which also explain most of the variance in phase. This is now mentioned in the caption to Figure 3 (lines 291-294).

      Second, how are spatial frequencies assigned to the different singular vectors? The proposed method for estimating spatial power spectra from irregularly sampled data seems rather ad hoc and it is not at all clear if, and under what conditions, it works and how accurate it is.

      The new version of the manuscript uses a combination of the method previously presented (the multi-scale differencing) and the method previously outlined in the supplementary materials (doing complex-valued SVD on the spatial vectors of phase). We hope that along with the additional expository material in the methods the new version is clearer and seems less ad hoc to the reviewer. Certainly, there are deep and well-understood links between Fourier analysis and SVD, and we hope we have brought these into focus now.

      (3) The authors define spatial power spectra in three-dimensional Euclidean space, whereas the actual cortical activity occurs on a two-dimensional sheet (the union of two topological 2spheres). As such, it is not at all clear how the estimated wavelengths in three-dimensional space relate to the actual wavelengths of the cortical activity.

      We define spatial power spectra on the folded cortical sheet, rather than Cartesian coordinates. We use geodesic distances in all cases where a distance measurement is required. We have included two new figures (Figure 5 and Supplementary Figure1) showing the mapping of the triangles onto the cortical sheet, which should bring this point home.

      (4) The authors' analysis of the iEEG data is subject to a caveat that is not mentioned in the manuscript: As a reference for the local field potentials, the average white-matter signal was used and this can lead to artifactual power at low spatial frequencies. This is because fluctuations in the reference signal are visible as standing waves in the recording array. This might also explain the observation that

      "A surprising finding was that the shape of the spatial frequency spectrum did not vary much with temporal frequency."

      because fluctuations in the reference signal are expected to have power at all temporal frequencies (1/f spectrum). When superposed with local activity at the recording electrodes, this leads to spurious power at low spatial frequencies. Can the authors exclude this interpretation of the results?

      The new version of the manuscript deals explicitly with this potential confound (lines 454467). First, the artefactual global synchrony due to the reference signal (the DC component in our spatial frequency spectra of phase) is at a distinct frequency from the lowest SF of interest here. The lowest spatial frequency is a function of the maximum spatial range of the recording array and not overlapping in our method with the DC component, despite the loss of SF resolution due to the noise of the spatial irregularity of the recording array. This can be seen from consideration of the SF tuning (Figure 4) for the MEG wave maps shown in Figure 3, and the spectra generated for sparse MEG arrays in Supplementary Figure 5. Additionally, this question led us to a series of surrogate tests which are now included in the manuscript. We used MEG to test for the effects of average reference, since in this modality the reference free case is available. The results show that even after imposing a strong and artefactual global synchrony, the method is highly robust to inflation of the DC component, which either way does not strongly influence the SF estimates in the range of interest (4c/m to 12c/m for the case of MEG).

      (5) Related to the previous comment: Contrary to the authors' claims, local field potentials are susceptible to volume conduction, particularly when average references are used (see e.g. https://www.cell.com/neuron/fulltext/S0896-6273(11)00883-X)

      Methods exist to mitigate these effects (e.g. taking first- or second-order spatial differences of the signals). I think this issue deserves to be discussed.

      We have reviewed this research and do not find it to be a problem. The authors cited by the reviewer were concerned with unacknowledged volume conduction up to 1 cm for LFP. The maximum spatial frequency we report here is 50c/m, or equivalent to 2cm. While the intercontact distance on the sEEG electrodes was 0.5cm, in practice the smallest equilateral triangles (i.e., between two electrodes) to be found in the grey matter was around 2cm linear size. We make no statements about SF in the 1cm range. We do now cite this paper and mention this short-range volume conduction (lines 602-605). The method of taking derivatives has the same problems as source localization methods. They remove both artefactual correlations (volume conduction) and real correlations (the low SF interactions of interest here). We mention this now at lines 667-669. In addition, our method to remove negative SF components from the LSVs ameliorates the effects of average referencing. There are now more details in the Methods about this step (lines 924-947), as well as a new supplementary figure illustrating its effects on signal with a known SF spectrum (MEG, supplementary Figure 6).

      (6) Could the authors add an analysis that excludes the possibility that the observed local maxima in the spectra are a necessary consequence of the analysis method, rather than reflecting true maxima in the spectra? A (possibly) similar effect can be observed in ordinary Fourier spectra that are estimated from zero-mean signals: Because the signals have zero mean, the power spectrum at frequency zero is close to zero and this leads to an artificial local maximum at low frequencies.

      We acknowledge the reviewer’s mathematical point. We do not agree that it could be an issue, though it is important to rule it out definitively. First, removing the DC component will only produce an artefactual low SF peak if the power at low SF is high. This may occur in the reviewer’s example only because temporal frequency has a ~1/f spectrum. If the true spectrum is flat, or increasing in power with f, no such artificial low SF will be produced (see Supplementary Figure 5G). Additionally,

      (1) The DC component is well separated from the low SF components in our method;

      (2) We now include several surrogate methods which show that our method finds the correct spectral distribution and is not just finding a maximum at low SFs due to the suggested effect (subtraction of the DC component). Analysis of separated wave maps in MEG (Figures 3 & 4) shows the expected peaks in SF, increasing in peak SF for each family of maps when wavenumber increases (roughly three k=1 maps, three k=2 etc.). A specific surrogate test for this query was also undertaken by creating a reverse SF spectrum in MEG phase data, in which the spectrum goes linearly with f over the SF range of interest, rather than the usual 1/f. Our method correctly finds the former spectrum (Supplementary Figure 5). Additionally, we tested for the effects of introducing the average reference and the effects of our method to remove the DC component of the phase SF spectrum (Supplementary Figure 6). We can definitively rule out the reviewer’s concern.

      A related issue (perhaps) is the observation that the location of the maximum (i.e. the peak spatial frequency of cortical activity) depends on array size: If cortical activity indeed has a characteristic wavelength (in the sense of its spectrum having a local maximum) would one not expect it to be independent of array size?

      This is only true when making estimates for relatively clean sinusoidal signals, and not from broad-band signals. Fourier analysis and our related SVD methods are very much dependent on maximum array size used to measure cortical signals. This is why the first frequency band (after the DC component) in Fourier analysis is always at a frequency equivalent to 1/array_size, even if the signal is known to contain lower frequency components. We now include a further illustration of this in Figure 3, a more detailed exposition of this point in the methods, and in Supplementary Figure 11 we provide a more detailed example of the relation between Fourier analysis and SVD when grids with two distinct scales are used.

      In short, it is not possible, mathematically, to measure wavelengths greater than the array size in broad-band data. This is now stated explicitly in the manuscript (lines 143-144). A common approach in Neuroscience research is to first do narrowband filtering, then use a method that can accurately estimate ‘instantaneous’ phase change, such as the Hilbert transform. This is not possible for highly irregular sEEG arrays.

      (7) The proposed method of estimating wavelength from irregularly sampled threedimensional iEEG data involves several steps (phase-extraction, singular value decomposition, triangle definition, dimension reduction, etc.) and it is not at all clear that the concatenation of all these steps actually yields accurate estimates.

      Did the authors use more realistic simulations of cortical activity (i.e. on the convoluted cortical sheet) to verify that the method indeed yields accurate estimates of phase spectra?

      We now included detailed surrogate testing, in which varying combinations of sEEG phase data and veridical surrogate wavelengths are added together.

      See our reply from the public reviewer comments. We assess that real neurophysiological data (here, sEEG plus surrogate and MEG manipulated in various ways) is a more accurate way to address these issues. In our experience, large scale TWs appear spontaneously in realistic cortical simulations, and we now cite the relevant papers in the manuscript (line 53).

      Minor comments

      (1) Perhaps move the first paragraph of the results section to the Introduction (it does not describe any results).

      So moved.

      (2) The authors write:

      "The stereotactic EEG contacts in the grey matter were re-referenced using the average of low-amplitude white matter contacts"

      Does this mean that the average is taken over a subset of white-matter contacts (namely those with low amplitude)? Or do the authors refer to all white-matter contacts as "low-amplitude"? And had contacts at different needles different references? Or where the contacts from all needles pooled?

      A subset of white-matter contacts was used for re-referencing, namely those 50% with lowest amplitude signals. This subset was used to construct a pooled, single, average reference. We have rephrased the sentences referring to this procedure to improve clarity (line 202 and 743745).

    1. eLife Assessment

      This study offers important insight into the pathogenic basis of intragenic frameshift deletions in the carboxy-terminal domain of MECP2, which account for some Rett syndrome cases, yet similar variants also appear in unaffected individuals. Using base editing and mouse models, the authors present convincing evidence supporting the pathogenicity of select deletion variants, with potential implications for therapeutic development. However, comments regarding the analysis of publicly available genetic databases should be addressed to strengthen the conclusions and provide greater clarity to the field.

    2. Reviewer #1 (Public review):

      Summary:

      The authors scrutinized differences in C-terminal region variant profiles between Rett syndrome patients and healthy individuals and pinpointed that subtle genetic alternation can cause benign or pathogenic output, which harbors important implications in Rett syndrome diagnosis and proposes a therapeutic strategy. This work will be beneficial to clinicians and basic scientists who work on Rett syndrome, and carries the potential to be applied to other Mendelian rare diseases.

      Strengths:

      Well-designed genetic and molecular experiments translate genetic differences into functional and clinical changes. This is a unique study resolving subtle changes in sequences that give rise to dramatic phenotypic consequences.

      Weaknesses:

      There are many base-editing and protein-expression changes throughout the manuscript, and they cause confusion. It would be helpful to readers if authors could provide a simple summary diagram at the end of the paper.

    3. Reviewer #2 (Public review):

      Summary:

      This study by Guy and Bird and colleagues is a natural follow-up to their 2018 Human Molecular Genetics paper, further clarifying the molecular basis of C-terminal deletions (CTDs) in MECP2 and how they contribute to Rett syndrome. The authors combine human genetic data with well-designed experiments in embryonic stem cells, differentiated neurons, and knock-in mice to explain why some CTD mutations are disease-causing while others are harmless. They show that pathogenic mutations create a specific amino acid motif at the C-terminus, where +2 frameshifts produce a PPX ending that greatly reduces MeCP2 protein levels (likely due to translational stalling) whereas +1 frameshifts generating SPRTX endings are well tolerated.

      Strengths:

      This is a comprehensive and rigorous study that convincingly pinpoints the molecular mechanism behind CTD pathogenicity, with strong agreement between the cell-based and animal data. The authors also provide a proof of principle that modifying the PPX termination codon can restore MeCP2-CTD protein levels and rescue symptoms in mice. In addition, they demonstrate that adenine base editing can correct this defect in cultured cells and increase MeCP2-CTD protein levels. Overall, this is a well-executed study that provides important mechanistic and translational insight into a clinically important class of MECP2 mutations.

      Weaknesses:

      The adenine base editing to change the termination codon is shown to be feasible in generated cell lines, but has yet to be shown in vivo in animal models.

    4. Reviewer #3 (Public review):

      Summary:

      Guy et al. explored the variation in the pathogenicity of carboxy-terminal frameshift deletions in the X-linked MECP2 gene. Loss-of-function variants in MECP2 are associated with Rett syndrome, a severe neurodevelopmental disorder. Although 100's of pathogenic MECP2 variants have been found in people with Rett syndrome, 8 recurrent point mutations are found in ~65% of disease cases, and frameshift insertions/deletions (indels) variants resulting in production of carboxy-terminal truncated (CTT) MeCP2 protein account for ~10% of cases. Many of these occur in a "deletion prone region" (DPR) between c.1110-1210, with common recurrent deletions c.1157-1197del (CTD1) and c.1164_1207del (CTD2). While two major protein functional domains have been defined in MeCP2, the methyl-binding domain (MBD) and the NCoR interacting domain (NID), the functional role of the carboxy-terminal domain (CTD, beyond the NID, predicted to have a disordered protein structure) has not been identified, and previous work by this group and others demonstrated that a Mecp2 "minigene" lacking the CTD retains MeCP2 function suggesting that the CTD is dispensable. This raises an important question: If the CTD is dispensable, what is the pathogenic basis of the various CTT frameshift variants? Prior work from this group demonstrated that genetically engineered mice expressing the CTD1 variant had decreased expression of Mecp2 RNA and MeCP2 protein and decreased survival, but those expressing the CTD2 variant had normal Mecp2 RNA and protein and survival. However, they noted that differences between the mouse and human coding sequences resulted in different terminal sequences between the two common CTD, with CTD1 ending in -PPX in both mouse and human, but CTD2 ending in -PPC in human but -SPX in mouse, and in the previous paper they demonstrated in humanized mouse ES cells (edited to have the same -PPX termination) containing the CTD2 deletion resulted in decreased Mecp2 RNA and protein levels. This previous work provides the underlying hypotheses that they sought to explore, which is that the pathological basis of disease causing CTD relates to the formation of truncated proteins that end with a specific amino acid sequence (-PPX), which leads to decreased mRNA and protein levels, whereas tolerated, non-pathogenic CTD do not lead to production of truncated proteins ending in this sequence and retain normal mRNA/protein expression.

      In this manuscript, they evaluate missense variants, in-frame deletions, and frame shift deletions within the DPR from the aggregated Genome Aggregated Database (gnomAD) and find that the "apparently" normal individuals within gnomAD had numerous tolerated missense variants and in-frame deletions within this region, as well as frameshift deletions (in hemizygous males) in the defined region. All of the gnomAD deletions within this region resulted in terminal amino acid sequences -SPRTX (due to +1 frameshift), whereas nearly all deletion variants in this region from people with Rett syndrome (from the Clinvar copy of the former RettBase database) had a terminal -PPX sequence, due to a +2 frameshift. They hypothesized that terminal proline codons causing ribosomal stalling and "nonsense mediated decay like" degradation of mRNA (with subsequent decreased protein expression) was the basis of the specific pathogenicity of the +2 frameshift variants, and that utilizing adenine base editors (ABE) to convert the termination codon to a tryptophan could correct this issue. They demonstrate this by engineering the change into mouse embryonic stem cell lines and mouse lines containing the CTD1 deletion and show that this change normalized Mecp2 mRNA and protein levels and mouse phenotypes. Finally, they performed an initial proof-of-concept in an inducible HEK cell line and showed the ability of targeted ABE to edit the correct adenine and cause production of the expected larger truncated Mecp2 protein from CTD1 constructs.

      The findings of this manuscript provide a level of support for their hypothesis about the pathogenicity versus non-pathogenicity of some MECP2 CTT intragenic deletions and provide preliminary evidence for a novel therapeutic approach for Rett syndrome; however, limitations in their analysis do not fully support the broader conclusions presented.

      Strengths:

      (1) Utilization of publicly available databases containing aggregated genetic sequencing data from adult cohorts (gnomAD) and people with Rett syndrome (Clinvar copy of RettBase) to compare differences in the composition of the resulting terminal amino acid sequences resulting from deletions presumed to be pathogenic (n+2) versus presumed to be tolerated (n+1).

      (2) Evaluation of a unique human pedigree containing an n+1 deletion in this region that was reported as pathogenic, with demonstration of inheritance of this from the unaffected father and presence within other unaffected family members.

      (3) Development of a novel engineered mouse model of a previously assumed n+1 pathogenic variant to demonstrate lack of detrimental effect, supporting that this is likely a benign variant and not causative of Rett syndrome.

      (4) Creation and evaluation of novel cell lines and mouse models to test the hypothesis that the pathogenicity of the n+2 deletion variants could be altered by a single base change in the frameshifted stop codon.

      (5) Initial proof-of-concept experiments demonstrating the potential of ABE to correct the pathogenicity of these n+2 deletion variants.

      Weaknesses:

      (1) While the use of the large aggregated gnomAD genetic data benefits from the overall size of the data, the presence of genetic variants within this collection does not inherently mean that they are "neutral" or benign. While gnomAD does not include children, it does include aggregated data from a variety of projects targeting neuropsychiatric (and other conditions), so there is information in gnomAD from people with various medical/neuropsychiatric conditions. The authors do make some acknowledgement of this and argue that the presence of intragenic deletion variants in their region of interest in hemizygous males indicates that it is highly likely that these are tolerated, non-pathogenic variants. Broadly, it is likely true that gnomAD MECP2 variants found in hemizygous males are unlikely to cause Rett syndrome in heterozygous females, it does not necessarily mean that these variants have no potential to cause other, milder, neuropsychiatric disorders. As a clear example, within gnomAD, there is a hemizygous male with the rs28934908 C>T variant that results in p.A140V (p.A152V in e1 transcript numbering convention). This pathogenic variant has been found in a number of pedigrees with an X-linked intellectual disability pattern, in which males have a clear neurodevelopmental disorder and heterozygous females have mild intellectual disability (see PMIDs 12325019, 24328834 as representative examples of a large number of publications describing this). Thus, while their claim that hemizygous deletion variants in gnomAD are unlikely to cause Rett syndrome, that cannot make the definitive statement that they are not pathogenic and completely benign, especially when only found in a very small number of individuals in gnomAD.

      (2) The authors focus exclusively on deletions within the "DPR", they define as between c.1110-1210 and say that these deletions account for 10% of Rett syndrome cases. However, the published studies that are the basis for this 10% estimate include all genetic variants (frameshift deletions, insertions, complex insertion/deletions, nonsense variants) resulting in truncations beyond the NID. For example, Bebbington 2010 (PMID: 19914908), which includes frameshift indels as early as c.905 and beyond c.1210. Further specific examples from RettBase are described below, but the important point is that their evaluation of only frameshift variants within c.1110-1210 is not truly representative of the totality of genetic variants that collectively are considered CTT and account for 10% of Rett cases.

      (3) The authors say that they evaluated the putative pathogenic variants contained within RettBase (which is no longer available, but the data were transferred to Clinvar) for all cases with Classic Rett syndrome and de novo deletion variants within their defined DPR domain. Looking at the data from the Clinvar copy of RettBase, there are a number (n=143) of c-terminal truncating variants (either frameshift or nonsense) present beyond the NID, but the authors only discuss 14 deletion frameshift variants in this manuscript. A number of these variants have molecular features that do not fall into the pathogenic classification proposed by the authors and are not addressed in the manuscript and do not support the generalization of the conclusions presented in this manuscript, especially the conclusion that the determination of pathogenicity of all c-terminal truncating variants can be determined according to their proposed n+2 rule, or that all of the 10% of people with Rett syndrome and c-terminal truncating variants could be treated by using a base editor to correct the -PPX termination codon.

      (4) The HEK-based system utilized is convenient for doing the initial experiments testing ABE; however, it represents an artificial system expressing cDNA without splicing. Canonical NMD is dependent on splicing, and while non-canonical "NMD-like" processes are less well understood, a concern is whether the artificial system used can adequately predict efficacy in a native setting that includes introns and splicing.

    5. Author response:

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors scrutinized differences in C-terminal region variant profiles between Rett syndrome patients and healthy individuals and pinpointed that subtle genetic alternation can cause benign or pathogenic output, which harbours important implications in Rett syndrome diagnosis and proposes a therapeutic strategy. This work will be beneficial to clinicians and basic scientists who work on Rett syndrome, and carries the potential to be applied to other Mendelian rare diseases.

      Strengths:

      Well-designed genetic and molecular experiments translate genetic differences into functional and clinical changes. This is a unique study resolving subtle changes in sequences that give rise to dramatic phenotypic consequences.

      Weaknesses:

      There are many base-editing and protein-expression changes throughout the manuscript, and they cause confusion. It would be helpful to readers if authors could provide a simple summary diagram at the end of the paper.

      We thank Reviewer #1 for their encouraging comments. As suggested, we will include a summary figure of the genetic changes we have made, and the resulting expression and phenotypic consequences.

      Reviewer #2 (Public review):

      Summary:

      This study by Guy and Bird and colleagues is a natural follow-up to their 2018 Human Molecular Genetics paper, further clarifying the molecular basis of C-terminal deletions (CTDs) in MECP2 and how they contribute to Rett syndrome. The authors combine human genetic data with well-designed experiments in embryonic stem cells, differentiated neurons, and knock-in mice to explain why some CTD mutations are disease-causing while others are harmless. They show that pathogenic mutations create a specific amino acid motif at the C-terminus, where +2 frameshifts produce a PPX ending that greatly reduces MeCP2 protein levels (likely due to translational stalling) whereas +1 frameshifts generating SPRTX endings are well tolerated.

      Strengths:

      This is a comprehensive and rigorous study that convincingly pinpoints the molecular mechanism behind CTD pathogenicity, with strong agreement between the cell-based and animal data. The authors also provide a proof of principle that modifying the PPX termination codon can restore MeCP2-CTD protein levels and rescue symptoms in mice. In addition, they demonstrate that adenine base editing can correct this defect in cultured cells and increase MeCP2-CTD protein levels. Overall, this is a well-executed study that provides important mechanistic and translational insight into a clinically important class of MECP2 mutations.

      Weaknesses:

      The adenine base editing to change the termination codon is shown to be feasible in generated cell lines, but has yet to be shown in vivo in animal models.

      We thank Reviewer #2 for their positive comments. We agree that an in vivo study demonstrating effective DNA base editing in our CTD-1 mouse model is the obvious next step, and this work is in progress. However, given the ever-increasing use of pre- and neonatal screening for genetic diseases, we felt it important to disseminate our findings as soon as possible. The family pedigree in Figure 3C is a clear demonstration of this need.

      Reviewer #3 (Public review):

      Summary:

      Guy et al. explored the variation in the pathogenicity of carboxy-terminal frameshift deletions in the X-linked MECP2 gene. Loss-of-function variants in MECP2 are associated with Rett syndrome, a severe neurodevelopmental disorder. Although 100's of pathogenic MECP2 variants have been found in people with Rett syndrome, 8 recurrent point mutations are found in ~65% of disease cases, and frameshift insertions/deletions (indels) variants resulting in production of carboxy-terminal truncated (CTT) MeCP2 protein account for ~10% of cases. Many of these occur in a "deletion prone region" (DPR) between c.1110-1210, with common recurrent deletions c.1157-1197del (CTD1) and c.1164_1207del (CTD2). While two major protein functional domains have been defined in MeCP2, the methyl-binding domain (MBD) and the NCoR interacting domain (NID), the functional role of the carboxy-terminal domain (CTD, beyond the NID, predicted to have a disordered protein structure) has not been identified, and previous work by this group and others demonstrated that a Mecp2 "minigene" lacking the CTD retains MeCP2 function suggesting that the CTD is dispensable. This raises an important question: If the CTD is dispensable, what is the pathogenic basis of the various CTT frameshift variants? Prior work from this group demonstrated that genetically engineered mice expressing the CTD1 variant had decreased expression of Mecp2 RNA and MeCP2 protein and decreased survival, but those expressing the CTD2 variant had normal Mecp2 RNA and protein and survival. However, they noted that differences between the mouse and human coding sequences resulted in different terminal sequences between the two common CTD, with CTD1 ending in -PPX in both mouse and human, but CTD2 ending in -PPC in human but -SPX in mouse, and in the previous paper they demonstrated in humanized mouse ES cells (edited to have the same -PPX termination) containing the CTD2 deletion resulted in decreased Mecp2 RNA and protein levels. This previous work provides the underlying hypotheses that they sought to explore, which is that the pathological basis of disease causing CTD relates to the formation of truncated proteins that end with a specific amino acid sequence (-PPX), which leads to decreased mRNA and protein levels, whereas tolerated, non-pathogenic CTD do not lead to production of truncated proteins ending in this sequence and retain normal mRNA/protein expression.

      In this manuscript, they evaluate missense variants, in-frame deletions, and frame shift deletions within the DPR from the aggregated Genome Aggregated Database (gnomAD) and find that the "apparently" normal individuals within gnomAD had numerous tolerated missense variants and in-frame deletions within this region, as well as frameshift deletions (in hemizygous males) in the defined region. All of the gnomAD deletions within this region resulted in terminal amino acid sequences -SPRTX (due to +1 frameshift), whereas nearly all deletion variants in this region from people with Rett syndrome (from the Clinvar copy of the former RettBase database) had a terminal -PPX sequence, due to a +2 frameshift. They hypothesized that terminal proline codons causing ribosomal stalling and "nonsense mediated decay like" degradation of mRNA (with subsequent decreased protein expression) was the basis of the specific pathogenicity of the +2 frameshift variants, and that utilizing adenine base editors (ABE) to convert the termination codon to a tryptophan could correct this issue. They demonstrate this by engineering the change into mouse embryonic stem cell lines and mouse lines containing the CTD1 deletion and show that this change normalized Mecp2 mRNA and protein levels and mouse phenotypes. Finally, they performed an initial proof-of-concept in an inducible HEK cell line and showed the ability of targeted ABE to edit the correct adenine and cause production of the expected larger truncated Mecp2 protein from CTD1 constructs.

      The findings of this manuscript provide a level of support for their hypothesis about the pathogenicity versus non-pathogenicity of some MECP2 CTT intragenic deletions and provide preliminary evidence for a novel therapeutic approach for Rett syndrome; however, limitations in their analysis do not fully support the broader conclusions presented.

      Strengths:

      (1) Utilization of publicly available databases containing aggregated genetic sequencing data from adult cohorts (gnomAD) and people with Rett syndrome (Clinvar copy of RettBase) to compare differences in the composition of the resulting terminal amino acid sequences resulting from deletions presumed to be pathogenic (n+2) versus presumed to be tolerated (n+1).

      (2) Evaluation of a unique human pedigree containing an n+1 deletion in this region that was reported as pathogenic, with demonstration of inheritance of this from the unaffected father and presence within other unaffected family members.

      (3) Development of a novel engineered mouse model of a previously assumed n+1 pathogenic variant to demonstrate lack of detrimental effect, supporting that this is likely a benign variant and not causative of Rett syndrome.

      (4) Creation and evaluation of novel cell lines and mouse models to test the hypothesis that the pathogenicity of the n+2 deletion variants could be altered by a single base change in the frameshifted stop codon.

      (5) Initial proof-of-concept experiments demonstrating the potential of ABE to correct the pathogenicity of these n+2 deletion variants.

      Weaknesses:

      (1) While the use of the large aggregated gnomAD genetic data benefits from the overall size of the data, the presence of genetic variants within this collection does not inherently mean that they are "neutral" or benign. While gnomAD does not include children, it does include aggregated data from a variety of projects targeting neuropsychiatric (and other conditions), so there is information in gnomAD from people with various medical/neuropsychiatric conditions. The authors do make some acknowledgement of this and argue that the presence of intragenic deletion variants in their region of interest in hemizygous males indicates that it is highly likely that these are tolerated, non-pathogenic variants. Broadly, it is likely true that gnomAD MECP2 variants found in hemizygous males are unlikely to cause Rett syndrome in heterozygous females, it does not necessarily mean that these variants have no potential to cause other, milder, neuropsychiatric disorders. As a clear example, within gnomAD, there is a hemizygous male with the rs28934908 C>T variant that results in p.A140V (p.A152V in e1 transcript numbering convention). This pathogenic variant has been found in a number of pedigrees with an X-linked intellectual disability pattern, in which males have a clear neurodevelopmental disorder and heterozygous females have mild intellectual disability (see PMIDs 12325019, 24328834 as representative examples of a large number of publications describing this). Thus, while their claim that hemizygous deletion variants in gnomAD are unlikely to cause Rett syndrome, that cannot make the definitive statement that they are not pathogenic and completely benign, especially when only found in a very small number of individuals in gnomAD.

      (2) The authors focus exclusively on deletions within the "DPR", they define as between c.1110-1210 and say that these deletions account for 10% of Rett syndrome cases. However, the published studies that are the basis for this 10% estimate include all genetic variants (frameshift deletions, insertions, complex insertion/deletions, nonsense variants) resulting in truncations beyond the NID. For example, Bebbington 2010 (PMID: 19914908), which includes frameshift indels as early as c.905 and beyond c.1210. Further specific examples from RettBase are described below, but the important point is that their evaluation of only frameshift variants within c.1110-1210 is not truly representative of the totality of genetic variants that collectively are considered CTT and account for 10% of Rett cases.

      (3) The authors say that they evaluated the putative pathogenic variants contained within RettBase (which is no longer available, but the data were transferred to Clinvar) for all cases with Classic Rett syndrome and de novo deletion variants within their defined DPR domain. Looking at the data from the Clinvar copy of RettBase, there are a number (n=143) of c-terminal truncating variants (either frameshift or nonsense) present beyond the NID, but the authors only discuss 14 deletion frameshift variants in this manuscript. A number of these variants have molecular features that do not fall into the pathogenic classification proposed by the authors and are not addressed in the manuscript and do not support the generalization of the conclusions presented in this manuscript, especially the conclusion that the determination of pathogenicity of all c-terminal truncating variants can be determined according to their proposed n+2 rule, or that all of the 10% of people with Rett syndrome and c-terminal truncating variants could be treated by using a base editor to correct the -PPX termination codon.

      (4) The HEK-based system utilized is convenient for doing the initial experiments testing ABE; however, it represents an artificial system expressing cDNA without splicing. Canonical NMD is dependent on splicing, and while non-canonical "NMD-like" processes are less well understood, a concern is whether the artificial system used can adequately predict efficacy in a native setting that includes introns and splicing.

      We thank reviewer #3 for their constructive comments. A number of these relate to our analysis of databases of pathogenic (RettBASE) and non-pathogenic (gnomAD) databases. We disagree with their assertion that we are looking at only a small subset of RTT CTD mutations. We detail 14 different RTT CTDs in the manuscript, but these include the 3 most frequently occurring, which alone account for 121 RTT cases in RettBASE.

      We used the original RettBASE database for our analysis, which contained significantly more information than was transferred to Clinvar. We may not have made this sufficiently clear and will remedy this during revision of the manuscript.

      We stress that RettBASE contained many non-RTT causing mutations. For this reason, we employed stringent selection criteria to define a high-confidence set of RTT CTD alleles. Importantly, this set was chosen before any investigation of reading frame or C-terminal amino acid sequence. Our stringent set was selected based on three criteria: location within the C-terminal deletion prone region (CT-DPR), a diagnosis of Classical RTT and at least one case where that mutation had been shown to be absent from both parents (i.e. that it was a de novo mutation). This excluded a large number of CTD alleles which fitted the +2 frameshift/PPX ending category as well as some in other categories. There are good reasons to believe that the vast majority of genuinely pathogenic RTT CTD mutations do fall into this class.

      Concerning gnomAD CTDs, we chose to restrict our detailed analysis to those which are present in the hemizygous state, to exclude individuals which mask a pathogenic mutation due to skewed X-inactivation. Data from all zygosities are shown in Fig. 3, SF1.

      We will revise the manuscript to include tables of all extracted data relevant to this region, from both gnomAD and RettBASE, along with analysis of a less stringent set of RettBASE CTDs for reading frame and C-terminal amino acid sequence. We hope this will make clear the strength of the evidence for our conclusions.

      We agree with Reviewer #3 that inclusions of variants in gnomAD does not exclude the possibility that they may cause medical/psychiatric conditions other than RTT. This point is already mentioned in the Discussion, but we plan to emphasise it further. The pedigree included in the paper, as well as others that we have learned of, argue that loss of the C-terminus of MeCP2 has few if any phenotypic consequences, but more cases are needed to robustly assess this conclusion.

      We disagree that our HEK cell-based system is not suitable for testing efficacy of reagents for use on endogenous alleles in vivo. The editing process is not dependent on splicing, and we have shown in this manuscript that making this single base change has the same effect on an endogenous knock-in allele (CTD1 X>W) or a cDNA-based transgene (Flp-In T-REx CTD1 + base editing), namely, to increase the amount of truncated MeCP2 produced.

    1. eLife Assessment

      This study provides an important assessment of how body size influences the occurrence of macro-organisms in urban areas across the globe. Size in most plants, but only some animal families, was positively associated with urban tolerance. The data set is impressive, but the evidence for broad-scale conclusions is incomplete due to methodological issues that need to be resolved.

    2. Reviewer #1 (Public review):

      Summary:

      The authors integrate multiple large databases to test whether body sizes were positively associated with which species tolerate urban areas. In general, many plant families showed a positive association between body size and urban tolerance, whereas a smaller, though still non-trivial, percentage of animal families showed the same pattern. Notably, the authors are careful in the interpretation of their findings and provide helpful context for the ways that this analysis can be generative in shaping new hypotheses and theory around how urbanization influences biodiversity at large. They are careful to discuss how body size is an important trait, but the absence of a relationship between body size and urban tolerance in many families suggests a variety of other traits undergird urban success.

      Strengths:

      The authors aggregated a large dataset, but they also applied robust filters to ensure they had an adequate and representative number of detections for a given species, family, geography, etc. The authors also applied their analysis at multiple taxonomic scales (family and order), which allowed for a better interpretation of the patterns in the data and at what taxonomic scale body size might be important.

      Weaknesses:

      My main concern is that it is not fully clear how the measure of body size might influence the result. The authors were unable to obtain consistent measures of body size (mean, median, maximum, or sex variation). This, of course, could be very consequential as means and medians can differ quite a bit, and they certainly will differ substantially from a maximum. And of course, sex differences can be marked in multiple directions or absent altogether. The authors do note that they selected the measure that was most common in a family, but it was not clear whether species in that family that did not have that measure were removed or not. This could potentially shape the variability in the dataset and obscure true patterns. This may require additional clarity from the authors and is also a real constraint in compiling large data from disparate sources.

    3. Reviewer #2 (Public review):

      I have completed a thorough review of this paper, which seeks to use the large datasets of species occurrences available through GBIF to estimate variation in how large numbers of plant and animal species are associated with urbanization throughout the world, describing what they call the "species urbanness distribution" or SUD. They explore how these SUDs differ between regions and different taxonomic levels. They then calculate a measure of urban tolerance and seek to explore whether organism size predicts variation in tolerance among species and across regions.

      The study is impressive in many respects. Over the course of several papers, Callaghan and coauthors have been leaders in using "big [biodiversity] data" to create metrics of how species' occurrence data are associated with urban environments, and in describing variation in urban tolerance among taxa and regions. This work has been creative, novel, and it has pushed the boundaries of understanding how urbanization affects a wide diversity of taxa. The current paper takes this to a new level by performing analyses on over 94000 observations from >30,000 species of plants and animals, across more than 370 plant and animal taxonomic families. All of these analyses were focused on answering two main questions:

      (1) What is the shape of species' urban tolerance distributions within regional communities?

      (2) Does body size consistently correlate with species' urban tolerance across taxonomic groups and biogeographic contexts?

      Overall, I think the questions are interesting and important, the size and scope of the data and analyses are impressive, and this paper has a potentially large contribution to make in pushing forward urban macroecology specifically and urban ecology and evolution more generally.

      Despite my enthusiasm for this paper and its potential impact, there are aspects that could be improved, and I believe the paper requires major revision.

      Some of these revisions ideally involve being clearer about the methodology or arguments being made. In other cases, I think their metrics of urban tolerance are flawed and need to be rethought and recalculated, and some of the conclusions are inaccurate. I hope the authors will address these comments carefully and thoroughly. I recognize that there is no obligation for authors to make revisions. However, revising the paper along the lines of the comments made below would increase the impact of the paper and its clarity to a broad readership.

      Major Comments:

      (1) Subrealms

      Where does the concept of "subrealms" come from? No citation is given, and it could be said that this sounds like an idea straight out of Middle Earth. How do subrealms relate to known bioclimatic designations like Koppen Climate classifications, which would arguably be more appropriate? Or are subrealms more socio-ecologically oriented? From what I can tell, each subrealm lumps together climatically diverse areas. It might be better and more tractable to break things in terms of continents, as the rationale for subrealms is unclear, and it makes the analyses and results more confusing. The authors rationalized the use of subrealms to account for potential intraspecific differences in species' response to urbanization, but that is never a core part of the questions or interpretation in the paper, and averaging across subrealms also accounts for intraspecific variation. Another issue with using the subrealm approach is that the authors only included a species if it had 100 observations in a given subrealm, leading to a focus on only the most common species, which may be biased in their SUD distribution. How many more species would be included if they did their analysis at the continental or global scale, and would this change the shape of SUDs?

      (2) Methods - urban score

      The authors describe their "urban score" as being calculated as "the mean of the distribution of VIIRS values as a relative species specific measure of a response to urban land cover."

      I don't understand how this is a "relative species-specific measure". What is it relative to? Figures S4 and S5 show the mean distribution of VIIRS for various taxa, and this mean looks to be an absolute measure. Mean VIIRS for a given species would be fine and appropriate as an "urban score", but the authors then state in the next sentence: "this urban score represents the relative ranking of that species to other species in response to urban land cover".

      That doesn't follow from the description of how this is calculated. Something is missing here. Please clarify and add an explicit equation for how the urban score is calculated because the text is unclear and confusing.

      (3) Methods - urban tolerance

      How the authors are defining and calculating tolerance is unclear, confusing, and flawed in my opinion.

      Tolerance is a common concept in ecology, evolution, and physiology, typically defined as the ability for an organism to maintain some measure of performance (e.g., fitness, growth, physiological homeostasis) in the presence versus absence of some stressor. As one example, in the herbivory literature, tolerance is often measured as the absolute or relative difference in fitness of plants that are damaged versus undamaged (e.g., https://academic.oup.com/evolut/article/62/9/2429/6853425?login=true).

      On line 309, after describing the calculation of urban scores across subrealms, they write: "Therefore, a species could be represented across multiple subrealms with differing measures of urban tolerance (Fig. S4). Importantly, this continuous metric of urban tolerance is a relative measure of a species' preference, or affinity, to urban areas: it should be interpreted only within each subrealm".

      This is problematic on several fronts. First, the authors never define what they mean by the term "tolerance". Second, they refer to urban tolerance throughout the paper, but don't describe the calculation until lines 315-319, where they write (text in [ ] is from the reviewer):

      "Within each subrealm, we further accounted for the potential of different levels of urbanization by scaling each species' urban score by subtracting the mean VIIRS of all observations in the subrealm (this value is hereafter referred to as urban tolerance). This 'urban tolerance' (Fig. S5) value can be negative - when species under-occupy urban areas [relative to the average across all species] suggesting they actively avoid them-or positive-when species over-occupy urban areas [relative to the average across all species] suggesting they prefer them (i.e., ranging from urban avoiders to urban exploiters, respectively).<br /> They are taking a relativized urban score and then subtracting the mean VIIRS of all observations across species in a subrealm. How exactly one interprets the magnitude isn't clear and they admit this metric is "not interpretative across subrealms".

      This is not a true measure of tolerance, at least not in the conventional sense of how tolerance is typically defined. The problem is that a species distribution isn't being compared to some metric of urbanness, but instead it is relative to other species' urban scores, where species may, on average, be highly urban or highly nonurban in their distribution, and this may vary from subrealm to subrealm. A measure of urban tolerance should be independent of how other species are responding, and should be interpretable across subrealms, continents, and the globe.

      I propose the authors use one of two metrics of urban tolerance:

      (i) Absolute Urban Tolerance = Mean VIIRS of species_i - Mean VIIRS of city centers<br /> Here, the mean VIIRS of city centers could be taken from the center of multiple cities throughout a subrealm, across a continent, or across the world. Here, the units are in the original VIIRS units where 0 would correspond to species being centered on the most extreme urban habitats, and the most extreme negative values would correspond to species that occupy the most non-urban habitats (i.e., no artificial light at night). In essence, this measure of tolerance would quantify how far a species' distribution is shifted relative to the most highly urbanized habitat available.

      (ii) % Urban Tolerance = (Mean VIIRS of species_i - Mean VIIRS of city centers)/MeanVIIRS of city centers * 100%<br /> This metric provides a % change in species mean VIIRS distribution relative to the most urban habitats. This value could theoretically be negative or positive, but will typically be negative, with -100% being completely non-urban, and 0% being completely urban tolerant.

      Both of these metrics can be compared across the world, as it would provide either absolute (equation 1) or relative (equation 2) metrics of urban tolerance that are comparable and easily interpretable in any region.

      In summary, the definition of tolerance should be clear, the metric should be a true measure of tolerance that is comparable across regions, and an equation should be given.

      (4) Figure 1: The figure does not stand alone. For example, what is the hypothesis for thermophily or the temperature-size rule? The authors should expand the legend slightly to make the hypotheses being illustrated clearer.

      (5) SUDs: I don't agree with the conclusion given on line 83 ("pattern was consistent across subrealms and several taxonomic levels") or in the legend of Figure 2 ("there were consistent patterns for kingdoms, classes, and orders, as shown by generally similar density histograms shapes for each of these").

      The shapes of the curves are quite different, especially for the two Kingdoms and the different classes. I agree they are relatively consistent for the different taxonomic Orders of insects.

    4. Reviewer #3 (Public review):

      Summary:

      This paper reports on an association between body size and the occurrence of species in cities, which is quantified using an 'urban score' that can be visualized as a 'Species Urbanness Distribution' for particular taxa. The authors use species records from the Global Biodiversity Information Facility (GBIF) and link the occurrence data to nighttime lighting quantified using satellite data (Visible Infrared Imaging Radiometer Suite-VIIRS). They link the urban score to body size data to find 'heterogeneous relationship between body size and urban tolerance across the tree'. The results are then discussed with reference to potential mechanisms that could possibly produce the observed effects (cf. Figure 1).

      Strengths:

      The novelty of this study lies in the huge number of species analyzed and the comparison of results among animal taxa, rather than in a thorough analysis of what traits allow species to persist under urban conditions. Such analyses have been done using a much more thorough approach that employs presence-absence data as well as a suite of traits by other studies, for example, in (Hahs et al. 2023, Neate-Clegg et al. 2023). The dataset that the authors produced would also be very valuable if these raw data were published, both the cleaned species records as well as the body sizes.

      The paper could strongly add to our understanding of what species occur in cities when the open questions are addressed.

      Weaknesses:

      I value the approach of the authors, but I think the paper needs to be revised.

      In my view, the authors could more carefully validate their approach. Currently, any weakness or biases in the approach are quickly explained away rather than carefully explored. This concerns particularly the use of presence-only data, but also the calculation of the urban score.

      The vast majority of data in GBIF is presence-only data. This produces a strong bias in the analysis presented in the paper. For some taxa, it is likely that occurrences within the city are overrepresented, and for other taxa, the opposite is true (cf. Sweet et al. 2022). I think the authors should try to address this problem.

      The authors should compare their results to studies focusing on particular taxa where extensive trait-based analyses have already been performed, i.e., plants and birds. In fact, I strongly suggest that the authors should compare their results to previous studies on the relationship between traits, including body size and occurrences along a gradient of urbanisation, to draw conclusions about the validity of the approach used in the current study, which has a number of weaknesses.

      They should be be more careful in coming up with post-hoc explanations of why the pattern found in this study makes sense or suggests a particular mechanism. This reviewer considers that there is no way in which the current study can disentangle the different possible mechanisms without further analyses and data, so I would suggest pointing out carefully how the mechanisms could be studied

      More details should be given about the methodology. The readers should be able to understand the methods without having to read a number of other papers.

      References:

      Hahs, A. K., B. Fournier, M. F. Aronson, C. H. Nilon, A. Herrera-Montes, A. B. Salisbury, C. G. Threlfall, C. C. Rega-Brodsky, C. A. Lepczyk, and F. A. La Sorte. 2023. Urbanisation generates multiple trait syndromes for terrestrial animal taxa worldwide. Nature Communications 14:4751.

      Neate-Clegg, M. H. C., B. A. Tonelli, C. Youngflesh, J. X. Wu, G. A. Montgomery, Ç. H. Şekercioğlu, and M. W. Tingley. 2023. Traits shaping urban tolerance in birds differ around the world. Current Biology 33:1677-1688.

      Sweet, F. S. T., B. Apfelbeck, M. Hanusch, C. Garland Monteagudo, and W. W. Weisser. 2022. Data from public and governmental databases show that a large proportion of the regional animal species pool occur in cities in Germany. Journal of Urban Ecology 8:juac002.

    1. eLife Assessment

      The goal of this useful study is to examine learning-related changes in neural representations of global and local spatial reference frames in a spatial navigation task. Although the study addresses an interesting question, the evidence for neural representations in the hippocampus and retrosplenial cortex remains incomplete because of confounds in the experimental design and partial data analysis. There are further concerns about the framing of the study in the context of the relevant literature as well as the discussion.

    2. Reviewer #1 (Public review):

      Summary:

      In this paper, Qiu et al. developed a novel spatial navigation task to investigate the formation of multi-scale representations in the human brain. Over multiple sessions and diverse tasks, participants learned the location of 32 objects distributed across 4 different rooms. The key task was a "judgement of relative direction" task delivered in the scanner, which was designed to assess whether object representations reflect local (within-room) or global (across-room) similarity structures. In between the two scanning sessions, participants received extensive further training. The goal of this manipulation was to test how spatial representations change with learning.

      Strengths:

      The authors designed a very comprehensive set of tasks in virtual reality to teach participants a novel spatial map. The spatial layout is well-designed to address the question of interest in principle. Participants were trained in a multi-day procedure, and representations were assessed twice, allowing the authors to investigate changes in the representation over multiple days.

      Weaknesses:

      Unfortunately, I see multiple problems with the experimental design that make it difficult to draw conclusions from the results.

      (1) In the JRD task (the key task in this paper), participants were instructed to imagine standing in front of the reference object and judge whether the second object was to their left or right. The authors assume that participants solve this task by retrieving the corresponding object locations from memory, rotating their imagined viewpoint and computing the target object's relative orientation. This is a challenging task, so it is not surprising that participants do not perform particularly well after the initial training (performance between 60-70% accuracy). Notably, the authors report that after extensive training, they reached more than 90% accuracy.

      However, I wonder whether participants indeed perform the task as intended by the authors, especially after the second training session. A much simpler behavioural strategy is memorising the mapping between a reference object and an associated button press, irrespective of the specific target object. This basic strategy should lead to quite high success rates, since the same direction is always correct for four of the eight objects (the two objects located at the door and the two opposite the door). For the four remaining objects, the correct button press is still the same for four of the six target objects that are not located opposite to the reference object. Simply memorising the button press associated with each reference object should therefore lead to a high overall task accuracy without the necessity to mentally simulate the spatial geometry of the object relations at all.

      I also wonder whether the random effect coefficients might reflect interindividual differences in such a strategy shift - someone who learnt this relationship between objects and buttons might show larger increases in RTs compared to someone who did not.

      (2) On a related note, the neural effect that appears to reflect the emergence of a global representation might be more parsimoniously explained by the formation of pairwise associations between reference and target objects. Since both objects always came from the same room, an RDM reflecting how many times an object pair acted as a reference-target pair will correlate with the categorical RDM reflecting the rooms corresponding to each object. Since the categorical RDM is highly correlated with the global RDM, this means that what the authors measure here might not reflect the formation of a global spatial map, but simply the formation of pairwise associations between objects presented jointly.

      (3) In general, the authors attribute changes in neural effects to new learning. But of course, many things can change between sessions (expectancy, fatigue, change in strategy, but also physiological factors...). Baseline phsiological effects are less likely to influence patterns of activity, so the RSA analyses should be less sensitive to this problem, but especially the basic differences in activation for the contrast of post-learning > pre-learning stages in the judgment of relative direction (JRD) task could in theory just reflect baseline differences in blood oxygenation, possibly due to differences in time of day, caffeine intake, sleep, etc. To really infer that any change in activity or representation is due to learning, an active control would have been great.

      (4) RSA typically compares voxel patterns associated with specific stimuli. However, the authors always presented two objects on the screen simultaneously. From what I understand, this is not considered in the analysis ("The β-maps for each reference object were averaged across trials to create an overall β-map for that object."). Furthermore, participants were asked to perform a complex mental operation on each trial ("imagine standing at A, looking at B, then perform the corresponding motor response"). Assuming that participants did this (although see points 1 and 2 above), this means that the resulting neural representation likely reflects a mixture of the two object representations, the mental transformation and the corresponding motor command, and possibly additionally the semantic and perceptual similarity between the two presented words. This means that the βs taken to reflect the reference object representation must be very noisy.

      This problem is aggravated by two additional points. Firstly, not all object pairs occurred equally often, because only a fraction of all potential pairs were sampled. If the selection of the object pairs is not carefully balanced, this could easily lead to sampling biases, which RSA is highly sensitive to.

      Secondly, the events in the scanner are not jittered. Instead, they are phase-locked to the TR (1.2 sec TR, 1.2 sec fixation, 4.8 sec stimulus presentation). This means that every object onset starts at the same phase of the image acquisition, making HRF sampling inefficient and hurting trial-wise estimation of betas used for the RSA. This likely significantly weakens the strength of the neural inferences that are possible using this dataset.

      (5) It is not clear why the authors focus their report of the results in the main manuscript on the preselected ROIs instead of showing whole-brain results. This can be misleading, as it provides the false impression that the neural effects are highly specific to those regions.

      (6) I am missing behavioural support for the authors' claims.

      Overall, I am not convinced that the main conclusion that global spatial representations emerge during learning is supported by the data. Unfortunately, I think there are some fundamental problems in the experimental design that might make it difficult to address the concerns.

      However, if the authors can provide convincing evidence for their claims, I think the paper will have an impact on the field. The question of how multi-scale representations are represented in the human brain is a timely and important one.

    3. Reviewer #2 (Public review):

      Summary:

      Qui and colleagues studied human participants who learned about the locations of 32 different objects located across 4 different rooms in a common spatial environment. Participants were extensively trained on the object locations, and fMRI scans were done during a relative direction judgement task in a pre- and post-session. Using RSA analysis, the authors report that the hippocampus increased global relative to local representations with learning; the RSC showed a similar pattern, but also increased effects of both global and local information with time.

      Strengths:

      (1) The manuscript asks a generally interesting question concerning the learning of global versus local spatial information.

      (2) The virtual environment task provides a rich and naturalistic spatial setting for participants, and the setup with 32 objects across 4 rooms is interesting.

      (3) The within-subject design and use of verbal cues for spatial retrieval is elegant .

      Weaknesses:

      (1) My main concern is that the global Euclidean distances and room identity are confounded. I fear this means that all neural effects in the RSA could be alternatively explained by associations to the visual features of the rooms that build up over time.

      (2) The direction judgement task is not very informative about cognitive changes, as only objects in a room are compared. The setup also discourages global learning, and leaves unclear whether participants focussed on learning the left/right relationships required by the task.

      (3) With N = 23, the power is low, and the effects are weak.

      (4) It appears no real multiple comparisons correction is done for the ROI based approach, and significance across ROIs is not tested directly.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Qui et al. explores the issue of spatial learning in both local (rooms) and global (connected rooms) environments. The authors perform a pointing task, which involves either pressing the right or left button in the scanner to indicate where an object is located relative to another object. Participants are repeatedly exposed to rooms over sessions of learning, with one "pre" and one "post" learning session. The authors report that the hippocampus shifted from lower to higher RSA for the global but not the local environment after learning. RSC and OFC showed higher RSA for global object pointing. Other brain regions also showed effects, including ACC, which seemed to show a similar pattern as the hippocampus, as well as other regions shown in Figure S5. The authors attempt to tie their results in with local vs. global spatial representations.

      Strengths:

      Extensive testing of subjects before and after learning a spatial environment, with data suggesting that there may be fMRI codes sensitive to both global and local codes. Behavioral data suggest that subjects are performing well at the task and learning both global and local object locations, although see further comments.

      Weaknesses:

      (1) The authors frame the entire introduction around confirming the presence of the cognitive map either locally or globally. There are some significant issues with this framing. For one, the introduction appears to be confirmatory and not testing specific hypotheses that can be falsified. What exactly are the hypotheses being tested? I believe that this relates to the testing whether neural representations are global and/or local. However, this is not clear. Given that a previous paper (Marchette et al. 2014 Nature Neuro, which bears many similarities) showed only local coding in RSC, this paper needs to be discussed in far more depth in terms of its similarities and differences. This paper looked at both position and direction, while the current paper looks at direction. Even here, direction in the current study is somewhat impoverished: it involves either pointing right or left to an object, and much of this could be categorical or even lucky guesses. From what I could tell, all behavioral inferences are based on reaction time and not accuracy, and therefore, it is difficult to determine if the subject's behavior actually reflects knowledge gained or simply faster reaction time, either due to motor learning or a speed-accuracy trade-off. The pointing task is largely egocentric: it can be solved by remembering a facing direction and an object relative to that. It is not the JRD task as has been used in other studies (e.g., Huffman et al. 2019 Neuron), which is continuous and has an allocentric component. This "version" of the task would be largely egocentric. In this way, the pointing task used does not test the core tenets of the cognitive map during navigation, which is defined as allocentric and Euclidean (please see O'Keefe and Nadel 1978, The Hippocampus as a Cognitive Map). Since neither of these assumptions appears valid, the paper should be reframed to reflect spatial representations more broadly or even egocentric spatial representations.

      (2) The fMRI data workup is insufficient. What do the authors mean by "deactivations" in Figure 3b? Does this mean the object task showed more activation than the spatial task in HSC? Given that HSC is one of these regions, this would seem to suggest that the hippocampus is more involved in object than spatial processing, although it is difficult to tell from how things are written. The RSA is more helpful, but now a concern is that the analysis focuses on small clusters that are based on analyses determined previously. This appears to be the case for the correlations shown in Figure 3e as well. The issues here are several-fold. For one, it has been shown in previous work that basing secondary analyses on related first analyses can inflate the risk of false positives (i.e., Kriegeskorte et al. 2009 Nature Neuro). The authors should perform secondary analyses in ways that are unbiased by the first analyses, preferably, selecting cluster centers (if they choose to go this route) from previous papers rather than their own analyses. Another option would be to perform analyses at the level of the entire ROI, meaning that the results would generalize more readily. The authors should also perform permutation tests to ensure that the RSA results are reliable, as these can run the risk of false positives (e.g., Nolan et al. 2018 eNeuro). If these results hold, the authors should perform post-hoc (corrected) t-tests for global vs. local before and after learning to ensure these differences are robust and not simply rely on the interaction effect. The figures were difficult to follow in this regard, and an interaction effect does not necessarily mean the differences that are critical (global higher than local after) are necessarily significant. The end part of the results was hard to follow. If ACC showed a similar effect to HC and RSC, why is it not being considered? Many other areas that seemed to show local vs. global effects were dismissed, but these should instead be discussed in terms of whether they are consistent or inconsistent with the hypotheses.

      (3) Concerns about the discussion: there are areas involving reverse inference about brain areas rather than connecting the findings with hypotheses (see Poldrack et al. 2006 Trends in Cognitive Science). The authors also argue for 'transfer" of information (for example, from ACC to OFC), but did not perform any connectivity analyses, so these conclusions are not based on any results. Instead, the authors should carefully compare what can be concluded from the reaction time findings and the fMRI data. What is consistent vs. inconsistent with the hypotheses? The authors should also provide a much more detailed comparison with past work. The Marchette et al. paper comes to different conclusions regarding RSC and involves more detailed analyses than those done here, including position. What is different in the current paper that might explain the differences in results? Another previous paper that came to a different conclusion (hippocampus local, retrosplenial global) and should be carefully considered and compared, as it also involved learning of environments and comparisons at different phases (e.g., Wolbers & Buchel 2005 J Neuro). Other papers that have used the JRD task have demonstrated similar, although not identical, networks (e.g., Huffman et al. 2019 Neuron) and the results here should be more carefully compared, as the current task is largely egocentric while the Huffman et al. paper involves a continuous and allocentric version of the JRD task.

      (4) The authors cite rodent papers involving single neuron recordings. These are quite different experiments, however: they involve rodents, the rodents are freely moving, and single neurons are recorded. Here, the study involves humans who are supine and an indirect vascular measure of neural activity. Citations should be to studies of spatial memory and navigation in humans using fMRI: over-reliance on rodent studies should be avoided for the reasons mentioned above.

    1. eLife Assessment

      This study presents a valuable approach for revealing large-scale brain attractor dynamics during resting states, task processing, and disease conditions using insights from Hopfield neural networks. The evidence supporting the findings is convincing across the many datasets analysed. The work will be of broad interest to neuroscientists using neuroimaging data with interest in computational modelling of brain activity.

    2. Reviewer #1 (Public review):

      Summary:

      Englert et al. proposed a functional connectivity-based Attractor Neural Network (fcANN) to reveal attractor states and activity flows across various conditions, including resting state, task-evoked, and pathological conditions. The large-scale brain attractors reconstructed by fcANNs are orthogonal organization, which is in line with the free-energy theoretical framework. Additionally, the fcANN demonstrates differences in attractor states between individuals with autism and typically developing individuals.

      The study used seven datasets, which ensures robust replication and validation of generalization across various conditions. The study is a representative example that combines experimental evidence based on fcANN and the theoretical framework. The fcANN projection offers an interesting way of visualization, allowing researchers to observe attractor states and activity flow patterns directly. Overall, the study may offer valuable insights into brain dynamics and computational neuroscience.

      Comments on revision:

      The authors have addressed my previous concerns and substantially improved the manuscript. Fig.4 and Fig.5 still keep fcHNN rather than the updated fcANN.

    3. Reviewer #2 (Public review):

      Summary:

      Englert et al. use a novel modelling approach called functional connectome-based Hopfield Neural Networks (fcHNN) to describe spontaneous and task-evoked brain activity, and the alterations in brain disorders. Given its novelty, the authors first validate the model parameters (the temperature and noise) with empirical resting-state function data and against null models. Through the optimisation of the temperature parameter, they first show that the optimal number of attractor states is four before fixing the optimal noise that best reflects the empirical data, through stochastic relaxation. Then, they demonstrate how these fcHNN generated dynamics predict task-based functional activity relating to pain and self-regulation. To do so, they characterise the different brain states (here as different conditions of the experimental pain paradigm) in terms of the distribution of the data on the fcHNN projections and flow-analysis. Lastly, a similar analysis was performed on a population with autism condition. Through Hopfield modeling, this work proposes a comprehensive framework that links various types of functional activity under a unified interpretation with high predictive validity.

      Strengths:

      The phenomenological nature of the Hopfield model and its validation across multiple datasets presents a comprehensive and intuitive framework for the analysis of functional activity. The results presented in this work further motivate the study of phenomenological models as an adequate mechanistic characterisation of large-scale brain activity.

      Following up from Cole et al. 2016, the authors put forward a hypothesis that many of the changes to the brain activity, here, in terms of task-evoked and clinical data, can be inferred from the resting-state brain data alone. This brings together neatly the idea of different facets of brain activity emerging from a common space of functional (ghost) attractors.

      The use of the null models motivates the benefit for non-linear dynamics in the context of phenomenological models when assessing the similarity to the real empirical data.

      Comments on revision:

      I am happy with how the authors addressed the comments and am happy to move ahead without further comments.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Englert et al. proposed a functional connectome-based Hopfield artificial neural network (fcHNN) architecture to reveal attractor states and activity flows across various conditions, including resting state, task-evoked, and pathological conditions. The fcHNN can reconstruct characteristics of resting-state and task-evoked brain activities. Additionally, the fcHNN demonstrates differences in attractor states between individuals with autism and typically developing individuals.

      Strengths:

      (1) The study used seven datasets, which somewhat ensures robust replication and validation of generalization across various conditions.

      (2) The proposed fcHNN improves upon existing activity flow models by mimicking artificial neural networks, thereby enhancing the representational ability of the model. This advancement enables the model to more accurately reconstruct the dynamic characteristics of brain activity.

      (3) The fcHNN projection offers an interesting visualization, allowing researchers to observe attractor states and activity flow patterns directly.

      We are grateful to the reviewer for highlighting the robustness of our findings across multiple datasets and for appreciating the novelty and representational advantages of our fcHNN model (which has been renamed to fcANN in the revised manuscript).

      Weaknesses:

      (1) The fcHNN projection can offer low-dimensional dynamic visualizations, but its interpretability is limited, making it difficult to make strong claims based on these projections. The interpretability should be enhanced in the results and discussion.

      We thank the reviewer for these important points. We agree that the interpretability of the low-dimensional projection is limited. In the revised manuscript, we have reframed the fcANN projection primarily as a visualization tool (see e.g. line 359) and moved the corresponding part of Figure 2 to the Supplementary Material (Supplementary Figure 2). We have also implemented a substantial revision of the manuscript, which now directly links our analysis to the novel theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025), opening several new avenues in terms of interpretation and shedding light on the computational principles underlying attractor dynamics in the brain (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts.

      (2) The presentation of results is not clear enough, including figures, wording, and statistical analysis, which contributes to the overall difficulty in understanding the manuscript. This lack of clarity in presenting key findings can obscure the insights that the study aims to convey, making it challenging for readers to fully grasp the implications and significance of the research.

      We have thoroughly revised the manuscript for clarity in wording, figures (see e.g. lines 257, 482, 529 in the Results and lines 1128, 1266, 1300, 1367 in the Methods). We carefully improved statistical reporting and ensured that we always report test statistics, effect sizes and clearly refer to the null modelling approach used (e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4). As absolute effect sizes, in many analyses, do not have a straightforward interpretation, we provided Glass’ , as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation. To further improve clarity, we now clearly define our research questions and the corresponding analyses and null models in the revised manuscript, both in the main text and in two new tables (Tables 1 and 2). We denoted research questions and null model with Q1-7 and NM1-5, respectively and refer to them at multiple instances when detailing the analyses and the results.

      Reviewer #2 (Public Review):

      Summary:

      Englert et al. use a novel modelling approach called functional connectome-based Hopfield Neural Networks (fcHNN) to describe spontaneous and task-evoked brain activity and the alterations in brain disorders. Given its novelty, the authors first validate the model parameters (the temperature and noise) with empirical resting-state function data and against null models. Through the optimisation of the temperature parameter, they first show that the optimal number of attractor states is four before fixing the optimal noise that best reflects the empirical data, through stochastic relaxation. Then, they demonstrate how these fcHNN-generated dynamics predict task-based functional activity relating to pain and self-regulation. To do so, they characterise the different brain states (here as different conditions of the experimental pain paradigm) in terms of the distribution of the data on the fcHNN projections and flow analysis. Lastly, a similar analysis was performed on a population with autism condition. Through Hopfield modeling, this work proposes a comprehensive framework that links various types of functional activity under a unified interpretation with high predictive validity.

      Strengths:

      The phenomenological nature of the Hopfield model and its validation across multiple datasets presents a comprehensive and intuitive framework for the analysis of functional activity. The results presented in this work further motivate the study of phenomenological models as an adequate mechanistic characterisation of large-scale brain activity.

      Following up on Cole et al. 2016, the authors put forward a hypothesis that many of the changes to the brain activity, here, in terms of task-evoked and clinical data, can be inferred from the resting-state brain data alone. This brings together neatly the idea of different facets of brain activity emerging from a common space of functional (ghost) attractors.

      The use of the null models motivates the benefit of non-linear dynamics in the context of phenomenological models when assessing the similarity to the real empirical data.

      We thank the reviewer for recognizing the comprehensive and intuitive nature of our framework and for acknowledging the strength of our hypothesis that diverse brain activity facets emerge from a common resting state attractor landscape.

      Weaknesses:

      While the use of the Hopfield model is neat and very well presented, it still begs the question of why to use the functional connectome (as derived by activity flow analysis from Cole et al. 2016). Deriving the functional connectome on the resting-state data that are then used for the analysis reads as circular.

      We agree that starting from functional couplings to study dynamics is in stark contrast with the common practice of estimating the interregional couplings based on structural connectome data. We now explicitly discuss how this affects the scope of the questions we can address with the approach, with explicit notes on the inability of this approach to study the structure-function coupling and its limitations in deriving mechanistic insights at the level of biophysical implementation.

      Line 894:

      “The proposed approach is not without limitations. First, as the proposed approach does not incorporate information about anatomical connectivity and does not explitly model biophysical details. Thus, in its present form, the model is not suitable to study the structure-function coupling and cannot yiled mechanistic explanations underlying (altered) polysynaptic connections, at the level of biophysical details.”

      We are confident, however, that our approach is not circular. At the high level, our approach can be considered as a function-to-function generative model, with twofold aims.

      First, we link large-scale brain dynamics to theoretical artificial neural network models and show that the functional connectome display characteristics that render it as an exceptionally “well-behaving” attractor network (e.g. superior convergence properties, as contrasted against appropriate respective null models). In the revised manuscript, we have significantly improved upon this aspect by explicitly linking the fcANN model to the theoretical framework of self-orthogonalizing attractor networks (Spisak & Friston, 2025) (see the revised introduction and the new section “Theoretical background”, starting at lines 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). As part of these efforts, we now provide evidence for the brain’s functional organization approximating a special, computationally efficient class of attractor networks, the so-called Kanter-Sompolinsky projector network (Figure 2A-C, line 346, see also our answer to your next comment). This is exactly, what the theoretical framework of free-energy-minimizing attractor networks predicts. This result is not circular, as the empirical model does not use the key mechanism (the Hebbian/anti-Hebbian learning rule) that induces self-orthogonalization in the theoretical framework. We clarify this in the revised manuscript, e.g. in line 736.

      Second, we benchmark ability of the proposed function-to-function generative model to predict unseen data (new datasets) or data characteristics that are not directly encompassed in the connectivity matrix (e.g. non-Gaussian conditional dependencies, temporal autocorrelation, dynamical responses to perturbations on the system). These benchmarks are constructed against well defined null models, which provide reasonable references. We have now significantly improved the discussion of these null models in the revised manuscript (Tables 1 and 2, lines 257). We not only show, that our model - when reconstructing resting state dynamics - can generalize to unseen data over and beyond what is possible with the baseline descriptive measure (e.g. covariance measures and PCA), but also demonstrate the ability of the framework to reconstruct the effects of perturbations on this dynamics (such as task-evoked changes), based solely on the resting state data form another sample.

      If the fcHNN derives the basins of four attractors that reflect the first two principal components of functional connectivity, it perhaps suffices to use the empirically derived components alone and project the task and clinical data on it without the need for the fcHNN framework.

      We are thankful for the reviewer for highlighting this important point, which encouraged us to develop a detailed understanding of the origins of the close alignment between attractors and principal components (eigenvectors of the coupling matrix) and the corresponding (approximate) orthogonality. Here, we would like to emphasize that the attractor-eigenvector correspondence is by no means a general feature of any arbitrary attractor network. In fact, such networks are a very special class of attractor neural networks (the so-called Kanter-Sompolinsky projector neural network (Kanter & Sompolinsky, 1987)), with a high degree of computational efficiency, maximal memory capacity and perfect memory recall. It has been rigorously shown that in such networks, the eigenvectors of the coupling matrix (i.e. PCA on the timeseries data) and the attractors become equivalent (Kanter & Sompolinsky, 1987). This in turn made us ask the question, what are the learning and plasticity rules that drive attractor networks towards developing approximately orthogonal attractors? We found that this is a general tendency of networks obeying the free energy principle ( Figure 2A-C, line 346, see also our answer to your next comment). The formal derivation of this framework in now presented in an accompanying theoretical piece (Spisak & Friston, 2025). In the revised manuscript, we provide a short, high-level overview of these results (in the Introduction form line 55 and in the new section “Theoretical background”, line 128, but also the Mathematical Appendices 1-2 in the Supplementary Material for a comprehensive formal derivation). According to this new theoretical model, attractor states can be understood as a set of priors (in the Bayesian sense) that together constitute an optimal orthogonal basis, equipping the update process (which is akin to a Markov-chain Monte Carlo sampling) to find posteriors that generalize effectively within the spanned subspace. Thus, in sum, understanding brain function in terms of attractor dynamics - instead of PCA-like descriptive projections - provides important links towards a Bayesian interpretation of brain activity. At the same time, the eigenvector-attractor correspondence also explains, why descriptive decomposition approaches, like PCA or ICA are so effective at capturing the dynamics of the system, at the first place.

      As presented here, the Hopfield model is excellent in its simplicity and power, and it seems suited to tackle the structure-function relationship with the power of going further to explain task-evoked and clinical data. The work could be strengthened if that was taken into consideration. As such the model would not suffer from circularity problems and it would be possible to claim its mechanistic properties. Furthermore, as mentioned above, in the current setup, the connectivity matrix is based on statistical properties of functional activity amongst regions, and as such it is difficult to talk about a certain mechanism. This contention has for example been addressed in the Cole et al. 2016 paper with the use of a biophysical model linking structure and function, thus strengthening the mechanistic claim of the work.

      We agree that investigating how the structural connectome constraints macro-scale dynamics is a crucial next step. Linking our results with the theoretical framework of self-orthogonalizing attractor networks provides a principled approach to this, as the “self-orthogonalizing” learning rule in the accompanying theoretical work provides the opportunity to fit attractor networks with structural constraints to functional data, shedding light on the plastic processes which maintain the observed approximate orthogonality even in the presence of these structural constraints. We have revised the manuscript to clarify that our phenomenological approach is inherently limited in its ability to answer mechanistic questions at the level of biophysical details (lines 894) and discuss this promising direction as follows:

      Lines 803:

      “A promising application of this is to consider structural brain connectivity (as measured by diffusion MRI) as a sparsity constraint for the coupling weights and then train the fcANN model to match the observed resting-state brain dynamics. If the resulting structural-functional ANN model is able to closely match the observed functional brain substate dynamics, it can be used as a novel approach to quantify and understand the structural functional coupling in the brain”.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The statistical analyses are poorly described throughout the manuscript. The authors should provide more details on the statistical methods used for each comparison, as well as the corresponding statistics and degrees of freedom, rather than solely reporting p-values.

      We thank the reviewer for pointing this out. We have revised the manuscript to include the specific test statistics, precise p-values and raw effect sizes for all reported analyses to ensure full transparency and replicability, see e.g. lines 461, 542, 550, 565, 573, 619, as well as Figures 2-4. Additionally, as absolute effect sizes - in many analyses - do not have a straightforward interpretation, we provided Glass’ Δ, as a standardized effect size measure, expressing the distance of the true observation from the null distribution as a ratio of the null standard deviation.

      We have also improved the description of the statistical methods used in the manuscript (lines 1270, 1306, 1339, 1367, 1404) and added two overview tables (Tables 1 and 2) that summarize the methodological approaches and the corresponding null models.

      Furthermore, we have fully revised the analysis corresponding to noise optimization. We only retained null model 2 (covariance-matched Gaussian) in the main text and on Figure 3, and moved model 1 (spatial phase randomization) into the Supplementary Material (Supplementary Figure 6) and is less appropriate for this analysis (trivially significant in all cases). Furthermore, as test statistic, no we use a Wasserstein distance between the 122-dimensional empirical and the simulated data (instead of focusing on the 2-dimensional projection). This analysis now directly quantifies the capacity of the fcANN model to capture non-Gaussian conditionals in the data.

      (2) The convergence procedure is not clearly explained in the manuscript. Is this an optimization procedure to minimize energy? If so, the authors should provide more details about the optimizer used.

      We apologize for the lack of clarity. The convergence is not an optimization procedure per se, in a sense that it does not involve any external optimizer. It is simply the repeated (deterministic) application of the same update rule also known from Hopfield networks or Boltzmann machines. However, as detailed in the accompanying theoretical paper, this update rule (or inference rule) inherently solves and optimization problem: it performs gradient descent on the free energy landscape of the network. As such, it is guaranteed to converge to a local free energy minimum in the deterministic case. We have clarified this process in the Results and Methods sections as follows:

      Line 161:

      “Inference arises from minimizing free energy with respect to the states \sigma. For a single unit, this yields a local update rule homologous to the relaxation dynamics in Hopfield networks”.

      Line 181:

      “In the basis framework (Spisak & Friston, 2025), inference is a gradient descent on the variational free energy landscape with respect to the states σ and can be interpreted as a form of approximate Bayesian inference, where the expected value of the state σ<sub>i</sub> is interpreted as the posterior mean given the attractor states currently encoded in the network (serving as a macro-scale prior) and the previous state, including external inputs (serving as likelihood in the Bayesian sense)”.

      Line 1252:

      “As the inference rule was derived as a gradient descent on free energy, iterations monotonically decrease the free energy function and therefore converge to a local free‑energy minimum without any external optimizer. Thus, convergence does not require any optimization procedure with an external optimizer. Instead, it arises as the fixed point of repeated local inference updates, which implement gradient descent on free energy in the deterministic symmetric case.”

      (3) In Figure 2G, the beta values range from 0.035 to 0.06, but they are reported as 0.4 in the main text and the Supplementary Figure. Please clarify this discrepancy.

      We are grateful to the reviewer for spotting this typo. The correct value for β is 0.04, as reported in the Methods section. We have corrected this inconsistency in the revised manuscript and as well as in Supplementary Figure 5.

      (4) Line 174: What type of null model was used to evaluate the impact of the beta values? The authors did not provide details on this anywhere in the manuscript.

      We apologize for this omission. The null model is based on permuting the connectome weights while retaining the matrix symmetry, which destroys the specific topological structure but preserves the overall weight distribution. We have now clarified this at multiple places in the revised manuscript (lines 432, Table 1-2, Figure 2), and added new overview tables (Tables 1 and 2) to summarize the methodological approaches and the corresponding null models.

      (5) Figure 3B: It appears that the authors only demonstrate the reproducibility of the “internal” attractor across different datasets. What about other states?

      Thank you for noticing this. We now visualize all attractor states in Figure 3B (note that these essentially consist of two symmetric pairs).

      (6) Figure 3: What does “empirical” represent in Figure 3? Is it PCA? If the “empirical” method, which is a much simpler method, can achieve results similar to those of the fcHNN in terms of state occupancy, distribution, and activity flow, what are the benefits of the proposed method? Furthermore, the authors claim that the explanatory power of the fcHNN is higher than that of the empirical model and shows significant differences. However, from my perspective, this difference is not substantial (37.0% vs. 39.9%). What does this signify, particularly in comparison to PCA?

      This is a crucial point that is now a central theme of our revised manuscript. The reviewer is correct that the “empirical” method is PCA. PCA - by identifying variance-heavy orthogonal directions - aims to explain the highest amount of variance possible in the data (with the assumption of Gaussian conditionals). While empirical attractors are closely aligned to the PCs (i.e. eigenvectors of the inverse covariance matrix, as shown in the new analysis Q1), the alignment is only approximate. We basically take advantage of this small “gap” to quantify, weather attractor states are a better fit to the unseen data than the PCs. Obviously, due to the otherwise strong PC-attractor correspondence, this is expected to be only a small improvement. However, it is an important piece of evidence for the validity of our framework, as it shows that attractors are not just a complementary, perhaps “noisier” variety of the PCs, but a “substrate” that generalizes better to unseen data than the PCs themselves. We have revised the manuscript to clarify this point (lines 528).

      Reviewer #2 (Recommendations For The Authors):

      For clarity, it might be useful to define and use consistently certain key terms. Connectome often refers to structural (anatomical) connectivity unless defined specifically this should be considered, in Figure 1B title for example Brain state often refers to different conditions ie autism, neurotypical, sleep, etc... see for review Kringelbach et al. 2020, Cell Reports. When referring to attractors of brain activity they might be called substates.

      We thank the reviewer for these helpful suggestions. We have carefully revised the manuscript to ensure our terminology is precise and consistent. We now explicitly refer to the “functional connectome” (including the title) and avoid using the too general term “brain state” and use “substates” instead.

      In Figure 2 some terms are not defined. Noise is sigma in the text but elpsilon in the figure. Only in methods, the link becomes clear. Perhaps define epsilon in the caption for clarity. The same applies to μ in the methods. It is only described above in the methods, I suggest repeating the epsilon definition for clarity

      We appreciate this feedback and apologize for the inconsistency. We have revised all figures and the Methods section to ensure that all mathematical symbols (including ε, σ, and μ) are clearly and consistently defined upon their first appearance and in all figure captions. For instance, noise level is now consistently referred to as ϵ. We improved the consistency and clarity for other terms, too, including:

      functional connectome-based Hopfiled network (fcHNN) => functional connectivity-based attractor network (fcANN);

      temperature => inverse temperature;

      And improved grammar and language throughout the manuscript.

      References

      Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review A, 35(1), 380–392. 10.1103/physreva.35.380

      Spisak T & Friston K (2025). Self-orthogonalizing attractor neural networks emerging from the free energy principle. arXiv preprint arXiv:2505.22749.

    1. eLife Assessment

      O'Brien and co-authors provide important data demonstrating that tissue-resident macrophages can exert physiological functions and influence endocrine systems.Their model in which AMs negatively regulate aldosterone production via effects exerted in the lung is solid. The work will be of broad interest to cell biologists and immunologists.

    2. Reviewer #2 (Public review):

      Summary:

      Tissue-resident macrophages are more and more thought to exert key homeostatic functions and contribute to physiological responses. In the report of O'Brien and Colleagues, the idea that the macrophage-expressed scavenger receptor MARCO could regulate adrenal corticosteroid output at steady-state was explored. The authors found that male MARCO-deficient mice exhibited higher plasma aldosterone levels and higher lung ACE expression as compared to wild-type mice, while the availability of cholesterol and the machinery required to produce aldosterone in the adrenal gland were not affected by MARCO deficiency. The authors take these data to conclude that MARCO in alveolar macrophages can negatively regulate ACE expression and aldosterone production at steady-state and that MARCO-deficient mice suffer from a secondary hyperaldosteronism.

      Strengths:

      If properly demonstrated and validated, the fact that tissue-resident macrophages can exert physiological functions and influence endocrine systems would be highly significant and could be amenable to novel therapies.

      Major weakness:

      The comparison between C57BL/6J wild-type mice and knock-out mice for which a precise information about the genetic background and the history of breedings and crossings is lacking can lead to misinterpretations of the results obtained. Hence, MARCO-deficient mice should be compared with true littermate controls.

    1. eLife Assessment

      This is an important account of replay as recency-weighted context-guided memory reactivation that explains a number of empirical findings across human and rodent memory literatures. The evidence is compelling and the work is likely to inspire further adaptions to incorporate additional biological and cognitive features.

    2. Reviewer #1 (Public review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently. Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

    3. Reviewer #3 (Public review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR-replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency and contiguity. Unlike its predecessors, CMR-replay has built in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's items-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backwards replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory building in the field.

      The authors addressed my concerns with respect to adding methodological detail. I am satisfied with the changes.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      We appreciate the encouragement to discuss this connection. Our framework can accommodate semantic associations as determinants of sleep-dependent consolidation, which can in principle outweigh temporal associations. Indeed, prior models in this lineage have extensively simulated how semantic associations support encoding and retrieval alongside temporal associations. It would therefore be straightforward to extend our model to simulate how semantic associations guide sleep benefits, and to compare their contribution against that conferred by temporal associations across different experimental paradigms. In the revised manuscript, we have added a discussion of how our framework may simulate the role of semantic associations in sleep-dependent consolidation.

      “Several recent studies have argued for dominance of semantic associations over temporal associations in the process of human sleep-dependent consolidation (Schechtman et al., 2023; Liu and Ranganath 2021; Sherman et al., 2025), with one study observing no role at all for temporal associations (Schechtman et al., 2023). At first glance, these findings appear in tension with our model, where temporal associations drive offline consolidation. Indeed, prior models have accounted for these findings by suppressing temporal context during sleep (Liu and Ranganath 2024; Sherman et al., 2025). However, earlier models in the CMR lineage have successfully captured the joint contributions of semantic and temporal associations to encoding and retrieval (Polyn et al., 2009), and these processes could extend naturally to offline replay. In a paradigm where semantic associations are especially salient during awake learning, the model could weight these associations more and account for greater co-reactivation and sleep-dependent memory benefits for semantically related than temporally related items. Consistent with this idea, Schechtman et al. (2023) speculated that their null temporal effects likely reflected the task’s emphasis on semantic associations. When temporal associations are more salient and task-relevant, sleep-related benefits for temporally contiguous items are more likely to emerge (e.g., Drosopoulos et al., 2007; King et al., 2017).”

      The reviewer’s comment points to fruitful directions for future work that could employ our framework to dissect the relative contributions of semantic and temporal associations to memory consolidation.

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently.

      Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      We appreciate the opportunity to clarify this aspect of the model. We first note that this mechanism has long been a fundamental component of this class of models (Howard & Kahana 2002). Many classic memory models (Brown et al., 2000; Burgess & Hitch, 1991; Lewandowsky & Murdock 1989) incorporate response suppression, in which activated items are temporarily inhibited. The simplest implementation, which we use here, removes activated items from the pool of candidate items. Alternative implementations achieve this through transient inhibition, often conceptualized as neuronal fatigue (Burgess & Hitch, 1991; Grossberg 1978). Our model adopts a similar perspective, interpreting this mechanism as mimicking a brief refractory period that renders reactivated neurons unlikely to fire again within a short physiological event such as a sharp-wave ripple. Importantly, this approach does not generate spurious sequences. Instead, the model’s ability to preserve the structure of wake experience during replay depends entirely on the learned associations between items (without these associations, item order would be random). Similar assumptions are also common in models of replay. For example, reinforcement learning models of replay incorporate mechanisms such as inhibition to prevent repeated reactivations (e.g., Diekmann & Cheng, 2023) or prioritize reactivation based on ranking to limit items to a single replay (e.g., Mattar & Daw, 2018). We now discuss these points in the section titled “A context model of memory replay”

      “This mechanism of sampling without replacement, akin to response suppression in established context memory models (Howard & Kahana 2002), could be implemented by neuronal fatigue or refractory dynamics (Burgess & Hitch, 1991; Grossberg 1978). Non-repetition during reactivation is also a common assumption in replay models that regulate reactivation through inhibition or prioritization (Diekmann & Cheng 2023; Mattar & Daw 2018; Singh et al., 2022).”

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      We agree that these mechanisms and their relationships would benefit from clarification. As noted, novelty influences learning through two distinct mechanisms. First, the suppression mechanism is essential for capturing the inverse relationship between the amount of wake experience and the frequency of replay, as observed in several studies. This mechanism ensures that items with high wake activity are less likely to dominate replay. Second, the decrease in learning rates with repetition is crucial for preserving the stochasticity of replay. Without this mechanism, the model would increase weights linearly, leading to an exponential increase in the probability of successive wake items being reactivated back-to-back due to the use of a softmax choice rule. This would result in deterministic replay patterns, which are inconsistent with experimental observations.

      We have revised the Methods section to explicitly distinguish these two mechanisms:

      “This experience-dependent suppression mechanism is distinct from the reduction of learning rates through repetition; it does not modulate the update of memory associations but exclusively governs which items are most likely to initiate replay.”

      We have also clarified our rationale for including a learning rate reduction mechanism:

      “The reduction in learning rates with repetition is important for maintaining a degree of stochasticity in the model’s replay during task repetition, since linearly increasing weights would, through the softmax choice rule, exponentially amplify differences in item reactivation probabilities, sharply reducing variability in replay.”

      Finally, we now specify exactly where the learning-rate reduction applied, namely in simulations where sequences are repeated across multiple sessions:

      “In this simulation, the learning rates progressively decrease across sessions, as described above.“

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      We appreciate the reviewer raising this important point. Unlike the mechanism proposed by the synaptic homeostasis hypothesis, the suppression mechanism in our model does not suppress items based on synapse strength, nor does it modify synaptic weights. Instead, it determines the level of suppression for each item based on activity during awake experience. The brain could implement such a mechanism by tagging each item according to its activity level during wakefulness. During subsequent consolidation, the initial reactivation of an item during replay would reflect this tag, influencing how easily it can be reactivated.

      A related hypothesis has been proposed in recent work, suggesting that replay avoids recently active trajectories due to spike frequency adaptation in neurons (Mallory et al., 2024). Similarly, the suppression mechanism in our model is critical for explaining the observed negative relationship between the amount of recent wake experience and the degree of replay.

      We discuss the biological plausibility of this mechanism and its relationship with existing models in the Introduction. In the section titled “The influence of experience”, we have added the following:

      “Our model implements an activity‑dependent suppression mechanism that, at the onset of each offline replay event, assigns each item a selection probability inversely proportional to its activation during preceding wakefulness. The brain could implement this by tagging each memory trace in proportion to its recent activation; during consolidation, that tag would then regulate starting replay probability, making highly active items less likely to be reactivated. A recent paper found that replay avoids recently traversed trajectories through awake spike‑frequency adaptation (Mallory et al., 2025), which could implement this kind of mechanism. In our simulations, this suppression is essential for capturing the inverse relationship between replay frequency and prior experience. Note that, unlike the synaptic homeostasis hypothesis (Tononi & Cirelli 2006), which proposes that the brain globally downscales synaptic weights during sleep, this mechanism leaves synaptic weights unchanged and instead biases the selection process during replay.”

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? 

      We appreciate the encouragement to comment on the roles of different sleep stages in the manuscript, especially since, as noted, the lab is very interested in this and has explored it in other work. We chose to focus on NREM in this work because the vast majority of electrophysiological studies of sleep replay have identified these events during NREM. In addition, our lab’s theory of the role of REM (Singh et al., 2022, PNAS) is that it is a time for the neocortex to replay remote memories, in complement to the more recent memories replayed during NREM. The experiments we simulate all involve recent memories. Indeed, our view is that part of the reason that there is so little data on REM replay may be that experimenters are almost always looking for traces of recent memories (for good practical and technical reasons).

      Regarding the simplicity of the distinction between simulated wake and sleep replay, we view it as an asset of the model that it can account for many of the different characteristics of awake and NREM replay with very simple assumptions about differences in the initial conditions. There are of course many other differences between the states that could be relevant to the impact of replay, but the current target empirical data did not necessitate us taking those into account. This allows us to argue that differences in initial conditions should play a substantial role in an account of the differences between wake and sleep replay.

      We have added discussion of these ideas and how they might be incorporated into future versions of the model in the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      We appreciate the encouragement to discuss this, as we do think the model could explain findings showing a preference for reactivation of weaker memories, as in Schapiro et al. (2018). In our framework, memory strength is reflected in the magnitude of each memory’s associated synaptic weights, so that stronger memories yield higher retrieved‑context activity during wake encoding than weaker ones. Because the model’s suppression mechanism reduces an item’s replay probability in proportion to its retrieved‑context activity, items with larger weights (strong memories) are more heavily suppressed at the onset of replay, while those with smaller weights (weaker memories) receive less suppression. When items have matched reward exposure, this dynamic would bias offline replay toward weaker memories, therefore preferentially reactivating weak memories. 

      In the section titled “The influence of experience”, we updated a sentence to discuss this idea more explicitly: 

      “Such a suppression mechanism may be adaptive, allowing replay to benefit not only the most recently or strongly encoded items but also to provide opportunities for the consolidation of weaker or older memories, consistent with empirical evidence (e.g., Schapiro et al. 2018; Yu et al., 2024).”

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      In studies where animals run back and forth on a linear track, replay events are decoded separately for left and right runs, identifying both forward and reverse sequences for each direction, for example using direction-specific place cell sequence templates. Accordingly, in our simulation of, e.g., Ambrose et al. (2016), we use two independent sequences, one for left runs and one for right runs (an approach that has been taken in prior replay modeling work). Crucially, our model assumes a context reset between running episodes, preventing the final item of one traversal from acquiring contextual associations with the first item of the next. As a result, learning in the two sequences remains independent, and when an external cue is presented at the track’s end, replay predominantly unfolds in the backward direction, only occasionally producing forward segments when the cue briefly reactivates an earlier sequence item before proceeding forward.

      We added a note to the section titled “The context-dependency of memory replay” to clarify this:

      “In our model, these patterns are identical to those in our simulation of Ambrose et al. (2016), which uses two independent sequences to mimic the two run directions. This is because the drifting context resets before each run sequence is encoded, with the pause between runs acting as an event boundary that prevents the final item of one traversal from associating with the first item of the next, thereby keeping learning in each direction independent.”

      To our knowledge, no study has observed a similar asymmetry when animals are fully removed from the track, although both types of replay can be observed when animals are away from the track. For example, Gupta et al. (2010) demonstrated that when animals replay trajectories far from their current location, the ratio of forward vs. backward replay appears more balanced. We now highlight this result in the manuscript and explain how it aligns with the predictions of our model:

      “For example, in tasks where the goal is positioned in the middle of an arm rather than at its end, CMR-replay predicts a more balanced ratio of forward and reverse replay, whereas the EVB model still predicts a dominance of reverse replay due to backward gain propagation from the reward. This contrast aligns with empirical findings showing that when the goal is located in the middle of an arm, replay events are more evenly split between forward and reverse directions (Gupta et al., 2010), whereas placing the goal at the end of a track produces a stronger bias toward reverse replay (Diba & Buzsaki 2007).” 

      Although no studies, to our knowledge, have observed a context-dependent asymmetry between forward and backward replay when the animal is away from the track, our model does posit conditions under which it could. Specifically, it predicts that deliberation on a specific memory, such as during planning, could generate an internal context input that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track.

      We now discuss this prediction in the section titled “The context-dependency of memory replay”:

      “Our model also predicts that deliberation on a specific memory, such as during planning, could serve to elicit an internal context cue that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track. While not explored here, this mechanism presents a potential avenue for future modeling and empirical work.”

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      Although our model predicts that replay is triggered immediately by the sound cue, it also predicts a sustained bias toward the cued sequence. Replay in our model unfolds across the rest phase as multiple successive events, so the bias observed in our sleep simulations indeed reflects a prolonged preference for the cued sequence.

      We now discuss this issue, acknowledging the discrepancy:

      “Bendor and Wilson (2012) found that sound cues during sleep did not trigger immediate replay, but instead biased reactivation toward the cued sequence over an extended period of time. While the model does exhibit some replay triggered immediately by the cue, it also captures the sustained bias toward the cued sequence over an extended period.”

      Second, within this framework, context is modeled as a weighted average of the features associated with items. As a result, cueing the model with the first R/L item produces qualitatively similar outcomes as cueing it with a more extended R/L cue that incorporates features of additional items. This is because both approaches ultimately use context features unique to the two sides.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      We thank the reviewer for this suggestion. Regarding differences in the contribution of wake and sleep to memory, our current simulations predict that compared to rest in the task environment, sleep is less biased toward initiating replay at specific items, leading to a more uniform benefit across all memories. Regarding the contributions of forward and backward replay, our model predicts that both strengthen bidirectional associations between items and contexts, benefiting memory in qualitatively similar ways. Furthermore, we suggest that the offline learning captured  by our teacher-student simulations reflects consolidation processes that are specific to sleep.

      We have expanded the section titled The influence of experience to discuss these predictions of the model: 

      “The results outlined above arise from the model's assumption that replay strengthens bidirectional associations between items and contexts to benefit memory. This assumption leads to several predictions about differences across replay types. First, the model predicts that sleep yields different memory benefits compared to rest in the task environment: Sleep is less biased toward initiating replay at specific items, resulting in a more uniform benefit across all memories. Second, the model predicts that forward and backward replay contribute to memory in qualitatively similar ways but tend to benefit different memories. This divergence arises because forward and backward replay exhibit distinct item preferences, with backward replay being more likely to include rewarded items, thereby preferentially benefiting those memories.”

      We also updated the “The function of replay” section to include our teacher-student speculation:

      “We speculate that the offline learning observed in these simulations corresponds to consolidation processes that operate specifically during sleep, when hippocampal-neocortical dynamics are especially tightly coupled (Klinzing et al., 2019).”

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

      We appreciate these insightful comments. Traditionally, replay studies have focused on spatial tasks with autocorrelated item representations (e.g., place fields). However, an increasing number of human studies have demonstrated sequential replay using stimuli with distinct, unrelated representations. Our model is designed to accommodate both scenarios. In our current simulations, we employ orthogonal item representations while leveraging a shared, temporally autocorrelated context to link successive items. We anticipate that incorporating autocorrelated item representations would further enhance sequence memory by increasing the similarity between successive contexts. Overall, we believe that the model generalizes across a broad range of experimental settings, regardless of the degree of autocorrelation between items. Moreover, the underlying framework has been successfully applied to explain sequential memory in both spatial domains, explaining place cell firing properties (e.g., Howard et al., 2004), and in non-spatial domains, such as free recall experiments where items are arbitrarily related. 

      In the section titled “A context model of memory replay”, we added this comment to address this point:

      “Its contiguity bias stems from its use of shared, temporally autocorrelated context to link successive items, despite the orthogonal nature of individual item representations. This bias would be even stronger if items had overlapping representations, as observed in place fields.”

      Since CMR-replay learns distributed context representations where overlap across context vectors captures associative structure, and replay helps strengthen that overlap, this could indeed be viewed as consonant with complementary learning systems integration processes. 

      Reviewer #2 (Public Review):

      This manuscript proposes a model of replay that focuses on the relation between an item and its context, without considering the value of the item. The model simulates awake learning, awake replay, and sleep replay, and demonstrates parallels between memory phenomenon driven by encoding strength, replay of sequence learning, and activation of nearest neighbor to infer causality. There is some discussion of the importance of suppression/inhibition to reduce activation of only dominant memories to be replayed, potentially boosting memories that are weakly encoded. Very nice replications of several key replay findings including the effect of reward and remote replay, demonstrating the equally salient cue of context for offline memory consolidation.

      I have no suggestions for the main body of the study, including methods and simulations, as the work is comprehensive, transparent, and well-described. However, I would like to understand how the CMRreplay model fits with the current understanding of the importance of excitation vs inhibition, remembering vs forgetting, activation vs deactivation, strengthening vs elimination of synapses, and even NREM vs REM as Schapiro has modeled. There seems to be a strong association with the efforts of the model to instantiate a memory as well as how that reinstantiation changes across time. But that is not all this is to consolidation. The specific roles of different brain states and how they might change replay is also an important consideration.

      We are gratified that the reviewer appreciated the work, and we agree that the paper would benefit from comment on the connections to these other features of consolidation.

      Excitation vs. inhibition: CMR-replay does not model variations in the excitation-inhibition balance across brain states (as in other models, e.g., Chenkov et al., 2017), since it does not include inhibitory connections. However, we posit that the experience-dependent suppression mechanism in the model might, in the brain, involve inhibitory processes. Supporting this idea, studies have observed increased inhibition with task repetition (Berners-Lee et al., 2022). We hypothesize that such mechanisms may underlie the observed inverse relationship between task experience and replay frequency in many studies. We discuss this in the section titled “A context model of memory replay”:

      “The proposal that a suppression mechanism plays a role in replay aligns with models that regulate place cell reactivation via inhibition (Malerba et al., 2016) and with empirical observations of increased hippocampal inhibitory interneuron activity with experience (Berners-Lee et al., 2022). Our model assumes the presence of such inhibitory mechanisms but does not explicitly model them.”

      Remembering/forgetting, activation/deactivation, and strengthening/elimination of synapses: The model does not simulate synaptic weight reduction or pruning, so it does not forget memories through the weakening of associated weights. However, forgetting can occur when a memory is replayed less frequently than others, leading to reduced activation of that memory compared to its competitors during context-driven retrieval. In the Discussion section, we acknowledge that a biologically implausible aspect of our model is that it implements only synaptic strengthening: 

      “Aspects of the model, such as its lack of regulation of the cumulative positive weight changes that can accrue through repeated replay, are biologically implausible (as biological learning results in both increases and decreases in synaptic weights) and limit the ability to engage with certain forms of low level neural data (e.g., changes in spine density over sleep periods; de Vivo et al., 2017; Maret et al., 2011). It will be useful for future work to explore model variants with more elements of biological plausibility.” Different brain states and NREM vs REM: Reviewer 1 also raised this important issue (see above). We have added the following thoughts on differences between these states and the relationship to our prior work to the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      We hope these points clarify the model’s scope and its potential for future extensions.

      Do the authors suggest that these replay systems are more universal to offline processes beyond episodic memory? What about procedural memories and working memory?

      We thank the reviewer for raising this important question. We have clarified in the manuscript:

      “We focus on the model as a formulation of hippocampal replay, capturing how the hippocampus may replay past experiences through simple and interpretable mechanisms.”

      With respect to other forms of memory, we now note that:

      “This motor memory simulation using a model of hippocampal replay is consistent with evidence that hippocampal replay can contribute to consolidating memories that are not hippocampally dependent at encoding (Schapiro et al., 2019; Sawangjit et al., 2018). It is possible that replay in other, more domain-specific areas could also contribute (Eichenlaub et al., 2020).”

      Though this is not a biophysical model per se, can the authors speak to the neuromodulatory milieus that give rise to the different types of replay?

      Our work aligns with the perspective proposed by Hasselmo (1999), which suggests that waking and sleep states differ in the degree to which hippocampal activity is driven by external inputs. Specifically, high acetylcholine levels during waking bias activity to flow into the hippocampus, while low acetylcholine levels during sleep allow hippocampal activity to influence other brain regions. Consistent with this view, our model posits that wake replay is more biased toward items associated with the current resting location due to the presence of external input during waking states. In the Discussion section, we have added a comment on this point:

      “Our view aligns with the theory proposed by Hasselmo (1999), which suggests that the degree of hippocampal activity driven by external inputs differs between waking and sleep states: High acetylcholine levels during wakefulness bias activity into the hippocampus, while low acetylcholine levels during slow-wave sleep allow hippocampal activity to influence other brain regions.”

      Reviewer #3 (Public Review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency, and contiguity. Unlike its predecessors, CMR-replay has built-in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's item-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backward replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory-building in the field.

      With respect to weaknesses, additional details for some of the methods and results would help the readers better evaluate the data presented here (e.g., explicitly defining how the various 'proportion of replay' DVs were calculated).

      For example, for many of the simulations, the y-axis scale differs from the empirical data despite using comparable units, like the proportion of replay events (e.g., Figures 1B and C). Presumably, this was done to emphasize the similarity between the empirical and model data. But, as a reader, I often found myself doing the mental manipulation myself anyway to better evaluate how the model compared to the empirical data. Please consider using comparable y-axis ranges across empirical and simulated data wherever possible.

      We appreciate this point. As in many replay modeling studies, our primary goal is to provide a qualitative fit that demonstrates the general direction of differences between our model and empirical data, without engaging in detailed parameter fitting for a precise quantitative fit. Still, we agree that where possible, it is useful to better match the axes. We have updated figures 2B and 2C so that the y-axis scales are more directly comparable between the empirical and simulated data. 

      In a similar vein to the above point, while the DVs in the simulations/empirical data made intuitive sense, I wasn't always sure precisely how they were calculated. Consider the "proportion of replay" in Figure 1A. In the Methods (perhaps under Task Simulations), it should specify exactly how this proportion was calculated (e.g., proportions of all replay events, both forwards and backwards, combining across all simulations from Pre- and Post-run rest periods). In many of the examples, the proportions seem to possibly sum to 1 (e.g., Figure 1A), but in other cases, this doesn't seem to be true (e.g., Figure 3A). More clarity here is critical to help readers evaluate these data. Furthermore, sometimes the labels themselves are not the most informative. For example, in Figure 1A, the y-axis is "Proportion of replay" and in 1C it is the "Proportion of events". I presumed those were the same thing - the proportion of replay events - but it would be best if the axis labels were consistent across figures in this manuscript when they reflect the same DV.

      We appreciate these useful suggestions. We have revised the Methods section to explain in detail how DVs are calculated for each simulation. The revisions clarify the differences between related measures, such as those shown in Figures 1A and 1C, so that readers can more easily see how the DVs are defined and interpreted in each case. 

      Reviewer #4/Reviewing Editor (Public Review):

      Summary:

      With their 'CMR-replay' model, Zhou et al. demonstrate that the use of spontaneous neural cascades in a context-maintenance and retrieval (CMR) model significantly expands the range of captured memory phenomena.

      Strengths:

      The proposed model compellingly outperforms its CMR predecessor and, thus, makes important strides towards understanding the empirical memory literature, as well as highlighting a cognitive function of replay.

      Weaknesses:

      Competing accounts of replay are acknowledged but there are no formal comparisons and only CMR-replay predictions are visualized. Indeed, other than the CMR model, only one alternative account is given serious consideration: A variant of the 'Dyna-replay' architecture, originally developed in the machine learning literature (Sutton, 1990; Moore & Atkeson, 1993) and modified by Mattar et al (2018) such that previously experienced event-sequences get replayed based on their relevance to future gain. Mattar et al acknowledged that a realistic Dyna-replay mechanism would require a learned representation of transitions between perceptual and motor events, i.e., a 'cognitive map'. While Zhou et al. note that the CMR-replay model might provide such a complementary mechanism, they emphasize that their account captures replay characteristics that Dyna-replay does not (though it is unclear to what extent the reverse is also true).

      We thank the reviewer for these thoughtful comments and appreciate the opportunity to clarify our approach. Our goal in this work is to contrast two dominant perspectives in replay research: replay as a mechanism for learning reward predictions and replay as a process for memory consolidation. These models were chosen as representatives of their classes of models because they use simple and interpretable mechanisms that can simulate a wide range of replay phenomena, making them ideal for contrasting these two perspectives.

      Although we implemented CMR-replay as a straightforward example of the memory-focused view, we believe the proposed mechanisms could be extended to other architectures, such as recurrent neural networks, to produce similar results. We now discuss this possibility in the revised manuscript (see below). However, given our primary goal of providing a broad and qualitative contrast of these two broad perspectives, we decided not to undertake simulations with additional individual models for this paper.

      Regarding the Mattar & Daw model, it is true that a mechanistic implementation would require a mechanism that avoids precomputing priorities before replay. However, the "need" component of their model already incorporates learned expectations of transitions between actions and events. Thus, the model's limitations are not due to the absence of a cognitive map.

      In contrast, while CMR-replay also accumulates memory associations that reflect experienced transitions among events, it generates several qualitatively distinct predictions compared to the Mattar & Daw model. As we note in the manuscript, these distinctions make CMR-replay a contrasting rather than complementary perspective.

      Another important consideration, however, is how CMR replay compares to alternative mechanistic accounts of cognitive maps. For example, Recurrent Neural Networks are adept at detecting spatial and temporal dependencies in sequential input; these networks are being increasingly used to capture psychological and neuroscientific data (e.g., Zhang et al, 2020; Spoerer et al, 2020), including hippocampal replay specifically (Haga & Fukai, 2018). Another relevant framework is provided by Associative Learning Theory, in which bidirectional associations between static and transient stimulus elements are commonly used to explain contextual and cue-based phenomena, including associative retrieval of absent events (McLaren et al, 1989; Harris, 2006; Kokkola et al, 2019). Without proper integration with these modeling approaches, it is difficult to gauge the innovation and significance of CMR-replay, particularly since the model is applied post hoc to the relatively narrow domain of rodent maze navigation.

      First, we would like to clarify our principal aim in this work is to characterize the nature of replay, rather than to model cognitive maps per se. Accordingly, CMR‑replay is not designed to simulate head‐direction signals, perform path integration, or explain the spatial firing properties of neurons during navigation. Instead, it focuses squarely on sequential replay phenomena, simulating classic rodent maze reactivation studies and human sequence‐learning tasks. These simulations span a broad array of replay experimental paradigms to ensure extensive coverage of the replay findings reported across the literature. As such, the contribution of this work is in explaining the mechanisms and functional roles of replay, and demonstrating that a model that employs simple and interpretable memory mechanisms not only explains replay phenomena traditionally interpreted through a value-based lens but also accounts for findings not addressed by other memory-focused models.

      As the reviewer notes, CMR-replay shares features with other memory-focused models. However, to our knowledge, none of these related approaches have yet captured the full suite of empirical replay phenomena, suggesting the combination of mechanisms employed in CMR-replay is essential for explaining these phenomena. In the Discussion section, we now discuss the similarities between CMR-replay and related memory models and the possibility of integrating these approaches:

      “Our theory builds on a lineage of memory-focused models, demonstrating the power of this perspective in explaining phenomena that have often been attributed to the optimization of value-based predictions. In this work, we focus on CMR-replay, which exemplifies the memory-centric approach through a set of simple and interpretable mechanisms that we believe are broadly applicable across memory domains. Elements of CMR-replay share similarities with other models that adopt a memory-focused perspective. The model learns distributed context representations whose overlaps encodes associations among items, echoing associative learning theories in which overlapping patterns capture stimulus similarity and learned associations (McLaren & Mackintosh 2002). Context evolves through bidirectional interactions between items and their contextual representations, mirroring the dynamics found in recurrent neural networks (Haga & Futai 2018; Levenstein et al., 2024). However, these related approaches have not been shown to account for the present set of replay findings and lack mechanisms—such as reward-modulated encoding and experience-dependent suppression—that our simulations suggest are essential for capturing these phenomena. While not explored here, we believe these mechanisms could be integrated into architectures like recurrent neural networks (Levenstein et al., 2024) to support a broader range of replay dynamics.”

      Recommendations For The Authors

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 94-96: These lines may be better positioned earlier in the paragraph.

      We now introduce these lines earlier in the paragraph.

      (2) Line 103 - It's unclear to me what is meant by the statement that "the current context contains contexts associated with previous items". I understand why a slowly drifting context will coincide and therefore link with multiple items that progress rapidly in time, so multiple items will be linked to the same context and each item will be linked to multiple contexts. Is that the idea conveyed here or am I missing something? I'm similarly confused by line 129, which mentions that a context is updated by incorporating other items' contexts. How could a context contain other contexts?

      In the model, each item has an associated context that can be retrieved via Mfc. This is true even before learning, since Mfc is initialized as an identity matrix. During learning and replay, we have a drifting context c that is updated each time an item is presented. At each timestep, the model first retrieves the current item’s associated context cf by Mfc, and incorporates it into c. Equation #2 in the Methods section illustrates this procedure in detail. Because of this procedure, the drifting context c is a weighted sum of past items’ associated contexts. 

      We recognize that these descriptions can be confusing. We have updated the Results section to better distinguish the drifting context from items’ associated context. For example, we note that:

      “We represent the drifting context during learning and replay with c and an item's associated context with cf.”

      We have also updated our description of the context drift procedure to distinguish these two quantities: 

      “During awake encoding of a sequence of items, for each item f, the model retrieves its associated context cf via Mfc. The drifting context c incorporates the item's associated context cf and downweights its representation of previous items' associated contexts (Figure 1c). Thus, the context layer maintains a recency weighted sum of past and present items' associated contexts.”

      (3) Figure 1b and 1d - please clarify which axis in the association matrices represents the item and the context.

      We have added labels to show what the axes represent in Figure 1.

      (4) The terms "experience" and "item" are used interchangeably and it may be best to stick to one term.

      We now use the term “item” wherever we describe the model results. 

      (5) The manuscript describes Figure 6 ahead of earlier figures - the authors may want to reorder their figures to improve readability.

      We appreciate this suggestion. We decided to keep the current figure organization since it allows us to group results into different themes and avoid redundancy. 

      (6) Lines 662-664 are repeated with a different ending, this is likely an error.

      We have fixed this error.

      Reviewer #3 (Recommendations For The Authors):

      Below, I have outlined some additional points that came to mind in reviewing the manuscript - in no particular order.

      (1) Figure 1: I found the ordering of panels a bit confusing in this figure, as the reading direction changes a couple of times in going from A to F. Would perhaps putting panel C in the bottom left corner and then D at the top right, with E and F below (also on the right) work?

      We agree that this improves the figure. We have restructured the ordering of panels in this figure. 

      (2) Simulation 1: When reading the intro/results for the first simulation (Figure 2a; Diba & Buszaki, 2007; "When animals traverse a linear track...", page 6, line 186). It wasn't clear to me why pre-run rest would have any forward replay, particularly if pre-run implied that the animal had no experience with the track yet. But in the Methods this becomes clearer, as the model encodes the track eight times prior to the rest periods. Making this explicit in the text would make it easier to follow. Also, was there any reason why specifically eight sessions of awake learning, in particular, were used?

      We now make more explicit that the animals have experience with the track before pre-run rest recording:

      “Animals first acquire experience with a linear track by traversing it to collect a reward. Then, during the pre-run rest recording, forward replay predominates.”

      We included eight sessions of awake learning to match with the number of sessions in Shin et al. (2017), since this simulation attempts to explain data from that study. After each repetition, the model engages in rest. We have revised the Methods section to indicate the motivation for this choice: 

      “In the simulation that examines context-dependent forward and backward replay through experience (Figs. 2a and 5a), CMR-replay encodes an input sequence shown in Fig. 7a, which simulates a linear track run with no ambiguity in the direction of inputs, over eight awake learning sessions (as in Shin et al. 2019)”

      (3) Frequency of remote replay events: In the simulation based on Gupta et al, how frequently overall does remote replay occur? In the main text, the authors mention the mean frequency with which shortcut replay occurs (i.e., the mean proportion of replay events that contain a shortcut sequence = 0.0046), which was helpful. But, it also made me wonder about the likelihood of remote replay events. I would imagine that remote replay events are infrequent as well - given that it is considerably more likely to replay sequences from the local track, given the recency-weighted mental context. Reporting the above mean proportion for remote and local replay events would be helpful context for the reader.

      In Figure 4c, we report the proportion of remote replay in the two experimental conditions of Gupta et al. that we simulate. 

      (4) Point of clarification re: backwards replay: Is backwards replay less likely to occur than forward replay overall because of the forward asymmetry associated with these models? For example, for a backwards replay event to occur, the context would need to drift backwards at least five times in a row, in spite of a higher probability of moving one step forward at each of those steps. Am I getting that right?

      The reviewer’s interpretation is correct: CMR-replay is more likely to produce forward than backward replay in sleep because of its forward asymmetry. We note that this forward asymmetry leads to high likelihood of forward replay in the section titled “The context-dependency of memory replay”: 

      “As with prior retrieved context models (Howard & Kahana 2002; Polyn et al., 2009), CMR-replay encodes stronger forward than backward associations. This asymmetry exists because, during the first encoding of a sequence, an item's associated context contributes only to its ensuing items' encoding contexts. Therefore, after encoding, bringing back an item's associated context is more likely to reactivate its ensuing than preceding items, leading to forward asymmetric replay (Fig. 6d left).”

      (5) On terminating a replay period: "At any t, the replay period ends with a probability of 0.1 or if a task-irrelevant item is reactivated." (Figure 1 caption; see also pg 18, line 635). How was the 0.1 decided upon? Also, could you please add some detail as to what a 'task-irrelevant item' would be? From what I understood, the model only learns sequences that represent the points in a track - wouldn't all the points in the track be task-relevant?

      This value was arbitrarily chosen as a small value that allows probabilistic stopping. It was not motivated by prior modeling or a systematic search. We have added: “At each timestep, the replay period ends either with a stop probability of 0.1 or if a task-irrelevant item becomes reactivated. (The choice of the value 0.1 was arbitrary; future work could explore the implications of varying this parameter).” 

      In addition, we now explain in the paper that task irrelevant items “do not appear as inputs during awake encoding, but compete with task-relevant items for reactivation during replay, simulating the idea that other experiences likely compete with current experiences during periods of retrieval and reactivation.”

      (6) Minor typos:

      Turn all instances of "nonlocal" into "non-local", or vice versa

      "For rest at the end of a run, cexternal is the context associated with the final item in the sequence. For rest at the end of a run, cexternal is the context associated with the start item." (pg 20, line 663) - I believe this is a typo and that the second sentence should begin with "For rest at the START of a run".

      We have updated the manuscript to correct these typos. 

      (7) Code availability: I may have missed it, but it doesn't seem like the code is currently available for these simulations. Including the commented code in a public repository (Github, OSF) would be very useful in this case.

      We now include a Github link to our simulation code: https://github.com/schapirolab/CMR-replay.

    1. eLife Assessment

      This study combines genetic, cell biological, and interaction data to propose a model of meiotic double-strand break regulation in C. elegans. Solid evidence supports the main conclusions, while by nature of a screening-type study, more may be needed to solidify speculations in future studies. Yet, comprehensive cataloging of the physical and genetic interactions of factors required for meiotic double-strand break is useful information for the field.

    2. Joint Public Review:

      Meiotic recombination begins with DNA double-strand breaks (DSBs) generated by the conserved enzyme Spo11, which relies on several accessory factors that vary widely across eukaryotes. In C. elegans, multiple proteins have been implicated in promoting DSB formation, but their functional relationships and how they collectively recruit the DSB machinery to chromosome axes have remained unclear.

      In this study, Raices et al. investigate the biochemical and genetic interactions among known DSB-promoting factors in C. elegans meiosis. Using yeast two-hybrid assays and co-immunoprecipitation, they map pairwise protein interactions and identify a connection between the chromatin-associated protein HIM-17 and the transcription factor XND-1. They also confirm the established interaction between DSB-1 and SPO-11 and show that DSB-1 associates with the nematode-specific factor HIM-5, which is required for X-chromosome DSB formation.

      The authors extend these findings with genetic analyses, placing these factors into four epistasis groups based on single- and double-mutant phenotypes. Together, these biochemical and genetic data support a model describing how these proteins engage chromatin loops and localize to chromosome axes. The work provides a clearer view of how C. elegans assembles its DSB-forming machinery and how this process compares to mechanisms in other organisms.

      Comment from the Reviewing Editor on the revised version:

      The authors have adequately addressed the prior review comments. At this point, after going through multiple rounds of reviews and revisions, the community will be better served by having this paper out in public. This version was assessed by the editors without further input from the reviewers.

    3. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      The manuscript by Raices et al., provides some novel insights into the role and interactions between SPO-11 accessory proteins in C. elegans. The authors propose a model of meiotic DSBs regulation, critical to our understanding of DSB formation and ultimately crossover regulation and accurate chromosome segregation. The work also emphasizes the commonalities and species-specific aspects of DSB regulation. 

      Strengths: 

      This study capitalizes on the strengths of the C. elegans system to uncover genetic interactions between a lSPO-11 accessory proteins. In combination with physical interactions, the authors synthesize their findings into a model, which will serve as the basis for future work, to determine mechanisms of DSB regulation. 

      Weaknesses: 

      The methodology, although standard, still lacks some rigor, especially with the IPs. 

      Reviewer #2 (Public review): 

      Summary: 

      Meiotic recombination initiates with the formation of DNA double-strand break (DSB) formation, catalyzed by the conserved topoisomerase-like enzyme Spo11. Spo11 requires accessory factors that are poorly conserved across eukaryotes. Previous genetic studies have identified several proteins required for DSB formation in C. elegans to varying degrees; however, how these proteins interact with each other to recruit the DSB-forming machinery to chromosome axes remains unclear. 

      In this study, Raices et al. characterized the biochemical and genetic interactions among proteins that are known to promote DSB formation during C. elegans meiosis. The authors examined pairwise interactions using yeast two-hybrid (Y2H) and co-immunoprecipitation and revealed an interaction between a chromatin-associated protein HIM-17 and a transcription factor XND-1. They further confirmed the previously known interaction between DSB-1 and SPO-11 and showed that DSB-1 also interacts with a nematodespecific HIM-5, which is essential for DSB formation on the X chromosome. They also assessed genetic interactions among these proteins, categorizing them into four epistasis groups by comparing phenotypes in double vs. single mutants. Combining these results, the authors proposed a model of how these proteins interact with chromatin loops and are recruited to chromosome axes, offering insights into the process in C. elegans compared to other organisms. 

      Weaknesses: 

      This work relies heavily on Y2H, which is notorious for having high rates of false positives and false negatives. Although the interactions between HIM-17 and XND-1 and between DSB-1 and HIM-5 were validated by co-IP, the significance of these interactions was not tested in vivo. Cataloging Y2H and genetic interactions does not yield much more insight. The model proposed in Figure 4 is also highly speculative. 

      Reviewer #3 (Public review): 

      The goal of this work is to understand the regulation of double-strand break formation during meiosis in C. elegans. The authors have analyzed physical and genetic interactions among a subset of factors that have been previously implicated in DSB formation or the number of timing of DSBs: CEP-1, DSB-1, DSB-2, DSB-3, HIM-5, HIM-17, MRE-11, REC-1, PARG-1, and XND-1. 

      The 10 proteins that are analyzed here include a diverse set of factors with different functions, based on prior analyses in many published studies. The term "Spo11 accessory factors" has been used in the meiosis literature to describe proteins that directly promote Spo11 cleavage activity, rather than factors that are important for the expression of meiotic proteins or that influence the genome-wide distribution or timing of DSBs. Based on this definition, the known SPO-11 accessory factors in C. elegans include DSB-1, DSB2, DSB-3, and the MRN complex (at least MRE-11 and RAD-50). These are all homologs of proteins that have been studied biochemically and structurally in other organisms. DSB-1 & DSB-2 are homologs of Rec114, while DSB-3 is a homolog of Mei4. Biochemical and structural studies have shown that Rec114 and Mei4 directly modulate Spo11 activity by recruiting Spo11 to chromatin and promoting its dimerization, which is essential for cleavage. The other factors analyzed in this study affect the timing, distribution, or number of RAD-51 foci, but they likely do so indirectly. As elaborated below, XND-1 and HIM-17 are transcription factors that modulate the expression of other meiotic genes, and their role in DSB formation is parsimoniously explained by this regulatory activity. The roles of HIM-5 and REC-1 remain unclear; the reported localization of HIM-5 to autosomes is consistent with a role in transcription (the autosomes are transcriptionally active in the germline, while the X chromosome is largely silent), but its loss-of-function phenotypes are much more limited than those of HIM-17 and XND-1, so it may play a more direct role in DSB formation. The roles of CEP-1 (a Rad53 homolog) and PARG-1 are also ambiguous, but their homologs in other organisms contribute to DNA repair rather than DSB formation. 

      We appreciate the reviewer’s clarification. However, the definition of Spo11 accessory factors varies across the literature. Only Keeney and colleagues define these as proteins that physically associate with and activate Spo11 to catalyze DSB formation (Keeney, Lange & Mohibullah, 2014; Lam & Keeney, 2015). In contrast, other authors have used the term more broadly to refer to proteins that promote or regulate Spo11-dependent DSB formation, without necessarily implying a direct interaction with Spo11 (e.g., Panizza et al., 2011; Robert et al., 2016; Stanzione et al., 2016; Li et al., 2021; Lange et al., 2016). Thus, our usage of the term follows this broader functional definition.

      An additional significant limitation of the study, as stated in my initial review, is that much of the analysis here relies on cytological visualization of RAD-51 foci as a proxy for DSBs. RAD-51 associates transiently with DSB sites as they undergo repair and is thus limited in its ability to reveal details about the timing or abundance of DSBs since its loading and removal involve additional steps that may be influenced by the factors being analyzed. 

      We agree with the reviewer that counting RAD-51 foci provides only an indirect measure of SPO-11–dependent DSBs, as RAD-51 marks sites of repair rather than the breaks themselves. However, we would like to clarify that our current study does not rely on RAD51 foci quantification for any of the analyses or conclusions presented. None of the figures or datasets in this manuscript are based on RAD-51 cytology. Instead, our conclusions are drawn from genetic interactions, biochemical assays, and protein–protein interaction analyses.

      The paper focuses extensively on HIM-5, which was previously shown through genetic and cytological analysis to be important for breaks on the X chromosome. The revised manuscript still claims that "HIM-5 mediates interactions with the different accessory factors sub-groups, providing insights into how components on the DNA loops may interact with the chromosome axis." The weak interactions between HIM-5 and DSB-1/2 detected in the Y2H assay do not convincingly support such a role. The idea that HIM-5 directly promotes break formation is also inconsistent with genetic data showing that him5 mutants lack breaks on the X chromosomes, while HIM-5 has been shown to be is enriched on autosomes. Additionally, as noted in my comment to the authors, the localization data for HIM-5 shown in this paper are discordant with prior studies; this discrepancy should be addressed experimentally. 

      We appreciate the reviewer’s concerns regarding the interpretation of HIM-5 function.  The weak Y2H interactions between HIM-5 and DSB-1 are not interpreted as direct biochemical evidence of a strong physical interaction, but rather as a potential point of regulatory connection between these pathways. Importantly, these Y2H data are further supported by co-immunoprecipitation experiments, genetic interactions, and the observed mislocalization of HIM-5 in the absence of DSB-1. Together, these complementary results strengthen our conclusion that HIM-5 functionally associates with DSB-promoting complexes.

      Regarding HIM-5 localization, the pattern we observe using both anti-GFP staining of the eaIs4 transgene (Phim-5::him-5::GFP) and anti-HA staining of the HIM-5::HA strain is consistent with that reported by McClendon et al. (2016), who validated the same eaIs4 transgene. Although the pattern difers slightly from Meneely et al. (2012), that used a HIM5 antibody that is no longer functional and that has been discontinued by the commercial source. In this prior study, a weak signal was detected in the mitotic region and late pachytene, but stronger signal was seen in early to mid-pachytene. Our imaging— optimized for low background and stable signal—similarly shows robust HIM-5 localization in early and mid-pachytene, supporting the reliability of our GFP and HA-tagged analyses.

      The recent analysis of DSB formation in C. elegans males (Engebrecht et al; PloS Genetics; PMID: 41124211) shows that in absence of him-5 there is a significant reduction of CO designation (measured as COSA-1 foci) on autosomes. This study strongly supports a direct and general role for HIM-5 in crossover formation— on both autosomes and on the hermaphrodite X.

      This paper describes REC-1 and HIM-5 as paralogs, based on prior analysis in a paper that included some of the same authors (Chung et al., 2015; DOI 10.1101/gad.266056.115). In my initial review I mentioned that this earlier conclusion was likely incorrect and should not be propagated uncritically here. Since the authors have rebutted this comment rather than amending it, I feel it is important to explain my concerns about the conclusions of previous study. Chung et al. found a small region of potential homology between the C. elegans rec-1 and him-5 genes and also reported that him-5; rec-1 double mutants have more severe defects than either single mutant, indicative of a stronger reduction in DSBs. Based on these observations and an additional argument based on microsynteny, they concluded that these two genes arose through recent duplication and divergence. However, as they noted, genes resembling rec-1 are absent from all other Caenorhabditis species, even those most closely related to C. elegans. The hypothesis that two genes are paralogs that arose through duplication and divergence is thus based on their presence in a single species, in the absence of extensive homology or evidence for conserved molecular function. Further, the hypothesis that gene duplication and divergence has given rise to two paralogs that share no evident structural similarity or common interaction partners in the few million years since C. elegans diverged from its closest known relatives is implausible. In contrast, DSB-1 and DSB-2 are both homologs of Rec114 that clearly arose through duplication and divergence within the Caenorhabditis lineage, but much earlier than the proposed split between REC-1 and HIM-5. Two genes that can be unambiguously identified as dsb-1 and dsb-2 are present in genomes throughout the Elegans supergroup and absent in the Angaria supergroup, placing the duplication event at around 18-30 MYA, yet DSB-1 and DSB-2 share much greater similarity in their amino acid sequence, predicted structure, and function than HIM-5 and REC-1. Further, Raices place HIM-5 and REC-1 in different functional complexes (Figure 3B). 

      We respectfully disagree with the reviewer’s characterization of the relationship between HIM-5 and REC-1. Our use of the term “paralog” follows the conclusions of Chung et al. (2015), a peer-reviewed study that provided both sequence and microsynteny evidence supporting this relationship. While we acknowledge that the degree of sequence conservation is limited, the evolutionary scenario proposed by Chung et al. remains the only published framework addressing this question. Further the degree of homology between either HIM-5 or REC-1 and the ancestral locus are similar to that observed for DSB-1 and DSB-2 with REC-114 (Hinman et al., 2021). We therefore retain the use of the term “paralog” in reference to these genes. Importantly, our conclusions regarding their distinct molecular and functional roles are independent of this classification.

      The authors acknowledge that HIM-17 is a transcription factor that regulates many meiotic genes. Like HIM-17, XND-1 is cytologically enriched along the autosomes in germline nuclei, suggestive of a role in transcription. The Reinke lab performed ChIP-seq in a strain expressing an XND-1::GFP fusion protein and showed that it binds to promoter regions, many of which overlap with the HIM-17-regulated promoters characterized by the Ahringer lab (doi: 10.1126/sciadv.abo4082). Work from the Yanowitz lab has shown that XND-1 influences the transcription of many other genes involved in meiosis (doi: 10.1534/g3.116.035725) and work from the Colaiacovo lab has shown that XND-1 regulates the expression of CRA-1 (doi: 10.1371/journal.pgen.1005029). Additionally, loss of HIM-17 or XND-1 causes pleiotropic phenotypes, consistent with a broad role in gene regulation. Collectively, these data indicate that XND-1 and HIM-17 are transcription factors that are important for the proper expression of many germline-expressed genes. Thus, as stated above, the roles of HIM-17 and XND-1 in DSB formation, as well as their effects on histone modification, are parsimoniously explained by their regulation of the expression of factors that contribute more directly to DSB formation and chromatin modification. I feel strongly that transcription factors should not be described as "SPO-11 accessory factors." 

      The ChIP analysis of XND-1 binding sites (using the XND-1::GFP transgene we provided to the Reinke lab) was performed, and Table S3 in the Ahringer paper suggests it is found at germline promoters, although the analysis is not actually provided. We completely agree that at least a subset of XND-1 functions is explained by its regulation of transcriptional targets (as we previously showed for HIM-5). However, like the MES proteins, a subset of which are also autosomal and impact X chromosome gene expression, XND-1 could also be directly regulating chromatin architecture which could have profound effects on DSB formation.  As stated in our prior comments, precedent for the involvement of a chromatin factor in DSB formation is provided by yeast Spp1. 

      Recommendations for the authors: 

      Editor comments: 

      As you can see, the reviewers have additional comments, and the authors can include revisions to address those points prior to publicizing 'a version of record' (e.g. hatching rate assay mentioned by reviewer #1). This type of study, trying to catalog interactions of many factors, inevitably has loose ends, but in my opinion, it does not reduce the value of the study, as long as statements are not misleading. I suggest that the authors address issues by making changes to the main text. After the next round of adjustments by authors, I feel that it will be ready for a version of record, based on the spirit of the current eLife publication model. 

      Reviewer #1 (Recommendations for the authors): 

      I still have concerns about the HIM-17 IP and immunoblot probing with XND-1 antibodies. While the newly provided whole extract immunoblot clearly shows a XND-1 specific band that goes away in the mutant extracts, there is additional bands that are recognized - the pattern looks different than in the input in Figure 1B. Additionally, there is still a band of the corresponding size in the IPs from extracts not containing the tagged allele of HIM-17, calling into question whether XND-1 is specifically pulled down. 

      The authors did not include the hatching rate as pointed out in the original reviews. In the rebuttal: 

      "Great question. I guess we need to do this while back out for review. If anyone has suggestions of what to say here. Clearly we overlooked this point but do have the strain." 

      We thank the reviewer for this suggestion. We had intended to include a hatching analysis; however, during the course of this work we discovered that our him-17 stock had acquired an additional linked mutation(s) that altered its phenotype and led to inconsistent results. This strain was used to rederive the him-17; eaIs4 double mutant after our original did not survive freeze/thaw. Given the abnormal behavior observed in this line, we concluded that proceeding with the hatching assays could yield unreliable data. We are currently reestablishing a verified him-17 strain, but in the interest of accuracy and reproducibility, we have restricted our analysis in this manuscript to validated datasets derived from confirmed strains.

      Reviewer #2 (Recommendations for the authors): 

      The authors have addressed most of the previous concerns and substantially improved the manuscript. The new data demonstrate that HIM-5 localization depends on DSB-1, and together with the Y2H and Co-PI results, strengthen the link between HIM-5 and the DSBforming machinery in C. elegans. The remaining points are outlined below: 

      Specific comments: 

      The font size of texts and labels in the Figure is very small and is hardly legible. Please enlarge them and make them clearly visible (Fig 1A, 1B, 2A, 2B, 2C, 2D, 2E, 3A, 3B, 3C, 3D, 3F)

      Done

      Although the authors have addressed the specificity of the XND-1 antibody, it remains unclear whether the boxed band is specific to the him-17::3xHA IP, since the same band appears in the control IP, albeit with lower intensity (Fig 1B). Is the ~100 kDa band in the him-17::3xHA IP a modified form XND-1? While antibody specificity was previously demonstrated by IF using xnd-1 mutants, it would be ideal to confirm this on a western blot as well. 

      A Western Blot performed using whole cell extracts and probed with the anti- XND-1 antibody has been provided in the revised version of the manuscript (Fig. S1A). This confirms that the antibody specifically recognizes XND-1 protein. We believe that the ~100 kDa band mentioned by the reviewer is likely to be a non-specific cross reaction band detected by the antibody, since an identical band of the same mW was also detected in xnd-1 null mutants (Fig. S1A).

      Regarding the IP negative controls, we are firmly convinced the boxed band to be specific, and the fact that a (very) low intensity band is also found in the negative control should not infringe the validity of the HIM-17-XND-1 specific interaction. There is a constellation of similar examples present across the literature, as it is widely acknowledged amongst biochemists that some proteins may “stick” to the beads due their intrinsic biochemical properties despite usage of highly stringent IP buffers. However, the high level of enrichment detected in the IP (as also underlined by the reviewer) corroborates that XND-1 specifically immunoprecipitates with HIM-17 despite a low, non-specific binding to the HA beads is present. If interaction between XND-1 and HIM-17 was non-specific, we logically would have found the band in the IP and the band in the negative control to be of very similar intensity, which is clearly not the case. 

      Although co-IP assays are generally considered not a strictly quantitative assay, we want to emphasize that a comparable amount of nuclear extract was employed in both samples as also evidenced by the inputs, in which it is also possible to see that if anything, slightly less  nuclear extracts were employed in the him-17::3xHA; him-5::GFP::3xFLAG vs. the him5::GFP::3xFLAG negative control, corroborating the above mentioned points.

      Lastly, it is crucial to mention that mass spectrometry analyses performed on HIM17::3xHA pulldowns show XND-1 as a highly enriched interacting protein (Blazickova et al.; 2025 Nature Comms.), which strongly supports our co-IP results.

      The subheading "HIM-5 is the essential factor for meiotic breaks in the X chromosome" does not accurately represent the work described in the Results or in Figure 1. I disagree with the authors' response to the earlier criticism. The issue is not merely semantic. The data do not demonstrate that HIM-5 is required for DSB formation on the X chromosome - this conclusion can only be inferred. What Figure 1 shows is that XND-1 and HIM-17 interact, and that pie-1p-driven HIM-5 expression can partially rescue meiotic defects of him-17 mutants. This supports the conclusion that him-5 is a target of HIM-17/XND-1 in promoting CO formation on the X chromosome. However, the data provide no direct evidence for the claim stated in the subheading. I strongly encourage authors to revise the subheading to more accurately represent the findings presented in the paper. 

      After considering the reviewer’s comments, we have revised the subheading to more accurately describe our findings.

      In Fig1C, please fix the typo in the last row - "pie1p::him5-::GFP" to "pie-1p::him- 5::GFP".

      Done

      In Fig 2C, "p" is missing from the label on the right for Phim-5::him-5::GFP.

      Done

      In Fig 3I, bring the labels (DSB-1/2/3) at the lower right to the front.

      Done

      In Concluding Remarks, please fix the typo "frequently".

      Done

      Reviewer #3 (Recommendations for the authors): 

      The experiments that analyze HIM-5 in dsb-1 mutants should be repeated using antibodies against the endogenous HIM-5 antibody, and localization of the HIM-5::HA and HIM-5::GFP proteins should be compared directly to antibody staining. This work uses an epitopetagged protein and a GFP-tagged protein to analyze the localization of HIM-5, while prior work (Meneely et al., 2012) used an antibody against the endogenous protein. In Figures 2 and S4 of this paper, neither HIM-5::HA nor HIM-5::GFP appears to localize strongly to chromatin, and autosomal enrichment of HIM-5, as previously reported for the endogenous protein based on antibody staining, is not evident. Moreover, HIM-5::GFP and HIM-5::HA look different from each other, and neither resembles the low-resolution images shown in Figure 6 in Meneely et al 2012, which showed nuclear staining throughout the germline, including in the mitotic zone, and also in somatic sheath cells. Given the differences in localization between the tagged transgenes and the endogenous protein, it is important to analyze the behavior of the endogenous, untagged protein. A minor issue: a wild-type control should also be shown for HIM-5::HA in Figure S4. 

      Wild type control added to figure S4

      Evidence that XND-1 and HIM-17 form a complex is weak; it is supported by the Y2H and co-IP data but opposed by functional analysis or localization. The diversity of proteins found in the Co-IP of HIM-17::GFP (Table S2) indicate that these interactions are unlikely to be specific. The independent localization of these proteins to chromatin is clear evidence that they do not form an obligate complex; additionally, they have been found to regulate distinct (although overlapping) sets of genes. The predicted structure generated by Alphafold3 has very low confidence and should not be taken as evidence for an interaction.The newly added argument about the lack of apparently overlap between HIM-17 and XND1 due to the distance between the HA tag on HIM-17 and XND-1 is flawed and should be removed - the extended C-terminus in the predicted AlphaFold3 C-terminus of HIM-17 has been interpreted as if it were a structured domain. Moreover, the predicted distance of 180 Å (18 nm) is comparable to the distance between a fluorophore on a secondary antibody and the epitope recognized by the primary antibody (~20-25 nm) and is far below than the resolution limit of light microscopy. 

      We appreciate the reviewer’s thoughtful comment. The evidence supporting a physical interaction between XND-1 and HIM-17 is not only shown by our co-IP experiments, but it has also been recently shown in an independent study where MS analyses were conducted on HIM-17::3xHA pull downs to identify novel HIM-17 interactors (Blazickova et al.; 2025 Nature Comms). As shown in the data provided in this study, also under these experimental settings XND-1 was identified as a highly enriched putative HIM-17 interactor. We do acknowledge that their chromatin localization patterns are distinct and they regulate overlapping but not identical sets of genes, however, it is worth noting that protein–protein interactions in meiosis are often transient or context-dependent, and may not necessarily result in co-localization detectable by microscopy. In line with this, in the same work cited above, a similar situation for BRA-2 and HIM-17 was reported, as they were shown to interact biochemically despite the absence of overlapping staining patterns. 

      Minor issues: 

      The images shown in Panel D in Figure 1 seem to have very different resolutions; the HTP3/HIM-17 colocalization image is particularly blurry/low-resolution and should be replaced. The contrast between blue and green cannot be seen clearly; colors with stronger contrast should be used, and grayscale images should also be shown for individual channels. High-resolution images should probably be included for all of the factors analyzed here to facilitate comparisons.

    1. eLife Assessment

      This study reports important advances in our understanding of how enteropathogenic E. coli (EPEC) interacts at the intestinal interface. Solid data describe a novel model of spatially coordinated calcium signaling to modulate NF-kB activation; additional data and clarification of methods would improve the strength of these conclusions. These findings, which integrate imaging, genetics, and computational modeling, provide a new way to consider host-pathogen interactions in EPEC infections that may lead to improved therapies.

    2. Reviewer #1 (Public review):

      Summary:

      In their article, Guo and coworkers investigate the Ca²⁺ signaling responses induced by Enteropathogenic Escherichia coli (EPEC) in epithelial cells and how these responses regulate NF-κB activation. The authors show that EPEC induces rapid, spatially coordinated Ca²⁺ transients mediated by extracellular ATP released through the type III secretion system (T3SS). Using high-speed Ca²⁺ imaging and stochastic modeling, they propose that low ATP levels trigger "Coordinated Ca²⁺ Responses from IP₃R Clusters" (CCRICs) via fast Ca²⁺ diffusion and Ca²⁺-induced Ca²⁺ release. These responses may dampen TNF-α-induced NF-κB activation through Ca²⁺-dependent modulation of O-GlcNAcylation of p65. The interdisciplinary work suggests a new perspective on calcium-mediated immune response by combining quantitative imaging, bacterial genetics, and computational modeling.

      Strengths:

      The study provides a new concept for host responses to bacterial infections and introduces the concept of Coordinated Ca²⁺ Responses from IP₃R Clusters (CCRICs) as synchronized, whole-cell-scale Ca²⁺ transients with the fast kinetics typical of local events. This is elegantly done by an interdisciplinary approach using quantitative measurements and mechanistic modelling.

      Weaknesses:

      (1) The effect of coordination by fast diffusion for small eATP concentrations is explained by the resulting low Ca2+ concentration that is not as strongly affected by calcium buffers compared to higher concentrations. While I agree with this statement on the relative level, CICR is based on the resulting absolute concentration at neighboring IP3Rs (to activate them). Thus, I do not fully agree with the explanation, or at least would expect to use the modelling approach to demonstrate this effect. Simulations for different activation and buffer concentrations could strengthen this point and exclude potential inhibition of channels at higher stimulation levels.

      In this respect, I would also include the details of the modelling, such as implementation environment, parameters, and benchmarking. The description in the Supplementary Methods is very similar to the description in the main text. In terms of reproducibility, it would be important to at least provide simulation parameters, and providing the code would align with the emerging standards for reproducible science.

      (2) Quantitative characterization of CCRICs:

      The paper would benefit from a clearer definition of the term CCRICs and quantitative descriptors like duration, amplitude distribution, frequency, and spatial extent (also in relation to the comment on the EGTA measurements below). Furthermore, it remains unclear to me whether CCRICs represent a population of rapidly propagating micro-waves or truly simultaneous events. Maybe kymographs or wave-front propagation analyses (at least from simulations if experimental resolution is too bad) would strengthen this point.

      (3) Specificity of pharmacological tools:

      Suramin and U73122 are known to have off-target effects. Control experiments using alternative P2 receptor antagonists like PPADS or inactive U73343 analogs would strengthen the causal link.

    3. Reviewer #2 (Public review):

      Summary:

      The authors of this study are trying to resolve how cellular infection by enteropathogenic E. coli (EPEC) subverts cellular signaling pathways to promote infection and dampen immune responses. Specifically, alteration in calcium dynamics has been evidenced in the prior literature as a potential initiator of these adaptations, and this study provides ideas and mechanistic detail as to how cellular calcium dynamics may be subverted by pathogens.

      Strengths:

      The clear strengths of this paper relate to the new ideas inherent in the proposed hypothesis and their support from the experimental approaches used. Overall, the proposed work provides new ideas in this area, which will benefit from further investigation. Certainly, this is an interesting and challenging paradigm to pick apart mechanistically, and is important for improving treatments from intestinal infections.

      Weaknesses:

      Additional insight is needed in three specific areas to convincingly support the conclusions drawn by the authors. These three areas are: first, a better description of the infection-associated calcium signals. Second, a mechanistic definition of the relevant purinoceptors versus other pathways to increase cellular calcium. Third, an effort to show that the proposed pathways have relevance in a polarized epithelial cell.

    1. Author response:

      Reviewer #1:

      We thank the reviewer for this important point. Beyond long reaction times, we did not originally exclude participants based on low EMA variability. We agree this is a relevant concern, particularly given the need to add small random noise to some EMA series for model convergence. In the revised manuscript, we will assess additional indicators of careless responding, including within-person EMA variability (e.g., standard deviation or proportion of modal responses) following Jaso et al., 2022 criteria. We will conduct sensitivity analyses excluding low-variability responses or participants and report whether these checks affect the robustness of the results. We will also clarify in the Discussion that minimal EMA variance may reflect either true affective stability or reduced engagement, and discuss how this ambiguity may affect interpretation.

      Reviewer #2:

      We thank the reviewer for raising this fundamental conceptual concern. We agree that more research is needed to fully understand the processes captured by DQRT. In the revised manuscript, we will more clearly reference and summarize prior validation work from our lab providing strong support for a cognitive characterization of DQRT as a measure of cognitive processing speed, while also explicitly acknowledging potential confounds and limitations (Teckentrup et al., 2025). We will clarify that our DQRT computation followed those validated procedures, including exclusion of extreme values above the sample-specific median + 2 SD. In addition, consistent with Reviewer #1’s comment, we will expand the Discussion of how potential careless responding and non-cognitive factors may influence DQRT. We will further tone down language implying causal inference.

      References

      Jaso, B. A., Kraus, N. I., & Heller, A. S. (2022). Identification of careless responding in ecological momentary assessment research: From posthoc analyses to real-time data monitoring. Psychological Methods, 27(6), 958.

      Teckentrup, V., Rosická, A. M., Donegan, K. R., Gallagher, E., Hanlon, A. K., & Gillan, C. M. (2025). Digital questionnaire response time (DQRT): A ubiquitous and low-cost digital assay of cognitive processing speed. Behavior Research Methods, 57(7), 200.

    1. eLife Assessment

      This useful manuscript reports findings indicating that cell cycle progression and cytokinesis both contribute to the transition from early to late neural stem cell fates. Although orthogonal approaches would help confirm the findings, which are based on loss-of-function, the experimental evidence is convincing. Lastly, an investigation of the underlying mechanisms linking the cell cycle to temporal factor expression is still needed.

    2. Reviewer #1 (Public review):

      Summary:

      Drosophila larval type II neuroblasts generate diverse types of neurons by sequentially expressing different temporal identity genes during development. Previous studies have shown that transition from early temporal identity genes (such as Chinmo and Imp) to late temporal identity genes (such as Syp and Broad) depends on the activation of the expression of EcR by Seven-up (Svp) and progression through the G1/S transition of the cell cycle. In this study, Chaya and Syed examined if the expression of Syp and EcR is regulated by cell cycle and cytokinesis by knocking down CDK1 or Pav, respectively, throughout development or at specific developmental stages. They find that knocking down CDK1 or Pav either in all type II neuroblasts throughout the development or in single type neuroblast clones after larval hatching consistently leads to failure to activate late temporal identity genes Syp and EcR. To determine whether the failure of the activation of Syp and EcR is due to impaired Svp expression, they also examined Svp expression using a Svp-lacZ reporter line. They find that Svp is expressed normally in CDK1 RNAi neuroblasts. Further, knocking down CDK1 or Pav after Svp activation still leads to loss of Syp and EcR expression. Finally, they also extended their analysis to type I neuroblasts. They find that knocking down CDK1 or Pav, either at 0 hours or at 42 hours after larval hatching, also results in loss of Syp and EcR expression in type I neuroblasts. Based on these findings, the authors conclude that cycle and cytokinesis are required for the transition from early to late late temporal identity genes in both types of neuroblasts. These findings add mechanistic details to our understanding of the temporal patterning of Drosophila larval neuroblasts.

      Strengths:

      The data presented in the paper are solid and largely support their conclusion. Images are of high quality. The manuscript is well-written and clear.

      Weaknesses:

      The authors have addressed all the weaknesses in this revision.

    3. Reviewer #2 (Public review):

      Summary:

      Neural stem cells produce a wide variety of neurons during development. The regulatory mechanisms of neural diversity are based on the spatial and temporal patterning of neural stem cells. Although the molecular basis of spatial patterning is well-understood, the temporal patterning mechanism remains unclear. In this manuscript, the authors focused on the roles of cell cycle progression and cytokinesis in temporal patterning and found that both are involved in this process.

      Strengths:

      They conducted RNAi-mediated disruption on cell cycle progression and cytokinesis. As they expected, both disruptions affected temporal patterning in NSCs.

      Weaknesses:

      Although the authors showed clear results, they needed to provide additional data to support their conclusion sufficiently.

      For example, they can examine the effects of cell cycle acceleration on the temporal patterning.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Chaya and Syed focuses on understanding the link between cell cycle and temporal patterning in central brain type II neural stem cells (NSCs). To investigate this, the authors perturb the progression of the cell cycle by delaying the entry into M phase and preventing cytokinesis. Their results convincingly show that temporal factor expression requires progression of the cell cycle in both Type 1 and Type 2 NSCs in the Drosophila central brain. Overall, this study establishes an important link between the two timing mechanisms of neurogenesis.

      Strengths:

      The authors provide solid experimental evidence for the coupling of cell cycle and temporal factor progression in Type 2 NSCs. The quantified phenotype shows an all-or-none effect of cell cycle block on the emergence of subsequent temporal factors in the NSCs, strongly suggesting that both nuclear division and cytokinesis are required for temporal progression. The authors also extend this phenotype to Type 1 NSCs in the central brain, providing a generalizable characterization of the relationship between cell cycle and temporal patterning.

      Weaknesses:

      One major weakness of the study is that the authors do not explore the mechanistic relationship between cell cycle and temporal factor expression. Although their results are quite convincing, they do not provide an explanation as to why Cdk1 depletion affects Syp and EcR expression but not the onset of svp. This result suggests that at least a part of the temporal cascade in NSCs is cell-cycle independent which isn't addressed or sufficiently discussed.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Drosophila larval type II neuroblasts generate diverse types of neurons by sequentially expressing different temporal identity genes during development. Previous studies have shown that the transition from early temporal identity genes (such as Chinmo and Imp) to late temporal identity genes (such as Syp and Broad) depends on the activation of the expression of EcR by Seven-up (Svp) and progression through the G1/S transition of the cell cycle. In this study, Chaya and Syed examined whether the expression of Syp and EcR is regulated by cell cycle and cytokinesis by knocking down CDK1 or Pav, respectively, throughout development or at specific developmental stages. They find that knocking down CDK1 or Pav either in all type II neuroblasts throughout development or in single-type neuroblast clones after larval hatching consistently leads to failure to activate late temporal identity genes Syp and EcR. To determine whether the failure of the activation of Syp and EcR is due to impaired Svp expression, they also examined Svp expression using a Svp-lacZ reporter line. They find that Svp is expressed normally in CDK1 RNAi neuroblasts. Further, knocking down CDK1 or Pav after Svp activation still leads to loss of Syp and EcR expression. Finally, they also extended their analysis to type I neuroblasts. They find that knocking down CDK1 or Pav, either at 0 hours or at 42 hours after larval hatching, also results in loss of Syp and EcR expression in type I neuroblasts. Based on these findings, the authors conclude that cycle and cytokinesis are required for the transition from early to late temporal identity genes in both types of neuroblasts. These findings add mechanistic details to our understanding of the temporal patterning of Drosophila larval neuroblasts.

      Strengths:

      The data presented in the paper are solid and largely support their conclusion. Images are of high quality. The manuscript is well-written and clear.

      We appreciate the reviewer’s detailed summary and recognition of the study’s strengths.

      Weaknesses:

      The quantifications of the expression of temporal identity genes and the interpretation of some of the data could be more rigorous.

      (1) Expression of temporal identity genes may not be just positive or negative. Therefore, it would be more rigorous to quantify the expression of Imp, Syp, and EcR based on the staining intensity rather than simply counting the number of neuroblasts that are positive for these genes, which can be very subjective. Or the authors should define clearly what qualifies as "positive" (e.g., a staining intensity at least 2x background).

      We thank the reviewer for this helpful suggestion. In the new version, we have now clarified how positive expression was defined and added more details of our quantification strategy to the Methods section (page 11, lines 380-388; lines 426-434 in tracked changes file). Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered positive for a given factor when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (2) The finding that inhibiting cytokinesis without affecting nuclear divisions by knocking down Pav leads to the loss of expression of Syp and EcR does not support their conclusion that nuclear division is also essential for the early-late gene expression switch in type II NSCs (at the bottom of the left column on page 5). No experiments were done to specifically block the nuclear division in this study specifically. This conclusion should be revised.

      We blocked both cell cycle progression and cytokinesis, and both these manipulations affected temporal gene transitions, suggesting that both cell cycle and cytokinesis are essential. To our knowledge, no mechanism/tool exists that selectively blocks nuclear division while leaving cell cycle progression intact. We have added more clarification on page 4, line 123 onwards (lines 126 onwards in tracked changes file).

      (3) Knocking down CDK1 in single random neuroblast clones does not make the CDK1 knockdown neuroblast develop in the same environment (except still in the same brain) as wild-type neuroblast lineages. It does not help address the concern whether "type 2 NSCS with cell cycle arrest failed to undergo normal temporal progression is indirectly due to a lack of feedback signaling from their progeny", as discussed (from the bottom of the right column on page 9 to the top of the left column on page 10). The CDK1 knockdown neuroblasts do not divide to produce progeny and thus do not receive a feedback signal from their progeny as wild-type neuroblasts do. Therefore, it cannot be ruled out that the loss of Syp and EcR expression in CDK1 knockdown neuroblasts is due to the lack of the feedback signal from their progeny. This part of the discussion needs to be clarification.

      Thanks to the reviewer for raising this critical point. We agree and have added more clarification of our interpretations and limitations to our studies in the revised text on page 8, line 278-282 (lines 296-300 in tracked changes file)

      (4) In Figure 2I, there is a clear EcR staining signal in the clone, which contradicts the quantification data in Figure 2J that EcR is absent in Pav RNAi neuroblasts. The authors should verify that the image and quantification data are consistent and correct.

      When cytokinesis is blocked using pav-RNAi, the neuroblasts become extremely large and multinucleated. In some large pav RNAi clones, we observed a weak EcR signal near the cell membrane. However, more importantly, none of the nuclear compartments showed detectable EcR staining, where EcR is typically localized. We selected a representative nuclear image for the figure panel. To clarify this observation, we have now added an explanatory note to the discussion section on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      Reviewer #2 (Public review):

      Summary:

      Neural stem cells produce a wide variety of neurons during development. The regulatory mechanisms of neural diversity are based on the spatial and temporal patterning of neural stem cells. Although the molecular basis of spatial patterning is well-understood, the temporal patterning mechanism remains unclear. In this manuscript, the authors focused on the roles of cell cycle progression and cytokinesis in temporal patterning and found that both are involved in this process.

      Strengths:

      They conducted RNAi-mediated disruption on cell cycle progression and cytokinesis. As they expected, both disruptions affected temporal patterning in NSCs.

      We appreciate the reviewer’s positive assessment of our experimental results.

      Weaknesses:

      Although the authors showed clear results, they needed to provide additional data to support their conclusion sufficiently.

      For example, they need to identify type II NSCs using molecular markers (Ase/Dpn).The authors are encouraged to provide a more detailed explanation of each experiment. The current version of the manuscript is difficult for non-expert readers to understand.

      Thanks for your feedback. We have now included a detailed description of how we identify type II NSCs in both wild-type and mutant clones. We have also added a representative Asense staining to clearly distinguish type 1 (Ase<sup>+</sup>) from type 2 (Ase<sup>-</sup>) NSCs see Figure S1. We have also added a resources table explaining the genotypes associated with each figure, which was omitted due to an error in the previous version of the manuscript. 

      Reviewer #3 (Public review):

      Summary:

      The manuscript by Chaya and Syed focuses on understanding the link between cell cycle and temporal patterning in central brain type II neural stem cells (NSCs). To investigate this, the authors perturb the progression of the cell cycle by delaying the entry into M phase and preventing cytokinesis. Their results convincingly show that temporal factor expression requires progression of the cell cycle in both Type 1 and Type 2 NSCs in the Drosophila central brain. Overall, this study establishes an important link between the two timing mechanisms of neurogenesis.

      Strengths:

      The authors provide solid experimental evidence for the coupling of cell cycle and temporal factor progression in Type 2 NSCs. The quantified phenotype shows an all-ornone effect of cell cycle block on the emergence of subsequent temporal factors in the NSCs, strongly suggesting that both nuclear division and cytokinesis are required for temporal progression. The authors also extend this phenotype to Type 1 NSCs in the central brain, providing a generalizable characterization of the relationship between cell cycle and temporal patterning.

      We thank the reviewer for recognizing the robustness of our data linking the cell cycle to temporal progression.

      Weaknesses:

      One major weakness of the study is that the authors do not explore the mechanistic relationship between the cell cycle and temporal factor expression. Although their results are quite convincing, they do not provide an explanation as to why Cdk1 depletion affects Syp and EcR expression but not the onset of svp. This result suggests that at least a part of the temporal cascade in NSCs is cell-cycle independent, which isn't addressed or sufficiently discussed.

      Thank you for bringing up this important point. We are equally interested in uncovering the mechanism by which the cell cycle regulates temporal gene transitions; however, such mechanistic exploration is beyond the scope of the present study. Interestingly, while the temporal switching factor Svp is expressed independently of the cell cycle, the subsequent temporal transitions are not. We have expanded our discussion on this intriguing finding (page 9, line 307-315; lines 345-355 in tracked changes file). Specifically, we propose that svp activation marks a cell-cycle–independent phase, whereas EcR/Syp induction likely depends on cell-cycle–coupled mechanisms, such as mitosis-dependent chromatin remodeling or daughter-cell feedback. Although further dissection of this mechanism lies beyond the current study, our findings establish a foundation for future work aimed at identifying how developmental timekeeping is molecularly coupled to cell-cycle progression.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) Figure 1 C and D, it would be better to put a question mark to indicate that these are hypotheses to be tested. 

      We appreciate this suggestion and have added question marks in Figure 1C and 1D to clearly indicate that these panels represent hypotheses under investigation clearly.

      (2) Figure 2A-I, Figure 4A-I, Figure 5A-I and K-S, in addition to enlarged views of single type II neuroblasts, it would be more convincing to include zoomed-out images of the entire larval brain or at least a portion of the brain to include neighboring wild-type type II neuroblasts as internal controls. Also, it would be ideal to show EcR staining from the same neuroblasts as IMP and Syp staining. 

      We thank the reviewer for this valuable input. In our imaging setup, the number of available antibody channels was limited to four (anti-Ase, anti-GFP, anti-Syp, and antiImp). Adding EcR in the same sample was therefore not technically possible, we performed EcR staining separately. 

      (3) The authors cited "Syed et al., 2024" (in the middle of the right column on page 5), but this reference is missing in the "References" section and should be added. 

      The missing citation has been added to the reference section.  

      (4) It would be better to include Ase staining in the relevant figure to indicate neuroblast identity as type I or type II. 

      We agree and now include representative Ase staining for both type 1 and type 2 NSC clones in Figure S1, along with corresponding text updates that describe these markers.

      Reviewer #2 (Recommendations for the authors): 

      Major comments 

      (1) The present conclusion relies on the results using Cdk1 RNAi and pav RNAi. It is still possible that Cdk1 and Pav are involved in the regulation of temporal patterning independent of the regulation of cell cycle or cytokinesis, respectively. To avoid this possibility, the authors need to inhibit cell cycle progression or cytokinesis in another alternative manner. 

      We thank the reviewer for raising this important point. While we cannot completely exclude gene-specific, cell-cycle-independent roles for Cdk1 or Pav, we observe consistent phenotypes across several independent manipulations that slow or block the cell cycle. Also, earlier studies using orthogonal approaches that delay G1/S (Dacapo/Rbf) or impair mitochondrial OxPhos (which lengthens G1/S; van den Ameele & Brand, 2019) produce similar temporal delays. These concordant phenotypes strongly support the interpretation that altered cell-cycle progression—rather than specific roles of a single gene—is the primary cause of the defect. While we cannot exclude additional, gene-specific effects of Cdk1 or Pav, the concordant phenotypes across independent perturbations make the cell-cycle disruption model the most parsimonious interpretation. We have clarified this reasoning in the discussion section on pages 8-9, lines 293-305 (lines 311-343 in tracked changes file).

      (2) To reach the present conclusion, the authors need to address the effects of acceleration of cell cycle progression or cytokinesis on temporal patterning. 

      We thank the reviewer for this insightful suggestion. To our knowledge, there are currently no established genetic tools that can specifically accelerate cell-cycle progression in Drosophila neuroblasts. However, our results demonstrate that blocking the cell cycle impairs the transition from early to late temporal gene expression. These findings suggest that proper cell-cycle progression is essential for the transition from early to late temporal identity in neuroblasts.

      Minor comments 

      (3) P3L2 (right), ... we blocked the NSC cell cycle...

      How did they do it? 

      Which fly lines were used?

      Why did they use the line? 

      These details are now included in the Materials and Methods and the Resource Table (pages 11-13). We used Wor-Gal4, Ase-Gal80 to drive UAS-Cdk1RNAi and UASpavRNAi in type 2 NSCs 

      (4) P5L1(left), ... we used the flip-out approach...

      Why did they conduct it? 

      Probably, the authors have reasons other than "to further ensure." 

      We have clarified in the text on page 4, lines 137-139, that the flip-out approach was used to generate random single-cell clones, enabling quantitative analysis of type 2 NSCs within an otherwise wild-type brain. 

      (5) P5L8(left), ... type 2 hits were confirmed by lack of the type 1 Asense...  The authors must examine Deadpan (Dpn) expression as well. Because there are a lot of Asense (Ase) negative cells in the brain (neurons, glial cell, and neuroepithelial cells). 

      Type II NSCs can be identified as Dpn+/Ase- cells.

      We agree that Dpn is a helpful marker. However, we reliably distinguished type II NSCs by their lack of Ase and larger cell size relative to surrounding neurons and glia, which are smaller in size and located deeper within the clone. These differences, together with established lineage patterns, allow unambiguous identification of type 2 NSCs across all genotypes. We have now added representative type I and type 2 NSC clones to the supplemental figure S1 (E-G’) with Asense stains to demonstrate how we differentiate type I from type II NSCs. 

      (6) P5L32(left), To do this, we induced... 

      This sentence should be made more concise.

      Please rephrase it. 

      The sentence has been rewritten for clarity and concision.

      (7)  P5L42(left), ...lack of EcR/Syp expression (Figure 2).  However, EcR expression is still present (Figure 2I). 

      In some large pavRNAi clones, a weak EcR signal can be observed near the cell membrane; however, none of the nuclear compartments—where EcR is typically localized—show detectable staining. We selected a representative nuclear image for the figure and addressed this observation on page 8, lines 283-291 (lines 301-309 in tracked changes file).

      (8) P7L29(left), ......had persistent Imp expression...

      Imp expression is faint compared to that in Figure 2G.

      The differences between Figures 2G and 3G should be discussed. 

      We thank the reviewer for this comment. We have added a note in the Methods section clarifying that brightness and contrast were adjusted per panel for optimal visualization; thus, apparent differences in signal intensity do not reflect biological variation. Fluorescence intensity for each neuroblast was normalized to the mean intensity of neighboring wild-type neuroblasts imaged in the same field. A neuroblast was considered Imp-positive when its normalized nuclear intensity was at least 2× the local background. This scoring criterion was applied uniformly across all genotypes and time points. All quantifications were performed on the raw LSM files in Fiji prior to assembling the figure panels.

      (9) P8 (Figure 5)

      The Imp expression is faint compared to that in Figure 5Q.

      The difference between Figure 5G and 5Q should be discussed further. 

      As mentioned above, we have clarified our image processing approach in the Methods section to explain any differences in signal appearance between these figures.

      (10) P10 Materials and Methods

      The authors did not mention the fly lines used. This is very important for the readers. 

      We thank the reviewer for bringing this oversight to our attention. The Resource Table was inadvertently omitted from the initial submission. The complete list of fly lines and reagents used in this study is now provided in the updated Resource Table.

      Reviewer #3 (Recommendations for the authors): 

      Major points 

      (1) The authors mention that the heat-shock induction at 42ALH is well after svp temporal window and therefore the cell cycle block independently affects Syp and EcR expression. However, Figure 3 shows svp-LacZ expression at 48ALH. If svp expression is indeed transient in Type 2 NSCs, then this must be validated using an immunostaining of the svp-LacZ line with svp antibody. This is crucial as the authors claim that cell cycle block doesn't affect does affect svp expression and is required independently. 

      We thank the reviewer for bringing this important issue to our attention. As noted, Svp protein is expressed transiently and stochastically in type 2 NSCs (Syed et al., 2017), making direct antibody quantification challenging upon cell cycle block. Consistent with previous work (Syed et al., 2017), we used the svp-LacZ reporter line to visualize stabilized Svp expression, which reliably captures Svp expression in type 2 NSCs (Syed et al., 2017 https://doi.org/10.7554/eLife.26287, and Dhilon et al., 2024 https://doi.org/10.1242/dev.202504).

      (2) The authors have successfully slowed down the cell cycle and showed that it affects temporal progression. However, a converse experiment where the cell cycle is sped up in NSCs would be an important test for the direct coupling of temporal factor expression and cell cycle, wherein the expectation would be the precocious expression of late temporal factors in faster cycle NSCs. 

      We agree that such an experiment would be ideal. However, as noted above (Reviewer #2 comment 2), to our knowledge, no suitable tools currently exist to accelerate neuroblast cell-cycle progression without pleiotropic effects.

      Minor point 

      The authors must include Ray and Li (https://doi.org/10.7554/eLife.75879) in the references when describing that "...cell cycle has been shown to influence temporal patterning in some systems,...".  

      We thank the reviewer for this helpful suggestion. The cited reference (Ray and Li, eLife, 2022) has now been included and appropriately referenced in the revised manuscript.

    1. eLife Assessment

      The authors investigate arrestin2-mediated CCR5 endocytosis in the context of clathrin and AP2 contributions. Using an extensive set of NMR experiments, and supported by microscopy and other biophysical assays, the authors provide compelling data on the roles of AP2 and clathrin in CCR5 endocytosis. This important work will appeal to an audience beyond those studying chemokine receptors, including those studying GPCR regulation and trafficking. The distinct role of AP2 and not clathrin will be of particular interest to those studying GPCR internalization mechanisms.

    2. Reviewer #1 (Public review):

      Petrovic et al. investigate CCR5 endocytosis via arrestin2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data clearly demonstrate chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization.

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays.

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fits-all model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure.

    3. Reviewer #2 (Public review):

      Summary:

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation.

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis.

      Strengths:

      The 15N,1H and 13C,methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin with both kD and description of the interfaces.

    4. Reviewer #3 (Public review):

      Summary:

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field.

      Strengths:

      Strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL 376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Fig. 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2.

      SEC and NMR data suggest that full-length arr2 (1-418) binding with 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Fig. 3). The pp6 peptide shows the highest degree of arr2 activation, and 2-adaptin binding, compared to less phosphorylated peptide or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the AP2 interaction is necessary for CCR5 endocytosis.

      To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes onto endosomes (Fig. 4). The data suggest that complex internalization is dependent on AP2 binding not clathrin (Fig. 5).

      The addition of the antagonist experiment/data adds rigor to the study.

      Overall, this is a solid study that will be of interest to the field.

    5. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Petrovic et al. investigate CCR5 endocytosis via arrestin 2, with a particular focus on clathrin and AP2 contributions. The study is thorough and methodologically diverse. The NMR titration data clearly demonstrate chemical shift changes at the canonical clathrin-binding site (LIELD), present in both the 2S and 2L arrestin splice variants. 

      To assess the effect of arrestin activation on clathrin binding, the authors compare: truncated arrestin (1-393), full-length arrestin, and 1-393 incubated with CCR5 phosphopeptides. All three bind clathrin comparably, whereas controls show no binding. These findings are consistent with prior crystal structures showing peptide-like binding of the LIELD motif, with disordered flanking regions. The manuscript also evaluates a non-canonical clathrin binding site specific to the 2L splice variant. Though this region has been shown to enhance beta2-adrenergic receptor binding, it appears not to affect CCR5 internalization. 

      Similar analyses applied to AP2 show a different result. AP2 binding is activation-dependent and influenced by the presence and level of phosphorylation of CCR5-derived phosphopeptides. These findings are reinforced by cellular internalization assays. 

      In sum, the results highlight splice-variant-dependent effects and phosphorylation-sensitive arrestin-partner interactions. The data argue against a (rapidly disappearing) one-size-fitsall model for GPCR-arrestin signaling and instead support a nuanced, receptor-specific view, with one example summarized effectively in the mechanistic figure.

      We thank the referee for this positive assessment of our manuscript. Indeed, by stepping away from the common receptor models for understanding internalization (b2AR and V2R), we revealed the phosphorylation level of the receptor as a key factor in driving the sequestration of the receptor from the plasma membrane. We hope that the proposed mechanistic model will aid further studies to obtain an even more detailed understanding of forces driving receptor internalization.

      Weaknesses: 

      Figure 1 shows regions alphaFold model that are intrinsically disordered without making it clear that this is not an expected stable position. The authors NMR titration data are n=1. Many figure panels require that readers pinch and zoom to see the data.

      In the “Recommendations for the Authors” section, we addressed the reviewer’s stated weaknesses. In short, for the AlphaFold representation in Figure 1A, we added explicit labeling and revised the legend and main text to clearly state that the depicted loops are intrinsically disordered, absent from crystal structures due to flexibility, and shown only for visualization of their location. We also clarified that the NMR titration experiments inherently have n = 1 due to technical limitations, and that this is standard practice in the field, while ensuring individual data points remain visible. The supplementary NMR figures now have more vibrant coloring, allowing easier data assessment. However, we have not changed the zooming of the microscopy and NMR spectra. We believe that the presentation of microscopy data, which already show zoomed-in regions of interest, follow standard practices in the field. Furthermore, we strongly believe that we should display full NMR spectra in the supplementary figures to allow the reader to assess the overall quality and behavior. As indicated previously, the reader can zoom in to very high resolution, since the spectra are provided by vector graphics. Zoomed regions of the relevant details are provided in the main figures.

      Reviewer #2 (Public review): 

      Summary: 

      Based on extensive live cell assays, SEC, and NMR studies of reconstituted complexes, these authors explore the roles of clathrin and the AP2 protein in facilitating clathrin mediated endocytosis via activated arrestin-2. NMR, SEC, proteolysis, and live cell tracking confirm a strong interaction between AP2 and activated arrestin using a phosphorylated C-terminus of CCR5. At the same time a weak interaction between clathrin and arrestin-2 is observed, irrespective of activation. 

      These results contrast with previous observations of class A GPCRs and the more direct participation by clathrin. The results are discussed in terms of the importance of short and long phosphorylated bar codes in class A and class B endocytosis. 

      Strengths: 

      The 15N,1H and 13C,methyl TROSY NMR and assignments represent a monumental amount of work on arrestin-2, clathrin, and AP2. Weak NMR interactions between arrestin-2 and clathrin are observed irrespective of activation of arrestin. A second interface, proposed by crystallography, was suggested to be a possible crystal artifact. NMR establishes realistic information on the clathrin and AP2 affinities to activated arrestin with both kD and description of the interfaces.

      We sincerely thank the referee for this encouraging evaluation of our work and appreciate the recognition of the NMR efforts and insights into the arrestin–clathrin–AP2 interactions.

      Weaknesses: 

      This reviewer has identified only minor weaknesses with the study. 

      (1) I don't observe two overlapping spectra of Arrestin2 (1393) +/- CLTC NTD in Supp Figure 1

      We believe the referee is referring to Figure 1 – figure supplement 2. We have now made the colors of the spectra more vibrant and used different contouring to make the differences between the two spectra clearer. The spectra are provided as vector graphics, which allows zooming in to the very fine details.

      (2) Arrestin-2 1-418 resonances all but disappear with CCR5pp6 addition. Are they recovered with Ap2Beta2 addition and is this what is shown in Supp Fig 2D

      We believe the reviewer is referring to Figure 3 - figure supplement 1. In this figure, the panels E and F show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline) disappear upon the addition of CCR5pp6 (arrestin2<sup>1-418</sup>•CCR5pp6 complex spectrum in red). The panels C and D show resonances of arrestin2<sup>1-418</sup> (apo state shown with black outline), which remain unchanged upon addition of AP2b2 <sup>701-937</sup> (orange), indicating no complex formation. We also recorded a spectrum of the arrestin2<sup>1-418</sup>•CCR5pp6 complex under addition of AP2b2 <sup>701-937</sup> (not shown), but the arrestin2 resonances in the arrestin2<sup>1-418</sup> •CCR5pp6 complex were already too broad for further analysis. This had been already explained in the text.

      “In agreement with the AP2b2 NMR observations, no interaction was observed in the arrestin2 methyl and backbone NMR spectra upon addition of AP2b2 in the absence of phosphopeptide (Figure 3-figure supplement 1C, D). However, the significant line broadening of the arrestin2 resonances upon phosphopeptide addition (Figure 3-figure supplement 1E, F) precluded a meaningful assessment of the effect of the AP2b2 addition on arrestin2 in the presence of phosphopeptide”.

      (3) I don't understand how methyl TROSY spectra of arrestin2 with phosphopeptide could look so broadened unless there are sample stability problems?

      We thank the referee for this comment. We would like to clarify that in general a broadened spectrum beyond what is expected from the rotational correlation time does not necessarily correlate with sample stability problems. It is rather evidence of conformational intermediate exchange on the micro- to millisecond time scale.

      The displayed <sup>1</sup>H-<sup>15</sup>N spectra of apo arrestin2 already suffer from line broadening due to such intrinsic mobility of the protein. These spectra were recorded with acquisition times of 50 ms (<sup>15</sup>N) and 55 ms (<sup>1</sup>H) and resolution-enhanced by a 60˚-shifted sine-bell filter for <sup>15</sup>N and a 60˚-shifted squared sine-bell filter for <sup>1</sup>H, respectively, which leads to the observed resolution with still reasonable sensitivity. The <sup>1</sup>H-<sup>15</sup>N resonances in Fig. 1b (arrestin2<sup>1-393</sup>) look particularly narrow. However, this region contains a large number of flexible residues. The full spectrum, e.g. Figure 1-figure supplement 2, shows the entire situation with a clear variation of linewidths and intensities. The linewidth variation becomes stronger when omitting the resolution enhancement filters.

      The addition of the CCR5pp6 phosphopeptide does not change protein stability, which we assessed by measuring the melting temperature of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 complex (Tm = 57°C in both cases). We believe that the explanation for the increased broadening of the arrestin2 resonances is that addition of the CCR5pp6, possibly due to the release of the arrestin2 strand b20, amplifies the mentioned intermediate timescale protein dynamics. This results in the disappearance of arrestin2 resonances.

      We have now included the assessment of arrestin2<sup>1-418</sup> and arrestin2<sup>1-418</sup>•CCR5pp6 stability in the manuscript:

      “The observed line broadening of arrestin2 in the presence of phosphopeptide must be a result of increased protein motions and is not caused by a decrease in protein stability, since the melting temperature of arrestin2 in the absence and presence of phosphopeptide are identical (56.9 ± 0.1 °C)”.

      (4) At one point the authors added excess fully phosphorylated CCR5 phosphopeptide (CCR5pp6). Does the phosphopeptide rescue resolution of arrestin2 (NH or methyl) to the point where interaction dynamics with clathrin (CLTC NTD) are now more evident on the arrestin2 surface?

      Unfortunately, when we titrate arrestin2 with CCR5pp6 (please see Isaikina & Petrovic et. al, Mol. Cell, 2023 for more details), the arrestin2 resonances undergo fast-to-intermediate exchange upon binding. In the presence of phosphopeptide excess, very few resonances remain, the majority of which are in the disordered region, including resonances from the clathrin-binding loop. Due to the peak overlap, we could not unambiguously assign arrestin2 resonances in the bound state, which precluded our assessment of the arrestin2-clathrin interaction in the presence of phosphopeptide. We have made this now clearer in the paragraph ‘The arrestin2-clathrin interaction is independent of arrestin2 activation’

      “Due to significant line broadening and peak overlap of the arrestin2 resonances upon phosphopeptide addition, the influence of arrestin activation on the clathrin interaction could not be detected on either backbone or methyl resonances “.

      (5) Once phosphopeptide activates arrestin-2 and AP2 binds can phosphopeptide be exchanged off? In this case, would it be possible for the activated arrestin-2 AP2 complex to re-engage a new (phosphorylated) receptor?

      This would be an interesting mechanism. In principle, this should be possible as long as the other (phosphorylated) receptor outcompetes the initial phosphopeptide with higher affinity towards the binding site. However, we do not have experiments to assess this process directly. Therefore, we rather wish not to further speculate.

      (6) I'd be tempted to move the discussion of class A and class B GPCRs and their presumed differences to the intro and then motivate the paper with specific questions. 

      We appreciate the referee’s suggestion and had a similar idea previously. However, as we do not have data on other class-A or class-B receptors, we rather don’t want to motivate the entire manuscript by this question.

      (7) Did the authors ever try SEC measurements of arrestin-2 + AP2beta2+CCR5pp6 with and without PIP2, and with and without clathrin (CLTC NTD? The question becomes what the active complex is and how PIP2 modulates this cascade of complexation events in class B receptors.

      We thank the referee for this question. Indeed, we tested whether PIP2 can stabilize the arrestin2•CCR5pp6•AP2 complex by SEC experiments. Unfortunately, the addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. The resolution of SEC experiments was not sufficient to distinguish arrestin2 in oligomeric form or in arrestin2•CCR5pp6•AP2 complex. We now mention this in the text:

      “We also attempted to stabilize the arrestin2-AP2b2-phosphopetide complex through the addition of PIP2, which can stabilize arrestin complexes with the receptor (Janetzko et al., 2022). The addition of PIP2 increased the formation of arrestin2 dimers and higher oligomers, presumably due to the presence of additional charges. Unfortunately, the resolution of the SEC experiments was not sufficient to separate the arrestin2 oligomers from complexes with AP2b2”.

      Reviewer #3 (Public review): 

      Summary: 

      Overall, this is a well-done study, and the conclusions are largely supported by the data, which will be of interest to the field. 

      Strengths: 

      Strengths of this study include experiments with solution NMR that can resolve high-resolution interactions of the highly flexible C-terminal tail of arr2 with clathrin and AP2. Although mainly confirmatory in defining the arr2 CBL376LIELD380 as the clathrin binding site, the use of the NMR is of high interest (Fig. 1). The 15N-labeled CLTC-NTD experiment with arr2 titrations reveals a span from 39-108 that mediates an arr2 interaction, which corroborates previous crystal data, but does not reveal a second area in CLTC-NTD that in previous crystal structures was observed to interact with arr2. 

      SEC and NMR data suggest that full-length arr2 (1-418) binding with 2-adaptin subunit of AP2 is enhanced in the presence of CCR5 phospho-peptides (Fig. 3). The pp6 peptide shows the highest degree of arr2 activation, and 2-adaptin binding, compared to less phosphorylated peptide or not phosphorylated at all. It is interesting that the arr2 interaction with CLTC NTD and pp6 cannot be detected using the SEC approach, further suggesting that clathrin binding is not dependent on arrestin activation. Overall, the data suggest that receptor activation promotes arrestin binding to AP2, not clathrin, suggesting the

      AP2 interaction is necessary for CCR5 endocytosis. 

      To validate the solid biophysical data, the authors pursue validation experiments in a HeLa cell model by confocal microscopy. This requires transient transfection of tagged receptor (CCR5-Flag) and arr2 (arr2-YFP). CCR5 displays a "class B"-like behavior in that arr2 is rapidly recruited to the receptor at the plasma membrane upon agonist activation, which forms a stable complex that internalizes onto endosomes (Fig. 4). The data suggest that complex internalization is dependent on AP2 binding not clathrin (Fig. 5). 

      The addition of the antagonist experiment/data adds rigor to the study. 

      Overall, this is a solid study that will be of interest to the field.

      We thank the referee for the careful and encouraging evaluation of our work. We appreciate the recognition of the solidity of our data and the support for our conclusions regarding the distinct roles of AP2 and clathrin in arrestin-mediated receptor internalization.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors): 

      I believe that the authors have made efforts to improve the accessibility to a broader audience. In a few cases, I believe that the authors response either did not truly address the concern or made the problem worse. I am grouping these as 'very strong opinions' and 'sticking point'. 

      Very strong opinion 1: 

      While data presentation is somewhat at the authors discretion, there were several figures where the presentation did not make the work approachable, including microscopy insets and NMR spectra. A suggestion to 'pinch and zoom' does not really address this. For the overlapping NMR spectra in supporting Figure 1, I actually -can- see this on zooming, but I did not recognize this on first pass because the colors are almost identical for the two spectra. This is an easy fix. Changing the presentation by coloring these distinctly would alleviate this. The Supplemental figure to Fig. 2 looks strange with pinch and zoom. But at the end of the day, data presentation where the reader is to infer that they must zoom in is not very approachable and may prevent readers from being able to independently assess the data. In this case, there doesn't seem to be a strong rationale to not make these panels easier to see at 100% size. 

      We appreciate the reviewer’s thoughtful comments regarding figure accessibility and agree that data presentation should be clear and interpretable without requiring readers to zoom in extensively. However, we must note that the presentation of the microscopy data follows standard practices in the field and that the panels already include zoomed-in regions, which enable easier access to key results and observations.

      Regarding the NMR data, we have revised Figure 1—figure supplement 2 and Figure 2— figure supplement 1 to match the presentation style of Figure 3—figure supplement 1, which the reviewer apparently found more accessible. We also made the colors of the spectra more vibrant, as the referee suggested. We would like to emphasize that it is absolutely necessary to display the full NMR spectra in order to allow independent assessment of signal assignment, data quality, and overall protein behavior. Zoomed regions of the relevant details are provided in the main figures.

      Very strong opinion 2: 

      The author's response to lack of individual data points and error bars is that this is an n=1 experiment. I do not believe this meets the minimum standard for best practices in the field.

      We respectfully disagree with the reviewer’s assessment. The Figure already displays individual data points, as shown already in the initial submission. Performing NMR titrations with isotopically labeled protein samples is inherently resource-intensive, and single-sample (n = 1) experiments are widely accepted and routinely reported in the field. Numerous studies have used the same approach, including Rosenzweig et al., Science (2013); Nikolaev et al., Nat. Methods (2019); and Hobbs et al., J. Biomol. NMR (2022), as well as our own recent work (Isaikina & Petrovic et al., Mol. Cell, 2023). These studies demonstrate that such NMR-based affinity measurements, even when performed on a single sample, are highly reproducible, precise, and consistent with orthogonal evidence and across different sample conditions.

      Sticking point:

      Figure 1A - the alphaFold model of arrestin2L depicts the disordered loops as ordered. The depiction is misleading at best, and inaccurate in truth. To use an analogy, what the authors depict is equivalent to publishing an LLM hallucination in the text. Unlike LLMs, alphaFold will actually flag its hallucination with the confidence (pLDDT) in the output. Both for LLMs and for alphaFold, we are spending much time teaching our students in class how to use computation appropriately - both to improve efficiency but also to ensure accuracy by removing hallucinations.

      The original review indicated that confidences needed to be shown and that this needed to be depicted in a way that clarifies that this is NOT a structural state of the loops. The newly added description ("The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 Cdomain, which are not detected in the available crystal structures...) worsens the concern because it even more strongly implies that a 0 confidence computational output is a likely structural state. It also indicates that these regions were 'not detected' in crystal structures. These regions of arrestin are intrinsically disordered. AlphaFold (by it's nature) must put out something in terms of coordinates, even if the pLDDT suggests that the region cannot be predicted or is not in a stable position, which is the case here. In crystal structures, these regions are not associated with interpretable electron density, meaning that coordinates are omitted in these regions because adding them would imply that under the conditions used, the protein adopts a low energy structural state in this region. This region is instead intrinsically disordered. 

      A good description of why showing disordered loops in a defined position is incorrect and how to instead depict disorder correctly is in Brotzakis et al. Nat communications 16, 1632 (2025) "AlphaFold prediction of structural ensembles of disordered proteins", where figures 3A, 4A, and 5A show one AlphaFold prediction colored by confidence and 3B, 4B and 5B are more accurate depictions of the structural ensemble. 

      Coming back to the original comment "The AlphaFold model could benefit from a more transparent discussion of prediction confidence and caveats. The younger crowd (part of the presumed intended readership) tends to be more certain that computational output is 'true'...." Right now, the authors are still showing in Fig 1A a depiction of arrestin with models for the loops that are untrue. They now added text indicating that these loops are visualized in an AlphaFold prediction and 'true' but 'not detected in crystal structures'. There is no indication in the text that these are intrinsically disordered. The lack of showing the pLDDT confidence and the lack of any indication that these are disordered regions is simply incorrect. 

      We appreciate the concern of the reviewer towards AlphaFold models. As NMR spectroscopists we are highly aware of intrinsic biomolecular motions. However, our AlphaFold2 model is used as a graphical representation to display the interaction sites of loops; it is not intended to depict the loops as fixed structural states. The flexibility of the loops had been clearly described in the main text before:

      “Arrestin2 consists of two consecutive (N- and C-terminal) β-sandwich domains (Figure 1A), followed by the disordered clathrin-binding loop (CBL, residues 353–386), strand b20 (residues 386–390), and a disordered C-terminal tail after residue 393”.

      and

      “Figure 1B depicts part of a 1H-15N TROSY spectrum (full spectrum in Figure 1-figure supplement 2A) of the truncated 15N-labeled arrestin2 construct arrestin21-393 (residues 1393), which encompasses the C-terminal strand β20, but lacks the disordered C-terminal tail. Due to intrinsic microsecond dynamics, the assignment of the arrestin21-393 1H-15N resonances by triple resonance methods is largely incomplete, but 16 residues (residues 367381, 385-386) within the mobile CBL could be assigned. This region of arrestin is typically not visible in either crystal or cryo-EM structures due to its high flexibility”.

      as well as in the legend to Figure 1:

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)]. In the other structured regions, the model is virtually identical to the crystal structures”.

      We have now further added a label ‘AlphaFold2 model’ to Figure 1A and amended the respective Figure legend to

      “The model was used to visualize the clathrin-binding loop and the 344-loop of the arrestin2 C-domain, which are not detected in the available crystal structures of apo arrestin2 [bovine: PDB 1G4M (Han et al., 2001), human: PDB 8AS4 (Isaikina et al., 2023)] due to flexibility. In the other structured regions, the model is virtually identical to the crystal structures”.

      Reviewer #2 (Recommendations for the authors): 

      I appreciated the response by the authors to all of my questions. I have no further comments

      We thank the referee for the raised questions, which we believe have improved the quality of the manuscript.

    1. eLife Assessment

      This revised paper provides a valuable and novel neural network-based framework for parameterizing individual differences and predicting individual decision-making across task conditions. The methods and analyses are solid yet could benefit from further validation of the superiority of the proposed framework against other baseline models. With these concerns addressed, this study would offer a proof-of-concept neural network approach to scientists working on the generalization of cognitive skills across contexts.

    2. Reviewer #1 (Public review):

      Summary

      The manuscript presents EIDT, a framework that extracts an "individuality index" from a source task to predict a participant's behaviour in a related target task under different conditions. However, the evidence that it truly enables cross-task individuality transfer is not convincing.

      Strengths

      The EIDT framework is clearly explained, and the experimental design and results are generally well-described. The performance of the proposed method is tested on two distinct paradigms: a Markov Decision Process (MDP) task (comparing 2-step and 3-step versions) and a handwritten digit recognition (MNIST) task under various conditions of difficulty and speed pressure. The results indicate that the EIDT framework generally achieved lower prediction error compared to baseline models and that it was better at predicting a specific individual's behaviour when using their own individuality index compared to using indices from others.

      Furthermore, the individuality index appeared to form distinct clusters for different individuals, and the framework was better at predicting a specific individual's behaviour when using their own derived index compared to using indices from other individuals.

      Comments on revisions:

      I thank the author for the additional analyses. They have fully addressed all of my previous concerns, and I have no further recommendations.